CN108364302B - Unmarked augmented reality multi-target registration tracking method - Google Patents

Unmarked augmented reality multi-target registration tracking method Download PDF

Info

Publication number
CN108364302B
CN108364302B CN201810096334.1A CN201810096334A CN108364302B CN 108364302 B CN108364302 B CN 108364302B CN 201810096334 A CN201810096334 A CN 201810096334A CN 108364302 B CN108364302 B CN 108364302B
Authority
CN
China
Prior art keywords
image
target
tracking
thread
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201810096334.1A
Other languages
Chinese (zh)
Other versions
CN108364302A (en
Inventor
张宇
卢明林
李雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201810096334.1A priority Critical patent/CN108364302B/en
Publication of CN108364302A publication Critical patent/CN108364302A/en
Application granted granted Critical
Publication of CN108364302B publication Critical patent/CN108364302B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/207Analysis of motion for motion estimation over a hierarchy of resolutions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Abstract

The invention discloses a unmarked augmented reality multi-target registration tracking method, which comprises the following steps: 1) an off-line stage: clustering a vocabulary tree model through layered k-means, registering corresponding image id for inverted indexes of all leaf nodes of the vocabulary tree, and finally updating the vocabulary tree into a vocabulary tree with tf-idf according to the frequency of the leaf nodes of the whole tree and the total number of images to be registered; 2) an online stage: and in the online stage, a real-time response system is used for retrieving the most similar image from the image input by the camera in real time according to the trained vocabulary tree, then calculating the initial pose of the 3D object to be loaded through pose estimation of the camera, tracking the motion of the target by adopting a tracking algorithm of KLT, and finally constructing the augmented reality scene of each target through a rendering thread. The invention provides an efficient and reliable solution for unmarked multi-target augmented reality.

Description

Unmarked augmented reality multi-target registration tracking method
Technical Field
The invention relates to the field of computer graphics, in particular to a unmarked augmented reality multi-target registration tracking method.
Background
Augmented reality is a technology for seamlessly fusing information of a virtual world and information of a real world, and experience information which originally exceeds the real world is overlapped in the real world through simulation by scientific technologies such as computers, so that sense experience exceeding reality is achieved, and the augmented reality can act on sensory systems such as vision, hearing and taste. The origin of the augmented reality can be traced back to the birth of computer technology, the head-mounted display device invented by Sutherland of the subsidiary professor of electrical engineering of Harvard university in 1968 is a prototype of the augmented reality, the whole set of system places the display device on the ceiling of the top of the head of a user and is connected with the head-mounted device through a connecting rod, and a simple wire frame graph can be converted into an image with a 3D effect. The augmented reality application scene related by the invention is mainly a visual system, the traditional augmented reality technology based on the mark can not keep up with the foot step of the times, the abundant diversified world requirements of people can not be met, and the unmarked augmented reality technology is more flexible and has universal applicability. The invention discloses a technology applied to unmarked multi-target augmented reality registration tracking, which mainly comprises algorithms of visual feature clustering, a vocabulary tree model, inverted indexes, camera pose estimation, target tracking and the like.
In the 21 st century, a large number of companies began to develop augmented reality applications, and the development platforms used by the companies were mainly as follows: (1) vufaria; (2) easy AR; (3) wikitude; (4) ARToolkit; (5) maxst; (6) xzimg; and Metaio corporation, early purchased by apple Inc., and released an ARkit development kit for IOS system for users in 6 months of 2017. Besides the ARToolKit, other platforms only provide open interfaces for users to develop applications, and charges are needed after certain limits are exceeded. Although the commercial open platform provides a relatively perfect solution, it cannot provide a set of detailed and definite algorithm theoretical basis for researchers. Only the ARToolKit with the only open source algorithm code provides a marked-based multi-target registration tracking method, but a unmarked multi-target registration tracking method is not realized, and the ARToolKit traverses the whole image database in the recognition stage, so that the recognition efficiency is very low when a large number of registered images exist. Meanwhile, the system and the method based on the unmarked augmented reality of the image, which are invented by the Zhang et al, still select to traverse the whole image database for comparison in the retrieval stage.
The core technology for augmented reality mainly comprises three aspects: (1) retrieving an image; (2) estimating the pose of the camera; (3) tracking on line; in the beginning of the 21 st century, a widely used augmented reality image recognition method employed a template based on a square marker. A square template based on marks is composed of an embedding mode, a white background and a black frame, wherein the embedding mode determines the uniqueness of the marks, and the black frame is firstly identified, tracked and detected and used for estimating the pose of a camera. However, the template based on the mark has less storage characteristics, and the supported modes are limited, so the template is not suitable for a massive mode recognition scene.
With the subsequent development of the technology, the unmarked natural feature tracking breaks through the limitation of the traditional marking-based method, so that the augmented reality application scene is more flexible, and the storage capacity is larger. The unmarked augmented reality consists of natural feature points, and commonly used feature point extraction algorithms include SIFT, SURF, FAST, ORB and the like. The method has the advantages that the extracted feature points are many and complex based on unmarked augmented reality, and the named BOW bag-of-words model is available in the image retrieval stage, however, experiments show that the bag-of-words model is adopted to often need a larger clustering center number, such as 10, in the clustering stage6The above results are better and the time spent in the search stage is linearly related as the number of registered images increases. The scalable recognition using lexical trees and the KLT target tracking algorithm based on optical flow proposed by Nister D, Stewenius H provide the basis for the present invention.
Disclosure of Invention
The invention aims to overcome the defects of small storage capacity and inflexible use of a marked template in augmented reality, and provides a unmarked augmented reality multi-target registration tracking method. And the recognition speed is high, the storage capacity is large, and the efficiency and the quality are guaranteed under the multi-target tracking.
In order to achieve the purpose, the technical scheme provided by the invention is as follows: a multi-target registration tracking method for unmarked augmented reality comprises the following steps:
1) an off-line stage: clustering out a vocabulary tree model through layered k-means, registering corresponding image id for inverted indexes of all leaf nodes of the vocabulary tree, and finally updating the vocabulary tree into a vocabulary tree with tf-idf according to the frequency of the leaf nodes of the whole tree and the total number of images to be registered.
2) An online stage: and in the online stage, a real-time response system is used for retrieving the most similar image from the image input by the camera in real time according to the vocabulary tree trained in the step 1), then calculating the initial pose of the 3D object to be loaded through pose estimation of the camera, tracking the motion of the target by adopting a KLT tracking algorithm, and finally constructing the augmented reality scene of each target through a rendering thread.
In step 1), the off-line phase comprises the following steps:
a) extracting all SIFT feature point descriptors for an image set to be registered;
b) performing k-means hierarchical clustering according to the number of branches of the vocabulary tree;
c) repeating the operation of b) for all SIFT feature point descriptors under each branch in sequence until leaf nodes;
d) finally, respectively linking one inverted index file to all leaf nodes at the bottommost layer;
e) extracting SIFT feature point descriptors for the image to be registered again, counting the score weight of each feature point descriptor on each leaf node according to the trained vocabulary tree and registering the feature point descriptors for the corresponding image;
f) recalculating the vocabulary tree with tf-idf weight according to the length of the inverted index file on the leaf node of the whole tree; the total number of registered images is N, and N is shared by any leaf node jjWhen an image is registered on the inverted index, the tf-idf weight of the leaf node j is
Figure GDA0002528701010000031
For registered image d', if there is mjIf a feature point falls on leaf node j, the score of the leaf node is dj'=mjwjFor the image q' to be searched, if there is njIf a feature point falls on a leaf node j, the score of the leaf node is qj'=njwj
In step 2), the whole system has an identification thread, a plurality of tracking threads and a rendering thread, and the specific conditions are as follows:
the expression form of multi-target tracking is as follows: given a set of image sequences I1,I2,...,ItThe number of targets in each image is MtThe state of each target has position and attitude information, and the position and attitude information is composed of a 3 × 4 matrix
Figure GDA0002528701010000046
Represented by the formula:
Figure GDA0002528701010000041
the 3D point of a world coordinate system can be mapped to obtain the position of a 2D point of a camera coordinate system through the following equation:
Figure GDA0002528701010000042
where u, v are the image coordinate system of the camera projection, xw,yw,zwIs the world coordinate system coordinate of a feature point, and A is the internal parameter of the camera, as follows:
Figure GDA0002528701010000043
in the formula, is axAnd ∈yRepresents the focal length, cxAnd cyCoordinates representing principal points, s representing tilt coefficients, which are intrinsic parameters of the camera used to correct the effects of optical deformations, which can be done in an off-line phase; the pose estimation is to calculate an accurate external parameter s of the camera through the visual angles of n more than or equal to 3 reference pointstCalculating a rotation matrix R and a translation matrix f by using an ICP (inductively coupled plasma) iterative method;
then all the object states in each image are represented as
Figure GDA0002528701010000044
The motion track corresponding to the ith target is expressed as
Figure GDA0002528701010000045
Sequence of states S consisting of all image objects1:t={S1,S2,...,St};
Identifying the work content of the thread:
in contrast to conventional single-object recognition, multi-object recognition aims at recognizing as many objects as possible in an image, let ItThe method comprises h targets, and the flow of the steps is as follows:
a) according to the input image I of the cameratFirstly, detecting all feature point sets P by using an SIFT algorithm, and extracting all SIFT feature point descriptors for the feature point sets P;
b) performing inner product on each descriptor and each layer of nodes of the vocabulary tree from top to bottom according to the vocabulary tree trained in the 1) stage, judging that each feature point descriptor falls into a final leaf node, and adding tf-idf weight of the node to the score of the corresponding leaf node;
c) taking out all images in the leaf node inverted index with the score not equal to zero of the input image, and obtaining the most similar reference image in the to-be-detected image set by using the following similarity calculation formula
Figure GDA0002528701010000051
Considering that the image may contain a plurality of identical targets, the Euclidean distance calculation will result in more similar targets and longer distance, so the similarity calculation method for multi-target detection and identification should adopt a cosine similarity calculation formula;
Figure GDA0002528701010000052
wherein q is the leaf node score vector of the query image with all scores different from zero,
Figure GDA0002528701010000053
and (4) a leaf node score vector with all non-zero scores of one image to be compared in the image database.
d) And searching a homography matrix H which is a 3 multiplied by 3 matrix, if the homography matrix H exists, recalculating the inner point set Q by using a RANSAC algorithm, and if the homography matrix H does not exist, stopping. The pair of points on the images from two different perspectives can be described by a projective transformation, as follows:
x'=H·x
where x', x are the coordinates of point pairs on two images from different viewing angles, respectively.
e) For reference image
Figure GDA0002528701010000054
The 4 boundary points and the homography matrix H are operated by the above formula, and whether the quadrangle a 'b' c 'e' after matrix transformation is a convex quadrangle or not is judged, and if not, the operation is ended. Setting the height of an original image as h and the width as w, and calculating four points a, b, c and e by the following formula to obtain coordinates of a ', b', c 'and e', and judging whether the coordinates are convex edges or not by a geometric theorem;
Figure GDA0002528701010000061
in the formula, the coordinate of the point a is (0,0), the coordinate of the point b is (0, h), the coordinate of the point c is (w, h), the coordinate of the point e is (w,0), and the coordinate of the point a' is (x)a,ya) And b' point coordinate is (x)b,yb) C' point coordinate is (x)c,yc) And e' point coordinates are (x)e,ye)。
f) Calculating the initial pose matrix of the target by an ICP iterative algorithm
Figure GDA0002528701010000064
Handing over to a new tracking thread and shielding the target area;
g) let P be { P- (P ∩ Q) }, and give the target to the tracking thread to update the position and pose matrix in real time
Figure GDA0002528701010000065
The recognition thread continues to repeat a) -f) work on the unmasked region.
Work content of each trace thread:
multi-object tracking means that a sequence of images I is given1,I2,...,ItAnd finding a moving object set in the image sequence through multi-target identification, corresponding moving objects in subsequent frames to one of the moving objects, and giving the motion tracks of different objects. Herein, adoptedThe tracking algorithm is a KLT algorithm based on optical flow, and is specifically realized as follows:
the KLT implementation principle defines that the same target appears in two images I, J and is the same over a local window W, then within the local window W: i (x, y, l) ═ J (x ', y', l + τ), which explains that on image I, all (x, y) points have moved in one direction by a distance g ═ dx, dy, and at time l + τ correspond to (x ', y') on image J, i.e., (x + dx, y + dy), so the problem of finding a match can be to find a minimum for the following equation:
Figure GDA0002528701010000062
the above equation, also called the difference equation, is the equivalent of the integral:
Figure GDA0002528701010000063
where w (x) is a weighting function, and is usually set to a constant value of 1. I.e. find two images, I is ux,uyCentered at J with ux+g,uy+ g is central, wherein
Figure GDA0002528701010000071
The difference between the local windows W of the radius, if (g) is to be taken as the minimum, the derivative of the above equation is zero, i.e. the distance g of the movement can be solved.
Therefore, the specific implementation steps for identifying the thread are as follows:
a) tracking the identified target in real time by using a KLT tracking algorithm;
b) if the target tracking is lost, feeding back to the identification thread, and requiring the identification thread to recover the region to be detected, namely P ═ { PuQ };
c) if the target tracking is successful, the target tracking is fed back to the identification thread, and the shielding area and the pose matrix of the identification thread are updated
Figure GDA0002528701010000073
The work content of the rendering thread is as follows:
according to the position and posture matrix of each tracking thread
Figure GDA0002528701010000072
And placing the corresponding 3D models in sequence, and performing augmented reality scene rendering through OpenGL.
Compared with the prior art, the invention has the following advantages and benefits:
1. the method solves the limitation that the marked augmented reality depends on the auxiliary mark in the image retrieval stage, and can be applied to unmarked augmented reality application scenes.
2. Compared with the prior art oolkit platform which is open source, the method abandons the low-efficiency method of traversing the whole image library for retrieval in the image retrieval without marking natural characteristic points, and shortens the retrieval time by utilizing the index structure of the vocabulary tree.
3. The method provides a solution on unmarked multi-target tracking, and compared with an Artoolkit platform, the method can simultaneously track a plurality of same or different target objects.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a diagram illustrating the hierarchical clustering using vocabulary trees according to an embodiment of the present invention.
FIG. 3 is a diagram illustrating the results of a search using a trained lexical tree in accordance with an embodiment of the present invention.
FIG. 4 is a system framework diagram of the present invention.
FIG. 5 is a convex quadrilateral after homography transformation of an original image according to an embodiment of the present invention.
Fig. 6 shows an embodiment of the present invention in which an input image containing two objects is captured by a USB camera.
FIG. 7 is a graph showing the results of the search of FIG. 6 according to the embodiment of the present invention.
Fig. 8 is an image of the left object being homography matched according to an embodiment of the present invention.
Fig. 9 is an effect image of performing homography matching on a right object in the embodiment of the present invention.
Fig. 10 is an effect diagram of performing augmented reality tracking on 2 different objects in the embodiment of the present invention.
Fig. 11 is an effect diagram of performing augmented reality tracking on 9 identical objects in the embodiment of the present invention.
Detailed Description
The present invention will be further described with reference to the following specific examples.
As shown in fig. 1, the method for multi-target registration and tracking in unmarked augmented reality provided by this embodiment includes the following two stages:
1) an off-line stage: clustering a vocabulary tree model through layered k-means, registering corresponding image id for inverted indexes of all leaf nodes of the vocabulary tree, and finally updating the vocabulary tree into a vocabulary tree with tf-idf according to the frequency of the leaf nodes of the whole tree and the total number of images to be registered; the realization process is as follows:
a) extracting all SIFT feature point descriptors for an image set to be registered;
b) performing k-means hierarchical clustering according to the number of branches of the vocabulary tree;
c) repeating the operation of b) for all SIFT feature point descriptors under each branch in sequence until leaf nodes;
d) finally, respectively linking one inverted index file to all leaf nodes at the bottommost layer;
e) extracting SIFT feature point descriptors for the image to be registered again, counting the score weight of each feature point descriptor on each leaf node according to the trained vocabulary tree and registering the feature point descriptors for the corresponding image;
f) and recalculating the vocabulary tree with tf-idf weight according to the length of the inverted index file on the leaf node of the whole tree. The total number of registered images is N, and N is shared by any leaf node jjWhen an image is registered on the inverted index, the tf-idf weight of the leaf node j is
Figure GDA0002528701010000091
For registered image d', if there is mjIf a feature point falls on leaf node j, the leaf isThe score of a node is dj'=mjwjFor the image q' to be searched, if there is njIf a feature point falls on a leaf node j, the score of the leaf node is qj'=njwj
2) An online stage: and in the online stage, a real-time response system is used for retrieving the most similar image from the image input by the camera in real time according to the vocabulary tree trained in the step 1), then calculating the initial pose of the 3D object to be loaded through pose estimation of the camera, tracking the motion of the target by adopting a KLT tracking algorithm, and finally constructing the augmented reality scene of each target through a rendering thread. The whole system is provided with an identification thread, a plurality of tracking threads and a rendering thread.
The expression form of multi-target tracking is as follows: given a set of image sequences I1,I2,...,ItThe number of targets in each image is MtThe state of each target has position, attitude and other information, and is composed of a 3 × 4 matrix
Figure GDA0002528701010000094
Represented by the formula:
Figure GDA0002528701010000092
the 3D point of a world coordinate system can be mapped to obtain the position of a 2D point of a camera coordinate system through the following equation:
Figure GDA0002528701010000093
where u, v are the image coordinate system of the camera projection, xw,yw,zwIs the world coordinate system coordinate of a feature point, and A is the internal parameter of the camera, as follows:
Figure GDA0002528701010000101
in the formula, is axAnd ∈yRepresents the focal length, cxAnd cyCoordinates representing principal points, s representing tilt coefficients, which are intrinsic parameters of the camera used to correct the effects of optical deformations, which can be done in an off-line phase; the pose estimation is to calculate an accurate external parameter s of the camera through the visual angles of n more than or equal to 3 reference pointstCalculating a rotation matrix R and a translation matrix f by using an ICP (inductively coupled plasma) iterative method;
then all the object states in each image are represented as
Figure GDA0002528701010000102
The motion track corresponding to the ith target is expressed as
Figure GDA0002528701010000103
Sequence of states S consisting of all image objects1:t={S1,S2,...,St};
Identifying the work content of the thread:
in contrast to conventional single-object recognition, multi-object recognition aims at recognizing as many objects as possible in an image, let ItThe method comprises h targets, and the flow of the steps is as follows:
a) according to the input image I of the cameratFirstly, detecting all feature point sets P by using an SIFT algorithm, and extracting all SIFT feature point descriptors for the feature point sets P;
b) performing inner product on each descriptor and each layer of nodes of the vocabulary tree from top to bottom according to the vocabulary tree trained in the 1) stage, judging that each feature point descriptor falls into a final leaf node, and adding tf-idf weight of the node to the score of the corresponding leaf node;
c) taking out all images in the leaf node inverted index with the score not equal to zero of the input image, and obtaining the most similar reference image in the to-be-detected image set by using the following similarity calculation formula
Figure GDA0002528701010000104
Considering possible packets in the imageThe method comprises a plurality of identical targets, and the Euclidean distance calculation can cause more similar targets to be at longer distance instead, so that a cosine similarity calculation formula is adopted in the similarity calculation method for multi-target detection and identification;
Figure GDA0002528701010000105
wherein q is the leaf node score vector of the query image with all scores different from zero,
Figure GDA0002528701010000115
and (4) a leaf node score vector with all non-zero scores of one image to be compared in the image database.
d) And searching a homography matrix H which is a 3 multiplied by 3 matrix, if the homography matrix H exists, recalculating the inner point set Q by using a RANSAC algorithm, and if the homography matrix H does not exist, stopping. The pair of points on the images from two different perspectives can be described by a projective transformation, as follows:
x'=H·x
where x', x are the coordinates of point pairs on two images from different viewing angles, respectively.
e) For reference image
Figure GDA0002528701010000111
The 4 boundary points and the homography matrix H are operated by the above formula, and whether the quadrangle a 'b' c 'e' after matrix transformation is a convex quadrangle or not is judged, and if not, the operation is ended. Setting the height of an original image as h and the width as w, and calculating four points a, b, c and e by the following formula to obtain coordinates of a ', b', c 'and e', and judging whether the coordinates are convex edges or not by a geometric theorem;
Figure GDA0002528701010000112
in the formula, the coordinate of the point a is (0,0), the coordinate of the point b is (0, h), the coordinate of the point c is (w, h), the coordinate of the point e is (w,0), and the coordinate of the point a' is (x)a,ya) And b' point coordinate is (x)b,yb) C' point coordinate is (x)c,yc) And e' point coordinates are (x)e,ye)。
f) Calculating the initial pose matrix of the target by an ICP iterative algorithm
Figure GDA0002528701010000113
Handing over to a new tracking thread and shielding the target area;
g) let P be { P- (P ∩ Q) }, and give the target to the tracking thread to update the position and pose matrix in real time
Figure GDA0002528701010000114
The recognition thread continues to repeat a) -f) work on the unmasked region.
Work content of each trace thread:
multi-object tracking means that a sequence of images I is given1,I2,...,ItAnd finding a moving object set in the image sequence through multi-target identification, corresponding moving objects in subsequent frames to one of the moving objects, and giving the motion tracks of different objects. Here, the tracking algorithm used is a KLT algorithm based on optical flow, and is specifically implemented as follows:
the KLT implementation principle defines that the same target appears in two images I, J and is the same over a local window W, then within the local window W: i (x, y, l) ═ J (x ', y', l + τ), which explains that on image I, all (x, y) points have moved in one direction by a distance g ═ dx, dy, and at time l + τ correspond to (x ', y') on image J, i.e., (x + dx, y + dy), so the problem of finding a match can be to find a minimum for the following equation:
Figure GDA0002528701010000121
the above equation, also called the difference equation, is the equivalent of the integral:
Figure GDA0002528701010000122
where w (x) is a weighting function, and is usually set to a constant value of 1. I.e. find two images, I is ux,uyCentered at J with ux+g,uy+ g is central, wherein
Figure GDA0002528701010000123
The difference between the local windows W of the radius, if (g) is to be taken as the minimum, the derivative of the above equation is zero, i.e. the distance g of the movement can be solved.
Therefore, the specific implementation steps for identifying the thread are as follows:
a) tracking the identified target in real time by using a KLT tracking algorithm;
b) if the target tracking is lost, feeding back to the identification thread, and requiring the identification thread to recover the region to be detected, namely P ═ { PuQ };
c) if the target tracking is successful, the target tracking is fed back to the identification thread, and the shielding area and the pose matrix of the identification thread are updated
Figure GDA0002528701010000124
The work content of the rendering thread is as follows:
according to the position and posture matrix of each tracking thread
Figure GDA0002528701010000125
And placing the corresponding 3D models in sequence, and performing augmented reality scene rendering through OpenGL.
The unmarked augmented reality multi-target registration tracking method of the embodiment is further described below with reference to specific data and accompanying drawings, which are as follows:
1) off-line phase
The offline stage is mainly the training of the vocabulary tree. The method is based on a windows 7 operating system, relies on an OpenCV2.4.10 graphic library, and finishes the writing and debugging of codes under VS 2012. The image set to be registered adopts a test image database provided by the simple paper by Wang J Z et al, and comprises 1000 test images, the specification size is 384 × 256 or 256 × 384, and the test images are stored in JPEG format. In the training process, an SIFT algorithm is adopted to extract feature points of an image set to be registered, and SIFT descriptors are used for generating feature descriptors for the feature points, wherein the feature descriptors are 128-dimensional integer vectors; the number of sub-branches of the k-means cluster is chosen to be 10, i.e. the number of clusters, the depth is chosen to be 6, and the structure of the vocabulary tree is shown in fig. 2. During the experiment, images captured by a common USB camera (with the resolution of 640 x 480) are used as input, the algorithm is operated on a 4-core processor of AMD A6-3400M, and the main frequency is 1.4 GHz.
Firstly, in the step a), all SIFT descriptors are extracted from 1000 input images, and the total time is 125.4 seconds; according to the steps b), c) and d), hierarchical clustering is carried out through k-means, inverted index files are linked, the constructed vocabulary tree takes 70.2 seconds, and the size of the vocabulary tree is 19M; according to the above steps e), f), it takes 17.7 seconds to register all the images, and the size of the vocabulary tree including the image index is registered to be 27M.
The larger the training feature data extracted, and the larger the sub-branching trees and depths of the lexical tree, the more discriminative the recognition of the lexical tree will be. For a new image to be registered later, the extracted feature points are traversed once, and the image is registered for the inverted index falling into the leaf node, and the time consumption is millisecond level. Typically a better lexical tree can be used as a common dictionary and the user only needs to be concerned with the registration of new images and the retrieval of images to be recognized. As shown in fig. 3, it takes 0.226 seconds to retrieve a single image.
2) On-line phase
The online stage is a real-time response system, the system framework is as shown in fig. 4, identification and tracking are relatively independent, a new thread is allocated to each identified target for tracking, the tracking thread feeds back results to the identification thread, detection is not performed on the identified region, and finally each tracking thread collects the results to a rendering thread to complete construction of the augmented reality scene.
For an input image, an image captured by using a USB camera includes at least two targets, and fig. 6 shows an image captured by using a USB camera and including two reference targets, which includes the following specific steps:
firstly, in the step a) of identifying the threads, extracting all key points in the image by using an SIFT detector and generating SIFT descriptor feature vectors;
according to the step b) of identifying the thread, the descriptors of all the detection points are distributed to the leaf nodes by utilizing the trained vocabulary tree, and finally, the non-zero-score feature vectors of all the leaf nodes are obtained;
according to the step c) of identifying the thread, similarity comparison is carried out on the weight of the inverted index image library, the most similar image is solved by adopting a cosine similarity calculation method, and the retrieval result is shown in FIG. 7;
according to the step d) and the step e) of identifying the thread, the image and the matching image are subjected to homography detection, experiments show that homography solving of the gray level image is superior to that of the primary color image, and as shown in the figures 8 and 9, outlier screening of RANSAC is carried out at the same time. And checking whether the original image is convex after transformation, as shown in fig. 5;
calculating an initial pose matrix of each target by an ICP (inductively coupled plasma) iterative method according to the step f) of identifying the threads
Figure GDA0002528701010000141
And finally, according to the step g) of identifying the thread, taking the retrieved image out of the image database, marking the area as an identified area, and avoiding detecting the area again next time.
The tracking stage is to accelerate the recognition speed of the feature points, the feature extraction algorithm is replaced by an ORB detector and a FREAK descriptor, and an optical flow tracking method of KLT is adopted. FIG. 10 is a drawing illustration of the recognition phase capturing two different images of the returned registration, the lower left corner rendering a cube for each object; fig. 11 is a view showing that 9 returned registered same images are captured in the recognition stage, a tracking thread is added for each recognized object, each object is tracked, each object is rendered into a cube at the lower left corner, the real-time tracking rate is about 30fps, and the picture is smooth and free of jamming.
The above-mentioned embodiments are only preferred embodiments of the present invention, and not intended to limit the scope of the present invention, so that any modification and optimization made according to the technical solution of the present invention should be covered by the scope of the present invention.

Claims (2)

1. A multi-target registration tracking method for unmarked augmented reality is characterized by comprising the following steps:
1) an off-line stage: clustering a vocabulary tree model through layered k-means, registering corresponding image id for inverted indexes of all leaf nodes of the vocabulary tree, and finally updating the vocabulary tree into a vocabulary tree with tf-idf according to the frequency of the leaf nodes of the whole tree and the total number of images to be registered;
2) an online stage: the online stage is a real-time response system, images input by a camera in real time are searched out to be the most similar images according to the word tree trained in the step 1), then the initial pose of the 3D object to be loaded is calculated through pose estimation of the camera, the motion of the target is tracked by adopting a tracking algorithm of KLT, and finally an augmented reality scene of each target is built through a rendering thread; the whole system is provided with an identification thread, a plurality of tracking threads and a rendering thread, and the specific conditions are as follows:
the expression form of multi-target tracking is as follows: given a set of image sequences I1,I2,...,ItThe number of targets in each image is MtThe state of each target has position and attitude information, and the position and attitude information is composed of a 3 × 4 matrix
Figure FDA0002528699000000013
Represented by the formula:
Figure FDA0002528699000000011
the 3D point of a world coordinate system can be mapped to obtain the position of a 2D point of a camera coordinate system through the following equation:
Figure FDA0002528699000000012
wherein u and v are image coordinate systems projected by the camera; x is the number ofw,yw,zwWorld coordinate system coordinates of a characteristic point; a is an internal parameter of the camera, as follows:
Figure FDA0002528699000000021
in the formula, is axAnd ∈yRepresents the focal length, cxAnd cyCoordinates representing principal points, s representing tilt coefficients, which are intrinsic parameters of the camera used to correct the effects of optical deformations, which can be done in an off-line phase; the pose estimation is to calculate an accurate external parameter s of the camera through the visual angles of n more than or equal to 3 reference pointstCalculating a rotation matrix R and a translation matrix f by using an ICP (inductively coupled plasma) iterative method;
then all the object states in each image are represented as
Figure FDA0002528699000000022
The motion track corresponding to the ith target is expressed as
Figure FDA0002528699000000023
Sequence of states S consisting of all image objects1:t={S1,S2,...,St};
Identifying the work content of the thread:
in contrast to conventional single-object recognition, multi-object recognition aims at recognizing as many objects as possible in an image, let ItThe method comprises h targets, and the flow of the steps is as follows:
a) according to the input image I of the cameratFirstly, detecting all feature point sets P by using an SIFT algorithm, and extracting all SIFT feature point descriptors for the feature point sets P;
b) according to the vocabulary tree trained in the offline stage, performing inner product on each descriptor and each layer of nodes of the vocabulary tree from top to bottom, judging the final leaf node where each feature point descriptor falls, and adding tf-idf weight of the node to the score of the corresponding leaf node;
c) take out the input imageThe most similar reference image in the to-be-detected image set is obtained by utilizing the following similarity calculation formula
Figure FDA0002528699000000024
Considering that the image may contain a plurality of identical targets, the Euclidean distance calculation leads to more similar targets to be far away, so the similarity calculation method for multi-target detection and identification adopts a cosine similarity calculation formula;
Figure FDA0002528699000000025
wherein q is the leaf node score vector of the query image with all scores different from zero,
Figure FDA0002528699000000031
the leaf node score vectors with the scores different from zero are all the leaf nodes of one image to be compared in the image database;
d) searching a homography matrix H which is a 3 multiplied by 3 matrix, if the homography matrix H exists, recalculating the inner point set Q by using a RANSAC algorithm, and if the homography matrix H does not exist, terminating; the pair of points on the images from two different perspectives can be described by a projective transformation, as follows:
x'=H·x
in the formula, x' and x are point pair coordinates on two images with different visual angles respectively;
e) for reference image
Figure FDA0002528699000000032
The 4 boundary points and the homography matrix H are operated by the above formula, whether the quadrangle a 'b' c 'e' after matrix transformation is a convex quadrangle or not is judged, and if not, the operation is terminated; setting the height of an original image as h and the width as w, and calculating four points a, b, c and e by the following formula to obtain coordinates of a ', b', c 'and e', and judging whether the coordinates are convex edges or not by a geometric theorem;
Figure FDA0002528699000000033
in the formula, the coordinate of the point a is (0,0), the coordinate of the point b is (0, h), the coordinate of the point c is (w, h), the coordinate of the point e is (w,0), and the coordinate of the point a' is (x)a,ya) And b' point coordinate is (x)b,yb) C' point coordinate is (x)c,yc) And e' point coordinates are (x)e,ye);
f) Calculating the initial pose matrix of the target by an ICP iterative algorithm
Figure FDA0002528699000000034
Handing over to a new tracking thread and shielding the target area;
g) let P be { P- (P ∩ Q) }, and give the target to the tracking thread to update the position and pose matrix in real time
Figure FDA0002528699000000035
The identification thread continues to repeat a) to f) work on the unmasked area;
work content of each trace thread:
multi-object tracking means that a sequence of images I is given1,I2,...,ItFinding a moving object set in an image sequence through multi-target identification, corresponding moving objects in subsequent frames to one of the moving objects, and giving motion tracks of different objects; here, the tracking algorithm used is a KLT algorithm based on optical flow, and is specifically implemented as follows:
the KLT implementation principle defines that the same target appears in two frames of images I and J; and the same is true on the local window W, then within the local window W: i (x, y, l) ═ J (x ', y', l + τ), which explains that on image I, all (x, y) points have moved in one direction by a distance g ═ dx, dy, and at time l + τ correspond to (x ', y') on image J, i.e., (x + dx, y + dy), so the problem of finding a match can be to find a minimum for the following equation:
Figure FDA0002528699000000041
the above equation, also called the difference equation, is the equivalent of the integral:
Figure FDA0002528699000000042
where w (x) is a weighting function, typically set to a constant of 1; i.e. find two images, I is ux,uyCentered at J with ux+g,uy+ g is central, wherein
Figure FDA0002528699000000043
Figure FDA0002528699000000044
Is the difference between a local window W of radius, if (g) is taken to be the minimum, the derivative of the above equation is zero, i.e. the distance g of movement can be solved;
therefore, the specific implementation steps for identifying the thread are as follows:
a) tracking the identified target in real time by using a KLT tracking algorithm;
b) if the target tracking is lost, feeding back to the identification thread, and requiring the identification thread to recover the region to be detected, namely P ═ { PuQ };
c) if the target tracking is successful, the target tracking is fed back to the identification thread, and the shielding area and the pose matrix of the identification thread are updated
Figure FDA0002528699000000045
The work content of the rendering thread is as follows:
according to the position and posture matrix of each tracking thread
Figure FDA0002528699000000046
And placing the corresponding 3D models in sequence, and performing augmented reality scene rendering through OpenGL.
2. The multi-target registration tracking method for unmarked augmented reality as claimed in claim 1, wherein in step 1), the off-line stage comprises the following steps:
a) extracting all SIFT feature point descriptors for an image set to be registered;
b) performing k-means hierarchical clustering according to the number of branches of the vocabulary tree;
c) repeating the operation of b) for all SIFT feature point descriptors under each branch in sequence until leaf nodes;
d) finally, respectively linking one inverted index file to all leaf nodes at the bottommost layer;
e) extracting SIFT feature point descriptors for the image to be registered again, counting the score weight of each feature point descriptor on each leaf node according to the trained vocabulary tree and registering the feature point descriptors for the corresponding image;
f) recalculating the vocabulary tree with tf-idf weight according to the length of the inverted index file on the leaf node of the whole tree; the total number of registered images is N, and N is shared by any leaf node jjWhen an image is registered on the inverted index, the tf-idf weight of the leaf node j is
Figure FDA0002528699000000051
For registered image d', if there is mjIf a feature point falls on leaf node j, the score of the leaf node is dj'=mjwjFor the image q' to be searched, if there is njIf a feature point falls on a leaf node j, the score of the leaf node is qj'=njwj
CN201810096334.1A 2018-01-31 2018-01-31 Unmarked augmented reality multi-target registration tracking method Expired - Fee Related CN108364302B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810096334.1A CN108364302B (en) 2018-01-31 2018-01-31 Unmarked augmented reality multi-target registration tracking method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810096334.1A CN108364302B (en) 2018-01-31 2018-01-31 Unmarked augmented reality multi-target registration tracking method

Publications (2)

Publication Number Publication Date
CN108364302A CN108364302A (en) 2018-08-03
CN108364302B true CN108364302B (en) 2020-09-22

Family

ID=63007579

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810096334.1A Expired - Fee Related CN108364302B (en) 2018-01-31 2018-01-31 Unmarked augmented reality multi-target registration tracking method

Country Status (1)

Country Link
CN (1) CN108364302B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109978829B (en) * 2019-02-26 2021-09-28 深圳市华汉伟业科技有限公司 Detection method and system for object to be detected
CN110473259A (en) * 2019-07-31 2019-11-19 深圳市商汤科技有限公司 Pose determines method and device, electronic equipment and storage medium
CN112734797A (en) * 2019-10-29 2021-04-30 浙江商汤科技开发有限公司 Image feature tracking method and device and electronic equipment
CN111402579A (en) * 2020-02-29 2020-07-10 深圳壹账通智能科技有限公司 Road congestion degree prediction method, electronic device and readable storage medium
CN112000219B (en) * 2020-03-30 2022-06-14 华南理工大学 Movable gesture interaction method for augmented reality game
CN112884048A (en) * 2021-02-24 2021-06-01 浙江商汤科技开发有限公司 Method for determining registration image in input image, and related device and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103177468A (en) * 2013-03-29 2013-06-26 渤海大学 Three-dimensional motion object augmented reality registration method based on no marks
CN104966307A (en) * 2015-07-10 2015-10-07 成都品果科技有限公司 AR (augmented reality) algorithm based on real-time tracking
WO2016048366A1 (en) * 2014-09-26 2016-03-31 Hewlett Packard Enterprise Development Lp Behavior tracking and modification using mobile augmented reality
CN106843493A (en) * 2017-02-10 2017-06-13 深圳前海大造科技有限公司 A kind of augmented reality implementation method of picture charge pattern method and application the method
KR20180005430A (en) * 2016-07-06 2018-01-16 윤상현 Augmented reality realization system for image

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103177468A (en) * 2013-03-29 2013-06-26 渤海大学 Three-dimensional motion object augmented reality registration method based on no marks
WO2016048366A1 (en) * 2014-09-26 2016-03-31 Hewlett Packard Enterprise Development Lp Behavior tracking and modification using mobile augmented reality
CN104966307A (en) * 2015-07-10 2015-10-07 成都品果科技有限公司 AR (augmented reality) algorithm based on real-time tracking
KR20180005430A (en) * 2016-07-06 2018-01-16 윤상현 Augmented reality realization system for image
CN106843493A (en) * 2017-02-10 2017-06-13 深圳前海大造科技有限公司 A kind of augmented reality implementation method of picture charge pattern method and application the method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《基于上下文感知的移动增强现实浏览器构建及优化方法研究》;林一;《中国优秀硕士学位论文全文数据库信息科技辑》;20160315(第2016卷第3期);正文第2.3部分 *
Nist'er D et al.《Scalable Recognition with a Vocabulary Tree》.《CVPR2006》.2006,第2卷第2161-2168页. *

Also Published As

Publication number Publication date
CN108364302A (en) 2018-08-03

Similar Documents

Publication Publication Date Title
CN108364302B (en) Unmarked augmented reality multi-target registration tracking method
Rogez et al. Lcr-net++: Multi-person 2d and 3d pose detection in natural images
Ramesh et al. Dart: distribution aware retinal transform for event-based cameras
US10755128B2 (en) Scene and user-input context aided visual search
Zimmermann et al. Learning to estimate 3d hand pose from single rgb images
CN109753940B (en) Image processing method and device
Lim et al. Real-time image-based 6-dof localization in large-scale environments
Kendall et al. Posenet: A convolutional network for real-time 6-dof camera relocalization
Hagbi et al. Shape recognition and pose estimation for mobile augmented reality
Xu et al. Pano2cad: Room layout from a single panorama image
CN110472585B (en) VI-S L AM closed-loop detection method based on inertial navigation attitude track information assistance
Rahman et al. 3D object detection: Learning 3D bounding boxes from scaled down 2D bounding boxes in RGB-D images
CN106203423B (en) Weak structure perception visual target tracking method fusing context detection
CN113537208A (en) Visual positioning method and system based on semantic ORB-SLAM technology
Rafi et al. Self-supervised keypoint correspondences for multi-person pose estimation and tracking in videos
Pilet et al. Virtually augmenting hundreds of real pictures: An approach based on learning, retrieval, and tracking
CN109272577B (en) Kinect-based visual SLAM method
Ratan et al. Object detection and localization by dynamic template warping
Chen et al. TriViews: A general framework to use 3D depth data effectively for action recognition
Madadi et al. Occlusion aware hand pose recovery from sequences of depth images
Szeliski et al. Feature detection and matching
Ren et al. An investigation of skeleton-based optical flow-guided features for 3D action recognition using a multi-stream CNN model
CN111402331A (en) Robot repositioning method based on visual word bag and laser matching
Guan et al. Registration based on scene recognition and natural features tracking techniques for wide-area augmented reality systems
CN112861808B (en) Dynamic gesture recognition method, device, computer equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200922

CF01 Termination of patent right due to non-payment of annual fee