CN110009683A

CN110009683A - Object detecting method on real-time planar based on MaskRCNN

Info

Publication number: CN110009683A
Application number: CN201910250262.6A
Authority: CN
Inventors: 林春雨; 王旭东; 赵耀; 刘美琴
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2019-03-29
Filing date: 2019-03-29
Publication date: 2019-07-12
Anticipated expiration: 2039-03-29
Also published as: CN110009683B

Abstract

The embodiment of the invention provides object detecting methods on a kind of real-time planar based on MaskRCNN, it is characterized in that, include: every frame image in step 1 acquisition video flowing, the pose of the frame is calculated based on ORBSLAM2, the pose and corresponding image for saving the frame are into global array；Step 2 is based on ORBSLAM2 and increases deep learning detection thread, the deep learning thread extracts data from global array, extract adjacent two field pictures in array, and the perspective view of adjacent two frame is calculated separately by pose, the pixel that object is included on the second frame perspective view is detected by MaskRCNN, the translation relation of characteristic point can be obtained according to adjacent two frames perspective view characteristic matching so as to find out the pixel of object on first frame perspective view, and the pixel of the two field pictures is subjected to inverse transformation according to pose, match point after inverse transformation is calculated to the world coordinates of object by trigonometric ratio；Step 3 does not render the object of detection if camera pose meets plane according to the pixel coordinate that the pose of present frame and the world coordinates of the object calculate object present frame, only renders non-detection object, and AR object is inserted on the object of detection.

Description

Object detecting method on real-time planar based on MaskRCNN

Technical field

The present invention relates to objects in image identification technical field more particularly to a kind of real-time planar based on MaskRCNN to examine Survey method.

Background technique

With the rapid development of science and technology, every illusion all gradually becomes reality, and cruel electricity is dazzled in " steel is chivalrous " this portion Shadow brings strong vision to us and shakes, and has used augmented reality among these to realize the chivalrous behaviour for battle dress of steel Make.Initial augmented reality reads each frame of video flowing in real time and does similitude with mark figure by sticking mark figure in reality Differentiate to realize, but knowledge figure can not be labelled under many scenes, therefore augmented reality system is realized by identification physical feature Become the emphasis of research.

Physical feature has very much, and plane is structure that is wherein most commonly seen and most easily utilizing, can when recognizing plane Dummy object is rendered into plane by three-dimensional registration technology, for example you can see two dinosaurs and walk on the table, you Football match can be seen on the table, you can also freely change desktop wallpaper, but often there are some objects in plane, be The raising sense of reality, it would be desirable to the boundary of accurate detection object, and only rendering plane when rendering.Deep learning is more traditional Algorithm more robust, example segmentation is there are many algorithm, and wherein MaskRCNN is the algorithm proposed by He Kaiming team, the algorithm have compared with Good recognition effect, but identify that a frame needs about 2s, this makes the high-accuracy advantage of deep learning be very difficult to apply in real-time system In system.

Summary of the invention

The embodiment provides object detecting methods on a kind of real-time planar based on MaskRCNN, existing to overcome There is the defect of technology.

To achieve the goals above, this invention takes following technical solutions.

Object detecting method on a kind of real-time planar based on MaskRCNN, comprising:

Step 1 obtains every frame image in video flowing, and the pose of the frame is calculated based on ORBSLAM2, saves the pose of the frame And corresponding image is into global array；

Step 2 is based on ORBSLAM2 and increases deep learning detection thread, and the deep learning thread is taken out from global array Access evidence, extracts adjacent two field pictures in array, and the perspective view of adjacent two frame is calculated separately by pose, passes through MaskRCNN The pixel that object is included on the second frame perspective view is detected, characteristic point can be obtained according to adjacent two frames perspective view characteristic matching Translation relation so as to find out object on first frame perspective view pixel, and by the pixel of the two field pictures according to pose Inverse transformation is carried out, the match point after inverse transformation is calculated to the world coordinates of object by trigonometric ratio；

Step 3 calculates the pixel coordinate of object present frame, phase according to the pose of present frame and the world coordinates of the object If seat in the plane appearance meets plane, the object of detection is not rendered, only renders non-detection object, and in the object of detection AR object is inserted on body.

Preferably, the step 2 includes:

1) deep learning thread obtains and saves adjacent and clearly two field pictures, and the two field pictures are projected on In xoy plane；

2) characteristic point of the perspective view of the two field pictures is matched based on ORB, and filters matching error with RANSAC algorithm Point；

3) translation relation of characteristic point between two width perspective views is calculated, and is detected on the second frame perspective view by MaskRCNN The pixel that object is included, the i.e. match point of object on the second frame perspective view；

4) on the translation relation of known features point and the perspective view of the second frame object match point, calculate first frame perspective view The match point of upper object；

5) match point of object on the two frames perspective view is transformed into two frame original images respectively, passes through Corresponding matching point three Angling finds out the world coordinates of the corresponding object of match point.

Preferably, described that the match point of object on the two frames perspective view is transformed into two frame original images respectively, by right Match point trigonometric ratio is answered to find out the world coordinates of the corresponding object of match point, comprising:

It is reference using I1 figure for two field pictures I1, I2, the transformation matrix of I2 figure is T, theoretically straight line O1P1 and O2P2 Should intersect at a point P, but due to influence of noise, two straight lines can not often intersect, according to the definition of Epipolar geometry, x₁、x₂Ying Man Foot, wherein x₁、x₂Indicate the normalized coordinate of two field pictures character pair point,

s₁x₁=s₂Rx₂+ t formula (1)

The rotation angle R and translation t for having found out camera, if expecting the depth s of match point₁、s₂, first seek s₂, to above formula (1) premultiplication x₁^ is obtained:

s₁x₁^x₁=0=s₂x₁^Rx₂+x₁^t formula (2)

It is 0 on the left of above formula (2), right side is the equation about s2, acquires s2, then s1 is similarly；Since there are noises, estimate R, t when making above formula (2) not for 0, result is obtained by least square.

Preferably, the step 3 includes: to be calculated currently according to the world coordinates of calculated object and the pose of present frame The world coordinates of frame, calculation such as following formula,

Wherein [Rt] be present frame pose, [XwYwZw1] T be world of bodies coordinate homogeneous coordinates, f/du, f/dv, U0, v0 are camera internal reference, and Zc is scale parameter, and u, v are pixel coordinate of the object in present frame, then can be according to the formula by object Body world coordinates (x, y, z) is transformed on current frame pixel coordinate system by pose [R t], determines object according to transformation result Pixel, when rendering plane, do not render the pixel of object.As can be seen from the technical scheme provided by the above-mentioned embodiment of the present invention, originally The thought that inventive embodiments use multithreading realizes front and back end separation, and front end is responsible for calculating the pose of image, and rear end then uses MaskRCNN carrys out detection object, since the world coordinates of object is that there is no transformation, even if not detecting present frame figure As can still obtain its world coordinates, the thought proposed in this paper based on projection can calculate object according to frame before World coordinates can be converted after obtaining the world coordinates of object according to world coordinates and pixel coordinate conversion formula, this Invention ensure that high rate and high-accuracy in real time, so that deep learning becomes possible in real time

The additional aspect of the present invention and advantage will be set forth in part in the description, these will become from the following description Obviously, or practice through the invention is recognized.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill of field, without creative efforts, it can also be obtained according to these attached drawings others Attached drawing.

Fig. 1 is ORBSLAM2 algorithm frame schematic diagram provided in an embodiment of the present invention；

Fig. 2 is RCNN provided in an embodiment of the present invention and MaskRCNN algorithm effect comparison diagram；

Fig. 3 provides a kind of testing process signal of object on the real-time planar based on MaskRCNN for the embodiment of the present invention Figure；

Fig. 4 is the second frame testing result figure provided in an embodiment of the present invention；

Fig. 5 is first frame provided in an embodiment of the present invention and the second frame images match process；

Fig. 6 trigonometric ratio seeks depth schematic diagram；

Fig. 7 is augmented reality effect picture provided in an embodiment of the present invention.

Specific embodiment

Embodiments of the present invention are described below in detail, the example of the embodiment is shown in the accompanying drawings, wherein from beginning Same or similar element or element with the same or similar functions are indicated to same or similar label eventually.Below by ginseng The embodiment for examining attached drawing description is exemplary, and for explaining only the invention, and is not construed as limiting the claims.

Those skilled in the art of the present technique are appreciated that unless expressly stated, singular " one " used herein, " one It is a ", " described " and "the" may also comprise plural form.It is to be further understood that being arranged used in specification of the invention Diction " comprising " refer to that there are the feature, integer, step, operation, element and/or component, but it is not excluded that in the presence of or addition Other one or more features, integer, step, operation, element, component and/or their group.It should be understood that when we claim member Part is " connected " or when " coupled " to another element, it can be directly connected or coupled to other elements, or there may also be Intermediary element.In addition, " connection " used herein or " coupling " may include being wirelessly connected or coupling.Wording used herein "and/or" includes one or more associated any cells for listing item and all combinations.

Those skilled in the art of the present technique are appreciated that unless otherwise defined, all terms used herein (including technology art Language and scientific term) there is meaning identical with the general understanding of those of ordinary skill in fields of the present invention.Should also Understand, those terms such as defined in the general dictionary, which should be understood that, to be had and the meaning in the context of the prior art The consistent meaning of justice, and unless defined as here, it will not be explained in an idealized or overly formal meaning.

In order to facilitate understanding of embodiments of the present invention, it is done by taking several specific embodiments as an example below in conjunction with attached drawing further Explanation, and each embodiment does not constitute the restriction to the embodiment of the present invention.

The embodiment of the invention provides object detecting methods on a kind of real-time planar based on MaskRCNN, comprising:

It is as shown in Figure 1 the flow chart of ORBSLAM2.The present invention is based on ORBSLAM2 Open Frameworks to be developed, the system Thread comprising four operations simultaneously, respectively front end odometer, rear end optimization, winding detects and map structuring, in front end Pose of the journey meter according to the rough calculating camera of the Video stream information of input, the picture of angle and translation that pose, that is, camera rotates Element, rear end optimization are used to optimize the pose of camera, and the estimation of usual pose, which is based on previous frame recursion, accumulated error will occurs, return Ring detection is just used to detect whether that camera motion crosses this position so as to reduce error, and map structuring is to calculate spy Sign is put the position in world coordinates and is drawn.

It is illustrated in figure 2 RCNN and MaskRCNN algorithm effect comparison diagram.In order to improve the sense of reality, it would be desirable to using deep The method of degree study comes the boundary of accurate detection object, and detection algorithm can only provide rectangular block diagram, as shown in Fig. 2 (a).To It needs to use partitioning algorithm to accurately object boundary, and the MaskRCNN that He Kaiming team proposes is the best opinion of ICCV2017 Text, algorithm accuracy is high, and shown in effect such as Fig. 2 (b), therefore the present invention carries out the object in plane using MaskRCNN algorithm Detection.

It is illustrated in figure 3 the testing process schematic diagram of object on the real-time planar based on MaskRCNN.The present invention is based on ORBSLAM2 exploitation can obtain characteristic point, the characteristic point corresponding world coordinates and the frame of every frame image in video flowing Pose.For the video flowing of reading, ORBSLAM2 circular treatment in the form of frames, if directly passing through MaskRCNN on present frame Method calculate object segmentation as a result, due to MaskRCNN calculate an image time be 2 seconds or so, then thread can be generated Real-time effect will be not achieved, therefore the present invention is based on ORBSLAM2 increasings until can just see next frame image after two seconds in blocking Adding a thread, for the thread using MaskRCNN in backstage detection object, the present invention is named as deep learning detection thread, from And the thread independent can go calculated result.After the detection of deep learning thread, the world coordinates of object is retained in In global array, main thread reads object coordinates array, is worth if it exists, then it is assumed that testing result success, and according to the generation of object Boundary's coordinate and current pose are transformed on current frame pixel coordinate system.If camera pose meets plane, for the object of detection It does not render, only renders non-detection object, and be inserted into AR object on it, the specific detection method is as follows:

1, deep learning is increased based on ORB-SLAM2 and detects thread

If carrying out MaskRCNN detection to present frame, real-time effect is not achieved in the time slowly very much, therefore creates one New thread is named as deep learning detection thread.The calculated pose of ORBSLAM2 and corresponding image are deposited into one entirely In inning group, deep learning thread only extracts data from global array when being detected and returns to the Object Depth of calculating, Since object is that there is no transformation in world coordinates, the picture according to calculated depth calculation present frame is only needed The position of object in plain coordinate system.By the time of or so the thread of deep learning several times, that is, half a minute just can obtain object compared with It for fine world coordinates and signs on present frame, to ensure that real-time detection.

2, its world coordinates is asked based on projection

It detects a frame image and has obtained the pixel of object, but an only frame image can not carry out trigonometric ratio and seek depth, because This needs deep learning thread to detect two frames, and the pixel of object is matched as characteristic point, and the present invention temporarily claims depth Learning the object picture vegetarian refreshments that detects is --- match point.

The invention proposes a kind of thought based on projection matching, the time interval of adjacent two frame is about 30ms, and so What short time image occurred varies less, if adjacent two frame to be done to the projection about xoy plane, it assumes that almost without change in z-axis Change, two frames are only displaced in xoy plane, and this thought is based on, and can easily show that previous frame image is flat based on xoy The match point of face projection.Main flow are as follows:

Step1: deep learning thread obtains and saves adjacent and clearly two field pictures, this two field pictures is projected on In xoy plane；

Step2: the characteristic point of the perspective view of the two field pictures is matched based on ORB, and is filtered and is matched with RANSAC algorithm Erroneous point；

Step3: the translation relation of characteristic point between two width perspective views is calculated, and the projection of the second frame is detected by MaskRCNN The pixel that object is included on figure, the i.e. match point of object on the second frame perspective view；

Step4: the match point of object on the translation relation of known features point and the perspective view of the second frame calculates first frame and throws The match point of object on shadow figure；

Step5: the match point of object on two frame perspective views is transformed into two frame original images respectively again；

The result of the second frame is detected as shown in figure 4, the calculating matching process of first frame is as shown in Figure 5.

After two frame results obtain, two frame perspective views are subjected to inverse transformation according to pose, and pass through after inverse transformation The corresponding world coordinates of match point is found out with a trigonometric ratio.

Trigonometric ratio process is as shown in fig. 6, for two frame I1, I2, and using left figure as reference, the transformation matrix of right figure is T, theoretical Upper straight line O1P1 and O2P2 should intersect at a point P, but due to influence of noise, two straight lines can not often intersect.According to extremely several What definition, x₁、x₂Following formula should be met, wherein x₁、x₂Indicate the normalized coordinate of two field pictures character pair point.

s₁x₁=s₂Rx₂+ t formula (1)

The rotation angle R and translation t for having found out camera at present, if expecting the depth s of match point₁、s₂, first seek s₂, right Above formula (1) premultiplication x₁^ is obtained:

s₁x₁^x₁=0=s₂x₁^Rx₂+x₁^t formula (2)

It is 0 on the left of above formula (2), right side is about s₂Equation, s can be acquired₂, then s₁Similarly.Since there are noise, estimations R, t out not necessarily strictly makes above formula be 0, therefore more common way is to obtain result by least square.

Just corresponding pixel can be calculated according to the pose of present frame after obtaining world of bodies coordinate, to only need Result can be calculated in real time and be rendered on image in backstage detection by MaskRCNN, complete high-accuracy and in real time Combination.

3, dummy object is merged

According to the above-mentioned world coordinates for having calculated object, it is only necessary to calculate the world according to the pose of present frame and sit Mark, calculation such as following formula,

Wherein [Rt] be present frame pose, [XwYwZw1] T be world of bodies coordinate homogeneous coordinates, f/du, f/dv, U0, v0 are camera internal reference, and Zc is scale parameter, and u, v are pixel coordinate of the object in present frame, then can be according to the formula by object Body world coordinates (x, y, z) is transformed on current frame pixel coordinate system by pose [R t].

Can determine in present frame which is object according to transformation result, when rendering plane does not render the pixel of object, effect Fruit is as shown in Figure 7.

In conclusion the thought that the embodiment of the present invention uses multithreading realizes front and back end separation, calculating figure is responsible in front end The pose of picture, rear end then use MaskRCNN to carry out detection object, due to the world coordinates of object be there is no transformation, Even if its world coordinates can still be obtained by not detecting current frame image, the thought proposed in this paper based on projection being capable of basis The world coordinates that frame before calculates object can be according to world coordinates and pixel coordinate after obtaining the world coordinates of object Conversion formula is converted, this invention ensures that high rate and high-accuracy in real time, so that deep learning becomes possible in real time.

Those of ordinary skill in the art will appreciate that: attached drawing is the schematic diagram of one embodiment, module in attached drawing or Process is not necessarily implemented necessary to the present invention.

As seen through the above description of the embodiments, those skilled in the art can be understood that the present invention can It realizes by means of software and necessary general hardware platform.Based on this understanding, technical solution of the present invention essence On in other words the part that contributes to existing technology can be embodied in the form of software products, the computer software product It can store in storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions are used so that a computer equipment (can be personal computer, server or the network equipment etc.) executes the certain of each embodiment or embodiment of the invention Method described in part.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by anyone skilled in the art, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with scope of protection of the claims Subject to.

Claims

1. object detecting method on a kind of real-time planar based on MaskRCNN characterized by comprising

Step 1 obtains every frame image in video flowing, and the pose of the frame is calculated based on ORBSLAM2, saves the pose of the frame and right The image answered is into global array；

Step 2 is based on ORBSLAM2 and increases deep learning detection thread, and the deep learning thread extracts number from global array According to extracting adjacent two field pictures in array, and calculate separately by pose the perspective view of adjacent two frame, detected by MaskRCNN The pixel that object is included on second frame perspective view can obtain the flat of characteristic point according to adjacent two frames perspective view characteristic matching Shifting relationship and carries out the pixel of the two field pictures according to pose so as to find out the pixel of object on first frame perspective view Match point after inverse transformation is calculated the world coordinates of object by inverse transformation by trigonometric ratio；

Step 3 calculates the pixel coordinate of object present frame, phase seat in the plane according to the pose of present frame and the world coordinates of the object If appearance meets plane, the object of detection is not rendered, only renders non-detection object, and on the object of detection It is inserted into AR object.

2. the method according to claim 1, wherein the step 2 includes:

1) deep learning thread obtains and saves adjacent and clearly two field pictures, and the two field pictures are projected on xoy and are put down On face；

2) characteristic point of the perspective view of the two field pictures is matched based on ORB, and filters matching error point with RANSAC algorithm；

3) translation relation of characteristic point between two width perspective views is calculated, and object on the second frame perspective view is detected by MaskRCNN The pixel for being included, the i.e. match point of object on the second frame perspective view；

4) on the translation relation of known features point and the perspective view of the second frame object match point, calculate object on first frame perspective view The match point of body；

5) match point of object on the two frames perspective view is transformed into two frame original images respectively, passes through Corresponding matching point trigonometric ratio Find out the world coordinates of the corresponding object of match point.

3. according to the method described in claim 2, it is characterized in that, described respectively by the matching of object on the two frames perspective view Point is transformed into two frame original images, and the world coordinates of the corresponding object of match point is found out by Corresponding matching point trigonometric ratio, comprising:

It is reference using I1 figure, the transformation matrix of I2 figure is T, and theoretically straight line O1P1 and O2P2 answers phase for two field pictures I1, I2 A point P is met at, but due to influence of noise, two straight lines can not often intersect, according to the definition of Epipolar geometry, x₁、x₂It should meet, Wherein x₁、x₂Indicate the normalized coordinate of two field pictures character pair point,

s₁x₁=s₂Rx₂+ t formula (1)

s₁x₁^x₁=0=s₂x₁^Rx₂+x₁^t formula (2)

It is 0 on the left of above formula (2), right side is the equation about s2, acquires s2, then s1 is similarly；Due to there are noise, estimate R, When t makes above formula (2) not for 0, result is obtained by least square.

4. the method according to claim 1, wherein the step 3 includes: the world according to calculated object The world coordinates of the pose of coordinate and present frame calculating present frame, calculation such as following formula,

Wherein [Rt] is the pose of present frame, and [XwYwZw1] T is the homogeneous coordinates of world of bodies coordinate, f/du, f/dv, u0, v0 It is camera internal reference, Zc is scale parameter, and u, v are pixel coordinate of the object in present frame, then can be according to the formula by object generation Boundary's coordinate (x, y, z) is transformed on current frame pixel coordinate system by pose [R t], and the picture of object is determined according to transformation result Element, when rendering plane, do not render the pixel of object.