CN110084850A

CN110084850A - A kind of dynamic scene vision positioning method based on image, semantic segmentation

Info

Publication number: CN110084850A
Application number: CN201910270280.0A
Authority: CN
Inventors: 潘树国; 盛超; 曾攀; 黄砺枭; 赵涛; 王帅; 高旺
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2019-04-04
Filing date: 2019-04-04
Publication date: 2019-08-02
Anticipated expiration: 2039-04-04
Also published as: CN110084850B

Abstract

The invention discloses a kind of dynamic scene vision positioning methods based on image, semantic segmentation, belong to SLAM(Simultaneous Localization and Mapping, synchronous to position and build figure) field.The present invention uses the supervised learning mode in deep learning to be split the dynamic object in original image first, obtains semantic image；On this basis, ORB characteristic point is extracted from original image and dynamic object characteristic point is rejected according to semantic image；Finally, carrying out locating and tracking to camera motion using the monocular SLAM method based on point feature based on the characteristic point after rejecting.Positioning result shows that compared to conventional method, positioning accuracy of the method disclosed by the invention in dynamic scene improves 13% to 30%.

Description

A kind of dynamic scene vision positioning method based on image, semantic segmentation

Technical field

The present invention relates to application of the deep learning in vision SLAM, belong to SLAM (Simultaneous Localization and Mapping, synchronous to position and build figure) field.

Background technique

It positions simultaneously and builds the key technology that figure (SLAM) is robot autonomous operation under circumstances not known.Based on robot The environmental data that external sensor detects, SLAM construct the environmental view of robot, while giving robot in ring Position in the figure of border.Compared with the distance mearuring equipments such as radar, sonar, visual sensor is with small in size, low in energy consumption, information collection is rich The features such as rich, can provide texture information abundant in external environment.Therefore, vision SLAM has become the heat of current research Point, and it is applied to the fields such as independent navigation, VR/AR.

Traditional vision SLAM algorithm based on point feature is based on static ring in restoration scenario information and camera motion What border was assumed.Dynamic object in scene will affect positioning accuracy.Currently, traditional vision SLAM algorithm based on point feature is logical It crosses detection dynamic point and is marked as exterior point to handle simple dynamic scene problem.ORB-SLAM passes through RANSAC, card side It examines, key frame method and local map reduce influence of the dynamic object to positioning accuracy.Direct method passes through optimization cost function To handle occlusion issue caused by dynamic object.2013, there is scholar to propose a kind of new key frame expression and update method, For carrying out adaptive modeling to dynamic environment, appearance or structure change in dynamic environment are effectively detected and handled.The same year, Attitude estimation and the method for building figure are used to handle dynamic scene between having scholar to introduce multiple-camera.However tradition SLAM method exists Positioning accuracy and robustness under dynamic scene have to be hoisted.

Summary of the invention

The technical problems to be solved by the present invention are:

In order to promote positioning accuracy and robustness of traditional SLAM method under dynamic scene, provide a kind of based on image language The dynamic scene vision positioning method of justice segmentation, can be split the dynamic object in scene, reject dynamic object feature Point.

The present invention uses following technical scheme to solve above-mentioned technical problem:

The present invention proposes a kind of dynamic scene vision positioning method based on image, semantic segmentation, comprising the following steps:

Step 1, acquisition original image construct convolutional neural networks, and divide the original using the convolutional neural networks Dynamic object in beginning image obtains semantic image；

Step 2 extracts ORB characteristic point in the original image；

Step 3 carries out the dynamic object characteristic point in step 2 gained ORB characteristic point according to step 1 gained semantic image It rejects, only retains stationary body characteristic point；

Step 4 is based on the resulting stationary body characteristic point of step 3, using traditional SLAM method pair based on point feature Camera motion carries out locating and tracking.

A kind of foregoing dynamic scene vision positioning method based on image, semantic segmentation, further: step 1 In, the step of building convolutional neural networks includes:

Step 1.1.1, original image is downsampled to 1/4, inputs PSPNet, obtaining size step by step is 1/8 and 1/16 Characteristic pattern finally exports the characteristic pattern F1 of 1/32 size；

Step 1.1.2, original image is downsampled to 1/2, inputs the PSPNet, obtaining size step by step is 1/4 and 1/8 Characteristic pattern, finally export the characteristic pattern F2 of 1/16 size；

Step 1.1.3, that characteristic pattern F1, F2 and true value label having a size of original image 1/16 are inputted the first CFF is mono- Member fusion, the characteristic pattern F that output size is 1/16¹With the loss item L of the first branch₁；

Step 1.1.4, original image is inputted into the PSPNet, obtains the characteristic pattern that size is 1/2 and 1/4 step by step, most The characteristic pattern F3 of 1/8 size is exported afterwards；By characteristic pattern F¹With characteristic pattern F3 and the true value label having a size of original image 1/8 is defeated Enter the fusion of the 2nd CFF unit, the characteristic pattern F that output size is 1/8²With the loss item L of the second branch₂；

Step 1.1.5, the described characteristic pattern F²By up-sampling, the characteristic pattern F that size is 1/4 is obtained³, the characteristic pattern F³ The loss item L of third branch is exported after the true value tag processes of 1/4 size₃；

Step 1.1.6, by the loss item L₁、L₂、L₃Superposition is for training the convolutional neural networks.

A kind of foregoing dynamic scene vision positioning method based on image, semantic segmentation, further: step 1.1.3 and CFF unit includes described in step 1.1.4 image processing step includes:

The lesser characteristic pattern of size in two input feature vector figures is up-sampled with sample rate for 2, input classification respectively Convolutional layer and expansion convolutional layer, the convolution kernel of the classification convolutional layer is having a size of 1*1*1, the convolution kernel ruler of the expansion convolutional layer Very little is 3*3*C₃, expansion rate 2；By characteristic pattern larger-size in two input feature vector figures input convolution kernel having a size of 1*1*C₃ Projection convolutional layer；Normalization is criticized respectively for the output result of the expansion convolutional layer and projection convolutional layer then to sum, then The summed result is inputted into RELU function, exports characteristic pattern F_c, by the output result of the classification convolutional layerWith true value mark Label substitute into Softmax function, obtain the loss item of the CFF unit respective branches.

A kind of foregoing dynamic scene vision positioning method based on image, semantic segmentation, further: step 1.6 It is described by the loss item L₁、L₂、L₃Superposition is for training the specific steps of the convolutional neural networks to include:

To loss item L₁、L₂、L₃Summation, obtains final loss item L_total:

Wherein i is branch's number, ω_iFor the weight of each branch penalty item,To lose letter for calculating in each branch Several characteristic patterns, Y_i×X_iForSize, N is kind of object number to be split in preset image,For in spy Sign figureThe position (n, y, x) numerical value,ForThe corresponding true value label at (y, x).

A kind of foregoing dynamic scene vision positioning method based on image, semantic segmentation, further: step 1 institute State the dynamic object divided in the original image using the convolutional neural networks, obtain semantic image the following steps are included:

Step 1.2.1, original image is downsampled to 1/4, inputs PSPNet, obtaining size step by step is 1/8 and 1/16 Characteristic pattern finally exports the characteristic pattern F1 of 1/32 size；

Step 1.2.2, original image is downsampled to 1/2, inputs the PSPNet, obtaining size step by step is 1/4 and 1/8 Characteristic pattern, finally export the characteristic pattern F2 of 1/16 size；

Step 1.2.3, that characteristic pattern F1, F2 and true value label having a size of original image 1/16 are inputted the first CFF is mono- Member fusion, the characteristic pattern F that output size is 1/16¹；

Step 1.2.4, original image is inputted into the PSPNet, obtains the characteristic pattern that size is 1/2 and 1/4 step by step, most The characteristic pattern F3 of 1/8 size is exported afterwards；By characteristic pattern F¹The fusion of the 2nd CFF unit, output size 1/8 are inputted with characteristic pattern F3 Characteristic pattern F²；

Step 1.2.5, the described characteristic pattern F²By up-sampling, the characteristic pattern F that size is 1/4 is obtained³, when test process, By F³It is up-sampled, the characteristic pattern that Output Size size is 1, this feature figure is semantic segmentation figure；

Step 1.2.6, binary conversion treatment is carried out to the semantic segmentation figure: to the dynamic object in the semantic segmentation figure It is marked using black picture element 0, other objects are marked using white pixel 1, and obtaining one only includes dynamic object Black and white semantic image i '_t；

Step 1.2.7, the operation of the step 1.1 to 1.7 is carried out to the image sequence being made of original image, it is final to obtain To the semantic image sequence I '={ i ' for only including dynamic object_t, i '₂, i '₃, i '₄..., i '_t}。

A kind of foregoing dynamic scene vision positioning method based on image, semantic segmentation, further: the step In rapid 2, extracting ORB characteristic point specific steps in original image includes:

According to the complexity of scene, feature quantity to be extracted is set, extracts input picture using ORB feature extractor i_tIn characteristic point i_t(x, y), wherein x, y are characterized transverse and longitudinal coordinate a little.

A kind of foregoing dynamic scene vision positioning method based on image, semantic segmentation, further: the step In rapid 3, the dynamic object characteristic point in step 2 gained ORB characteristic point is rejected according to step 1 gained semantic image, only Retain stationary body characteristic point the step of include:

For original image i_tEach of characteristic point i_t(x, y), in its semantic image i '_tMiddle determining corresponding position i '_t (x, y)；

If i '_t(x, y)=0, the point are black pixel point, that is, belong to dynamic object feature, execute and reject operation；

If i '_t(x, y)=1, the point are white pixel point, that is, belong to stationary body feature, execute reservation operations.

A kind of foregoing dynamic scene vision positioning method based on image, semantic segmentation, further: the step In rapid 4, it is based on the resulting stationary body characteristic point of step 3, using traditional SLAM method based on point feature to camera motion Locating and tracking is carried out, specifically:

For image sequence I={ i₁,i₂,i₃,i₄,…,i_t, the ORB characteristic point after being rejected based on step 3, using tradition SLAM frame based on point feature calculates and optimizes camera pose, completes the positioning and tracking of camera.

The invention adopts the above technical scheme compared with prior art, has following technical effect that

1, the present invention first carries out the dynamic object in original image using the supervised learning mode in deep learning Segmentation, obtains semantic image；On this basis, ORB characteristic point is extracted from original image and according to semantic image to goer Body characteristics point is rejected, to improve positioning accuracy and robustness of traditional SLAM method under dynamic scene；

2, method positioning result proposed by the present invention is better than the positioning result of traditional ORB-SLAM, and positioning accuracy improves 13% to 30%.

Detailed description of the invention

Fig. 1 is this method flow chart；

Fig. 2 is this method image, semantic segmentation network structure；

Fig. 3 is this method cascade nature integrated unit structure chart；

Fig. 4 is this method dynamic object segmentation flow chart；

Fig. 5 is this method image, semantic segmentation result figure；

Fig. 6 is that this method dynamic object characteristic point rejects result figure；

Fig. 7 is this method and positioning track plan view of the complete ORB-SLAM in four sequences；

Fig. 8 is this method and positioning track plan view of the incomplete ORB-SLAM in four sequences.

Specific embodiment

Technical solution of the present invention is described in further detail with reference to the accompanying drawing:

Those skilled in the art can understand that unless otherwise defined, all terms used herein (including skill Art term and scientific term) there is meaning identical with the general understanding of those of ordinary skill in fields of the present invention.Also It should be understood that those terms such as defined in the general dictionary should be understood that have in the context of the prior art The consistent meaning of meaning will not be explained in an idealized or overly formal meaning and unless defined as here.

With the development of depth learning technology, people explore the semantic information of image, improve vision whereby The performance of SLAM.Semantic segmentation is the basic task in computer vision, needs for vision input to be divided into not in semantic segmentation Classification can be explained in same semanteme.The present invention proposes a kind of dynamic scene vision positioning method based on image, semantic segmentation, it is intended to On the basis of rejecting dynamic object characteristic point, the positioning accuracy of SLAM under dynamic scene is improved, while it is abundant to obtain scene Semantic information.

The present invention proposes a kind of dynamic scene vision positioning method based on image, semantic segmentation, and Fig. 1 is this method process Figure, Fig. 4 are this method dynamic object segmentation flow charts.First using the supervised learning mode in deep learning to original image In dynamic object be split, obtain semantic image；On this basis, ORB characteristic point and basis are extracted from original image Semantic image rejects dynamic object characteristic point；Finally, using the monocular based on point feature based on the characteristic point after rejecting SLAM method carries out locating and tracking to camera motion.

Step 1, building convolutional neural networks are split the dynamic object in original image, obtain semantic image:

Step 1.1, building are used for the convolutional neural networks of semantic segmentation

Constructed neural network structure is as shown in Figure 2.In the network structure of Fig. 2 description, including top, middle part, bottom Branch, three layers of portion；Number in bracket is the dimension ratio compared to original input picture；' CFF ' it is that cascade nature fusion is single Member；The identical parameter of the three first layers network share of top layer and middle layer branch.

Now network structure is described in further detail:

Cascade image input: original image is downsampled to 1/4 size first by the top branch of the network described by Fig. 2 Image, then input PSPNet, export the characteristic pattern of 1/32 size, this is a kind of coarse segmentation result, is lacked many thin Section and boundary.At middle part and bottom leg, it is extensive that details is carried out to above-mentioned coarse result using the image and original image of 1/2 size Multiple and refinement.Although the segmentation result of top branch is more rough, semantic component abundant is contained.Therefore, it is used for details The middle part and bottom leg network of recovery and refinement are lightweights.Different points are merged using cascade nature integrated unit (CFF) The output characteristic pattern of branch enhances the learning process of different branches using cascade label guidance.

Cascade nature fusion: Fig. 3 illustrates the specific structure of cascade nature integrated unit, and wherein F1 and F2 is different branches The characteristic pattern of output, the bulk size of F2 are twice of F1.Cascade nature integrated unit is for merging different branch's outputs Characteristic pattern, the input of this element includes two characteristic patterns F1, F2 and a true value label, and the size of F1 is Y₁×X₁×C₁, The size of F2 is Y₂×X₂×C₂, the size of label is Y₁×X₁×1.For characteristic pattern F1, adopt for 2 with sample rate first Sample exports the characteristic pattern of size identical with F2.Right the latter core is having a size of 3 × 3 × C₃, spreading rate be 2 expansion convolutional layer use It is refined in above-mentioned output characteristic pattern, therefore the size of F1 becomes Y₂×X₂×C₃.For characteristic pattern F2, pass through a core Having a size of 1 × 1 × C₃Convolutional layer, export Y₂×X₂×C₃The characteristic pattern of size.Batch standard is carried out simultaneously to the output of F1 and F2 Change, and by summation layer and ' RELU ' function layer, final output fusion characteristic pattern F2 '.

Cascade label guidance；In the network structure of Fig. 2 description, the different (size of relatively primitive image of three sizes Size is respectively 1/16,1/8,1/4) true value label be used for network top, middle part and bottom leg generate three independences Loss item, and sum to three loss items, obtain final loss item:

Wherein ω_tFor the weight of each branch penalty item, F^tFor the characteristic pattern of each branch output, Y_t×X_tFor F^tRuler Very little, N is kind of object number to be split in preset image,For in characteristic pattern F^tThe position (n, y, x) numerical value,ForThe corresponding true value label at (y, x).

Dynamic object in step 1.2, segmentation original input picture:

The realization process for the step for Fig. 3 is illustrated.For one group of given image sequence I={ i₁, i₂, i₃, i₄..., i_t, wherein i_tThe image shot for t moment camera:

(1) to semantic segmentation network inputs piece image i constructed by step 1.1_t, export the colored language after a width is divided Adopted image, in semantic image, the pixel of the objects such as automobile, pedestrian, building, direction board in different colors is labeled；

(2) binary conversion treatment is carried out to the semantic image in (1), the dynamic object (pedestrian, automobile) in image is utilized Black picture element 0 is marked, other objects are marked using white pixel 1, obtains one only comprising the black and white of dynamic object Semantic image i '_t；

(3) it to each image in image sequence I, repeats step (1) and (2)；

Finally obtain semantic image sequence I '={ i ' only comprising dynamic object_t, i '₂, i '₃, i '₄..., i '_t}。

Step 2, ORB characteristic point is extracted in original image, and dynamic object characteristic point is rejected according to semantic image, Only retain stationary body characteristic point:

ORB characteristic point in step 2.1, extraction original image:

Step 3, dynamic object characteristic point is rejected according to semantic image, only retains stationary body characteristic point:

(1) for i_tEach of characteristic point i_t(x, y), in semantic image i '_tMiddle determining corresponding position i '_t(x, y)；

(2) if i_t(x, y)=0, the point are black pixel point, that is, belong to dynamic object feature, execute and reject operation；

(3) if i_t(x, y)=1, the point are white pixel point, that is, belong to stationary body feature, execute reservation operations.

Step 4, the ORB characteristic point after being rejected based on step 3, using traditional SLAM frame based on point feature to camera Carry out locating and tracking:

For image sequence I={ i₁, i₂, i₃, i₄..., i_t, the ORB characteristic point after being rejected based on step 2, using tradition SLAM frame based on point feature calculates and optimizes camera pose, completes the positioning and tracking of camera.

Embodiment one

The present invention is assessed using Frankfurt monocular image sequence, which is Cityscapes data set A part.Entire Frankfurt sequence provides the outdoor environment image more than 100,000 frames, and provides and can be used as true value Positioning result.The sequence is divided into several lesser sequences, wherein including the dynamic object sequence of 1300-2500 frame, is such as driven Sail automobile or pedestrian.Experiment porch is configured that Intel XeonE5-2690V4；The RAM of 128GB；It is tall and handsome to reach TitanV GPU。

The sequence separated from original Frankfurt sequence is as follows:

Seq.01:frankfurt_000001_054140_leftImg8bit.png-frankfurt_000001_ 056555_leftImg8bit.png

Seq.02:frankfurt_000001_012745_leftImg8bit.png-frankfurt_000001_ 014100_leftImg8bit.png

Seq.03:frankfurt_000001_003311_leftImg8bit.png-frankfurt_000001_ 005555_leftImg8bit.png

Seq.04:frankfurt_000001_010580_leftImg8bit.png-frankfurt_000001_ 012739_leftImg8bit.png

Fig. 5 illustrates the result of semantic segmentation.Middle column show scene in trees, building, road, traffic sign and its His object is by fine Ground Split.Right side only retains the segmentation result of dynamic object (automobile and pedestrian).Although boundary is not exclusively smart Really, but result is sufficient to reject characteristic point.

Fig. 6 illustrates the result of dynamic object characteristic point rejecting.White car is the dynamic object travelled on road. The two images of left column are before rejecting as a result, many of them belongs to the characteristic point of dynamic vehicle.The right side is classified as rejecting as a result, vapour The characteristic point of vehicle is rejected completely.

Fig. 7 illustrate this method based on complete ORB-SLAM and complete ORB-SLAM Seq.01, Seq.02, Positioning track plan view in tetra- sections of video sequences of Seq.03, Seq.04.By four width figures it is found that method proposed by the present invention obtains Positioning track (Ours) compared to the calculated track complete ORB-SLAM (ORB-SLAM Full) and real trace Deviation between (Ground Truth) is smaller.Since dynamic vehicle and pedestrian are more in Seq.01 sequence, two methods result All deviation is larger between true value, but this method in positioning accuracy still better than complete ORB-SLAM.Since system is based on closing Key frame carries out position tracking, and it is discontinuous that positioning track will appear part.

Complete ORB-SLAM has used Chi-square Test, reduces behavioral characteristics point to a certain extent to positioning accuracy It influences, Fig. 8 illustrates this method of the imperfect ORB-SLAM based on removal Chi-square Test and imperfect ORB-SLAM exists Positioning track plan view in tetra- sections of video sequences of Seq.01, Seq.02, Seq.03, Seq.04.By four width figures it is found that the present invention The positioning track (Ours) that the method for proposition obtains track (ORB-SLAM calculated compared to incomplete ORB-SLAM Imcomplete the deviation) between real trace (Ground Truth) is smaller.Since pedestrian is more in scene, deposited in scene In a large amount of behavioral characteristics point, incomplete ORB-SLAM is positioned in Seq.02 and is fallen flat, it was demonstrated that side proposed by the present invention Method robustness is more preferable.Position tracking is carried out since system is based on key frame, it is discontinuous that positioning track will appear part.

Finally provide positioning of four sections of image sequences in complete ORB-SLAM, incomplete ORB-SLAM and this method As a result.The method positioning result proposed by the present invention known to Tables 1 and 2 is better than the positioning result of traditional ORB-SLAM, positioning Precision improves 13% to 30%.

Table 1: two methods positioning result on Seq01-Seq04 image sequence compares

Table 2: two methods positioning result on Seq01-Seq04 image sequence compares

The above is only some embodiments of the invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of dynamic scene vision positioning method based on image, semantic segmentation, which comprises the following steps:

Step 1, acquisition original image construct convolutional neural networks, and divide the original graph using the convolutional neural networks Dynamic object as in obtains semantic image；

Step 2 extracts ORB characteristic point in the original image；

Step 3 picks the dynamic object characteristic point in step 2 gained ORB characteristic point according to step 1 gained semantic image It removes, only retains stationary body characteristic point；

Step 4 is based on the resulting stationary body characteristic point of step 3, using traditional SLAM method based on point feature to camera Movement carries out locating and tracking.

2. a kind of dynamic scene vision positioning method based on image, semantic segmentation according to claim 1, it is characterised in that: In step 1, the step of building convolutional neural networks, includes:

Step 1.1.1, original image is downsampled to 1/4, inputs PSPNet, obtain the feature that size is 1/8 and 1/16 step by step Figure, finally exports the characteristic pattern F1 of 1/32 size；

Step 1.1.2, original image is downsampled to 1/2, inputs the PSPNet, obtain the spy that size is 1/4 and 1/8 step by step Sign figure, finally exports the characteristic pattern F2 of 1/16 size；

Step 1.1.3, characteristic pattern F1, F2 and true value label having a size of original image 1/16 the first CFF unit is inputted to melt It closes, the characteristic pattern F that output size is 1/16¹With the loss item L of the first branch₁；

Step 1.1.4, original image is inputted into the PSPNet, obtains the characteristic pattern that size is 1/2 and 1/4 step by step, it is last defeated The characteristic pattern F3 of 1/8 size out；By characteristic pattern F¹With characteristic pattern F3 and the input of the true value label having a size of original image 1/8 the The fusion of two CFF units, the characteristic pattern F that output size is 1/8²With the loss item L of the second branch₂；

Step 1.1.5, the described characteristic pattern F²By up-sampling, the characteristic pattern F that size is 1/4 is obtained³, the characteristic pattern F³Through 1/4 The loss item L of third branch is exported after the true value tag processes of size₃；

3. a kind of dynamic scene vision positioning method based on image, semantic segmentation according to claim 2, it is characterised in that: The image processing step that CFF unit described in step 1.1.3 and step 1.1.4 includes includes:

The lesser characteristic pattern of size in two input feature vector figures is up-sampled with sample rate for 2, respectively input classification convolution Layer and expansion convolutional layer, the convolution kernel of the classification convolutional layer having a size of 1*1*1, the convolution kernel of the expansion convolutional layer having a size of 3*3*C₃, expansion rate 2；By characteristic pattern larger-size in two input feature vector figures input convolution kernel having a size of 1*1*C₃Throwing Shadow convolutional layer；The expansion convolutional layer and the output result for projecting convolutional layer are criticized by normalization are respectively then summed, then by institute Summed result input RELU function is stated, characteristic pattern F is exported_c, by the output result of the classification convolutional layerWith true value label generation Enter Softmax function, obtains the loss item of the CFF unit respective branches.

4. a kind of dynamic scene vision positioning method based on image, semantic segmentation according to claim 2, it is characterised in that: By the loss item L described in step 1.1.6₁、L₂、L₃Superposition is for training the specific steps of the convolutional neural networks to include:

To loss item L₁、L₂、L₃Summation, obtains final loss item L_total:

Wherein i is branch's number, ω_iFor the weight of each branch penalty item,For the spy for being used to calculate loss function in each branch Sign figure, Y_i×X_iForSize, N is kind of object number to be split in preset image,For in characteristic pattern The position (n, y, x) numerical value,ForThe corresponding true value label at (y, x).

5. a kind of dynamic scene vision positioning method based on image, semantic segmentation according to claim 1, it is characterised in that: Divide the dynamic object in the original image using the convolutional neural networks described in step 1, obtain semantic image include with Lower step:

Step 1.2.1, original image is downsampled to 1/4, inputs PSPNet, obtain the feature that size is 1/8 and 1/16 step by step Figure, finally exports the characteristic pattern F1 of 1/32 size；

Step 1.2.2, original image is downsampled to 1/2, inputs the PSPNet, obtain the spy that size is 1/4 and 1/8 step by step Sign figure, finally exports the characteristic pattern F2 of 1/16 size；

Step 1.2.3, characteristic pattern F1, F2 and true value label having a size of original image 1/16 the first CFF unit is inputted to melt It closes, the characteristic pattern F that output size is 1/16¹；

Step 1.2.4, original image is inputted into the PSPNet, obtains the characteristic pattern that size is 1/2 and 1/4 step by step, it is last defeated The characteristic pattern F3 of 1/8 size out；By characteristic pattern F¹The fusion of the 2nd CFF unit, the spy that output size is 1/8 are inputted with characteristic pattern F3 Sign figure F²；

Step 1.2.5, the described characteristic pattern F²By up-sampling, the characteristic pattern F that size is 1/4 is obtained³, when test process, by F³ It is up-sampled, the characteristic pattern that Output Size size is 1, this feature figure is semantic segmentation figure；

Step 1.2.6, binary conversion treatment is carried out to the semantic segmentation figure: the dynamic object in the semantic segmentation figure is utilized Black picture element 0 is marked, other objects are marked using white pixel 1, obtains one only comprising the black and white of dynamic object Semantic image i '_t；

Step 1.2.7, the operation of the step 1.2.1 to 1.2.7 is carried out to the image sequence being made of original image, it is final to obtain To the semantic image sequence I '={ i ' for only including dynamic object_t, i '₂, i '₃, i '₄..., i '_t}。

6. a kind of dynamic scene vision positioning method based on image, semantic segmentation according to claim 1, it is characterised in that: In the step 2, extracting ORB characteristic point specific steps in original image includes:

According to the complexity of scene, feature quantity to be extracted is set, extracts input picture i using ORB feature extractor_tIn Characteristic point i_t(x, y), wherein x, y are characterized transverse and longitudinal coordinate a little.

7. a kind of dynamic scene vision positioning method based on image, semantic segmentation according to claim 1, it is characterised in that: In the step 3, the dynamic object characteristic point in step 2 gained ORB characteristic point is picked according to step 1 gained semantic image It removes, only the step of reservation stationary body characteristic point includes:

For original image i_tEach of characteristic point i_t(x, y), in its semantic image i '_tMiddle determining corresponding position i '_t(x, y)；

8. a kind of dynamic scene vision positioning method based on image, semantic segmentation according to claim 1, it is characterised in that: In the step 4, it is based on the resulting stationary body characteristic point of step 3, using traditional SLAM method based on point feature to phase Machine movement carries out locating and tracking, specifically:

For image sequence I={ i₁, i₂, i₃, i₄..., i_t, the ORB characteristic point after being rejected based on step 3 is based on using tradition The SLAM frame of point feature calculates and optimizes camera pose, completes the positioning and tracking of camera.