CN117495965A

CN117495965A - Method, device, equipment and medium for determining camera pose of visual SLAM

Info

Publication number: CN117495965A
Application number: CN202311481179.2A
Authority: CN
Inventors: 张松林; 张天奇; 张建; 王宇; 曹天书
Original assignee: FAW Group Corp
Current assignee: FAW Group Corp
Priority date: 2023-11-08
Filing date: 2023-11-08
Publication date: 2024-02-02

Abstract

The invention discloses a method, a device, equipment and a medium for determining the pose of a camera of a visual SLAM, wherein in the running process of a target vehicle, video images are acquired based on a camera device arranged in the target vehicle; for each video frame, determining a plurality of object detection frames and a plurality of feature points to be processed, wherein the object detection frames and the feature points to be processed comprise at least one preset object, in the current video frame, and determining attribute types of the feature points to be processed; determining feature points to be processed which belong to the same object detection frame; determining the state category of the corresponding object detection frame according to the number of feature points to be processed, wherein the attribute category of the feature points is an abnormal category, in each object detection frame; screening the feature points to be processed based on the state category of each object detection frame to obtain a plurality of target feature points; based on the target feature points, a camera pose of an imaging device disposed in the target vehicle is determined. The method and the device improve the positioning accuracy of the visual SLAM in the dynamic environment.

Description

Method, device, equipment and medium for determining camera pose of visual SLAM

Technical Field

The embodiment of the invention relates to the technical field of vehicle control, in particular to a method, a device, equipment and a medium for determining the pose of a camera of a visual SLAM.

Background

The synchronous positioning and mapping (Simultaneous localization andmapping, SLAM) plays a very important role in robot vision, can approximate the posture of a camera, can reconstruct an unknown environment in various modes, and is widely applied to the field of automatic driving. SLAM technology has a basic assumption premise: the application environment is static. However, in the application of the autopilot field, most of the environments are dynamic, for example, the vehicle is synchronized with positioning and mapping in real time during running. If a high-precision SLAM result is to be obtained, the current pose of the camera on the vehicle needs to be determined in real time, and in a dynamic environment, the movement of an object can cause the reduction of the pose positioning precision of the SLAM algorithm. Based on this, accurate computer pose in a dynamic environment is critical to improving SLAM accuracy.

At present, the pose of a camera is calculated by tracking stable static feature points, or an image with depth information is acquired by adopting high-performance image pickup equipment, so that more feature information is obtained to calculate the pose of the camera. However, the number of the static feature points is very small, and the pose of the camera is calculated based on a small number of the static feature points, so that the problem of low positioning and composition accuracy exists; the mode of adopting high-performance image pickup equipment has the problem of high cost.

Disclosure of Invention

The invention provides a method, a device, equipment and a medium for determining the pose of a camera of a visual SLAM, which are used for improving the positioning precision of the pose of the visual SLAM in a dynamic environment and reducing the application cost of the SLAM.

According to a first aspect of the present invention, there is provided a camera pose determination method of a visual SLAM, the method comprising:

acquiring a video image based on a camera device deployed in a target vehicle during the driving of the target vehicle;

for each video frame, determining a plurality of object detection frames and a plurality of feature points to be processed, wherein the object detection frames and the feature points to be processed comprise at least one preset object, in the current video frame, and determining attribute categories of the feature points to be processed;

determining feature points to be processed which belong to the same object detection frame;

determining the state category of the corresponding object detection frame according to the number of feature points to be processed, wherein the attribute category of the feature points is an abnormal category, in each object detection frame;

screening the feature points to be processed based on the state category of each object detection frame to obtain a plurality of target feature points;

and determining the camera pose of the camera device arranged in the target vehicle based on the target feature points.

According to a second aspect of the present invention, there is provided a camera pose determination apparatus of a visual SLAM, the apparatus comprising:

The video frame acquisition module is used for acquiring video images based on a camera device arranged in the target vehicle in the running process of the target vehicle;

the characteristic point attribute determining module is used for determining a plurality of object detection frames and a plurality of characteristic points to be processed, which comprise at least one preset object, in the current video frame for each video frame, and determining attribute categories of the characteristic points to be processed;

the characteristic point attribution determining module is used for determining characteristic points to be processed which belong to the same object detecting frame;

the detection frame state determining module is used for determining the state type of the corresponding object detection frame according to the number of feature points to be processed, wherein the attribute type of the feature points is an abnormal type, in each object detection frame;

the target point determining module is used for screening the plurality of feature points to be processed based on the state category of each object detection frame to obtain a plurality of target feature points;

and the camera pose determining module is used for determining the camera pose of the camera device arranged in the target vehicle based on the target characteristic points.

According to a third aspect of the present invention, there is provided an electronic device comprising:

at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores a computer program executable by the at least one processor, so that the at least one processor can execute the camera pose determining method of the visual SLAM according to any embodiment of the present invention.

According to a fourth aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement the camera pose determination method of the visual SLAM according to any embodiment of the present invention when executed.

According to the technical scheme, in the driving process of the target vehicle, video images are collected based on the camera device arranged in the target vehicle, for each video frame, a plurality of object detection frames and a plurality of feature points to be processed, which comprise at least one preset object, in the current video frame are determined, the attribute category of each feature point to be processed is determined, further, the feature points to be processed, which belong to the same object detection frame, are determined, further, the state category of the corresponding object detection frame is determined according to the number of feature points to be processed, which are of the abnormal category, in the object detection frames, the feature points to be processed are screened based on the state category of each object detection frame, so that a plurality of target feature points are obtained, and finally, the camera pose arranged in the camera device in the target vehicle is determined based on the target feature points. According to the method and the device, the state type of the object detection frame is judged, dynamic feature points in the current video frame are screened out according to the state type of the object detection frame, the number of stable static points is guaranteed, and the pose of the camera is determined based on more stable static points, so that the pose positioning precision of the visual SLAM in a dynamic environment is improved. In addition, the method is not limited by depth information, so that no special requirement is made on the performance of the imaging device, the practicability is high, and the cost of synchronous positioning and mapping can be reduced.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a camera pose determining method of a vision SLAM according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of a current video frame according to a first aspect of the present invention;

FIG. 3 is a schematic diagram of an output image of a current video frame processed by an object detection model according to a first embodiment of the present invention;

fig. 4 is a flowchart of a camera pose determining method of a vision SLAM according to a second embodiment of the present invention;

fig. 5 is a schematic structural diagram of a camera pose determining apparatus of a vision SLAM according to a third embodiment of the present invention;

Fig. 6 is a schematic structural diagram of an electronic device implementing a camera pose determination method of a visual SLAM according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Fig. 1 is a flowchart of a camera pose determining method of a visual SLAM according to an embodiment of the present invention, where the method may be performed by a camera pose determining device of a visual SLAM in a case of performing high-precision positioning in an indoor parking lot where a positioning signal is weak, and the camera pose determining device of the visual SLAM may be implemented in a form of hardware and/or software, and the camera pose determining device of the visual SLAM may be configured in a terminal and/or a server. As shown in fig. 1, the method includes:

s110, acquiring video images based on an imaging device arranged in a target vehicle during running of the target vehicle.

The target vehicle is a vehicle which is to be subjected to synchronous positioning and mapping. The camera device is a camera deployed on the target vehicle, and the camera device only has a video acquisition function.

Specifically, during the driving process of the target vehicle, the video image can be acquired in real time through the image pickup device mounted on the target vehicle.

In particular, among sensors typically utilized by an autonomous vehicle, a camera is relatively inexpensive and has rich environmental information. To achieve high-precision synchronous localization and mapping, recognition processing based on video images with depth information is often required, however, only expensive RGB-D cameras can acquire depth information. The image pickup apparatus in the present embodiment is not limited to the depth information, and a video image having depth information may be used, or a video image having no depth information may be used.

S120, for each video frame, determining a plurality of object detection frames and a plurality of feature points to be processed, wherein the object detection frames and the feature points to be processed comprise at least one preset object, in the current video frame, and determining attribute categories of the feature points to be processed.

The video frame is a certain picture in the video image collected by the camera device. The current video frame is the video frame collected at the current moment.

The preset object is a preset identification object of various types, for example, the preset object can be a building, a person, a pet, a plant, and the like. In practical applications, the preset object may be a local part in a building, a person, a pet, and a plant, for example, the preset object may be a hand part of a user. The object detection frame is a rectangular frame for performing frame selection annotation on the object identified in the video frame. The feature points to be processed are feature points capable of representing video frames, namely synchronous positioning and mapping are realized based on the feature points to be processed.

The attribute categories of the feature points to be processed comprise normal categories and abnormal categories. The attribute category is used for representing whether the to-be-processed characteristic point of the current video frame has displacement relative to the historical characteristic point of the corresponding position of the previous video frame. If the attribute type is normal, the feature point to be processed is considered to be a static point, and no displacement exists relative to the historical feature point of the corresponding position of the previous video frame; if the attribute type is abnormal, the feature point to be processed is considered to be a dynamic point, and displacement exists relative to the historical feature point at the corresponding position of the previous video frame;

Specifically, the current video frame may be input into an object detection model trained in advance, to obtain a plurality of object detection frames and a plurality of feature points to be processed of at least one preset object.

Wherein the object detection model is pre-trained. The object detection module may be any deep neural network semantic model. For example, the object detection model is a YOLOX network model.

Illustratively, the present embodiment employs a YOLOX network architecture for detecting a preset object, which has excellent accuracy and high calculation speed. In complex scenarios, all objects have the possibility to move, which is advantageous for building a semantic map, to identify as many object classes as possible, pre-training YOLOX to identify more classes.

The YOLOX main structure is divided into a Backbone network layer (Backbone), a Neck layer (nack) and a Prediction layer (Prediction), and when each preset object label is allocated, sim OTA technology is used to dynamically allocate the number of positive samples for different labels, so that training time is shortened, and additional parameters are avoided. The convolution in the self Neck layer and the Prediction layer of the Yolox is too heavy, so that a Ghost module is adopted to replace a 3×3 convolution module in the two layers in the embodiment, parameter redundancy is reduced, and the model volume is reduced. In addition, in consideration of the reduction of model precision caused by the reduction of feature extraction capability after the replacement of a convolution module, the embodiment adds a CA attention mechanism in a Neck layer, so that noise influence caused by background can be reduced, the extraction capability of key information is enhanced, the loss cost is low, and the lightweight model requirement is met.

For example, referring to fig. 2, a schematic view of a current video frame is input into an object detection model, and the object detection model recognizes that two objects exist in the current video frame, and thus two object detection frames are labeled. An output image schematic diagram of the object detection model after processing the current video frame is shown in fig. 3, and a plurality of feature points to be processed and two object detection frames are marked in the output image.

Further, determining the attribute category of each feature point to be processed, which specifically includes the following steps:

(1) And processing the polar lines of the current video frame and the previous video frame to obtain a plurality of characteristic point matching pairs and polar lines of each characteristic point matching pair.

The last video frame is a video frame determined at the last acquisition time of the current video frame acquisition time. The historical feature points are feature points in the previous video frame. Since the previous video frame is the current video frame relative to the previous acquisition time, the object recognition model can also be used to determine a plurality of feature points of the previous video frame, so that the history feature points are easy to obtain.

In this embodiment, the basic matrix will map the historical feature points of the previous video frame into the search field of the current video frame, i.e. pole line processing. The historical characteristic points of the previous video frame correspond to the characteristic points to be processed of the current video frame to form a characteristic point matching pair together, and a plurality of characteristic point matching pairs and polar lines of the characteristic point matching pairs are obtained based on the characteristic point matching pairs.

(2) And for each characteristic point matching pair, determining the attribute category of the characteristic point to be processed in the characteristic point matching pair based on the epipolar distance between the characteristic point to be processed and the corresponding epipolar line.

The characteristic point matching pair comprises a history characteristic point in the previous video frame and a characteristic point to be processed matched with the history characteristic point in the current video frame.

In this embodiment, RANSAC is used to find the basis matrix to calculate the epipolar distance of the current video frame. Specific examples of calculating the epipolar distances are as follows:

P ₁ ＝[u ₁ ,v ₁ ,1],P ₂ ＝[u ₂ ,v ₂ ,1] (1)

p ₁ ＝[u ₁ ,v ₁ ],p ₂ ＝[u ₂ ,v ₂ ] (2)

wherein p is ₁ 、p ₂ Representing successfully matched feature points in the previous video frame and the current video frame, P ₁ 、P ₂ Representing their isomorphic forms of coordinates, respectively. u and v are coordinate values in the image. The calculation formula of epipolar line is as followsThe following is shown:

where L denotes a epipolar line, X, Y, Z denotes a line vector, and F denotes a fundamental matrix. The calculation method of the distance between the feature points to be processed and the corresponding polar lines is as follows:

wherein D represents the epipolar distance between the feature point to be processed and the corresponding epipolar line.

Further, determining the attribute category of the feature point to be processed in the feature point matching pair based on the epipolar distance between the feature point to be processed and the corresponding epipolar line specifically comprises: if the polar distance between the feature point to be processed and the corresponding polar line is larger than a preset polar line threshold value, the attribute category of the feature point to be processed in the feature point matching pair is abnormal; if the polar distance between the feature point to be processed and the corresponding polar line is smaller than the preset polar line threshold value, the attribute category of the feature point to be processed in the feature point matching pair is normal.

In this embodiment, the preset epipolar threshold is a preset reference threshold. And determining the attribute category of the feature point to be processed in the feature point matching pair by comparing the polar distance between the feature point to be processed and the corresponding polar line and presetting the numerical value magnitude relation before polar line threshold.

S130, determining feature points to be processed belonging to the same object detection frame.

In this embodiment, since the object detection model may output a plurality of feature points to be processed, some feature points to be processed are located in the object detection frame, and some object detection points are located outside the object detection frame, in this step, by determining feature points to be processed that belong to the same object detection frame, the state category corresponding to the current object detection frame is further determined based on the attribute categories of the feature points to be processed in the object detection frame.

Specifically, determining feature points to be processed, which belong to the same object detection frame, specifically includes: determining an abscissa range of the object detection frame and an ordinate range of the object detection frame based on the four-corner coordinates corresponding to the object detection frame; and determining the feature points to be processed, which belong to the same object detection frame, by locating the abscissa of the coordinate information in the abscissa range of the object detection frame and locating the ordinate in the ordinate range of the object detection frame.

In this embodiment, the method for determining whether the feature points to be processed belong to the same object detection frame is the same for each object detection frame, and here, an example is given by taking any one of the object detection frames as an example.

Specifically, the object detection model may output four-corner coordinates corresponding to the individual object detection frames. Assume that the set of all feature points to be processed on the current video frame is a= { p ₁ ,p ₂ ,…,p _n -the coordinates of the single feature point to be processed are p _i ＝(u _i ,v _i ) The object detection frame is defined as d= (l, t, r, b), where l is the left x coordinate value of the object detection frame, t is the upper y coordinate value, r is the right x coordinate value, and b is the lower y coordinate value. The abscissa range of the object detection frame is [ l, r ]]The ordinate range of the object detection frame is [ t, b ]]And judging each feature point to be processed as follows: if the point p satisfies { (u, v) |1<u<r,t<v<b }, then p belongs to the object detection box.

S140, determining the state category of the corresponding object detection frame according to the number of feature points to be processed, wherein the attribute category of each object detection frame is an abnormal category.

The state category of the object detection frame comprises a dynamic detection frame and a static detection frame.

In this embodiment, an arbitrary object detection frame is exemplified. On the basis of determining the feature points to be processed belonging to the same object detection frame, the total number of the feature points to be processed in the object detection frame can be determined by counting. Since the attribute class of each feature point to be processed has been determined in S120, the number of feature points to be processed based on which the attribute class is an abnormal class is easily obtained, and further, the state class of the corresponding object detection frame is determined based on the ratio of the number of feature points to be processed of the abnormal class to the total number of feature points to be processed.

Optionally, determining the status category of the corresponding object detection frame specifically includes:

(1) And determining the occupation ratio of the abnormal characteristic points based on the total number of the characteristic points to be processed belonging to the object detection frame and the number of the characteristic points to be processed with the attribute category being the abnormal category.

For example, the total number of the feature points to be processed belonging to the object detection frame is denoted as A, the number of the feature points to be processed belonging to the object detection frame with the attribute category being the abnormal category is denoted as B, and the ratio of the abnormal feature points is denoted as

(2) And determining the state category of the corresponding object detection frame based on the abnormal feature point occupation ratio and a preset discrimination threshold.

In this embodiment, since the objects within the object detection frame may be of different categories, for example, the objects within the object detection frame may be persons or objects. It will be appreciated that the manner in which the character and object move is different, and that the character often moves the local joints, i.e., only one or two points are dynamic. Therefore, when the object category in the object detection frame is currently detected as a person in a specific application, the preset discrimination threshold corresponding to the object detection frame should be sufficiently small. Other objects do not always move a small part like a person and therefore more feature points should be identified. On the other hand, some objects are extracted with few feature points, while others need to be described with many feature points, in which case a fixed threshold is set, not applicable to all objects. Therefore, different preset discrimination thresholds may be preset for different object categories.

Specifically, the preset discrimination threshold corresponding to the object category may be obtained based on the object category in the object detection frame. Further, the state category of the corresponding object detection frame is determined by comparing the value size relation between the occupation ratio of the abnormal characteristic points and the value before the preset judging threshold value. The method specifically comprises the following steps: if the ratio of the abnormal feature points is larger than a preset judging threshold value, determining the state type of the corresponding object detection frame as a dynamic detection frame; if the ratio of the abnormal feature points is smaller than a preset judging threshold value, determining the state type of the corresponding object detection frame as a static detection frame.

And S150, screening the feature points to be processed based on the state category of each object detection frame to obtain a plurality of target feature points.

The target feature points are the final remaining feature points to be processed after a part of feature points are removed from the feature points to be processed. And inputting the target feature points into SLAM systems corresponding to the target vehicles, so that the SLAM systems synchronously position and build the map based on the target feature points.

In this embodiment, for each feature point to be processed, it is determined whether or not the following two conditions are satisfied at the same time; 1. whether the detection frame is positioned in the dynamic detection frame; 2. if the feature points are located outside the static detection frame, eliminating the feature points to be processed if the feature points are located outside the static detection frame. Finally, the rest feature points to be processed are target feature points. Thus, the more static frames, the more stable the static points, and the higher the accuracy of the system.

S160, determining the camera pose of the camera device deployed in the target vehicle based on the target feature points.

In the present embodiment, a target feature point p is set _i ＝(u _i ,v _i ) Is P _i ＝(X _i ,Y _i ,Z _i ) Target feature point p _i And a spatial point P _i The correspondence of (a) is as follows:

where s is a scale factor, K is a camera reference matrix, ζ is a lie algebra of the pose, Λ is an operator for converting the lie algebra into a matrix, and e is an error.

The formula (5) is rewritten into a matrix form:

all errors are accumulated and summed to construct a least squares problem:

wherein,that is, the camera pose is calculated, B is the set of target feature points, and P in the above formula is P.epsilon.B _i Are fixed values.

Example two

Fig. 4 is a flowchart of a camera pose determining method of a visual SLAM according to a second embodiment of the present invention, and S140 is further refined based on the foregoing embodiment. Wherein, the technical terms identical to or corresponding to the above embodiments are not repeated herein.

As shown in fig. 4, the method includes:

s210, acquiring video images based on an image pickup device arranged in a target vehicle during running of the target vehicle.

S220, for each video frame, determining a plurality of object detection frames and a plurality of feature points to be processed, wherein the object detection frames and the feature points to be processed comprise at least one preset object, in the current video frame, and determining attribute categories of the feature points to be processed.

In this embodiment, on the basis of obtaining a plurality of feature points to be processed, pre-screening processing may be performed on each feature point to be processed, and some interference feature points may be removed. The specific implementation method can be as follows: and eliminating the interference characteristic points from the plurality of characteristic points to be processed based on preset screening conditions. Wherein, the preset screening conditions comprise at least one of the following: feature points to be processed, wherein the distance between the feature points and edge pixel points of the object detection frame is smaller than a preset distance threshold value; and the pixel difference value between the pixel difference value and the central pixel point in the object detection frame is larger than the feature point to be processed of the preset pixel threshold value.

S230, determining feature points to be processed belonging to the same object detection frame.

S240, if the preset object category in the object detection frame is a person, determining the height and the width of the person image in the object detection frame, and judging whether the ratio of the height to the width is smaller than a preset value.

It will be appreciated that the manner in which the character and object move is different, that the character often moves the local joints, i.e. only one or two points are dynamic, and that the character most often moves the hands. Thus, the dynamic determination condition of the person can be distinguished from the object.

Typically, only half of the body of a seated person is in motion, while the entire body of a standing person is in motion. Thus, this embodiment designs a special algorithm to determine whether a person is sitting. In the field of painting, the length and width of the head are commonly used to paint the proportions of parts of a human being. For example, when a person is sitting, the width is the head of two persons and the length is the head of five persons. The aspect ratio of the standing person is typically 9/2 or 8/2, with an aspect ratio greater than 3, typically 4. While the seated person is 5/2, with an aspect ratio of less than 3. Based on this, this experience is applied to an algorithm to determine the human body posture, and the aspect ratio of the human body image in the object detection frame is set to be smaller than a preset value, and the ratio thereof is determined as follows:

Wherein R is _i Representing the aspect ratio, W, of a person image in an ith object detection frame _i 、H _i Representing the respective width and height.

And S250, if so, dividing the object detection frame into a first object detection frame and a second object detection frame based on the horizontal direction.

The ratio of the first object detection frame height to the second object detection frame height is a preset ratio, and the first object detection frame height is larger than the second object detection frame height.

In this embodiment, the human hand is used as a main detection point for feature recognition, and in order to divide the human hand into the upper half at any time, the human body is divided into two parts of 1/1.3 and 0.3/1.3 when the aspect ratio of the human body is less than a preset value according to the human activity in the data set. Based on this, in the present embodiment, the object detection frame is divided into a first object detection frame and a second object detection frame based on the horizontal direction. The first detection frame corresponds to the upper half part of the human body, and the second detection frame corresponds to the lower half part of the human body. In this way, the human hand is always divided into the upper half of the human body, and when the person occupies most of the views, the movement of the person can be easily checked.

S260, determining the state type of the object detection frame based on the total number of the feature points to be processed in the first object detection frame and the number of the feature points to be processed with the attribute type of the first object detection frame as the abnormal type.

Specifically, the ratio of the number of feature points to be processed, of which the attribute category is an abnormal category, in the first object detection frame to the total number of feature points to be processed in the first object detection frame is calculated. The preset discrimination threshold corresponding to the person may be sufficiently small, and on the basis of acquiring the preset discrimination threshold of the person, the state type of the corresponding object detection frame is determined by comparing the value of the ratio with the value before the preset discrimination threshold. The method specifically comprises the following steps: if the ratio is larger than a preset judging threshold value, determining the state type of the corresponding object detection frame as a dynamic detection frame; if the ratio is smaller than the preset judging threshold value, determining the state type of the corresponding object detection frame as a static detection frame.

S270, determining the feature points to be processed, which are located outside the static object detection frame and in the dynamic object detection frame, as feature points to be removed.

The feature points to be removed are feature points to be processed, which are to be removed from a plurality of feature points to be processed.

In this embodiment, it is determined whether each feature point to be processed is located outside the static object detection frame and located inside the dynamic object detection frame, and if so, the corresponding feature point to be processed is determined as the feature point to be removed.

And S280, deleting the feature points to be removed from a plurality of feature points to be processed corresponding to the current video frame to obtain target feature points.

S290, determining the camera pose of the camera device arranged in the target vehicle based on the target feature points.

According to the technical scheme, when the state type of the object detection frame is determined, if the preset object type in the object detection frame is a person, the height and the width of the person image in the object detection frame are determined, whether the ratio of the height to the width is smaller than a preset value is judged, if yes, the object detection frame is divided into a first object detection frame and a second object detection frame based on the horizontal direction, and therefore the state type of the object detection frame is determined based on the total number of feature points to be processed in the first object detection frame and the number of feature points to be processed, of which the attribute type in the first object detection frame is an abnormal type. Because the staff is always positioned at the upper half part of the human body, the staff is a dynamic characteristic point which is frequently detected, when the object in the object detection frame is a person, the upper half part is emphasized to be detected, and when the person occupies most views, the movement of the person can be easily checked, so that the static characteristic point can be more accurately determined, and the positioning precision of the visual SLAM in the dynamic environment is further improved.

Example III

Fig. 5 is a schematic structural diagram of a camera pose determining device of a vision SLAM according to a third embodiment of the present invention. As shown in fig. 5, the apparatus includes: the video frame acquisition module 310, the feature point attribute determination module 320, the feature point attribution determination module 330, the detection frame state determination module 340, the target point determination module 350, and the camera pose determination module 360.

The video frame acquisition module 310 is configured to acquire a video image based on a camera device deployed in a target vehicle during a driving process of the target vehicle;

the feature point attribute determining module 320 is configured to determine, for each video frame, a plurality of object detection frames and a plurality of feature points to be processed, where the object detection frames and the feature points to be processed include at least one preset object in a current video frame, and determine attribute categories of the feature points to be processed;

the feature point attribution determining module 330 is configured to determine feature points to be processed that belong to the same object detection frame;

the detection frame state determining module 340 is configured to determine a state class of a corresponding object detection frame according to the number of feature points to be processed in which the attribute class is an abnormal class in each object detection frame;

the target point determining module 350 is configured to screen the feature points to be processed based on the status category of each object detection frame, so as to obtain a plurality of target feature points;

A camera pose determination module 360, configured to determine a pose of the camera disposed in the target vehicle based on the target feature points.

Optionally, the feature point attribute determining module 320 includes:

the characteristic point matching pair determining unit is used for processing the current video frame and the last video frame to polar lines to obtain a plurality of characteristic point matching pairs and polar lines of each characteristic point matching pair; wherein, the last video frame is a video frame determined at the last acquisition time of the current video frame acquisition time;

an attribute category determining unit, for each of the feature point matching pairs, determining an attribute category of the feature point to be processed in the feature point matching pair based on a epipolar distance between the feature point to be processed and the corresponding epipolar line; the characteristic point matching pair comprises a history characteristic point in the previous video frame and a characteristic point to be processed matched with the history characteristic point in the current video frame.

Optionally, the attribute category determining unit includes:

the first judging subunit is used for judging that the attribute category of the feature points to be processed in the feature point matching pair is abnormal if the polar distance between the feature point matching pair is larger than a preset polar threshold value;

and the second judging subunit is used for judging that the attribute category of the feature points to be processed in the feature point matching pair is normal if the polar distance between the feature point matching pair is smaller than a preset polar threshold value.

Optionally, the feature point attribution determining module 330 includes:

a coordinate range determining unit, configured to determine an abscissa range of the object detection frame and an ordinate range of the object detection frame based on four-corner coordinates corresponding to the object detection frame;

and the characteristic point attribution determining unit is used for determining the characteristic points to be processed, which belong to the same object detection frame, of which the abscissa of the coordinate information is positioned in the abscissa range of the object detection frame and the ordinate is positioned in the ordinate range of the object detection frame.

Optionally, the detection frame status determining module 340 includes:

the abnormal point occupation ratio determining unit is used for determining an abnormal characteristic point occupation ratio based on the total number of the characteristic points to be processed belonging to the object detection frame and the number of the characteristic points to be processed with the attribute category being the abnormal category;

and the detection frame type determining unit is used for determining the state type of the corresponding object detection frame based on the abnormal characteristic point occupation value and a preset judging threshold value.

Optionally, the detection frame status determining module 340 further includes:

the preset object identification unit is used for determining the height and the width of the human image in the object detection frame if the preset object category in the object detection frame is a human, and judging whether the ratio of the height to the width is smaller than a preset value or not;

The detection frame dividing unit is used for dividing the object detection frame into a first object detection frame and a second object detection frame based on the horizontal direction if the object detection frame is in the first object detection frame; the ratio of the first object detection frame height to the second object detection frame height is a preset ratio, and the first object detection frame height is larger than the second object detection frame height;

and the detection frame type determining unit is used for determining the state type of the object detection frame based on the total number of the feature points to be processed in the first object detection frame and the number of the feature points to be processed, wherein the attribute type in the first object detection frame is an abnormal type.

Optionally, the target point determining module 350 includes:

the rejecting point determining unit is used for determining to-be-processed characteristic points which are positioned outside the static object detection frame and in the dynamic object detection frame as to-be-rejected characteristic points;

and the target point determining unit is used for deleting the feature points to be removed from a plurality of feature points to be processed corresponding to the current video frame to obtain target feature points.

The camera pose determining device of the visual SLAM provided by the embodiment of the invention can execute the camera pose determining method of the visual SLAM provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the executing method.

Example IV

Fig. 6 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 6, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the respective methods and processes described above, for example, a camera pose determination method of the visual SLAM.

In some embodiments, the camera pose determination method of the visual SLAM may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the camera pose determination method of the visual SLAM described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the camera pose determination method of the visual SLAM by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A camera pose determining method based on visual SLAM is characterized by comprising the following steps:

2. The method of claim 1, wherein determining the attribute category of each feature point to be processed comprises:

processing the polar lines of the current video frame and the previous video frame to obtain a plurality of characteristic point matching pairs and polar lines of each characteristic point matching pair; wherein, the last video frame is a video frame determined at the last acquisition time of the current video frame acquisition time;

for each feature point matching pair, determining the attribute category of the feature point to be processed in the feature point matching pair based on the epipolar distance between the feature point to be processed and the corresponding epipolar line; the characteristic point matching pair comprises a history characteristic point in the previous video frame and a characteristic point to be processed matched with the history characteristic point in the current video frame.

3. The method of claim 1, wherein the determining the attribute category of the feature point to be processed in the feature point matching pair based on epipolar distances between the feature point to be processed and the corresponding epipolar line comprises:

if the epipolar distance between the feature point matching pairs is larger than a preset epipolar threshold value, the attribute category of the feature points to be processed in the feature point matching pairs is abnormal;

if the epipolar distance between the feature point matching pairs is smaller than a preset epipolar threshold value, the attribute category of the feature points to be processed in the feature point matching pairs is normal.

4. The method according to claim 1, wherein the determining feature points to be processed belonging to the same object detection frame comprises:

determining an abscissa range of the object detection frame and an ordinate range of the object detection frame based on the four-corner coordinates corresponding to the object detection frame;

and determining the feature points to be processed, which belong to the same object detection frame, by locating the abscissa of the coordinate information in the abscissa range of the object detection frame and locating the ordinate in the ordinate range of the object detection frame.

5. The method according to claim 1, wherein determining the status category of the corresponding object detection frame according to the number of feature points to be processed whose attribute category is an anomaly category in each object detection frame includes:

Determining the occupation ratio of the abnormal characteristic points based on the total number of the characteristic points to be processed belonging to the object detection frame and the number of the characteristic points to be processed with the attribute category being the abnormal category;

and determining the state category of the corresponding object detection frame based on the abnormal feature point occupation ratio and a preset discrimination threshold.

6. The method of claim 1, wherein the predetermined objects include persons and objects, and wherein the determining the status category of the corresponding object detection frame based on the abnormal feature point occupation value and a predetermined discrimination threshold comprises:

if the preset object category in the object detection frame is a person, determining the height and the width of a person image in the object detection frame, and judging whether the ratio of the height to the width is smaller than a preset value or not;

if yes, dividing the object detection frame into a first object detection frame and a second object detection frame based on the horizontal direction; the ratio of the first object detection frame height to the second object detection frame height is a preset ratio, and the first object detection frame height is larger than the second object detection frame height;

and determining the state type of the object detection frame based on the total number of the feature points to be processed in the first object detection frame and the number of the feature points to be processed, wherein the attribute type in the first object detection frame is an abnormal type.

7. The method according to claim 1, wherein the status class of the object detection frame is a static object detection frame or a dynamic object detection frame, and the filtering the feature points to be processed based on the status class of each object detection frame to obtain a plurality of target feature points includes:

determining feature points to be processed which are positioned outside the static object detection frame and in the dynamic object detection frame as feature points to be removed;

deleting the feature points to be removed from a plurality of feature points to be processed corresponding to the current video frame to obtain target feature points.

8. A camera pose determining device based on visual SLAM, comprising:

9. An electronic device, characterized in that the electronic device comprises:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by one or more processors, cause the one or more processors to implement the vision SLAM-based camera pose determination method of any of claims 1-7.

10. A storage medium containing computer executable instructions, which when executed by a computer processor are for performing the vision SLAM-based camera pose determination method of any of claims 1-7.