CN113362363B - Image automatic labeling method and device based on visual SLAM and storage medium - Google Patents

Image automatic labeling method and device based on visual SLAM and storage medium Download PDF

Info

Publication number
CN113362363B
CN113362363B CN202110676419.9A CN202110676419A CN113362363B CN 113362363 B CN113362363 B CN 113362363B CN 202110676419 A CN202110676419 A CN 202110676419A CN 113362363 B CN113362363 B CN 113362363B
Authority
CN
China
Prior art keywords
point cloud
labeling
image
area
camera
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110676419.9A
Other languages
Chinese (zh)
Other versions
CN113362363A (en
Inventor
曾碧
王俊丰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN202110676419.9A priority Critical patent/CN113362363B/en
Publication of CN113362363A publication Critical patent/CN113362363A/en
Application granted granted Critical
Publication of CN113362363B publication Critical patent/CN113362363B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/187Segmentation; Edge detection involving region growing; involving region merging; involving connected component labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/80Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses an image automatic labeling method, device and storage medium based on visual SLAM; the method comprises the steps of processing a current frame image to obtain a labeling frame, and then automatically labeling video images of other different frames to correct the initial labeling frame to obtain final accurate labeling data; the method does not need a prior data set, can realize online automatic labeling while acquiring the image, supports multi-target and multi-class labeling, only needs single labeling of a user, can greatly reduce manual image labeling work, improves the acquisition efficiency of image labeling data, and is simple to operate and easy to use. The invention can be widely applied to the technical field of image processing.

Description

Image automatic labeling method and device based on visual SLAM and storage medium
Technical Field
The invention relates to the technical field of image processing, in particular to a method and a device for automatically labeling images based on visual SLAM and a storage medium.
Background
With the development of computer vision, the artificial intelligence technologies such as machine learning and deep learning are widely applied to researches such as image classification, target detection and image segmentation, and breakthrough results are achieved, but model training of the method depends on a large amount of image annotation data related to tasks, image data acquisition and annotation must be performed on a business scene in a manual mode before a model is promoted to be applied, and at present, the stage is time-consuming and affects the overall research and development progress.
At present, a method for identifying an image to be labeled based on deep learning is used for labeling, but the method needs to collect image data in advance for training and establish a corresponding candidate label set, so that the method cannot be directly used for the condition of first labeling; in addition, the accuracy of image recognition depends on the scale of the data set and the probability distribution condition of each label, and meanwhile, the labeling object is limited, and the method has no general applicability.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides an image automatic annotation method and device based on visual SLAM and a storage medium.
The technical scheme adopted by the invention is as follows:
in one aspect, an embodiment of the present invention includes an automatic image annotation method based on visual SLAM, including:
acquiring at least two continuous images of a real-time camera or a video, and constructing by utilizing an ORB-SLAM algorithm to obtain a three-dimensional point cloud space;
acquiring a new frame of image as a current frame of image, and solving through a PnP (pseudo-random P) problem to obtain a camera pose;
selecting a marking area in a frame of the selected area by taking the current frame image as the selected area, and marking corresponding semantic information in the marking area;
acquiring a first three-dimensional point cloud set corresponding to the two-dimensional feature point in the labeling area according to the three-dimensional point cloud space;
processing the first three-dimensional point cloud set by using a point cloud clustering algorithm to obtain a second three-dimensional point cloud set, wherein the second three-dimensional point cloud set is a point cloud cluster with the largest point cloud number;
calculating to obtain a minimum directed bounding box of the second three-dimensional point cloud set based on a principal component analysis method;
constructing and obtaining a camera perspective matrix according to the camera pose and the camera internal parameters;
projecting the minimum directional bounding box to the current frame image according to the camera perspective matrix to form an augmented reality scene and obtain a rectangular frame corresponding to the minimum directional bounding box;
performing image segmentation on the rectangular frame to identify a subject;
adjusting the rectangular frame according to the identified main body to obtain a labeling frame;
comparing the areas of the rectangular frame and the labeling frame;
if the ratio of the area of the labeling frame to the area of the rectangular frame is smaller than a first threshold value, taking the labeling frame as a selection area, selecting a labeling area in the selection area frame, and labeling corresponding semantic information in the labeling area;
acquiring a target frame image from the video, wherein the target frame image is any subsequent frame image except the current frame image;
acquiring first new point cloud data of the target frame image based on a visual SLAM algorithm;
filtering the first newly added point cloud data to obtain second newly added point cloud data, wherein the second newly added point cloud data is newly added point cloud data in the minimum oriented bounding box;
and if the quantity of the second newly-added point cloud data is larger than a second threshold value, taking the target frame image as a selected area, selecting a marked area in the selected area frame, and marking corresponding semantic information in the marked area.
Further, the step of acquiring at least two frames of images in a real-time camera or video and constructing a three-dimensional point cloud space by using an ORB-SLAM algorithm includes:
calibrating a camera to obtain a camera internal parameter matrix and a distortion parameter;
acquiring at least two continuous frames of images in a real-time camera or video;
sequentially extracting ORB characteristic points from the two frames of continuous images and calculating corresponding binary descriptors;
according to the ORB feature points, feature point matching is carried out on the two frames of continuous images to obtain image feature points;
estimating the camera motion between two frames of continuous images by using epipolar geometric constraint, and calculating to obtain a camera pose;
and calculating the three-dimensional coordinates of the image characteristic points by utilizing triangulation measurement according to the camera internal reference matrix, the distortion parameters and the camera pose, and constructing to obtain a three-dimensional point cloud space.
Further, the step of selecting a labeled region in the selected region by taking the current frame image as the selected region, and labeling the labeled region with corresponding semantic information includes:
selecting a first area in a frame of the selected area by taking the current frame image as the selected area;
converting the first area into a pixel area according to the ratio of the window size of the current frame image to the pixel width and height of the current frame image;
and taking the pixel area as a labeling area, and labeling corresponding semantic information in the labeling area.
Further, the step of obtaining a first three-dimensional point cloud set corresponding to the two-dimensional feature point in the labeling area according to the three-dimensional point cloud space includes:
acquiring two-dimensional feature points in the labeling area;
and performing back projection on the two-dimensional feature points according to the three-dimensional point cloud space to obtain a first three-dimensional point cloud set.
Further, the step of processing the first three-dimensional point cloud set by using a point cloud clustering algorithm to obtain a second three-dimensional point cloud set includes:
selecting a first point cloud as a target point cloud from the first three-dimensional point cloud set, wherein the first point cloud is any point cloud in the first three-dimensional point cloud set;
judging whether the target point cloud is an edge noise point;
if the target point cloud is an edge noise point, selecting a second point cloud as the target point cloud from the first three-dimensional point cloud set again, wherein the second point cloud is any point cloud except the first point cloud in the first three-dimensional point cloud set;
if the target point cloud is not an edge noise point, finding out all point clouds with the reachable density from the target point cloud from the first three-dimensional point cloud set to form a point cloud cluster;
traversing all point clouds in the first three-dimensional point cloud set to obtain a plurality of point cloud clusters;
and selecting the point cloud cluster with the largest number of point clouds as a second three-dimensional point cloud set.
Further, the step of constructing a camera perspective matrix according to the camera pose and the camera internal parameters includes:
transferring the data of the camera pose from a world coordinate system to an OpenGL coordinate system according to a first formula, wherein the first formula is as follows: r is o =R ow R w R ow ,t o =R ow t w (ii) a Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003120715080000031
world coordinate system (R) w ,t w ) OpenGL coordinate system (R) o ,t o );
According to the OThe penGL coordinate system is combined with camera internal parameters to construct a camera perspective matrix, and the camera perspective matrix is as follows:
Figure BDA0003120715080000032
wherein the content of the first and second substances,
Figure BDA0003120715080000033
wherein P represents a camera perspective matrix and the camera intrinsic parameter is (u) 0 ,v 0 ,f u ,f v ) (ii) a near represents the distance from the camera optical center to the near section, far represents the distance from the camera optical center to the far section, right represents the distance from the near section center to the right section, left represents the distance from the near section center to the left section, top represents the distance from the near section center to the upper section, bottom represents the distance from the near section center to the lower section, w represents the image pixel width, and h represents the image pixel height.
Further, the step of filtering the first newly added point cloud data to obtain second newly added point cloud data includes:
filtering the first new point cloud data based on a random sampling consistency algorithm to remove ground point cloud data;
and performing conditional filtering according to the three-dimensional space range of the minimum directed bounding box to obtain second newly-added point cloud data.
Further, if the number of the second newly added point cloud data is not greater than a second threshold, taking the target frame image as a current frame image, and performing a step of projecting the minimum directional bounding box to the current frame image according to the camera perspective matrix to form an augmented reality scene and obtain a rectangular frame corresponding to the minimum directional bounding box.
In another aspect, an embodiment of the present invention further includes an apparatus for automatically labeling an image based on a visual SLAM, including:
at least one processor;
at least one memory for storing at least one program;
when the at least one program is executed by the at least one processor, the at least one processor is caused to implement the method for automatic annotation of images based on visual SLAM.
In another aspect, an embodiment of the present invention further includes a computer readable storage medium, on which a processor-executable program is stored, where the processor-executable program is used to implement the method for automatically labeling images based on visual SLAM when being executed by a processor.
The invention has the beneficial effects that:
the invention provides an automatic image annotation method based on visual SLAM, which comprises the following steps: acquiring a video image, constructing and obtaining a three-dimensional point cloud space and a camera pose by utilizing an ORB-SLAM algorithm, selecting a marking area in a selection area frame by taking a current frame image as a selection area, and marking corresponding semantic information in the marking area; acquiring a first three-dimensional point cloud set corresponding to a two-dimensional feature point in a labeling area according to the three-dimensional point cloud space; processing the first three-dimensional point cloud set by using a point cloud clustering algorithm to obtain a second three-dimensional point cloud set; calculating to obtain a minimum directed bounding box of the second three-dimensional point cloud set based on a principal component analysis method; constructing and obtaining a camera perspective matrix according to the camera pose and the camera internal parameters; projecting the minimum directional bounding box to the current frame image according to the camera perspective matrix to form an augmented reality scene and obtain a rectangular frame corresponding to the minimum directional bounding box; performing image segmentation on the rectangular frame to identify a subject; adjusting the rectangular frame according to the identified main body to obtain a labeling frame; comparing the areas of the rectangular frame and the marking frame; if the ratio of the area of the labeling frame to the area of the rectangular frame is smaller than a first threshold value, taking the labeling frame as a selection area, selecting a labeling area in the selection area frame, and labeling corresponding semantic information in the labeling area; so that the labeling result of the current frame image can be generated. And finally, automatically adjusting the labeling frame by combining the newly added point cloud data of different video frames, thereby automatically generating a labeling result of each frame of image. The method does not need a prior data set, can realize online automatic labeling while acquiring the image, supports multi-target and multi-class labeling, only needs single labeling of a user, can greatly reduce manual image labeling work, improves the acquisition efficiency of image labeling data, and is simple to operate and easy to use.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flowchart illustrating steps of a method for automatic image annotation based on visual SLAM according to an embodiment of the present invention;
fig. 2 is a specific flowchart of an automatic image annotation method based on visual SLAM according to an embodiment of the present invention.
FIG. 3 is a flowchart of acquiring at least two frames of images in a video and constructing a three-dimensional point cloud space by using an ORB-SLAM algorithm according to an embodiment of the present invention;
FIG. 4 is a flow chart of processing the first three-dimensional point cloud set by using a point cloud clustering algorithm to obtain a second three-dimensional point cloud set according to the embodiment of the present invention;
FIG. 5 is a detailed flowchart of processing the first three-dimensional point cloud set by using a point cloud clustering algorithm to obtain a second three-dimensional point cloud set according to the embodiment of the present invention;
FIG. 6 is a schematic diagram illustrating a comparison between a rectangular frame and a labeled frame according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an automatic image labeling apparatus based on visual SLAM according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
In the description of the present invention, it should be understood that the orientation or positional relationship referred to in the description of the orientation, such as the upper, lower, front, rear, left, right, etc., is based on the orientation or positional relationship shown in the drawings, and is only for convenience of description and simplification of description, and does not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention.
In the description of the present invention, a plurality of means is one or more, a plurality of means is two or more, and greater than, less than, more than, etc. are understood as excluding the essential numbers, and greater than, less than, etc. are understood as including the essential numbers. If there is a description of first and second for the purpose of distinguishing technical features only, this is not to be understood as indicating or implying a relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of technical features indicated.
In the description of the present invention, unless otherwise explicitly limited, terms such as arrangement, installation, connection and the like should be understood in a broad sense, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention in combination with the specific contents of the technical solutions.
The embodiments of the present application will be further explained with reference to the drawings.
Referring to fig. 1, an embodiment of the present invention provides an automatic image annotation method based on visual SLAM, which includes, but is not limited to, the following steps:
s100, acquiring at least two continuous images in a real-time camera or video, and constructing to obtain a three-dimensional point cloud space by utilizing an ORB-SLAM algorithm;
s200, acquiring a new frame of image as a current frame of image, and solving through a PnP problem to obtain a camera pose;
s300, selecting a marking area in a frame of the selected area by taking the current frame image as the selected area, and marking corresponding semantic information in the marking area;
s400, acquiring a first three-dimensional point cloud set corresponding to a two-dimensional feature point in a labeling area according to a three-dimensional point cloud space;
s500, processing the first three-dimensional point cloud set by using a point cloud clustering algorithm to obtain a second three-dimensional point cloud set, wherein the second three-dimensional point cloud set is a point cloud cluster with the largest number of point clouds;
s600, calculating to obtain a minimum directed bounding box of a second three-dimensional point cloud set based on a principal component analysis method;
s700, constructing to obtain a camera perspective matrix according to the camera pose and camera internal parameters;
s800, projecting the minimum directional bounding box to the current frame image according to the camera perspective matrix to form an augmented reality scene and obtain a rectangular frame corresponding to the minimum directional bounding box;
s900, carrying out image segmentation on the rectangular frame to identify a main body;
s1000, adjusting the rectangular frame according to the identified main body to obtain a labeling frame;
s1100, comparing the areas of the rectangular frame and the marking frame;
s1200, if the ratio of the area of the labeling frame to the area of the rectangular frame is smaller than a first threshold value, taking the labeling frame as a selection area, selecting a labeling area in the selection area, and labeling corresponding semantic information in the labeling area;
s1300, acquiring a target frame image from the video, wherein the target frame image is any one frame image except the current frame image;
s1400, acquiring first new point cloud data of the target frame image based on a visual SLAM algorithm;
s1500, filtering the first newly added point cloud data to obtain second newly added point cloud data, wherein the second newly added point cloud data is newly added point cloud data in the minimum directed bounding box;
and S1600, if the quantity of the second newly-added point cloud data is larger than a second threshold value, taking the target frame image as a selected area, selecting a marked area in a selected area frame, and marking corresponding semantic information in the marked area.
In this embodiment, an automatic image annotation method based on visual SLAM is provided in this embodiment, aiming at the problems that manual annotation of image data requires a long time and restricts the research and development progress. The method mainly comprises the following steps: firstly, acquiring a video image, constructing a three-dimensional point cloud space and a camera pose based on a visual SLAM algorithm, and constructing an augmented reality scene based on an OpenGL coordinate system by combining a camera internal parameter matrix; secondly, according to the two-dimensional feature points in the user single framing area, carrying out back projection to obtain corresponding three-dimensional point cloud data, clustering the three-dimensional point cloud data, and solving a minimum directed bounding box; the virtual bounding box of the parcel entity can be projected onto the two-dimensional image and a preliminary rectangular labeling frame is generated by combining the pose of the camera and the projection matrix; then, the main body is identified based on an image segmentation algorithm, the frame selection area is reasonably adjusted, and the current labeling result is generated. And finally, further adjusting the virtual bounding box by combining the newly added point cloud data under different visual angles and the automatically adjusted two-dimensional frame selection area, thereby automatically generating the labeling data of each frame of image. According to the automatic image annotation method, data annotation is performed from the image source, manual image annotation work can be greatly reduced, the acquisition efficiency of image annotation data is improved, and the research and development progress of computer vision model training is effectively promoted.
In the embodiment, the visual SLAM algorithm adopts an ORB-SLAM algorithm; the point cloud clustering algorithm adopts a density-based DBSCAN algorithm; solving the minimum directed Bounding Box algorithm adopts a directed Bounding Box algorithm (Oriented Bounding Box); the image segmentation algorithm adopts a region growing algorithm.
Specifically, regarding step S100, in this embodiment, acquiring at least two frames of continuous images in the real-time camera or the video can construct a three-dimensional point cloud space by using the ORB-SLAM algorithm, and certainly, acquiring more than two frames of continuous images can also construct a three-dimensional point cloud space, referring to fig. 3, the specific operation process is as follows:
s101, calibrating a camera to obtain an internal parameter matrix and distortion parameters of the camera;
s102, acquiring at least two continuous frames of images in a real-time camera or video;
s103, sequentially extracting ORB characteristic points from two continuous frames of images and calculating corresponding binary descriptors;
s104, according to the ORB feature points, performing feature point matching on two continuous frames of images to obtain image feature points;
s105, estimating the camera motion between two continuous frames of images by using epipolar geometric constraint, and calculating to obtain the camera pose;
and S106, calculating the three-dimensional coordinates of the image feature points by utilizing triangulation measurement according to the camera internal parameter matrix, the distortion parameters and the camera pose, and constructing to obtain a three-dimensional point cloud space.
In the embodiment, a camera for acquiring a video image is calibrated to acquire an internal reference matrix and distortion parameters of the camera, ORB feature points are sequentially extracted from two continuous frames of images and an ORB descriptor is acquired, then feature point matching is performed on the two continuous frames of images to obtain image feature points, finally, the motion of the camera between the two continuous frames of images is solved according to pole constraint, and three-dimensional coordinates of the image feature points are calculated by utilizing triangulation measurement, so that a three-dimensional point cloud space is constructed. If more than two frames of video images are acquired, for example, a video is randomly intercepted, at the moment, the camera is also required to be calibrated to acquire an internal parameter matrix and distortion parameters of the camera, ORB feature points are extracted from the video images according to a time sequence and descriptors of the ORB feature points are solved, feature point matching is carried out between adjacent frames, the motion of the camera between the adjacent frames is solved according to pole constraint, and three-dimensional coordinates of image feature points are calculated by utilizing triangulation so as to construct a three-dimensional point cloud space.
In this embodiment, for step S200, for the subsequent image frame in the video, a PnP problem solution is performed, so as to calculate the camera pose. In addition, the ORB-SLAM algorithm can perform closed-loop detection, repositioning and image optimization on key frames, and further can obtain more accurate camera poses, namely a rotation matrix R and a translation vector t of the camera.
For step S300, that is, the step of selecting a labeled area in the selected area by using the current frame image as the selected area, and labeling the labeled area with corresponding semantic information, specifically includes:
s301, selecting a first area in a selected area frame by taking the current frame image as the selected area;
s302, converting a first area into a pixel area according to the ratio of the window size of the current frame image to the pixel width and height of the current frame image;
and S303, taking the pixel area as a labeling area, and labeling corresponding semantic information in the labeling area.
In this embodiment, a user needs to perform first frame selection on a target object of a current frame image, specifically, the current frame image is used as a selection area, and an area where the target object is located is selected as a labeling area in the selection area. Then, according to the labeled area framed and selected by the user on the current frame image, the labeled area is converted into a pixel area according to the ratio of the window size to the image pixel width and height, and the semantic information labeled in the area is stored.
In this embodiment, step S400, that is, the step of obtaining the first three-dimensional point cloud set corresponding to the two-dimensional feature point in the labeling area according to the three-dimensional point cloud space, specifically includes:
s401, acquiring two-dimensional feature points in a labeling area;
s402, performing back projection on the two-dimensional feature points according to the three-dimensional point cloud space to obtain a first three-dimensional point cloud set.
In this embodiment, it is considered that the labeled area on the current frame image is a two-dimensional plane, and therefore, the point clouds in the labeled area are all two-dimensional feature points. Therefore, the two-dimensional feature points need to be subjected to back projection according to the previously constructed three-dimensional point cloud space, so as to obtain a first three-dimensional point cloud set corresponding to the two-dimensional feature points.
In this embodiment, after obtaining the first three-dimensional point cloud set, the first three-dimensional point cloud set needs to be further processed by using a point cloud clustering algorithm to obtain a second three-dimensional point cloud set, where the second three-dimensional point cloud set is a point cloud cluster with the largest number of point clouds. That is to say, after obtaining the first three-dimensional point cloud set, step S500 is executed, and referring to fig. 4, step S500 specifically includes:
s501, selecting a first point cloud from a first three-dimensional point cloud set as a target point cloud, wherein the first point cloud is any point cloud in the first three-dimensional point cloud set;
s502, judging whether the target point cloud is an edge noise point;
s503, if the target point cloud is an edge noise point, selecting a second point cloud as the target point cloud from the first three-dimensional point cloud set again, wherein the second point cloud is any point cloud except the first point cloud in the first three-dimensional point cloud set;
s504, if the target point cloud is not an edge noise point, finding out all point clouds with the reachable density from the target point cloud from the first three-dimensional point cloud set to form a point cloud cluster;
s505, traversing all point clouds in the first three-dimensional point cloud set to obtain a plurality of point cloud clusters;
s506, selecting the point cloud cluster with the largest point cloud number as a second three-dimensional point cloud set.
Specifically, referring to fig. 5, in this embodiment, a domain radius and a domain object number threshold are set according to a service scenario; then, one three-dimensional point cloud is arbitrarily selected from the first three-dimensional point cloud set obtained in the step S400 to serve as a core point P, and all three-dimensional point clouds with the density reaching from the point cloud P are found out to form a point cloud cluster; and if the selected point cloud is the edge noise point, selecting other three-dimensional point clouds as core points, and repeatedly executing the behaviors until all point clouds in the first three-dimensional point cloud set are visited. And finally, keeping the point cloud cluster with the largest number of the three-dimensional point clouds as a second three-dimensional point cloud set.
In this embodiment, if the number of the neighboring points of the point cloud P is not greater than the preset threshold, the point cloud P is marked as an edge noise point, and at this time, another point cloud is reselected from the first three-dimensional point cloud set as a core point. And finally, selecting the point cloud cluster with the largest number of point clouds from all the obtained point cloud clusters as a second three-dimensional point cloud set. In this embodiment, after the second three-dimensional point cloud set is obtained, a small amount of remaining external noise points need to be removed.
In step S600, in this embodiment, after the second three-dimensional point cloud set is obtained, the minimum directional bounding box of the second three-dimensional point cloud set is solved based on a Principal Component Analysis (Principal Component Analysis), that is, the central point position, the size, and the rotation matrix of the minimum directional bounding box are obtained.
In this embodiment, step S700, namely, the step of constructing and obtaining a camera perspective matrix according to the camera pose and the camera internal reference, specifically includes:
s701, transferring the data of the pose of the camera from a world coordinate system to an OpenGL coordinate system according to a first formulaOne formula is: r o =R ow R w R ow ,t o =R ow t w (ii) a Wherein the content of the first and second substances,
Figure BDA0003120715080000091
world coordinate system (R) w ,t w ) OpenGL coordinate system (R) o ,t o );
S702, according to an OpenGL coordinate system, by combining camera internal parameters, a camera perspective matrix is constructed and obtained, wherein the camera perspective matrix is as follows:
Figure BDA0003120715080000092
wherein the content of the first and second substances,
Figure BDA0003120715080000093
wherein P represents a camera perspective matrix and the camera intrinsic parameter is (u) 0 ,v 0 ,f u ,f v ) (ii) a near represents the distance from the camera optical center to the near section, far represents the distance from the camera optical center to the far section, right represents the distance from the near section center to the right section, left represents the distance from the near section center to the left section, top represents the distance from the near section center to the upper section, bottom represents the distance from the near section center to the lower section, w represents the image pixel width, and h represents the image pixel height.
In the present embodiment, the camera pose data is derived from the world coordinate system (R) according to the first formula w ,t w ) Go to OpenGL coordinate System (R) o ,t o ) Incorporating camera reference (u) 0 、v 0 、f u 、f v ) A camera perspective matrix P can be constructed.
In this embodiment, after the camera perspective matrix P is constructed, step S800 is performed, that is, the minimum directional bounding box is projected to the current frame image according to the camera perspective matrix, so as to form an augmented reality scene, and a rectangular frame corresponding to the minimum directional bounding box is obtained. Specifically, the minimum directional bounding box is projected onto the current image, forming an augmented reality scene. And comparing the pixel coordinates after 8 vertexes of the minimum directed bounding box are projected, and keeping the maximum and minimum u and v coordinate values to form the upper left corner coordinates and the lower right corner coordinates of the rectangular frame.
In this embodiment, after the rectangular frame is obtained, considering that the rectangular frame includes a large range and may not be tightly wrapped with the labeling object (main body), the rectangular frame needs to be further processed, and step S1000 and step S1100 need to be further performed, that is, the rectangular frame is subjected to image segmentation to identify the main body, and then the rectangular frame is adjusted according to the identified main body to obtain the labeling frame. Specifically, the rectangular frame is appropriately enlarged, the image in the rectangular frame is subjected to image segmentation to identify the main body, the size of the rectangular frame is adjusted to compactly wrap the main body, and the adjusted rectangular frame is used as the labeling frame. Referring to fig. 6, fig. 6 is a schematic diagram comparing a rectangular frame and a labeled frame.
In this embodiment, image segmentation is performed by using a region growing algorithm, specifically, a central point of a rectangular frame is taken as a seed point, the seed point is marked as an accessed point, 4 neighborhood expansion is performed at the seed point, if an absolute value of a gray value difference between a neighborhood pixel and the seed pixel is smaller than a threshold value, the neighborhood pixel and the seed pixel are merged into the same region, and the neighborhood point is added into a set for storage. And taking points which are not visited from the set as seed points, and repeating the actions until the set is empty. And taking a rectangular frame formed by the new area as a marking frame.
In this embodiment, after the label frame is obtained, the areas of the rectangular frame and the label frame are further compared; if the ratio of the area of the labeling frame to the area of the rectangular frame is smaller than the first threshold, the labeling frame is used as the selection area, the step of selecting the labeling area in the selection area frame is executed, and the corresponding semantic information is labeled in the labeling area, that is, the step S300 is repeatedly executed by using the labeling frame as the selection area until the ratio of the area of the finally obtained labeling frame to the area of the rectangular frame is not smaller than the first threshold. If the ratio of the area of the labeling frame to the area of the rectangular frame is not less than the first threshold, the labeling frame is the labeling data required by the current frame image.
After the current frame image is labeled, different three-dimensional point cloud data are newly added due to different viewing angles or corresponding different viewing angles for different video frame images, so that when other different video frame images are processed, the labeled data are corrected through steps S1300-S1600. Specifically, selecting a next frame of video image, and executing step S1400 first, that is, acquiring first new point cloud data of the frame of video image based on the visual SLAM algorithm; then, step S1500 is executed, that is, the first newly added point cloud data is filtered to obtain second newly added point cloud data, and the second newly added point cloud data is newly added point cloud data in the minimum directed bounding box; specifically, the steps specifically include:
s1501, filtering the first new point cloud data based on a random sampling consistency algorithm to remove ground point cloud data;
s1502, performing conditional filtering according to the three-dimensional space range of the minimum directed bounding box to obtain second newly-added point cloud data.
And after the second newly added point cloud data is obtained, executing step S1600, namely judging whether the second newly added point cloud data is greater than a second threshold, if the quantity of the second newly added point cloud data is greater than the second threshold, taking the frame of video image as a selected area, selecting a marked area in the selected area, and marking corresponding semantic information in the marked area. That is to say, if the number of the second newly added point cloud data is greater than the second threshold, it indicates that the difference between the frame of video image and the current frame image processed before is large, and at this time, the frame of video image is required to be used as a selection area, and step S300 is executed again to obtain a new standard frame corresponding to the frame of video image.
In this embodiment, if the number of the second newly added point cloud data is not greater than the second threshold, the target frame image is used as the current frame image, and the step of projecting the minimum directional bounding box to the current frame image according to the camera perspective matrix to form an augmented reality scene and obtain a rectangular frame corresponding to the minimum directional bounding box is performed. That is to say, if the number of the second newly added point cloud data is not greater than the second threshold, it indicates that the difference between the frame of video image and the current frame of image processed before is not great, and at this time, the step S800 is executed again only by recalculating the projection condition of the minimum directional bounding box according to the view angle of the frame of video image. In this embodiment, a specific flow of the automatic image annotation method based on the visual SLAM may refer to fig. 2.
The image automatic labeling method based on the visual SLAM has the following technical effects:
the embodiment of the invention provides an automatic image annotation method based on visual SLAM, which comprises the following steps: acquiring a video image, constructing and obtaining a three-dimensional point cloud space and a camera pose by utilizing an ORB-SLAM algorithm, selecting a marking area in a selection area frame by taking a current frame image as a selection area, and marking corresponding semantic information in the marking area; acquiring a first three-dimensional point cloud set corresponding to a two-dimensional feature point in a labeling area according to the three-dimensional point cloud space; processing the first three-dimensional point cloud set by using a point cloud clustering algorithm to obtain a second three-dimensional point cloud set; calculating to obtain a minimum directed bounding box of the second three-dimensional point cloud set based on a principal component analysis method; constructing and obtaining a camera perspective matrix according to the camera pose and the camera internal parameters; projecting the minimum directional bounding box to the current frame image according to the camera perspective matrix to form an augmented reality scene and obtain a rectangular frame corresponding to the minimum directional bounding box; performing image segmentation on the rectangular frame to identify a subject; adjusting the rectangular frame according to the identified main body to obtain a marking frame; comparing the areas of the rectangular frame and the marking frame; if the ratio of the area of the labeling frame to the area of the rectangular frame is smaller than a first threshold value, taking the labeling frame as a selection area, selecting a labeling area in the selection area frame, and labeling corresponding semantic information in the labeling area; so that the labeling result of the current frame image can be generated. And finally, automatically adjusting the labeling frame by combining the newly added point cloud data of different video frames, thereby automatically generating a labeling result of each frame of image. The method does not need a prior data set, can realize online automatic labeling while acquiring the image, supports multi-target and multi-class labeling, only needs single labeling of a user, can greatly reduce manual image labeling work, improves the acquisition efficiency of image labeling data, and is simple to operate and easy to use.
Referring to fig. 7, an embodiment of the present invention further provides an automatic image annotation apparatus 200 based on a visual SLAM, which specifically includes:
at least one processor 210;
at least one memory 220 for storing at least one program;
when the at least one program is executed by the at least one processor 210, the at least one processor 210 is caused to implement the method as shown in fig. 1.
The memory 220, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs and non-transitory computer-executable programs. The memory 220 may include high speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 220 may optionally include remote memory located remotely from processor 210, and such remote memory may be connected to processor 210 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
It will be understood that the device structure shown in fig. 7 does not constitute a limitation of device 200, and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components.
In the apparatus 200 shown in fig. 7, the processor 210 may retrieve the program stored in the memory 220 and execute, but is not limited to, the steps of the embodiment shown in fig. 1.
The above-described embodiments of the apparatus 200 are merely illustrative, and the units illustrated as separate components may or may not be physically separate, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purposes of the embodiments.
Embodiments of the present invention also provide a computer-readable storage medium, which stores a program executable by a processor, and the program executable by the processor is used for implementing the method shown in fig. 1 when being executed by the processor.
Embodiments of the present application also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and executed by the processor to cause the computer device to perform the method illustrated in fig. 1.
It will be understood that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, or suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as is well known to those skilled in the art.
The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims (9)

1. An automatic image annotation method based on visual SLAM is characterized by comprising the following steps:
acquiring at least two continuous images of a real-time camera or a video, and constructing by utilizing an ORB-SLAM algorithm to obtain a three-dimensional point cloud space;
acquiring a new frame of image as a current frame of image, and solving through a PnP problem to obtain a camera pose;
selecting a marking area in a frame of the selected area by taking the current frame image as the selected area, and marking corresponding semantic information in the marking area;
acquiring a first three-dimensional point cloud set corresponding to the two-dimensional feature point in the labeling area according to the three-dimensional point cloud space;
processing the first three-dimensional point cloud set by using a point cloud clustering algorithm to obtain a second three-dimensional point cloud set, wherein the second three-dimensional point cloud set is a point cloud cluster with the largest point cloud number;
calculating to obtain a minimum directed bounding box of the second three-dimensional point cloud set based on a principal component analysis method;
constructing and obtaining a camera perspective matrix according to the camera pose and the camera internal parameters;
projecting the minimum directional bounding box to the current frame image according to the camera perspective matrix to form an augmented reality scene and obtain a rectangular frame corresponding to the minimum directional bounding box;
performing image segmentation on the rectangular frame to identify a subject;
adjusting the rectangular frame according to the identified main body to obtain a labeling frame;
comparing the areas of the rectangular frame and the labeling frame;
if the ratio of the area of the labeling frame to the area of the rectangular frame is smaller than a first threshold value, taking the labeling frame as a selection area, executing the step of selecting a labeling area in the selection area, and labeling corresponding semantic information in the labeling area;
acquiring a target frame image from the video, wherein the target frame image is any subsequent frame image except the current frame image;
acquiring first new point cloud data of the target frame image based on a visual SLAM algorithm;
filtering the first newly added point cloud data to obtain second newly added point cloud data, wherein the second newly added point cloud data is newly added point cloud data in the minimum oriented bounding box;
if the number of the second newly-added point cloud data is larger than a second threshold value, taking the target frame image as a selected area, executing the step of selecting a marked area in the selected area frame, and marking corresponding semantic information in the marked area;
the step of constructing and obtaining a camera perspective matrix according to the camera pose and the camera internal parameters comprises the following steps:
transferring the data of the camera pose from a world coordinate system to an OpenGL coordinate system according to a first formula, wherein the first formula is as follows: r o =R ow R w R ow ,t o =R ow t w (ii) a Wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003775332240000011
world coordinate system (R) w ,t w ) OpenGL coordinate system (R) o ,t o );
According to the OpenGL coordinate system and in combination with camera internal parameters, a camera perspective matrix is constructed, wherein the camera perspective matrix is as follows:
Figure FDA0003775332240000021
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003775332240000022
wherein P represents a camera perspective matrix and the camera intrinsic parameter is (u) 0 ,v 0 ,f u ,f v ) (ii) a near represents the distance from the camera optical center to the near section, far represents the distance from the camera optical center to the far section, right represents the distance from the near section center to the right section, left represents the distance from the near section center to the left section, top represents the distance from the near section center to the upper section, bottom represents the distance from the near section center to the lower section, and w represents the image pixel widthAnd h denotes an image pixel height.
2. The method as claimed in claim 1, wherein the step of obtaining at least two frames of images from a real-time camera or video and constructing a three-dimensional point cloud space using an ORB-SLAM algorithm comprises:
calibrating a camera to obtain a camera internal parameter matrix and a distortion parameter;
acquiring at least two continuous frames of images in a real-time camera or video;
sequentially extracting ORB characteristic points from the two frames of continuous images and calculating corresponding binary descriptors;
according to the ORB feature points, feature point matching is carried out on the two frames of continuous images to obtain image feature points;
estimating the camera motion between two frames of continuous images by using epipolar geometric constraint, and calculating to obtain a camera pose;
and calculating the three-dimensional coordinates of the image characteristic points by utilizing triangulation measurement according to the camera internal parameter matrix, the distortion parameters and the camera pose, and constructing to obtain a three-dimensional point cloud space.
3. The method as claimed in claim 1, wherein the step of selecting a labeled region in the selected region by using the current frame image as the selected region, and labeling the labeled region with corresponding semantic information comprises:
selecting a first area in a frame of the selected area by taking the current frame image as the selected area;
converting the first area into a pixel area according to the ratio of the window size of the current frame image to the pixel width and height of the current frame image;
and taking the pixel area as a labeling area, and labeling corresponding semantic information in the labeling area.
4. The method of claim 1, wherein the step of obtaining a first three-dimensional point cloud set corresponding to the two-dimensional feature points in the annotation region according to the three-dimensional point cloud space comprises:
acquiring two-dimensional feature points in the labeling area;
and carrying out back projection on the two-dimensional feature points according to the three-dimensional point cloud space to obtain a first three-dimensional point cloud set.
5. The method of claim 1, wherein the step of processing the first three-dimensional point cloud set to obtain a second three-dimensional point cloud set using a point cloud clustering algorithm comprises:
selecting a first point cloud as a target point cloud from the first three-dimensional point cloud set, wherein the first point cloud is any point cloud in the first three-dimensional point cloud set;
judging whether the target point cloud is an edge noise point;
if the target point cloud is an edge noise point, selecting a second point cloud as the target point cloud from the first three-dimensional point cloud set again, wherein the second point cloud is any point cloud except the first point cloud in the first three-dimensional point cloud set;
if the target point cloud is not an edge noise point, finding out all point clouds with the reachable density from the target point cloud from the first three-dimensional point cloud set to form a point cloud cluster;
traversing all point clouds in the first three-dimensional point cloud set to obtain a plurality of point cloud clusters;
and selecting the point cloud cluster with the largest number of point clouds as a second three-dimensional point cloud set.
6. The method as claimed in claim 1, wherein the step of filtering the first newly added point cloud data to obtain a second newly added point cloud data includes:
filtering the first new point cloud data based on a random sampling consistency algorithm to remove the ground point cloud data;
and performing conditional filtering according to the three-dimensional space range of the minimum directed bounding box to obtain second newly-added point cloud data.
7. The method as claimed in claim 1, wherein if the number of the second newly added point cloud data is not greater than a second threshold, the target frame image is used as a current frame image, and the step of projecting the minimum directional bounding box to the current frame image according to the camera perspective matrix to form an augmented reality scene and obtain a rectangular frame corresponding to the minimum directional bounding box is performed.
8. An automatic image annotation device based on visual SLAM, comprising:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement the method of any one of claims 1-7.
9. Computer-readable storage medium, on which a processor-executable program is stored, which, when being executed by a processor, is adapted to carry out the method according to any one of claims 1-7.
CN202110676419.9A 2021-06-18 2021-06-18 Image automatic labeling method and device based on visual SLAM and storage medium Active CN113362363B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110676419.9A CN113362363B (en) 2021-06-18 2021-06-18 Image automatic labeling method and device based on visual SLAM and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110676419.9A CN113362363B (en) 2021-06-18 2021-06-18 Image automatic labeling method and device based on visual SLAM and storage medium

Publications (2)

Publication Number Publication Date
CN113362363A CN113362363A (en) 2021-09-07
CN113362363B true CN113362363B (en) 2022-11-04

Family

ID=77534882

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110676419.9A Active CN113362363B (en) 2021-06-18 2021-06-18 Image automatic labeling method and device based on visual SLAM and storage medium

Country Status (1)

Country Link
CN (1) CN113362363B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241384B (en) * 2021-12-20 2024-01-19 北京安捷智合科技有限公司 Continuous frame picture marking method, electronic equipment and storage medium
CN114973056B (en) * 2022-03-28 2023-04-18 华中农业大学 Information density-based fast video image segmentation and annotation method
CN114782438B (en) * 2022-06-20 2022-09-16 深圳市信润富联数字科技有限公司 Object point cloud correction method and device, electronic equipment and storage medium
CN114792343B (en) * 2022-06-21 2022-09-30 阿里巴巴达摩院(杭州)科技有限公司 Calibration method of image acquisition equipment, method and device for acquiring image data
CN116721246B (en) * 2023-07-14 2024-03-19 酷哇科技有限公司 Continuous frame point cloud rapid labeling method and system

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800255A (en) * 2019-11-14 2021-05-14 阿里巴巴集团控股有限公司 Data labeling method, data labeling device, object tracking method, object tracking device, equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112601641B (en) * 2018-08-23 2024-03-08 实时机器人有限公司 Collision detection for robotic motion planning
US10878282B2 (en) * 2018-10-15 2020-12-29 Tusimple, Inc. Segmentation processing of image data for LiDAR-based vehicle tracking system and method
CN110827395B (en) * 2019-09-09 2023-01-20 广东工业大学 Instant positioning and map construction method suitable for dynamic environment
CN111461245B (en) * 2020-04-09 2022-11-04 武汉大学 Wheeled robot semantic mapping method and system fusing point cloud and image
CN112907760B (en) * 2021-02-09 2023-03-24 浙江商汤科技开发有限公司 Three-dimensional object labeling method and device, tool, electronic equipment and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800255A (en) * 2019-11-14 2021-05-14 阿里巴巴集团控股有限公司 Data labeling method, data labeling device, object tracking method, object tracking device, equipment and storage medium

Also Published As

Publication number Publication date
CN113362363A (en) 2021-09-07

Similar Documents

Publication Publication Date Title
CN113362363B (en) Image automatic labeling method and device based on visual SLAM and storage medium
CN110568447B (en) Visual positioning method, device and computer readable medium
WO2022088982A1 (en) Three-dimensional scene constructing method, apparatus and system, and storage medium
US9483703B2 (en) Online coupled camera pose estimation and dense reconstruction from video
EP3502621B1 (en) Visual localisation
US20160196658A1 (en) 3d image generation
US20160189419A1 (en) Systems and methods for generating data indicative of a three-dimensional representation of a scene
CN110807809B (en) Light-weight monocular vision positioning method based on point-line characteristics and depth filter
CN112927353B (en) Three-dimensional scene reconstruction method, storage medium and terminal based on two-dimensional target detection and model alignment
CN111144213B (en) Object detection method and related equipment
CN111340873B (en) Object minimum outer envelope size measuring and calculating method for multi-view image
KR100953076B1 (en) Multi-view matching method and device using foreground/background separation
CN109842811B (en) Method and device for implanting push information into video and electronic equipment
CN109934873B (en) Method, device and equipment for acquiring marked image
CN115376109B (en) Obstacle detection method, obstacle detection device, and storage medium
CN112083403A (en) Positioning tracking error correction method and system for virtual scene
CN110443228B (en) Pedestrian matching method and device, electronic equipment and storage medium
CN109118565B (en) Electric power corridor three-dimensional model texture mapping method considering shielding of pole tower power line
CN111899277A (en) Moving object detection method and device, storage medium and electronic device
CN113298871B (en) Map generation method, positioning method, system thereof, and computer-readable storage medium
CN112991419B (en) Parallax data generation method, parallax data generation device, computer equipment and storage medium
CN114140581A (en) Automatic modeling method and device, computer equipment and storage medium
WO2022041119A1 (en) Three-dimensional point cloud processing method and apparatus
WO2022011560A1 (en) Image cropping method and apparatus, electronic device, and storage medium
CN114898321A (en) Method, device, equipment, medium and system for detecting road travelable area

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant