CN113362363B

CN113362363B - Image automatic labeling method and device based on visual SLAM and storage medium

Info

Publication number: CN113362363B
Application number: CN202110676419.9A
Authority: CN
Inventors: 曾碧; 王俊丰
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2022-11-04
Anticipated expiration: 2041-06-18
Also published as: CN113362363A

Abstract

The invention discloses an image automatic labeling method, device and storage medium based on visual SLAM; the method comprises the steps of processing a current frame image to obtain a labeling frame, and then automatically labeling video images of other different frames to correct the initial labeling frame to obtain final accurate labeling data; the method does not need a prior data set, can realize online automatic labeling while acquiring the image, supports multi-target and multi-class labeling, only needs single labeling of a user, can greatly reduce manual image labeling work, improves the acquisition efficiency of image labeling data, and is simple to operate and easy to use. The invention can be widely applied to the technical field of image processing.

Description

Image automatic labeling method and device based on visual SLAM and storage medium

Technical Field

The invention relates to the technical field of image processing, in particular to a method and a device for automatically labeling images based on visual SLAM and a storage medium.

Background

With the development of computer vision, the artificial intelligence technologies such as machine learning and deep learning are widely applied to researches such as image classification, target detection and image segmentation, and breakthrough results are achieved, but model training of the method depends on a large amount of image annotation data related to tasks, image data acquisition and annotation must be performed on a business scene in a manual mode before a model is promoted to be applied, and at present, the stage is time-consuming and affects the overall research and development progress.

At present, a method for identifying an image to be labeled based on deep learning is used for labeling, but the method needs to collect image data in advance for training and establish a corresponding candidate label set, so that the method cannot be directly used for the condition of first labeling; in addition, the accuracy of image recognition depends on the scale of the data set and the probability distribution condition of each label, and meanwhile, the labeling object is limited, and the method has no general applicability.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides an image automatic annotation method and device based on visual SLAM and a storage medium.

The technical scheme adopted by the invention is as follows:

in one aspect, an embodiment of the present invention includes an automatic image annotation method based on visual SLAM, including:

acquiring at least two continuous images of a real-time camera or a video, and constructing by utilizing an ORB-SLAM algorithm to obtain a three-dimensional point cloud space;

acquiring a new frame of image as a current frame of image, and solving through a PnP (pseudo-random P) problem to obtain a camera pose;

selecting a marking area in a frame of the selected area by taking the current frame image as the selected area, and marking corresponding semantic information in the marking area;

acquiring a first three-dimensional point cloud set corresponding to the two-dimensional feature point in the labeling area according to the three-dimensional point cloud space;

processing the first three-dimensional point cloud set by using a point cloud clustering algorithm to obtain a second three-dimensional point cloud set, wherein the second three-dimensional point cloud set is a point cloud cluster with the largest point cloud number;

calculating to obtain a minimum directed bounding box of the second three-dimensional point cloud set based on a principal component analysis method;

constructing and obtaining a camera perspective matrix according to the camera pose and the camera internal parameters;

projecting the minimum directional bounding box to the current frame image according to the camera perspective matrix to form an augmented reality scene and obtain a rectangular frame corresponding to the minimum directional bounding box;

performing image segmentation on the rectangular frame to identify a subject;

adjusting the rectangular frame according to the identified main body to obtain a labeling frame;

comparing the areas of the rectangular frame and the labeling frame;

if the ratio of the area of the labeling frame to the area of the rectangular frame is smaller than a first threshold value, taking the labeling frame as a selection area, selecting a labeling area in the selection area frame, and labeling corresponding semantic information in the labeling area;

acquiring a target frame image from the video, wherein the target frame image is any subsequent frame image except the current frame image;

acquiring first new point cloud data of the target frame image based on a visual SLAM algorithm;

filtering the first newly added point cloud data to obtain second newly added point cloud data, wherein the second newly added point cloud data is newly added point cloud data in the minimum oriented bounding box;

and if the quantity of the second newly-added point cloud data is larger than a second threshold value, taking the target frame image as a selected area, selecting a marked area in the selected area frame, and marking corresponding semantic information in the marked area.

Further, the step of acquiring at least two frames of images in a real-time camera or video and constructing a three-dimensional point cloud space by using an ORB-SLAM algorithm includes:

calibrating a camera to obtain a camera internal parameter matrix and a distortion parameter;

acquiring at least two continuous frames of images in a real-time camera or video;

sequentially extracting ORB characteristic points from the two frames of continuous images and calculating corresponding binary descriptors;

according to the ORB feature points, feature point matching is carried out on the two frames of continuous images to obtain image feature points;

estimating the camera motion between two frames of continuous images by using epipolar geometric constraint, and calculating to obtain a camera pose;

and calculating the three-dimensional coordinates of the image characteristic points by utilizing triangulation measurement according to the camera internal reference matrix, the distortion parameters and the camera pose, and constructing to obtain a three-dimensional point cloud space.

Further, the step of selecting a labeled region in the selected region by taking the current frame image as the selected region, and labeling the labeled region with corresponding semantic information includes:

selecting a first area in a frame of the selected area by taking the current frame image as the selected area;

converting the first area into a pixel area according to the ratio of the window size of the current frame image to the pixel width and height of the current frame image;

and taking the pixel area as a labeling area, and labeling corresponding semantic information in the labeling area.

Further, the step of obtaining a first three-dimensional point cloud set corresponding to the two-dimensional feature point in the labeling area according to the three-dimensional point cloud space includes:

acquiring two-dimensional feature points in the labeling area;

and performing back projection on the two-dimensional feature points according to the three-dimensional point cloud space to obtain a first three-dimensional point cloud set.

Further, the step of processing the first three-dimensional point cloud set by using a point cloud clustering algorithm to obtain a second three-dimensional point cloud set includes:

selecting a first point cloud as a target point cloud from the first three-dimensional point cloud set, wherein the first point cloud is any point cloud in the first three-dimensional point cloud set;

judging whether the target point cloud is an edge noise point;

if the target point cloud is an edge noise point, selecting a second point cloud as the target point cloud from the first three-dimensional point cloud set again, wherein the second point cloud is any point cloud except the first point cloud in the first three-dimensional point cloud set;

if the target point cloud is not an edge noise point, finding out all point clouds with the reachable density from the target point cloud from the first three-dimensional point cloud set to form a point cloud cluster;

traversing all point clouds in the first three-dimensional point cloud set to obtain a plurality of point cloud clusters;

and selecting the point cloud cluster with the largest number of point clouds as a second three-dimensional point cloud set.

Further, the step of constructing a camera perspective matrix according to the camera pose and the camera internal parameters includes:

transferring the data of the camera pose from a world coordinate system to an OpenGL coordinate system according to a first formula, wherein the first formula is as follows: r is _o ＝R _ow R _w R _ow ,t _o ＝R _ow t _w (ii) a Wherein, the first and the second end of the pipe are connected with each other,

world coordinate system (R) _w ,t _w ) OpenGL coordinate system (R) _o ,t _o )；

According to the OThe penGL coordinate system is combined with camera internal parameters to construct a camera perspective matrix, and the camera perspective matrix is as follows:

wherein the content of the first and second substances,

wherein P represents a camera perspective matrix and the camera intrinsic parameter is (u) ₀ ,v ₀ ,f _u ,f _v ) (ii) a near represents the distance from the camera optical center to the near section, far represents the distance from the camera optical center to the far section, right represents the distance from the near section center to the right section, left represents the distance from the near section center to the left section, top represents the distance from the near section center to the upper section, bottom represents the distance from the near section center to the lower section, w represents the image pixel width, and h represents the image pixel height.

Further, the step of filtering the first newly added point cloud data to obtain second newly added point cloud data includes:

filtering the first new point cloud data based on a random sampling consistency algorithm to remove ground point cloud data;

and performing conditional filtering according to the three-dimensional space range of the minimum directed bounding box to obtain second newly-added point cloud data.

Further, if the number of the second newly added point cloud data is not greater than a second threshold, taking the target frame image as a current frame image, and performing a step of projecting the minimum directional bounding box to the current frame image according to the camera perspective matrix to form an augmented reality scene and obtain a rectangular frame corresponding to the minimum directional bounding box.

In another aspect, an embodiment of the present invention further includes an apparatus for automatically labeling an image based on a visual SLAM, including:

at least one processor;

at least one memory for storing at least one program;

when the at least one program is executed by the at least one processor, the at least one processor is caused to implement the method for automatic annotation of images based on visual SLAM.

In another aspect, an embodiment of the present invention further includes a computer readable storage medium, on which a processor-executable program is stored, where the processor-executable program is used to implement the method for automatically labeling images based on visual SLAM when being executed by a processor.

The invention has the beneficial effects that:

the invention provides an automatic image annotation method based on visual SLAM, which comprises the following steps: acquiring a video image, constructing and obtaining a three-dimensional point cloud space and a camera pose by utilizing an ORB-SLAM algorithm, selecting a marking area in a selection area frame by taking a current frame image as a selection area, and marking corresponding semantic information in the marking area; acquiring a first three-dimensional point cloud set corresponding to a two-dimensional feature point in a labeling area according to the three-dimensional point cloud space; processing the first three-dimensional point cloud set by using a point cloud clustering algorithm to obtain a second three-dimensional point cloud set; calculating to obtain a minimum directed bounding box of the second three-dimensional point cloud set based on a principal component analysis method; constructing and obtaining a camera perspective matrix according to the camera pose and the camera internal parameters; projecting the minimum directional bounding box to the current frame image according to the camera perspective matrix to form an augmented reality scene and obtain a rectangular frame corresponding to the minimum directional bounding box; performing image segmentation on the rectangular frame to identify a subject; adjusting the rectangular frame according to the identified main body to obtain a labeling frame; comparing the areas of the rectangular frame and the marking frame; if the ratio of the area of the labeling frame to the area of the rectangular frame is smaller than a first threshold value, taking the labeling frame as a selection area, selecting a labeling area in the selection area frame, and labeling corresponding semantic information in the labeling area; so that the labeling result of the current frame image can be generated. And finally, automatically adjusting the labeling frame by combining the newly added point cloud data of different video frames, thereby automatically generating a labeling result of each frame of image. The method does not need a prior data set, can realize online automatic labeling while acquiring the image, supports multi-target and multi-class labeling, only needs single labeling of a user, can greatly reduce manual image labeling work, improves the acquisition efficiency of image labeling data, and is simple to operate and easy to use.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flowchart illustrating steps of a method for automatic image annotation based on visual SLAM according to an embodiment of the present invention;

fig. 2 is a specific flowchart of an automatic image annotation method based on visual SLAM according to an embodiment of the present invention.

FIG. 3 is a flowchart of acquiring at least two frames of images in a video and constructing a three-dimensional point cloud space by using an ORB-SLAM algorithm according to an embodiment of the present invention;

FIG. 4 is a flow chart of processing the first three-dimensional point cloud set by using a point cloud clustering algorithm to obtain a second three-dimensional point cloud set according to the embodiment of the present invention;

FIG. 5 is a detailed flowchart of processing the first three-dimensional point cloud set by using a point cloud clustering algorithm to obtain a second three-dimensional point cloud set according to the embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating a comparison between a rectangular frame and a labeled frame according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an automatic image labeling apparatus based on visual SLAM according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

In the description of the present invention, it should be understood that the orientation or positional relationship referred to in the description of the orientation, such as the upper, lower, front, rear, left, right, etc., is based on the orientation or positional relationship shown in the drawings, and is only for convenience of description and simplification of description, and does not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention.

In the description of the present invention, a plurality of means is one or more, a plurality of means is two or more, and greater than, less than, more than, etc. are understood as excluding the essential numbers, and greater than, less than, etc. are understood as including the essential numbers. If there is a description of first and second for the purpose of distinguishing technical features only, this is not to be understood as indicating or implying a relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of technical features indicated.

In the description of the present invention, unless otherwise explicitly limited, terms such as arrangement, installation, connection and the like should be understood in a broad sense, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention in combination with the specific contents of the technical solutions.

The embodiments of the present application will be further explained with reference to the drawings.

Referring to fig. 1, an embodiment of the present invention provides an automatic image annotation method based on visual SLAM, which includes, but is not limited to, the following steps:

s100, acquiring at least two continuous images in a real-time camera or video, and constructing to obtain a three-dimensional point cloud space by utilizing an ORB-SLAM algorithm;

s200, acquiring a new frame of image as a current frame of image, and solving through a PnP problem to obtain a camera pose;

s300, selecting a marking area in a frame of the selected area by taking the current frame image as the selected area, and marking corresponding semantic information in the marking area;

s400, acquiring a first three-dimensional point cloud set corresponding to a two-dimensional feature point in a labeling area according to a three-dimensional point cloud space;

s500, processing the first three-dimensional point cloud set by using a point cloud clustering algorithm to obtain a second three-dimensional point cloud set, wherein the second three-dimensional point cloud set is a point cloud cluster with the largest number of point clouds;

s600, calculating to obtain a minimum directed bounding box of a second three-dimensional point cloud set based on a principal component analysis method;

s700, constructing to obtain a camera perspective matrix according to the camera pose and camera internal parameters;

s800, projecting the minimum directional bounding box to the current frame image according to the camera perspective matrix to form an augmented reality scene and obtain a rectangular frame corresponding to the minimum directional bounding box;

s900, carrying out image segmentation on the rectangular frame to identify a main body;

s1000, adjusting the rectangular frame according to the identified main body to obtain a labeling frame;

s1100, comparing the areas of the rectangular frame and the marking frame;

s1200, if the ratio of the area of the labeling frame to the area of the rectangular frame is smaller than a first threshold value, taking the labeling frame as a selection area, selecting a labeling area in the selection area, and labeling corresponding semantic information in the labeling area;

s1300, acquiring a target frame image from the video, wherein the target frame image is any one frame image except the current frame image;

s1400, acquiring first new point cloud data of the target frame image based on a visual SLAM algorithm;

s1500, filtering the first newly added point cloud data to obtain second newly added point cloud data, wherein the second newly added point cloud data is newly added point cloud data in the minimum directed bounding box;

and S1600, if the quantity of the second newly-added point cloud data is larger than a second threshold value, taking the target frame image as a selected area, selecting a marked area in a selected area frame, and marking corresponding semantic information in the marked area.

In this embodiment, an automatic image annotation method based on visual SLAM is provided in this embodiment, aiming at the problems that manual annotation of image data requires a long time and restricts the research and development progress. The method mainly comprises the following steps: firstly, acquiring a video image, constructing a three-dimensional point cloud space and a camera pose based on a visual SLAM algorithm, and constructing an augmented reality scene based on an OpenGL coordinate system by combining a camera internal parameter matrix; secondly, according to the two-dimensional feature points in the user single framing area, carrying out back projection to obtain corresponding three-dimensional point cloud data, clustering the three-dimensional point cloud data, and solving a minimum directed bounding box; the virtual bounding box of the parcel entity can be projected onto the two-dimensional image and a preliminary rectangular labeling frame is generated by combining the pose of the camera and the projection matrix; then, the main body is identified based on an image segmentation algorithm, the frame selection area is reasonably adjusted, and the current labeling result is generated. And finally, further adjusting the virtual bounding box by combining the newly added point cloud data under different visual angles and the automatically adjusted two-dimensional frame selection area, thereby automatically generating the labeling data of each frame of image. According to the automatic image annotation method, data annotation is performed from the image source, manual image annotation work can be greatly reduced, the acquisition efficiency of image annotation data is improved, and the research and development progress of computer vision model training is effectively promoted.

In the embodiment, the visual SLAM algorithm adopts an ORB-SLAM algorithm; the point cloud clustering algorithm adopts a density-based DBSCAN algorithm; solving the minimum directed Bounding Box algorithm adopts a directed Bounding Box algorithm (Oriented Bounding Box); the image segmentation algorithm adopts a region growing algorithm.

Specifically, regarding step S100, in this embodiment, acquiring at least two frames of continuous images in the real-time camera or the video can construct a three-dimensional point cloud space by using the ORB-SLAM algorithm, and certainly, acquiring more than two frames of continuous images can also construct a three-dimensional point cloud space, referring to fig. 3, the specific operation process is as follows:

s101, calibrating a camera to obtain an internal parameter matrix and distortion parameters of the camera;

s102, acquiring at least two continuous frames of images in a real-time camera or video;

s103, sequentially extracting ORB characteristic points from two continuous frames of images and calculating corresponding binary descriptors;

s104, according to the ORB feature points, performing feature point matching on two continuous frames of images to obtain image feature points;

s105, estimating the camera motion between two continuous frames of images by using epipolar geometric constraint, and calculating to obtain the camera pose;

and S106, calculating the three-dimensional coordinates of the image feature points by utilizing triangulation measurement according to the camera internal parameter matrix, the distortion parameters and the camera pose, and constructing to obtain a three-dimensional point cloud space.

In the embodiment, a camera for acquiring a video image is calibrated to acquire an internal reference matrix and distortion parameters of the camera, ORB feature points are sequentially extracted from two continuous frames of images and an ORB descriptor is acquired, then feature point matching is performed on the two continuous frames of images to obtain image feature points, finally, the motion of the camera between the two continuous frames of images is solved according to pole constraint, and three-dimensional coordinates of the image feature points are calculated by utilizing triangulation measurement, so that a three-dimensional point cloud space is constructed. If more than two frames of video images are acquired, for example, a video is randomly intercepted, at the moment, the camera is also required to be calibrated to acquire an internal parameter matrix and distortion parameters of the camera, ORB feature points are extracted from the video images according to a time sequence and descriptors of the ORB feature points are solved, feature point matching is carried out between adjacent frames, the motion of the camera between the adjacent frames is solved according to pole constraint, and three-dimensional coordinates of image feature points are calculated by utilizing triangulation so as to construct a three-dimensional point cloud space.

In this embodiment, for step S200, for the subsequent image frame in the video, a PnP problem solution is performed, so as to calculate the camera pose. In addition, the ORB-SLAM algorithm can perform closed-loop detection, repositioning and image optimization on key frames, and further can obtain more accurate camera poses, namely a rotation matrix R and a translation vector t of the camera.

For step S300, that is, the step of selecting a labeled area in the selected area by using the current frame image as the selected area, and labeling the labeled area with corresponding semantic information, specifically includes:

s301, selecting a first area in a selected area frame by taking the current frame image as the selected area;

s302, converting a first area into a pixel area according to the ratio of the window size of the current frame image to the pixel width and height of the current frame image;

and S303, taking the pixel area as a labeling area, and labeling corresponding semantic information in the labeling area.

In this embodiment, a user needs to perform first frame selection on a target object of a current frame image, specifically, the current frame image is used as a selection area, and an area where the target object is located is selected as a labeling area in the selection area. Then, according to the labeled area framed and selected by the user on the current frame image, the labeled area is converted into a pixel area according to the ratio of the window size to the image pixel width and height, and the semantic information labeled in the area is stored.

In this embodiment, step S400, that is, the step of obtaining the first three-dimensional point cloud set corresponding to the two-dimensional feature point in the labeling area according to the three-dimensional point cloud space, specifically includes:

s401, acquiring two-dimensional feature points in a labeling area;

s402, performing back projection on the two-dimensional feature points according to the three-dimensional point cloud space to obtain a first three-dimensional point cloud set.

In this embodiment, it is considered that the labeled area on the current frame image is a two-dimensional plane, and therefore, the point clouds in the labeled area are all two-dimensional feature points. Therefore, the two-dimensional feature points need to be subjected to back projection according to the previously constructed three-dimensional point cloud space, so as to obtain a first three-dimensional point cloud set corresponding to the two-dimensional feature points.

In this embodiment, after obtaining the first three-dimensional point cloud set, the first three-dimensional point cloud set needs to be further processed by using a point cloud clustering algorithm to obtain a second three-dimensional point cloud set, where the second three-dimensional point cloud set is a point cloud cluster with the largest number of point clouds. That is to say, after obtaining the first three-dimensional point cloud set, step S500 is executed, and referring to fig. 4, step S500 specifically includes:

s501, selecting a first point cloud from a first three-dimensional point cloud set as a target point cloud, wherein the first point cloud is any point cloud in the first three-dimensional point cloud set;

s502, judging whether the target point cloud is an edge noise point;

s503, if the target point cloud is an edge noise point, selecting a second point cloud as the target point cloud from the first three-dimensional point cloud set again, wherein the second point cloud is any point cloud except the first point cloud in the first three-dimensional point cloud set;

s504, if the target point cloud is not an edge noise point, finding out all point clouds with the reachable density from the target point cloud from the first three-dimensional point cloud set to form a point cloud cluster;

s505, traversing all point clouds in the first three-dimensional point cloud set to obtain a plurality of point cloud clusters;

s506, selecting the point cloud cluster with the largest point cloud number as a second three-dimensional point cloud set.

Specifically, referring to fig. 5, in this embodiment, a domain radius and a domain object number threshold are set according to a service scenario; then, one three-dimensional point cloud is arbitrarily selected from the first three-dimensional point cloud set obtained in the step S400 to serve as a core point P, and all three-dimensional point clouds with the density reaching from the point cloud P are found out to form a point cloud cluster; and if the selected point cloud is the edge noise point, selecting other three-dimensional point clouds as core points, and repeatedly executing the behaviors until all point clouds in the first three-dimensional point cloud set are visited. And finally, keeping the point cloud cluster with the largest number of the three-dimensional point clouds as a second three-dimensional point cloud set.

In this embodiment, if the number of the neighboring points of the point cloud P is not greater than the preset threshold, the point cloud P is marked as an edge noise point, and at this time, another point cloud is reselected from the first three-dimensional point cloud set as a core point. And finally, selecting the point cloud cluster with the largest number of point clouds from all the obtained point cloud clusters as a second three-dimensional point cloud set. In this embodiment, after the second three-dimensional point cloud set is obtained, a small amount of remaining external noise points need to be removed.

In step S600, in this embodiment, after the second three-dimensional point cloud set is obtained, the minimum directional bounding box of the second three-dimensional point cloud set is solved based on a Principal Component Analysis (Principal Component Analysis), that is, the central point position, the size, and the rotation matrix of the minimum directional bounding box are obtained.

In this embodiment, step S700, namely, the step of constructing and obtaining a camera perspective matrix according to the camera pose and the camera internal reference, specifically includes:

s701, transferring the data of the pose of the camera from a world coordinate system to an OpenGL coordinate system according to a first formulaOne formula is: r _o ＝R _ow R _w R _ow ,t _o ＝R _ow t _w (ii) a Wherein the content of the first and second substances,

S702, according to an OpenGL coordinate system, by combining camera internal parameters, a camera perspective matrix is constructed and obtained, wherein the camera perspective matrix is as follows:

wherein the content of the first and second substances,

In the present embodiment, the camera pose data is derived from the world coordinate system (R) according to the first formula _w ,t _w ) Go to OpenGL coordinate System (R) _o ,t _o ) Incorporating camera reference (u) ₀ 、v ₀ 、f _u 、f _v ) A camera perspective matrix P can be constructed.

In this embodiment, after the camera perspective matrix P is constructed, step S800 is performed, that is, the minimum directional bounding box is projected to the current frame image according to the camera perspective matrix, so as to form an augmented reality scene, and a rectangular frame corresponding to the minimum directional bounding box is obtained. Specifically, the minimum directional bounding box is projected onto the current image, forming an augmented reality scene. And comparing the pixel coordinates after 8 vertexes of the minimum directed bounding box are projected, and keeping the maximum and minimum u and v coordinate values to form the upper left corner coordinates and the lower right corner coordinates of the rectangular frame.

In this embodiment, after the rectangular frame is obtained, considering that the rectangular frame includes a large range and may not be tightly wrapped with the labeling object (main body), the rectangular frame needs to be further processed, and step S1000 and step S1100 need to be further performed, that is, the rectangular frame is subjected to image segmentation to identify the main body, and then the rectangular frame is adjusted according to the identified main body to obtain the labeling frame. Specifically, the rectangular frame is appropriately enlarged, the image in the rectangular frame is subjected to image segmentation to identify the main body, the size of the rectangular frame is adjusted to compactly wrap the main body, and the adjusted rectangular frame is used as the labeling frame. Referring to fig. 6, fig. 6 is a schematic diagram comparing a rectangular frame and a labeled frame.

In this embodiment, image segmentation is performed by using a region growing algorithm, specifically, a central point of a rectangular frame is taken as a seed point, the seed point is marked as an accessed point, 4 neighborhood expansion is performed at the seed point, if an absolute value of a gray value difference between a neighborhood pixel and the seed pixel is smaller than a threshold value, the neighborhood pixel and the seed pixel are merged into the same region, and the neighborhood point is added into a set for storage. And taking points which are not visited from the set as seed points, and repeating the actions until the set is empty. And taking a rectangular frame formed by the new area as a marking frame.

In this embodiment, after the label frame is obtained, the areas of the rectangular frame and the label frame are further compared; if the ratio of the area of the labeling frame to the area of the rectangular frame is smaller than the first threshold, the labeling frame is used as the selection area, the step of selecting the labeling area in the selection area frame is executed, and the corresponding semantic information is labeled in the labeling area, that is, the step S300 is repeatedly executed by using the labeling frame as the selection area until the ratio of the area of the finally obtained labeling frame to the area of the rectangular frame is not smaller than the first threshold. If the ratio of the area of the labeling frame to the area of the rectangular frame is not less than the first threshold, the labeling frame is the labeling data required by the current frame image.

After the current frame image is labeled, different three-dimensional point cloud data are newly added due to different viewing angles or corresponding different viewing angles for different video frame images, so that when other different video frame images are processed, the labeled data are corrected through steps S1300-S1600. Specifically, selecting a next frame of video image, and executing step S1400 first, that is, acquiring first new point cloud data of the frame of video image based on the visual SLAM algorithm; then, step S1500 is executed, that is, the first newly added point cloud data is filtered to obtain second newly added point cloud data, and the second newly added point cloud data is newly added point cloud data in the minimum directed bounding box; specifically, the steps specifically include:

s1501, filtering the first new point cloud data based on a random sampling consistency algorithm to remove ground point cloud data;

s1502, performing conditional filtering according to the three-dimensional space range of the minimum directed bounding box to obtain second newly-added point cloud data.

And after the second newly added point cloud data is obtained, executing step S1600, namely judging whether the second newly added point cloud data is greater than a second threshold, if the quantity of the second newly added point cloud data is greater than the second threshold, taking the frame of video image as a selected area, selecting a marked area in the selected area, and marking corresponding semantic information in the marked area. That is to say, if the number of the second newly added point cloud data is greater than the second threshold, it indicates that the difference between the frame of video image and the current frame image processed before is large, and at this time, the frame of video image is required to be used as a selection area, and step S300 is executed again to obtain a new standard frame corresponding to the frame of video image.

In this embodiment, if the number of the second newly added point cloud data is not greater than the second threshold, the target frame image is used as the current frame image, and the step of projecting the minimum directional bounding box to the current frame image according to the camera perspective matrix to form an augmented reality scene and obtain a rectangular frame corresponding to the minimum directional bounding box is performed. That is to say, if the number of the second newly added point cloud data is not greater than the second threshold, it indicates that the difference between the frame of video image and the current frame of image processed before is not great, and at this time, the step S800 is executed again only by recalculating the projection condition of the minimum directional bounding box according to the view angle of the frame of video image. In this embodiment, a specific flow of the automatic image annotation method based on the visual SLAM may refer to fig. 2.

The image automatic labeling method based on the visual SLAM has the following technical effects:

the embodiment of the invention provides an automatic image annotation method based on visual SLAM, which comprises the following steps: acquiring a video image, constructing and obtaining a three-dimensional point cloud space and a camera pose by utilizing an ORB-SLAM algorithm, selecting a marking area in a selection area frame by taking a current frame image as a selection area, and marking corresponding semantic information in the marking area; acquiring a first three-dimensional point cloud set corresponding to a two-dimensional feature point in a labeling area according to the three-dimensional point cloud space; processing the first three-dimensional point cloud set by using a point cloud clustering algorithm to obtain a second three-dimensional point cloud set; calculating to obtain a minimum directed bounding box of the second three-dimensional point cloud set based on a principal component analysis method; constructing and obtaining a camera perspective matrix according to the camera pose and the camera internal parameters; projecting the minimum directional bounding box to the current frame image according to the camera perspective matrix to form an augmented reality scene and obtain a rectangular frame corresponding to the minimum directional bounding box; performing image segmentation on the rectangular frame to identify a subject; adjusting the rectangular frame according to the identified main body to obtain a marking frame; comparing the areas of the rectangular frame and the marking frame; if the ratio of the area of the labeling frame to the area of the rectangular frame is smaller than a first threshold value, taking the labeling frame as a selection area, selecting a labeling area in the selection area frame, and labeling corresponding semantic information in the labeling area; so that the labeling result of the current frame image can be generated. And finally, automatically adjusting the labeling frame by combining the newly added point cloud data of different video frames, thereby automatically generating a labeling result of each frame of image. The method does not need a prior data set, can realize online automatic labeling while acquiring the image, supports multi-target and multi-class labeling, only needs single labeling of a user, can greatly reduce manual image labeling work, improves the acquisition efficiency of image labeling data, and is simple to operate and easy to use.

Referring to fig. 7, an embodiment of the present invention further provides an automatic image annotation apparatus 200 based on a visual SLAM, which specifically includes:

at least one processor 210;

at least one memory 220 for storing at least one program;

when the at least one program is executed by the at least one processor 210, the at least one processor 210 is caused to implement the method as shown in fig. 1.

The memory 220, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs and non-transitory computer-executable programs. The memory 220 may include high speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 220 may optionally include remote memory located remotely from processor 210, and such remote memory may be connected to processor 210 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

It will be understood that the device structure shown in fig. 7 does not constitute a limitation of device 200, and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components.

In the apparatus 200 shown in fig. 7, the processor 210 may retrieve the program stored in the memory 220 and execute, but is not limited to, the steps of the embodiment shown in fig. 1.

The above-described embodiments of the apparatus 200 are merely illustrative, and the units illustrated as separate components may or may not be physically separate, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purposes of the embodiments.

Embodiments of the present invention also provide a computer-readable storage medium, which stores a program executable by a processor, and the program executable by the processor is used for implementing the method shown in fig. 1 when being executed by the processor.

Embodiments of the present application also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and executed by the processor to cause the computer device to perform the method illustrated in fig. 1.

It will be understood that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, or suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as is well known to those skilled in the art.

The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims

1. An automatic image annotation method based on visual SLAM is characterized by comprising the following steps:

acquiring a new frame of image as a current frame of image, and solving through a PnP problem to obtain a camera pose;

performing image segmentation on the rectangular frame to identify a subject;

comparing the areas of the rectangular frame and the labeling frame;

if the ratio of the area of the labeling frame to the area of the rectangular frame is smaller than a first threshold value, taking the labeling frame as a selection area, executing the step of selecting a labeling area in the selection area, and labeling corresponding semantic information in the labeling area;

if the number of the second newly-added point cloud data is larger than a second threshold value, taking the target frame image as a selected area, executing the step of selecting a marked area in the selected area frame, and marking corresponding semantic information in the marked area;

the step of constructing and obtaining a camera perspective matrix according to the camera pose and the camera internal parameters comprises the following steps:

transferring the data of the camera pose from a world coordinate system to an OpenGL coordinate system according to a first formula, wherein the first formula is as follows: r _o ＝R _ow R _w R _ow ,t _o ＝R _ow t _w (ii) a Wherein, the first and the second end of the pipe are connected with each other,

According to the OpenGL coordinate system and in combination with camera internal parameters, a camera perspective matrix is constructed, wherein the camera perspective matrix is as follows:

wherein, the first and the second end of the pipe are connected with each other,

wherein P represents a camera perspective matrix and the camera intrinsic parameter is (u) ₀ ,v ₀ ,f _u ,f _v ) (ii) a near represents the distance from the camera optical center to the near section, far represents the distance from the camera optical center to the far section, right represents the distance from the near section center to the right section, left represents the distance from the near section center to the left section, top represents the distance from the near section center to the upper section, bottom represents the distance from the near section center to the lower section, and w represents the image pixel widthAnd h denotes an image pixel height.

2. The method as claimed in claim 1, wherein the step of obtaining at least two frames of images from a real-time camera or video and constructing a three-dimensional point cloud space using an ORB-SLAM algorithm comprises:

and calculating the three-dimensional coordinates of the image characteristic points by utilizing triangulation measurement according to the camera internal parameter matrix, the distortion parameters and the camera pose, and constructing to obtain a three-dimensional point cloud space.

3. The method as claimed in claim 1, wherein the step of selecting a labeled region in the selected region by using the current frame image as the selected region, and labeling the labeled region with corresponding semantic information comprises:

4. The method of claim 1, wherein the step of obtaining a first three-dimensional point cloud set corresponding to the two-dimensional feature points in the annotation region according to the three-dimensional point cloud space comprises:

acquiring two-dimensional feature points in the labeling area;

and carrying out back projection on the two-dimensional feature points according to the three-dimensional point cloud space to obtain a first three-dimensional point cloud set.

5. The method of claim 1, wherein the step of processing the first three-dimensional point cloud set to obtain a second three-dimensional point cloud set using a point cloud clustering algorithm comprises:

judging whether the target point cloud is an edge noise point;

6. The method as claimed in claim 1, wherein the step of filtering the first newly added point cloud data to obtain a second newly added point cloud data includes:

filtering the first new point cloud data based on a random sampling consistency algorithm to remove the ground point cloud data;

7. The method as claimed in claim 1, wherein if the number of the second newly added point cloud data is not greater than a second threshold, the target frame image is used as a current frame image, and the step of projecting the minimum directional bounding box to the current frame image according to the camera perspective matrix to form an augmented reality scene and obtain a rectangular frame corresponding to the minimum directional bounding box is performed.

8. An automatic image annotation device based on visual SLAM, comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement the method of any one of claims 1-7.

9. Computer-readable storage medium, on which a processor-executable program is stored, which, when being executed by a processor, is adapted to carry out the method according to any one of claims 1-7.