CN113096003B

CN113096003B - Labeling method, device, equipment and storage medium for multiple video frames

Info

Publication number: CN113096003B
Application number: CN202110362493.3A
Authority: CN
Inventors: 石佳; 侯文博; 李翔; 李俊桥
Original assignee: Beijing CHJ Automotive Information Technology Co Ltd
Current assignee: Beijing CHJ Automotive Information Technology Co Ltd
Priority date: 2021-04-02
Filing date: 2021-04-02
Publication date: 2023-08-18
Anticipated expiration: 2041-04-02
Also published as: CN113096003A

Abstract

The application provides a labeling method, a labeling device, labeling equipment and a storage medium for multiple video frames, and relates to the technical field of image processing. The labeling method for the multi-video frames comprises the following steps: acquiring a plurality of video frames acquired for the same region; performing three-dimensional reconstruction according to the plurality of video frames to obtain a three-dimensional reconstruction scene corresponding to the region; labeling is carried out based on the three-dimensional reconstruction scene, and labeling information in the three-dimensional reconstruction scene is obtained; and projecting the annotation information in the three-dimensional reconstruction scene to a plurality of video frames to obtain the annotation information in the video frames. The method can realize batched labeling of video frames, can effectively improve the labeling efficiency of multiple video frames and reduce the labor cost of labeling work.

Description

Labeling method, device, equipment and storage medium for multiple video frames

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for labeling multiple video frames.

Background

In the unmanned technical field, an image sensing algorithm is widely applied as a core algorithm, and can be used for positioning, obstacle recognition and the like in a machine learning mode based on continuous image frames acquired by unmanned equipment. In the training process of machine learning, a target object in an image frame corresponding to training sample data needs to be marked.

At present, most labels used in image sensing algorithms are derived from manual labels, and labeling personnel only label on a single frame of image. The labeling mode consumes a large amount of manpower and material resources, continuous frames need to be labeled in the automatic driving field, and therefore the labeling workload and the labeling cost are greatly increased.

In view of the foregoing, it is desirable to provide a solution that can improve the labeling efficiency of multiple video frames.

Disclosure of Invention

The embodiment of the application aims to provide a labeling method, a labeling device, labeling equipment and a storage medium for multiple video frames, so as to at least solve the problem of how to improve the labeling efficiency of the multiple video frames.

In order to solve the technical problems, the embodiment of the application provides the following technical scheme:

the first aspect of the present application provides a method for labeling multiple video frames, the method comprising:

acquiring a plurality of video frames acquired for the same region;

performing three-dimensional reconstruction according to the plurality of video frames to obtain a three-dimensional reconstruction scene corresponding to the region;

labeling is carried out based on the three-dimensional reconstruction scene, and labeling information in the three-dimensional reconstruction scene is obtained;

and projecting the annotation information in the three-dimensional reconstruction scene to a plurality of video frames to obtain the annotation information in the video frames.

A second aspect of the present application provides an annotation device for multiple video frames, the device comprising:

the video frame acquisition module is used for acquiring a plurality of video frames acquired for the same area;

the three-dimensional reconstruction module is used for carrying out three-dimensional reconstruction according to the plurality of video frames to obtain a three-dimensional reconstruction scene corresponding to the region;

the three-dimensional labeling module is used for labeling based on the three-dimensional reconstruction scene to obtain labeling information in the three-dimensional reconstruction scene;

and the annotation projection module is used for projecting the annotation information in the three-dimensional reconstruction scene to a plurality of video frames to obtain the annotation information in the video frames.

A third aspect of the present application provides an electronic apparatus, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the computer program to perform the method according to the first aspect of the application.

A fourth aspect of the application provides a computer readable storage medium having stored thereon computer readable instructions executable by a processor to implement the method of the first aspect of the application.

According to the labeling method for the multiple video frames, provided by the first aspect of the application, the multiple video frames collected for the same area are obtained, three-dimensional reconstruction is carried out according to the multiple video frames to obtain the three-dimensional reconstruction scene corresponding to the area, then the labeling is carried out based on the three-dimensional reconstruction scene to obtain the labeling information in the three-dimensional reconstruction scene, the labeling information in the three-dimensional reconstruction scene is projected into the multiple video frames to obtain the labeling information in the multiple video frames, and compared with the mode that each video frame is required to be labeled manually in the prior art, a user can automatically reversely project the labeling information into the multiple video frames by only labeling in the three-dimensional reconstruction scene, so that batched labeling of the video frames can be realized, the labeling efficiency of the multiple video frames can be effectively improved, and the labor cost of labeling work can be reduced.

The labeling device for multiple video frames provided in the second aspect of the present application, the electronic device provided in the third aspect, and the computer readable storage medium provided in the fourth aspect of the present application have the same beneficial effects as the labeling method for multiple video frames provided in the first aspect of the present application due to the same inventive concept.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present application will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. In the drawings, wherein like or corresponding reference numerals indicate like or corresponding parts, there are shown by way of illustration, and not limitation, several embodiments of the application, in which:

FIG. 1 schematically illustrates a first flowchart of a labeling method for multiple video frames provided by some embodiments of the application;

FIG. 2 schematically illustrates a second flowchart of a labeling method for multiple video frames provided by some embodiments of the application;

FIG. 3 schematically illustrates a third flow chart of a labeling method for multiple video frames provided by some embodiments of the application;

FIG. 4 schematically illustrates a fourth flowchart of a labeling method for multiple video frames provided by some embodiments of the application;

FIG. 5 schematically illustrates a fifth flowchart of a labeling method for multiple video frames provided by some embodiments of the application;

FIG. 6 schematically illustrates a sixth flowchart of a labeling method for multiple video frames provided by some embodiments of the application;

FIG. 7 schematically illustrates a schematic representation of annotation based on video frames provided by some embodiments of the application;

FIG. 8 schematically illustrates a schematic diagram of labeling based on top view provided by some embodiments of the application;

FIG. 9 schematically illustrates a schematic diagram of an annotation device for multiple video frames provided by some embodiments of the application;

FIG. 10 schematically illustrates a schematic diagram of an electronic device provided by some embodiments of the application;

fig. 11 schematically illustrates a schematic diagram of a computer-readable storage medium provided by some embodiments of the application.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It is noted that unless otherwise indicated, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs.

In the following, some terms used in the embodiments of the present application are explained as follows:

four-dimensional (4D) labeling: unlike traditional two-dimensional (2D) annotations, 4D annotations map the annotated static into the real world coordinate system, thereby preserving annotations for real objects in the real world. The labels of the real objects can be used for back projection into the 2D picture through the corresponding relation, and then the labels are added. Thus, only one annotation is needed, and annotations can be added to all pictures of the road section in the manner described above.

Pixel coordinate system: the coordinate system of the 2D picture only comprises an x axis and a y axis, namely no depth is generated, and the far point is the upper left corner of the picture.

Camera coordinate system: the coordinate system using the optical center of the camera as the origin comprises an x-axis, a y-axis and a z-axis.

World coordinate system: the coordinate system in the real world may include an x-axis, a y-axis, and a z-axis with the position of the first frame of video as a reference point.

In addition, the terms "first" and "second" etc. are used to distinguish different objects and are not used to describe a particular order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

The embodiment of the application provides a labeling method, a labeling device, labeling equipment and a storage medium for multiple video frames, which at least solve the problem of how to improve the labeling efficiency of the multiple video frames. The following is an exemplary description with reference to the accompanying drawings.

Referring to fig. 1, a first flowchart of a labeling method for multiple video frames according to some embodiments of the present application is schematically shown, and as shown in fig. 1, a labeling method for multiple video frames may include the following steps S101-S104:

step S101: multiple video frames acquired for the same region are acquired.

The area may include an area to be marked, such as a highway, a lane, a parking lot, and the like, and when the vehicle passes through the area, a camera or other image capturing device may be installed on the vehicle, and the video frame of the area may include an image captured by a single frame, or may include one frame of continuous video frames captured from the video, and correspondingly, the plurality of video frames may include a plurality of images captured by a single frame, or may include a plurality of frames in a group of continuous video frames, or may include a plurality of frames in a plurality of groups of continuous video frames, which may all achieve the purpose of the embodiment of the present application.

In addition, the plurality of video frames can be respectively collected by a plurality of vehicles, and as a single vehicle is shielded by other vehicles in the driving process, the video frames shot by the single vehicle are shielded, so that the three-dimensional reconstruction scene obtained by reconstruction is incomplete when the video frames collected by the single vehicle are used for carrying out subsequent three-dimensional reconstruction, and the position incapable of being reconstructed due to shielding can be effectively eliminated by obtaining the video frames collected by the plurality of vehicles, so that the complete three-dimensional reconstruction scene of the region can be obtained later.

Step S102: and carrying out three-dimensional reconstruction according to the plurality of video frames to obtain a three-dimensional reconstruction scene corresponding to the region.

The three-dimensional reconstructed scene is a three-dimensional virtual scene obtained by performing three-dimensional reconstruction in a computer and simulating the real world.

Step S103: and labeling the three-dimensional reconstruction scene to obtain labeling information in the three-dimensional reconstruction scene.

Step S104: and projecting the annotation information in the three-dimensional reconstruction scene to a plurality of video frames to obtain the annotation information in the video frames.

According to the labeling method for the multiple video frames, which is provided by the embodiment of the application, the multiple video frames collected for the same area can be obtained, three-dimensional reconstruction is carried out according to the multiple video frames, the three-dimensional reconstruction scene corresponding to the area is obtained, then the labeling information in the three-dimensional reconstruction scene is obtained based on the labeling of the three-dimensional reconstruction scene, the labeling information in the three-dimensional reconstruction scene is projected to the multiple video frames, and the labeling information in the multiple video frames is obtained.

In the embodiment of the application, three-dimensional reconstruction refers to the establishment of a mathematical model suitable for computer representation and processing of a three-dimensional object, is the basis for processing, operating and analyzing the three-dimensional object in a computer environment, and is also a key technology for establishing virtual reality expressing objective world in a computer. In computer vision, three-dimensional reconstruction refers to a process of reconstructing three-dimensional information from single-view or multi-view images, which may be performed by calibrating a camera, i.e. calculating a relationship between a pixel coordinate system of the camera and a world coordinate system, and then reconstructing three-dimensional information using information in a plurality of two-dimensional images (i.e. video frames), for example, three-dimensional reconstruction process based on two-dimensional images is illustrated in the following (1) - (5):

(1) Image acquisition: prior to image processing, a two-dimensional image of a three-dimensional object (e.g., a video frame of the present application) is acquired using an imaging device (e.g., a camera).

(2) Calibrating a camera: an effective imaging model is established through camera calibration, internal and external parameters of a camera are solved, and therefore three-dimensional point coordinates in space can be obtained by combining the matching result of images, and the purpose of three-dimensional reconstruction is achieved.

(3) Feature extraction: the features mainly comprise feature points, feature lines and regions. In most cases, feature points are taken as matching primitives, and the form of feature point extraction is closely related to the matching strategy. Feature point extraction algorithms may include, but are not limited to: a method based on directional derivative, a method based on image brightness contrast relation, a method based on mathematical morphology and the like.

(4) Stereo matching: the stereo matching is to establish a corresponding relation between different images according to the extracted features, that is, imaging points of the same physical space point in two different images are in one-to-one correspondence. Attention is paid to disturbances in the scene due to factors such as light conditions, noise disturbances, scene geometry distortions, surface physical properties, and camera characteristics.

(5) Three-dimensional reconstruction: the three-dimensional scene information can be recovered by combining the internal and external parameters calibrated by the camera with a relatively accurate matching result. Because the three-dimensional reconstruction precision is influenced by factors such as matching precision, internal and external parameter errors of a camera and the like, the work of the previous steps is needed to be done, so that the precision of each link is high, the error is small, and the three-dimensional reconstruction can be realized more accurately.

The foregoing illustrates a three-dimensional reconstruction process based on a two-dimensional image, and a person skilled in the art may refer to the foregoing exemplary description, and flexibly modify and implement the method in combination with an actual scene to achieve the purposes of the embodiments of the present application, for example, please refer to fig. 2, which schematically illustrates a second flowchart of a labeling method for multiple video frames provided by some embodiments of the present application, as shown in fig. 2, the foregoing step S102 performs three-dimensional reconstruction according to the multiple video frames to obtain a three-dimensional reconstruction scene corresponding to the region, and may include the following substeps S1021-S1022:

Step S1021: determining camera position information corresponding to each video frame respectively;

step S1022: and carrying out three-dimensional reconstruction according to the camera position information corresponding to each video frame and the video frame, and obtaining a three-dimensional reconstruction scene of the region.

The camera position information can be collected together by using a GPS positioning device when the video frame is collected.

For the specific algorithm of three-dimensional reconstruction, in some modified embodiments, step S1022 performs three-dimensional reconstruction according to the camera position information and the video frame corresponding to each video frame, to obtain a three-dimensional reconstruction scene of the region, which may include:

and determining the position information of each pixel point corresponding to a three-dimensional point in a world coordinate system by adopting a dense reconstruction algorithm according to the camera position information corresponding to each video frame and the pixel position information of each pixel point in the video frame, and determining a three-dimensional reconstruction scene of the region according to a three-dimensional point cloud formed by the three-dimensional points.

The dense reconstruction (Multiple View Stereo, MVS) algorithm is multi-view solid geometry, and aims to calculate three-dimensional points corresponding to each pixel point in an image pixel by pixel on the premise that the pose of a camera is known, so as to obtain dense three-dimensional point clouds on the surface of a scene object.

Through the implementation mode, three-dimensional reconstruction can be accurately and rapidly realized, and the whole labeling accuracy and the whole labeling efficiency are improved.

In the foregoing embodiment corresponding to fig. 1, the foregoing step S103 may be implemented in various manners, for example, in some embodiments, labeling may be performed on the basis of video frames, please refer to fig. 3, which schematically illustrates a third flowchart of a labeling method for multiple video frames according to some embodiments of the present application, as shown in fig. 3, where the foregoing step S103 performs labeling based on the three-dimensional reconstruction scene, to obtain labeling information in the three-dimensional reconstruction scene, and the following substeps S1031-S1032 may include: :

step S1031: determining the corresponding relation between the pixel points in each video frame and the three-dimensional points in the three-dimensional reconstruction scene;

step S1032: and responding to a first labeling operation of a user for pixel points in the video frame, converting the first labeling operation into a second labeling operation for three-dimensional points in the three-dimensional reconstruction scene according to the corresponding relation, and generating labeling information in the three-dimensional reconstruction scene according to the second labeling operation.

As another example, in other embodiments of the step S103, the object marked by the method is located on the ground, the video frame includes a top view taken by a camera facing the ground, please refer to fig. 4, which schematically illustrates a fourth flowchart of a marking method for multiple video frames according to some embodiments of the present application, as shown in fig. 4, the step S103 is marked based on the three-dimensional reconstructed scene, to obtain marking information in the three-dimensional reconstructed scene, and the following substep S1033 may include: :

Step S1033: responding to a third labeling operation of a user on the pixel points in the top view, converting the third labeling operation into a fourth labeling operation on the three-dimensional points in the three-dimensional reconstruction scene according to a coordinate conversion relation between a camera coordinate system corresponding to a camera shooting the top view and a world coordinate system corresponding to the three-dimensional reconstruction scene, and generating labeling information in the three-dimensional reconstruction scene according to the fourth labeling operation.

In addition, on the basis of the foregoing embodiment corresponding to fig. 1, please refer to fig. 5, which schematically illustrates a fifth flowchart of a labeling method for multiple video frames according to some embodiments of the present application, as shown in fig. 5, the step S104 of projecting labeling information in the three-dimensional reconstruction scene into multiple video frames to obtain labeling information in the multiple video frames may include the following substeps S1041:

step S1041: and projecting the annotation information in the three-dimensional reconstruction scene into a plurality of video frames according to the camera position information and shooting time information corresponding to each video frame respectively to obtain the annotation information in the plurality of video frames, wherein the shooting objects contained in each projected video frame comprise objects marked by the annotation information.

According to the labeling method for the multiple video frames provided by the at least one embodiment of the application, multiple video frames acquired for the same region can be acquired, three-dimensional reconstruction is carried out according to the multiple video frames to obtain a three-dimensional reconstruction scene corresponding to the region, then labeling is carried out based on the three-dimensional reconstruction scene to obtain labeling information in the three-dimensional reconstruction scene, the labeling information in the three-dimensional reconstruction scene is projected into the multiple video frames to obtain the labeling information in the multiple video frames, and compared with the mode that each video frame is required to be labeled manually in the prior art, a user can automatically reversely project the labeling information into the multiple video frames by only labeling in the three-dimensional reconstruction scene, so that batched labeling of the video frames can be realized, the labeling efficiency of the multiple video frames can be effectively improved, and the labor cost of labeling work can be reduced.

In practical application, the above labeling method may be applied to labeling of static objects on a lane, please refer to fig. 6, which schematically illustrates a sixth flowchart of a labeling method for multiple video frames according to some embodiments of the present application, as shown in fig. 6, where the step S101 of acquiring multiple video frames collected for the same area may include the substep S1011: acquiring a plurality of groups of continuous video frames acquired by a plurality of vehicles in the same lane;

The step S102 of performing three-dimensional reconstruction according to the plurality of video frames to obtain a three-dimensional reconstruction scene corresponding to the region may include the substep S1023: carrying out three-dimensional reconstruction on the static object on the lane according to the plurality of groups of continuous video frames to obtain a three-dimensional reconstruction scene of the lane;

the step S103 of labeling based on the three-dimensional reconstruction scene to obtain labeling information in the three-dimensional reconstruction scene may include the substep S1034: labeling the static objects in the three-dimensional reconstruction scene based on the three-dimensional reconstruction scene to obtain labeling information of the static objects in the three-dimensional reconstruction scene;

the step S104 of projecting the annotation information in the three-dimensional reconstructed scene to a plurality of video frames to obtain the annotation information in the plurality of video frames may include the substep S1042: and projecting the annotation information of the static object in the three-dimensional reconstruction scene to the plurality of groups of continuous video frames to obtain the annotation information in the plurality of groups of continuous video frames.

The static object may include lane lines, lane isolation guardrails, lane trees, etc., which are not limited in the embodiment of the present application, and may be marked by the method of the embodiment of the present application.

The following description is made in connection with specific examples:

in some specific implementations of the embodiments of the present application, the above method is used for marking lane lines, and the overall concept is that firstly, GPS positioning information of a vehicle feedback video is combined with map information, all videos running on the same lane are found, a three-dimensional reconstruction technology is used to recover a real scene 1:1 on a road to form a three-dimensional reconstruction scene, a marking person marks in a three-dimensional space based on the recovered 3D point cloud, a time dimension is added on a 3D basis, and a 3D marking result is projected onto continuous video frame images of a plurality of videos, so that marking efficiency is greatly improved and full-dimension (3D position, angle and speed) marking is obtained. The following describes the specific steps:

step one, selecting the data (namely the collected video frames) returned by the automobile data recorder of a specific road section with awareness and attention according to the satellite map network.

And step two, carrying out three-dimensional reconstruction according to the multi-pass back transmission video on the same road selected in the step one, and recovering the absolute three-dimensional position (namely camera position information) of the camera at each moment, the environment three-dimensional point cloud (namely three-dimensional reconstruction scene) and the ground equation by combining a GPS.

And thirdly, performing 4D labeling according to the restored three-dimensional point cloud and ground equation, namely performing static object labeling in the real three-dimensional world, wherein high-precision information of labeling can be back projected in each frame of image in each pass. The following is a specific labeling procedure and principle by way of example with a dot:

firstly, the labeling method for multiple video frames provided by the embodiment of the application can be implemented by adopting a special labeling tool, and the labeling tool can be software or hardware equipment based on the software implementation. By using the marking tool, marking personnel, namely users, can realize marking in at least the following two modes:

in one embodiment, referring to fig. 7, a marker creates a point (at the lane arrow in fig. 7) in the image by double clicking the left mouse button, and the 4D labeling tool can convert the point from the image coordinate system to the real world coordinate system. The principle is that the point represents a ray emitted from the optical center of the camera in three-dimensional space, and the intersection point of the ray and the ground equation is the three-dimensional point of the world coordinate system of the point.

In a second way, referring to fig. 8, the annotator can also directly create a point (at the lane arrow in fig. 8) from the top view, and the 4D annotation tool can convert this point from the camera coordinate system to the real world coordinate system. The principle is that the point is on the ground of the camera coordinate system, and the three-dimensional point of the world coordinate system of the point can be obtained through the conversion relation between the camera coordinate system and the world coordinate system.

The first and second modes can be alternatively implemented.

After the three-dimensional position of the point is determined in the first or second mode, the three-dimensional spatial point can be re-projected onto any frame image of any pass by using the pose between cameras of different passes at different times generated in the second step (i.e. cameras of multi-pass vehicles passing at different times based on the same road surface), for example, a three-dimensional point of a lane line marked in a world coordinate system based on a current frame can be projected into any video frame containing the lane line, for example, can be projected onto a camera image of the current pass track before 100 meters, or can be projected onto other camera images, so that so-called "4D marking" is realized, that is, one-time marking, and a huge amount of marked data can be generated.

According to the embodiment, the three-dimensional reconstruction technology can be utilized to restore the returned video into the real three-dimensional world, the information of the whole scene and the whole dimension is obtained, then the information is marked in the three-dimensional world (namely, the three-dimensional reconstruction scene), finally, the points, the lines and the planes marked in the three-dimensional world can be projected into each frame of image, and compared with the 2D marking, the marking efficiency can be greatly improved. (for example, the lane line marks 170 traditional 2D marks per day, and the 4D marks 3000 traditional 2D marks per day), full-dimension mark information (3D positions and angles) can be obtained, full-scene mark information (all weather scenes such as rainy days, snowy days, backlight, traffic jam, haze and the like of the same road) can be obtained, and a previously built three-dimensional model and mark can be directly reused for newly added cameras or sensors such as a laser radar and the like, so that the compatibility is good.

In the above embodiment, a labeling method for multiple video frames is provided, and correspondingly, the application also provides a labeling device for multiple video frames. The labeling device for the multiple video frames provided by the embodiment of the application can implement the labeling method for the multiple video frames, and the labeling device for the multiple video frames can be realized in a mode of software, hardware or a combination of software and hardware. For example, the labeling apparatus for multiple video frames may include integrated or separate functional modules or units to perform the corresponding steps in the methods described above. Referring to fig. 9, a schematic diagram of an labeling apparatus for multiple video frames according to some embodiments of the present application is schematically shown. Since the apparatus embodiments are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

As shown in fig. 9, an embodiment of the present application provides an labeling apparatus 10 for multiple video frames, which may include:

a video frame acquisition module 101, configured to acquire a plurality of video frames acquired for the same area;

The three-dimensional reconstruction module 102 is configured to perform three-dimensional reconstruction according to the plurality of video frames, so as to obtain a three-dimensional reconstruction scene corresponding to the region;

the three-dimensional labeling module 103 is used for labeling based on the three-dimensional reconstruction scene to obtain labeling information in the three-dimensional reconstruction scene;

and the annotation projection module 104 is configured to project the annotation information in the three-dimensional reconstructed scene to a plurality of video frames to obtain the annotation information in the plurality of video frames.

In some variations of the present embodiment, the three-dimensional reconstruction module 102 includes:

a camera position information determining unit, configured to determine camera position information corresponding to each video frame;

and the three-dimensional reconstruction unit is used for carrying out three-dimensional reconstruction according to the camera position information corresponding to each video frame and the video frame to obtain a three-dimensional reconstruction scene of the region.

In some variations of the embodiments of the present application, the three-dimensional reconstruction unit includes:

and the dense reconstruction subunit is used for determining the position information of each pixel point corresponding to the three-dimensional point in the world coordinate system by adopting a dense reconstruction algorithm according to the camera position information corresponding to each video frame and the pixel position information of each pixel point in the video frame, and determining the three-dimensional reconstruction scene of the region according to the three-dimensional point cloud formed by the three-dimensional points.

In some variations of the embodiment of the present application, the three-dimensional labeling module 103 includes:

the corresponding relation determining unit is used for determining the corresponding relation between the pixel point in each video frame and the three-dimensional point in the three-dimensional reconstruction scene;

the video frame annotation unit is used for responding to a first annotation operation of a user for pixel points in the video frame, converting the first annotation operation into a second annotation operation for three-dimensional points in the three-dimensional reconstruction scene according to the corresponding relation, and generating annotation information in the three-dimensional reconstruction scene according to the second annotation operation.

In some variations of the present embodiment, the object marked by the device 10 is located on the ground, and the video frame includes a top view taken by a camera facing down toward the ground;

the three-dimensional labeling module 103 includes:

and the top view labeling unit is used for responding to a third labeling operation of a user on the pixel points in the top view, converting the third labeling operation into a fourth labeling operation on the three-dimensional points in the three-dimensional reconstruction scene according to the coordinate conversion relation between a camera coordinate system corresponding to a camera shooting the top view and a world coordinate system corresponding to the three-dimensional reconstruction scene, and generating labeling information in the three-dimensional reconstruction scene according to the fourth labeling operation.

In some variations of the embodiments of the present application, the video frame acquisition module 101 includes:

the lane video frame acquisition unit is used for acquiring a plurality of groups of continuous video frames acquired by a plurality of vehicles on the same lane;

the three-dimensional reconstruction module 102 includes:

the lane scene three-dimensional reconstruction unit is used for carrying out three-dimensional reconstruction on the static objects on the lane according to the plurality of groups of continuous video frames to obtain a three-dimensional reconstruction scene of the lane;

the three-dimensional labeling module 103 includes:

the lane object three-dimensional labeling unit is used for labeling the static objects in the three-dimensional reconstruction scene based on the three-dimensional reconstruction scene to obtain labeling information of the static objects in the three-dimensional reconstruction scene.

In some variations of the present embodiment, the labeling projection module 104 includes:

the annotation projection unit is used for projecting the annotation information in the three-dimensional reconstruction scene to a plurality of video frames according to the camera position information and the shooting time information corresponding to each video frame respectively to obtain the annotation information in the plurality of video frames, wherein the shooting objects contained in each projected video frame comprise the objects annotated by the annotation information.

The labeling device 10 for multiple video frames provided in the embodiment of the present application has the same beneficial effects as the labeling method for multiple video frames provided in the foregoing embodiment of the present application due to the same inventive concept, and will not be described herein.

The embodiment of the application also provides an electronic device corresponding to the labeling method for the multiple video frames provided by the previous embodiment, so as to execute the labeling method for the multiple video frames.

Referring to fig. 10, a schematic diagram of an electronic device according to some embodiments of the present application is schematically shown. As shown in fig. 10, the electronic device 20 includes: a processor 200, a memory 201, a bus 202 and a communication interface 203, the processor 200, the communication interface 203 and the memory 201 being connected by the bus 202; the memory 201 stores a computer program that can be executed on the processor 200, and when the processor 200 executes the computer program, the labeling method for multiple video frames provided in any of the foregoing embodiments of the present application is executed.

The memory 201 may include a high-speed random access memory (Random Access Memory, RAM), and may further include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the system network element and at least one other network element is implemented via at least one communication interface 203 (which may be wired or wireless), the internet, a wide area network, a local network, a metropolitan area network, etc. may be used.

Bus 202 may be an ISA bus, a PCI bus, an EISA bus, or the like. The buses may be classified as address buses, data buses, control buses, etc. The memory 201 is configured to store a program, and the processor 200 executes the program after receiving an execution instruction, and the labeling method for multiple video frames disclosed in any of the foregoing embodiments of the present application may be applied to the processor 200 or implemented by the processor 200.

The processor 200 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 200 or by instructions in the form of software. The processor 200 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but may also be a Digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 201, and the processor 200 reads the information in the memory 201, and in combination with its hardware, performs the steps of the above method.

The electronic device provided by the embodiment of the application and the labeling method for multiple video frames provided by the previous embodiment of the application are the same in inventive concept, and have the same beneficial effects as the method adopted, operated or realized by the electronic device.

The embodiment of the present application further provides a computer readable medium corresponding to the labeling method for multiple video frames provided in the foregoing embodiment, referring to fig. 11, the computer readable storage medium is shown as an optical disc 30, on which a computer program (i.e. a program product) is stored, where the computer program, when executed by a processor, performs the labeling method for multiple video frames provided in any of the foregoing embodiments.

It should be noted that examples of the computer readable storage medium may also include, but are not limited to, a phase change memory (PRAM), a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a flash memory, or other optical or magnetic storage medium, which will not be described in detail herein.

The computer readable storage medium provided by the above embodiment of the present application has the same advantages as the method adopted, operated or implemented by the application program stored in the same concept of the present application as the labeling method for multiple video frames provided by the above embodiment of the present application.

It is noted that the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application, and are intended to be included within the scope of the appended claims and description.

Claims

1. A method for labeling multiple video frames, comprising:

acquiring a plurality of video frames acquired for the same region;

projecting the annotation information in the three-dimensional reconstruction scene to a plurality of video frames to obtain the annotation information in the video frames;

the projecting the annotation information in the three-dimensional reconstruction scene to a plurality of video frames to obtain the annotation information in the video frames includes:

and projecting the annotation information in the three-dimensional reconstruction scene into a plurality of video frames according to the camera position information and shooting time information corresponding to each video frame respectively to obtain the annotation information in the plurality of video frames, wherein the shooting objects contained in each projected video frame comprise objects marked by the annotation information.

2. The method according to claim 1, wherein the performing three-dimensional reconstruction according to the plurality of video frames to obtain a three-dimensional reconstruction scene corresponding to the region comprises:

Determining camera position information corresponding to each video frame respectively;

and carrying out three-dimensional reconstruction according to the camera position information corresponding to each video frame and the video frame, and obtaining a three-dimensional reconstruction scene of the region.

3. The method according to claim 2, wherein the performing three-dimensional reconstruction according to the camera position information and the video frame corresponding to each video frame respectively to obtain a three-dimensional reconstruction scene of the region includes:

4. The method according to claim 1, wherein the labeling based on the three-dimensional reconstructed scene to obtain labeling information in the three-dimensional reconstructed scene comprises:

determining the corresponding relation between the pixel points in each video frame and the three-dimensional points in the three-dimensional reconstruction scene;

and responding to a first labeling operation of a user for pixel points in the video frame, converting the first labeling operation into a second labeling operation for three-dimensional points in the three-dimensional reconstruction scene according to the corresponding relation, and generating labeling information in the three-dimensional reconstruction scene according to the second labeling operation.

5. The method of claim 1, wherein the object marked by the method is located on the ground, and the video frame comprises a top view taken with a camera facing down toward the ground;

the labeling based on the three-dimensional reconstruction scene to obtain labeling information in the three-dimensional reconstruction scene comprises the following steps:

responding to a third labeling operation of a user on the pixel points in the top view, converting the third labeling operation into a fourth labeling operation on the three-dimensional points in the three-dimensional reconstruction scene according to a coordinate conversion relation between a camera coordinate system corresponding to a camera shooting the top view and a world coordinate system corresponding to the three-dimensional reconstruction scene, and generating labeling information in the three-dimensional reconstruction scene according to the fourth labeling operation.

6. The method of claim 1, wherein the acquiring a plurality of video frames acquired for the same region comprises:

acquiring a plurality of groups of continuous video frames acquired by a plurality of vehicles in the same lane;

the three-dimensional reconstruction is performed according to the plurality of video frames to obtain a three-dimensional reconstruction scene corresponding to the region, including:

carrying out three-dimensional reconstruction on the static object on the lane according to the plurality of groups of continuous video frames to obtain a three-dimensional reconstruction scene of the lane;

and labeling the static objects in the three-dimensional reconstruction scene based on the three-dimensional reconstruction scene to obtain labeling information of the static objects in the three-dimensional reconstruction scene.

7. An annotation device for multiple video frames, comprising:

the annotation projection module is used for projecting the annotation information in the three-dimensional reconstruction scene to a plurality of video frames to obtain the annotation information in the video frames;

wherein, annotate the projection module, include:

8. The apparatus of claim 7, wherein the three-dimensional reconstruction module comprises:

9. The apparatus of claim 8, wherein the three-dimensional reconstruction unit comprises:

10. The apparatus of claim 7, wherein the three-dimensional labeling module comprises:

11. The apparatus of claim 7, wherein the object marked by the apparatus is located on the ground, and the video frame comprises a top view taken with a camera facing down toward the ground;

the three-dimensional labeling module comprises:

12. The apparatus of claim 7, wherein the video frame acquisition module comprises:

the three-dimensional reconstruction module comprises:

the three-dimensional labeling module comprises:

13. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor executes to implement the method according to any of the claims 1 to 6 when running the computer program.

14. A computer readable storage medium having stored thereon computer readable instructions executable by a processor to implement the method of any one of claims 1 to 6.