CN113096003A

CN113096003A - Labeling method, device, equipment and storage medium for multiple video frames

Info

Publication number: CN113096003A
Application number: CN202110362493.3A
Authority: CN
Inventors: 石佳; 侯文博; 李翔; 李俊桥
Original assignee: Beijing CHJ Automotive Information Technology Co Ltd
Current assignee: Beijing CHJ Automotive Information Technology Co Ltd
Priority date: 2021-04-02
Filing date: 2021-04-02
Publication date: 2021-07-09
Anticipated expiration: 2041-04-02
Also published as: CN113096003B

Abstract

The application provides a labeling method, a labeling device, labeling equipment and a storage medium for multiple video frames, and relates to the technical field of image processing. The labeling method for the multiple video frames comprises the following steps: acquiring a plurality of video frames collected aiming at the same area; performing three-dimensional reconstruction according to the plurality of video frames to obtain a three-dimensional reconstruction scene corresponding to the region; labeling based on the three-dimensional reconstruction scene to obtain labeling information in the three-dimensional reconstruction scene; and projecting the labeling information in the three-dimensional reconstruction scene to a plurality of video frames to obtain the labeling information in the plurality of video frames. The method can realize batch annotation of the video frames, effectively improve the annotation efficiency of the multiple video frames and reduce the labor cost of annotation operation.

Description

Labeling method, device, equipment and storage medium for multiple video frames

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for labeling multiple video frames.

Background

In the field of unmanned driving technology, an image perception algorithm is widely applied as a core algorithm, and can be used for positioning, identifying obstacles and the like in a machine learning mode based on continuous image frames acquired by unmanned equipment. In the training process of machine learning, a target object in an image frame corresponding to training sample data needs to be labeled.

At present, most of the labels used in the image perception algorithm are derived from manual labels, and the labeling personnel only perform labeling on a single frame of image. The marking mode consumes a large amount of manpower and material resources, continuous frame marking is needed in the automatic driving field, and therefore marking workload and marking cost are greatly increased.

In view of the above, it is desirable to provide a scheme capable of improving the annotation efficiency of multiple video frames.

Disclosure of Invention

An object of the embodiments of the present application is to provide a method, an apparatus, a device and a storage medium for labeling multiple video frames, so as to at least solve the problem of how to improve the efficiency of labeling multiple video frames.

In order to solve the above technical problem, an embodiment of the present application provides the following technical solutions:

a first aspect of the present application provides an annotation method for multiple video frames, the method comprising:

acquiring a plurality of video frames collected aiming at the same area;

performing three-dimensional reconstruction according to the plurality of video frames to obtain a three-dimensional reconstruction scene corresponding to the region;

labeling based on the three-dimensional reconstruction scene to obtain labeling information in the three-dimensional reconstruction scene;

and projecting the labeling information in the three-dimensional reconstruction scene to a plurality of video frames to obtain the labeling information in the plurality of video frames.

A second aspect of the present application provides an annotation device for multiple video frames, the device comprising:

the video frame acquisition module is used for acquiring a plurality of video frames acquired aiming at the same area;

the three-dimensional reconstruction module is used for performing three-dimensional reconstruction according to the plurality of video frames to obtain a three-dimensional reconstruction scene corresponding to the region;

the three-dimensional labeling module is used for labeling based on the three-dimensional reconstruction scene to obtain labeling information in the three-dimensional reconstruction scene;

and the annotation projection module is used for projecting the annotation information in the three-dimensional reconstruction scene into the plurality of video frames to obtain the annotation information in the plurality of video frames.

A third aspect of the present application provides an electronic device comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the computer program when executing the computer program to perform the method of the first aspect of the application.

A fourth aspect of the present application provides a computer readable storage medium having computer readable instructions stored thereon which are executable by a processor to implement the method of the first aspect of the present application.

In a first aspect of the present application, an annotation method for multiple video frames is provided, which obtains multiple video frames collected for the same area, performing three-dimensional reconstruction according to the plurality of video frames to obtain a three-dimensional reconstruction scene corresponding to the region, then labeling is carried out based on the three-dimensional reconstruction scene to obtain labeling information in the three-dimensional reconstruction scene, and the labeling information in the three-dimensional reconstruction scene is projected to a plurality of video frames, obtaining the labeling information in a plurality of video frames, compared with the mode of manually labeling each video frame in the prior art, the user only needs to label in the three-dimensional reconstruction scene, namely, the annotation information can be automatically back projected into a plurality of video frames to realize the batch annotation of the video frames, the annotation efficiency of multiple video frames can be effectively improved, and the labor cost of annotation work can be reduced.

The annotation device for multiple video frames provided by the second aspect of the present application, the electronic device provided by the third aspect of the present application, and the computer-readable storage medium provided by the fourth aspect of the present application have the same advantages and are based on the same inventive concept as the annotation method for multiple video frames provided by the first aspect of the present application.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present application will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present application are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 schematically illustrates a first flowchart of an annotation method for multiple video frames provided by some embodiments of the present application;

FIG. 2 schematically illustrates a second flowchart of an annotation method for multiple video frames provided by some embodiments of the present application;

FIG. 3 schematically illustrates a third flowchart of an annotation method for multiple video frames provided by some embodiments of the present application;

FIG. 4 schematically illustrates a fourth flowchart of an annotation method for multiple video frames provided by some embodiments of the present application;

FIG. 5 schematically illustrates a fifth flowchart of an annotation method for multiple video frames provided by some embodiments of the present application;

FIG. 6 schematically illustrates a sixth flowchart of an annotation method for multiple video frames provided by some embodiments of the present application;

FIG. 7 is a schematic diagram illustrating a video frame based annotation provided by some embodiments of the present application;

FIG. 8 schematically illustrates a schematic diagram labeled based on a top view provided by some embodiments of the present application;

FIG. 9 is a schematic diagram of an annotation device for multiple video frames according to some embodiments of the present application;

FIG. 10 schematically illustrates a schematic view of an electronic device provided by some embodiments of the present application;

FIG. 11 schematically illustrates a schematic diagram of a computer-readable storage medium provided by some embodiments of the present application.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which this application belongs.

In the following, some terms used in the examples of the present application are explained as follows:

four-dimensional (4D) labeling: different from the traditional two-dimensional (2D) labeling, the 4D labeling maps the labeled static object into the real world coordinate system, and then the labeling of the real object in the real world is reserved. The labels of these real objects can be back-projected back into the 2D picture by correspondence, and then added. Thus, only one annotation is needed, and the annotations can be added in the above manner for all pictures of the road section.

Pixel coordinate system: the coordinate system of the 2D picture only comprises an x axis and a y axis, namely, no depth exists, and the far point is the upper left corner of the picture.

Camera coordinate system: the coordinate system with the optical center of the camera as the origin comprises an x-axis, a y-axis and a z-axis.

World coordinate system: the coordinate system in the real world may include an x-axis, a y-axis, and a z-axis, with the position of the first frame of the video as a reference point.

In addition, the terms "first" and "second", etc. are used to distinguish different objects, rather than to describe a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

The embodiment of the application provides a labeling method, a labeling device, labeling equipment and a storage medium for multiple video frames, so as to at least solve the problem of how to improve the labeling efficiency of the multiple video frames. The following description is made by way of example with reference to the accompanying drawings.

Referring to fig. 1, which schematically illustrates a first flowchart of an annotation method for multiple video frames provided in some embodiments of the present application, as shown in fig. 1, an annotation method for multiple video frames may include the following steps S101 to S104:

step S101: multiple video frames captured for the same region are acquired.

The area may include an area to be marked, such as a highway, a lane, a parking lot, and the like, and a camera device, such as a camera, may be mounted on a vehicle, and when the vehicle passes through the area, a video frame of the area is acquired, where the video frame may include an image shot by a single frame, or may include one of consecutive video frames captured from a video, and accordingly, the plurality of video frames may include a plurality of images shot by a single frame, may also include a plurality of frames in a group of consecutive video frames, or may also include a plurality of frames in a plurality of groups of consecutive video frames, which may all achieve the purpose of the embodiment of the present application.

In addition, the plurality of video frames can be respectively acquired by the plurality of vehicles, and the condition that the video frames shot by a single vehicle are shielded by other vehicles in the driving process is caused, so that the situation that the reconstructed three-dimensional reconstructed scene is incomplete when the video frames acquired by the single vehicle are used for subsequent three-dimensional reconstruction can be caused, and therefore, the positions which cannot be reconstructed due to shielding can be effectively eliminated by acquiring the video frames acquired by the plurality of vehicles, and the complete three-dimensional reconstructed scene in the region can be obtained subsequently.

Step S102: and performing three-dimensional reconstruction according to the plurality of video frames to obtain a three-dimensional reconstruction scene corresponding to the region.

The three-dimensional reconstruction scene is a three-dimensional virtual scene obtained by performing three-dimensional reconstruction in a computer and simulating a real world.

Step S103: and labeling based on the three-dimensional reconstruction scene to obtain labeling information in the three-dimensional reconstruction scene.

Step S104: and projecting the labeling information in the three-dimensional reconstruction scene to a plurality of video frames to obtain the labeling information in the plurality of video frames.

The labeling method for multiple video frames provided by the embodiment of the application can perform three-dimensional reconstruction according to multiple video frames acquired by aiming at the same region, obtain a three-dimensional reconstruction scene corresponding to the region, label the three-dimensional reconstruction scene to obtain labeling information in the three-dimensional reconstruction scene, project the labeling information in the three-dimensional reconstruction scene to multiple video frames to obtain labeling information in the multiple video frames, and compared with a mode that each video frame needs to be manually labeled respectively in the prior art, a user only needs to label the three-dimensional reconstruction scene, can automatically reversely project the labeling information to the multiple video frames, thereby realizing batch labeling of the video frames, effectively improving the labeling efficiency of the multiple video frames and reducing the labor cost of labeling work.

In the embodiment of the present application, the three-dimensional reconstruction refers to establishing a mathematical model suitable for representation and processing by a computer for a three-dimensional object, which is a basis for processing, operating and analyzing the properties of the three-dimensional object in a computer environment, and is also a key technology for establishing a virtual reality expressing an objective world in the computer. In computer vision, three-dimensional reconstruction refers to a process of reconstructing three-dimensional information from a single-view or multi-view image, and may be implemented by calibrating a camera, i.e. calculating a relationship between a pixel coordinate system of the camera and a world coordinate system, and then reconstructing three-dimensional information by using information in a plurality of two-dimensional images (i.e. video frames), for example, the following three-dimensional reconstruction process based on two-dimensional images is illustrated in (1) to (5):

(1) image acquisition: before image processing, a two-dimensional image (e.g., a video frame of the present application) of a three-dimensional object is acquired by an imaging device (e.g., a camera).

(2) Calibrating a camera: an effective imaging model is established through camera calibration, and internal and external parameters of a camera are solved, so that three-dimensional point coordinates in a space can be obtained by combining the matching result of the image, and the purpose of three-dimensional reconstruction is achieved.

(3) Feature extraction: the features mainly include feature points, feature lines, and regions. In most cases, feature points are used as matching primitives, and the form of extracting the feature points is closely related to the matching strategy. The feature point extraction algorithm may include, but is not limited to: directional derivative based methods, image brightness contrast relationship based methods, mathematical morphology based methods, and the like.

(4) Stereo matching: stereo matching is to establish a corresponding relationship between different images according to the extracted features, that is, imaging points of the same physical space point in two different images are in one-to-one correspondence. Some factors in the scene, such as the illumination condition, noise interference, geometric distortion of the scene, surface physical characteristics, and camera characteristics, are considered in matching.

(5) Three-dimensional reconstruction: with the accurate matching result, the three-dimensional scene information can be restored by combining the internal and external parameters calibrated by the camera. Because the three-dimensional reconstruction precision is influenced by factors such as matching precision, internal and external parameter errors of a camera and the like, the work of the previous steps is needed to be done firstly, so that the precision of each link is high, the error is small, and the three-dimensional reconstruction can be realized more accurately.

For example, please refer to fig. 2, which schematically illustrates a second flowchart of a labeling method for multiple video frames according to some embodiments of the present application, where, as shown in fig. 2, the step S102 performs three-dimensional reconstruction according to the multiple video frames to obtain a three-dimensional reconstructed scene corresponding to the region, and may include the following sub-steps S1021 to S1022:

step S1021: determining camera position information corresponding to each video frame;

step S1022: and performing three-dimensional reconstruction according to the camera position information corresponding to each video frame and the video frame to obtain a three-dimensional reconstruction scene of the region.

The camera position information can be collected by using GPS positioning equipment when video frames are collected.

As to a specific algorithm for three-dimensional reconstruction, in some modified embodiments, the step S1022 of performing three-dimensional reconstruction according to the camera position information and the video frame respectively corresponding to each video frame to obtain a three-dimensional reconstructed scene of the region may include:

and determining the position information of each pixel point corresponding to a three-dimensional point in a world coordinate system by adopting a dense reconstruction algorithm according to the camera position information corresponding to each video frame and the pixel position information of each pixel point in the video frames, and determining the three-dimensional reconstruction scene of the region according to a three-dimensional point cloud formed by the three-dimensional points.

The method comprises the steps of calculating a three-dimensional point corresponding to each pixel point in an image one by one on the premise that the pose of a camera is known, and obtaining a dense three-dimensional point cloud on the surface of a scene object, wherein the dense reconstruction (MVS) algorithm is multiview solid geometry.

Through the embodiment, the three-dimensional reconstruction can be accurately and quickly realized, and the whole marking accuracy and marking efficiency are improved.

In the foregoing embodiment corresponding to fig. 1, there are various implementations of the step S103, for example, in some implementations, the labeling may be performed on the basis of a video frame, please refer to fig. 3, which schematically illustrates a third flowchart of a labeling method for multiple video frames provided in some implementations of the present application, and as shown in fig. 3, the step S103 performs labeling based on the three-dimensional reconstructed scene to obtain labeling information in the three-dimensional reconstructed scene, and may include the following sub-steps S1031 to S1032: :

step S1031: determining the corresponding relation between the pixel point in each video frame and the three-dimensional point in the three-dimensional reconstruction scene;

step S1032: responding to a first labeling operation of a user for a pixel point in the video frame, converting the first labeling operation into a second labeling operation for a three-dimensional point in the three-dimensional reconstruction scene according to the corresponding relation, and generating labeling information in the three-dimensional reconstruction scene according to the second labeling operation.

For another example, in another embodiment of the above step S103, an object labeled by the method is located on the ground, the video frame includes a top view taken by a camera looking down on the ground, please refer to fig. 4, which schematically illustrates a fourth flowchart of the labeling method for multiple video frames provided in some embodiments of the present application, and as shown in fig. 4, the step S103 labeling based on the three-dimensional reconstructed scene to obtain labeling information in the three-dimensional reconstructed scene may include the following sub step S1033: :

step S1033: responding to a third labeling operation of a user for a pixel point in the top view, converting the third labeling operation into a fourth labeling operation for a three-dimensional point in the three-dimensional reconstruction scene according to a coordinate conversion relation between a camera coordinate system corresponding to a camera for shooting the top view and a world coordinate system corresponding to the three-dimensional reconstruction scene, and generating labeling information in the three-dimensional reconstruction scene according to the fourth labeling operation.

In addition, on the basis of the corresponding embodiment in fig. 1, please refer to fig. 5, which schematically illustrates a fifth flowchart of the annotation method for multiple video frames according to some embodiments of the present application, as shown in fig. 5, in which step S104 projects annotation information in the three-dimensional reconstructed scene into multiple video frames to obtain annotation information in the multiple video frames, the method may include the following sub-step S1041:

step S1041: and projecting the annotation information in the three-dimensional reconstruction scene into a plurality of video frames according to the camera position information and the shooting time information respectively corresponding to each video frame to obtain the annotation information in the plurality of video frames, wherein a shooting object contained in each projected video frame comprises an object marked by the annotation information.

At least one of the foregoing embodiments of the present application provides an annotation method for multiple video frames, which is capable of obtaining multiple video frames captured for the same area, performing three-dimensional reconstruction according to the plurality of video frames to obtain a three-dimensional reconstruction scene corresponding to the region, then labeling is carried out based on the three-dimensional reconstruction scene to obtain labeling information in the three-dimensional reconstruction scene, and the labeling information in the three-dimensional reconstruction scene is projected to a plurality of video frames, obtaining the labeling information in a plurality of video frames, compared with the mode of manually labeling each video frame in the prior art, the user only needs to label in the three-dimensional reconstruction scene, namely, the annotation information can be automatically back projected into a plurality of video frames to realize the batch annotation of the video frames, the annotation efficiency of multiple video frames can be effectively improved, and the labor cost of annotation work can be reduced.

In practical applications, the above-mentioned annotation method may be applied to the annotation of a static object on a lane, please refer to fig. 6, which schematically shows a sixth flowchart of the annotation method for multiple video frames provided in some embodiments of the present application, as shown in fig. 6, the step S101 above obtains multiple video frames captured for the same region, and may include the sub-step S1011: acquiring a plurality of groups of continuous video frames acquired by a plurality of vehicles in the same lane;

the step S102 of performing three-dimensional reconstruction according to the plurality of video frames to obtain a three-dimensional reconstructed scene corresponding to the region may include the substep S1023: performing three-dimensional reconstruction on the static object on the lane according to the groups of continuous video frames to obtain a three-dimensional reconstruction scene of the lane;

the step S103 of labeling based on the three-dimensional reconstructed scene to obtain labeling information in the three-dimensional reconstructed scene may include the substep S1034: labeling the static object in the three-dimensional reconstruction scene based on the three-dimensional reconstruction scene to obtain labeling information of the static object in the three-dimensional reconstruction scene;

the step S104 of projecting the annotation information in the three-dimensional reconstructed scene to a plurality of video frames to obtain the annotation information in the plurality of video frames may include the substep S1042: and projecting the labeling information of the static object in the three-dimensional reconstruction scene into the multiple groups of continuous video frames to obtain the labeling information in the multiple groups of continuous video frames.

The static objects can include lane lines, lane isolation guardrails, lane trees and the like, and the method can be adopted for marking without limitation in the embodiment of the application.

The following is described with reference to specific examples:

in some specific implementation manners of the embodiment of the application, the method is used for marking lane lines, and the overall concept is that firstly, GPS positioning information of videos returned by a vehicle is combined with map information, all videos driving on the same lane are found, a three-dimensional reconstruction technology is used for recovering a real scene 1:1 on a road to form a three-dimensional reconstruction scene, a marking person marks in a three-dimensional space based on recovered 3D point clouds, a time dimension is added on the 3D basis, and a 3D marking result is projected onto continuous video frame images of a plurality of videos, so that marking efficiency is greatly improved, and full-dimension (3D position, angle and speed) marking is obtained. The following steps are introduced:

firstly, selecting a vehicle event data recorder for perceiving a specific concerned road section according to a satellite image network to return data (namely, an acquired video frame).

And step two, performing three-dimensional reconstruction according to the multiple return videos on the same road selected in the step one, and recovering the absolute three-dimensional position (namely camera position information), the environment three-dimensional point cloud (namely a three-dimensional reconstruction scene) and the ground equation of the camera at each moment by combining a GPS.

And thirdly, performing 4D labeling according to the recovered three-dimensional point cloud and a ground equation, namely performing static object labeling in a real three-dimensional world, wherein the labeled high-precision information can be reversely projected in each frame of image in each time. The following detailed labeling procedures and principles are exemplified with one point:

first, the annotation method for multiple video frames provided in the embodiment of the present application may be implemented by using a dedicated annotation tool, where the annotation tool may be software or hardware device implemented based on software, and the embodiment of the present application is not limited. By utilizing the marking tool, a marking person, namely a user, can realize marking in at least the following two ways:

first, referring to fig. 7, the annotating person creates a point (at the lane line arrow in fig. 7) in the image by double-clicking the left mouse button, and the 4D annotation tool can convert the point from the image coordinate system to the real world coordinate system. The principle is that the point represents a ray emitted from the optical center of the camera in a three-dimensional space, and the intersection point of the ray and the ground equation is the three-dimensional point of the world coordinate system of the point.

In a second way, please refer to fig. 8, the annotator can also create a point (at the position of the lane line arrow in fig. 8) directly from the top view, and the 4D annotation tool can convert the point from the camera coordinate system to the real world coordinate system. The principle is that the point is on the ground of the camera coordinate system, and the three-dimensional point of the world coordinate system of the point can be obtained through the conversion relation between the camera coordinate system and the world coordinate system.

The first and second modes can be alternatively implemented.

After the three-dimensional position of the point is determined in the first or second manner, the pose between cameras of different passes at different times (i.e. cameras of multiple passes of vehicles passing at different times based on the same road surface) generated in the second step can be used to re-project the three-dimensional space point onto any frame image of any pass, for example, a three-dimensional point of a lane line marked in a world coordinate system based on the current frame can be projected onto any video frame containing the lane line, for example, the three-dimensional point can be projected onto a camera image of the current pass track 100 meters ahead, or can be projected onto other camera images of the passes, so that so-called "4D marking", that is, one-time marking, can generate a huge amount of marking data.

Through the implementation mode, the returned video can be restored into a real three-dimensional world by using a three-dimensional reconstruction technology, full-scene full-dimensional information is obtained, then the annotation is carried out in the three-dimensional world (namely, the three-dimensional reconstruction scene), and finally, points, lines and surfaces which are annotated in the three-dimensional world can be projected into each frame of image, so that the annotation efficiency can be greatly improved compared with 2D annotation. (for example, lane marking tradition 2D marks 170/people day, and 4D marks 3000/people day), can obtain full dimension marking information (3D position, angle), can obtain full scene marking information (all weather scenes such as rainy day, snow day, backlight, traffic congestion, haze on the same section road), to sensors such as newly-increased camera or laser radar, can directly multiplex the three-dimensional model and mark established before, compatibility is better.

In the foregoing embodiments, a method for labeling multiple video frames is provided, and correspondingly, an apparatus for labeling multiple video frames is also provided. The annotation device for multiple video frames provided in the embodiment of the present application can implement the above annotation method for multiple video frames, and the annotation device for multiple video frames can be implemented by software, hardware, or a combination of software and hardware. For example, the annotation device for multiple video frames may comprise integrated or separate functional modules or units to perform the corresponding steps of the above methods. Please refer to fig. 9, which schematically illustrates a schematic diagram of an annotation apparatus for multiple video frames according to some embodiments of the present application. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

As shown in fig. 9, an annotating device 10 for multiple video frames according to an embodiment of the present application may include:

a video frame acquisition module 101, configured to acquire a plurality of video frames acquired for the same area;

a three-dimensional reconstruction module 102, configured to perform three-dimensional reconstruction according to the multiple video frames to obtain a three-dimensional reconstruction scene corresponding to the region;

the three-dimensional labeling module 103 is configured to label based on the three-dimensional reconstructed scene to obtain labeling information in the three-dimensional reconstructed scene;

and the annotation projection module 104 is configured to project annotation information in the three-dimensional reconstructed scene into the plurality of video frames to obtain annotation information in the plurality of video frames.

In some variations of the embodiments of the present application, the three-dimensional reconstruction module 102 includes:

the camera position information determining unit is used for determining the camera position information corresponding to each video frame;

and the three-dimensional reconstruction unit is used for performing three-dimensional reconstruction according to the camera position information corresponding to each video frame and the video frame to obtain a three-dimensional reconstruction scene of the region.

In some variations of embodiments of the present application, the three-dimensional reconstruction unit includes:

and the dense reconstruction subunit is used for determining the position information of the pixel point corresponding to the three-dimensional point in the world coordinate system by adopting a dense reconstruction algorithm according to the camera position information corresponding to each video frame and the pixel position information of each pixel point in the video frame, and determining the three-dimensional reconstruction scene of the region according to the three-dimensional point cloud formed by the three-dimensional points.

In some variations of the embodiments of the present application, the three-dimensional labeling module 103 includes:

the corresponding relation determining unit is used for determining the corresponding relation between the pixel point in each video frame and the three-dimensional point in the three-dimensional reconstruction scene;

and the video frame labeling unit is used for responding to a first labeling operation of a user for a pixel point in the video frame, converting the first labeling operation into a second labeling operation for a three-dimensional point in the three-dimensional reconstruction scene according to the corresponding relation, and generating labeling information in the three-dimensional reconstruction scene according to the second labeling operation.

In some variations of the embodiments of the present application, the object marked by the apparatus 10 is located on the ground, and the video frame includes a top view of the camera taken with the camera facing down to the ground;

the three-dimensional labeling module 103 includes:

and the top view labeling unit is used for responding to a third labeling operation of a user for a pixel point in the top view, converting the third labeling operation into a fourth labeling operation for a three-dimensional point in the three-dimensional reconstruction scene according to a coordinate conversion relation between a camera coordinate system corresponding to a camera for shooting the top view and a world coordinate system corresponding to the three-dimensional reconstruction scene, and generating labeling information in the three-dimensional reconstruction scene according to the fourth labeling operation.

In some variations of the embodiments of the present application, the video frame acquiring module 101 includes:

the lane video frame acquisition unit is used for acquiring a plurality of groups of continuous video frames acquired by a plurality of vehicles in the same lane;

the three-dimensional reconstruction module 102 includes:

the lane scene three-dimensional reconstruction unit is used for performing three-dimensional reconstruction on the static object on the lane according to the multiple groups of continuous video frames to obtain a three-dimensional reconstruction scene of the lane;

the three-dimensional labeling module 103 includes:

and the lane object three-dimensional labeling unit is used for labeling the static objects in the three-dimensional reconstruction scene based on the three-dimensional reconstruction scene to obtain the labeling information of the static objects in the three-dimensional reconstruction scene.

In some variations of the embodiments of the present application, the annotation projection module 104 includes:

and the annotation projection unit is used for projecting annotation information in the three-dimensional reconstruction scene into the plurality of video frames according to the camera position information and the shooting time information which are respectively corresponding to each video frame to obtain the annotation information in the plurality of video frames, wherein a shooting object contained in each projected video frame comprises an object marked by the annotation information.

The labeling device 10 for multiple video frames provided in the embodiment of the present application and the labeling method for multiple video frames provided in the foregoing embodiment of the present application have the same inventive concept and the same beneficial effects, and are not described herein again.

The embodiment of the present application further provides an electronic device corresponding to the annotation method for multiple video frames provided in the foregoing embodiment, so as to execute the annotation method for multiple video frames.

Please refer to fig. 10, which schematically illustrates a schematic diagram of an electronic device provided in some embodiments of the present application. As shown in fig. 10, the electronic device 20 includes: the system comprises a processor 200, a memory 201, a bus 202 and a communication interface 203, wherein the processor 200, the communication interface 203 and the memory 201 are connected through the bus 202; the memory 201 stores a computer program that can be executed on the processor 200, and the processor 200 executes the annotation method for multiple video frames provided by any of the foregoing embodiments when executing the computer program.

The Memory 201 may include a Random Access Memory (RAM) and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 203 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.

Bus 202 can be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The memory 201 is configured to store a program, and the processor 200 executes the program after receiving an execution instruction, where the method for labeling multiple video frames disclosed in any of the foregoing embodiments of the present application may be applied to the processor 200, or implemented by the processor 200.

The processor 200 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 200. The Processor 200 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 201, and the processor 200 reads the information in the memory 201 and completes the steps of the method in combination with the hardware thereof.

The electronic device provided by the embodiment of the present application and the labeling method for multiple video frames provided by the foregoing embodiments of the present application have the same inventive concept and the same beneficial effects as the method adopted, operated or implemented by the electronic device.

Referring to fig. 11, a computer-readable storage medium is shown as an optical disc 30, on which a computer program (i.e., a program product) is stored, where the computer program is executed by a processor to perform the method for labeling multiple video frames provided in any of the foregoing embodiments.

It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, or other optical and magnetic storage media, which are not described in detail herein.

The computer-readable storage medium provided by the above-mentioned embodiments of the present application and the labeling method for multiple video frames provided by the foregoing embodiments of the present application have the same advantages as the method adopted, run or implemented by the application program stored in the computer-readable storage medium.

It should be noted that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present disclosure, and the present disclosure should be construed as being covered by the claims and the specification.

Claims

1. An annotation method for multiple video frames, comprising:

acquiring a plurality of video frames collected aiming at the same area;

2. The method according to claim 1, wherein the performing three-dimensional reconstruction from the plurality of video frames to obtain a three-dimensional reconstructed scene corresponding to the region comprises:

determining camera position information corresponding to each video frame;

and performing three-dimensional reconstruction according to the camera position information corresponding to each video frame and the video frame to obtain a three-dimensional reconstruction scene of the region.

3. The method according to claim 2, wherein the three-dimensional reconstruction according to the camera position information and the video frame corresponding to each video frame respectively to obtain a three-dimensional reconstructed scene of the region comprises:

4. The method according to claim 1, wherein the labeling based on the three-dimensional reconstructed scene to obtain labeling information in the three-dimensional reconstructed scene comprises:

determining the corresponding relation between the pixel point in each video frame and the three-dimensional point in the three-dimensional reconstruction scene;

responding to a first labeling operation of a user for a pixel point in the video frame, converting the first labeling operation into a second labeling operation for a three-dimensional point in the three-dimensional reconstruction scene according to the corresponding relation, and generating labeling information in the three-dimensional reconstruction scene according to the second labeling operation.

5. The method of claim 1, wherein the object annotated by the method is located on the ground, and wherein the video frame comprises an overhead view of a camera taken looking down on the ground;

the labeling based on the three-dimensional reconstruction scene to obtain labeling information in the three-dimensional reconstruction scene comprises the following steps:

responding to a third labeling operation of a user for a pixel point in the top view, converting the third labeling operation into a fourth labeling operation for a three-dimensional point in the three-dimensional reconstruction scene according to a coordinate conversion relation between a camera coordinate system corresponding to a camera for shooting the top view and a world coordinate system corresponding to the three-dimensional reconstruction scene, and generating labeling information in the three-dimensional reconstruction scene according to the fourth labeling operation.

6. The method of claim 1, wherein the projecting the annotation information in the three-dimensional reconstructed scene into a plurality of video frames to obtain annotation information in the plurality of video frames comprises:

and projecting the annotation information in the three-dimensional reconstruction scene into a plurality of video frames according to the camera position information and the shooting time information respectively corresponding to each video frame to obtain the annotation information in the plurality of video frames, wherein a shooting object contained in each projected video frame comprises an object marked by the annotation information.

7. The method of claim 1, wherein the obtaining a plurality of video frames captured for a same region comprises:

acquiring a plurality of groups of continuous video frames acquired by a plurality of vehicles in the same lane;

the three-dimensional reconstruction according to the plurality of video frames to obtain a three-dimensional reconstruction scene corresponding to the region comprises:

performing three-dimensional reconstruction on the static object on the lane according to the groups of continuous video frames to obtain a three-dimensional reconstruction scene of the lane;

and labeling the static object in the three-dimensional reconstruction scene based on the three-dimensional reconstruction scene to obtain labeling information of the static object in the three-dimensional reconstruction scene.

8. An annotating device for multiple video frames, comprising:

9. The apparatus of claim 8, wherein the three-dimensional reconstruction module comprises:

10. The apparatus of claim 9, wherein the three-dimensional reconstruction unit comprises:

11. The apparatus of claim 8, wherein the three-dimensional labeling module comprises:

12. The apparatus of claim 8, wherein the object tagged by the apparatus is located on the ground, and wherein the video frame comprises an overhead view of the camera taken looking down on the ground;

the three-dimensional labeling module comprises:

13. The apparatus of claim 8, wherein the annotation projection module comprises:

14. The apparatus of claim 8, wherein the video frame acquisition module comprises:

the three-dimensional reconstruction module comprises:

the three-dimensional labeling module comprises:

15. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor executes the computer program to implement the method according to any of claims 1 to 7.

16. A computer-readable storage medium having computer-readable instructions stored thereon, the computer-readable instructions being executable by a processor to implement the method of any one of claims 1 to 7.