CN108875730B

CN108875730B - Deep learning sample collection method, device, equipment and storage medium

Info

Publication number: CN108875730B
Application number: CN201710342890.8A
Authority: CN
Inventors: 陈文杰
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2017-05-16
Filing date: 2017-05-16
Publication date: 2023-08-08
Anticipated expiration: 2037-05-16
Also published as: CN108875730A

Abstract

The invention discloses a deep learning sample collection method, a deep learning sample collection device, deep learning sample collection equipment and a storage medium. The method comprises the following steps: determining a target object in one frame of image of video data and marking an interested region of the target object; according to the target object determined in the image and the region of interest of the target object, utilizing an instant positioning and mapping SLAM system to mark the region of interest of the target object in a multi-frame image of the video data; and acquiring an image of the region of interest marked with the target object as a training sample image of the target object. The method takes the region of interest as the annotation of the image, so that the training sample image of the target object is already annotated when the training sample image is stored, and the annotation has accuracy. According to the embodiment, a large number of training sample images do not need to be manually marked, and the problem that the manual marking is easy to cause error marking is avoided.

Description

Deep learning sample collection method, device, equipment and storage medium

Technical Field

The present invention relates to the field of deep learning technologies, and in particular, to a method, an apparatus, a device, and a storage medium for collecting deep learning samples.

Background

The concept of deep learning techniques was proposed by Hinton et al in 2006. An unsupervised greedy layer-by-layer training algorithm based on Deep Belief networks (DBNs for short) brings hopes for solving optimization problems related to Deep structures, and a multi-layer automatic encoder Deep structure is subsequently provided based on the method. Convolutional neural networks are the first truly multi-layer automatic encoder deep structure learning algorithm that exploits spatial correlation to reduce the number of parameters to improve training performance. So far, the deep learning technology has overwhelmed a plurality of traditional vision algorithms in the technical fields of target recognition, target detection, target segmentation, target tracking and other computer vision by virtue of rolling property by virtue of the strong feature extraction capability.

Although the deep learning technique has obvious advantages over the conventional computer vision method in effect, a large number of training samples are required to be used as support for achieving a good effect, otherwise, a deep model with high accuracy cannot be obtained. However, the reality is that: in one aspect, obtaining a large number of accurately labeled training samples is often an extremely difficult task, which greatly hinders the application of deep learning techniques in many fields, and in another aspect, even if a large amount of manpower and material resources are spent labeling training samples, some incorrectly labeled training samples are easy to occur due to too large and complicated workload, which has a large influence on the accuracy of the finally obtained deep model.

Disclosure of Invention

The invention aims to solve the technical problems that a large number of training samples are marked manually in the prior art, and marking errors are easy to occur.

In order to solve the technical problems, the invention is solved by the following technical scheme:

the invention provides a deep learning sample collection method, which comprises the following steps: determining a target object in one frame of image of video data and marking an interested region of the target object; according to the target object determined in the image and the region of interest of the target object, utilizing an instant positioning and mapping SLAM system to mark the region of interest of the target object in a multi-frame image of the video data; and acquiring an image of the region of interest marked with the target object as a training sample image of the target object.

The method for constructing the SLAM system by utilizing the instant positioning and mapping, marking the interested area of the target object in the video data, comprises the following steps: and marking a region of interest of the target object in each frame of image containing the target object by using the SLAM system according to the video data being shot or the video data which is shot.

The method for marking the interested area of the target object in each frame of image containing the target object by utilizing the SLAM system aiming at the video data being shot comprises the following steps: shooting the target object under different light conditions and/or under different poses, and marking the region of interest of the target object in each frame of image containing the target object by utilizing the SLAM system.

Wherein the shooting the target object under different light conditions and different poses comprises: step 2, adjusting the current light brightness; step 4, shooting the target object in different scales to obtain images of the target object in different scales under the current light brightness; step 6, shooting the target object around the target object at different visual angles to obtain images of the target object at different visual angles under the current light brightness; and 8, judging whether the shooting of the preset light brightness is finished, if not, jumping to the step 2.

The method for determining the target object in one frame of image of the video data and marking the interested area of the target object comprises the following steps: and detecting a target object in an image in video data by using a preset target detection model, and marking out a region of interest of the target object.

The method for marking the target object in the multi-frame image of the video data by utilizing the real-time positioning and mapping SLAM system according to the target object determined in the image and the target object region of interest comprises the following steps: determining pose change of the target object in the current frame image relative to the target object in the previous frame image through the SLAM system; and marking the region of interest of the target object in the current frame of image according to the pose change of the target object and the region of interest of the target object in the previous frame of image.

The invention provides a deep learning sample collection device, which comprises the following program modules: the target determining module is used for determining a target object in one frame of image of the video data and marking out a region of interest of the target object; the target labeling module is used for labeling the region of interest of the target object in a multi-frame image of video data by utilizing a SLAM system according to the target object determined in the image and the region of interest of the target object; the image acquisition module is used for acquiring an image of the region of interest marked with the target object and taking the image as a training sample image of the target object.

The target labeling module is used for: and marking a region of interest of the target object in each frame of image containing the target object by using the SLAM system according to the video data being shot or the video data which is shot.

The target labeling module is used for: shooting the target object under different light conditions and/or under different poses, and marking the region of interest of the target object in each frame of image containing the target object by utilizing the SLAM system.

The target labeling module comprises: the adjusting unit is used for adjusting the current light brightness; the scale shooting unit is used for shooting the target object in different scales so as to obtain images of the target object in different scales under the current light brightness; a view angle shooting unit for shooting different view angles around the target object to obtain images of the target object at different view angles under the current light brightness; and the judging unit is used for judging whether the shooting of the preset light brightness is finished or not, and if not, calling the adjusting unit to adjust the current light brightness.

The target determining module is used for detecting a target object in an image in video data by utilizing a preset target detection model and marking an interested region of the target object.

The target labeling module is used for: determining pose change of the target object in the current frame image relative to the target object in the previous frame image through the SLAM system; and marking the region of interest of the target object in the current frame of image according to the pose change of the target object and the region of interest of the target object in the previous frame of image.

The invention provides a storage medium storing a computer program which when executed by a processor implements the deep learning sample collection method described above.

The invention provides a deep learning sample collection device, which comprises a processor and a memory; the processor is used for executing the deep learning sample collection program stored in the memory so as to realize the deep learning sample collection method.

The invention has the following beneficial effects:

according to the invention, the region of interest of the target object can be marked in the image by utilizing the SLAM system, the training sample image containing the region of interest of the target object is obtained, and the region of interest can be further used as the mark of the image containing the target object, so that the training sample image of the target object is marked when the training sample image of the target object is stored, and the mark has accuracy. According to the embodiment, a large number of training sample images do not need to be manually marked, the problem that error marking is easy to occur in manual marking is avoided, marking efficiency is high, accuracy is high, and accuracy of a finally trained depth model can be improved.

Drawings

FIG. 1 is a flow chart of a deep learning sample collection method according to a first embodiment of the present invention;

FIG. 2 is a flow chart of a deep learning sample collection method according to a second embodiment of the present invention;

FIG. 3 is a flow chart of a deep learning sample collection method according to a third embodiment of the present invention;

fig. 4 is a schematic view of an environmental image acquired when a camera shoots according to a third embodiment of the present invention;

FIG. 5 is a schematic view of an image captured by a camera according to a third embodiment of the present invention when the camera is remotely located from a target object;

FIG. 6 is a schematic view of an image captured by a camera near a target object according to a third embodiment of the present invention;

FIG. 7 is a schematic view of an image acquired when a camera shoots around a target object according to a third embodiment of the present invention;

FIG. 8 is a schematic view of an image acquired when a camera shoots around a target object according to a third embodiment of the present invention;

fig. 9 is a schematic view of an image acquired when a camera according to a third embodiment of the present invention photographs around a target object;

fig. 10 is a structural view of a deep learning sample collection device according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Example 1

The embodiment of the invention provides a deep learning sample collection method. Fig. 1 is a flowchart of a deep learning sample collection method according to a first embodiment of the present invention.

In step S110, a target object is determined in a frame of image of the video data and a region of interest (Region of Interest, abbreviated as ROI) of the target object is marked.

The video data may be video data being photographed or video data having been photographed.

The region of interest may be an circumscribed region of the target object outlined in a rectangular, circular, elliptical, irregular polygonal, etc. manner, such that the region of interest marks the target object in the image. In this embodiment, the region of interest is preferably a circumscribed rectangle of the target object.

In this embodiment, a preset target detection model may be used to detect a target object in an image in video data and mark a region of interest of the target object. The target detection model may be a coarsely trained target detection model. The target detection model is used for detecting an object, and the object detected by the target detection model can be determined as a target object and marks a region of interest of the target object.

In this embodiment, the user may also select an object in a frame of image, determine the object selected by the user as the target object, and mark the region of interest of the target object. For example: the user rectangular box draws an object in the image, which is the target object, and the rectangular box can be considered as a region of interest for labeling the target object.

Step S120, marking the region of interest of the target object in the multi-frame image of the video data by using an instant positioning and mapping (simultaneous localization and mapping, abbreviated as SLAM) system according to the target object determined in the image and the region of interest of the target object.

The SLAM system is a visual SLAM system with environmental modeling capabilities and target pose estimation capabilities. The visual SLAM system has the ability to automatically annotate a target object with a region of interest in an image. In the present embodiment, the region of interest of the target object is noted in each frame of image containing the target object using the SLAM system for the video data being shot or the video data being shot. The SLAM system may be built in advance or in real time.

For video data being photographed, the target object under different light conditions and/or under different poses may be photographed, and a region of interest of the target object is noted in each frame of image containing the target object using the SLAM system. The target objects of different poses refer to target objects of different dimensions and different viewing angles. In this embodiment, preferably, when the target object is photographed, the photographed image includes the entire target object, so that continuous training sample images can be acquired in the video data.

The SLAM system can perform three-dimensional imaging on a target object in video data, and the SLAM system determines the pose change of the target object in the current frame image relative to the target object in the previous frame image; and marking the region of interest of the target object in the current frame of image according to the pose change of the target object and the region of interest of the target object in the previous frame of image. Pose changes include dimensional changes and perspective changes. The pose change value includes: and a change value of the distance (scale) between the camera and the target object, and a change value of the visual angle between the camera and the target object. The change of the visual angle comprises the change of the shooting position and the change of the shooting angle of the camera.

Step S130, collecting an image of the region of interest marked with the target object as a training sample image of the target object.

The region of interest of the target object is contained in the multi-frame image in the video data, and for the image of the region of interest containing the target object, the image in the region of interest can be acquired as a training sample image of the target object, or the whole image can be acquired as a training sample image of the target object.

And saving the training sample image of the target object. The training sample images of the same target object can be stored in a folder, or the training sample images of the same class of target objects (such as target objects of a display class) can be stored in a folder, and the folder can be named as the name of the target object.

In this embodiment, a plurality of target objects may be selected, and the region of interest of each target object is respectively marked in the video data by using the SLAM system, so as to respectively obtain training sample images including the region of interest of each target object. The region of interest may be used as a label for an image containing the target object, such that when a training sample image of the target object is stored, the training sample image is already labeled, and the labeling is accurate. According to the embodiment, a large number of training sample images do not need to be manually marked, the problem that error marking is easy to occur in manual marking is avoided, marking efficiency is high, accuracy is high, and accuracy of a finally trained depth model can be improved.

Example two

A more specific embodiment is given below to illustrate the SLAM-based deep learning sample collection method of the present invention. Fig. 2 is a flowchart of a deep learning sample collection method according to a second embodiment of the present invention.

Step S210, selecting and starting a SLAM system.

The SLAM system is started in an environment where the target object exists.

SLAM systems are, for example: an orbSLAM system or a dso system.

Step S220, initializing the SLAM system, and calling a camera to shoot the environment in the view angle of the camera.

The camera is connected with the SLAM system.

According to the training target task, a target object to be shot can be determined in advance, and the camera can perform mobile shooting in an environment where the target object exists. Such as: telescopic movement towards the target object, movement around the target object, etc.

Step S230, starting a preset target detection model, detecting a target object in a first frame image shot by a camera through the target detection model, and marking an interested region of the target object.

According to the training target task, a rough target detection model is prepared, the target detection model is a model for initial training of detecting a target object, the detection precision is low, after enough training sample images of the target object are acquired according to the embodiment, the target detection model can be retrained or optimized, and finally a model after deep learning is obtained.

Step S240, marking the region of interest of the target object in the image shot by the camera later by utilizing the SLAM system.

The SLAM system can accurately model the three-dimensional space of the environment in the view angle of the camera according to the geometric principle of camera imaging, so as to obtain the three-dimensional imaging of the target object, and can calculate the relative pose of the camera and the target object based on the SLAM system, so as to determine the pose change value of the target object in two adjacent frames of images; according to the pose change value of the target object and pose data (the distance between the camera and the target object and the angle between the camera and the target object) of the target object in the previous frame image, pose data of the target object in the current frame image can be obtained, the pose data of the target object in the current frame image can be converted into two-dimensional data (such as two-dimensional coordinates), and then a camera projection model of the target object from three-dimensional imaging to two-dimensional images is obtained, so that a precise binding box (region of interest) of the target object is obtained.

If the SLAM technology of the embodiment is not used, video data are directly collected and training sample images of target objects are stored, then accurate external connection areas of single target objects cannot be obtained, and therefore manual labeling of the training sample images is needed to be carried out later.

Step S250, the current light brightness is adjusted to change the light condition.

A plurality of light intensities may be preset so as to obtain training sample images of the target object at the plurality of light intensities.

Step S260, shooting the target object in different scales to obtain images of the target object in different scales under the current light brightness so as to obtain and store training sample images of the target object.

The camera can be held or controlled to shoot the target object in different scales, the interested region of the target object can be marked in the shot image through SLAM, the image can be further used as a training sample image of the target object, and the training sample image of the target object is stored. In this embodiment, preferably, the handheld camera is close to or far from the target object, and when the handheld camera is close to the target object, the shot image may include a complete target, and when the handheld camera is far from the target object, the size of the target object is not less than 1/10 of the shot image.

Step S270, shooting the target object at different visual angles, obtaining images of the target object at different visual angles under the current light brightness, so as to obtain and store training sample images of the target object.

The camera can be held, shooting with different angles can be carried out aiming at the target object, the interested region of the target object can be marked in the shot image through SLAM, the image can be further used as a training sample image of the target object, and the training sample image of the target object is stored. For example: the handheld camera shoots around the target object, and can perform 360-degree rotation, pitching, deflecting and other mobile shooting around the target object at an angular speed of less than 5 degrees per second.

Step S280, judging whether training sample images of the target object under all light brightness are acquired or not; if yes, go to step S290; if not, step S250 is performed.

The light brightness can be preset, whether the preset light brightness is all collected is judged, if yes, the step S290 is skipped, and if no, the step S250 is skipped.

Step S290, judging whether the number of the training sample images of the collected target object reaches a preset threshold value; if yes, ending the flow; if not, step S250 is performed.

The preset threshold may be set according to specific requirements.

In this embodiment, the flow of this embodiment may be executed for target objects in different background environments, for example: the flow of the present embodiment is executed for a target object in a home environment, a target object in an office environment, and a target object in an outdoor environment.

In this embodiment, the flow of this embodiment may also be executed for multiple target objects of the same type, so as to obtain a training sample image set of the same type of target object, for example: the flow of the present embodiment is performed for different brands, different models of target objects.

In this embodiment, a plurality of target objects may be determined in the image, and a region of interest of each of the target objects is respectively noted in the image by using the SLAM system. The plurality of target objects may be of the same type or different types.

Compared with the manual labeling collection of a large number of training sample images, the SLAM technology of the embodiment is used for automatically collecting the training sample images, and has advantages in the aspects of collecting speed and accuracy of the obtained training sample images, particularly in the case of collecting a large number of training sample images, the advantages are more remarkable, and the diversity and the richness of the training sample images are better.

Example III

The present embodiment provides an application example to further explain the deep learning sample collection method of the present invention. Fig. 3 is a flowchart of a deep learning sample collection method according to a third embodiment of the present invention.

Step S310, selecting an orbSLAM system and taking a master rcnn as a target detection model.

Step S320, starting and initializing an orbSLAM system, and calling a camera to shoot the environment in the view angle of the camera.

Step S330, a master rcnn is started, and a target object in the image is detected through the master rcnn and a region of interest of the target object is marked.

In this embodiment, a training sample image containing a display, a keyboard, a mouse pad, and double sided tape needs to be acquired, so the orbSLAM system and the master rcnn are activated in an environment containing a display, a keyboard, a mouse pad, and double sided tape. The master rcnn is used for detecting a display, a keyboard, a mouse pad and double-sided adhesive tape in an image.

In step S340, the display, the keyboard, the mouse pad and the binding box of the double sided tape are respectively labeled in the image shot by the camera.

As shown in fig. 4, which is a schematic diagram of an environmental image acquired when a camera shoots, the image shot by the camera is a desk, a display, a keyboard, a mouse pad and a double sided tape are arranged on the desk, a master rcnn detects target objects in the image, namely the display, the keyboard, the mouse pad and the double sided tape, and a binding box marked with the display, the keyboard, the mouse pad and the double sided tape is displayed.

Step S350, the camera is made to approach and depart from the target object and shot, and training sample images of the target object under each scale are saved.

As shown in fig. 5, an image diagram acquired when the camera is far away from the target object is shown.

As shown in fig. 6, an image acquired when the camera approaches the target object is schematically shown.

Step S360, the camera is made to shoot around the target object at different visual angles, and training sample images of the target object at each visual angle are saved.

As shown in fig. 7 to 9, the images acquired when the camera shoots around the target object are schematic views, and the angles of view of fig. 7 to 9 are different.

Step S370, judging whether training sample images under various light brightness have been acquired, if yes, turning to step S380, otherwise turning to step S390;

step S380, judging whether the number of the training sample images reaches a preset threshold, if so, ending the flow, otherwise, turning to step S320.

Step S390, change the brightness of the ambient light, and go to step S350.

According to the embodiment, a large amount of manpower and material resources and time cost can be saved, through the SLAM system, training sample images of a plurality of target objects can be automatically obtained only by recording a video around the target objects, the target objects are prevented from being framed out of each original picture in a manual mode, and the exponential rise of the embodiment is obtained in acquisition efficiency.

Example IV

The embodiment provides a deep learning sample collection device. Fig. 10 is a structural view of a deep learning sample collection device according to a fourth embodiment of the present invention.

The target determining module 1010 is configured to determine a target object in a frame of image of video data and label a region of interest of the target object.

And the target labeling module 1020 is used for labeling the region of interest of the target object in the multi-frame image of the video data by utilizing the SLAM system according to the target object determined in the image and the region of interest of the target object.

The image acquisition module 1030 is configured to acquire an image of a region of interest labeled with the target object, as a training sample image of the target object.

Further, the target labeling module 1020 is configured to label, with respect to the video data being shot or the video data being shot, a region of interest of the target object in each frame of image containing the target object by using the SLAM system.

Further, the target labeling module 1020 is configured to capture the target object under different light conditions and/or different poses, and label, with the SLAM system, a region of interest of the target object in each frame of image including the target object.

Further, the target labeling module 1020 includes:

an adjusting unit (not shown) for adjusting the current light brightness;

a scale photographing unit (not shown) for photographing the target object at different scales to obtain images of the target object at different scales at the current light brightness;

a view angle photographing unit (not shown) for photographing different view angles around the target object to obtain images of the target object at different view angles at the current light brightness;

and the judging unit (not shown in the figure) is used for judging whether the shooting of the preset light brightness is finished, and if not, the adjusting unit is called to adjust the current light brightness.

Further, the target determining module 1010 is configured to detect a target object in an image in the video data and mark a region of interest of the target object by using a preset target detection model.

Further, the target labeling module 1020 is configured to:

determining pose change of the target object in the current frame image relative to the target object in the previous frame image through the SLAM system;

and marking the region of interest of the target object in the current frame of image according to the pose change of the target object and the region of interest of the target object in the previous frame of image.

The functions of the apparatus according to this embodiment have been described in the method embodiments shown in fig. 1 to 9, so that the descriptions of this embodiment are not detailed, and reference may be made to the related descriptions in the foregoing embodiments, which are not repeated herein.

Because the number of training sample images required by the deep learning algorithm is very large, sometimes on the order of tens of millions or even hundreds of millions, manually labeling training sample images on large orders of magnitude inevitably introduces some mislabeled training sample images, and the accuracy of the final model of the deep learning training is negatively affected by the wrong samples. The training sample image collected by the SLAM-based method of the embodiment can train the accuracy of the sample image, and even if tracking fails occasionally, the SLAM system can be restarted to continuously collect the training sample image, so that the error labeling condition can not occur.

According to the embodiment, the training sample images are obtained in a video mode, the change between frames in the video is small, so that for a target object, the training sample images under richer different state changes can be obtained, the description of the training sample image set on the target object is finer and complete, the dependence of a deep learning algorithm on sample enhancement pretreatment can be weakened to a certain extent, and the accuracy of a final deep learning model is improved.

Example five

The present embodiment provides a storage medium in which a computer program is stored, the computer program being executable by a processor. Wherein the storage medium may comprise volatile memory, such as random access memory; the storage medium may also include non-volatile memory, such as read-only memory, flash memory, hard disk, or solid state disk; the storage medium may also comprise a combination of memories of the kind described above.

The deep learning sample collection method described in the first to fourth embodiments can be implemented when the computer program stored in the storage medium is executed by a processor.

Example six

The present embodiment provides a deep learning sample collection apparatus. The deep learning sample collection device may be a server or a terminal device. The SLAM system may be run on a deep learning sample collection device.

The deep learning sample collection device includes a processor and a memory; the processor is configured to execute the deep learning sample collection program stored in the memory, so as to implement the deep learning sample collection method described in embodiments one to four.

Although the preferred embodiments of the present invention have been disclosed for illustrative purposes, those skilled in the art will appreciate that various modifications, additions and substitutions are possible, and accordingly the scope of the invention is not limited to the embodiments described above.

Claims

1. A deep learning sample collection method, comprising:

determining a target object in one frame of image of video data and marking an interested region of the target object;

according to the target object determined in the image and the region of interest of the target object, utilizing an instant positioning and mapping SLAM system to mark the region of interest of the target object in a multi-frame image of the video data;

collecting an image of a region of interest marked with the target object as a training sample image of the target object;

the method for marking the target object in the multi-frame image of the video data by utilizing the real-time positioning and mapping SLAM system according to the target object determined in the image and the target object region of interest comprises the following steps:

2. The method of claim 1, wherein marking the region of interest of the target object in the video data using an on-the-fly localization and mapping SLAM system comprises:

and marking a region of interest of the target object in each frame of image containing the target object by using the SLAM system according to the video data being shot or the video data which is shot.

3. The method of claim 2, wherein the marking, with the SLAM system, the region of interest of the target object in each frame of image containing the target object for the video data being photographed, comprises:

shooting the target object under different light conditions and/or under different poses, and marking the region of interest of the target object in each frame of image containing the target object by utilizing the SLAM system.

4. A method as claimed in claim 3, wherein said capturing said target object under different light conditions and under different poses comprises:

step 2, adjusting the current light brightness;

step 4, shooting the target object in different scales to obtain images of the target object in different scales under the current light brightness;

step 6, shooting the target object around the target object at different visual angles to obtain images of the target object at different visual angles under the current light brightness;

and 8, judging whether the shooting of the preset light brightness is finished, if not, jumping to the step 2.

5. The method according to any one of claims 1 to 4, wherein determining a target object in a frame of image of video data and marking a region of interest of the target object comprises:

and detecting a target object in an image in video data by using a preset target detection model, and marking out a region of interest of the target object.

6. A deep learning sample collection device, comprising program modules that:

the target determining module is used for determining a target object in one frame of image of the video data and marking out a region of interest of the target object;

the target labeling module is used for labeling the region of interest of the target object in a multi-frame image of video data by utilizing a SLAM system according to the target object determined in the image and the region of interest of the target object;

the image acquisition module is used for acquiring an image of the region of interest marked with the target object and taking the image as a training sample image of the target object;

the target labeling module is used for:

7. The apparatus of claim 6, wherein the targeting module is to:

8. The apparatus of claim 7, wherein the targeting module is to:

9. The apparatus of claim 8, wherein the targeting module comprises:

the adjusting unit is used for adjusting the current light brightness;

the scale shooting unit is used for shooting the target object in different scales so as to obtain images of the target object in different scales under the current light brightness;

a view angle shooting unit for shooting different view angles around the target object to obtain images of the target object at different view angles under the current light brightness;

and the judging unit is used for judging whether the shooting of the preset light brightness is finished or not, and if not, calling the adjusting unit to adjust the current light brightness.

10. The apparatus according to any one of claims 6 to 9, wherein the object determining module is configured to detect an object in an image in the video data and label a region of interest of the object using a preset object detection model.

11. A storage medium storing a computer program, characterized in that the computer program stored in the storage medium, when executed by a processor, implements the method of any one of claims 1-5.

12. A deep learning sample collection device, the deep learning sample collection device comprising a processor and a memory; the processor is configured to execute a deep learning sample collection program stored in the memory to implement the method of any one of claims 1-5.