CN114202719A

CN114202719A - Video sample labeling method and device, computer equipment and storage medium

Info

Publication number: CN114202719A
Application number: CN202111358708.0A
Authority: CN
Inventors: 袁野; 刘娜; 王中磐; 吴国栋; 陈子昂
Original assignee: Zhongyuan Power Intelligent Robot Co ltd
Current assignee: Zhongyuan Power Intelligent Robot Co ltd
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2022-03-18

Abstract

The application discloses a method and a device for labeling a video sample, computer equipment and a storage medium, wherein a second video target segmentation model is obtained by obtaining initial video data and adjusting preset model parameters of a first video target segmentation model based on an initial labeled image so that the second video target segmentation model learns the labeling characteristics of the initial labeled image; the second video target segmentation model is utilized to carry out fine annotation on the unmarked image to obtain a first annotated image, so that image annotation is realized, and the labor cost and the time cost are reduced; and then, evaluating the quality of the first labeled image by using a masking RCNN model to obtain a first quality score, and if the first quality score is greater than a preset threshold, confirming that the labeling of the initial video data is finished to obtain a video sample set, so that the labeling quality of the video sample is improved, and the model training is easier.

Description

Video sample labeling method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of machine learning, and in particular, to a method and an apparatus for labeling a video sample, a computer device, and a storage medium.

Background

At present, intelligent robots are widely applied to daily life of users, wherein the functions of the robots are mainly realized by means of complex deep neural network models such as ResNet and YOLO. In order to ensure the performance of the deep neural network model, a large number of labeled data sets are required to train the model.

At present, the process of constructing a data set for a network model applied to a robot mainly includes: a data acquisition person shoots a target by using a professional camera in a real scene to obtain original video data; the data screening personnel screen the original video data and remove the original video data with abnormal exposure or fuzzy target; and (4) marking the original video data by using professional software according to a preset rule to obtain a marked sample set. However, the data annotating personnel inevitably introduce the problem of label noise in the annotating process, and the problem of label noise refers to the problem that the sample class is wrongly annotated in the sample set due to uncontrollable factors such as target class confusion and annotating personnel cognitive difference, so that the quality of the annotation result of the sample set is poor, and the model training effect is poor.

Disclosure of Invention

The application provides a method and a device for labeling a video sample, computer equipment and a storage medium, which are used for solving the technical problem of poor quality of a labeling result of the current video sample.

In order to solve the above technical problem, an embodiment of the present application provides a method for annotating a video sample, including:

acquiring initial video data, wherein the initial video data comprises an initial marked image and an unmarked image;

based on the initial annotation image, adjusting preset model parameters of a first video target segmentation model to obtain a second video target segmentation model;

finely marking the unmarked image by using the second video target segmentation model to obtain a first marked image;

performing quality evaluation on the first annotation image by using a masking RCNN model to obtain a first quality score;

and if the first quality score is larger than a preset threshold value, confirming that the initial video data is marked completely, and obtaining a video sample set.

In the embodiment, a second video target segmentation model is obtained by acquiring initial video data and adjusting preset model parameters of a first video target segmentation model based on the initial annotation image, so that the second video target segmentation model learns the annotation characteristics of the initial annotation image; the second video target segmentation model is utilized to carry out fine annotation on the unmarked image to obtain a first annotated image, so that image annotation is realized, and the labor cost and the time cost are reduced; and then, evaluating the quality of the first labeled image by using a masking RCNN model to obtain a first quality score, and if the first quality score is greater than a preset threshold, confirming that the labeling of the initial video data is finished to obtain a video sample set, so that the labeling quality of the video sample is improved, and the model training is easier.

In an embodiment, the step of obtaining the second video target segmentation model by adjusting the preset model parameters of the first video target segmentation model based on the initial annotation image includes:

finely marking the initial marked image according to the initial mask marking data by using the first video target segmentation model to obtain a second marked image, wherein the second marked image comprises second mask marking data;

performing quality evaluation on the second labeled image by using the masking RCNN model to obtain a second quality score;

and if the second quality score is larger than a preset threshold value, updating the model parameters of the first video target segmentation model according to the second mask marking data to obtain the second video target segmentation model.

In the embodiment, the initial labeling model is subjected to fine labeling, and quality evaluation is combined, so that the labeling quality of the initial labeling model is improved, and meanwhile, the second video target segmentation model learns high-quality labeling characteristics, so that the labeling quality of the video sample is further improved.

In an embodiment, the performing, by using the first video target segmentation model, a fine annotation on the initial annotated image according to the initial mask annotation data to obtain a second annotated image includes:

determining a second target position in the initial marked image according to the initial mask marking data by using the first video target segmentation model;

extracting a second target mask of the second target position;

and taking the second target mask as second mask labeling data, and labeling the initial labeling image to obtain a second labeling image.

In the embodiment, the mask characteristics of the acquired initial annotation image are extracted through the first video target segmentation model, and the mask characteristics are adopted to realize fine annotation of the initial annotation image, so that the annotation quality is improved.

In an embodiment, the performing, by using the second video target segmentation model, fine labeling on the unlabeled image to obtain a first labeled image includes:

predicting a first target position in the unmarked image according to second mask marking data in the second marked image by using the second video target segmentation model;

extracting a first target mask of the first target position;

and taking the first target mask as first mask marking data, and marking the unmarked image to obtain the first marked image.

In the embodiment, the mask features of the unmarked image are extracted through the second video target segmentation model, and the mask features are adopted to realize the fine marking of the unmarked image, so that the marking quality is improved.

In an embodiment, the performing quality evaluation on the first labeled image by using a masking RCNN model to obtain a first quality score includes:

predicting third mask annotation data of the first annotation image by using the masking RCNN model;

and calculating the intersection ratio between the first mask marking data and the third mask marking data, wherein the intersection ratio is the first quality score.

In this embodiment, the mask of the first annotation image is predicted through the masking RCNN model, and then the predicted mask is compared with the mask extracted by the second video target segmentation model, so as to realize quality evaluation, thereby facilitating subsequent optimization of annotation quality.

In an embodiment, the predicting the third mask annotation data of the first annotation image by using the masking RCNN model includes:

extracting a plurality of candidate regions of the first annotation image by using the masking RCNN model;

extracting the regional characteristics of each candidate region;

and performing mask prediction on the candidate region according to the region characteristics to obtain third mask labeling data of the first labeling image.

In the embodiment, the mask prediction is realized through target detection and region feature extraction.

In an embodiment, after the performing quality evaluation on the first annotation image by using a masking RCNN model to obtain a first quality score, the method further includes:

if the first quality score is not greater than a preset threshold, executing a cyclic annotation step until the first quality score is greater than the preset threshold or the cyclic frequency reaches a preset frequency, confirming that the initial video data annotation is finished, and obtaining a video sample; wherein the step of circularly labeling comprises:

performing supplementary annotation on the first annotated image based on annotation information input by a user to obtain a third annotated image;

performing annotation correction on the third annotated image by using the second video target segmentation model to obtain a fourth annotated image;

and evaluating the quality of the fourth annotation image by using the masking RCNN model to obtain a new first quality score.

According to the embodiment, the marked image with poor quality is corrected in a man-machine interaction mode, so that the marking quality of the video data is improved.

In a second aspect, an embodiment of the present application provides an apparatus for annotating a video sample, including:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring initial video data which comprises an initial marked image and an unmarked image;

the adjusting module is used for adjusting preset model parameters of the first video target segmentation model based on the initial annotation image to obtain a second video target segmentation model;

the labeling module is used for carrying out fine labeling on the unmarked image by utilizing the second video target segmentation model to obtain a first labeled image;

the evaluation module is used for evaluating the quality of the first annotation image by using a masking RCNN model to obtain a first quality score;

and the confirming module is used for confirming that the initial video data is marked completely to obtain a video sample set if the first quality score is larger than a preset threshold value.

In a third aspect, an embodiment of the present application provides a computer device, including a processor and a memory, where the memory is used to store a computer program, and the processor, when executing the computer program, implements the method for annotating a video sample according to the first aspect.

In a fourth aspect, an embodiment of the present application is a computer-readable storage medium, which stores a computer program, and the computer program, when executed by a processor, implements the method for annotating a video sample according to the first aspect.

It should be noted that, please refer to the relevant description of the first aspect for the beneficial effects of the second aspect to the fourth aspect, which is not repeated herein.

Drawings

Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a method for annotating a video sample according to an embodiment of the present application;

FIG. 3 is a schematic image diagram illustrating a fine annotation process of a first video object segmentation model according to an embodiment of the present application;

FIG. 4 is a schematic image diagram illustrating a fine annotation process of a second video object segmentation model according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an apparatus for annotating a video sample according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

As described in the related art, the data set building process for the network model applied to the robot mainly includes: a data acquisition person shoots a target by using a professional camera in a real scene to obtain original video data; the data screening personnel screen the original video data and remove the original video data with abnormal exposure or fuzzy target; and (4) marking the original video data by using professional software according to a preset rule to obtain a marked sample set. However, the data annotating personnel inevitably introduce the problem of label noise in the annotating process, and the problem of label noise refers to the problem that the sample class is wrongly annotated in the sample set due to uncontrollable factors such as target class confusion and annotating personnel cognitive difference, so that the quality of the annotation result of the sample set is poor, and the model training effect is poor.

Therefore, the embodiment of the application provides a method, a device, a computer device and a storage medium for annotating a video sample, wherein a second video target segmentation model is obtained by obtaining initial video data and adjusting preset model parameters of a first video target segmentation model based on an initial annotation image, so that the second video target segmentation model learns the annotation characteristics of the initial annotation image; the second video target segmentation model is utilized to carry out fine annotation on the unmarked image to obtain a first annotated image, so that image annotation is realized, and the labor cost and the time cost are reduced; and then, evaluating the quality of the first labeled image by using a masking RCNN model to obtain a first quality score, and if the first quality score is greater than a preset threshold, confirming that the labeling of the initial video data is finished to obtain a video sample set, so that the labeling quality of the video sample is improved, and the model training is easier.

Referring to fig. 1, fig. 1 is a schematic diagram of a system architecture provided in an embodiment of the present application, including a robot system and a human-computer interaction labeling system. The robot system comprises a data acquisition module, a communication module, a data storage module and a central control module. The man-machine interaction labeling system comprises a server and a plurality of computer devices.

A data acquisition module: receiving an instruction of the central control module to complete the setting of parameters such as height, inclination angle and the like of the camera equipment; acquiring image data and transmitting the data to a data storage module; and feeding back the current working state information to the central control module.

A communication module: as a communication interface between the robot system and the outside, the robot system can complete two-way communication with the central control module, can receive external control instructions, and can also feed back the current working state information of the robot to the outside.

A data storage module: and receiving an instruction of the central control module, feeding back the current working state information to the central control module, and storing the large-scale data from the data acquisition module.

The central control module: as a decision center of the robot system, the system can acquire the working state information of the rest of modules and can also send instructions to the rest of modules, so that the work of each module is coordinated.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a method for annotating a video sample according to an embodiment of the present disclosure. The video sample labeling method of the embodiment of the present application can be applied to the computer devices, which include, but are not limited to, computing devices such as smart phones, tablet computers, notebook computers, desktop computers, and personal digital assistants. As shown in fig. 2, the method for annotating a video sample includes steps S101 to S105, which are detailed as follows:

step S101, obtaining initial video data, wherein the initial video data comprises an initial marked image and an unmarked image.

In this step, the initial video data comprises a plurality of consecutive frames of video images, and the video images comprise an initial marked image and an unmarked image. The initial labeled image is a video image obtained by simply labeling the image, for example, for a user target in the image, a mask of the user target needs to surround the user target in general, and the simple labeling may be to label the user target with a short straight line. It can be understood that, for the unlabeled images, the number of the initial labeled images is very small, such as 100 frames of the initial labeled images and 9900 frames of the unlabeled images in 10000 frames of video images.

And S102, adjusting preset model parameters of a first video target segmentation model based on the initial annotation image to obtain a second video target segmentation model.

In this step, the first video Object segmentation model is constructed based on a VOS (video Object segmentation) algorithm, which may be OSVOS, FEELVOS, DyeNet, or the like. Illustratively, a network (such as VGG-16) is selected to be classified and pre-trained on ImageNet, the original classification layer is deleted when the training is finished, and a new loss function is embedded: the pixel-level sigmoid balanced cross entropy (pixel-wise sigmoid basic cross entropy). So that each pixel can be classified as foreground or background; the network after embedding the loss function is retrained on the DAVIS-2016 training set. When the method is used for deducing, whenever a new video frame picture is to be segmented, the model is firstly finely tuned by using the real label of the first frame, and then the segmentation of the subsequent video frame picture is carried out.

And S103, finely marking the unmarked image by using the second video target segmentation model to obtain a first marked image.

In this step, a process of finely labeling the label information required for the complete label image, for example, mask data of the user target in the video image, is labeled with mask data surrounding the entire user target. Optionally, for an unmarked image in the video sample, predicting the position of the target in the unmarked image according to the position of the target in the marked image by using a second video target segmentation model, detecting mask data of the target, and using the mask data to mark the unmarked image.

And step S104, evaluating the quality of the first annotation image by using a masking RCNN model to obtain a first quality score.

In the embodiment, the masking RCNN model includes a Backbone network Backbone, a bounding box detection network RCNN Head, a Mask identification network Mask Head, and an accuracy evaluation network Mask iou Head. And predicting the annotation data in the unmarked image through the masking RCNN model, and comparing the annotation data with the annotation data in the first annotated image to determine the quality of the first annotated image.

Step S105, if the first quality score is larger than a preset threshold value, the initial video data is confirmed to be marked completely, and a video sample set is obtained.

In this embodiment, if the first quality score is greater than the preset threshold, it indicates that the annotation quality of the first annotation image meets the requirement, and when the first quality scores of all the first annotation images are greater than the preset threshold, the annotation of the initial video data is completed, so as to obtain the video sample set.

Exemplarily, it is assumed that 10000 pictures for taking pedestrians need to be labeled, and according to the conventional method, a data labeling person needs to finely label 10000 pictures one by one. The method for labeling a video sample provided by the embodiment of the application specifically operates as follows: (1) the data annotation personnel only need to simply label any 100 pictures in 10000 pictures; (2) inputting 10000 pictures (including 9900 unmarked pictures and 100 simply marked pictures) into a system, and finely marking all 10000 pictures by adopting a VOS algorithm based on the simple marking information on 100 pictures (at the moment, the finely marked quality may not reach the quality of the fine marking of data marking personnel, so that the marking quality is marked later, and the VOS algorithm is circularly used for finely marking); (3) the system adopts a Mask Scoring R-CNN algorithm to score the labeling quality of the 10000 pictures; (4) the system picks out a plurality of pictures (assuming 10 pictures) with the lowest labeling quality score, and the data labeling personnel supplements labeling information for the pictures; (5) and inputting 10000 pictures into the system again, and finely marking the pictures again by the system based on the existing marking information by adopting a VOS algorithm. The steps are repeated in a circulating mode until the image labeling quality score meets the requirement. Therefore, only 110 manually marked pictures are needed in the whole process, which is far less than the workload of marking 10000 pictures in the traditional method.

In this embodiment, the initial annotated image is annotated finely by the first video object segmentation model, so that the first video object segmentation model learns the annotation characteristics, such as the position, size, shape, and the like of the object in the second annotated image, and thus can be used for subsequent automatic annotation of the unlabeled image.

Optionally, if the second quality score is not greater than the preset threshold, the second quality score is fed back to the user, the user corrects the annotation data through human-computer interaction, so that model parameters of the first video target segmentation model are adjusted, and the model parameter adjustment is completed until the second quality scores of all the second annotation images are greater than the preset threshold.

Optionally, the performing, by using the first video target segmentation model, fine labeling on the initial labeled image according to the initial mask labeling data to obtain a second labeled image includes:

extracting a second target mask of the second target position;

Exemplarily, as shown in fig. 3, a left image is an initial labeled image, and a right image is a second labeled image, so that the first video target segmentation model identifies a user position according to labeled data in the initial labeled image, extracts user mask data, labels the initial labeled image with the user mask data to obtain the second labeled image, and uses the user mask data as the second mask labeled data.

extracting a first target mask of the first target position;

In this embodiment, as an example, as shown in fig. 4, an image schematic diagram of a fine labeling process of a second video object segmentation model, a left diagram in fig. 4 is a second labeled image, a middle diagram and a right diagram are unlabeled images, a user in the left diagram is taken as second mask labeling data, a first target position of the user in the middle diagram is predicted, and a first target mask is extracted, and similarly, a user in the middle diagram is taken as first mask labeling data, a first target position of the user in the right diagram is predicted, and a first target mask is extracted, and so on.

In this embodiment, after the segmentation by the VOS algorithm, first mask annotation data is obtained, and the first mask annotation data is used as a groudtuthmask, that is, it is assumed that the first mask annotation data is real annotation information of the first annotation image. And predicting to obtain third mask labeling data through a masking RCNN model. IoU (an index measuring the accuracy of detecting the corresponding object in a specific data set) between the predicted third mask annotation data and the first mask annotation data is calculated by using a MaskIoUHead network, so as to obtain a first quality score of the first annotation image.

Alternatively, the quality score may be set to 0 to 1, 0 to 100, or the like, and the lower the score, the lower the annotation quality of the description data, the higher the score, the higher the annotation quality of the description data.

extracting the regional characteristics of each candidate region;

In this embodiment, a propofol (i.e., a candidate region unrelated to the object category) of the first labeled image is extracted through the backhaul network, characteristics of each propofol are extracted through the RCNN Head network and the Mask Head network, and classification, bbox regression, and Mask prediction are performed on the propofol to obtain third Mask labeling data.

In this embodiment, through a computer of the human-computer annotation interactive system, the user performs supplementary annotation on the first annotation image with a low quality score, and the computer communicates with the server to complete annotation correction.

In order to implement the method for labeling the video sample corresponding to the above method embodiment, corresponding functions and technical effects are achieved. Referring to fig. 5, fig. 5 is a block diagram illustrating a structure of an apparatus for annotating a video sample according to an embodiment of the present application. For convenience of explanation, only the parts related to the present embodiment are shown, and the apparatus for annotating a video sample provided in the embodiment of the present application includes:

an obtaining module 501, configured to obtain initial video data, where the initial video data includes an initial tagged image and an un-tagged image;

an adjusting module 502, configured to adjust a preset model parameter of a first video target segmentation model based on the initial annotation image to obtain a second video target segmentation model;

the labeling module 503 is configured to perform fine labeling on the unlabeled image by using the second video target segmentation model to obtain a first labeled image;

the evaluation module 504 is configured to perform quality evaluation on the first labeled image by using a masking RCNN model to obtain a first quality score;

a determining module 505, configured to determine that the initial video data is labeled completely to obtain a video sample set if the first quality score is greater than a preset threshold.

In one embodiment, the adjusting module 502 includes:

the first labeling unit is used for carrying out fine labeling on the initial labeling image according to the initial mask labeling data by utilizing the first video target segmentation model to obtain a second labeling image, and the second labeling image comprises second mask labeling data;

the evaluation unit is used for evaluating the quality of the second annotation image by using the masking RCNN model to obtain a second quality score;

and the first updating unit is used for updating the model parameters of the first video target segmentation model according to the second mask marking data to obtain the second video target segmentation model if the second quality score is larger than a preset threshold value.

In one embodiment, the first labeling unit includes:

a determining subunit, configured to determine, by using the first video target segmentation model, a second target position in the initial annotation image according to the initial mask annotation data;

a first extraction subunit, configured to extract a second target mask at the second target position;

and the labeling subunit is used for labeling the initial labeled image by taking the second target mask as second mask labeling data to obtain the second labeled image.

In one embodiment, the labeling module 503 includes:

a first prediction unit, configured to predict, by using the second video target segmentation model, a first target position in the unmarked image according to second mask marking data in the second marked image;

an extraction unit for extracting a first target mask of the first target position;

and the second labeling unit is used for labeling the unmarked image by taking the first target mask as first mask labeling data to obtain the first labeled image.

In one embodiment, the evaluation module 504 includes:

the second prediction unit is used for predicting third mask annotation data of the first annotation image by using the masking RCNN model;

and the calculating unit is used for calculating the intersection ratio between the first mask marking data and the third mask marking data, and the intersection ratio is the first quality score.

In one embodiment, the second prediction unit includes:

a second extracting subunit, configured to extract, by using the masking RCNN model, a plurality of candidate regions of the first labeled image;

a third extraction subunit, configured to extract a region feature of each of the candidate regions;

and the predicting subunit is used for performing mask prediction on the candidate region according to the region feature to obtain third mask labeling data of the first labeling image.

In one embodiment, the labeling apparatus further comprises:

the execution module is used for executing a cyclic labeling step if the first quality score is not greater than a preset threshold, and confirming that the initial video data labeling is completed to obtain a video sample until the first quality score is greater than the preset threshold or the cyclic frequency reaches a preset frequency; wherein the step of circularly labeling comprises:

the supplementary module is used for carrying out supplementary annotation on the first annotated image based on the annotation information input by the user to obtain a third annotated image;

the correction module is used for performing annotation correction on the third annotation image by using the second video target segmentation model to obtain a fourth annotation image;

and the second evaluation module is used for evaluating the quality of the fourth annotation image by using the masking RCNN model to obtain a new first quality score.

The device for annotating a video sample can implement the method for annotating a video sample according to the above method embodiment. The alternatives in the above-described method embodiments are also applicable to this embodiment and will not be described in detail here. The rest of the embodiments of the present application may refer to the contents of the above method embodiments, and in this embodiment, details are not described again.

Fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 6, the computer device 6 of this embodiment includes: at least one processor 60 (only one shown in fig. 6), a memory 61, and a computer program 62 stored in the memory 61 and executable on the at least one processor 60, the processor 60 implementing the steps in any of the method embodiments described above when executing the computer program 62.

The computer device 6 may be a computing device such as a smartphone, tablet and desktop computer. The computer device may include, but is not limited to, a processor 60, a memory 61. Those skilled in the art will appreciate that fig. 6 is merely an example of the computer device 6 and does not constitute a limitation of the computer device 6, and may include more or less components than those shown, or combine certain components, or different components, such as input output devices, network access devices, etc.

The Processor 60 may be a Central Processing Unit (CPU), and the Processor 60 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 61 may in some embodiments be an internal storage unit of the computer device 6, such as a hard disk or a memory of the computer device 6. The memory 61 may also be an external storage device of the computer device 6 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 6. Further, the memory 61 may also include both an internal storage unit and an external storage device of the computer device 6. The memory 61 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer program. The memory 61 may also be used to temporarily store data that has been output or is to be output.

In addition, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in any of the method embodiments described above.

The embodiments of the present application provide a computer program product, which when executed on a computer device, enables the computer device to implement the steps in the above method embodiments.

In several embodiments provided herein, it will be understood that each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above-mentioned embodiments are further detailed to explain the objects, technical solutions and advantages of the present application, and it should be understood that the above-mentioned embodiments are only examples of the present application and are not intended to limit the scope of the present application. It should be understood that any modifications, equivalents, improvements and the like, which come within the spirit and principle of the present application, may occur to those skilled in the art and are intended to be included within the scope of the present application.

Claims

1. A method for annotating a video sample, comprising:

2. The method for labeling a video sample according to claim 1, wherein the initial labeled image contains initial mask labeling data, and the adjusting the preset model parameters of the first video object segmentation model based on the initial labeled image to obtain the second video object segmentation model comprises:

3. The method for labeling a video sample according to claim 2, wherein the step of performing a fine labeling on the initial labeled image according to the initial mask labeling data by using the first video object segmentation model to obtain a second labeled image comprises:

extracting a second target mask of the second target position;

4. The method for labeling video samples according to claim 2, wherein said performing a fine labeling on the unlabeled image by using the second video object segmentation model to obtain a first labeled image comprises:

extracting a first target mask of the first target position;

5. The method for labeling a video sample as claimed in claim 1, wherein said first labeled image comprises first mask labeling data, and said performing quality evaluation on said first labeled image by using a masking RCNN model to obtain a first quality score comprises:

6. The method of claim 5, wherein predicting the third mask annotation data of the first annotated image using a masking RCNN model comprises:

extracting the regional characteristics of each candidate region;

7. The method for annotating a video sample according to claim 1, wherein said evaluating the quality of said first annotated image using a masking RCNN model to obtain a first quality score further comprises:

8. An apparatus for annotating a video sample, comprising:

9. A computer device comprising a processor and a memory, the memory being configured to store a computer program which, when executed by the processor, implements a method of annotating a video sample according to any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the method of annotating a video sample according to any one of claims 1 to 7.