CN116311120A

CN116311120A - Video annotation model training method, video annotation method, device and equipment

Info

Publication number: CN116311120A
Application number: CN202310178047.6A
Authority: CN
Inventors: 程大治
Original assignee: Jiangsu Rye Data Technology Co ltd
Current assignee: Jiangsu Rye Data Technology Co ltd
Priority date: 2023-02-28
Filing date: 2023-02-28
Publication date: 2023-06-23

Abstract

The application relates to a video annotation model training method, a video annotation device and equipment. The video annotation model training method comprises the following steps: acquiring an nth frame image sample, nth frame annotation information of the nth frame image sample, an (n+1) th frame image sample and sample annotation information of the (n+1) th frame image sample in a video data sample, wherein N is a positive integer; inputting the Nth frame image sample, the Nth frame annotation information and the (n+1) th frame image sample into an initial annotation model to obtain pre-annotation information which is output by the initial annotation model and corresponds to the (n+1) th frame image sample; and calculating the matching degree of the pre-labeling information and the sample labeling information, and carrying out iterative updating on the model parameters of the initial labeling model according to the matching degree to obtain the trained video labeling model. By adopting the method, the video annotation efficiency can be improved, and the object detection performance of the automobile is improved.

Description

Video annotation model training method, video annotation method, device and equipment

Technical Field

The present disclosure relates to the field of computer vision, and in particular, to a video annotation model training method, a video annotation device, and a device.

Background

Object detection is one of the classical problems in computer vision technology, and with the development of machine learning models, in the field of automatic driving, target objects are often detected by computer vision technology.

At present, for video data collected in an automatic driving process, a labeling person labels a target object in a 2D bounding box mode, and the target object to be detected of an automobile is selected by the box, but the labeling work is large in demand and high in cost, and the traditional video labeling method is low in efficiency, so that the performance of the automatic driving automobile on object detection is affected.

Disclosure of Invention

Based on the method, the device and the equipment for training the video annotation model, the method, the device and the equipment for training the video annotation model are provided, and the problem of low video annotation efficiency in the prior art is solved.

In one aspect, a method for training a video annotation model is provided, the method comprising:

acquiring an nth frame image sample, nth frame annotation information of the nth frame image sample, an (n+1) th frame image sample and sample annotation information of the (n+1) th frame image sample in a video data sample, wherein N is a positive integer;

inputting the Nth frame image sample, the Nth frame annotation information and the (n+1) th frame image sample into an initial annotation model to obtain pre-annotation information which is output by the initial annotation model and corresponds to the (n+1) th frame image sample;

And calculating the matching degree of the pre-labeling information and the sample labeling information, and carrying out iterative updating on the model parameters of the initial labeling model according to the matching degree to obtain the trained video labeling model.

In one embodiment, inputting the nth frame image sample, the nth frame annotation information and the (n+1) th frame image sample into an initial annotation model to obtain pre-annotation information corresponding to the (n+1) th frame image sample output by the initial annotation model, where the pre-annotation information comprises:

performing feature extraction on the Nth frame image sample and the N+1th frame image sample to obtain an Nth frame feature map and an N+1th frame feature map;

combining the N-th frame characteristic diagram and the N+1th frame characteristic diagram, and calculating the position offset of the pixel point in the N-th frame characteristic diagram in the N+1th frame characteristic diagram to obtain pixel offset information;

interpolation processing is carried out on the N+1st frame characteristic diagram according to the pixel offset information, and the N+1st frame distortion characteristic diagram is obtained;

combining the N-th frame characteristic diagram and the N+1-th frame distortion characteristic diagram to obtain a characteristic combined diagram, and carrying out characteristic interception on the characteristic combined diagram according to the N-th frame marking information to obtain a characteristic interception diagram;

And estimating object offset according to the characteristic cut-out graph to obtain the pre-labeling information of the (N+1) th frame image sample.

In one embodiment, feature extraction is performed on the nth frame image sample and the n+1th frame image sample to obtain an nth frame feature map and an n+1th frame feature map, including:

performing convolution operation and downsampling on the Nth frame image sample and the (n+1) th frame image sample to obtain a high-dimensional feature map;

and carrying out up-sampling and layer-jump connection on the high-dimension feature map to obtain the N-th frame feature map and the (n+1) -th frame feature map.

In one embodiment, feature interception is performed on the feature merging graph according to the nth frame annotation information to obtain a feature interception graph, including:

corresponding the feature merging graph and the Nth frame marking information to obtain a marking area in the feature merging graph;

and carrying out interpolation processing on the marked area to intercept the characteristic information in the marked area, and adjusting the length and width specification of the characteristic information to obtain the characteristic intercept drawing.

In one embodiment, performing object offset estimation according to the feature extraction graph to obtain the pre-labeling information of the n+1st frame image sample, including:

And carrying out regression calculation on the characteristic cut-off graph according to a preset loss function to obtain the pre-labeling information of the (N+1) th frame image sample.

In another aspect, a video annotation method is provided, including:

obtaining video data to be annotated, and annotating an N-th frame image in the video data according to a received annotation instruction to obtain N-th frame annotation information, wherein N is a positive integer;

inputting the Nth frame image, the Nth frame annotation information and the (n+1) th frame image in the video data into a video annotation model to obtain pre-annotation information of the (n+1) th frame image, wherein the video annotation model can be trained according to any video annotation model training method provided in the embodiment;

and if the pre-labeling information is incorrect, correcting the pre-labeling information according to the received correction instruction to obtain the N+1st frame labeling information of the N+1st frame image.

In one embodiment, after the obtaining the pre-labeling information of the n+1st frame image, the method further includes:

judging whether the pre-marked information is correct or not;

if yes, the pre-labeling information, the (N+1) th frame image and the (N+2) th frame image are input into the video labeling model, and iterative labeling is carried out on the video data.

In one embodiment, the video annotation model comprises: the method comprises a feature extraction layer, a pixel offset estimation layer and an object offset estimation layer, wherein the N-th frame image, the N-th frame annotation information and the N+1th frame image in the video data are input into a video annotation model to obtain pre-annotation information of the N+1th frame image, and the method comprises the following steps:

respectively inputting the Nth frame image and the (n+1) th frame image into the feature extraction layer to obtain an Nth frame feature image and an (n+1) th frame feature image;

combining the N frame feature map and the N+1st frame feature map to obtain a combination result, and inputting the combination result into the pixel offset estimation layer to calculate the position offset of the pixel point in the N frame feature map in the N+1st frame feature map to obtain pixel offset information;

and inputting the characteristic cut-off graph into the object offset estimation layer to perform object offset estimation to obtain the pre-labeling information of the (n+1) th frame image.

In another aspect, a video annotation model training apparatus is provided, the apparatus comprising:

the system comprises a sample acquisition module, a video data acquisition module and a video data analysis module, wherein the sample acquisition module is used for acquiring an nth frame of image sample in a video data sample, nth frame marking information of the nth frame of image sample, an (n+1) th frame of image sample and sample marking information of the (n+1) th frame of image sample, wherein N is a positive integer;

the specimen labeling module is used for inputting the Nth frame image sample, the Nth frame labeling information and the (n+1) th frame image sample into an initial labeling model to obtain pre-labeling information which is output by the initial labeling model and corresponds to the (n+1) th frame image sample;

and the model updating module is used for calculating the matching degree of the pre-labeling information and the sample labeling information, and carrying out iterative updating on the model parameters of the initial labeling model according to the matching degree to obtain a trained video labeling model.

In another aspect, there is provided a video annotation device, the device comprising:

the data acquisition module is used for acquiring video data to be marked, marking an N frame image in the video data according to a received marking instruction, and obtaining N frame marking information, wherein N is a positive integer;

The data annotation module is used for inputting the Nth frame image, the Nth frame annotation information and the (n+1) th frame image in the video data into a video annotation model to obtain pre-annotation information of the (n+1) th frame image, wherein the video annotation model can be trained according to any video annotation model training method provided in the embodiment;

and the correction module is used for correcting the pre-labeling information according to the received correction instruction if the pre-labeling information is incorrect, so as to obtain the N+1st frame labeling information of the N+1st frame image.

In yet another aspect, a computer device is provided comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of:

Calculating the matching degree of the pre-labeling information and the sample labeling information, and carrying out iterative updating on the model parameters of the initial labeling model according to the matching degree to obtain a trained video labeling model;

or the processor when executing the computer program performs the steps of:

In yet another aspect, a computer readable storage medium is provided, having stored thereon a computer program which when executed by a processor performs the steps of:

or the computer program when executed by a processor performs the steps of:

According to the video annotation model training method, the video annotation device and the video annotation model training equipment, the N-th frame image sample, the N+1-th frame image sample and the sample annotation information of the N+1-th frame image sample in the acquired video data samples are input into the initial annotation model, so that the initial annotation model can annotate the characteristics of the N+1-th frame image sample on the N+1-th frame image sample, pre-annotation information of the N+1-th frame image sample is obtained, the pre-annotation information is compared with the sample annotation information, and the initial annotation model is subjected to iterative updating to obtain the trained video annotation model. The object can be detected more accurately through the trained video annotation model, the video data can be annotated more accurately, the video annotation efficiency can be improved, and the object detection performance of the automobile is improved.

Drawings

FIG. 1 is a schematic illustration of an implementation environment to which the present application relates;

FIG. 2 is a flow chart of a video annotation model training method in one embodiment;

FIG. 3 is a schematic diagram of a video annotation model training process in one embodiment;

FIG. 4 is a schematic diagram of a feature extraction process in one embodiment;

FIG. 5 is a schematic diagram of a feature intercept process in one embodiment;

FIG. 6 is a flowchart of a video labeling method according to another embodiment;

FIG. 7 is a block diagram of a video annotation model training device, under one embodiment;

FIG. 8 is a block diagram of a video annotation device in one embodiment;

fig. 9 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

Along with the development of deep learning technology, computer vision technology expands research and application in a plurality of fields such as autopilot, and in the autopilot field, the annotator can utilize the form of bounding box to annotate the object in the video data that the car gathered to the frame selects the target object that the car needs to detect, and then promotes car object detection's ability.

However, the video annotation work has large demand, high cost and low efficiency of the traditional video annotation method.

Therefore, a scheme for improving video annotation efficiency without losing object detection accuracy is highly desired.

Based on this, embodiments of the present application provide a video annotation model training method, a video annotation device, a computer device, and a computer readable storage medium.

The video annotation model training method provided by the application can be applied to an implementation environment shown in fig. 1. Wherein the terminal 101 communicates with the server 102 via a network. The terminal 101 may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices, and the server 102 may be implemented by a stand-alone server or a server cluster formed by a plurality of servers.

In some embodiments, server 102 primarily undertakes object detection model training work and terminal 101 undertakes secondary object detection model training work; alternatively, the server 102 performs a secondary object detection model training service, and the terminal 101 performs a primary object detection model training service; alternatively, the server 120 and the terminal 110 may cooperate with each other using a distributed computing architecture.

In some embodiments, the terminal 101 may communicate with an element or device having communication capability and/or logic operation capability disposed in the vehicle end through a network, such that the terminal 101 receives video data collected by the vehicle end, or the terminal 101 is disposed in the vehicle end as an element or device having communication capability and/or logic operation capability and communicates with the server 102, which is not limited herein.

In one embodiment, as shown in fig. 2, a video annotation model training method is provided, and the method is mainly applied to the server in fig. 1 for illustration, and includes the following steps:

step 201, acquiring an nth frame of image sample, nth frame of labeling information of the nth frame of image sample, an (n+1) th frame of image sample and sample labeling information of the (n+1) th frame of image sample in a video data sample, wherein N is a positive integer.

The video data samples are composed of a plurality of frames of image samples according to time sequence, the annotation information is the annotation of the object to be detected in the image, and the annotation information can be in the form of a 2D bounding box or other forms, and is not limited herein. For ease of description, the following labeling information may be understood and described by taking a 2D bounding box as an example.

It should be noted that, the obtained video data samples may be collected in real time by the vehicle end, or may be stored in a terminal or a server. The acquisition process may be real-time acquisition or periodic acquisition, which is not limited in this application.

Step 202, inputting the nth frame image sample, the nth frame annotation information and the (n+1) th frame image sample into an initial annotation model to obtain pre-annotation information corresponding to the (n+1) th frame image sample output by the initial annotation model.

The network structure of the initial labeling model is not limited in this application. As described in fig. 3, the initial annotation model may include a feature extraction layer, a pixel offset estimation layer, and an object offset estimation layer.

For an exemplary illustration, referring to fig. 3, an overview of the training process of the overall video annotation model may refer to inputting an nth frame of image sample, nth frame of annotation information, and an n+1th frame of image sample into an initial annotation model, and extracting features by the feature extraction layer to obtain an nth frame of feature map and an n+1th frame of feature map; then, the pixel offset estimation layer carries out pixel offset estimation on the N frame characteristic image and the N+1 frame characteristic image, and the N+1 frame characteristic image is distorted to the N frame characteristic image according to the pixel offset to obtain a distorted characteristic image; and carrying out feature interception according to the distortion feature map, the N frame feature map and the N frame marking information to obtain an object corresponding to the marking information, and carrying out regression calculation on the object through the object offset estimation layer to obtain pre-marking information of the object in the N+1 frame image.

And 203, calculating the matching degree of the pre-labeling information and the sample labeling information, and carrying out iterative updating on the model parameters of the initial labeling model according to the matching degree to obtain the trained video labeling model.

After the pre-labeling information is obtained, the processing condition of the current initial labeling model on the video data can be reflected, if the matching accuracy of the pre-labeling information and the sample labeling information is low, the model parameters of the initial labeling model are adjusted, and the initial labeling model is subjected to iterative updating so as to optimize the data processing capacity of the initial labeling model.

The matching accuracy of the pre-labeling information and the sample labeling information can be verified by applying a loss function, the higher the loss value obtained according to the loss function is, the lower the accuracy of the detection result is, and the training target of the initial model is to reduce the loss value. The loss function may be a SmoothL1 loss function, a Cross entropy loss function (Cross-entropy loss function), a range loss function, an Exponential loss function (Exponential loss), or the like.

For example, the pre-labeled information is verified by a SmoothL1 loss function, and the formula of the SmoothL1 loss function L (a) may be:

Wherein y and

the method and the device are respectively pre-labeling information and sample labeling information, and a is the difference between the pre-labeling information and the sample labeling information.

And when the model parameters of the initial annotation model are iteratively updated, detecting whether the initial annotation model meets the model training completion condition. The model training completion condition may be that the iteration number of the initial labeling model reaches the preset iteration number, or may be that the loss function of the initial labeling model converges, which is not limited in the present application. And when the initial annotation model meets the model training completion condition, taking the current initial annotation model as a video annotation model.

According to the video annotation model training method, the N-th frame image sample, the N-th frame annotation information of the N-th frame image sample, the N+1-th frame image sample and the sample annotation information of the N+1-th frame image sample in the obtained video data samples are input into the initial annotation model, so that the initial annotation model can annotate the characteristics of the N+1-th frame image sample on the N+1-th frame image sample, pre-annotation information of the N+1-th frame image sample is obtained, the pre-annotation information is compared with the sample annotation information, and the initial annotation model is subjected to iterative update to obtain the trained video annotation model. The object can be detected more accurately through the trained video annotation model, the video data can be annotated more accurately, the video annotation efficiency can be improved, and the object detection performance of the automobile is improved.

As shown in fig. 4, the feature extraction process may be performed through a Network structure based on a Residual Network (Residual Network), and the corresponding feature map is obtained by convolving and sampling each frame of image in the video data, and performing operations such as identity mapping and layer jump linking.

The interpolation process may be bilinear interpolation, by which visual distortion caused by resizing the image to a non-integer scale factor is avoided.

The feature warp map refers to a warp image obtained by bilinear interpolation of the n+1th frame feature map according to pixel shift information with the N frame feature map, and may be considered as if the n+1th frame feature map is warped into the N frame feature map by pixel-level shift for ease of understanding.

Feature extraction refers to extracting the region of the labeling information corresponding to each object (for example, the bounding box of each object) by bilinear interpolation based on the feature merging diagram, and adjusting the region to the same length and width, and can be performed by referring to an ROI alignment processing method for example.

The object offset estimation refers to performing regression calculation on the labeling information by using a full connection layer to obtain the corresponding position of the labeling information on the (n+1) th frame image, and further obtaining the pre-labeling information of the (n+1) th frame image sample.

Referring to fig. 4, a schematic diagram of an alternative implementation manner in the feature extraction process is shown, where operations such as convolution operation and downsampling are performed on an nth frame image sample and an n+1st frame image sample by an Encoder in the Unet network structure, so as to obtain a high-dimensional feature map, where the Encoder may be formed by a convolution layer, a maximum pooling layer, and the like.

And respectively carrying out operations such as up-sampling on the high-dimensional feature images of the N frame image sample and the N+1 frame image sample through a Decoder, carrying out layer-jump connection on the high-dimensional feature images corresponding to the dimensions in the Encoder Decoder, merging the feature images in the down-sampling process through a superposition mode of Concate, and superposing the feature images according to the number of feature image channels so that each dimension in the feature extraction process can contain more features, wherein the Decoder can be composed of a convolution layer, a concat function for feature splicing and the like.

The interpolation process may be a bilinear interpolation process to avoid visual distortion when the image is resized to a non-integer scale factor.

Specifically, in the feature intercepting process, reference may be made to an ROI alignment region feature aggregation manner, as shown in fig. 5, where the feature merging diagram includes A, B, C three feature objects, the feature merging diagram corresponds to the nth frame of labeling information, and a labeling region of the nth frame of labeling information in the feature merging diagram is obtained, which may be understood herein as a region surrounded by a bounding box of the object in the feature merging diagram is obtained; and intercepting a plurality of characteristic information in the marked area by bilinear interpolation, and adjusting the characteristic information to the size of a preset length and width value to obtain a characteristic intercepting diagram corresponding to the marked area.

The regression calculation is to use a full-connection layer to perform dimension transformation and integrate the features extracted before, wherein the full-connection layer can be classified by adopting a mode such as softmax logistic regression (softmax regression) and the like, and a loss function is applied in the regression process to verify the output result.

Specifically, the three-layer full-connected layer constitutes the object offset estimation model to perform regression calculation on the feature cut-off graph to obtain the position of the labeling information of each object in the n+1st frame image (i.e. the position of the 2D bounding box)

The embodiment of the present application further provides a video labeling method, where the object detection method may be applied to the implementation environment shown in fig. 1 and specifically executed by the server 102 or the terminal 101 in the implementation environment, as shown in fig. 6, where the video labeling method includes at least the following steps:

and 601, obtaining video data to be annotated, and annotating an N frame image in the video data according to a received annotation instruction to obtain N frame annotation information, wherein N is a positive integer.

The video data to be marked can be acquired by the vehicle end in real time or stored in the vehicle end or in a terminal or a server. The acquisition process may be real-time acquisition or periodic acquisition, which is not limited in this application.

Labeling the image may be to apply a 2D bounding box to the object to be detected (object of interest) in the image.

Step 602, inputting the nth frame image, the nth frame annotation information and the (n+1) th frame image in the video data into a video annotation model to obtain pre-annotation information of the (n+1) th frame image, where the video annotation model can be trained according to any one of the video annotation model training methods provided in the foregoing embodiments.

And step 603, if the pre-labeling information is incorrect, correcting the pre-labeling information according to the received correction instruction to obtain the n+1st frame labeling information of the n+1st frame image.

judging whether the pre-marked information is correct or not;

It should be noted that, in order to improve the accuracy of object detection and judgment and ensure the safety in the automatic driving process, the accuracy of the pre-labeling information can be judged after the pre-labeling information is obtained, so that the video labeling method is more accurate.

If the pre-labeling information is correct, continuing pre-labeling the next frame of image so that the whole video is labeled;

if the marking information is incorrect, the pre-marking information can be corrected according to the obtained correction instruction to obtain corrected marking information with correct marking, marking is continued according to the corrected marking information and combining with the next frame image, whether the pre-marking information of the next frame image is accurate or not is verified, and the accuracy of the currently used video marking model can be judged.

It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described above may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, and the order of execution of the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with at least a part of the sub-steps or stages of other steps or other steps.

In one embodiment, as shown in fig. 7, there is provided a video annotation model training apparatus, comprising: sample acquisition module, sample marking module and model update module, wherein:

In one embodiment, as shown in fig. 8, there is provided a video annotation device comprising: the system comprises a data acquisition module, a data labeling module and a correction module, wherein:

judging whether the pre-marked information is correct or not;

For specific limitations on the video annotation model training apparatus and the video annotation apparatus, reference may be made to the above limitations on the video annotation model training method and the video annotation method, which are not described herein. Each of the modules in the above-described apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a video annotation model training method and/or a video annotation method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in fig. 9 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of when executing the computer program:

In one embodiment, the processor when executing the computer program further performs the steps of:

Or, the processor when executing the computer program performs the steps of:

judging whether the pre-marked information is correct or not;

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, the computer program when executed by the processor further performs the steps of:

Or, the computer program when executed by the processor further performs the steps of:

judging whether the pre-marked information is correct or not;

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. The video annotation model training method is characterized by comprising the following steps of:

2. The method for training a video annotation model according to claim 1, wherein the inputting the nth frame of image sample, the nth frame of annotation information, and the n+1th frame of image sample into an initial annotation model to obtain pre-annotation information corresponding to the n+1th frame of image sample output by the initial annotation model comprises:

3. The method for training a video annotation model according to claim 2, wherein the feature extraction of the nth frame image sample and the n+1th frame image sample to obtain an nth frame feature map and an n+1th frame feature map comprises:

4. The method for training a video annotation model according to claim 2, wherein the feature extraction is performed on the feature merging graph according to the nth frame annotation information to obtain a feature extraction graph, and the method comprises the following steps:

5. The method according to any one of claims 2 to 4, wherein the performing object offset estimation according to the feature extraction map to obtain the pre-labeling information of the n+1st frame image sample includes:

6. A method for video annotation, the method comprising:

inputting the Nth frame image, the Nth frame annotation information and the (n+1) th frame image in the video data into a video annotation model to obtain pre-annotation information of the (n+1) th frame image, wherein the video annotation model is trained according to the video annotation model training method of any one of claims 1 to 5;

7. The method according to claim 6, further comprising, after said obtaining the pre-labeling information of the n+1st frame image:

judging whether the pre-marked information is correct or not;

8. The video annotation method as claimed in claim 6 or 7, wherein the video annotation model comprises: the method comprises a feature extraction layer, a pixel offset estimation layer and an object offset estimation layer, wherein the N-th frame image, the N-th frame annotation information and the N+1th frame image in the video data are input into a video annotation model to obtain pre-annotation information of the N+1th frame image, and the method comprises the following steps:

9. A video annotation model training device, the device comprising:

10. A video annotation device, the device comprising:

the data annotation module is used for inputting the Nth frame image, the Nth frame annotation information and the (n+1) th frame image in the video data into a video annotation model to obtain pre-annotation information of the (n+1) th frame image, wherein the video annotation model is trained according to the video annotation model training method of any one of claims 1 to 5;

11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the video annotation model training method of any of claims 1 to 5 or the steps of the video annotation method of any of claims 6 to 7 when the computer program is executed by the processor.