CN117037014A

CN117037014A - Object labeling method, device, computer equipment and storage medium

Info

Publication number: CN117037014A
Application number: CN202211296780.XA
Authority: CN
Inventors: 郭卉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-10-21
Filing date: 2022-10-21
Publication date: 2023-11-10

Abstract

The application relates to an object labeling method, an object labeling device, a computer device, a storage medium and a computer program product. The object labeling method can be applied to the field of artificial intelligence. The object labeling method comprises the following steps: performing object detection on the video frames to be detected in the data set to obtain predicted video frames; the predicted video frame comprises a detection frame of the virtual object and a corresponding detection probability; taking a detection frame with the detection probability not smaller than the target probability as a clean detection frame; selecting each predicted video frame where a clean detection frame is positioned to obtain a candidate video frame, and determining path information of each virtual object based on the detection frame in the candidate video frame; filtering the detection frames which do not meet the marking conditions on the path information from the candidate video frames to obtain video frames to be marked; labeling a detection frame of which the detection probability in the video frame to be labeled belongs to a probability interval to obtain an object label; wherein the probability in the probability interval is less than the target probability. The method can improve the labeling effect and efficiency.

Description

Object labeling method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technology, and in particular, to an object labeling method, an object labeling apparatus, a computer device, a storage medium, and a computer program product.

Background

With the continuous development of artificial intelligence, machine learning models are often used for object detection, and a training process of the machine learning models generally requires a large amount of labeling data to enable the models to have enough accuracy.

In the related art, unlabeled video frames in the video can be detected through a detection model, difficult samples are selected from the unlabeled video frames based on detection results, labeling data are obtained through labeling of experts, and the labeling data are used for training of the detection model to form closed-loop active learning. Because the states of the virtual objects in the video frames may be different, each time of labeling selects a difficult sample to label, the generalization capability of the model will be affected, and the labeling effect is poor.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an object labeling method, apparatus, computer device, computer-readable storage medium, and computer program product that can enhance labeling effects.

In a first aspect, the present application provides an object labeling method. The method comprises the following steps:

Performing object detection on the video frames to be detected in the data set to obtain predicted video frames; the predicted video frame comprises a detection frame of the virtual object and a corresponding detection probability;

taking a detection frame with the detection probability not smaller than the target probability as a clean detection frame;

selecting each predicted video frame where a clean detection frame is positioned to obtain a candidate video frame, and determining path information of each virtual object based on the detection frame in the candidate video frame;

filtering the detection frames which do not meet the marking conditions on the path information from the candidate video frames to obtain video frames to be marked;

labeling a detection frame of which the detection probability in the video frame to be labeled belongs to a probability interval to obtain an object label; wherein the probability in the probability interval is less than the target probability.

In a second aspect, the application further provides an object labeling device. The device comprises:

the object detection module is used for carrying out object detection on the video frames to be detected in the data set to obtain predicted video frames; the predicted video frame comprises a detection frame of the virtual object and a corresponding detection probability;

the clean detection frame determining module is used for taking a detection frame with the detection probability not smaller than the target probability as a clean detection frame;

The path information determining module is used for selecting each predicted video frame where the clean detection frame is positioned to obtain a candidate video frame, and determining path information of each virtual object based on the detection frame in the candidate video frame;

the video frame to be marked determining module is used for filtering detection frames which do not meet marking conditions on the path information from the candidate video frames to obtain video frames to be marked;

the object labeling module is used for labeling the detection frames of which the detection probabilities in the video frames to be labeled belong to the probability interval to obtain object labels; wherein the probability in the probability interval is less than the target probability.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the following steps when executing the computer program:

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In a fifth aspect, the present application also provides a computer program product. Computer program product comprising a computer program which, when executed by a processor, realizes the steps of:

According to the object labeling method, the device, the computer equipment, the storage medium and the computer program product, object detection is carried out on the video frames to be detected in the data set to obtain predicted video frames, clean detection frames in the predicted video frames are determined according to detection probability, the predicted video frames comprising the clean detection frames are used as candidate video frames, path information of each virtual object is determined according to the detection frames in the candidate video frames, detection frames which do not meet labeling conditions are filtered according to the path information of each virtual object, so that partial difficult samples are filtered, the video frames to be labeled are obtained, the clean samples, namely simple samples, are included in the video frames to be labeled, and the generalization capability of the model can be improved through an object label training model obtained by labeling the video frames to be labeled, and the labeling effect is improved; in addition, during labeling, the clean detection frames do not need to be labeled, and only the detection frames belonging to the probability interval are labeled, so that the object labels can be obtained rapidly, and the labeling efficiency is greatly improved.

Drawings

FIG. 1 is a diagram of an application environment for an object annotation method in one embodiment;

FIG. 2 is a flow chart of an object labeling method in one embodiment;

FIG. 3 is a schematic diagram of a model structure of yolov3 in one embodiment;

FIG. 4 is a schematic diagram of a model structure of a detection model in one embodiment;

FIG. 5 is a schematic diagram of an exemplary diagram of an object subgraph corresponding to a detection frame of an Instant r2 in a predicted video frame y 1;

FIG. 6 is a schematic diagram of an exemplary diagram of an object subgraph corresponding to a detection frame in a predicted video frame y 2;

FIG. 7 is a schematic diagram of an exemplary diagram of an object subgraph corresponding to a detection frame in a predicted video frame y 3;

FIG. 8 is a schematic diagram of an exemplary diagram of an object subgraph corresponding to a detection frame in a predicted video frame y 4;

FIG. 9 is a schematic diagram of an exemplary diagram of an object subgraph corresponding to a detection frame in a predicted video frame y 1;

FIG. 10 is a schematic diagram of an exemplary diagram of an object subgraph corresponding to a detection frame in a predicted video frame y 2;

FIG. 11 is a schematic diagram of an exemplary diagram of an object subgraph corresponding to a detection frame in a predicted video frame y 3;

FIG. 12 is a diagram of an exemplary diagram of a sub-graph of an object corresponding to a detection frame of an Yingxiong r3 in a predicted video frame y 4;

FIG. 13 is a schematic diagram of detection probabilities included in a detection box in a target gaze frequency frame in one embodiment;

FIG. 14 is a schematic diagram of detection probabilities included in a detection box in another annotated video frame in one embodiment;

FIG. 15 is a flowchart showing steps prior to object detection of a video frame to be detected within a dataset by a detection model, in one embodiment;

FIG. 16 is a schematic diagram of the alternating model training and object labeling of one embodiment;

FIG. 17 is a schematic diagram of an object labeling method in one embodiment of a scene;

FIG. 18 is a schematic diagram of video frames in a game training video in one embodiment;

FIG. 19 is a schematic diagram of an object labeling method in one embodiment;

FIG. 20 is a block diagram of an object tagging apparatus in one embodiment;

fig. 21 is an internal structural view of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The object labeling method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be a stand alone device, integrated on server 104, or integrated on the cloud or other network server.

In some embodiments, the terminal 102 and the server 104 may each independently perform the object labeling method provided in the embodiments of the present application. The terminal 102 and the server 104 may also cooperate to perform the object labeling method provided in the embodiments of the present application. When the terminal 102 and the server 104 cooperate to execute the object labeling method provided in the embodiment of the present application, the terminal 102 acquires a data set from the server 104, and the terminal 102 performs object detection on a video frame to be detected in the data set to obtain a predicted video frame; the predicted video frame comprises a detection frame of the virtual object and a corresponding detection probability; the terminal 102 takes a detection frame with the detection probability not smaller than the target probability as a clean detection frame, the terminal 102 takes each predicted video frame with the clean detection frame as a candidate video frame, the terminal 102 determines the path information of each virtual object based on the detection frame in each candidate video frame, the terminal 102 filters the detection frames which do not meet the marking condition on the path information from the candidate video frames to obtain the video frames to be marked, and the terminal 102 marks the detection frames with the detection probability in the probability interval in the video frames to be marked to obtain the object labels.

The terminal 102 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, an internet of things device, and a portable wearable device, and the internet of things device may be a smart speaker, a smart television, a smart air conditioner, and a smart vehicle device. The portable wearable device may be a smart watch, smart bracelet, headset, or the like.

The server 104 may be a separate physical server or may be a service node in a blockchain system, where a Peer-To-Peer (P2P) network is formed between service nodes in the blockchain system, and the P2P protocol is an application layer protocol that runs on top of a transmission control protocol (TCP, transmission Control Protocol) protocol.

The server 104 may be a server cluster formed by a plurality of physical servers, and may be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.

The terminal 102 and the server 104 may be connected by a communication connection manner such as bluetooth, USB (Universal Serial Bus ) or a network, which is not limited herein.

In one embodiment, as shown in fig. 2, there is provided an object labeling method, which may be executed by a server or a terminal alone or may be executed by the server and the terminal together, and in an embodiment of the present application, the method is executed by the terminal as an example, and includes the following steps:

step S202, performing object detection on a video frame to be detected in a data set to obtain a predicted video frame; the predicted video frame contains a detection box of the virtual object and a corresponding detection probability.

The data set comprises a plurality of training videos, the video frames to be detected can be determined based on any training video in the data set, and the video frames to be detected can be part or all of the training videos.

The virtual object is the content displayed in the video frame, and the virtual object may be a virtual character, a virtual animal, a virtual building, or the like; illustratively, the virtual object is a virtual character, which may be a game hero, and accordingly, the training video is a game video, and the virtual character may also be a cartoon character, and accordingly, the training video is a cartoon video.

The video frames to be detected comprise a plurality of video frames, and the predicted video frames are a plurality of predicted video frames which are in one-to-one correspondence with the plurality of video frames.

The detection frame is a boundary frame of the virtual object and can be used for reflecting the position of the virtual object in the video frame, and the detection probability corresponding to the detection frame is the probability that the object in the detection frame is the virtual object; the predicted video frame includes two detection frames, namely a detection frame b1 and a detection frame b2, wherein the detection probability of the detection frame b1 is 0.6, the detection probability of the detection frame b2 is 0.2, that is, the probability of the object in the detection frame b1 being a virtual object is 0.6, and the probability of the object in the detection frame b2 being a virtual object is 0.2.

Specifically, the terminal acquires a training video from a data set, and acquires a video frame to be detected which does not participate in model training from the training video; the terminal can detect the object of the video frame to be detected through a detection model, wherein the detection model is used for detecting the virtual object in the video frame, and then at least one video frame in the video frame to be detected comprises the virtual object. The terminal inputs the video frame to be detected into the detection model, and outputs the predicted video frame after the object detection is carried out on the video frame to be detected through the detection model.

The video frames to be detected, which are acquired from the data set by the terminal, are respectively: f1, f2, … …, fn, inputting f1, f2, … …, fn to the detection model, outputting the predicted video frame through the detection model, comprising: y1, y2, … …, yn.

In step S204, the detection frame with the detection probability not less than the target probability is used as a clean detection frame.

The clean detection frame can be used as a marked detection frame, namely, the object in the clean detection frame is determined to be a virtual object.

Specifically, for each predicted video frame, the terminal determines a clean detection frame in the predicted video frame according to the detection probability of each detection frame in the predicted video frame and a preset target probability; for each detection frame in the predicted video frame, if the detection probability of the detection frame is not less than the preset target probability, taking the detection frame as a clean detection frame; in this manner, the terminal may determine a clean detection box included in each predicted video frame. Since the detection probability of the clean detection frame is not less than the target probability, i.e. can be detected more easily in the object detection, the clean detection frame can be regarded as a simple sample in the object detection.

The target probability may be set according to actual requirements, and illustratively, the target probability belongs to [0.4,0.8], for example, the target probability is 0.4, or the target probability is 0.6, or the target probability is 0.8, which is not limited in the embodiment of the present application.

It should be noted that, some predicted video frames may exist in the plurality of predicted video frames, and the predicted video frames do not include a clean detection frame.

Illustratively, the plurality of predicted video frames are y1, y2, … …, yn, y1 includes a detection frame yb11 and a detection frame yb12, and assuming that the detection probability of the detection frame yb11 is not less than the target probability and the detection probability of the detection frame yb12 is less than the target probability, the detection frame yb11 is taken as a clean detection frame, and y1 includes a clean detection frame yb11; y2 includes a detection frame yb21 and a detection frame yb22, and y2 does not include a clean detection frame provided that the detection probabilities of both the detection frame yb21 and the detection frame yb22 are smaller than the target probability.

Step S206, selecting each predicted video frame where the clean detection frame is located to obtain candidate video frames, and determining path information of each virtual object based on the detection frame in each candidate video frame.

The candidate video frames are a plurality of candidate video frames, wherein the plurality of candidate video frames comprise part or all of a plurality of predicted video frames, and each candidate video frame comprises at least one clean detection frame.

The path information of the virtual object comprises a plurality of detection frames and detection frame data of each detection frame, wherein the detection frame data can be a representation vector of a corresponding sub-graph of the detection frame or can be the position of the detection frame. The multiple detection frames respectively belong to different candidate video frames, and the path information of the virtual object is used for reflecting the detection condition of the virtual object in each candidate video frame.

Specifically, the terminal uses a predicted video frame including a clean detection frame among the plurality of predicted video frames as a candidate video frame to select and acquire the plurality of candidate video frames among the plurality of predicted video frames. And the terminal arranges the plurality of candidate video frames according to the playing sequence to obtain a candidate video frame sequence, wherein the playing sequence is the playing sequence of the training video to which the candidate video frames belong.

The terminal determines a starting detection frame of each virtual object in the candidate video frame sequence, for each virtual object, the terminal sequentially determines each path detection frame matched with the starting detection frame in the candidate video frame sequence based on the detection frame data of the starting detection frame of the virtual object, and the terminal takes the starting detection frame and each path detection frame of the virtual object as a plurality of detection frames included in the path information of the virtual object and obtains the path information of the virtual object according to the plurality of detection frames and the detection frame data of the plurality of detection frames.

Illustratively, the plurality of predicted video frames are y1, y2, … …, yn, the predicted video frames including the clean detection frame are taken as candidate video frames, the plurality of candidate video frames are h1, h2, … …, hm, the initial detection frame of the virtual object r1 is determined in the plurality of candidate video frames, the initial detection frame of the virtual object r1 is assumed to be a detection frame hb11 in the candidate video frame h1, and the matched path detection frames are determined in h2, h3, … …, hm based on the initial detection frame hb11, assuming that the path detection frames include: the detection frame hb22 in the candidate video frame h2, the detection frame hb41 in the candidate video frame h4, and the detection frame hbm1 in the candidate video frame hm determine the path information of the virtual object r1 according to the detection frame hb11, the detection frame hb22, the detection frame hb41, and the detection frame hbm.

And step S208, filtering the detection frames which do not meet the marking condition on the path information from the candidate video frames to obtain the video frames to be marked.

The video frames to be marked are a plurality of video frames to be marked, and the video frames to be marked correspond to the candidate video frames one by one.

Specifically, for the path information of each virtual object, the terminal determines a detection frame which does not meet the marking condition from a plurality of detection frames included in the path information according to the detection frame data of a clean detection frame and the detection frame data of a non-clean detection frame in the path information, wherein the detection frame which does not meet the marking condition is a detection frame with a large difference from the clean detection frame, the clean detection frame is a simple sample in object detection, the detection frame with a large difference from the clean detection frame can be regarded as a difficult sample in object detection, for each detection frame which does not meet the marking condition, the terminal determines a candidate video frame which belongs to the detection frame which does not meet the marking condition from a plurality of candidate video frames, and filters the detection frame which does not meet the marking condition from the candidate video frames to filter the difficult sample in the candidate video frames, so as to obtain the video frame to be marked corresponding to the candidate video frame.

For example, the plurality of candidate video frames are h1, h2, … …, hm, taking only one virtual object r1 in the plurality of candidate video frames as an example, determining a detection frame which does not meet the labeling condition on the path information of the virtual object r1 as a detection frame hb22, wherein the detection frame is a detection frame in the candidate video frame h2, filtering the detection frame hb22 in the candidate video frame h2, and obtaining a video frame b2 to be labeled corresponding to the candidate video frame h 2.

Step S210, labeling a detection frame of which the detection probability in a video frame to be labeled belongs to a probability interval to obtain an object label; wherein the probability in the probability interval is less than the target probability.

The video frame to be marked comprises a plurality of video frames to be marked, and each video frame to be marked comprises at least one clean detection frame; the detection frames whose detection probabilities belong to the probability interval do not include clean detection frames. The detection probability belongs to detection frames in the probability interval, and can be other detection frames except for a clean detection frame in the video frame to be marked, or can be part of detection frames in the other detection frames.

The object label is an annotated video frame corresponding to a video frame to be annotated, wherein the video frame to be annotated is obtained by filtering out detection frames which do not meet the annotation condition in candidate video frames, the candidate video frames are prediction video frames comprising clean detection frames, the prediction video frames are obtained by carrying out object detection on part of video frames, and the object label is an annotated video frame corresponding to the video frame to be detected and corresponding to the candidate video frames; in some possible scenarios, the video frame to be detected and the object tag corresponding to the candidate video frame may be used to train a network model that implements object detection.

Specifically, for each video frame to be annotated, if the video frame to be annotated includes: the detection probability belongs to a first detection frame in a probability interval, the detection probability does not belong to a second detection frame which is not a clean detection frame and a clean detection frame, the terminal filters the second detection frame in the video frame to be marked, adds marked identifiers for the clean detection frame, displays the video frame to be marked comprising the first detection frame and the clean detection frame added with the marked identifiers, marks the first detection frame in the video frame to be marked, and obtains object labels.

If the video frame to be marked comprises: the detection probability belongs to a first detection frame and a clean detection frame in a probability interval, the terminal adds marked identifiers for the clean detection frames, displays a video frame to be marked comprising the first detection frame and the clean detection frame added with the marked identifiers, marks the first detection frame in the video frame to be marked, and obtains object labels.

If the video frame to be marked only comprises a clean detection frame, the terminal adds marked identification for the clean detection frame to obtain an object label.

In the object labeling method, object detection is carried out on the video frames to be detected in the data set to obtain predicted video frames, clean detection frames in the predicted video frames are determined according to the detection probability, the predicted video frames comprising the clean detection frames are used as candidate video frames, path information of each virtual object is determined according to the detection frames in the candidate video frames, detection frames which do not meet labeling conditions are filtered according to the path information of each virtual object, so that partial difficult samples are filtered, the video frames to be labeled are obtained, the clean samples, namely the simple samples, are included in the video frames to be labeled, and the generalization capability of the model can be improved through an object label training model obtained by labeling the video frames to be labeled, and the labeling effect is further improved; in addition, during labeling, the clean detection frames do not need to be labeled, and only the detection frames belonging to the probability interval are labeled, so that the object labels can be obtained rapidly, and the labeling efficiency is greatly improved.

In some embodiments, the predicted video frames are object detected by a detection model; object detection is carried out on a video frame to be detected in a data set to obtain a predicted video frame, and the method comprises the following steps: extracting features of the video frames to be detected in the data set through a basic network in the detection model; and detecting the feature map extracted by the basic network through at least one detection network in the detection model to obtain a predicted video frame corresponding to each detection network.

The detection model comprises a basic network and at least one detection network connected with the basic network, wherein the basic network is used for extracting the characteristics of a video frame to be detected, and the detection network is used for detecting and obtaining a virtual object in the video frame to be detected according to the characteristics of the video frame to be detected.

Specifically, for each video frame in the video frames to be detected, the terminal is configured with a detection model, the terminal inputs the video frame into the base network to obtain a feature map of the video frame, and the feature map is processed through at least one detection network to obtain at least one predicted video frame corresponding to the video frame.

In one implementation, the detection model includes a base network, and a detection network connected to the base network, the detection model may be implemented by yolov3, which is a multi-scale detection model.

The model structure of yolov3 as shown in fig. 3, yolov3 includes a base network p1 and a detection network p2, where the base network p1 includes a first convolution layer p11, a first downsampling layer p12, a first residual module p13, a second downsampling layer p14, a second residual module p15, a third downsampling layer p16, a third residual module p17, a fourth downsampling layer p18, a fourth residual module p19, a fifth downsampling layer p110, and a fifth residual module p111;

the convolution information of the first convolution layer includes: the number of convolution kernels is 32, and the convolution kernel size is 3×3; inputting a video frame into a first convolution layer, and extracting by the first convolution layer to obtain a first feature map t1, wherein the size of t1 is 416 multiplied by 416;

the convolution information of the first downsampling layer includes: the number of convolution kernels is 64, the size of the convolution kernels is 3 multiplied by 3, and the step length is 2; inputting the first feature map t1 into a first downsampling layer, and extracting by the first downsampling layer to obtain a second feature map t2, wherein the size of t2 is 208×208;

the first residual module includes: a second convolution layer and a third convolution layer, the convolution information of the second convolution layer comprising: the number of convolution kernels is 32, the convolution kernel size is 1×1, and the convolution information of the third convolution layer includes: the number of convolution kernels is 64, and the convolution kernel size is 3×3; inputting the second characteristic diagram t2 into a first residual error module, and obtaining a third characteristic diagram t3 through the first residual error module, wherein the size of t3 is 208 multiplied by 208;

The convolution information of the second downsampling layer includes: the number of convolution kernels is 128, the size of the convolution kernels is 3 multiplied by 3, and the step length is 2; inputting the third characteristic diagram t3 into a second downsampling layer to obtain a fourth characteristic diagram t4, wherein the size of the t4 is 104 multiplied by 104;

the second residual error module comprises 2 cascaded second residual error sub-modules, the second residual error sub-module comprises a fourth convolution layer and a fifth convolution layer, and the convolution information of the fourth convolution layer comprises: the number of convolution kernels is 64, the convolution kernel size is 1×1, and the convolution information of the fifth convolution layer includes: the number of convolution kernels is 128, and the convolution kernel size is 3×3; inputting the fourth characteristic diagram t4 into a second residual error module to obtain a fifth characteristic diagram t5, wherein the size of the t5 is 104 multiplied by 104;

the convolution information of the third downsampling layer includes: the number of convolution kernels is 256, the size of the convolution kernels is 3 multiplied by 3, and the step length is 2; inputting the fifth characteristic diagram t5 into a third downsampling layer to obtain a sixth characteristic diagram t6, wherein the size of the t6 is 52 multiplied by 52;

the third residual error module comprises 8 cascaded third residual error sub-modules, the third residual error sub-module comprises a sixth convolution layer and a seventh convolution layer, and the convolution information of the sixth convolution layer comprises: the number of convolution kernels is 128, the convolution kernel size is 1×1, and the convolution information of the seventh convolution layer includes: the number of convolution kernels is 256, and the convolution kernel size is 3×3; inputting the sixth feature map t6 into a third residual error module to obtain a seventh feature map t7, wherein the size of the t7 is 52×52;

The convolution information of the fourth downsampling layer includes: the number of convolution kernels is 512, the size of the convolution kernels is 3 multiplied by 3, and the step length is 2; inputting the seventh feature map t7 into a fourth downsampling layer to obtain an eighth feature map t8, wherein the size of the t8 is 26 multiplied by 26;

the fourth residual error module comprises 8 cascaded fourth residual error sub-modules, the fourth residual error sub-module comprises an eighth convolution layer and a ninth convolution layer, and the convolution information of the eighth convolution layer comprises: the number of convolution kernels is 256, the convolution kernel size is 1×1, and the convolution information of the ninth convolution layer includes: the number of convolution kernels is 512, and the convolution kernel size is 3×3; inputting the eighth feature map t8 to a fourth residual error module to obtain a ninth feature map t9, wherein the size of the t9 is 26 multiplied by 26;

the convolution information of the fifth downsampling layer includes: the number of convolution kernels is 1024, the size of the convolution kernels is 3 multiplied by 3, the step length is 2, the ninth feature map t9 is input to a fifth downsampling layer, and a tenth feature map t10 is obtained, and the size of t10 is 13 multiplied by 13;

the fifth residual error module comprises 4 cascaded fifth residual error sub-modules, the fifth residual error sub-module comprises a tenth convolution layer and an eleventh convolution layer, the number of convolution kernels of the tenth convolution layer is 512, the convolution kernel size is 1 multiplied by 1, the number of convolution kernels of the eleventh convolution layer is 1024, and the convolution kernel size is 3 multiplied by 3; inputting the tenth characteristic diagram t10 into a fifth residual error module to obtain an eleventh characteristic diagram t11, wherein the size of the t11 is 13 multiplied by 13;

The detection network p2 comprises a first scale detection module p21, a first upsampling module p22, a first merging layer p23, a second scale detection module p24, a second upsampling module p25, a second merging layer p26 and a third scale detection module p27. The first scale detection module is connected with the fifth residual error module; the first up-sampling module is connected with the first scale detection module; the first merging layer is connected with the first up-sampling module and the fourth residual error module; the second scale detection module is connected with the first merging layer; the second up-sampling module is connected with the second scale detection module; the second merging layer is connected with a second up-sampling module and a third residual error module; the third scale detection module is connected with the second merging layer.

The first scale detection module comprises a cascade convolution set, a twelfth convolution layer and two-dimensional convolution; the first scale detection module, the second scale detection module and the third scale detection module have the same structure; the first upsampling module includes: a thirteenth convolution layer and an upsampling layer; the first upsampling module and the second upsampling module have the same structure.

The first scale detection module is connected with the fifth residual error module, which means that the fifth residual error module is cascaded with the convolution set in the first scale detection module; the first up-sampling module is connected with the first scale detection module, and refers to that a volume set of the first scale detection module is cascaded with a thirteenth convolution layer of the first up-sampling module; the first merging layer is connected with the first up-sampling module and the fourth residual error module, which means that the up-sampling layer of the first up-sampling module is cascaded with the first merging layer, and the fourth residual error module is cascaded with the first merging layer; the second scale detection module is connected with the first merging layer, and the first merging layer is cascaded with the convolution set in the second scale detection module; the second up-sampling module is connected with the second scale detection module, and refers to cascade connection of a volume set in the second scale detection module and a thirteenth convolution layer in the second up-sampling module; the second merging layer is connected with the second up-sampling module and the third residual error module, which means that the up-sampling layer in the second up-sampling module is cascaded with the second merging layer, and the third residual error module is cascaded with the second merging layer; the third scale detection module is connected with the second merging layer, which means that the second merging layer is cascaded with the convolution set in the third scale detection module.

The convolution set includes a concatenated fourteenth convolution layer of 1×1, a fifteenth convolution layer of 3×3, a sixteenth convolution layer of 1×1, a seventeenth convolution layer of 3×3, and an eighteenth convolution layer of 1×1.

Processing the feature map extracted by the base network p1 through the detection network p2 to obtain a predicted video frame, including: processing the eleventh feature map t11 through a first scale detection module to obtain a first prediction result; a twelfth feature map t12 output by a convolution set in the first scale detection module is obtained, the t12 is input to the first up-sampling module to obtain a thirteenth feature map t13, the t13 and the ninth feature map t9 are input to the first merging layer, and the t13 and the t9 are processed through the first merging layer to obtain a fourteenth feature map t14; processing t14 through a second scale detection module to obtain a second prediction result; a fifteenth feature map t15 output by a convolution set in a second scale detection module is obtained, the t15 is input to a second up-sampling module to obtain a sixteenth feature map t16, the t16 and a seventh feature map t7 are input to a second merging layer, and the t16 and the t7 are processed through the second merging layer to obtain a seventeenth feature map t17; processing t17 through a third scale detection module to obtain a third prediction result; and obtaining a predicted video frame according to the first predicted result, the second predicted result and the third predicted result.

In the above embodiment, the object detection is performed on a part of the image frames by using the detection model to obtain the predicted video frames, and the detection model includes at least one detection network, so that the predicted video frames corresponding to each detection network can be obtained, and the object detection is performed on a part of the image frames by using the detection model, so that the predicted video frames including the detection frame can be obtained.

In some embodiments, the detection network in the detection model includes a first detection network and a second detection network; taking a detection frame with the detection probability not smaller than the target probability as a clean detection frame, comprising: when the detection probability of the target detection frame in the predicted video frame corresponding to the first detection network is not less than the target probability, and the detection probability of the corresponding target detection frame in the predicted video frame corresponding to the second detection network is not less than the target probability, the target detection frame in the predicted video frame corresponding to the first detection network is used as a clean detection frame; or taking the target detection frame in the predicted video frame corresponding to the second detection network as a clean detection frame.

The detection model comprises a base network, a first detection network and a second detection network, wherein the first detection network and the second detection network are connected with the base network, the feature map of the video frame to be detected is processed through the first detection network to obtain a predicted video frame corresponding to the first detection network, and the feature map of the video frame to be detected is processed through the second detection network to obtain a predicted video frame corresponding to the second detection network.

Specifically, for convenience of explanation, the predicted video frame corresponding to the first detection network is referred to as a first predicted video frame, and the predicted video frame corresponding to the second detection network is referred to as a second predicted video frame; the target detection frame in the first predicted video frame is marked as a first target detection frame, and the target detection frame in the second predicted video frame is marked as a second target detection frame.

For each video frame in the video frames to be detected, the terminal inputs the video frame into a detection model to obtain a first predicted video frame and a second predicted video frame; and for a first target detection frame with the detection probability not smaller than the target probability in the first predicted video frame, determining a second target detection frame corresponding to the first target detection frame in the second predicted video frame, and taking the first target detection frame in the first predicted video frame or the second target detection frame in the second predicted video frame as a clean detection frame if the detection probability of the second target detection frame is not smaller than the target probability.

Determining a second target detection frame corresponding to the first target detection frame in the second predicted video frame refers to determining a second target detection frame in the second predicted video frame that is the same as the first target detection frame in position. The position of the first target detection frame comprises a first center coordinate, a first width and a first height, and the position of the second target detection frame comprises a second center coordinate, a second width and a second height; the first target detection frame may have the same position as the second target detection frame, and may have the same first center coordinate and the same second center coordinate, may have the same first width and the same second width, may have the same first center coordinate and the same second center coordinate, and may have the same first height and the same second height.

When the detection probability of the target detection frame in the predicted video frame corresponding to the first detection network is not smaller than the target probability, and the detection probability of the corresponding target detection frame in the predicted video frame corresponding to the second detection network is smaller than the target probability, the target detection frame in the predicted video frame corresponding to the first detection network and the target detection frame in the predicted video frame corresponding to the second detection network are not clean detection frames.

The detection model includes a base network, and when the first detection network and the second detection network are connected to the base network, the model structure of the detection model is shown in fig. 4. A fifth residual module W1-5 in the base network W1 is respectively connected with a first scale detection module W2-1 in the first detection network W2 and a first scale detection module W3-1 in the second detection network W3; a fourth residual module W1-4 in the base network W1 is respectively connected with the first merging layer W2-2 in the first detection network W2 and the first merging layer W3-2 in the second detection network W3; a third residual module W1-3 in the base network W1 is respectively connected with the second merging layer W2-3 in the first detection network W2 and the second merging layer W3-3 in the second detection network W3; the first predicted video frame is obtained through three detection results respectively output by three scale detection modules of the first detection network W2, and the second predicted video frame is obtained through three detection results respectively output by three scale detection modules of the second detection network W3.

The network structure of the first detection network is the same as that of the second detection network, the initialization network parameters of the first detection network are different from those of the second detection network, and the connection mode of the first detection network and the basic network is the same as that of the second detection network and the basic network; the specific network structure of the first detection network and the specific connection manner of the first detection network and the base network may refer to fig. 3 and the corresponding embodiment of fig. 3, which are not described herein again.

Illustratively, a video frame f1 of the video frames to be detected is input to the detection model, a first predicted video frame y11 is output through a first detection network of the detection model, and a second predicted video frame y12 is output through a second detection network of the detection model; assuming that the first predicted video frame y11 includes detection frames yb111, yb112, yb113, and yb114, the second predicted video frame y12 includes detection frames yb121, yb122, and yb123; if the detection probability of yb111 is not less than the target probability, using yb111 as a first target detection frame, determining that a second target detection frame corresponding to yb111 is yb123 in a second predicted video frame y12, and if the detection probability of yb123 is not less than the target probability, using yb111 or yb123 as a clean detection frame; if the detection probability of yb112 is not less than the target probability, using yb112 as a first target detection frame, determining that a second target detection frame corresponding to yb12 is yb121 in a second predicted video frame y12, and if the detection frame of yb112 is less than the target probability, neither yb112 nor yb121 is a clean detection frame.

In the above embodiment, the detection model includes a first detection network and a second detection network, where the first detection network and the second detection network only initialize parameters differently, and the detection results of the first detection network and the second detection network on noise in the video frame may be different in the detection model, but the detection situations of virtual objects in the video frame are the same, and the clean detection frame is screened out by combining the predicted video frame corresponding to the first detection network and the predicted video frame corresponding to the second detection network, so that the situation that the noise corresponding to the detection frame in the video frame is used as the clean detection frame is reduced, and the accuracy of determining the clean detection frame is improved.

In some embodiments, determining path information for each virtual object based on the detection box in the candidate video frame includes: searching a matching detection frame matched with a clean detection frame in a first video frame in the candidate video frames in each adjacent video frame of the candidate video frames; and determining path information of each virtual object in the candidate video frames based on the clean detection frame and the matching detection frame in the first video frame.

The first video frame in the candidate video frames is the video frame with the playing sequence of the first bit in the candidate video frames; each adjacent video frame of the candidate video frame is two video frames adjacent in play order, for example, the candidate video frames are ordered into a candidate video frame sequence according to play order: h1, h2, h3, … …, hm-1, hm, wherein h1 and h2 are adjacent video frames, h2 and h3 are adjacent video frames, hm-1 and hm are adjacent video frames, and h1 is the first video frame.

Specifically, a clean detection frame is determined in the first video frame, and a matching detection frame that matches the clean detection frame is determined in each adjacent video frame of the candidate video frame sequence.

In one implementation, the adjacent video frames include a first video frame and a last video frame; for adjacent video frames comprising a first video frame, namely the first video frame in the adjacent video frames is the first video frame, determining a matching detection frame matched with the clean detection frame in the last video frame; for adjacent video frames not including the first video frame, the first video frame includes a match detection frame, and a match detection frame matched with the match detection frame is determined in the last video frame.

In another implementation, a match detection frame that matches the clean detection frame is determined separately in each adjacent video frame of the candidate video frames.

In some embodiments, in determining path information of a virtual object, for an end video frame in adjacent video frames, if a detection frame matching with a clean detection frame of the virtual object is not found in the end video frame, skipping the end video frame, and taking a next candidate video frame of the end video frame as an updated end video frame until a detection frame matching is found in the updated end video frame.

Illustratively, h1 and h2 are adjacent video frames, for h2 is skipped if no matching detection frame is found in h2, h3 is taken as the updated last video frame, a matching detection frame is found in h3, h1 and h3 are adjacent detection frames if a matching detection frame is found in h3, and h3 and h4 are adjacent detection frames.

It should be noted that, the above-mentioned skipping of the last video frame, taking the next candidate video frame of the last video frame as the updated last video frame, is performed in the process of determining the path information of a certain virtual object, and in the process of determining the path information of another virtual object, there may be a matching detection frame of the clean detection frame corresponding to another virtual object in the last video frame, so that in the process of determining the path information of another virtual object, the skipping of the last video frame is not required.

And carrying out feature extraction on the object subgraph corresponding to the clean detection frame in the first video frame and the object subgraph corresponding to each determined matching detection frame to obtain a target characterization vector, and determining the path information of the virtual object corresponding to the clean detection frame according to the target characterization vector of the clean detection frame in the first video frame and the target characterization vector of each matching detection frame.

Illustratively, the first video frame includes a first clean detection box bo1 and a second clean detection box bo2, and each matching detection box matching with the first clean detection box bo1 is determined in each adjacent video frame of the candidate video frame, including: determining path information of a first clean detection box bo1 corresponding to a first virtual object according to target characterization vectors of object subgraphs corresponding to bo1-1, bo1-2, … …, bo1-z and bo1-1, bo1-2, … …, respectively;

determining each match detection box that matches the second clean detection box bo2 in each adjacent video frame of the candidate video frames, comprising: bo2-1, bo2-2, … …, bo2-u; and determining path information of a second virtual object corresponding to the second clean detection box bo2 according to the target characterization vectors of the object subgraphs corresponding to bo2-1, bo2-2 and … … and bo2-u respectively.

In the above embodiment, for the clean detection frame in the first video frame in the candidate video frames, a matching detection frame matching with the clean detection frame is found in each adjacent video frame of the candidate video frames, and the path information of the virtual object is determined according to the clean detection frame and the matching detection frame, so that the detection frame to be marked is determined based on the path information of the virtual object.

In some embodiments, searching for a matching detection box in each neighboring video frame of the candidate video frames that matches a clean detection box within a first video frame of the candidate video frames comprises: sequentially determining a first distance value between detection frames in each adjacent video frame in the candidate video frames by taking the first video frame in the candidate video frames as a starting frame; and acquiring a matching detection frame corresponding to the first distance value between the candidate video frame and the clean detection frame in the first video frame meeting the matching condition.

The first distance value between the detection frames is Euclidean distance between target characterization vectors of object subgraphs corresponding to the detection frames, and can be used for measuring similarity between the object subgraphs corresponding to the detection frames; the smaller the first distance value between the detection frames, the larger the similarity between the detection frames corresponding to the object subgraphs, and the larger the first distance value between the detection frames, the smaller the similarity between the detection frames corresponding to the object subgraphs.

The first distance value satisfying the matching condition means that the first distance value is smaller than the distance threshold. The distance threshold may be set according to actual requirements, and may be, for example, 0.2.

Specifically, the adjacent video frames include a first video frame and a last video frame; for adjacent video frames including the first video frame, determining first distance values between each detection frame and the clean detection frame in the last video frame, and if the minimum first distance value in each first distance value meets a matching condition, using the minimum first distance value as a matching detection frame of the clean detection frame. For adjacent video frames which do not comprise the first video frame, the first video frame comprises a matching detection frame, a first distance value between each detection frame and the matching detection frame in the last video frame is determined, and if the minimum first distance value in each first distance value meets the distance condition, the minimum first distance value corresponds to the detection frame and is used as the matching detection frame in the last video frame.

In the above embodiment, the matching detection frames meeting the matching condition are determined in the adjacent video frames through the first distance value between the detection frames in each adjacent video frame, so that the probability that the matching detection frames correspond to different virtual objects with the clean detection frames is reduced, and the quality of the path information of the virtual objects obtained by determination is improved.

In some embodiments, the object labeling method further comprises: when a target virtual object which is not matched with the virtual object in the first video frame and corresponds to the clean detection frame exists in the last video frame in the target adjacent video frames, sequentially determining a second distance value between the detection frames in other adjacent video frames by taking the last video frame in the target adjacent video frames as a starting frame; the other adjacent video frames are adjacent video frames following the target adjacent video frame; and when the second distance value meets the matching condition, determining the path information of the target virtual object based on the detection frames of the target virtual object in other adjacent video frames.

The second distance value between the detection frames is Euclidean distance between target characterization vectors of the object subgraphs corresponding to the detection frames, and can be used for measuring the similarity between the object subgraphs corresponding to the detection frames; the smaller the second distance value between the detection frames, the larger the similarity between the detection frames corresponding to the object subgraphs, and the larger the second distance value between the detection frames, the smaller the similarity between the detection frames corresponding to the object subgraphs.

Specifically, the existence of a target virtual object which is not matched with the virtual object in the first video frame and corresponds to the clean detection frame means that the existence of the target virtual object which corresponds to the clean detection frame which is not matched with the clean detection frame in the first video frame, the clean detection frame which is not matched with the clean detection frame in the first video frame and the matching detection frames in the first video frames of the target adjacent video frames are not matched. And taking a clean detection frame which is not matched with each matching detection frame in the first video frame in the tail video frame of the target adjacent video frame as a detection frame of the newly added target virtual object.

And taking the tail video frame as a starting frame of a newly added target virtual object, determining a matching detection frame of a clean detection frame corresponding to the target virtual object in the tail video frame in other adjacent video frames behind the target adjacent video frame, and determining path information of the target virtual object according to the clean detection frame and the matching detection frame of the target virtual object.

Other neighboring video frames include the end video frame, and the end video frame is the first video frame of the first neighboring video frame in the other neighboring video frames, illustratively, h2 and h3 are neighboring video frames, h3 and h4 are neighboring video frames, h4 and h5 are neighboring video frames, and assuming that h2 and h3 are target neighboring video frames, i.e., a clean detection frame including a target virtual object in h3, the other neighboring video frames include: adjacent video frames composed of h3 and h4, adjacent video frames composed of h4 and h5, h3 being the first adjacent video frame in the other adjacent video frames: the first video frame in h3 and h 4.

For the first adjacent video frames in other adjacent video frames, determining a second distance value between each detection frame in the last video frame of the first adjacent video frames and a clean detection frame of the target virtual object, and taking the detection frame corresponding to the second distance value meeting the matching condition as the detection frame of the target virtual object in the last video frame; for the adjacent video frames which are not the first bit in other adjacent video frames, determining a second distance value between each detection frame in the last video frame and the detection frame of the target virtual object in the first video frame, and taking the detection frame corresponding to the second distance value meeting the matching condition as the detection frame of the target virtual object in the last video frame.

And determining the path information of the target virtual object according to the target characterization vector corresponding to the detection frame of the virtual object in other adjacent video frames.

In the above embodiment, the clean detection frame of the newly added target virtual object may be determined in the last video frame of the target adjacent video frames, and the path information of the target virtual object may be determined in other adjacent video frames, so as to avoid missing virtual objects of which the clean detection frame is not in the first video frame, and obtain the path information of all virtual objects in the candidate video frames, so that the subsequently obtained video frames to be marked include detection frames corresponding to all virtual objects, and further obtain richer object labels.

In some embodiments, feature extraction is performed on object subgraphs corresponding to each detection frame in the candidate video frames to obtain a target characterization vector; filtering the detection frame which does not meet the labeling condition on the path information from the candidate video frames, wherein the filtering comprises the following steps: selecting detection frames which do not meet the marking condition from all detection frames positioned on the same path information based on all target characterization vectors; and filtering the detection frames which do not meet the labeling condition from the candidate video frames.

The object subgraphs corresponding to the detection frames refer to object subgraphs obtained by taking the detection frames as interception frames and intercepting the candidate video frames. For example, when the training video is a game video, the virtual object is a hero in the game, the object subgraph corresponding to the detection frame of hero r2 in the predicted video frame y1 is shown in fig. 5, the object subgraph corresponding to the detection frame of hero r2 in the predicted video frame y2 is shown in fig. 6, the object subgraph corresponding to the detection frame of hero r2 in the predicted video frame y3 is shown in fig. 7, and the object subgraph corresponding to the detection frame of hero r2 in the predicted video frame y4 is shown in fig. 8; fig. 5, 6, 7 and 8 are object subgraphs of hero r2 in different postures, a51 in fig. 5 is a rank of hero r2, a52 is a blood stripe of hero r2, a61 in fig. 6 is a rank of hero r2, a62 is a blood stripe of hero r2, a71 in fig. 7 is a rank of hero r2, a72 is a blood stripe of hero r2, a81 in fig. 8 is a rank of hero r2, and a82 is a blood stripe of hero r 2.

The object subgraphs corresponding to the detection frames of hero r3 in the predicted video frame y1 are shown in fig. 9, the object subgraphs corresponding to the detection frames of hero r3 in the predicted video frame y2 are shown in fig. 10, the object subgraphs corresponding to the detection frames of hero r3 in the predicted video frame y3 are shown in fig. 11, the object subgraphs corresponding to the detection frames of hero r2 in the predicted video frame y4 are shown in fig. 12, fig. 9, fig. 10, fig. 11 and fig. 12 are object subgraphs of different postures of hero r3, a91 in fig. 9 is the rank of hero r3, a92 is the blood bar of hero r3, a101 in fig. 10 is the rank of hero r3, a102 is the blood bar of hero r3, a111 in fig. 11 is the rank of hero r3, a112 is the blood bar of hero r3, a121 in fig. 12 is the rank of hero r3, and a122 is the blood bar of hero r 3.

Specifically, processing the object subgraph corresponding to the detection frame through the characterization extraction model to obtain the target characterization vector corresponding to the detection frame. In practical application, the characterization extraction model may be a residual network model resnet101, and the characterization extraction model may be obtained based on public data set training; and inputting the object subgraphs corresponding to the detection frames into the resnet101, and outputting the target characterization vector with 1×1024 dimensions through a pooling layer of the resnet 101.

The object subgraph corresponding to the detection frame may include noise, which refers to image content other than the virtual object, such as the content shown as a83 in fig. 8 is a virtual soldier in the game video, such as the content shown as a103 in fig. 10 is a virtual landscape in the game video, so that the detection frame corresponding to the object subgraph including noise may not be a clean detection frame.

The path information comprises a plurality of target characterization vectors respectively corresponding to the plurality of detection frames; the average value is determined according to a plurality of target characterization vectors respectively corresponding to a plurality of detection frames, the similarity between the target characterization vectors and the average value of the detection frames meeting the labeling condition is higher, the similarity between the target characterization vectors and the average value of the detection frames not meeting the labeling condition is lower, further, the object subgraphs corresponding to the detection frames not meeting the labeling condition may comprise noise, if the noise corresponds to the detection frames, the model is over-fitted when the model training is performed, and the model training is not facilitated.

And determining a detection frame which does not meet the marking condition in the path information of each virtual object, acquiring a candidate video frame to which the detection frame which does not meet the marking condition belongs, and filtering the detection frame which does not meet the marking condition in the candidate video frame.

Illustratively, the path information of the virtual object r1 includes a plurality of detection boxes respectively: the detection frames hb11, hb21, hb31, hb41 and ha91, and according to the target characterization vectors corresponding to the detection frames hb11, hb21, hb31, hb41 and ha91, determining the detection frames which do not meet the labeling condition, and if the detection frames which do not meet the labeling condition are the detection frames hb11 and hb21, the detection frames hb11 belong to the candidate video frame h1, the detection frames hb21 belong to the candidate video frame h2, filtering the detection frames hb11 in the candidate video frame h1, and filtering the detection frames hb21 in the candidate video frame h 2.

In the above embodiment, in each detection frame included in the path information, the detection frame that does not meet the labeling condition is determined according to the target characterization vector of each detection frame, and noise may exist in the object subgraph corresponding to the detection frame that does not meet the labeling condition, so that the detection frame that does not meet the labeling condition is filtered out of the candidate video frames, the quality of the detection frame in the filtered candidate video frames is improved, and the quality of the object label obtained subsequently is further improved.

In some embodiments, selecting a detection box that does not satisfy the labeling condition based on the target token vector of each object subgraph from detection boxes located on the same path information includes: determining a first center token vector for each virtual object based on each target token vector; determining a third distance value between a first center characterization vector of each virtual object and a target characterization vector of each object subgraph for each virtual object in the candidate video frame; selecting a detection frame corresponding to the virtual object with the third distance value exceeding the distance condition from the detection frames positioned on the same path information; and taking the selected detection frame as the detection frame which does not meet the marking condition.

The first center token vector of the virtual object may be a mean vector corresponding to each target token vector in path information of the virtual object.

Illustratively, each target token vector included in the path information of the virtual object r1 includes: the first center token vector of the virtual object r1 may be a mean vector calculated based on r 1-emmbedding 1, r 1-emmbedding 2, and r 1-emmbedding 3.

And the third distance value corresponding to the target characterization vector is used for characterizing the similarity between the target characterization vector and the first center characterization vector, and the larger the third distance value is, the larger the similarity between the target characterization vector and the first center characterization vector is, and the smaller the third distance value is, the smaller the similarity between the target characterization vector and the first center characterization vector is.

The third distance value exceeds the distance condition and is used for representing that the similarity between the target representation vector corresponding to the third distance value and the first center representation vector is smaller.

Specifically, for the path information of each virtual object, determining the mean value vector of each target characterization vector in the path information to obtain a first center characterization vector, and respectively calculating the Euclidean distance between each target characterization vector and the first center characterization vector to obtain each third distance value between each target characterization vector and the first center characterization vector; and arranging the third distance values in order from small to large to obtain a third distance value sequence, taking the third distance values belonging to the smallest part and the largest part in the third distance value sequence as the third distance values exceeding the distance condition, and taking the detection frame corresponding to the third distance values exceeding the distance condition as the detection frame of the virtual object which does not meet the marking condition.

The third distance values belonging to the smallest part in the third distance value sequence refer to the third distance values occupying a preset percentage of the third distance value sequence and being smaller than the third distance values of the third distance values except for the smallest part, and the third distance values belonging to the largest part in the third distance value sequence refer to the third distance values occupying a preset percentage of the third distance value sequence and being larger than the third distance values of the third distance values except for the largest part. The preset percentage is set according to actual requirements, and the preset percentage may be 5% or 10% by way of example, and the embodiment of the application does not limit the preset percentage.

Illustratively, the third distance values are arranged in order from small to large to obtain a third distance value sequence, where the third distance value sequence includes: dis1, dis2, dis3, … …, dis2 are the smallest 5% of the third distance values in the third distance value sequence, dis-1 and dis are the largest 5% of the third distance values in the third distance value sequence, and the detection frames corresponding to dis1, dis2, dis-1 and dis respectively are taken as the detection frames which do not meet the marking condition.

In the above embodiment, the first center token vector of the target token vector of each detection frame in the path information is determined, the third distance value between the target token vector of each detection frame and the first center token vector is determined, the detection frame corresponding to the distance condition which is not satisfied by the third distance value is used as the detection frame which is not satisfied by the labeling condition, the detection frame corresponding to the distance condition which is not satisfied by the third distance value indicates that the similarity between the sub-graph of the object corresponding to the detection frame and the sub-graph of the object corresponding to the first center token vector is lower, such detection frame is more likely to include noise, for example, the situation that the virtual object is overlapped in the detection frame or the content of the non-virtual object is included in the detection frame, and the detection frame corresponding to the distance condition which is exceeded by the third distance value is used as the detection frame which is not satisfied by the labeling condition, thereby improving the accuracy of the detection frame for determining that the noise exists.

In some embodiments, labeling a detection frame whose detection probability in a video frame to be labeled belongs to a probability interval includes: determining the image entropy of the video frame to be marked based on the detection probability of each detection frame in the video frame to be marked; selecting a target video frame to be marked from the video frames to be marked according to the image entropy; and labeling the detection frames of which the detection probabilities in the target video frames belong to the probability interval.

The image entropy is used for representing the difference degree of the detection probabilities of the detection frames in the video frames to be marked, and the larger the difference degree of the detection probabilities of the detection frames in the video frames to be marked is, the larger the image entropy is, and the smaller the difference degree of the detection probabilities of the detection frames in the video frames to be marked is, the smaller the image entropy is. The greater the degree of difference in detection probability of the detection frame (the greater the image entropy), the lower the certainty of the detection result by the detection model. For example, as shown in fig. 13 and 14, the degree of difference of the detection probabilities of the detection frames in the video frame to be annotated shown in fig. 13 is greater than the degree of difference of the detection probabilities of the detection frames in the video frame to be annotated shown in fig. 14, and then the image entropy of the video frame to be annotated shown in fig. 13 is greater than the image entropy of the video frame to be annotated shown in fig. 14.

Labeling the video frames to be labeled with larger image entropy, and training a detection model by adopting object labels obtained by labeling and corresponding video frames, so that the detection model can continuously learn the characteristics in the corresponding video frames to improve the accuracy of the detection model; therefore, the larger the image entropy is, the higher the labeling value of the video frame to be labeled is.

Specifically, for each video frame to be marked, acquiring the detection probability of each detection frame in the video frame to be marked, and calculating the image entropy of the video frame to be marked according to the detection probability of each detection frame; as shown in formula (1).

H(x)＝-∑ _x∈X P(x)logp(x) (1)

Wherein H (X) is the image entropy, X is one detection frame in the image to be marked, X is all detection frames in the image to be marked, and P (X) is the detection probability of the detection frame X.

Selecting a target video frame to be annotated from the video frames to be annotated according to the image entropy, wherein the image entropy of a part of the target video frames to be annotated can be selected according to the image entropy and is larger than the image entropy of other unselected video frames to be annotated, so that the target video frames to be annotated comprise the video frames to be annotated with larger image entropy.

And the terminal marks the detection frames of which the detection probabilities in the target video frames belong to the probability interval. The probability within the probability interval is less than the target probability and greater than a preset minimum probability value, which may be 0.1, assuming a target probability of 0.4, the probability interval is (0.1,0.4), assuming a target probability of 0.6, the probability interval is (0.1,0.6).

In some embodiments, the detection model includes a first detection network and a second detection network, in which case, the detection probability of the clean detection frame in the predicted video frame corresponding to the first detection network and the detection probability of the predicted video frame corresponding to the second detection network are not less than the target probability, so that the detection probability may exist in the target video frame to be marked, which is not the detection frame of the clean detection frame, and further the detection probability in the video frame to be marked belongs to the probability interval, and the detection frames except the clean detection frame are marked by the terminal, so as to obtain the object label.

In the above embodiment, the terminal determines the image entropy of the video frame to be annotated, and obtains the target video frame to be annotated from the video frame to be annotated according to the image entropy, so that the target video frame to be annotated includes the video frame to be annotated with larger image entropy, that is, higher annotation value, and then trains the detection model according to the object label obtained by annotation and the corresponding video frame, thereby improving the accuracy of the detection model and further improving the annotation effect.

In some embodiments, selecting a target video frame to be annotated from the video frames to be annotated according to the image entropy includes: ordering the video frames to be annotated according to the image entropy to obtain a video frame sequence; dividing a video frame sequence into at least two groups to obtain at least two groups of video frames; and respectively selecting at least one video frame to be annotated from at least two groups of video frames to obtain a target video frame to be annotated.

Illustratively, the video frames to be annotated are ordered according to the order of the image entropy from small to large to obtain a video frame sequence, the video frame sequence is divided into a first group of video frames and a second group of video frames according to the image entropy, and the image entropy of the first group of video frames is smaller than that of the second group of video frames; and respectively selecting at least one video frame to be annotated from each group of video frames to obtain a target video frame to be annotated.

Specifically, dividing the video frame sequence into at least two groups to obtain at least two groups of video frames, including: determining maximum image entropy and minimum image entropy in image entropy of all video frames to be annotated of a video frame sequence, determining a plurality of image entropy levels according to the maximum image entropy and the minimum image entropy, dividing the video frame sequence into a plurality of groups of video frames according to the plurality of image entropy levels, selecting part of the video frames to be annotated from the video frames to be annotated corresponding to each image entropy level, and determining a target video frame to be annotated according to the part of the video frames to be annotated corresponding to each image entropy level.

The number of the image entropy levels is the same as the number of the video frames of the plurality of groups; the number of the image entropy levels can be set according to actual requirements, for example, the number of the image entropy levels can be 3, 4 or 6, and the embodiment of the application is not limited to the number; the part of the video frames to be marked corresponding to each image entropy level can be a first preset number of video frames to be marked corresponding to each image entropy level, or can be a first preset proportion of video frames to be marked corresponding to each image entropy level; the first preset number and the first preset proportion can be set according to actual requirements, for example, the first preset number can be 100 or 50; the first preset ratio may be 50%, which is not limited in the embodiment of the present application.

Illustratively, determining a maximum image entropy Hmax and a minimum image entropy Hmin, assuming that the number of image entropy levels is 4, the image entropy of Hmin to hmin+ (Hmax-Hmin) x 1/4 belongs to a first image entropy level, the image entropy of hmin+ (Hmax-Hmin) x 1/4 to hmin+ (Hmax-Hmin) x 1/2 belongs to a second image entropy level, the image entropy of hmin+ (Hmax-Hmin) x 1/2 to hmin+ (Hmax-Hmin) x 3/4 belongs to a third image entropy level, and the image entropy of (Hmax-Hmin) x 3/4 to Hmax belongs to a fourth image entropy level; and randomly acquiring 50% of video frames to be annotated in each video frame to be annotated corresponding to the first image entropy level, the second image entropy level, the third image entropy level and the fourth image entropy level to obtain target video frames to be annotated.

In the above embodiment, the video frames to be marked are divided into at least two groups of video frames according to the image entropy, at least one video frame to be marked is selected from each group of video frames to obtain the target video frame to be marked, so that the target video frame to be marked comprises the video frame to be marked with larger image entropy, namely the video frame to be marked with higher marking value, and also comprises the video frame to be marked with smaller image entropy, thereby improving the diversity of the target video frame to be marked, effectively covering various states of the virtual object, and improving the marking quality.

In some embodiments, labeling a detection box in a target video frame whose detection probability belongs to a probability interval includes: displaying the target video frame; responding to the detection frame labeling operation, and determining a detection frame designated by the detection frame labeling operation in the target video frame; the detection probability of the detection frame designated by the detection frame labeling operation falls into a probability interval; and marking the detection frame designated by the detection frame marking operation.

Specifically, the clean detection frame in the target video frame to be marked does not need to be marked, the terminal adds marked identification for the clean detection frame in the target video frame to be marked, the target video frame with the added identification is obtained, the terminal displays the target video frame, and the clean detection frame and the detection frame to be marked can be determined through the target video frame. The marked mark is added for the clean detection frame, the color of the clean detection frame can be changed, for example, the detection frames in the target video frame to be marked are blue, the terminal adjusts the color of the clean detection frame to red, and the clean detection frame and the detection frame to be marked can be distinguished through the color of the detection frame; the marked mark is added for the clean detection frame, or marked icons can be added on the clean detection frame, for example, a preset marked icon or marked text is added on the frame line of the clean detection frame, and the clean detection frame and the detection frame to be marked can be distinguished by detecting whether the marked icon or the marked text exists on the frame line.

After the terminal displays the target video frame, detecting a detection frame labeling operation aiming at a detection frame to be labeled in the target video frame, wherein the detection frame labeling operation can be triggered by clicking the detection frame, for example, a labeling member clicks a detection frame displayed on the terminal, and the terminal can acquire the detection frame labeling operation aiming at the detection frame. The detection probability of the detection frame designated by the detection frame labeling operation falls into a probability interval, that is, the designated detection frame does not comprise a clean detection frame; the terminal determines a specified detection frame according to the detection frame marking operation, and adds marked marks for the specified detection frame to finish marking.

In the embodiment, the target video frame to be marked comprises the clean detection frame and the detection frame to be marked, and only the detection frame to be marked is required to be marked in the marking process, so that the marking quantity is greatly reduced, and the marking efficiency is improved.

In some embodiments, the video frame to be detected within the dataset is a first partial video frame within the dataset that is not subject to annotation; before the object detection is carried out on the video frame to be detected in the data set through the detection model, the object labeling method further comprises the following steps: determining a definition level of each video frame in the dataset; selecting at least one video frame from the video frames of each definition level to obtain a second partial video frame; training the detection model based on the second partial video frame and object labels corresponding to the virtual objects in the second partial video frame; object detection of a video frame to be detected within a dataset, comprising: when the trained detection model does not reach the quasi-recall condition, performing object detection on a first part of video frames in the data set through the detection model which does not reach the quasi-recall condition; the first partial video frames are obtained by selecting at least one video frame from the video of each sharpness level.

Wherein each video frame in the dataset is a video frame belonging to the same training video, i.e. a second partial video frame is determined in each video frame of the training video in the dataset.

In one possible scenario, the second portion of video frames and the first portion of video frames belong to the same training video; for example, a second partial video frame is determined in the training video, the second partial video frame is used to train the detection model, and a first partial video frame is determined in a non-second partial video frame of the training video.

In another possible scenario, the second portion of video frames and the first portion of video frames belong to different training videos. For example, a second portion of the video frames are determined in the first training video and a first portion of the video frames are determined in the second training video.

Specifically, a definition calculation method of a laplace operator may be adopted to determine the definition of each video frame, the video frames are sequenced according to the sequence from small to large of the definition, an initial video frame sequence is obtained, the maximum definition and the minimum definition are determined in the definition of each video frame, a plurality of definition levels are determined according to the maximum definition and the minimum definition, the initial video frame sequence is divided into a plurality of groups of initial video frames according to the plurality of definition levels, at least one video frame is selected from each group of initial video frames, at least one video frame corresponding to each definition level is obtained, and a second part of video frame is obtained according to at least one video frame corresponding to each definition level.

The number of the definition levels is the same as the number of the plurality of groups of initial video frames; the number of the definition levels can be set according to actual requirements, for example, the number of the definition levels can be 4, 5 or 6, and the embodiment of the application is not limited to the number; the at least one video frame corresponding to each definition level can be a second preset number of video frames corresponding to each definition level, or can be video frames corresponding to each definition level in a second preset proportion; the second preset number and the second preset ratio may be set according to actual requirements, for example, the second preset number may be 100 or may be 50, for example, the second preset ratio may be 50%, which is not limited in the embodiment of the present application.

Illustratively, the maximum definition Dmax and the minimum definition Dmin are determined in the definition of each video frame, assuming that the number of definition levels is 5, the definition of Dmin to dmin+ (Dmax-Dmin) 1/5 belongs to a first definition level, the definition of dmin+ (Dmax-Dmin) 1/5 to dmin+ (Dmax-Dmin) 2/5 belongs to a second definition level, the definition of dmin+ (Dmax-Dmin) 2/5 to dmin+ (Dmax-Dmin) 3/5 belongs to a third definition level, the definition of dmin+ (Dmax-Dmin) 3/5 to dmin+ (Dmax-Dmin) 4/5 belongs to a fourth definition level, and the definition of dmin+ (Dmax-Dmin) 4/5 to Dmax belongs to a fifth definition level; and randomly acquiring K video frames in each video frame corresponding to the first definition level, the second definition level, the third definition level, the fourth definition level and the fifth definition level to obtain a second part of video frames.

And labeling the virtual object in the second part of video frames to obtain object labels corresponding to the second part of video frames, training the detection model by adopting the second part of video frames and the corresponding object labels to obtain a trained detection model, and processing the video frames to be detected to obtain the predicted video frames by using the trained detection model.

In some embodiments, for ease of illustration, the object tag corresponding to the second portion of the video frame is denoted as the initial object tag; the detection model comprises a basic network, a first detection network and a second detection network; the training of the detection model using the second portion of the video frame and the initial object tag includes:

for each initial video frame in the second part of video frames, inputting the initial video frame into a detection model, outputting a first initial predicted video frame through a first detection network of the detection model, and outputting a second initial predicted video frame through a second detection network of the detection model; and calculating a loss value according to the object label corresponding to the initial video frame, the first initial predicted video frame and the second initial predicted video frame, and adjusting model parameters of the detection model according to the loss value.

According to the object label corresponding to the initial video frame and the first initial predicted video frame, calculating a first loss value through a loss function, and adjusting parameters of a basic network and a first detection network through the first loss value; and calculating a second loss value according to the object label corresponding to the initial video frame and the second initial predicted video frame through a loss function, and adjusting parameters of the basic network and the second detection network through the second loss value. The way the first loss value is calculated by the loss function is the same as the way the second loss value is calculated by the loss function; next, when the detection model is implemented by yolov3, the first loss value is determined as an example, and the loss function is shown in formula (2).

L(O，o，C，c，t，g)＝λ ₁ L _conf (o，c)+λ ₂ L _cla (O，C)+λ ₃ L _loc (t，g) (2)

Wherein L (O, O, C, C, t, g) is a first loss value, L _conf (o, c) is target confidence loss, L _cla (O, C) is target class loss, L _loc (t, g) is target location loss, lambda ₁ Is the weight of the target confidence loss, lambda ₂ Is the weight of the target class loss, lambda ₃ Is the weight of the target positioning penalty.

The target confidence loss can be determined by equation (3).

/>

Wherein,the detection probability c of the ith detection frame _i Mapping to [0,1 ] by sigmoid function]Values within the interval o _i The true value corresponding to the ith detection frame; if the ith detection frame has a corresponding label frame (label frame memory) in the initial object tagIn virtual object), o _i 1, if the detection frame does not have the corresponding labeling frame in the initial object label, o _i Is 0.

The target class loss can be determined by equation (4).

Wherein i epsilon Pos represents the detection frame of which the ith exists in the initial object label and corresponds to the annotation frame; j εcla represents the j-th category;is the classification predictive value C _ij Mapping to [0,1 ] by sigmoid function]Values within the interval; classification prediction value C _ij The detection probability of the ith detection frame corresponding to the labeling frame in the initial object label and the jth category is the ith detection frame; o (O) _ij The i-th detection frame with a corresponding label frame in the initial object label and the true value corresponding to the j-th class exist, and the embodiment of the application only has one class, namely the virtual object class, and the detection model predicts that the object in the detection frame is the virtual object, so that the predicted value C is classified _ij And c _i Equal, O _ij And o _i Equal.

The target location loss can be determined by equation (5).

Wherein,

L _loc (t, g) is the mean square error loss between the detection box and the corresponding annotation box,is the abscissa offset of the center point between the detection frame and the labeling frame, < > >Is the ordinate offset of the center point between the detection frame and the labeling frame, < >>Is the wide offset between the detection frame and the label frame,/->Is a high offset between the detection frame and the annotation frame; />Is the center point abscissa offset between the label box and the default rectangular box, +.>Is the ordinate offset of the center point between the label frame and the default rectangular frame, +.>Is the wide offset between the label box and the default rectangular box,/->Is a high offset between the label box and the default rectangular box; />And->Is a mark frame parameter,/->And->Is a default rectangular box parameter.

And performing iterative training on the detection model by using the second part of video frames and object labels corresponding to the virtual objects in the second part of video frames for preset rounds, so as to obtain a trained detection model, wherein the preset rounds can be set according to actual requirements, and the preset rounds can be 100 times in an exemplary way.

After the trained detection model is obtained, each test video frame is obtained in the data set, whether the tested detection model reaches the standard calling condition or not is verified by adopting each test video frame, and under a possible scene, each test video frame and a second part of video frames belong to the same training video, and the second part of video frames do not comprise the test video frames; the number of each test video frame may be 1/10 of the number of second portion video frames.

The detection model reaches a quasi recall condition, which means that the accuracy of the detection model reaches an accurate threshold value and the recall rate reaches a recall threshold value. The accuracy threshold and recall threshold may be set according to actual needs, and illustratively, the accuracy threshold may be 95% and the recall threshold may be 85%.

In the case that the detection model includes the first detection network and the second detection network, the accuracy and recall of the detection model may be the accuracy and recall of the detection network with higher accuracy and recall in the first detection network and the second detection network, for example, the accuracy corresponding to the first detection network is 70%, the recall is 60%, the accuracy corresponding to the second detection network is 80%, and the recall is 70%, and then the accuracy and recall corresponding to the second detection network are used as the accuracy and recall of the detection model.

If the trained detection model does not reach the call condition, acquiring a first part of video frames from the data set, and carrying out object detection on the first part of video frames through the trained detection model.

In one possible scenario, the second portion of video frames and the first portion of video frames belong to the same training video; for example, the second part of video frames are determined in the first training video, and the first part of video frames are obtained from the data set by selecting at least one video frame except for the second part of video frames from video frames of each definition level included in the first training video, so as to obtain the first part of video frames.

In another possible scenario, the second portion of video frames and the first portion of video frames belong to different training videos; for example, determining the second part of video frames in the first training video, and acquiring the first part of video frames from the data set may be acquiring the second training video in the data set, determining the definition of each video frame in the second training video, determining a plurality of definition levels according to the definition in each video frame, and selecting at least one video frame from the video frames of each definition level included in the second training video, so as to obtain the first part of video frames.

Next, by way of a specific example, the procedure of the above embodiment will be described, as shown in fig. 15, before object detection is performed on a video frame to be detected in a data set by a detection model, including:

s1501, acquiring a training video v1 from a data set, determining the definition of each video frame in the training video v1, and dividing each video frame in the training video v1 into 5 video frames with corresponding definition levels according to the definition of each video frame;

s1502, selecting K video frames from the video frames corresponding to each definition level respectively to obtain a second partial video frame;

S1503, training the detection model through the second partial video frame and the object label corresponding to the virtual object in the second partial video frame to obtain a trained detection model;

s1504, if the trained detection model does not reach the standard calling condition, selecting K video frames from the video frames corresponding to each definition level respectively to obtain a first partial video frame; the first partial video frames do not include the second partial video frames.

In the above embodiment, the definition level of each video frame is determined, and the second part of video frames includes at least one video frame corresponding to each definition level, because the influence of the definition of the video frame on the detection model is larger, the second part of video including the video frame corresponding to each definition level is adopted to train the detection model, so that the generalization performance of the detection model under different definition can be improved; the first part of video frames also comprises at least one video frame corresponding to each definition level, and the detection model is trained according to the first part of video frames and the corresponding object labels, so that the generalization performance of the detection model under different definition can be improved, and the robustness of the detection model is improved.

In some embodiments, the predicted video frames are object detected by a detection model; after labeling the detection frames of which the detection probabilities belong to the probability interval in the video frame to be labeled, the method further comprises the following steps: training a detection model based on the video frame to be detected and the object tag; and when the trained detection model does not reach the standard calling condition, executing the steps of the object labeling method on other video frames to be detected in the data set until detection frames with detection probability belonging to a probability interval in other predicted video frames are labeled, and obtaining an object label.

Specifically, the process of training the detection model according to the video frame to be detected (i.e., the first partial video frame and the corresponding object tag in the above embodiment) is the same as the process of training the detection model through the second partial video frame and the corresponding object tag, so the process of training the detection model through the second partial video frame and the corresponding object tag in the above embodiment may be referred to.

And carrying out iterative training on the detection model in preset rounds according to the video frames to be detected and the corresponding object labels to obtain a trained detection model, acquiring other video frames to be detected in the data set under the condition that the trained detection model does not reach the standard calling condition, and determining the object labels corresponding to the other video frames to be detected according to the object labeling method.

In some embodiments, after determining object labels corresponding to other video frames to be detected, performing iterative training of a preset round on the detection model by adopting the other video frames to be detected and the corresponding object labels, and repeating the steps so as to realize the alternate execution of model training and object labeling until the training obtains the detection model meeting the standard calling condition.

For example, as shown in fig. 16, a video frame F1 to be detected is obtained in a data set, object detection is performed on the frame F1 through a detection model A1 to obtain a predicted video frame Y1, path information of a virtual object is determined according to the predicted video frame Y1, a video frame D1 to be marked is determined according to the path information of the virtual object, an object tag Q1 corresponding to the video frame F1 to be detected is determined according to the video frame D1 to be marked, and a detection model A1 is trained through the video frame F1 to be detected and the object tag Q1 to obtain a trained detection model A2; if the detection model A2 does not meet the standard, acquiring other video frames F2 to be detected in the data set, performing object detection on the F2 through the detection model A2 to obtain other predicted video frames Y2, determining path information of a virtual object according to the other predicted video frames Y2, determining a video frame D2 to be marked according to the path information of the virtual object, determining an object tag Q2 corresponding to the other video frames F2 to be detected according to the video frame D2 to be marked, and training the detection model A2 through the other video frames F2 to be detected and the object tag Q2 to obtain a trained detection model A3; thus, model training and object labeling are alternately performed until training is performed to obtain a detection model meeting the quasi-recall condition.

In the embodiment, model training and object labeling are alternately performed, a clean detection frame which does not need labeling and a valuable detection frame to be labeled can be dug out from preset video frames output by a detection model, new object labels can be obtained according to the detection model after each training round along with continuous optimization of the detection model, virtual objects in all video frames do not need to be labeled in full, and rapid and effective acquisition of labeled object labels is realized.

In some embodiments, the object labeling method further comprises: after the trained detection model reaches the standard calling condition, receiving a video duplication eliminating request; and responding to the video duplication eliminating request, and eliminating duplication of the video to be processed through a detection model reaching the standard calling condition.

Wherein the video deduplication request is used to request a determination of a video that is duplicate with the video to be processed.

Specifically, after the trained detection model reaches the quasi-recall condition, the terminal configures the detection model reaching the quasi-recall condition; the terminal responds to the video duplication eliminating request, processes the video to be processed through the configured detection model to obtain virtual objects in each video frame of the video to be processed, and determines duplication eliminating results according to the virtual objects in each video frame of the video to be processed.

When the detection model comprises a first detection network and a second detection network, determining the detection model reaching the accurate recall condition according to the basic network and the detection network with higher accuracy and recall rate in the detection model. If the accuracy and recall of the first detection network are greater than the accuracy and recall of the second detection network, then determining that the detection model meets the quasi-recall condition includes a base network and the first detection network is connected to the base network.

In the above embodiment, the detection model for achieving the quasi-recall condition may be applied in a video scene, and the object detection is performed on the video to be processed through the detection model, so as to determine the virtual object in each video frame of the video to be processed, so that the weight-removing result is determined according to the virtual object in each video frame of the video to be processed, and the object detection is performed through the detection model, so that the accuracy of the virtual object obtained by the detection is improved, and further the accuracy of the weight removing of the subsequent video is improved.

In some embodiments, the de-weighting of the video to be processed by the detection model that achieves the quasi-recall condition includes: performing object detection on each video frame in the video to be processed through a detection model reaching a quasi-recall condition to obtain a first object image block; extracting features of each first object block to obtain a first characterization vector of a virtual object in each first object block; determining path information of each virtual object in the video to be processed based on the first characterization vector; and determining a duplication eliminating result of the video to be processed based on the path information of each virtual object in the video to be processed.

The first object block is a block corresponding to a virtual object in the video to be processed.

Specifically, for convenience of explanation, video frames in the video to be processed are denoted as video frames to be processed; for each video frame to be processed, the terminal processes the video frame to be processed through a detection model reaching a quasi-recall condition to obtain a first detection video frame comprising a detection frame, so as to obtain a first detection video frame corresponding to each video frame to be processed; for each first detection video frame, the terminal takes a block corresponding to the detection frame in the first detection video frame as a first object block, and extracts the characteristics of the first object block through the characterization extraction model to obtain a first characterization vector of a virtual object in the first object block; in this way, a first characterization vector corresponding to each first object tile in each first detected video frame can be obtained.

The terminal determines path information of each virtual object in the video to be processed based on the first characterization vector of each first object block in each first detection video frame, and the method comprises the following steps: sequencing the first detection video frames according to the playing sequence to obtain a first detection video frame sequence, and determining a matching image block which has the highest similarity with the first object image block and is larger than a similarity threshold value in the adjacent detection video frames according to a first characterization vector of the first object image block and a first characterization vector of each first object image block of adjacent detection video frames of the first detection video frame for each first object image block in the first detection video frame sequence; and determining the next matching image block with the highest similarity with the matching image block and larger than a similarity threshold value in the next adjacent detection video frame according to the first characterization vector of the matching image block and the first characterization vector of each first object image block in the next adjacent detection video frame of the adjacent detection video frame, so as to sequentially determine each matching image block in the subsequent first detection video frame of the first detection video frame sequence, and obtaining the path information of the virtual object corresponding to the first object image block according to the first object image block in the first detection frame and the first characterization vector of each matching image block.

The process of determining the path information of each virtual object in the video frame to be processed based on the first characterization vector is the same as that of determining the path information of each virtual object based on the detection frame in the candidate video frame in the above embodiment, and therefore, for a specific description of determining the path information of each virtual object in the video frame to be processed based on the first characterization vector, reference may be made to the description of determining the path information of each virtual object based on the detection frame in the candidate video frame in the above embodiment.

The terminal compares the path information of each virtual object in the video frame to be processed with the path information of each reference video in the database to determine the duplication eliminating result of the video to be processed; the weight removing result comprises: each reference video identity is ordered according to the degree of repetition with the video to be processed.

In one implementation manner, if a plurality of reference videos exist in a database, the repetition degree between path information included in the plurality of reference videos and path information of each virtual object in the video to be processed reaches a repetition threshold value, reference video identifications of the plurality of reference videos are obtained, and each reference video identification is sequenced according to the repetition degree corresponding to the plurality of reference videos, so that a duplication eliminating result of the video to be processed is obtained. The repetition threshold may be set according to actual requirements, and may be, for example, 0.2 or 0.1.

In the above embodiment, the virtual objects in the video to be processed are detected through the detection model to obtain the first object blocks including the virtual objects, the path information of each virtual object in the video to be processed is determined according to the first characterization vector corresponding to each first object block, and then the weight removing result of the video to be processed is determined according to the path information of each virtual object in the video to be processed; the virtual object in the video to be processed is effectively detected through the detection model, so that the accuracy of the duplication eliminating result of the video to be processed can be improved.

In some embodiments, determining a duplication elimination result of the video to be processed based on path information of each virtual object in the video to be processed includes: inquiring similar path information in a database based on path information of each virtual object in the video to be processed; determining a path repeatability score based on the similar path information and the path information of each virtual object in the video to be processed; and determining a duplication eliminating result of the video to be processed according to the path repeatability score.

Specifically, for the path information of each virtual object, the terminal queries similar path information corresponding to the path information in the database, and the number of the characterization vectors with the Euclidean distance smaller than the repetition threshold value between the characterization vectors included in the path information and the characterization vectors included in the corresponding similar path information reaches a similar condition.

The number of characterization vectors with the Euclidean distance smaller than the repetition threshold reaches a similar condition, and the ratio between the number of characterization vectors with the Euclidean distance smaller than the repetition threshold and the number of targets reaches a preset similar ratio, wherein the number of targets is the number of characterization vectors in the path information, or the number of targets is the number of characterization vectors in the similar path information, or the number of targets is the smaller number of the number of characterization vectors in the path information and the number of the characterization vectors in the similar path information.

The repetition threshold and the preset similarity ratio may be according to actual requirements, for example, the repetition threshold may be 0.2, and the preset similarity ratio may be 30%. Illustratively, there are 5 token vectors in the path information L1 in the video to be processed, the euclidean distance between the 5 token vectors in the path information L2 in the reference video is less than 0.2, and the path information L2 is similar path information of the path information L1 assuming that the path information L1 includes 8 token vectors, that is, the ratio between the number of token vectors whose euclidean distance is less than the repetition threshold and the target number reaches 30%.

The terminal determines reference videos to which the similar path information belongs, acquires a first quantity of the similar path information included in the reference videos, determines a ratio between the first quantity and a second quantity of the path information included in the video to be processed, takes the ratio as a path repeatability score between the video to be processed and the reference videos, acquires reference video identifications of the reference videos to which the similar path information belongs, and sorts the reference video identifications of the reference videos according to the order of the path repeatability score from large to small to obtain a duplication eliminating result of the video to be processed.

In some embodiments, based on the path information of each virtual object in the video to be processed, before querying the database for similar path information, the method further includes: and cleaning the path information of each virtual object in the video to be processed to obtain the path information of each virtual object after cleaning.

Specifically, the path information of each virtual object in the video to be processed comprises a plurality of first characterization vectors, for the path information of each virtual object, the terminal determines a first mean value vector of the plurality of first characterization vectors included in the path information, determines Euclidean distances between the plurality of first characterization vectors and the first mean value vector respectively, obtains fourth distance values respectively corresponding to the plurality of first characterization vectors, and filters the first characterization vectors corresponding to the fourth distances belonging to the minimum part and the maximum part in the plurality of fourth distance values to obtain the path information of each virtual object after cleaning.

The fourth distance belonging to the smallest portion occupies a predetermined percentage of the plurality of fourth distances and is smaller than the fourth distance values other than the fourth distance value of the smallest portion, and the fourth distance belonging to the largest portion occupies a predetermined percentage of the plurality of fourth distances and is larger than the fourth distance values other than the fourth distance value of the largest portion. The preset percentage may be 5% or 10%.

In the embodiment, the similar path information is queried in the database through the path information of each virtual object in the video to be processed, and the weight-removing result of the video to be processed is determined according to the path information of the virtual object, so that the accuracy of the weight-removing result of the video to be processed is improved.

In some embodiments, the object labeling method further comprises: performing object detection on each video frame in the reference video through a detection model reaching the quasi-recall condition to obtain a second object image block; extracting features of each second object block to obtain a second characterization vector of the virtual object in each second object block; determining path information of each virtual object in the reference video based on the second characterization vector; path information of each virtual object in the reference video is stored in a database.

The reference video and the video to be processed are similar videos, for example, the reference video and the video to be processed are game videos. The second object tile is a tile corresponding to a virtual object in the reference video.

Specifically, for convenience of explanation, video frames in the reference video are denoted as reference video frames; for each video frame to be processed, the terminal processes the reference video frame through a detection model reaching a quasi-recall condition to obtain a second detection video frame comprising a detection frame, so as to obtain a second detection video frame corresponding to each video frame to be processed; for each second detection video frame, the terminal takes a block corresponding to the detection frame in the second detection video frame as a second object block, and extracts the characteristics of the second object block through the characterization extraction model to obtain a second characterization vector of the virtual object in the second object block; in this way, a second characterization vector corresponding to each second object tile in each second detected video frame can be obtained.

The process of determining, by the terminal, the path information of each virtual object in the reference video based on the second characterization vector of each second object tile in each second detection video frame is the same as the process of determining, by the terminal, the path virtualization of each virtual object in the video to be processed based on the first characterization vector of each first object image in each first detection video frame in the above embodiment, and therefore, for the specific description of determining, by the terminal, the path information of each virtual object in the reference video based on the second characterization vector of each second object tile in each second detection video frame, the specific description of the path virtualization process of each virtual object in the video to be processed may be determined by referring to the above embodiment, for the first characterization vector of each first object image in each first detection video frame.

In the above embodiment, the virtual objects in the reference video are detected by the detection model to obtain the second object blocks including the virtual objects, each path information of each virtual object in the reference video is determined according to the second characterization vector corresponding to each second object block, and the quality of determining each path information of each virtual object in the reference video can be improved by effectively detecting the virtual objects in the reference video by the detection model.

In some embodiments, storing path information for each virtual object in the reference video in a database includes: determining a second center characterization vector corresponding to each path information based on the second characterization vector included in the path information of each virtual object in the reference video; cleaning the second characterization vector in each path information according to the distance value between the second center characterization vector corresponding to each path information and the second characterization vector included in each path information to obtain each cleaned path information; and storing the path information of each cleaned virtual object in the reference video in a database.

Specifically, the path information of each virtual object in the reference video comprises a plurality of second characterization vectors, and for the path information of each virtual object, the terminal determines a second center characterization vector of the plurality of second characterization vectors included in the path information, wherein the second center characterization vector can be a mean value vector of the plurality of second characterization vectors; and determining Euclidean distances between the plurality of second characterization vectors and the second mean value vector respectively to obtain fifth distance values corresponding to the plurality of second characterization vectors, and filtering the second characterization vectors corresponding to the fifth distances belonging to the minimum part and the maximum part in the plurality of fifth distance values by the terminal to obtain the path information after cleaning.

The fifth distance belonging to the minimum portion occupies a predetermined percentage of the plurality of fifth distances and is smaller than the other fifth distance values except for the fifth distance value of the minimum portion, and the fifth distance belonging to the maximum portion occupies a predetermined percentage of the plurality of fifth distances and is larger than the other fourth distance values except for the fifth distance value of the maximum portion. The preset percentage may be 5% or 10%.

In the above embodiment, the second token vector in the path information is cleaned according to the second center token vector, so that the second token vector with poor performance in the path information can be filtered, for example, the second token vector corresponding to the second object block of the virtual object with only the artifact can be filtered, and further, the quality of each path information of each virtual object in the reference video can be improved.

In one scene embodiment, as shown in FIG. 17, the dataset comprises a game training video, the video frames in the game training video being as shown in FIG. 18, the virtual objects being hero characters in the game training video; the detection model comprises a first detection network and a second detection network;

(1) Model training

1.1, acquiring a second partial video frame in a game training video V1 of a data set, wherein the second partial video frame comprises g1, g2, … … and gn; labeling g1, g2, … … and gn to obtain an object tag DB1 corresponding to the second partial video frame;

Training the detection model A1 by adopting a second partial video frame and an object tag DB1 to obtain a detection model A2;

(2) Object annotation

2.1, performing object detection on a first part of video frames (video frames to be detected) through a detection model to obtain predicted video frames; acquiring a first partial video frame in a game training video V1 of a data set, wherein the first partial video frame comprises f1, f2, … … and fn; object detection is carried out on the first part of video frames through a detection model A2 to obtain predicted video frames corresponding to a first detection network, and the method comprises the following steps: y11, y12, … …, y1n, obtaining a predicted video frame corresponding to the second detection network, including: y21, y22, … …, y2n; each predicted video frame comprises a detection frame and a detection probability of the detection frame; for example, the detection frame in y11 is the detection frame of the hero person in f1 detected by the first detection network in the detection model A2, and the detection probability of the detection frame in y11 is the probability that the content in the detection frame is the hero person detected by the first detection network in the detection model A2;

2.2, determining a clean detection frame in the predicted video frame according to the detection probability of the detection frame; according to the detection probability of the detection frame in the predicted video frame corresponding to the first detection network and the detection probability of the detection frame in the predicted video frame corresponding to the second detection network, determining a clean detection frame in the predicted video frame corresponding to the first detection network, for example, the detection probability of a detection frame in y11 is not less than 0.4 of the target probability, and the detection probability of a detection frame corresponding to the detection frame in y12 is not less than 0.4, and the detection frame is a clean detection frame;

2.3, determining path information of hero characters; in the predicted video frames corresponding to the first detection network, taking the predicted video frames including the clean detection frame as candidate video frames, including: h1, h2, … …, hm; obtaining object subgraphs corresponding to all detection frames in the candidate video frames, and extracting target characterization vectors of all the object subgraphs to obtain target characterization vectors corresponding to the detection frames in the candidate video frames; determining a first distance value between a target characterization vector of each detection frame and a target characterization vector of hb11 in h2 by taking h1 as a starting frame of the hero corresponding to hb11, and if the minimum distance value in each first distance is smaller than a distance threshold value, taking the minimum distance value as a matching detection frame of the hero corresponding to hb11 corresponding to detection frame hb 21; determining first distance values of the target characterization vector of each detection frame and the target characterization vector of hb21 in h3, and if the minimum distance value in each first distance is smaller than a distance threshold value, using the minimum distance value corresponding to the detection frame hb31 as a matching detection frame of the hero character corresponding to hb 11; so as to sequentially determine a matching detection frame of the hero character corresponding to hb11 in each subsequent candidate video frame; determining path information of each hero according to the clean detection frame of each hero and the target characterization vector corresponding to the matching detection frame;

2.4, cleaning path information of hero characters and determining video frames to be marked; determining an average value of all target characterization vectors in the path information of hero figures to obtain a first center characterization vector, determining target characterization vectors which do not meet the distance condition in all the target characterization vectors according to the first center characterization vector, and filtering out target characterization vectors which do not meet the distance condition from candidate video frames to obtain video frames to be marked, wherein the method comprises the following steps: b1, b2, … …, bm;

2.5, marking the detection frames except the clean detection frame in the video frame to be marked; each video frame to be marked comprises a clean detection frame, and the detection frames except the clean detection frames in the video frames to be marked are marked to obtain an object tag DB2 corresponding to the first part of video frames;

(3) Model training

Training the detection model A2 through the first part of video frames and the object tag DB2 to obtain a detection model A3;

(4) Object annotation

The process of the object annotation is the same as the process of the object annotation.

The process realizes the alternate execution of model training and object labeling until the trained detection model reaches the standard calling condition. In the process of alternately executing model training and object labeling, a clean detection frame which does not need labeling and a valuable detection frame to be labeled can be dug out from preset video frames output by a detection model, and along with continuous optimization of the detection model, new object labels can be obtained according to the detection model after each round of training, all virtual objects in all video frames do not need to be labeled, and the labeled object labels are quickly and effectively obtained.

In a specific embodiment, as shown in fig. 19, the object labeling method includes:

s1901, the terminal determines the definition level of each video frame in the data set; selecting at least one video frame from the video frames of each definition level to obtain a second partial video frame; training the detection model based on the second partial video frame and object labels corresponding to the virtual objects in the second partial video frame;

s1902, when the trained detection model does not reach the quasi-recall condition, the terminal performs feature extraction on a first part of video frames in the data set through a basic network in the detection model which does not reach the quasi-recall condition; the detection network in the detection model comprises a first detection network and a second detection network; the first part of video frames are obtained by respectively selecting at least one video frame from the video frames of all the definition levels;

s1903, the terminal detects the feature map extracted by the basic network through at least one detection network in the detection model to obtain a predicted video frame corresponding to each detection network;

s1904, when a target detection frame with the detection probability not smaller than the target probability exists in the predicted video frames corresponding to the first detection network, and the target detection frame in the predicted video frames corresponding to the second detection network is not smaller than the target probability, the terminal takes the target detection frame in the predicted video frames corresponding to the first detection network as a clean detection frame; or taking a target detection frame in the predicted video frame corresponding to the second detection network as a clean detection frame;

S1905, the terminal sequentially determines a first distance value between detection frames in each adjacent video frame in the candidate video frames by taking the first video frame in the candidate video frames as a starting frame; in the candidate video frames, a matching detection frame corresponding to the first distance value meeting the distance condition between the first distance value and the clean detection in the first video frame is obtained;

s1906, the terminal determines path information of each virtual object in the candidate video frame based on the clean detection frame and the matching detection frame in the first video frame;

s1907, when a target virtual object which is not matched with the virtual object in the first video frame and corresponds to the clean detection frame exists in the tail video frame in the target adjacent video frames, the terminal sequentially determines a second distance value between the detection frames in other adjacent video frames by taking the tail video frame in the target adjacent video frames as a starting frame; the other adjacent video frames are adjacent video frames following the target adjacent video frame; when the second distance value meets the distance condition, determining path information of the target virtual object based on detection frames of the target virtual object in other adjacent video frames;

s1908, the terminal determines a first center characterization vector of each virtual object based on each target characterization vector, and determines a third distance value between the first center characterization vector of each virtual object and the target characterization vector of each object subgraph for each virtual object in the candidate video frame; selecting a detection frame corresponding to the virtual object with the third distance value exceeding the distance condition from the detection frames positioned on the same path information; the selected detection frames are used as detection frames which do not meet the marking conditions, and the detection frames which do not meet the marking conditions are filtered out of the candidate video frames, so that the video frames to be marked are obtained;

S1909, the terminal determines the image entropy of the video frames to be marked based on the detection probability of each detection frame in the video frames to be marked, sorts the video frames to be marked according to the image entropy to obtain a video frame sequence, divides the video frame sequence into at least two groups to obtain at least two groups of video frames, and selects at least one video frame to be marked from the at least two groups of video frames to obtain a target video frame to be marked;

s1910, the terminal displays a target video frame, responds to the detection frame marking operation, determines a detection frame appointed by the detection frame marking operation in the target video frame, marks the detection frame appointed by the detection frame marking operation, and obtains an object label;

s1911, the terminal trains the detection model based on the first partial video frame and the object label, and when the trained detection model does not reach the standard calling condition, the steps are executed on other partial video frames in the data set until detection probability in other predicted video frames belongs to a detection frame in a probability interval, and the object label is obtained;

s1912, after the trained detection model reaches the standard calling condition, the terminal receives a video duplication elimination request, responds to the video duplication elimination request, and performs object detection on each video frame in the video to be processed through the detection model reaching the standard calling condition to obtain first object image blocks, performs feature extraction on each first object image block to obtain first characterization vectors of virtual objects in each first object image block, and determines path information of each virtual object in the video to be processed based on the first characterization vectors;

S1913, the terminal queries similar path information in the database based on the path information of each virtual object in the video to be processed, determines a path repeatability score based on the similar path information and the path information of each virtual object in the video to be processed, and determines a duplication eliminating result of the video to be processed according to the path repeatability score.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides an object labeling device for realizing the object labeling method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the object labeling device or devices provided below may refer to the limitation of the object labeling method hereinabove, and will not be repeated herein.

In one embodiment, as shown in fig. 20, there is provided an object labeling apparatus, including: an object detection module 2001, a clean detection frame determination module 2002, a path information determination module 2003, a video frame to be annotated determination module 2004, and an object annotation module 2005, wherein:

the object detection module 2001 is configured to perform object detection on a video frame to be detected in the data set, so as to obtain a predicted video frame; the predicted video frame comprises a detection frame of the virtual object and a corresponding detection probability;

the clean detection frame determining module 2002 is configured to take a detection frame with a detection probability not less than a target probability as a clean detection frame;

the path information determining module 2003 is configured to select each predicted video frame where the clean detection frame is located to obtain a candidate video frame, and determine path information of each virtual object based on the detection frame in the candidate video frame;

the to-be-annotated video frame determining module 2004 is configured to filter a detection frame that does not meet the annotation condition on the path information from the candidate video frames, so as to obtain to-be-annotated video frames;

the object labeling module 2005 is configured to label a detection frame in which the detection probability in the video frame to be labeled belongs to a probability interval, so as to obtain an object label; wherein the probability in the probability interval is less than the target probability.

In some embodiments, the predicted video frames are object detected by a detection model; the clean detection block determination module 2002 comprises:

the detection probability comparison unit is used for taking the target detection frame in the predicted video frame corresponding to the first detection network as a clean detection frame when the detection probability in the predicted video frame corresponding to the first detection network is not less than the target detection frame of the target probability and the target detection frame in the predicted video frame corresponding to the second detection network is not less than the target probability; or taking the target detection frame in the predicted video frame corresponding to the second detection network as a clean detection frame.

In some embodiments, the path information determination module 2003 includes:

the matching unit is used for searching a matching detection frame matched with a clean detection frame in a first video frame in the candidate video frames in each adjacent video frame of the candidate video frames;

and the path information determining unit is used for determining the path information of each virtual object in the candidate video frame based on the clean detection frame and the matching detection frame in the first video frame.

In some embodiments, the matching unit comprises:

a first distance value determining subunit, configured to sequentially determine a first distance value between detection frames in each adjacent video frame in the candidate video frames by using a first video frame in the candidate video frames as a start frame;

And the matching subunit is used for acquiring a matching detection frame corresponding to the first distance value meeting the distance condition between the first distance value and the clean detection in the first video frame in the candidate video frames.

In some embodiments, the path information determination module 2003 further includes:

the target virtual object path information determining unit is used for sequentially determining a second distance value between detection frames in other adjacent video frames by taking the tail video frame in the target adjacent video frame as a starting frame when the tail video frame in the target adjacent video frame is not matched with the virtual object in the first video frame and is a target virtual object corresponding to the clean detection frame; the other adjacent video frames are adjacent video frames following the target adjacent video frame; and when the second distance value meets the distance condition, determining the path information of the target virtual object based on the detection frames of the target virtual object in other adjacent video frames.

In some embodiments, the object annotation model further comprises: characterizing an extraction model;

the characterization extraction model is used for extracting the characteristics of the object subgraphs corresponding to each detection frame in the candidate video frames to obtain a target characterization vector;

in some embodiments, the video frame to be annotated determination module 2004 includes:

The detection frame filtering unit is used for selecting detection frames which do not meet the marking condition from all detection frames positioned on the same path information based on all target characterization vectors; and filtering the detection frames which do not meet the labeling condition from the candidate video frames.

In some embodiments, the detection frame filtering unit includes:

a third distance value determining subunit, configured to determine a first center token vector of each virtual object based on each target token vector; determining a third distance value between a first center characterization vector of each virtual object and a target characterization vector of each object subgraph for each virtual object in the candidate video frame;

the filtering subunit is used for selecting a detection frame corresponding to the virtual object with the third distance value exceeding the distance condition from the detection frames positioned on the same path information; and taking the selected detection frame as the detection frame which does not meet the marking condition.

In some embodiments, the object annotation module 2005 includes:

the image entropy determining unit is used for determining the image entropy of the video frame to be marked based on the detection probability of each detection frame in the video frame to be marked;

the labeling unit is used for selecting a target video frame to be labeled from the video frames to be labeled according to the image entropy; and labeling the detection frames of which the detection probabilities in the target video frames belong to the probability interval.

In some embodiments, the labeling unit comprises:

the annotation interaction subunit is used for displaying the target video frame; responding to the detection frame labeling operation, and determining a detection frame designated by the detection frame labeling operation in the target video frame; the detection probability of the detection frame designated by the detection frame labeling operation falls into a probability interval; and marking the detection frame designated by the detection frame marking operation.

In some embodiments, the labeling unit comprises:

the target video frame determining subunit to be marked is used for sequencing the video frames to be marked according to the image entropy to obtain a video frame sequence; dividing a video frame sequence into at least two groups to obtain at least two groups of video frames; and respectively selecting at least one video frame to be annotated from at least two groups of video frames to obtain a target video frame to be annotated.

In some embodiments, the video frame to be detected in the data set is a first partial video frame in the data set and not subject to object annotation, and the object annotation device further comprises:

the first training module is used for determining the definition level of each video frame in the data set; selecting at least one video frame from the video frames of each definition level to obtain a second partial video frame; training the detection model based on the second partial video frame and object labels corresponding to the virtual objects in the second partial video frame;

Correspondingly, the object detection module 2001 is configured to perform object detection on the first portion of video frames in the data set through the detection model that does not reach the quasi-recall condition when the trained detection model does not reach the quasi-recall condition; the first partial video frame is obtained by selecting at least one video frame from the video frames of each sharpness level.

In some embodiments, the object labeling apparatus further comprises:

the secondary labeling module is used for training the detection model based on the video frame to be detected and the object label; and when the trained detection model does not reach the standard calling condition, executing the steps of the object labeling method on other video frames to be detected in the data set until detection frames with detection probability belonging to a probability interval in other predicted video frames are labeled, and obtaining an object label.

In some embodiments, the object labeling apparatus further comprises:

the video weight-removing module is used for receiving a video weight-removing request after the trained detection model reaches the standard calling condition; and responding to the video duplication eliminating request, and eliminating duplication of the video to be processed through a detection model reaching the standard calling condition.

In some embodiments, the video de-duplication module includes:

the video weight-removing unit is used for carrying out object detection on each video frame in the video to be processed through a detection model reaching the standard calling condition to obtain a first object image block; extracting features of each first object block to obtain a first characterization vector of a virtual object in each first object block; determining path information of each virtual object in the video to be processed based on the first characterization vector; and determining a duplication eliminating result of the video to be processed based on the path information of each virtual object in the video to be processed.

In some embodiments, the video de-duplication unit comprises:

the duplicate removal result determining subunit is used for inquiring similar path information in the database based on the path information of each virtual object in the video to be processed; determining a path repeatability score based on the similar path information and the path information of each virtual object in the video to be processed; and determining a duplication eliminating result of the video to be processed according to the path repeatability score.

In some embodiments, the object labeling apparatus further comprises:

the database storage module is used for carrying out object detection on each video frame in the reference video through a detection model reaching the standard calling condition to obtain a second object image block; extracting features of each second object block to obtain a second characterization vector of the virtual object in each second object block; determining path information of each virtual object in the reference video based on the second characterization vector; path information of each virtual object in the reference video is stored in a database.

In some embodiments, the database storage module comprises:

the path information cleaning unit is used for determining a second center characterization vector corresponding to each path information based on the second characterization vector included in the path information of each virtual object in the reference video; cleaning the second characterization vector in each path information according to the distance value between the second center characterization vector corresponding to each path information and the second characterization vector included in each path information to obtain each cleaned path information; and storing the path information of each cleaned virtual object in the reference video in a database.

The modules in the object labeling apparatus may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device, which may be a terminal or a server, is provided, and in this embodiment, an example in which the computer device is a terminal is described, and an internal structure thereof may be as shown in fig. 21. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement an object labeling method. The display unit of the computer equipment is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device, wherein the display screen can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on a shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by persons skilled in the art that the architecture shown in fig. 21 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing an embodiment of the above-described object labeling method when executing the computer program.

In one embodiment, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements an embodiment of the above-described object labeling method.

In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements an embodiment of the above-described object labeling method.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. An object labeling method, characterized in that the method comprises:

performing object detection on the video frames to be detected in the data set to obtain predicted video frames; the predicted video frame comprises a detection frame of a virtual object and a corresponding detection probability;

taking the detection frame with the detection probability not smaller than the target probability as a clean detection frame;

selecting each predicted video frame where the clean detection frame is located to obtain a candidate video frame, and determining path information of each virtual object based on the detection frame in the candidate video frame;

Filtering the detection frames which do not meet the marking condition on the path information from the candidate video frames to obtain video frames to be marked;

2. The method of claim 1, wherein the predicted video frames are object detected by a detection model;

the object detection is carried out on the video frames to be detected in the data set to obtain predicted video frames, and the method comprises the following steps:

extracting features of the video frames to be detected in the data set through a basic network in the detection model;

and detecting the feature map extracted by the basic network through at least one detection network in the detection model to obtain a predicted video frame corresponding to each detection network.

3. The method of claim 2, wherein the detection networks in the detection model comprise a first detection network and a second detection network;

the step of taking the detection frame with the detection probability not smaller than the target probability as a clean detection frame comprises the following steps:

When a target detection frame with the detection probability not smaller than the target probability exists in the predicted video frame corresponding to the first detection network, and the detection probability of the corresponding target detection frame in the predicted video frame corresponding to the second detection network is not smaller than the target probability, the target detection frame in the predicted video frame corresponding to the first detection network is used as a clean detection frame; or,

and taking the target detection frame in the predicted video frame corresponding to the second detection network as a clean detection frame.

4. The method of claim 1, wherein determining path information for each of the virtual objects based on the detection boxes in the candidate video frames comprises:

searching a matching detection frame matched with a clean detection frame in a first video frame in the candidate video frames in each adjacent video frame of the candidate video frames;

and determining path information of each virtual object in the candidate video frame based on the clean detection frame and the matching detection frame in the first video frame.

5. The method of claim 4, wherein said searching for a match detection box in each adjacent video frame of said candidate video frames that matches a clean detection box in a first video frame of said candidate video frames comprises:

Sequentially determining a first distance value between detection frames in each adjacent video frame in the candidate video frames by taking the first video frame in the candidate video frames as a starting frame;

and acquiring a matching detection frame corresponding to the first distance value between the first detection frame and the clean detection frame in the first video frame in the candidate video frame, wherein the first distance value meets a matching condition.

6. The method according to claim 4, wherein the method further comprises:

when a target virtual object which is not matched with the virtual object in the first video frame exists in the last video frame in the target adjacent video frames and corresponds to the clean detection frame, sequentially determining a second distance value between the detection frames in other adjacent video frames by taking the last video frame in the target adjacent video frames as a starting frame; the other adjacent video frames are adjacent video frames following the target adjacent video frame;

and when the second distance value meets a matching condition, determining path information of the target virtual object based on detection frames of the target virtual object in other adjacent video frames.

7. The method according to claim 1, wherein the method further comprises:

Performing feature extraction on object subgraphs corresponding to all detection frames in the candidate video frames to obtain target characterization vectors;

the filtering the detection frame which does not meet the labeling condition on the path information from the candidate video frames comprises the following steps:

selecting detection frames which do not meet the marking condition from all detection frames positioned on the same path information based on all the target characterization vectors;

and filtering the detection frames which do not meet the labeling condition from the candidate video frames.

8. The method according to claim 7, wherein the selecting, among the detection frames located on the same path information, a detection frame that does not satisfy a labeling condition based on the target feature vector of each of the object subgraphs includes:

determining a first center token vector for each of the virtual objects based on each of the target token vectors;

determining, for each virtual object in the candidate video frame, a third distance value between a first center token vector for each virtual object and a target token vector for each object subgraph;

selecting a detection frame corresponding to the virtual object with the third distance value exceeding the distance condition from the detection frames positioned on the same path information;

And taking the selected detection frame as a detection frame which does not meet the marking condition.

9. The method according to claim 1, wherein labeling the detection box in the video frame to be labeled, in which the detection probability belongs to a probability interval, comprises:

determining the image entropy of the video frame to be annotated based on the detection probability of each detection frame in the video frame to be annotated;

selecting a target video frame to be marked from the video frames to be marked according to the image entropy;

and labeling the detection frames of which the detection probabilities in the target video frames belong to probability intervals.

10. The method of claim 9, wherein labeling the detection box in the target video frame for which the detection probability belongs within a probability interval comprises:

displaying the target video frame;

responding to a detection frame marking operation, and determining a detection frame designated by the detection frame marking operation in the target video frame; the detection probability of the detection frame designated by the detection frame labeling operation falls into a probability interval;

and marking the detection frame designated by the detection frame marking operation.

11. The method according to claim 9 or 10, wherein selecting the target video frame to be annotated from the video frames to be annotated according to the image entropy comprises:

Sequencing the video frames to be annotated according to the image entropy to obtain a video frame sequence;

dividing the video frame sequence into at least two groups to obtain at least two groups of video frames;

and respectively selecting at least one video frame to be annotated from the at least two groups of video frames to obtain a target video frame to be annotated.

12. The method of claim 2, wherein the video frame to be detected within the dataset is a first partial video frame within the dataset that is not subject to annotation; before the object detection is performed on the video frame to be detected in the data set through the detection model, the method further comprises:

determining a sharpness level of each video frame within the dataset;

selecting at least one video frame from the video frames with the definition level to obtain a second partial video frame;

training a detection model based on the second partial video frame and object labels corresponding to virtual objects in the second partial video frame;

the object detection for the video frame to be detected in the data set comprises the following steps:

when the trained detection model does not reach the quasi-recall condition, performing object detection on a first part of video frames in the data set through the detection model which does not reach the quasi-recall condition; the first partial video frame is obtained by selecting at least one video frame from the video frames of the definition level respectively.

13. The method of claim 1, wherein the predicted video frames are object detected by a detection model; after labeling the detection frames of which the detection probabilities in the video frames to be labeled belong to probability intervals, the method further comprises:

training the detection model based on the video frame to be detected and the object tag;

and when the trained detection model does not reach the standard calling condition, executing the steps of the object labeling method on other video frames to be detected in the data set until labeling the detection frames of which the detection probability belongs to a probability interval in other predicted video frames, so as to obtain an object label.

14. The method of claim 13, wherein the method further comprises:

receiving a video duplication elimination request after the trained detection model reaches the standard calling condition;

and responding to the video duplication eliminating request, and eliminating duplication of the video to be processed through a detection model reaching the quasi-recall condition.

15. The method of claim 14, wherein the de-weighting the video to be processed by the detection model that achieves the quasi-recall condition comprises:

Performing object detection on each video frame in the video to be processed through a detection model reaching the quasi-recall condition to obtain a first object image block;

extracting features of the first object blocks to obtain first characterization vectors of virtual objects in the first object blocks;

determining path information of each virtual object in the video to be processed based on the first characterization vector;

and determining a duplication eliminating result of the video to be processed based on the path information of each virtual object in the video to be processed.

16. The method of claim 15, wherein determining the de-duplication result of the video to be processed based on the path information of each virtual object in the video to be processed comprises:

inquiring similar path information in a database based on the path information of each virtual object in the video to be processed;

determining a path repeatability score based on the similar path information and the path information of each virtual object in the video to be processed;

and determining a duplication eliminating result of the video to be processed according to the path repeatability score.

17. An object marking apparatus, the apparatus comprising:

The object detection module is used for carrying out object detection on the video frames to be detected in the data set to obtain predicted video frames; the predicted video frame comprises a detection frame of a virtual object and a corresponding detection probability;

the clean detection frame determining module is used for taking the detection frame with the detection probability not smaller than the target probability as a clean detection frame;

the path information determining module is used for selecting each predicted video frame where the clean detection frame is located to obtain a candidate video frame, and determining path information of each virtual object based on the detection frame in the candidate video frame;

the video frame to be marked determining module is used for filtering the detection frames which do not meet the marking conditions on the path information from the candidate video frames to obtain video frames to be marked;

the object labeling module is used for labeling the detection frames of which the detection probabilities in the video frames to be labeled belong to probability intervals to obtain object labels; wherein the probability in the probability interval is less than the target probability.

18. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 16 when the computer program is executed.

19. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 16.

20. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of any one of claims 1 to 16.