CN115082830A

CN115082830A - Training of target detection model, target detection method, device and medium

Info

Publication number: CN115082830A
Application number: CN202210788464.8A
Authority: CN
Inventors: 曹琼; 石鼎丰; 陶大程
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2022-07-06
Filing date: 2022-07-06
Publication date: 2022-09-20

Abstract

The present disclosure provides a training method of a target detection model, a target detection method, an apparatus and a storage medium, wherein the training method comprises: generating a training sample based on the video sample and the enhancement sample; generating query feature information and constructing a first loss function corresponding to the enhancement sample using an encoder module and based on video feature information corresponding to the training sample; generating first classification confidence information corresponding to the video sample, regression information used for representing the target position and second classification confidence information corresponding to the enhancement sample by using a decoder module and based on the query feature information, and constructing a second loss function corresponding to the training sample; the adjustment process is performed using the first loss function and the second loss function. The method and the device can enhance the distinguishability of the input video features, increase the feature difference of the action and improve the distinguishability of the features; the classification training of the model is sufficient, so that the prediction result is more accurate.

Description

Training of target detection model, target detection method, device and medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a training method for a target detection model, a target detection method, an apparatus, and a storage medium.

Background

With the increasing amount of video data, the demand for analysis and processing of video data is increasing. For example, in scenes such as live content security detection and short video dangerous motion detection, it is necessary to identify a dangerous motion in video data using a video motion detection method. Conventionally, in order to detect a motion, it is common to detect a target using a DETR (Bidirectional Encoder representation based on a transform structure) model. The DETR model utilizes the structure of a Transformer to realize two-dimensional image target detection based on query. The Transformer structure is a network structure based on an Attention (Attention) mechanism, and the performance of the video motion detection method can be effectively improved by constructing a model through the Transformer. In the process of implementing the present invention, the inventor finds that, during the training of the DETR model, each input video segment obtains positive samples with the same number as the labels through matching, and other additional predictions are set as negative samples, so that the number of the positive samples participating in the training is insufficient, which causes insufficient classification training for the DETR model, and the classification accuracy is low.

Disclosure of Invention

In view of the above, an object of the present invention is to provide a method for training an object detection model, an object detection method, an object detection device, and a storage medium.

According to a first aspect of the present disclosure, there is provided a training method of an object detection model, wherein the object detection model includes: an encoder module and a decoder module; the training method comprises the following steps: acquiring an enhancement sample corresponding to a video sample, and generating a training sample based on the video sample and the enhancement sample; generating query feature information based on video feature information corresponding to the training samples and constructing a first loss function corresponding to the enhancement samples using an encoder module; generating, using a decoder module and based on the query feature information, first classification confidence information corresponding to the video samples and regression information for characterizing a target location, second classification confidence information corresponding to the enhancement samples, and constructing a second loss function corresponding to the training samples; and adjusting the target detection model by using the first loss function and the second loss function.

Optionally, the enhancement sample comprises: positive and negative samples corresponding to the video sample; said constructing a first loss function corresponding to said enhancement samples comprises: acquiring segment feature information corresponding to the positive sample and the negative sample based on the video feature information; and generating a first loss function according to the segment characteristic information.

Optionally, the encoder module comprises: a fully-connected layer and a region-of-interest Pooling RoI Pooling layer; the obtaining of segment feature information corresponding to the positive examples and the negative examples based on the video feature information includes: processing the video characteristic information through the full connection layer to acquire full connection characteristics corresponding to the video characteristic information; and performing feature extraction processing on the full connection information through the RoI Pooling layer to obtain the segment feature information.

Optionally, the generating a first loss function according to the segment feature information includes: acquiring a first segment feature corresponding to the video sample and a second segment feature corresponding to the positive sample from the segment information; determining a first sample feature based on the first segment feature and the second segment feature; acquiring a third segment feature corresponding to the negative sample in the segment information; determining a second sample feature based on the first, second, and third segment features; generating the first loss function from the first and second sample characteristics.

Optionally, the constructing a second loss function corresponding to the training samples comprises: determining a first classification loss function corresponding to the first classification confidence information; determining a second classification loss function corresponding to the second classification confidence information; generating the second loss function based on the second classification loss function and the second classification loss function.

Optionally, the first classification loss function comprises: a first cross entropy loss function; the second classification loss function includes: a second cross entropy loss function; the generating the second loss function based on the second classification loss function and the second classification loss function comprises: taking the sum of the first cross-entropy loss function and the second cross-entropy loss function as the second loss function.

Optionally, determining a first action type label for the video sample; obtaining the positive sample corresponding to the video sample from other videos based on the first action type label; wherein the second action type label of the positive sample is the same as the first action type label.

Optionally, based on the first action type tag, obtaining the negative sample corresponding to the video sample from other videos; wherein the second action type label of the positive sample is different from the first action type label.

Optionally, determining a first video segment corresponding to the first action type tag; extracting a video segment from the video sample based on the first video segment as the negative sample; and the ratio of the superposition length and the length sum of the first video segment and the second video segment is smaller than a preset intersection ratio threshold.

Optionally, processing the training sample by using a preset backbone network to generate the video feature information; wherein the backbone network comprises: a neural network model.

Optionally, the encoder module comprises: an encoder based on a Transformer structure; the decoder module comprises: a decoder based on the Transformer structure.

According to a second aspect of the present disclosure, there is provided an object detection method, comprising: acquiring a trained target detection model; the target detection model is obtained by training through the training method; and generating classification confidence information corresponding to the video to be detected and regression information for representing the position of the target by using the target detection model and based on the video to be detected.

According to a third aspect of the present disclosure, there is provided a training apparatus for an object detection model, wherein the object detection model includes: an encoder module and a decoder module; the training apparatus includes: the system comprises a sample generation module, a training module and a data processing module, wherein the sample generation module is used for acquiring an enhancement sample corresponding to a video sample and generating a training sample based on the video sample and the enhancement sample; the coding processing module is used for generating inquiry characteristic information based on the video characteristic information corresponding to the training sample by using the coder module and constructing a first loss function corresponding to the enhancement sample; a decoding processing module, configured to generate, using a decoder module and based on the query feature information, first classification confidence information corresponding to the video sample and regression information used to characterize a target location, and second classification confidence information corresponding to the enhancement sample, and construct a second loss function corresponding to the training sample; and the model adjusting module is used for adjusting the target detection model by using the first loss function and the second loss function.

Optionally, the enhancement sample comprises: positive and negative samples corresponding to the video samples; the encoding processing module comprises: a segment feature acquisition unit configured to acquire segment feature information corresponding to the positive sample and the negative sample based on the video feature information; and the first loss determining unit is used for generating a first loss function according to the segment characteristic information.

Optionally, the encoder module comprises: a fully-connected layer and a region-of-interest Pooling RoI Pooling layer; the segment feature acquisition unit is used for processing the video feature information through the full-link layer to acquire full-link features corresponding to the video feature information; and performing feature extraction processing on the full connection information through the RoI Pooling layer to obtain the segment feature information.

Optionally, the first loss determining unit is configured to obtain, in the segment information, a first segment feature corresponding to the video sample and a second segment feature corresponding to the positive sample; determining a first sample feature based on the first segment feature and the second segment feature; acquiring a third segment feature corresponding to the negative sample in the segment information; determining a second sample feature based on the first, second, and third segment features; generating the first loss function from the first and second sample characteristics.

Optionally, the decoding processing module includes: a classification loss determination unit for determining a first classification loss function corresponding to the first classification confidence information; determining a second classification loss function corresponding to the second classification confidence information; a second loss determination unit configured to generate the second loss function based on the second classification loss function and the second classification loss function.

Optionally, the first classification loss function comprises: a first cross entropy loss function; the second classification loss function includes: a second cross entropy loss function; the second loss determining unit is specifically configured to use a sum of the first cross-entropy loss function and the second cross-entropy loss function as the second loss function.

Optionally, the sample generation module includes: a positive sample acquiring unit, configured to determine a first action type label of the video sample; obtaining the positive sample corresponding to the video sample from other videos based on the first action type label; wherein the second action type label of the positive sample is the same as the first action type label.

Optionally, the sample generation module includes: a negative sample acquiring unit, configured to acquire the negative sample corresponding to the video sample from another video based on the first action type label; wherein the second action type label of the positive sample is different from the first action type label.

Optionally, the negative example obtaining unit is configured to determine a first video segment corresponding to the first action type tag; extracting a video segment from the video sample based on the first video segment as the negative sample; and the ratio of the superposition length and the length sum of the first video segment and the second video segment is smaller than a preset intersection ratio threshold.

Optionally, the feature information obtaining module is configured to process the training sample by using a preset backbone network to generate the video feature information; wherein the backbone network comprises: a neural network model.

Optionally, the encoder module comprises: an encoder based on a transform structure; the decoder module comprises: a decoder based on the Transformer structure.

According to a fourth aspect of the present disclosure, there is provided a training apparatus for an object detection model, comprising: a memory; and a processor coupled to the memory, the processor configured to perform the training method as described above based on instructions stored in the memory.

According to a fifth aspect of the present disclosure, there is provided an object detection apparatus comprising: the model acquisition module is used for acquiring a trained target detection model; wherein, the target detection model is obtained by training through the training method; and the detection processing module is used for generating classification confidence information corresponding to the video to be detected and regression information used for representing the target position by using the target detection model and based on the video to be detected.

According to a sixth aspect of the present disclosure, there is provided an object detection apparatus comprising: a memory; and a processor coupled to the memory, the processor configured to perform the object detection method as described above based on instructions stored in the memory.

According to a seventh aspect of the present disclosure, there is provided a computer readable storage medium storing computer instructions for execution by a processor to perform the method as above.

According to the training method, the target detection method and device and the storage medium of the target detection model, the training sample is generated based on the video sample and the enhancement sample, and the loss function corresponding to the enhancement sample is constructed, so that the distinguishability of the input video characteristics can be enhanced, the characteristic similarity of the same type of actions can be improved, the characteristic difference of different types of actions can be increased, and the distinguishability of the characteristics can be improved; generating classification confidence information corresponding to the video sample and the enhancement sample based on the query feature information and constructing a loss function, and fully training the classification of the model by increasing the number of training samples such as positive samples and the like so as to enable a prediction result to be more accurate; the use experience of the user is improved.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without inventive exercise.

FIG. 1 is a schematic flow chart diagram illustrating one embodiment of a method for training a target detection model according to the present disclosure;

FIG. 2 is a schematic flow chart diagram illustrating the construction of a first loss function in one embodiment of a training method for an object detection model according to the present disclosure;

FIG. 3 is a schematic flow chart illustrating the generation of a first loss function according to segment feature information in an embodiment of a training method for an object detection model according to the present disclosure;

FIG. 4 is a network framework diagram of an object detection model of the present disclosure;

FIG. 5 is a schematic diagram of a decoder in the object detection model of the present disclosure;

FIG. 6 is a schematic flow chart diagram illustrating the construction of a second loss function in one embodiment of a training method for an object detection model according to the present disclosure;

FIG. 7 is a schematic diagram of a decoder in the object detection model of the present disclosure;

FIG. 8 is a schematic flow chart diagram illustrating one embodiment of a target detection method according to the present disclosure;

FIG. 9 is a block diagram of one embodiment of a training apparatus for an object detection model according to the present disclosure;

FIG. 10 is a block diagram of another embodiment of a training apparatus for an object detection model according to the present disclosure;

FIG. 11 is a block diagram of an encoding processing module in an embodiment of a training apparatus for object detection models according to the present disclosure;

FIG. 12 is a block diagram of a decode processing module in one embodiment of a training apparatus for object detection models according to the present disclosure;

FIG. 13 is a block diagram of a sample generation module in an embodiment of a training apparatus for a target detection model according to the present disclosure;

FIG. 14 is a block diagram of yet another embodiment of a training apparatus for an object detection model according to the present disclosure;

FIG. 15 is a block schematic diagram of one embodiment of an object detection device according to the present disclosure;

FIG. 16 is a block schematic diagram of another embodiment of an object detection device according to the present disclosure.

Detailed Description

The present disclosure now will be described more fully hereinafter with reference to the accompanying drawings, in which exemplary embodiments of the disclosure are shown. The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure. The technical solution of the present disclosure is described in various aspects below with reference to various figures and embodiments.

The terms "first", "second", and the like are used hereinafter only for descriptive distinction and not for other specific meanings.

In the related art known to the inventors, in the field of temporal motion detection, the deta model is used to detect motion content in a video. The DERT model comprises a backbone network, a Transformer structure-based encoder and a decoder, namely a Transformer encoder and a Transformer decoder. An original video sequence passes through a backbone network (such as a convolutional neural network) to extract a time and space characteristic diagram, adds position coding information, synthesizes an embedded vector, and inputs the embedded vector into a transform coder. The Transformer encoder extracts image coding features through a self-attention mechanism and inputs the image coding features and query features into a Transformer decoder. The Transformer decoder outputs target query vectors, the target query vectors pass through a classification head and a regression head which are constructed by a full connection layer and a multi-layer perceptron layer, the position and the category of a detection target are output, and the detection target can be walking, running and other actions.

The Transformer structure has better performance in the aspect of characteristic representation, and the performance of the video motion detection method can be effectively improved by constructing a model through the Transformer. The Transformer encoder comprises a plurality of encoder layers, wherein each encoder layer consists of a multi-head self-attention layer, two layer normalization layers and a feedforward neural network layer. The Transformer decoder comprises a plurality of decoder layers, wherein each decoder layer consists of two multi-head self-attention layers, three normalization layers and a feedforward neural network layer.

The DETR method takes a fixed number N of learnable Query features (Query features) as input, each Query Feature adaptively samples pixel points from a two-dimensional image through a network, information interaction among the Query features is carried out in a Self-attention (Self-attention) mode, and finally, each Query Feature is used for independently predicting the position and the category of a detection frame. In the field of temporal motion detection, a fixed number of detection targets are predicted by means of an encoder-decoder. And when the target is detected, extracting time segment characteristics by using a Transformer structure based on sparse sampling.

For the Transformer decoder, K trainable query features are taken as input. The query feature is a learnable vector, and the query feature can extract time features from a specific moment according to the learned statistical information. And realizing information interaction among all query features by using self-attention operation, predicting the normalized coordinates of sampling k points on N time dimensions by each query feature through a layer of full-connection layer, and extracting features from the video features according to the sampling points to update the query features. For example, with another fully-connected layer, the input query features predict k weights, and the sampled k features are summed with a weighting. And predicting the position and the type of the action by the updated query characteristics through the regression head and the classification head respectively. The regression header and the classification header are three full-connected layers and one full-connected layer respectively, the regression header predicts normalized coordinates of the start and the end of the action, and the classification header predicts the classification and the confidence score of the action.

In the existing DETR model, each video segment is matched to obtain positive samples with the same number as labels, the number of the positive samples is smaller than that of the negative samples, and due to the fact that the number of the positive samples participating in training is insufficient, the classification training of the DETR model is insufficient, and the classification accuracy is low.

Fig. 1 is a flowchart illustrating an embodiment of a training method of a target detection model according to the present disclosure, where the target detection model includes an encoder module, a decoder module, and the like, as shown in fig. 1:

step 101, obtaining an enhancement sample corresponding to the video sample, and generating a training sample based on the video sample and the enhancement sample.

In one embodiment, the video samples are video segments with annotation information, and the enhancement samples include positive and negative samples corresponding to the video samples, and the training samples can be generated based on the video samples and the enhancement samples using a variety of existing methods. For example, a video sample is a video clip containing a running motion feature, if another video also contains the running motion feature, the video may be used as a positive sample, and if another video contains other motion features, such as a long jump, a singing, and the like, the video may be used as a negative sample.

Using the encoder module and based on the video feature information corresponding to the training samples, query feature information is generated, and a first loss function corresponding to the enhancement samples is constructed, step 102.

In one embodiment, as shown in fig. 4, the training samples are processed using a preset backbone network, which includes a neural network model and the like, to generate video feature information. The target detection model is a DETR model, and the encoder module comprises an encoder based on a Transformer structure and the like. And generating query features through a Transformer encoder, wherein the query features can be query vectors and the like, and constructing a first loss function corresponding to the enhanced samples.

And 103, generating first classification confidence coefficient information corresponding to the video samples, regression information used for representing the target position and second classification confidence coefficient information corresponding to the enhancement samples by using a decoder module and based on the query feature information, and constructing a second loss function corresponding to the training samples.

In one embodiment, the decoder module comprises a decoder based on a Transformer structure, and may also comprise a classification header, a regression header, and the like. The decoder is a Transformer decoder, generates first classification confidence information, regression information and second classification confidence information through the Transformer decoder, a classification head, a regression head and the like, and constructs a second loss function corresponding to the training samples.

And 104, adjusting the target detection model by using the first loss function and the second loss function.

In an embodiment, the target detection model may be adjusted according to the first loss function and the second loss function by using a plurality of existing model adjustment methods, so that the function values of the first loss function and the second loss function are within the allowable value ranges respectively.

In one embodiment, a number of methods may be employed to obtain positive and negative samples corresponding to video samples. The method comprises the steps of determining a first action type label of a video sample, and acquiring a positive sample corresponding to the video sample from other videos based on the first action type label, wherein a second action type label of the positive sample is the same as the first action type label. Other videos may be annotated video segments stored in a video library.

Acquiring a negative sample corresponding to the video sample from other videos based on the first action type label, wherein the second action type label of the positive sample is different from the first action type label; or, determining a first video segment corresponding to the first action type label, and extracting the video segment from the video sample based on the first video segment as a negative sample, wherein the ratio of the coincidence length and the length sum of the first video segment and the second video segment is smaller than a preset coincidence ratio threshold value.

By adding the enhanced samples corresponding to the video samples in training, the feature differences of different types of actions can be increased, and the feature distinguishability is improved. For example, for video sample v _i Given group-route action fragment s _g Group-channel motion fragment s _g Is manually marked, and the marking information comprises the starting point and the ending point time of the action in the group-route action segment, the corresponding action type and the corresponding label c _g (first action type tag). Sampling video with and label c from other videos _g Group-route action segment of the same tag (second action type tag) as group-route action segment s _g Is positive.

Two methods can be used to obtain negative samples: (1) sampling video with and label c from other videos _g Taking the group-route action segment of different labels (second action type labels) as a negative sample; (2) from ground-truth action fragment s _g The interior of the (first video segment corresponding to the first action type label) is randomly sampled and segments smaller than a certain threshold are obtained IoU as negative examples. IoU is the intersection ratio of two segments, i.e. IoU ═ the length at which the two segments coincide/the sum of the lengths of the two segments, the closer to 1 the IoU, the more the two segments coincide.

In one embodiment, a number of methods may be employed to construct the first loss function corresponding to the enhancement samples. Fig. 2 is a schematic flowchart of constructing a first loss function in an embodiment of a training method of an object detection model according to the present disclosure, as shown in fig. 2:

in step 201, segment feature information corresponding to the positive and negative examples is obtained based on the video feature information.

In one embodiment, as shown in fig. 4, the encoder module includes a fully-connected layer and a region-of-interest Pooling RoI _ Pooling layer, and the video feature information is processed through the fully-connected layer to obtain a fully-connected feature corresponding to the video feature information. And carrying out feature extraction processing on the full-connection information through a RoI Pooling layer to obtain fragment feature information.

The fully-connected layer and the RoI Pooling layer can be various existing fully-connected layers and RoI Pooling layers. As shown in fig. 5, the Transformer decoder includes a self-attention module, a cross-attention module, two normalization layers and a feed-forward network, and the Transformer decoder can use various existing implementations.

Step 202, generating a first loss function according to the segment feature information.

Generating the first loss function from the segment characteristic information may use various methods. Fig. 3 is a schematic flowchart of generating a first loss function according to the segment feature information in an embodiment of the training method of the target detection model according to the present disclosure, as shown in fig. 3:

in step 301, a first segment feature corresponding to a video sample and a second segment feature corresponding to a positive sample are obtained from segment information.

Step 302, determining a first sample feature based on the first segment feature and the second segment feature.

Step 303, obtaining a third segment feature corresponding to the negative example in the segment information.

Step 304, determining a second sample feature based on the first segment feature, the second segment feature, and the third segment feature.

Step 305 generates a first loss function based on the first sample characteristic and the second sample characteristic.

In one embodiment, based on the network structure as shown in FIG. 4, for a video segment s, x ∈ R is obtained ^T×D′ And

video features extracted for the pre-training network and features further projected through a single-layer full-connection layer are respectively extracted; wherein x is a feature extracted from a video segment through a pre-trained network, T represents that the video segment has T frames, and the feature dimension of each frame is D', and the feature of each frame can be projected into D dimension by projecting each frame (using a full link layer). Features in a specific time segment are intercepted by using a RoI Pooling layer, and the features are averaged in a time dimension (the time dimension becomes 1) to be used as the features of the video segment s, namely segment feature information is extracted.

The characteristics of an input decoder are enhanced by using a contrast learning method, and the generated loss is ACE-enc. Generating a first loss function as:

wherein f is a segment feature obtained by a video segment through a full connection layer and a RoI Pooling layer, namely a first segment feature corresponding to the video sample s; f. of _p The method comprises the steps of obtaining the segment characteristics of a positive sample by the same method, namely the second segment characteristics corresponding to the positive sample; f. of ^T f _p Is a first sample feature; d is a set of positive and negative samples, f _j Is a second segment feature corresponding to a positive sample, or a third segment feature corresponding to a negative sample; f. of ^T f _j Is the second sample feature.

In one embodiment, constructing the second loss function corresponding to the training samples may use a variety of methods. Fig. 6 is a schematic flowchart of constructing a second loss function in an embodiment of the training method of the target detection model according to the present disclosure, as shown in fig. 6:

step 601, determining a first classification loss function corresponding to the first classification confidence information.

Step 602, a second classification loss function corresponding to the second classification confidence information is determined.

Step 603, generating a second loss function based on the second classification loss function and the second classification loss function.

In one embodiment, the first classification loss function may be a first cross-entropy loss function, the second classification loss function may be a second cross-entropy loss function, and the first cross-entropy loss function and the second cross-entropy loss function may be a plurality of existing cross-entropy loss functions. And taking the sum of the first cross entropy loss function and the second cross entropy loss function as a second loss function.

For example, as shown in FIG. 7, the classification performance is improved by increasing the training samples of the classification head, and the loss of ACE-dec is generated due to the increase of the number of positive samples. In order to increase the training amount, training samples containing video samples and enhancement samples are used in a training stage, the newly added enhancement samples are used for carrying out additional training on a classification head (classifier) through a backbone network for extracting video features, an encoder module for further encoding the video features and a decoder network for detecting actions, and the features are sampled to train the classifier; meanwhile, the training classification head and the regression head are trained by using the characteristics of the video sample. By using additional enhancement sample segments, the number of positive samples can be increased, thereby improving training performance.

Constructing a second loss function according to the ACE-dec losses of each layer of the transform decoder:

wherein the content of the first and second substances,

is the first score corresponding to the video sampleA first cross entropy loss function of the class confidence degree information represents the classification loss of each video sample;

representing the classification loss of the successfully matched enhancement sample for a second cross entropy loss function of second classification confidence information corresponding to the enhancement sample; y represents a label on the basis of the number of labels,

it means that the successfully matched query feature y (not null) will generate the corresponding action segment, resulting in a penalty, i.e. only adding extra penalty to the query location of the successfully matched enhanced sample.

And

can be implemented by various existing methods, for example

And

are all existing Sigmoid Focal local, which is an existing cross-entropy Loss function.

A loss function corresponding to a segment predicted for a video sample,

is the loss function corresponding to the segment predicted from the labeled segment of the enhanced sample.

According to the training method of the target detection model, the training samples are generated based on the video samples and the enhancement samples, so that the distinguishability of the characteristics of the input video is enhanced; by generating classification confidence information corresponding to the video sample and the enhancement sample, classification performance in the field of video motion detection is improved.

Fig. 8 is a schematic flow chart diagram of an embodiment of a target detection method according to the present disclosure, as shown in fig. 8:

step 801, acquiring a trained target detection model; the target detection model is obtained by training through the training method.

Step 802, using the target detection model and based on the video to be detected, generating classification confidence information corresponding to the video to be detected and regression information for representing the position of the target.

In one embodiment, a video to be detected is input into a trained target detection model, and the target detection model outputs classification confidence and regression information for representing the position of a target; the target is the action and the like in the video to be detected, the classification confidence information can be the score and the like of the action classification confidence, and the regression information can be the starting and ending information of the action.

In one embodiment, the present disclosure provides a training apparatus 90 for an object detection model, which includes an encoder module, a decoder module, and the like. The training apparatus 90 includes a sample generation module 91, an encoding processing module 92, a decoding processing module 93, and a model adjustment module 94. The sample generation module 91 acquires an enhancement sample corresponding to the video sample, and generates a training sample based on the video sample and the enhancement sample.

The encoding processing module 92 generates query feature information based on video feature information corresponding to the training samples using the encoder module and constructs a first loss function corresponding to the enhancement samples. The decoding processing module 93 generates first classification confidence information corresponding to the video samples and regression information for characterizing the target position, second classification confidence information corresponding to the enhancement samples, and constructs a second loss function corresponding to the training samples, using the decoder module and based on the query feature information. The model adjustment module 94 performs an adjustment process on the target detection model using the first loss function and the second loss function.

As shown in fig. 10, the training apparatus 90 for a target detection model further includes a feature information obtaining module 95, where the feature information obtaining module 95 uses a preset backbone network to process a training sample to generate video feature information; the backbone network comprises a neural network model and the like.

In one embodiment, the enhancement samples include positive and negative samples corresponding to the video samples. As shown in fig. 11, the encoding processing module 92 includes a segment feature acquisition unit 921 and a first loss determination unit 922. The segment feature acquisition unit 921 acquires segment feature information corresponding to the positive and negative examples based on the video feature information. The first loss determination unit 922 generates a first loss function from the segment characteristic information.

The encoder module comprises a full connection layer and an interested region Pooling RoI Pooling layer, and the segment feature acquisition unit 921 processes the video feature information through the full connection layer to acquire full connection features corresponding to the video feature information; the segment feature obtaining unit 921 performs feature extraction processing on the full connection information through the RoI Pooling layer, and obtains segment feature information.

The first loss determination unit 922 acquires a first segment feature corresponding to the video sample and a second segment feature corresponding to the positive sample from the segment information; the first loss determination unit 922 determines a first sample feature based on the first segment feature and the second segment feature.

The first loss determining unit 922 acquires a third segment feature corresponding to the negative sample from the segment information, and determines a second sample feature based on the first segment feature, the second segment feature, and the third segment feature; the first loss determining unit 922 generates a first loss function from the first sample characteristic and the second sample characteristic.

In one embodiment, as shown in fig. 12, the decoding processing module 93 includes a classification loss determining unit 931 and a second loss determining unit 932. The classification loss determination unit 931 determines a first classification loss function corresponding to the first classification confidence information, and determines a second classification loss function corresponding to the second classification confidence information. The second loss determination unit 932 generates a second loss function based on the second classification loss function and the second classification loss function.

The first classification loss function comprises a first cross-entropy loss function; the second classification loss function includes a second cross-entropy loss function. The second loss determination unit 932 takes the sum of the first cross-entropy loss function and the second cross-entropy loss function as the second loss function.

In one embodiment, as shown in fig. 13, the sample generation module 91 includes a positive sample acquisition unit 911 and a negative sample acquisition unit 912. The positive sample acquiring unit 911 determines a first action type tag of the video sample; the positive sample acquiring unit 911 acquires a positive sample corresponding to the video sample from the other video based on the first action type tag; wherein the second action type label of the positive sample is the same as the first action type label.

The negative example acquiring unit 912 acquires a negative example corresponding to the video example from the other video based on the first action type tag; wherein the second action type label of the positive sample is different from the first action type label.

The negative sample acquiring unit 912 determines a first video segment corresponding to the first action type tag, and extracts a video segment from the video sample as a negative sample based on the first video segment; and the ratio of the superposition length and the length sum of the first video segment and the second video segment is smaller than a preset intersection ratio threshold value.

In one embodiment, as shown in fig. 14, the present disclosure provides a training apparatus of an object detection model, which may include a memory 141, a processor 142, a communication interface 143, and a bus 144. The memory 141 is used for storing instructions, the processor 142 is coupled to the memory 141, and the processor 142 is configured to execute a training method for implementing the above-mentioned target detection model based on the instructions stored in the memory 141.

The memory 141 may be a high-speed RAM memory, a non-volatile memory (non-volatile memory), or the like, and the memory 141 may be a memory array. Memory 141 may also be partitioned, and the blocks may be combined into virtual volumes according to certain rules. Processor 142 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement the training methods of the object detection model of the present disclosure.

In one embodiment, the present disclosure provides an object detection apparatus 15, including a model acquisition module 151 and a detection processing module 152. The model obtaining module 151 obtains a trained target detection model; the target detection model is obtained by training through the training method; the detection processing module 152 generates classification confidence information corresponding to the video to be detected and regression information for characterizing the target position based on the video to be detected using the target detection model.

In one embodiment, as shown in fig. 16, the present disclosure provides an object detection apparatus may include a memory 161, a processor 162, a communication interface 163, and a bus 164. The memory 161 is used to store instructions, the processor 162 is coupled to the memory 161, and the processor 162 is configured to execute the target detection method described above based on the instructions stored by the memory 161.

The memory 161 may be a high-speed RAM memory, a non-volatile memory (non-volatile memory), or the like, and the memory 161 may be a memory array. The storage 161 may also be partitioned, and the blocks may be combined into virtual volumes according to certain rules. Processor 162 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement the object detection methods of the present disclosure.

In one embodiment, the present disclosure provides a computer-readable storage medium storing computer instructions that, when executed by a processor, implement a method as in any one of the above embodiments.

In the training method, the target detection method, the device and the storage medium of the target detection model in the embodiments, the training sample is generated based on the video sample and the enhancement sample, and the loss function corresponding to the enhancement sample is constructed, so that the differentiability of input video features can be enhanced, the feature similarity of similar actions can be improved, the feature difference of different actions can be increased, and the differentiability of the features can be improved; generating classification confidence information corresponding to the video sample and the enhancement sample based on the query feature information and constructing a loss function, fully training the classification of the model by increasing the number of training samples such as positive samples and the like, and improving the classification performance and the training performance so that the prediction result is more accurate; the use experience of the user is improved.

The methods and systems of the present disclosure may be implemented in a number of ways. For example, the methods and systems of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

The description of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A training method of an object detection model, wherein the object detection model comprises the following steps: an encoder module and a decoder module; the training method comprises the following steps:

acquiring an enhancement sample corresponding to a video sample, and generating a training sample based on the video sample and the enhancement sample;

generating query feature information and constructing a first loss function corresponding to the enhancement sample using an encoder module and based on video feature information corresponding to the training sample;

generating, using a decoder module and based on the query feature information, first classification confidence information corresponding to the video samples and regression information for characterizing a target location, second classification confidence information corresponding to the enhancement samples, and constructing a second loss function corresponding to the training samples;

and adjusting the target detection model by using the first loss function and the second loss function.

2. The method of claim 1, wherein the enhancing the sample comprises: positive and negative samples corresponding to the video samples; said constructing a first loss function corresponding to said enhancement samples comprises:

acquiring segment feature information corresponding to the positive sample and the negative sample based on the video feature information;

and generating a first loss function according to the segment characteristic information.

3. The method of claim 2, wherein the encoder module comprises: a full-link layer and a region-of-interest pooling RoIPooling layer; the obtaining of segment feature information corresponding to the positive examples and the negative examples based on the video feature information includes:

processing the video characteristic information through the full connection layer to acquire full connection characteristics corresponding to the video characteristic information;

and performing feature extraction processing on the full-link information through the RoIPooling layer to acquire the segment feature information.

4. The method of claim 2, the generating a first loss function from the segment characterizing information comprising:

acquiring a first segment feature corresponding to the video sample and a second segment feature corresponding to the positive sample from the segment information;

determining a first sample feature based on the first segment feature and the second segment feature;

acquiring a third segment feature corresponding to the negative sample in the segment information;

determining a second sample feature based on the first, second, and third segment features;

generating the first loss function from the first and second sample characteristics.

5. The method of claim 2, the constructing a second loss function corresponding to the training samples comprising:

determining a first classification loss function corresponding to the first classification confidence information;

determining a second classification loss function corresponding to the second classification confidence information;

generating the second loss function based on the second classification loss function and the second classification loss function.

6. The method of claim 5, the first classification loss function comprising: a first cross entropy loss function; the second classification loss function includes: a second cross entropy loss function; the generating the second loss function based on the second classification loss function and the second classification loss function comprises:

taking the sum of the first cross-entropy loss function and the second cross-entropy loss function as the second loss function.

7. The method of claim 2, further comprising:

determining a first action type label for the video sample;

obtaining the positive sample corresponding to the video sample from other videos based on the first action type label; wherein the second action type label of the positive sample is the same as the first action type label.

8. The method of claim 2, comprising:

obtaining the negative sample corresponding to the video sample from other videos based on the first action type label; wherein the second action type label of the positive sample is different from the first action type label.

9. The method of claim 2, comprising:

determining a first video segment corresponding to the first action type tag;

extracting a video segment from the video sample based on the first video segment as the negative sample; and the ratio of the superposition length and the length sum of the first video segment and the second video segment is smaller than a preset intersection ratio threshold.

10. The method of claim 1, comprising:

processing the training sample by using a preset backbone network to generate the video characteristic information; wherein the backbone network comprises: a neural network model.

11. The method of any one of claims 1 to 10,

the encoder module includes: an encoder based on a Transformer structure;

the decoder module comprises: a decoder based on the Transformer structure.

12. A method of target detection, comprising:

acquiring a trained target detection model; wherein the target detection model is trained by the training method of any one of claims 1 to 11;

and generating classification confidence information corresponding to the video to be detected and regression information for representing the position of the target by using the target detection model and based on the video to be detected.

13. An apparatus for training an object detection model, wherein the object detection model comprises: an encoder module and a decoder module; the training apparatus includes:

the system comprises a sample generation module, a training module and a data processing module, wherein the sample generation module is used for acquiring an enhancement sample corresponding to a video sample and generating a training sample based on the video sample and the enhancement sample;

the coding processing module is used for generating query characteristic information by using a coder module and based on the video characteristic information corresponding to the training sample, and constructing a first loss function corresponding to the enhancement sample;

a decoding processing module, configured to generate, using a decoder module and based on the query feature information, first classification confidence information corresponding to the video sample and regression information used to characterize a target location, and second classification confidence information corresponding to the enhancement sample, and construct a second loss function corresponding to the training sample;

and the model adjusting module is used for adjusting the target detection model by using the first loss function and the second loss function.

14. A training apparatus for an object detection model, comprising:

a memory; and a processor coupled to the memory, the processor configured to perform the method of any of claims 1-11 based on instructions stored in the memory.

15. An object detection device comprising:

the model acquisition module is used for acquiring a trained target detection model; wherein the target detection model is trained by the training method of any one of claims 1 to 11;

and the detection processing module is used for generating classification confidence information corresponding to the video to be detected and regression information used for representing the target position by using the target detection model and based on the video to be detected.

16. An object detection device comprising:

a memory; and a processor coupled to the memory, the processor configured to perform the method of claim 12 based on instructions stored in the memory.

17. A computer-readable storage medium having stored thereon, non-transitory, computer instructions for execution by a processor to perform the method of any one of claims 1-12.