CN113065422A

CN113065422A - Training method of video target detection model and video target detection method and device

Info

Publication number: CN113065422A
Application number: CN202110294961.8A
Authority: CN
Inventors: 范琦; 戴宇荣
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2021-07-02

Abstract

The disclosure relates to a training method of a video target detection model and a video target detection method and device. The training method of the video target detection model comprises the following steps: acquiring a training sample video, a support picture of the training sample video and a real target detection result; respectively obtaining the target object characteristics and the support picture characteristics of the training sample video from the training sample video and the support pictures of the training sample video; matching the target object characteristics with the support picture characteristics to obtain a matching result; determining a first loss of the video target detection model based on the matching result and the real target detection result; training the video target detection model based on the determined first loss. According to the training method and device for the video target detection model, the video target detection effect can be improved, the detection flow is simplified, and the detection time is saved.

Description

Training method of video target detection model and video target detection method and device

Technical Field

The present disclosure relates to the field of video technology. More specifically, the present disclosure relates to a training method and apparatus for a video target detection model, and a target detection method and apparatus.

Background

With the continuous improvement of the requirements of users on image processing, the application of the target detection technology is also increasingly wide. Especially in the modern period of explosive development of streaming media, the application scenes of video target detection are more and more. Different from target detection in pictures, video target detection puts higher requirements on algorithms due to the problems of special jitter, defocus, illumination change, occlusion and the like. The user needs to train the model using a large number of high quality target detection training samples and then use the model in the target detection task. However, in an actual application scenario, a high-quality target detection training sample needs a large amount of manpower and material resources to label, and often cannot be obtained quickly, so that the detection model cannot be deployed quickly to the detection of a new sample. Video data is particularly difficult to label because each video has a large number of pictures to label. The small sample target detection method can well solve the problem, but the existing small sample target detection models are all algorithms for pictures, and no algorithm is specially used for detecting the small sample target of the video, and the uniqueness of video data greatly limits the application of the algorithms for the pictures.

Disclosure of Invention

An exemplary embodiment of the present disclosure is to provide a training method and apparatus for a video target detection model, and a target detection method and apparatus, so as to solve at least the problem of video target detection in the related art, and may not solve any of the above problems.

According to an exemplary embodiment of the present disclosure, there is provided a training method of a video target detection model, including: acquiring a training sample video, a support picture of the training sample video and a real target detection result; respectively obtaining the target object characteristics and the support picture characteristics of the training sample video from the training sample video and the support pictures of the training sample video; matching the target object characteristics with the support picture characteristics to obtain a matching result; determining a first loss of the video target detection model based on the matching result and the real target detection result; training the video target detection model based on the determined first loss.

Optionally, the step of obtaining the target object feature of the training sample video from the training sample video may include: generating a video pipeline of the objects in the training sample video, wherein the video pipeline contains the same object in at least one frame of the training sample video; and intercepting the features in the video pipeline as the target object features of the training sample video.

Optionally, the step of generating a video pipeline of objects in the training sample video may comprise: performing feature extraction on the training sample video to obtain features of the training sample video; generating a video pipeline for each object in the training sample video based on features of the training sample video.

Optionally, the step of generating a video pipeline for each object in the training sample video based on the features of the training sample video may include: tracking each object in the training sample video based on the characteristics of the training sample video to obtain the location of each object in each frame; a video pipeline for each object in the training sample video is generated based on the location of each object in each frame.

Optionally, the step of generating a video pipeline of each object in the training sample video based on the location of each object in each frame may comprise: and forming a track pipeline of each object in the whole training sample video as a video pipeline based on the positioning of each object in each frame.

Optionally, the step of obtaining support picture features from the support pictures of the training sample video may include: performing feature extraction on the support pictures of the training sample video, wherein the number of the support pictures of the training sample video is more than or equal to 1; intercepting the feature area of the extracted features; and carrying out averaging operation on the intercepted features to obtain the features of the support picture.

Optionally, the training method may further include: classifying the training sample support pictures based on the support picture characteristics; determining a second loss of the video target detection model based on the classification result and the real category of the support picture of the training sample video, wherein the step of training the video target detection model based on the determined first loss comprises: adjusting parameters of the video object detection model based on the determined first loss and second loss.

Optionally, the step of obtaining the target object feature of the training sample video from the training sample video may include: generating a video pipeline of objects in the training sample video; intercepting the features in the video pipeline as video pipeline features; acquiring a timing alignment feature from a training sample; and fusing the video pipeline characteristics and the time sequence alignment characteristics, and taking the fused characteristics as the target object characteristics of the training sample video.

Optionally, the step of obtaining the timing alignment feature from the training sample may include: extracting the characteristics of different frame images of the same object in the training sample video; intercepting the features in the region of the target object from the extracted features; and averaging the features in the target object regions of the different frame images to obtain a time sequence alignment feature.

Optionally, the first loss may include at least one of a matching loss and a position loss.

According to an exemplary embodiment of the present disclosure, there is provided a video object detection method including: acquiring a video to be detected; acquiring the support picture characteristics of the video to be detected, and acquiring the target object characteristics of the video to be detected from the video to be detected; and matching the characteristics of the target object with the characteristics of the support picture to obtain the target in the video to be detected.

Optionally, the step of obtaining the target object feature of the video to be detected from the video to be detected may include: generating a video pipeline of the object in the video to be detected, wherein the video pipeline comprises the same object in at least one frame of the video to be detected; and intercepting the characteristics in the video pipeline as the characteristics of the target object of the video to be detected.

Optionally, the step of generating a video pipeline of the object in the video to be detected may include: extracting the characteristics of the video to be detected to obtain the characteristics of the video to be detected; and generating a video pipeline of each object in the video to be detected based on the characteristics of the video to be detected.

Optionally, the step of generating a video pipeline of each object in the video to be detected based on the features of the video to be detected may include: tracking each object in the video to be detected based on the characteristics of the video to be detected to obtain the positioning of each object in each frame; and generating a video pipeline of each object in the video to be detected based on the positioning of each object in each frame.

Optionally, the step of generating a video pipeline of each object in the video to be detected based on the location of each object in each frame may include: and forming a track pipeline of each object in the whole video to be detected as a video pipeline based on the positioning of each object in each frame.

According to an exemplary embodiment of the present disclosure, there is provided a training apparatus of a video target detection model, including: the training data acquisition unit is configured to acquire a training sample video, a support picture of the training sample video and a real target detection result; a feature acquisition unit configured to acquire a target object feature and a support picture feature of the training sample video from support pictures of the training sample video and the training sample video, respectively; the characteristic matching unit is configured to match the target object characteristic with the support picture characteristic to obtain a matching result; a first loss determination unit configured to determine a first loss of the video target detection model based on the matching result and the real target detection result; and a model training unit configured to train the video target detection model based on the determined first loss.

Optionally, the feature acquisition unit may be configured to: generating a video pipeline of the object in the training sample video, wherein the video pipeline contains the same object in at least one frame of the training sample video; and intercepting the features in the video pipeline as the target object features of the training sample video.

Optionally, the feature acquisition unit may be configured to: performing feature extraction on the training sample video to obtain features of the training sample video; generating a video pipeline for each object in the training sample video based on features of the training sample video.

Optionally, the feature acquisition unit may be configured to: tracking each object in the training sample video based on the characteristics of the training sample video to obtain the location of each object in each frame; a video pipeline for each object in the training sample video is generated based on the location of each object in each frame.

Optionally, the feature acquisition unit may be configured to: and forming a track pipeline of each object in the whole training sample video as a video pipeline based on the positioning of each object in each frame.

Optionally, the feature acquisition unit may be configured to: performing feature extraction on the support pictures of the training sample video, wherein the number of the support pictures of the training sample video is more than or equal to 1; intercepting the feature area of the extracted features; and carrying out averaging operation on the intercepted features to obtain the features of the support picture.

Optionally, the training device may further include: a support picture classification unit configured to classify the training sample support picture based on the support picture features; and a second loss determination unit configured to determine a second loss of the video target detection model based on the classification result and the real category of the support picture of the training sample video, wherein the model training unit is configured to: adjusting parameters of the video object detection model based on the determined first loss and second loss.

Optionally, the feature acquisition unit may be configured to: generating a video pipeline of objects in the training sample video; intercepting the features in the video pipeline as video pipeline features; acquiring a timing alignment feature from a training sample; and fusing the video pipeline characteristics and the time sequence alignment characteristics, and taking the fused characteristics as the target object characteristics of the training sample video.

Optionally, the feature acquisition unit may be configured to: extracting the characteristics of different frame images of the same object in the training sample video; intercepting the features in the region of the target object from the extracted features; and averaging the features in the target object regions of the different frame images to obtain a time sequence alignment feature.

According to an exemplary embodiment of the present disclosure, there is provided a video object detecting apparatus including: a video acquisition unit configured to acquire a video to be detected; the feature acquisition unit is configured to acquire the support picture features of the video to be detected and acquire the target object features of the video to be detected from the video to be detected; and the characteristic matching unit is configured to match the characteristics of the target object with the characteristics of the support picture to obtain the target in the video to be detected of the video acquisition unit.

Optionally, the feature acquisition unit may be configured to: generating a video pipeline of the object in the video to be detected, wherein the video pipeline comprises the same object in at least one frame of the video to be detected; and intercepting the characteristics in the video pipeline as the characteristics of the target object of the video to be detected.

Optionally, the feature acquisition unit may be configured to: extracting the characteristics of the video to be detected to obtain the characteristics of the video to be detected; and generating a video pipeline of each object in the video to be detected based on the characteristics of the video to be detected.

Optionally, the feature acquisition unit may be configured to: tracking each object in the video to be detected based on the characteristics of the video to be detected to obtain the positioning of each object in each frame; and generating a video pipeline of each object in the video to be detected based on the positioning of each object in each frame.

Optionally, the feature acquisition unit may be configured to: and forming a track pipeline of each object in the whole video to be detected as a video pipeline based on the positioning of each object in each frame.

According to an exemplary embodiment of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement a video object detection method according to an exemplary embodiment of the present disclosure.

According to an exemplary embodiment of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor of an electronic device, causes the electronic device to perform a training method of a video object detection model or a video object detection method according to an exemplary embodiment of the present disclosure.

According to an exemplary embodiment of the present disclosure, a computer program product is provided, comprising computer programs/instructions which, when executed by a processor, implement a video object detection method or a training method of a video object detection model according to an exemplary embodiment of the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the detection effect of the small sample target detection model in the video data is improved, and the performance of the model is improved, so that the model can be effectively applied to different video service scenes;

target detection can be performed on the new category by relying on a small number of support pictures, a large number of training samples are not needed, labeling time and cost are greatly saved, and model deployment and a new category detection process are simplified.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 shows an overall system diagram of the training and application of a video target detection model according to an exemplary embodiment of the present disclosure.

Fig. 2 shows a flow chart of a training method of a video target detection model according to an example embodiment of the present disclosure.

Fig. 3 shows a flow chart of a training method of a video target detection model according to another exemplary embodiment of the present disclosure.

Fig. 4 shows a flowchart of a video object detection method according to an exemplary embodiment of the present disclosure.

Fig. 5 illustrates a flowchart of a video object detection method according to another exemplary embodiment of the present disclosure.

FIG. 6 shows a block diagram of a training apparatus for a video target detection model according to an example embodiment of the present disclosure.

Fig. 7 shows a block diagram of a training apparatus of a video target detection model according to another exemplary embodiment of the present disclosure.

Fig. 8 illustrates a block diagram of a video object detection apparatus according to an exemplary embodiment of the present disclosure.

Fig. 9 illustrates a block diagram of a video object detection apparatus according to another exemplary embodiment of the present disclosure.

Fig. 10 is a block diagram of an electronic device 1000 according to an example embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

In the related art, a small sample detection network is trained using a training triplet including a target sample, a positive sample, and a negative sample, and an attention-based Region Proposal Network (RPN) is proposed to generate detection frames that may include the target object, and then these detection frames are scored using a multi-relationship matching network, and a rectangular frame including the target object is screened out and used as a detection result. The small sample detection Jupiter obtained by training in the technical scheme can detect corresponding new classes only by relying on a small number of template pictures, a large number of training samples are not needed, and the labeling time and cost are greatly saved. However, the related art is a technology for detecting a picture object, and problems of jitter, defocus, illumination variation, occlusion, and the like in a video make it impossible to obtain a satisfactory detection result in video detection. The technology only detects each picture independently and does not utilize time sequence relation and information in continuous video frames, so the effect in a task of detecting the small sample target of the video is poor.

Hereinafter, a training method and apparatus of a video target detection model, a video target detection method and apparatus according to an exemplary embodiment of the present disclosure will be described in detail with reference to fig. 1 to 6.

Referring to fig. 1, the video target detection model may include a timing feature alignment module, a support picture classification module, a video pipeline generation module, a matching module, and a plurality of backbone networks. The input of the video object detection model may be a video and the output may be detected objects in the video. The backbone network is used to extract features and may be, for example, but not limited to, ResNet50, ResNet101, ResNet200, ResNeXt101, ResNeSt101, ResNet18, MobileNet, Squeezenet, etc. As shown in fig. 1, the timing feature alignment module may randomly select different frames of the same object in the video as input, extract features using a backbone network, intercept features in the region of the target object using a rectangular frame, then find an average value of the intercepted features, and finally fuse the average value with features in the video pipeline generation module to serve as features of the target object. The video pipeline generation module is used for generating a video pipeline, and may be a multi-target tracking network, which may be, for example, but not limited to, CTracker. The matching module may be a video pipeline feature-based matching module such as, but not limited to, a multi-relationship matching module, a convolution-based matching module, a cosine distance-based matching module, a dot product-based matching module, and the like. Unlike the picture-based matching module in FSOD, the matching module in the present disclosure is based on multi-frame picture features, and the target object features that it inputs are multi-frame fusion features. In addition, the matching module in the present disclosure may use a label smoothing technique to reduce overfitting, improving generalization capability. And after the characteristics of the support picture are extracted by using a backbone network and the characteristic areas are intercepted and averaged, the support picture is processed by using a support picture classification module. The support picture classification module may consist of a global mean pooling, a fully-connected layer with 2048 dimensions as input and 512 dimensions as output, and a fully-connected layer with 512 dimensions as input and 300 dimensions as output (300 classes are required in the training set). The output 300-dimensional vector is supervised trained using a cross entropy loss function.

The video target detection model needs to be trained before application. In an exemplary embodiment of the present disclosure, the video used for training is also referred to as a training sample video. The video target detection model may be trained based on the training sample video and the true target detection results of the training sample video. After the trained video target detection model is obtained, the video to be detected can be input, and the target in the video to be detected can be obtained.

In the training process, for a plurality of frames of continuous pictures in an input video, for example, but not limited to, two frames of pictures are first selected randomly to train a backbone network and a video pipeline generation module (the video pipeline generation module herein may use an existing multi-target tracking algorithm, for example, but not limited to, CTracker). Features within the video pipeline are then truncated (from Mask R-CNN using, for example and without limitation, the roiign technique) and averaged to obtain a (1xCxHxW) sized feature map as the features for this video pipeline (the same object contained in multiple frames in one video pipeline). And then fusing the characteristics with the characteristics of the time sequence characteristic alignment module to obtain the characteristics of the target object. Meanwhile, a plurality of support pictures are subjected to backbone network and area interception and averaging operation to obtain support characteristics. The target object features are input into the matching module along with the support features for matching and the network is trained using matching and position loss (the loss function can be a function of cross-entropy loss and a function of smooth L1 loss). In addition, in the disclosure, a support picture classification module is introduced to classify the support pictures and supervise them by using a cross entropy loss function.

During the training process, parameters of ResNet50 pre-trained on the ImageNet dataset and the MS COCO dataset can be used, and the newly added layer is initialized with a gaussian distribution with variance of 0.01 and mean of 0. For example, a Gradient descent method based on sgd (stochastic Gradient determination) can be adopted to solve the convolution template parameters w and the bias parameters b of the neural network model, in each iteration process, a prediction result error is calculated and is propagated to the convolution neural network model in a reverse direction, and the Gradient is calculated and the parameters of the convolution neural network model are updated. For example, the learning rate of training is 0.002, 45000 rounds of training are performed, and the learning rate is decreased by ten times in 30000 th and 40000 th rounds, respectively.

In the testing process, the characteristics of an input video are extracted by using a backbone network, a video pipeline generation module is used for generating a video pipeline of each object, the characteristics in the video pipelines are subjected to region interception and mean value calculation, and finally the characteristics and the characteristics of a support picture are simultaneously input into a matching module to obtain the matching scores of the video pipelines. These video pipes with matching scores serve as final detection results.

According to the video target detection model of the exemplary embodiment of the disclosure, the consistency of training and testing features can be enhanced by introducing the feature alignment module, and the small sample target detection performance in a video is improved by utilizing the time sequence relation and information of adjacent frames of the video, so that the problem of poor performance of the existing small sample target detection model on video data is solved; and the target detection can be carried out on the new category only by depending on a small number of supporting pictures, a large number of training samples are not needed, the labeling time and cost are greatly saved, and the model deployment and the detection process of the new category are simplified.

Referring to fig. 2, in step S201, a training sample video, a support picture of the training sample video, and a real target detection result are obtained.

The detection result obtained by detecting the training sample video in advance through a manual detection method or other methods can be used as a real target detection result. The supporting pictures include most or all of the features of the video.

In step S202, target object features and support picture features of the training sample video are obtained from the training sample video and the support pictures of the training sample video, respectively. Therefore, target detection can be carried out on the new category only by relying on a small number of support pictures, a large number of training samples are not needed, the labeling time and cost are greatly saved, and the model deployment and the detection process of the new category are simplified.

In an exemplary embodiment of the present disclosure, when obtaining the target object feature of the training sample video from the training sample video, a video pipeline of an object in the training sample video may be first generated, and then a feature in the video pipeline is intercepted as the target object feature of the training sample video, so as to improve the accuracy of the target object feature. Here, one video pipeline contains the same object in at least one frame of the training sample video (e.g., without limitation, all frames of the video).

In an exemplary embodiment of the present disclosure, when generating a video pipeline of each object in the training sample video, feature extraction may be performed on the training sample video to obtain features of the training sample video, and then a video pipeline of each object in the training sample video may be generated based on the features of the training sample video.

In an exemplary embodiment of the present disclosure, when generating a video pipeline of each object in the training sample video based on the features of the training sample video, each object in the training sample video may be first tracked based on the features of the training sample video, a location of each object in each frame is obtained, and then a video pipeline of each object in the training sample video is generated based on the location of each object in each frame.

In an exemplary embodiment of the present disclosure, in generating a video pipeline of each object in the training sample video based on the location of each object at each frame, a trajectory pipeline of each object in the entire training sample video may be formed as a video pipeline based on the location of each object at each frame.

In an exemplary embodiment of the present disclosure, when obtaining a support picture feature from a support picture of the training sample video, feature extraction may be performed on the support picture of the training sample video, and feature region clipping may be performed on the extracted feature, and then an averaging operation may be performed on the clipped feature to obtain the support picture feature. Here, the number of support pictures of the training sample video is 1 or more.

In an exemplary embodiment of the present disclosure, when obtaining a target object feature of the training sample video from the training sample video, a video pipeline of an object in the training sample video may be first generated, and a feature in the video pipeline is intercepted as a video pipeline feature, then a timing alignment feature is obtained from the training sample, and finally the video pipeline feature and the timing alignment feature are fused, and the fused feature is used as the target object feature of the training sample video. Therefore, the consistency of training and testing features is enhanced through the time sequence alignment features, and the time sequence relation and information of adjacent frames of the video are utilized, so that the small sample target detection performance in the video is improved.

In the exemplary embodiment of the present disclosure, when obtaining a timing alignment feature from a training sample, feature extraction may be performed on different frame images in which the same object appears in a training sample video, and features in a target object region are extracted from the extracted features, and then an average value is obtained for the features in the target object region of the different frame images, so as to obtain the timing alignment feature.

In step S203, the target object feature is matched with the support image feature to obtain a matching result. Here, in the matching process, a label smoothing technique can be used to reduce overfitting and improve generalization capability.

In step S204, a first loss of the video object detection model is determined based on the matching result and the real object detection result.

In an exemplary embodiment of the present disclosure, the first loss may include at least one of a matching loss and a position loss.

In step S205, the video target detection model is trained based on the determined first loss.

According to the video target detection method of the exemplary embodiment of the disclosure, the consistency of training and testing features is enhanced by introducing the feature alignment module, and the small sample target detection performance in a video is improved by utilizing the time sequence relation and information of adjacent frames of the video, so that the problem of poor performance of the existing small sample target detection model on video data can be solved; and the target detection can be carried out on the new category only by depending on a small number of supporting pictures, a large number of training samples are not needed, the labeling time and cost are greatly saved, and the model deployment and the detection process of the new category are simplified.

Referring to fig. 3, in step S301, a training sample video, a support picture of the training sample video, and a real target detection result are obtained.

The detection result obtained by detecting the training sample video in advance can be used as the real target detection result. The support picture comprises part of or all of the features of the video.

In step S302, target object features and support picture features of the training sample video are obtained from the training sample video and the support pictures of the training sample video, respectively. Therefore, target detection can be carried out on the new category only by relying on a small number of support pictures, a large number of training samples are not needed, the labeling time and cost are greatly saved, and the model deployment and the detection process of the new category are simplified.

In an exemplary embodiment of the present disclosure, when obtaining the target object features of the training sample video from the training sample video, a video pipeline of objects in the training sample video may be first generated, and then features in the video pipeline may be intercepted as the target object features of the training sample video.

In step S303, the target object feature is matched with the support image feature to obtain a matching result.

In step S304, a first loss of the video object detection model is determined based on the matching result and the real object detection result.

In step S305, the training sample support picture is classified based on the support picture features.

In step S306, a second loss of the video target detection model is determined based on the classification result and the real category of the support picture of the training sample video.

In step S307, parameters of the video object detection model are adjusted based on the determined first loss and second loss. After the training of the video target detection model is completed, the video target detection model can be put into practical application scenes for use. The target can be detected from the video to be detected by operating the video target detection model.

Referring to fig. 4, in step S401, a video to be detected is acquired.

In step S402, the support picture feature of the video to be detected is obtained, and the target object feature of the video to be detected is obtained from the video to be detected. Specifically, after the video to be detected is acquired (or determined, received) in step S401, the support picture feature corresponding to the video to be detected may be determined in step S402, and the target object feature of the video to be detected may be acquired from the video to be detected.

In an exemplary embodiment of the present disclosure, when obtaining the target object feature of the video to be detected from the video to be detected, a video pipeline of the object in the video to be detected may be first generated, and then the feature in the video pipeline is intercepted as the target object feature of the video to be detected, so as to represent the object and the feature in the video in the form of the video pipeline. Here, the video pipeline includes the same object in at least one frame of the video to be detected (e.g., without limitation, all frames of the video).

In the exemplary embodiment of the disclosure, when generating the video pipeline of the object in the video to be detected, feature extraction may be performed on the video to be detected first to obtain features of the video to be detected, and then the video pipeline of each object in the video to be detected is generated based on the features of the video to be detected, thereby improving the accuracy of the video pipeline.

In an exemplary embodiment of the present disclosure, when generating a video pipeline of each object in a video to be detected based on characteristics of the video to be detected, each object in the video to be detected may be tracked based on the characteristics of the video to be detected, so as to obtain a location of each object in each frame, and then generate a video pipeline of each object in the video to be detected based on the location of each object in each frame, thereby improving accuracy of the video pipeline.

In an exemplary embodiment of the present disclosure, when generating a video pipeline for each object in a video to be detected based on the location of each object in each frame, a track pipeline for each object in the entire video to be detected may be first formed as a video pipeline based on the location of each object in each frame, thereby improving the accuracy of the video pipeline.

In step S403, the target object characteristics are matched with the support image characteristics to obtain a target in the video to be detected. Specifically, for example, the target object features and the support picture features may be matched to obtain a similarity score of each target object feature with respect to the support picture features, and the target in the video to be detected may be determined according to the similarity score. For example, if the target object feature similarity score of an object exceeds a threshold, it is a target in the video to be detected.

Referring to fig. 5, in step S501, a video to be detected is acquired.

In step S502, a target in the video to be detected is detected based on the support picture feature of the video to be detected by using a video target detection model trained by using a training method according to the present disclosure (such as the training method described with reference to fig. 2 or fig. 3).

In a trained video target detection model, the characteristics of an input video are extracted by using a backbone network, a video pipeline generation module is used for generating a video pipeline of each object, the characteristics in the video pipelines are subjected to region interception and mean value calculation, finally, the characteristics of a support picture of a video to be detected are obtained, and the features after the mean value calculation and the characteristics of the support picture are simultaneously input into a matching module to obtain the matching scores of the video pipelines. These video pipes with matching scores serve as final detection results. The backbone network is used to extract features and may be, for example, but not limited to, ResNet50, ResNet101, ResNet200, ResNeXt101, ResNeSt101, ResNet18, MobileNet, Squeezenet, etc. The video pipeline generation module is used for generating a video pipeline, and may be a multi-target tracking network, which may be, for example, but not limited to, CTracker. The matching module may be a video pipeline feature-based matching module such as, but not limited to, a multi-relationship matching module, a convolution-based matching module, a cosine distance-based matching module, a dot product-based matching module, and the like. The training method of the video target detection model and the video target detection method according to the exemplary embodiment of the present disclosure have been described above with reference to fig. 1 to 5. Hereinafter, a training apparatus of a video object detection model, a video object detection apparatus, and units thereof according to exemplary embodiments of the present disclosure will be described with reference to fig. 6 to 9.

Referring to fig. 6, the training apparatus of the video object detection model includes a training data acquisition unit 61, a feature acquisition unit 62, a feature matching unit 63, a first loss determination unit 64, and a model training unit 65.

The training data acquisition unit 61 is configured to acquire a training sample video, a support picture of the training sample video, and a real target detection result.

The feature acquisition unit 62 is configured to acquire a target object feature and a support picture feature of the training sample video from the support pictures of the training sample video and the training sample video, respectively.

In an exemplary embodiment of the present disclosure, the feature acquisition unit 62 may be configured to: generating a video pipeline of objects in the training sample video; and intercepting the features in the video pipeline as the target object features of the training sample video.

In an exemplary embodiment of the present disclosure, the feature acquisition unit 62 may be configured to: performing feature extraction on the training sample video to obtain features of the training sample video; generating a video pipeline for each object in the training sample video based on features of the training sample video.

In an exemplary embodiment of the present disclosure, the feature acquisition unit 62 may be configured to: tracking each object in the training sample video based on the characteristics of the training sample video to obtain the location of each object in each frame; a video pipeline for each object in the training sample video is generated based on the location of each object in each frame.

In an exemplary embodiment of the present disclosure, the feature acquisition unit 62 may be configured to: and forming a track pipeline of each object in the whole training sample video as a video pipeline based on the positioning of each object in each frame.

In an exemplary embodiment of the present disclosure, the feature acquisition unit 62 may be configured to: performing feature extraction on the support pictures of the training sample video, wherein the number of the support pictures of the training sample video is more than or equal to 1; intercepting the feature area of the extracted features; and carrying out averaging operation on the intercepted features to obtain the features of the support picture.

In an exemplary embodiment of the present disclosure, the feature acquisition unit 62 may be configured to: generating a video pipeline of objects in the training sample video; intercepting the features in the video pipeline as video pipeline features; acquiring a timing alignment feature from a training sample; and fusing the video pipeline characteristics and the time sequence alignment characteristics, and taking the fused characteristics as the target object characteristics of the training sample video.

In an exemplary embodiment of the present disclosure, the feature acquisition unit 62 may be configured to: extracting the characteristics of different frame images of the same object in the training sample video; intercepting the features in the region of the target object from the extracted features; and averaging the features in the target object regions of the different frame images to obtain a time sequence alignment feature.

The feature matching unit 63 is configured to match the target object feature with the support picture feature, obtaining a matching result.

The first loss determination unit 64 is configured to determine a first loss of the video object detection model based on the matching result and the true object detection result.

The model training unit 65 is configured to train the video object detection model based on the determined first loss.

Referring to fig. 7, the training apparatus of the video object detection model includes a training data acquisition unit 71, a feature acquisition unit 72, a feature matching unit 73, a first loss determination unit 74, a support picture classification unit 75, a second loss determination unit 76, and a model training unit 77.

The training data acquisition unit 71 is configured to acquire a training sample video, a support picture of the training sample video, and a real target detection result.

The feature acquisition unit 72 is configured to acquire a target object feature and a support picture feature of the training sample video from the support pictures of the training sample video and the training sample video, respectively.

In an exemplary embodiment of the present disclosure, the feature acquisition unit 72 may be configured to: generating a video pipeline of the object in the training sample video, wherein the video pipeline contains the same object in at least one frame of the training sample video; and intercepting the features in the video pipeline as the target object features of the training sample video.

In an exemplary embodiment of the present disclosure, the feature acquisition unit 72 may be configured to: performing feature extraction on the training sample video to obtain features of the training sample video; generating a video pipeline for each object in the training sample video based on features of the training sample video.

In an exemplary embodiment of the present disclosure, the feature acquisition unit 72 may be configured to: tracking each object in the training sample video based on the characteristics of the training sample video to obtain the location of each object in each frame; a video pipeline for each object in the training sample video is generated based on the location of each object in each frame.

In an exemplary embodiment of the present disclosure, the feature acquisition unit 72 may be configured to: and forming a track pipeline of each object in the whole training sample video as a video pipeline based on the positioning of each object in each frame.

In an exemplary embodiment of the present disclosure, the feature acquisition unit 72 may be configured to: carrying out feature extraction on the support picture of the training sample video; intercepting the feature area of the extracted features; and carrying out averaging operation on the intercepted features to obtain the features of the support picture. Here, the number of the support pictures of the training sample video is 1 or more.

In an exemplary embodiment of the present disclosure, the feature acquisition unit 72 may be configured to: generating a video pipeline of objects in the training sample video; intercepting the features in the video pipeline as video pipeline features; acquiring a timing alignment feature from a training sample; and fusing the video pipeline characteristics and the time sequence alignment characteristics, and taking the fused characteristics as the target object characteristics of the training sample video.

In an exemplary embodiment of the present disclosure, the feature acquisition unit 72 may be configured to: extracting the characteristics of different frame images of the same object in the training sample video; intercepting the features in the region of the target object from the extracted features; and averaging the features in the target object regions of the different frame images to obtain a time sequence alignment feature.

The feature matching unit 73 is configured to match the target object feature with the support picture feature, obtaining a matching result.

The first loss determination unit 74 is configured to determine a first loss of the video object detection model based on the matching result and the real object detection result.

The support picture classification unit 75 is configured to classify the training sample support picture based on the support picture features.

The second loss determination unit 76 is configured to determine a second loss of the video target detection model based on the classification result and the real class of the support picture of the training sample video.

The model training unit 77 is configured to adjust parameters of the video object detection model based on the determined first loss and second loss.

Referring to fig. 8, the video object detection apparatus includes a video acquisition unit 81, a feature acquisition unit 82, and a feature matching unit 83.

The video acquisition unit 81 is configured to acquire a video to be detected.

The feature obtaining unit 82 is configured to obtain a support picture feature of the video to be detected, and obtain a target object feature of the video to be detected from the video to be detected.

In an exemplary embodiment of the present disclosure, the feature acquisition unit 82 may be configured to: generating a video pipeline of the object in the video to be detected, wherein the video pipeline comprises the same object in at least one frame of the video to be detected; and intercepting the characteristics in the video pipeline as the characteristics of the target object of the video to be detected.

In an exemplary embodiment of the present disclosure, the feature acquisition unit 82 may be configured to: extracting the characteristics of the video to be detected to obtain the characteristics of the video to be detected; and generating a video pipeline of each object in the video to be detected based on the characteristics of the video to be detected.

In an exemplary embodiment of the present disclosure, the feature acquisition unit 82 may be configured to: tracking each object in the video to be detected based on the characteristics of the video to be detected to obtain the positioning of each object in each frame; and generating a video pipeline of each object in the video to be detected based on the positioning of each object in each frame.

In an exemplary embodiment of the present disclosure, the feature acquisition unit 82 may be configured to: and forming a track pipeline of each object in the whole video to be detected as a video pipeline based on the positioning of each object in each frame.

The feature matching unit 83 is configured to match the features of the target object with the features of the support picture to obtain the target in the video to be detected.

Referring to fig. 9, the video object detection apparatus includes a video acquisition unit 91 and an object detection unit 92.

The video acquisition unit 91 is configured to acquire a video to be detected.

The target detection unit 92 is configured to detect a target in the video to be detected based on the support picture features of the video to be detected by using a video target detection model trained by using a training method according to the present disclosure (such as the training method described with reference to fig. 2 or fig. 3).

With regard to the apparatus in the above-described embodiment, the specific manner in which each unit performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

The training apparatus for the video target detection model and the video target detection apparatus according to the exemplary embodiment of the present disclosure have been described above with reference to fig. 6 to 9. Next, an electronic apparatus according to an exemplary embodiment of the present disclosure is described with reference to fig. 10.

Referring to fig. 10, the electronic device 1000 comprises at least one memory 1001 and at least one processor 1002, the at least one memory 1001 having stored therein a set of computer-executable instructions that, when executed by the at least one processor 1002, perform a method of video object detection according to an exemplary embodiment of the present disclosure.

For example, the electronic device 1000 may be a PC computer, tablet device, personal digital assistant, smartphone, or other device capable of executing the set of instructions. The electronic device 1000 need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) individually or in combination. The electronic device 1000 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the electronic device 1000, the processor 1002 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor 1002 may execute instructions or code stored in the memory 1001, wherein the memory 1001 may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory 1001 may be integrated with the processor 1002, for example, by having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 1001 may include a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 1001 and the processor 1002 may be operatively coupled or may communicate with each other, e.g., through I/O ports, network connections, etc., so that the processor 1002 can read files stored in the memory.

In addition, the electronic device 1000 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 1000 may be connected to each other via a bus and/or a network.

There is also provided, in accordance with an exemplary embodiment of the present disclosure, a computer-readable storage medium, such as a memory 1001, including instructions executable by a processor 1002 of a device 1000 to perform the above-described method. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

According to an exemplary embodiment of the present disclosure, a computer program product may also be provided, which comprises computer programs/instructions, which when executed by a processor, implement the method of video object detection according to an exemplary embodiment of the present disclosure.

The training method and apparatus of the video target detection model, the video target detection method and apparatus according to the exemplary embodiments of the present disclosure have been described above with reference to fig. 1 to 10. However, it should be understood that: the training of the video object detection model, the video object detection apparatus and the units thereof shown in fig. 6 to 9 may be respectively configured as software, hardware, firmware, or any combination thereof to perform a specific function, the electronic device shown in fig. 10 is not limited to including the above-illustrated components, but some components may be added or deleted as needed, and the above components may also be combined.

The training method and device for the video target detection model can be used for improving the detection effect of the small sample target detection model in video data and improving the performance of the model, so that the model can be effectively applied to different video service scenes, such as video target detection data labeling (especially suitable for cold start and labeling of new samples of new types).

In addition, according to the video target detection method and device disclosed by the invention, all objects with the same category can be detected in the video only by providing a small number of supporting pictures by a user, so that the marking time and cost are greatly saved, the model deployment and the detection process of a new category are simplified, and the user experience is improved.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A training method of a video target detection model is characterized by comprising the following steps:

acquiring a training sample video, a support picture of the training sample video and a real target detection result;

respectively obtaining the target object characteristics and the support picture characteristics of the training sample video from the training sample video and the support pictures of the training sample video;

matching the target object characteristics with the support picture characteristics to obtain a matching result;

determining a first loss of the video target detection model based on the matching result and the real target detection result;

training the video target detection model based on the determined first loss.

2. The training method according to claim 1, wherein the step of obtaining the target object feature of the training sample video from the training sample video comprises:

generating a video pipeline of the objects in the training sample video, wherein the video pipeline contains the same object in at least one frame of the training sample video;

and intercepting the features in the video pipeline as the target object features of the training sample video.

3. The training method of claim 2, wherein the step of generating a video pipeline of objects in the training sample video comprises:

performing feature extraction on the training sample video to obtain features of the training sample video;

generating a video pipeline for each object in the training sample video based on features of the training sample video.

4. The training method of claim 3, wherein the step of generating a video pipeline for each object in the training sample video based on the features of the training sample video comprises:

tracking each object in the training sample video based on the characteristics of the training sample video to obtain the location of each object in each frame;

a video pipeline for each object in the training sample video is generated based on the location of each object in each frame.

5. The training method of claim 4, wherein the step of generating a video pipeline for each object in the training sample video based on the location of each object in each frame comprises:

and forming a track pipeline of each object in the whole training sample video as a video pipeline based on the positioning of each object in each frame.

6. A method for video object detection, comprising:

acquiring a video to be detected;

acquiring the support picture characteristics of the video to be detected, and acquiring the target object characteristics of the video to be detected from the video to be detected;

and matching the characteristics of the target object with the characteristics of the support picture to obtain the target in the video to be detected.

7. An apparatus for training a video object detection model, comprising:

the training data acquisition unit is configured to acquire a training sample video, a support picture of the training sample video and a real target detection result;

a feature acquisition unit configured to acquire a target object feature and a support picture feature of the training sample video from support pictures of the training sample video and the training sample video, respectively;

the characteristic matching unit is configured to match the target object characteristic with the support picture characteristic to obtain a matching result;

a first loss determination unit configured to determine a first loss of the video target detection model based on the matching result and the real target detection result; and

a model training unit configured to train the video target detection model based on the determined first loss.

8. A video object detection apparatus, comprising:

a video acquisition unit configured to acquire a video to be detected;

the feature acquisition unit is configured to acquire the support picture features of the video to be detected and acquire the target object features of the video to be detected from the video to be detected; and

and the characteristic matching unit is configured to match the characteristics of the target object with the characteristics of the support picture to obtain the target in the video to be detected.

9. An electronic device/server, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 6.

10. A computer-readable storage medium, storing a computer program, which, when executed by a processor of an electronic device, causes the electronic device to perform the method of any of claims 1 to 6.