CN112949352A

CN112949352A - Training method and device of video detection model, storage medium and electronic equipment

Info

Publication number: CN112949352A
Application number: CN201911256542.4A
Authority: CN
Inventors: 蒋正锴; 王国利; 张骞
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2021-06-11
Anticipated expiration: 2039-12-10
Also published as: CN112949352B

Abstract

A training method and device for a video detection model, a storage medium and an electronic device are disclosed. The method comprises the following steps: determining a preset relationship between key frames and non-key frames in a plurality of training videos; acquiring a plurality of training samples from the plurality of training videos based on a preset relation between the key frames and the non-key frames; and training the video detection model according to the plurality of training samples. According to the technical scheme, the video detection model trained in the mode can be used for accurately identifying each frame of image in the video, and the video detection precision can be effectively improved.

Description

Training method and device of video detection model, storage medium and electronic equipment

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a training method and device of a video detection model, a storage medium and electronic equipment.

Background

Video detection is an important application technology and has important application prospects in automatic driving, security and the like. How to achieve fast and accurate monitoring is a research target of video detection.

In the video detection scheme in the prior art, tracking and labeling of objects in a video are realized mainly by using the idea that optical flow aligns features between different frames. Specifically, existing video detection schemes train an end-to-end video detection model that detects the network model and the optical flow network model together. In use, the detected video is input to the video detection model, and the video detection model can output information of the object detected from each frame in the video and the corresponding tag thereof.

However, the optical flow network module trained by the video detection model is no longer an optical flow in the traditional sense, which not only affects the speed of video detection, but also the optical flow network is often not accurate enough, resulting in low video detection precision.

Disclosure of Invention

In order to solve the technical problem, the invention provides a training method and device of a video detection model, a storage medium and an electronic device.

According to an aspect of the present application, there is provided a training method of a video detection model, including:

determining a preset relationship between key frames and non-key frames in a plurality of training videos;

acquiring a plurality of training samples from the plurality of training videos based on a preset relation between the key frames and the non-key frames;

and training the video detection model according to the plurality of training samples.

According to another aspect of the present application, there is provided a training apparatus for a video detection model, including:

the determining module is used for determining a preset relation between key frames and non-key frames in a plurality of training videos;

the acquisition module is used for acquiring a plurality of training samples from the plurality of training videos based on a preset relation between the key frames and the non-key frames;

and the training module is used for training the video detection model according to the plurality of training samples.

According to another aspect of the present application, there is provided a computer-readable storage medium having stored thereon a computer program for executing the method of any of the above.

According to another aspect of the present application, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; the processor is configured to perform any of the methods described above.

The method for training the video detection model provided by the embodiment of the application determines the preset relationship between the key frames and the non-key frames in a plurality of training videos; acquiring a plurality of training samples from the plurality of training videos based on a preset relation between the key frames and the non-key frames; the video detection model is trained according to the training samples, and the video detection model trained by the scheme can more accurately identify each frame of image in the video, so that the video detection precision can be effectively improved.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1 is a flowchart of a first embodiment of a method for training a video detection model according to the present invention.

Fig. 2 is a schematic structural diagram of a video detection model according to this embodiment.

Fig. 3 is a flowchart of a second embodiment of the training method of the video detection model of the present invention.

Fig. 4 is a flowchart of an embodiment of a video detection method according to the present embodiment.

FIG. 5 is a block diagram of a first embodiment of a training apparatus for video inspection models according to the present invention.

FIG. 6 is a block diagram of a second embodiment of the training apparatus for video inspection models according to the present invention.

Fig. 7 is a block diagram of an embodiment of a video detection apparatus according to the present invention.

FIG. 8 illustrates a block diagram of an electronic device in accordance with an embodiment of the present application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.

Summary of the application

The video detection scheme is realized by adopting a novel video detection model, the video detection model can comprise a motion prior learning module and a detection network module which are trained end to end, and the motion prior learning module is adopted, so that the space-time characteristics of different frames can be learnt in a targeted manner, and the video detection is more accurately carried out. The video detection scheme of the application can be applied to the fields of automatic driving, security protection and the like, and is used for labeling objects in each frame image of a video, for example, in the technical scheme of the application, a bounding-box (bounding-box) of the detected objects in each frame image can be specifically labeled, the bounding-box can be understood as frame information of an area where the objects are located, and label information of the predicted objects in various images, such as label, also needs to be labeled, so that the objects in the video can be tracked, analyzed and the like based on the result of video detection.

Exemplary System

The training scheme of the video detection model can be deployed in a training device of the video detection model, and the training device of the video detection model can be an entity electronic device such as a large computer, or can also adopt software integration application. When the video detection model training device is used, the video detection model can be trained in the video detection model training device so as to obtain the video detection model with high detection precision. The video detection model can then be deployed in a specific usage scenario when applied. For example, in the field of automatic driving, the video detection model can be arranged in an unmanned vehicle to realize the detection of an object in a video shot by the unmanned vehicle. In the security field, the video detection model can be arranged in a security system so as to detect the surveillance video shot by the camera. Similarly, in other fields, the trained video detection model can be deployed, so that the acquired video can be detected to meet the application of the corresponding scene.

Exemplary method

Fig. 1 is a flowchart of a first embodiment of a method for training a video detection model according to the present invention. As shown in fig. 1, the training method of the video detection model of this embodiment may specifically include the following steps:

s101, determining a preset relation between key frames and non-key frames in a plurality of training videos.

The training apparatus for the video detection model of this embodiment needs to determine the preset relationship between the key frame and the non-key frame before training. For example, the preset key is set according to the relationship of the preset frame interval number, for example, the 0 th frame in the video may be set as a key frame, the preset frame interval number is 5, every 5 frames are set as key frames, 0, 5, 10, 15.

S102, collecting a plurality of training samples from a plurality of training videos based on a preset relation between the key frames and the non-key frames.

The training samples of this embodiment are selected based on the preset relationship between the key frames and the non-key frames, and some key frames and some non-key frames are selected to form the training samples together.

It should be noted that, in order to cover various possible situations in the video, the several training samples collected in this embodiment may include all possible situations of two key frames and one non-key frame in the video.

For example, based on a preset relationship, two key frame images arranged in sequence and a non-key frame image positioned behind the two key frame images can be extracted from a plurality of training videos; and then acquiring the region information of the target object marked in the non-key frame image and the label information corresponding to the target object. Therefore, each training sample comprises two key frame images and one non-key frame image, the time sequence of the two key frame images is in the front, and the time sequence of the non-key frame images is in the back. Since the training of this embodiment is supervised training, each training sample further includes region information of the training object and corresponding label information labeled in the image data of the non-key frame, so as to calculate the detection loss subsequently. Therefore, in the training sample, it is also necessary to determine the region information of the target object and the corresponding Label, i.e., Label, which are labeled in the non-key frame image, and this part of information may be manually labeled by a worker, and is directly obtained when the training sample is collected. The region information may be a frame marked in the non-key frame image to enclose the target object, and the frame may be a rectangle, a square, a circle, or any other regular or irregular shape, that is, any shape capable of enclosing the target object. If the non-key frame image includes a plurality of target objects, each target object has corresponding area information, and a plurality of frames are correspondingly required to frame the corresponding target objects, so that one non-key frame image necessarily includes a plurality of frames. In order to track objects in motion when the video detection model is applied, in this embodiment, a label needs to be marked on the region-matching-region information of each target object, that is, the label is used to uniquely identify the target object in the frame. For example, the label may be represented by at least one symbol of letters, numbers, and letters.

S103, training a video detection model according to a plurality of training samples.

In this embodiment, when a plurality of training samples are collected from a plurality of training videos, one, two, or more training samples may be collected from each training video. In this embodiment, the number of the collected training samples can reach over a million level, and the more the number of the training samples is, the more accurate the trained video detection model is.

Fig. 2 is a schematic structural diagram of a video detection model according to this embodiment. As shown in fig. 2, the video detection model of this embodiment includes two parts, namely, a motion prior learning module and a detection network module, where the motion prior learning module is configured to obtain feature information of a current frame to be predicted based on feature information of different frames in a video by combining motion knowledge in consecutive frames, and the detection network module is configured to detect information of an object and a corresponding tag based on the feature information of the current frame. The motion prior learning module and the detection network module are both modeled by adopting a convolutional neural network so as to realize respective functions. During training, a motion prior learning module in the video detection model and a detection network module perform end-to-end training together.

Fig. 3 is a flowchart of a second embodiment of the training method of the video detection model of the present invention. As shown in fig. 3, this embodiment describes in detail a training process of training a video detection model according to a plurality of training samples in S103 in the embodiment shown in fig. 1, and as shown in fig. 3, the training method of the video detection model in this embodiment may specifically include the following steps:

s201, selecting a training sample from a plurality of training samples, and inputting image data of two key frames and image data of non-key frames in the corresponding training sample into a motion prior learning module.

It should be noted that, before training, it is necessary to perform random initialization on parameters in the motion prior learning module and the detection network module. Then, these parameters are trained according to the training mode of this embodiment.

S202, the acquired motion prior learning module performs feature fusion according to the image data of the two key frames to obtain feature information of the non-key frames.

In the present embodiment, during training, one training sample may be selected for training each time, specifically, the training sample is first input into the motion prior learning module, so that the motion prior learning module learns the fusion of the features of two key frames to a non-key frame. Correspondingly, the motion prior learning module can be obtained to perform feature fusion according to the image data of the two key frames to obtain feature information of the non-key frame.

In addition, it should be noted that, because the two key frames and the non-key frame in the training sample of the embodiment are obtained according to the previous preset relationship, the preset relationship includes the offset between the two key frames and the offset between the non-key frame and each key frame. The offsets may be obtained based on corresponding frame images, or may also be labeled in advance based on a preset relationship, and during training, the motion prior learning module may obtain the offsets, for example, the motion prior learning module may learn the feature information of the second key frame based on the image data of the first key frame and the offset between the two key frames, and correct the learned feature information of the second key frame based on the image data of the second key frame. Through the learning of the function, the motion prior learning module can learn the images tracking key frames to key frames. And moreover, feature fusion can be carried out according to the image data of the two key frames based on the image data of the first key frame, the image data of the second key frame and the offset of the acquired non-key frame and the two key frames respectively to obtain feature information of the non-key frame. And the feature information of the subsequent non-key frame is obtained by fusing the image data of the previous two key frames, so that the accuracy of the feature information of the non-key frame can be fully ensured. Through the learning of the function, the motion prior learning module can learn to track the image of the following non-key frame according to the images of the previous two key frames.

And S203, inputting the feature information of the non-key frame obtained by fusion into the detection network module.

S204, acquiring the area information and the label information of the training object in the image data of the non-key frame predicted by the detection network module.

The feature information of the non-key frame obtained by fusing the motion prior learning module is input into the detection network module, so that the detection network module can predict the region information and the label information of the training object in the image data of the non-key frame based on the input feature information of the non-key frame and output the region information and the label information.

S205, calculating the detection loss according to the predicted area information and label information of the training object, and the labeled area information and the corresponding label information of the training object.

Since the previous sample data includes the region information and the corresponding label information of the training object labeled in the image data of the non-key frame, the detection loss is calculated based on the predicted region information and label information of the training object and the region information and the corresponding label information of the training object labeled in the sample data.

S206, judging whether the detection loss is converged; if not, go to step S207; otherwise, step S208 is performed.

S207, parameters in the motion prior learning module and the detection network module are adjusted by adopting a gradient descent method; returning to step S201, the next training sample is selected, and training is continued.

S208, judging whether convergence is always achieved in the training of the continuous preset number of rounds, if yes, determining parameters in a motion prior learning module and a detection network module, determining the motion prior learning module and the detection network module, and further determining a video detection model; otherwise, returning to step S201 to select the next training sample, and continuing to train.

In this embodiment, when a video detection model is trained for the first time by using one training sample, since the detection loss is calculated for the first time, it cannot be determined whether the detection loss is converged, and it is considered that the detection loss is not converged uniformly, and the next training sample is directly selected, and the training is continued to start according to the above steps. For non-first training, the detection loss is calculated before, and the convergence of the detection loss can be judged by combining the previous detection result. In order to avoid the influence of the minor fluctuation on the training result, in this embodiment, the value of the detection loss may be set to be always kept at the minimum value in the training of the number of consecutive preset rounds, such as 100 times, 80 times or other times, and no longer continues to be reduced in the direction approaching 0, at which time the detection loss may be considered to be converged. At this time, the parameters in the motion prior learning module and the detection network module which are adjusted for the last time are taken as the parameters in the trained motion prior learning module and the trained detection network module, so that the motion prior learning module and the detection network module are determined, and then the video detection model is determined.

The training of this embodiment is end-to-end convergence, and when the detection loss is not converged, the parameters in the motion prior learning module and the detection network module need to be adjusted at the same time each time.

In the training of this embodiment, if the training samples collected are enough, detection loss convergence may be achieved by using one round of training, and if the training samples collected are not enough, detection loss convergence may be achieved by using several training samples to perform two or more rounds of training.

According to the training method of the video detection model, the motion prior learning module and the detection network module in the video detection model are trained by adopting the scheme, so that the motion prior learning module can learn to perform feature fusion based on two key frames, and feature prediction of any subsequent non-key frame is realized, so that each frame image in the video can be more accurately detected. By adopting the technical scheme of the embodiment, the trained video detection model can more accurately detect the video, and the video detection precision can be effectively improved.

In addition, the video detection model of the embodiment adopts the motion prior learning module, and compared with the existing optical flow network module, the video detection model has fewer parameters, so that the training speed of the video detection model can be further increased, and the precision of the trained video detection model is improved.

Moreover, the video detection model of this embodiment is an end-to-end training mode during training, and an end-to-end video detection model is obtained, that is, the motion prior learning module and the detection network module included in the video detection model are trained together. When the method is used, each module cannot output a result independently, and the whole video detection model only outputs a final result according to input, namely, one problem can be solved by adopting one step. The end-to-end implementation mode does not introduce accumulated errors, and further the precision of video detection can be effectively improved.

Fig. 4 is a flowchart of an embodiment of a video detection method according to the present embodiment. As shown in fig. 4, the video detection method of this embodiment may specifically include the following steps:

s301, acquiring a video to be detected.

S302, according to the video and a pre-trained video detection model, acquiring area information and corresponding labels of objects detected from each frame of image of the video; the video detection model is formed by end-to-end training based on a detection network module and a motion prior learning module.

The training method of the video detection model of this embodiment may specifically be a use method of the video detection model trained in the above embodiment.

The video to be detected in the embodiment can be the video to be detected in the fields of unmanned vehicles, security and the like. The video detection model in this embodiment is formed by performing end-to-end training based on a detection network module and a motion prior learning module. The motion prior learning module can learn feature fusion among different frames, and further can accurately identify the image of the current frame based on the image of the key frame before each frame of image. Therefore, the situation that the object image can still be accurately identified when the object image is not clear enough due to the fact that the object moving speed in the video is too high can be avoided.

When the video detection device is used, the video to be detected is directly input into a pre-trained video detection model, and the motion prior learning module and the detection network module in the video detection model can identify the object in each frame of image in the video and output the region information of the object in the frame of image and the corresponding label. The area information of the object may be a bounding-box (bounding-box) of the object, and the label of the object may be a unique identifier of the object, and may specifically be identified by any one of numbers, letters, special symbols, and chinese characters, or by a combination of at least two of the numbers, the letters, the special symbols, and the chinese characters.

According to the training method of the video detection model, the video detection model formed by end-to-end training based on the detection network module and the motion prior learning module is adopted, so that each frame of image in the video can be more accurately identified, and the precision of video detection can be effectively improved.

Exemplary devices

Fig. 5 is a structural diagram of a first embodiment of a training apparatus for video detection models provided in the present invention. As shown in fig. 5, the training apparatus for video detection model of the present embodiment includes:

the determining module 11 is configured to determine a preset relationship between a key frame and a non-key frame in a plurality of training videos;

the acquisition module 12 is configured to acquire a plurality of training samples from a plurality of training videos based on a preset relationship between a key frame and a non-key frame;

and the training module 13 is used for training the video detection model according to a plurality of training samples.

In the training apparatus for a video detection model of this embodiment, the implementation principle and technical effect of implementing training of a video detection model by using the above modules are the same as those of the related method embodiments, and reference may be made to the description of the related method embodiments in detail, which is not repeated herein.

Fig. 6 is a structural diagram of a second embodiment of the training apparatus for video detection models provided in the present invention. As shown in fig. 6, the training apparatus for video detection model of the present embodiment further introduces the technical solution of the present invention in more detail on the basis of the technical solution of the embodiment shown in fig. 5.

As shown in fig. 6, the acquisition module 12 of the present embodiment includes:

the image obtaining unit 121 is configured to extract, based on the preset relationship determined by the determining module 11, two key frame images that are arranged in sequence and a non-key frame image that is located behind the two key frame images from the plurality of training videos;

the object information acquiring unit 122 is configured to acquire region information of a target object labeled in a non-key frame image and tag information corresponding to the target object.

Further optionally, the training module 13 of this embodiment specifically includes:

the input unit 131 is configured to input, for each training sample, image data of two key frames and image data of a non-key frame in the corresponding training sample to a motion prior learning module in the video detection model;

the obtaining unit 132 is configured to obtain feature information of a non-key frame, which is obtained by performing feature fusion on the image data of two key frames by the motion prior learning module;

the input unit 131 is further configured to input the feature information of the non-key frame obtained by fusion to a detection network module in the video detection model;

the obtaining unit 132 is further configured to obtain region information and label information of a training object in the image data of the non-key frame predicted by the detection network module;

the calculating unit 133 is configured to calculate a detection loss according to the predicted region information and label information of the training object, and the region information and corresponding label information of the labeled training object;

the adjusting unit 134 is configured to adjust parameters of the video detection model based on the detection loss.

Further optionally, the training module 13 of this embodiment further includes a determining unit 135 and a determining unit 136:

the judgment unit 135 is configured to judge whether the detection loss calculated by the calculation unit 133 converges;

if the judging unit 135 judges and determines that the detection loss is not converged, the adjusting module 134 adjusts the parameters in the motion prior learning module and the detection network module by adopting a gradient descent method;

the determining unit 136 is configured to determine parameters in the motion prior learning module and the detection network module, determine the motion prior learning module and the detection network module, and further determine the video detection model when determining the detection loss convergence.

Further optionally, the training module 13 of this embodiment further includes:

the initialization unit 137 is used for randomly initializing parameters in the motion prior learning module and the detection network module.

Correspondingly, the processing of each unit in the training module 13 is performed based on the operation after the initialization unit 137.

Fig. 7 is a block diagram of an embodiment of a video detection apparatus according to the present invention. As shown in fig. 7, the video detection apparatus of this embodiment may specifically include:

the obtaining module 21 is configured to obtain a video to be detected;

the detection module 22 is configured to obtain region information of an object detected from each frame of image of the video and a corresponding label according to the video obtained by the obtaining module 21 and a pre-trained video detection model; the video detection model is formed by end-to-end training based on a detection network module and a motion prior learning module.

The video detection apparatus of this embodiment implements the same principle and technical effect as the related method embodiment by using the modules, and reference may be made to the description of the related method embodiment in detail, which is not repeated herein.

Exemplary electronic device

As shown in fig. 8, the electronic device 11 includes one or more processors 111 and memory 112.

The processor 111 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 11 to perform desired functions.

Memory 112 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 111 to implement the video detection model training methods, the video detection methods, and/or other desired functions of the various embodiments of the present application described above. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 11 may further include: an input device 113 and an output device 114, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

For example, the input device 113 may be a camera or a microphone, a microphone array, or the like as described above, for capturing an input signal of an image or a sound source. When the electronic device is a stand-alone device, the input means 123 may be a communication network connector for receiving the acquired input signals from the neural network processor.

The input device 113 may also include, for example, a keyboard, a mouse, and the like.

The output device 114 may output various information to the outside, including the determined output voltage, output current information, and the like. The output devices 114 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, among others.

Of course, for the sake of simplicity, only some of the components of the electronic device 11 relevant to the present application are shown in fig. 8, and components such as a bus, an input/output interface, and the like are omitted. In addition, the electronic device 11 may include any other suitable components, depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the method of training a video detection model according to various embodiments of the present application described in the "exemplary methods" section of this specification above.

The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the method for training a video detection model according to various embodiments of the present application described in the "exemplary methods" section above in this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.

The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A training method of a video detection model comprises the following steps:

2. The method of claim 1, wherein collecting training samples from the plurality of training videos based on the preset relationship between the key frames and non-key frames comprises:

extracting an array of two key frame images which are sequentially arranged and a non-key frame image positioned behind the two key frame images from the plurality of training videos based on the preset relationship;

and acquiring the region information of the target object marked in the non-key frame image and the label information corresponding to the target object.

3. The method of claim 1, wherein training the video detection model based on a number of the training samples comprises:

for each training sample, inputting the image data of the two key frames and the image data of the non-key frames in the corresponding training sample to a motion prior learning module in a video detection model;

acquiring the feature fusion of the motion prior learning module according to the image data of the two key frames to obtain the feature information of the non-key frame;

inputting the feature information of the non-key frames obtained by fusion into a detection network module in the video detection model;

acquiring the area information and the label information of the training object in the image data of the non-key frame predicted by the detection network module;

calculating detection loss according to the predicted region information and the label information of the training object, and the region information and the corresponding label information of the labeled training object;

adjusting parameters of the video detection model based on the detection loss.

4. The method of claim 3, wherein adjusting parameters of the video detection model based on the detection loss comprises:

judging whether the detection loss is converged;

if not, adjusting parameters in the motion prior learning module and the detection network module by adopting a gradient descent method;

and when the convergence of the detection loss is determined, determining parameters in the motion prior learning module and the detection network module, determining the motion prior learning module and the detection network module, and further determining the video detection model.

5. The method of claim 4, wherein for each of the training samples, prior to inputting the image data of the two keyframes and the image data of the non-keyframes in the corresponding training sample into the motion a priori learning module, the method further comprises:

and randomly initializing parameters in the motion prior learning module and the detection network module.

6. An apparatus for training a video inspection model, comprising:

7. The apparatus of claim 6, wherein the acquisition module comprises:

the image acquisition unit is used for extracting an array of two key frame images which are sequentially arranged and a non-key frame image positioned behind the two key frame images from the plurality of training videos based on the preset relationship;

and the object information acquisition unit is used for acquiring the area information of the target object marked in the non-key frame image and the label information corresponding to the target object.

8. The apparatus of claim 6, wherein the training module comprises:

an input unit, configured to input, for each training sample, image data of the two key frames and image data of the non-key frame in the corresponding training sample to a motion prior learning module in the video detection model;

the acquisition unit is used for acquiring the feature fusion of the motion prior learning module according to the image data of the two key frames to obtain the feature information of the non-key frame;

the input unit is further configured to input the feature information of the non-key frame obtained through fusion into a detection network module in the video detection model;

the obtaining unit is further configured to obtain region information and the label information of the training object in the image data of the non-key frame predicted by the detection network module;

a calculation unit, configured to calculate a detection loss according to the predicted region information and the tag information of the training object, and the region information and the corresponding tag information of the labeled training object;

and the adjusting unit is used for adjusting the parameters of the video detection model based on the detection loss.

9. A computer-readable storage medium, storing a computer program for executing the method for training a video inspection model according to any one of claims 1 to 5.

10. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to perform the training method of the video detection model according to any one of the claims 1 to 5.