CN115082752A

CN115082752A - Target detection model training method, device, equipment and medium based on weak supervision

Info

Publication number: CN115082752A
Application number: CN202210596349.0A
Authority: CN
Inventors: 于晋川; 张朋; 陈波扬; 虞响; 殷俊; 蔡丹平
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-09-20

Abstract

The application relates to the technical field of computer vision, and provides a target detection model training method, device, equipment and medium based on weak supervision, which are used for enhancing the target detection effect. The method comprises the steps of obtaining the characteristics of each sample image and each frame of video image through a characteristic extraction network, inputting the characteristics of each sample image into a target detection network, obtaining the predicted position of a target in each sample image, and adjusting the parameters of the characteristic extraction network and the target detection network according to the error between the predicted position of the target in each sample image and the actual position indicated by a corresponding label; inputting the characteristics of each frame of video image into a video frame prediction network to obtain the predicted video image of each frame of video image, adjusting the parameters of the characteristic extraction network and the video frame prediction network according to the error between the predicted video image of each frame of video image and the next frame of video image of each frame of video image, and taking the trained characteristic extraction network and the trained target detection network as the trained target detection model.

Description

Target detection model training method, device, equipment and medium based on weak supervision

Technical Field

The application relates to the technical field of computer vision, in particular to a target detection model training method, device, equipment and medium based on weak supervision.

Background

With the increase of mass data and the continuous improvement of equipment computing power, deep learning is widely applied in various fields in recent years, in particular to the field of computer vision. The target detection technology based on deep learning is also widely applied to the fields of security monitoring, automatic driving and the like, however, the target detection technology based on deep learning depends on the quantity and quality of the labeled data, otherwise, the detection effect of the trained target detection model is not good.

Because the manual labeling mode is low in efficiency and high in cost, the existing target detection training method usually utilizes labeled labels to generate pseudo labels or utilizes a teacher model to obtain a prediction result of unlabeled data as the pseudo labels, so that training samples are expanded, but the quality of the generated pseudo labels is difficult to guarantee, and the detection effect of a trained target detection model is poor.

Disclosure of Invention

The embodiment of the application provides a target detection model training method, device, equipment and medium based on weak supervision, which are used for enhancing the target detection effect.

In a first aspect, the present application provides a method for training a target detection model based on weak supervision, where the target detection model includes a feature extraction network, a target detection network, and a video frame prediction network, and the method includes:

acquiring a sample image set and a sample video set comprising at least one video; wherein the sample image set comprises a plurality of sample images with labels determined based on actual positions of objects in the corresponding sample images;

performing feature extraction on each sample image through the feature extraction network to obtain image features of each sample image, performing target detection on the image features of each sample image through the target detection network to obtain a predicted position of a target in each sample image, and adjusting parameters of the feature extraction network and the target detection network according to an error between the predicted position of the target in each sample image and an actual position indicated by a corresponding label;

performing feature extraction on each frame of video image in the at least one section of video through the feature extraction network to obtain image features of each frame of video image, inputting the image features of each frame of video image into the video frame prediction network, predicting a next frame of image of each frame of video image through the video frame prediction network to obtain a predicted video image of each frame of video image, and adjusting parameters of the feature extraction network and the video frame prediction network according to an error between the predicted video image of each frame of video image and the next frame of video image of each frame of video image;

and taking the trained feature extraction network and the trained target detection network as the trained target detection model until the target detection model meets the preset condition.

In the embodiment of the application, the training sample is extended through the unlabeled sample video set, and the labeled sample image set and the unlabeled sample video set are used for training the target detection model together, i.e. the model is trained in a weak supervision mode without generating additional pseudo labels or assisting in training by other models. The method adopts a multi-task learning mode to carry out weak supervision learning, combines a target detection task and a video frame prediction task, shares the network weight of the two learning tasks by a feature extraction network, and assists in training the feature extraction network by training and predicting a next frame of video image, so that the feature extraction capability of the target detection task is indirectly enhanced, and the detection effect of a trained target detection model is enhanced.

In one possible embodiment, the label further comprises the actual class of the object in each sample image; after the target detection is performed on the first image feature of each sample image through the target detection network to obtain the predicted position of the target in each sample image, the method further includes:

obtaining the prediction category of the target in each sample image;

after adjusting parameters of the feature extraction network and the object detection network according to an error between a predicted position of an object in each sample image and an actual position indicated by a corresponding label, the method further comprises:

and adjusting parameters of the feature extraction network and the target detection network according to the error between the predicted category of the target in each sample image and the actual category indicated by the corresponding label.

In the embodiment of the application, parameters of the feature extraction network and the target detection network are adjusted according to the error between the prediction category and the actual category of the target, and the feature extraction network is further trained, so that the feature extraction capability of the feature extraction network is further improved.

In one possible embodiment, adjusting the parameters of the feature extraction network and the object detection network according to the error between the predicted position of the object in each sample image and the actual position indicated by the corresponding label comprises:

if the actual position comprises an actual rectangular frame where the target in each sample image is located, and the predicted position comprises a predicted rectangular frame where the target in each sample image is located, adjusting parameters of the feature extraction network and the target detection network according to an error between the predicted rectangular frame where the target in each sample image is located and the actual rectangular frame indicated by the corresponding label; and/or the presence of a gas in the gas,

and if the actual position comprises the position information of the actual rectangular frame where the target in each sample image is located, and the predicted position comprises the position information of the predicted rectangular frame where the target in each sample image is located, adjusting the parameters of the feature extraction network and the target detection network according to the error between the position information of the predicted rectangular frame where the target in each sample image is located and the position information of the actual rectangular frame indicated by the corresponding label.

In the embodiment of the application, various schemes for calculating the error between the predicted position and the actual position are provided, so that the mode for adjusting the network parameters is more flexible.

In one possible embodiment, the sample image set and the sample video set are respectively acquired from two scenes having background references of the same category.

In one possible embodiment, after using the trained feature extraction network and the trained target detection network as the trained target detection model, the method further includes:

taking the trained feature extraction network and the trained video frame prediction network as trained video frame prediction models; wherein the trained video frame prediction model is used for predicting the next frame image of the continuous frame images.

In the embodiment of the application, through a multi-task training mode, not only the trained target detection model but also the trained video frame prediction model can be obtained for predicting the next frame image of the continuous frame images.

In one possible embodiment, the video frame prediction network is a temporal memory network.

In the embodiment of the application, the video has the time sequence, and the time sequence of the video frame is learned by adopting a time sequence network, so that the video can be used as a good training guide, and the accuracy of predicting the next frame of video image is improved.

In a second aspect, the present application provides a target detection method based on weak supervision, including:

acquiring an image to be detected;

inputting the image to be detected into a trained target detection model to obtain the position of a target in the image to be detected; wherein the trained target detection model is trained by the method according to any one of the first aspect.

In a third aspect, the present application provides a weak supervision-based target detection model training apparatus, where the target detection model includes a feature extraction network, a target detection network, and a video frame prediction network, the apparatus includes:

the acquisition module is used for acquiring a sample image set and a sample video set comprising at least one section of video; wherein the sample image set comprises a plurality of sample images with labels determined based on actual positions of objects in the corresponding sample images;

the adjusting module is used for extracting the features of each sample image through the feature extraction network to obtain the image features of each sample image, performing target detection on the image features of each sample image through the target detection network to obtain the predicted position of the target in each sample image, and adjusting the parameters of the feature extraction network and the target detection network according to the error between the predicted position of the target in each sample image and the actual position indicated by the corresponding label;

the adjusting module is further configured to perform feature extraction on each frame of video image in the at least one segment of video through the feature extraction network to obtain image features of each frame of video image, input the image features of each frame of video image into the video frame prediction network, predict a next frame of image of each frame of video image through the video frame prediction network to obtain a predicted video image of each frame of video image, and adjust parameters of the feature extraction network and the video frame prediction network according to an error between the predicted video image of each frame of video image and the next frame of video image of each frame of video image;

and the obtaining module is used for taking the trained feature extraction network and the trained target detection network as the trained target detection model until the target detection model meets the preset condition.

In one possible embodiment, the label is determined based on the actual location and actual class of the object in the corresponding sample image; the adjustment module is further configured to:

after the target detection is carried out on the first image characteristics of each sample image through the target detection network to obtain the predicted position of the target in each sample image, the predicted category of the target in each sample image is obtained;

after adjusting the parameters of the feature extraction network and the object detection network according to the error between the predicted position of the object in each sample image and the actual position indicated by the corresponding label, the parameters of the feature extraction network and the object detection network are adjusted according to the error between the predicted category of the object in each sample image and the actual category indicated by the corresponding label.

In a possible embodiment, the adjusting module is specifically configured to:

In a possible embodiment, the obtaining module is further configured to:

and after the trained feature extraction network and the trained target detection network are used as trained target detection models, the trained feature extraction network and the trained video frame prediction network are used as trained video frame prediction models. Wherein the trained video frame prediction model is used for predicting the next frame image of the continuous frame images.

In a fourth aspect, the present application provides a target detection apparatus based on weak supervision, including:

the acquisition module is used for acquiring an image to be detected;

the acquisition module is used for inputting the image to be detected into the trained target detection model to acquire the position of a target in the image to be detected; wherein the trained target detection model is obtained by training according to the method of any one of the first aspect.

In a fifth aspect, the present application provides an electronic device, comprising:

a memory for storing program instructions;

a processor for calling the program instructions stored in the memory and executing the method according to any one of the first aspect or the second aspect according to the obtained program instructions.

In a sixth aspect, the present application provides a computer readable storage medium having stored thereon a computer program comprising program instructions which, when executed by a computer, cause the computer to perform the method of any one of the first or second aspects.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or related technologies, the drawings needed to be used in the description of the embodiments or related technologies are briefly introduced below, it is obvious that the drawings in the following description are only the embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of a target detection model training method based on weak supervision according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a target detection model training method based on weak supervision according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a target detection model according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a sample image with a label provided by an embodiment of the present application;

fig. 5 is a schematic structural diagram of a target detection model training apparatus based on weak supervision according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a target detection apparatus based on weak supervision according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be described clearly and completely in the following with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. In the present application, the embodiments and features of the embodiments may be arbitrarily combined with each other without conflict. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

The terms "first" and "second" in the description and claims of the present application and the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the term "comprises" and any variations thereof, which are intended to cover non-exclusive protection. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the embodiments of the present application, "a plurality" may mean at least two, for example, two, three, or more, and the embodiments of the present application are not limited.

In order to enhance the effect of target detection, the embodiment of the present application provides a target detection model training method based on weak supervision, and some brief descriptions are provided below for application scenarios to which the technical solution of the embodiment of the present application can be applied, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. In a specific implementation process, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

Fig. 1 is a schematic view of an application scenario of a target detection model training method based on weak supervision according to an embodiment of the present application. The application scenario may include a photographing device 101 and a training device 102. The training device 102 may communicate with the photographing device 101.

The shooting device 101 is, for example, a terminal device or a video camera, and the terminal device includes, for example, a camera, a mobile phone, a tablet computer, and the like, and may further include other devices with shooting functions. The training device 102 may be implemented by a terminal, such as a mobile terminal, a fixed terminal, or a portable terminal, such as a smart camera, a mobile handset, a multimedia computer, a multimedia tablet, a desktop computer, a notebook computer, a tablet computer, or the like, or a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform, but is not limited thereto.

It should be noted that fig. 1 illustrates the shooting device 101 and the training device 102 as two independent devices, but in practice, the shooting device 101 may be coupled to the training device 102, or the shooting device 101 and the training device 102 may be the same device, such as a smart camera as the shooting device 101 and the training device 102.

Specifically, the shooting device 101 sends the sample image set and the sample video set to the training device 102, the training device 102 has a pre-constructed target detection model, and after receiving the sample image set and the sample video set, the training device 102 trains the target detection model by using the sample image set and the sample video set to obtain the trained target detection model. The process of how to train the target detection model will be described in detail below.

The application scenario is introduced as above, and the following description is given by taking the training device 102 in fig. 1 as an example to execute the weak supervision-based target detection model training method in conjunction with the application scenario shown in fig. 1. Fig. 2 is a schematic flow chart of a target detection model training method based on weak supervision according to an embodiment of the present application.

S201, acquiring a sample image set and a sample video set comprising at least one video.

Wherein the sample video set comprises at least one video segment, and each video segment comprises a plurality of continuous video images. The sample image set includes a plurality of sample images with labels, the labels are determined based on information of objects, such as position information, category information, and the like, which may be called as objects for any pedestrian, vehicle, and the like to be detected. There may be more than one object in the sample image, for example, there are multiple pedestrians in the image, and when there are multiple objects in the sample image, there is a label corresponding to each object.

There are various ways to determine the tag, and each is described below as an example.

1. The label is determined based on the actual position of the object in the corresponding sample image.

The actual position may include an actual rectangular frame in which the target is located in each sample image, and may further include position information of the actual rectangular frame in which the target is located in each sample image. For example, a rectangular coordinate system is established with the center point of the image as the origin, and the position information of the rectangular frame is (x) _i ,y _i ,w _i ,h _i )，x _i Abscissa, y, representing the center point of the rectangular box where the i-th object is located _i Ordinate, w, representing the centre point of the rectangular frame in which the i-th object is located _i Indicates the width, h, of the rectangular frame in which the ith target is located _i Indicating the height of the rectangular box in which the ith target is located.

2. The label is determined based on the actual location and actual class of the object in the corresponding sample image.

For the meaning of the actual position, please refer to the above discussion, and the detailed description is omitted here. The actual category of the object is for example human, car, animal etc., in particular human may be further classified as child, adult etc., car may be further classified as automotive and non-automotive.

For example, in the pedestrian detection, the label in the sample image includes a rectangular frame where the human body is located, and the label may further include a circumscribed rectangular frame in which the position information of the rectangular frame where the human body is located and the category of the human body are displayed. Or for example, in vehicle detection, the label in the sample image includes an actual rectangular frame where the vehicle body is located, and the label may further include a circumscribed rectangular frame in which the position information of the rectangular frame where the vehicle body is located and the category of the vehicle body are displayed.

Fig. 3 is a schematic view of a sample image with a label according to an embodiment of the present disclosure. The label of the sample image includes an actual rectangular frame 301 in which the vehicle body is located, and a circumscribed rectangular frame 302 connected to the actual rectangular frame 301. Where, car in the circumscribed rectangular frame 302 indicates that the actual category of the object is a vehicle, and (x, y, w, h) indicates the position information of the actual rectangular frame 301 in the sample image. Fig. 3 is an example of a sample image having only one object, and the number of objects in each sample image is not limited in practice.

The sample image set and the sample video set may be the same or different in acquisition scene, and are described as examples below.

Case one, the sample image set and the sample video set are acquired from the same scene.

The same scene refers to a scene of the same place, for example, the sample image set and the sample video set are both collected from a road traffic scene of the a place.

In the embodiment of the application, the characteristics of the sample image set and the sample video set acquired from the same scene are more similar, and the training of the characteristic extraction network is more facilitated.

Case two, the sample image set and the sample video set are acquired from two scenes, respectively.

Two scenes with the same category of background reference may be referred to as similar scenes. For example, the sample image set is collected from a road traffic scene at location a, and the sample video set is collected from a road traffic scene at location B.

In the embodiment of the application, compared with a sample video set of the same field set, a sample video set of a similar scene is easier to obtain.

In either case one or case two, the sample image set and the sample video set may be acquired at the same time period or at different time periods.

The sample image set and the sample video set are introduced as above, wherein how the training device acquires the sample image set and the sample video set is related, and the acquisition manners are various, which are described below.

And in the first mode, the shooting equipment sends the shooting information to the training equipment.

And in the second mode, the training equipment is downloaded from an online database.

And the mode III is obtained by the training equipment in response to the input operation of the user.

The manner in which the training device acquires the sample image set and the sample video set may be the same. For example, the training devices all acquire the sample image set and the sample video set in the first mode, or for example, the training devices all acquire the sample image set and the sample video set in the second mode, or for example, the training devices all acquire the sample image set and the sample video set in the third mode.

The way in which the training device acquires the sample image set and the sample video set may be different. For example, the training apparatus acquires the sample image set in the first mode, acquires the sample video set in the second mode or the third mode, or for example, the training apparatus acquires the sample image set in the second mode, acquires the sample video set in the first mode or the third mode, or for example, the training apparatus acquires the sample image set in the third mode, and acquires the sample video set in the first mode or the second mode. Specifically, for example, the labeled sample image set is downloaded from an online database by the training device, the unlabeled sample video set is sent to the training device by the shooting device, or the unlabeled sample video set is copied from the shooting device to the training device by the user.

In one possible embodiment, the number of video images in the sample video set may be greater than the number of sample images in the sample image set, considering that the process of manually marking images is time consuming and labor intensive, and the number of marked sample images downloaded from the web is limited, but the unmarked sample video set is more readily available.

In the embodiment of the application, a target detection model can be weakly supervised and trained by using a small amount of labeled sample image sets and a large amount of unlabeled sample video sets, and the training effect of the model can be improved under the condition that the labeled sample image sets are limited.

S202, extracting the features of each sample image through a feature extraction network to obtain the image features of each sample image, performing target detection on the image features of each sample image through a target detection network to obtain the predicted position of the target in each sample image, and adjusting the parameters of the feature extraction network and the target detection network according to the error between the predicted position of the target in each sample image and the actual position indicated by the corresponding label.

A pre-established target detection model is provided in the training device, and please refer to fig. 4, which is a schematic structural diagram of the target detection model provided in the embodiment of the present application, and the target detection model includes a feature extraction network 401, a target detection network 402, and a video frame prediction network 403. The feature extraction network 401 is used to extract image features. The target detection network 402 is configured to acquire the image features output by the feature extraction network 401, and output a target detection result. The video frame prediction network 403 is configured to obtain the image features output by the feature extraction network 401, and output a predicted video image of a next frame.

After the training device acquires the sample image set and the sample video set, the process of training the target detection model by using the sample image set is S1.1-S1.3.

S1.1, extracting the characteristics of each sample image through a characteristic extraction network to obtain the image characteristics of each sample image.

Specifically, after the training device acquires the sample image set, each sample image is input to a feature extraction network, which may be various neural networks, and feature extraction processing may be performed on the input image, so as to obtain image features of each sample image.

S1.2, carrying out target detection on the image characteristics of each sample image through a target detection network to obtain the predicted position of the target in each sample image.

Specifically, the training apparatus inputs the image features of each sample image into an object detection network, which may be any network for detecting an object, such as a fast rcnn, yolo, and the like, so as to obtain the predicted position of the object in each sample image, or obtain the predicted position and the predicted category of the object in each sample image.

It should be noted that, when a plurality of objects exist in a sample image, the predicted positions of the plurality of objects in the sample image, or the predicted positions and the predicted categories of the plurality of objects in the sample image, may be obtained by performing object detection on image features of the sample image through an object detection network.

S1.3, adjusting parameters of the feature extraction network and the target detection network according to errors between the predicted positions of the targets in the sample images and the actual positions indicated by the corresponding labels.

After the training device obtains the predicted position of the target, there are various ways to adjust the network parameters according to the position error, which is the error between the predicted position and the actual position of the target, and the following description is given in each case.

In the first case, the actual position includes an actual rectangular frame in which the target is located in each sample image, and the predicted position includes a predicted rectangular frame in which the target is located in each sample image.

For the first case, the training device may adjust parameters of the feature extraction network and the target detection network according to an error between the predicted rectangular frame in which the target is located in each sample image and the actual rectangular frame indicated by the corresponding label, that is, a first position error. Wherein the first position error may be calculated using a regression loss function, such as a CIOU loss function, a DIOU loss function, and the like.

In the second case, the actual position includes position information of an actual rectangular frame in which the target is located in each sample image, and the predicted position includes position information of a predicted rectangular frame in which the target is located in each sample image.

For the second case, the training device may adjust parameters of the feature extraction network and the target detection network according to an error between the position information of the predicted rectangular frame in which the target is located in each sample image and the position information of the actual rectangular frame indicated by the corresponding label, that is, a second position error. Wherein the second error may be calculated using a classification loss function, such as a cross-entropy loss function.

In the third case, the actual position includes an actual rectangular frame in which the target is located in each sample image and position information of the actual rectangular frame, and the predicted position includes position information of a predicted rectangular frame in which the target is located in each sample image and position information of the predicted rectangular frame.

For a third case, the training device may adjust parameters of the feature extraction network and the target detection network based on the first position error and the second position error. For the meaning of the first position error and the second position error, please refer to the content discussed above, and the description thereof is omitted. The specific adjustment methods are various and will be described below.

In the first mode, the training device may adjust the parameters of the feature extraction network and the target detection network according to the first position error, and then continuously adjust the parameters of the feature extraction network and the target detection network according to the second position error.

And in the second mode, the training equipment can adjust the parameters of the feature extraction network and the target detection network according to the second position error, and then continuously adjust the parameters of the feature extraction network and the target detection network according to the first position error.

And thirdly, the training equipment can adjust the parameters of the feature extraction network and the target detection network according to the weighted sum result of the first position error and the second position error.

It is described above how the parameters of the feature extraction network and the object detection network are adjusted according to the error between the predicted position and the actual position of the object. When the sample image has a plurality of targets, the training device may calculate an error between the predicted position of each target and the actual position indicated by the corresponding label, respectively, to obtain a plurality of position errors, and adjust parameters of the feature extraction network and the target detection network according to the plurality of position errors, respectively, or adjust parameters of the feature extraction network and the target detection network according to a weighted sum result of the plurality of position errors.

And S1.4, adjusting parameters of the feature extraction network and the target detection network according to errors between the prediction categories of the targets in the sample images and the actual categories indicated by the corresponding labels.

After the training device obtains the prediction category of the target, parameters of the feature extraction network and the target detection network can be adjusted according to an error between the prediction category and the actual category, namely a category error. Wherein calculating the error between the predicted class and the actual class may employ a classification loss function, such as a cross-entropy loss function.

When the sample image has a plurality of targets, the training device may calculate an error between the predicted category of each target and the actual category indicated by the corresponding label, respectively, to obtain a plurality of category errors, and adjust parameters of the feature extraction network and the target detection network according to the plurality of category errors, respectively, or adjust parameters of the feature extraction network and the target detection network according to a weighted sum result of the plurality of category errors.

It should be noted that when S1.2 is executed, if only the predicted position of the target in each sample image is obtained through the target detection network and the prediction type of the target is not obtained, only S1.3 is executed and S1.4 is not executed. And if the predicted position and the predicted category of the target in each sample image are obtained through the target detection network, executing S1.3 and S1.4.

Further, the execution sequence of S1.3 and S1.4 is arbitrary, and the training device may execute S1.3 first and then S1.4, or may execute S1.4 first and then S1.3, or may execute S1.3 and S1.4 simultaneously. For example, the position error in S1.3 and the category error in S1.3 are weighted and summed, and parameters of the feature extraction network and the target detection network are adjusted according to the weighted and summed result of the position error and the category error.

S203, extracting the features of each frame of video image in at least one section of video through a feature extraction network to obtain the image features of each frame of video image, inputting the image features of each frame of video image into a video frame prediction network, predicting the next frame of image of each frame of video image through the video frame prediction network to obtain the predicted video image of each frame of video image, and adjusting the parameters of the feature extraction network and the video frame prediction network according to the error between the predicted video image of each frame of video image and the next frame of video image of each frame of video image.

After the training device acquires the sample image set and the sample video set, the process of training the target detection model using the sample video set is as described in S2.1-S2.3.

And S2.1, performing feature extraction on each frame of video image through a feature extraction network to obtain the image features of each frame of video image.

Specifically, after the training device obtains the sample video set, each frame of video image is input to a feature extraction network, which may be various neural networks, and feature extraction processing may be performed on the input image, so as to obtain image features of each frame of video image.

And S2.2, inputting the image characteristics of each frame of video image into a video frame prediction network, and predicting the next frame of image of each frame of video image through the video frame prediction network to obtain the predicted video image of each frame of video image.

Specifically, the training device inputs the image characteristics of each frame of video image into a video frame prediction network, which may be a variety of neural networks, preferably, the video frame prediction network is a time sequence network, for example, a time sequence Memory network using Long Short-Term Memory (LSTM) and Gated cyclic Unit (GRU) can predict the next frame of image of each frame of video image, so as to obtain the predicted video image of each frame of video image.

And S2.3, adjusting parameters of the feature extraction network and the video frame prediction network according to the error between the predicted video image of each frame of video image and the next frame of video image of each frame of video image.

Specifically, due to the fact that the video frames have time sequence, the next frame of video image corresponding to each frame of video image can be determined from the sample video set, and the training device adjusts parameters of the feature extraction network and the video frame prediction network according to errors between the predicted video image of each frame of video image and the next frame of video image of each frame of video image. Wherein the error between the two images is calculated using an image reconstruction Loss function, such as the similarity Loss function (Dice Loss).

It should be noted that the training order of the sample image and the video image is arbitrary. The training device can alternately input the sample images and the video images into the feature extraction network, and alternately update the parameters of the feature extraction network according to the prediction results output by the target detection network and the video frame prediction network.

And S204, taking the trained feature extraction network and the trained target detection network as the trained target detection model until the target detection model meets the preset condition.

The training device may continuously iterate and update parameters of each network by using a gradient descent algorithm until the target detection model satisfies a preset condition, and the training device may use the trained feature extraction network and the trained target detection network as the trained target detection model after the training is finished.

The preset condition may refer to a preset iteration number or a preset test index. For example, when the total number of iterations of the network is equal to the preset number of iterations, the training is ended. Or for example, the sample image set may be divided into a training set and a verification set, the training set is used for training, the verification set is used for verification during training, when the training set has finished one iteration (epoch), that is, all sample images in the training set have been input for one iteration, the sample images in the verification set are used for verification once, and when the test indexes of the sample images in the verification set reach the preset test indexes, the training is finished. The test indexes are accuracy, recall rate and the like.

When the labels are determined based on the actual positions of the targets in the corresponding sample images, the trained target detection model may be used to detect the positions of the targets in any of the images. When the labels are determined based on the actual positions and actual classes of objects in the corresponding sample images, the trained object detection model may be used to detect the positions and classes of objects in any of the images.

In one possible embodiment, the training device may use the trained feature extraction network and the trained video frame prediction network as the trained video frame prediction model after using the trained feature extraction network and the trained target detection network as the trained target detection model. The trained video frame prediction model is used for predicting the next frame image of the continuous frame images.

The training device can obtain a to-be-predicted image after obtaining the trained video frame prediction model, input the to-be-predicted image into the trained video frame prediction model, perform feature extraction processing through the trained feature extraction network to obtain the image feature of the to-be-predicted image, input the image feature of the to-be-predicted image into the trained video frame prediction network, predict the next frame image of the to-be-predicted image, and obtain a predicted video image of the to-be-predicted image, namely the predicted next frame image of the to-be-predicted image.

The training framework provided by the embodiment of the application is simple and universal, and can be migrated to more tasks in a multi-task-based mode, for example, an image restoration task is adopted to replace a video prediction task, and the image restoration task and a target detection task are jointly trained.

Specifically, the training device may use a noise image set instead of the sample video set, and an image restoration network instead of the video frame prediction network. The noise image set includes a plurality of original images and a noise image corresponding to each original image, and the noise image refers to an image to which noise is added to the original image. The image restoration network is used to remove noise in the image. The method comprises the steps of extracting the features of each noise image through a feature extraction network to obtain the image features of each noise image, removing the noise of the image features of each noise image through an image restoration network to obtain the de-noised image of each noise image, and adjusting the parameters of the feature extraction network and the image restoration network according to the error between the de-noised image of each noise image and the original image corresponding to each noise image.

In order to more clearly illustrate the method for training the target detection model based on weak supervision provided by the embodiment of the present application, a training process of the target detection model based on weak supervision is further described below with reference to the schematic structural diagram of the target detection model shown in fig. 4.

And S3.1, inputting the marked sample image into a feature extraction network to extract features.

The training equipment inputs each sample image in the sample image set into the feature extraction network and outputs the image features of each sample image. For each sample image with a label, please refer to the above discussion for the meaning of the label and the obtaining manner of the sample image set, which are not described herein again.

And S3.2, inputting the unmarked video image into a feature extraction network to extract features.

And the training equipment inputs each frame of video image in the sample video set into the characteristic extraction network and outputs the image characteristic of each frame of video image. Please refer to the above discussion for the manner of obtaining the sample video set, which is not described herein again.

And S3.3, inputting the features extracted in the S3.1 into a target detection network to obtain the predicted position of the target.

The training device inputs the image features of each sample image into the target detection network, and outputs the predicted position of the target in each sample image, where the meaning of the predicted position refers to the content discussed above and is not described here again.

And S3.4, inputting the features extracted in the S3.2 into a video frame prediction network to obtain a prediction video image.

And the training equipment inputs the image characteristics of each frame of video image into a video frame prediction network, predicts the next frame of image of each frame of video image and outputs the predicted video image of each frame of video image.

And S3.5, performing loss calculation on the predicted position and the actual position obtained in the S3.3.

The training apparatus calculates the error between the predicted position of the target in each sample image and the actual position indicated by the corresponding label. For how to calculate the error, please refer to the above discussion, and the details are not repeated herein.

And S3.6, performing loss calculation on the predicted video image obtained in the S3.4 and the next frame video image.

The training apparatus calculates an error between a predicted video image of each frame of video image and a next frame of video image of each frame of video image. For how to calculate the error, please refer to the above discussion, and the details are not repeated here.

And S3.7, carrying out reverse gradient propagation by the network, updating network parameters, and entering the next round of training iteration.

And the training equipment updates parameters of the feature extraction network and the target detection network according to the error obtained in the step S3.5, updates parameters of the feature extraction network and the video frame prediction network according to the error obtained in the step S3.6, enters the next round of training iteration until preset conditions are met, finishes training, and takes the trained feature extraction network and the trained target detection network as trained target detection models. For the meaning of the predetermined condition, please refer to the content discussed above, and the details are not repeated herein.

Based on the same inventive concept, the embodiment of the application provides a target detection method based on weak supervision, which comprises the following steps:

and the training equipment acquires an image to be detected, inputs the image to be detected into the trained target detection model, and acquires the position of a target in the image to be detected. The trained target detection model is obtained by training equipment through the target detection model training method based on weak supervision.

Specifically, after the shooting device shoots the image to be detected in real time, the image to be detected can be sent to the training device, or the training device can shoot the image to be detected by itself. After obtaining the image to be detected, the training equipment inputs the image to be detected into the trained target detection model, performs characteristic extraction processing through the trained characteristic extraction network to obtain the image characteristics of the image to be detected, and performs target detection on the image characteristics of the image to be detected through the trained target detection network to obtain the position of the target in the image to be detected or obtain the position and the category of the target in the image to be detected.

In a possible embodiment, after the training device obtains the trained target detection model by the above method, the trained target detection model may be sent to other detection devices, and the other detection devices execute the above target detection method. Other detection devices may be implemented by a terminal or a server.

As an example, the shooting device in the embodiment illustrated in fig. 2 is exemplified by the shooting device 101 in fig. 1, and the training device is exemplified by the training device 102 in fig. 1.

Based on the same inventive concept, the embodiment of the application provides a target detection model training device based on weak supervision, wherein a target detection model comprises a feature extraction network, a target detection network and a video frame prediction network, and the device is arranged in the training equipment discussed in the foregoing. Referring to fig. 5, the apparatus includes:

an obtaining module 501, configured to obtain a sample image set and a sample video set including at least one video; wherein the sample image set comprises a plurality of sample images with labels, and the labels are determined based on the actual positions of the objects in the corresponding sample images;

the adjusting module 502 is configured to perform feature extraction on each sample image through a feature extraction network to obtain image features of each sample image, perform target detection on the image features of each sample image through a target detection network to obtain a predicted position of a target in each sample image, and adjust parameters of the feature extraction network and the target detection network according to an error between the predicted position of the target in each sample image and an actual position indicated by a corresponding label;

the adjusting module 502 is further configured to perform feature extraction on each frame of video image in at least one segment of video through a feature extraction network to obtain image features of each frame of video image, input the image features of each frame of video image into a video frame prediction network, predict a next frame of image of each frame of video image through the video frame prediction network to obtain a predicted video image of each frame of video image, and adjust parameters of the feature extraction network and the video frame prediction network according to an error between the predicted video image of each frame of video image and the next frame of video image of each frame of video image;

an obtaining module 503, configured to use the trained feature extraction network and the trained target detection network as the trained target detection model until the target detection model meets a preset condition.

In one possible embodiment, the label is determined based on the actual location and actual class of the object in the corresponding sample image; the adjustment module 502 is further configured to:

after target detection is carried out on the first image characteristics of each sample image through a target detection network to obtain the predicted position of the target in each sample image, the predicted category of the target in each sample image is obtained;

after parameters of the feature extraction network and the object detection network are adjusted according to an error between the predicted position of the object in each sample image and the actual position indicated by the corresponding label, parameters of the feature extraction network and the object detection network are adjusted according to an error between the predicted category of the object in each sample image and the actual category indicated by the corresponding label.

In a possible embodiment, the adjusting module 502 is specifically configured to:

if the actual position comprises an actual rectangular frame where the target in each sample image is located and the predicted position comprises a predicted rectangular frame where the target in each sample image is located, adjusting parameters of the feature extraction network and the target detection network according to errors between the predicted rectangular frame where the target in each sample image is located and the actual rectangular frame indicated by the corresponding label; and/or the presence of a gas in the gas,

and if the actual position comprises the position information of the actual rectangular frame where the target is located in each sample image, and the predicted position comprises the position information of the predicted rectangular frame where the target is located in each sample image, adjusting parameters of the feature extraction network and the target detection network according to an error between the position information of the predicted rectangular frame where the target is located in each sample image and the position information of the actual rectangular frame indicated by the corresponding label.

In a possible embodiment, the obtaining module 503 is further configured to:

and after the trained feature extraction network and the trained target detection network are used as trained target detection models, the trained feature extraction network and the trained video frame prediction network are used as trained video frame prediction models. The trained video frame prediction model is used for predicting the next frame image of the continuous frame images.

It should be noted that the apparatus in fig. 5 may also be used to implement the weak supervision-based target detection model training method discussed above, and will not be described herein again.

Based on the same inventive concept, the application provides a target detection device based on weak supervision, which comprises:

an obtaining module 601, configured to obtain an image to be detected;

an obtaining module 602, configured to input the image to be detected into a trained target detection model, and obtain a position of a target in the image to be detected; wherein the trained target detection model is obtained by training through the weak supervision-based target detection model training method discussed above.

It should be noted that the apparatus in fig. 6 can also be used to implement the weak supervision-based target detection method discussed above, and will not be described herein again.

Based on the same inventive concept, an electronic device is provided in the embodiments of the present application, please refer to fig. 7, which includes a processor 701 and a memory 702.

A memory 702 for storing program instructions;

the processor 701 is configured to call the program instruction stored in the memory 702, and execute the object detection model training method and/or the object detection method according to the obtained program instruction. The processor 701 may also implement the functions of the various modules in the apparatus shown in fig. 5 and/or fig. 6.

In the embodiment of the present application, a specific connection medium between the processor 701 and the memory 702 is not limited, and fig. 7 illustrates that the processor 701 and the memory 702 are connected by a bus 700. The bus 700 is shown in fig. 7 by a thick line, and the connection between other components is merely illustrative and not limited thereto. The bus 700 may be divided into an address bus, a data bus, a control bus, etc., and is shown in fig. 7 with only one thick line for ease of illustration, but does not represent only one bus or one type of bus. Alternatively, the processor 701 may also be referred to as a controller, without limitation to name a few.

The processor 701 may be a general-purpose processor, such as a Central Processing Unit (CPU), digital signal processor, application specific integrated circuit, field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like, that may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the target detection model training method and/or the target detection method disclosed in the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.

Memory 702, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 702 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and the like. The memory 702 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 702 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.

Based on the same inventive concept, embodiments of the present application provide a computer-readable storage medium storing a computer program, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform an object detection model training method and/or an object detection method as discussed in the foregoing. Since the principle of solving the problem of the computer-readable storage medium is similar to that of the method, the implementation of the computer-readable storage medium can refer to the implementation of the method, and repeated details are not repeated.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A target detection model training method based on weak supervision is characterized in that the target detection model comprises a feature extraction network, a target detection network and a video frame prediction network, and the method comprises the following steps:

extracting the features of each frame of video image in the at least one section of video through the feature extraction network to obtain the image features of each frame of video image, inputting the image features of each frame of video image into the video frame prediction network, predicting the next frame of image of each frame of video image through the video frame prediction network to obtain the predicted video image of each frame of video image, and adjusting the parameters of the feature extraction network and the video frame prediction network according to the error between the predicted video image of each frame of video image and the next frame of video image of each frame of video image;

and taking the trained feature extraction network and the trained target detection network as the trained target detection model until the target detection model meets the preset conditions.

2. The method of claim 1, wherein the label is determined based on an actual location and an actual category of the object in the corresponding sample image; after the target detection is performed on the first image feature of each sample image through the target detection network to obtain the predicted position of the target in each sample image, the method further includes:

obtaining the prediction category of the target in each sample image;

3. The method of claim 1, wherein adjusting parameters of the feature extraction network and the object detection network based on an error between a predicted location of an object in each sample image and an actual location indicated by a corresponding label comprises:

4. The method of any of claims 1-3, wherein the sample image set and the sample video set are respectively acquired from two scenes having background references of the same category.

5. The method of any one of claims 1-3, wherein after using the trained feature extraction network and the trained target detection network as trained target detection models, the method further comprises:

6. A method according to any one of claims 1 to 3, wherein the video frame prediction network is a temporal memory network.

7. A target detection method based on weak supervision is characterized by comprising the following steps:

acquiring an image to be detected;

inputting the image to be detected into a trained target detection model to obtain the position of a target in the image to be detected; wherein the trained target detection model is trained by the method of any one of claims 1-6.

8. An object detection model training device based on weak supervision, wherein the object detection model comprises a feature extraction network, an object detection network and a video frame prediction network, the device comprises:

the acquisition module is used for acquiring a sample image set and a sample video set comprising at least one video; wherein the sample image set comprises a plurality of sample images with labels determined based on actual positions of objects in the corresponding sample images;

9. An object detection device based on weak supervision, comprising:

the acquisition module is used for acquiring an image to be detected;

the acquisition module is used for inputting the image to be detected into the trained target detection model to acquire the position of a target in the image to be detected; wherein the trained target detection model is trained by the method of any one of claims 1-6.

10. An electronic device, comprising:

a memory for storing program instructions;

a processor for calling program instructions stored in the memory and for executing the method according to any one of claims 1-6 or 7 in accordance with the obtained program instructions.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a computer, cause the computer to perform the method of any of claims 1-6 or 7.