CN113989721A

CN113989721A - Target detection method and training method and device of target detection model

Info

Publication number: CN113989721A
Application number: CN202111279535.3A
Authority: CN
Inventors: 康帅
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-01-28

Abstract

The disclosure provides a target detection method, a training method and device of a target detection model, electronic equipment and a storage medium, and relates to the field of artificial intelligence, in particular to the technical field of computer vision and deep learning. The target detection method may include: detecting a video frame to be processed in a video frame sequence to obtain predicted position information of a target object included in the video frame to be processed; in response to detecting that a target object included in the video frame to be processed is incomplete, acquiring target position information determined based on a previous video frame of the video frame to be processed in the video frame sequence; and in response to acquiring the target position information, correcting the predicted position information according to the target position information.

Description

Target detection method and training method and device of target detection model

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to the field of computer vision and deep learning technologies, and in particular, to a target detection method and a training method and apparatus for a target detection model, an electronic device, and a storage medium.

Background

With the continuous development of deep learning technology, the computer vision task has not been developed before. The computer vision tasks may include an image classification task, an object detection task, a video understanding task, and the like. With the continued refinement of computer vision tasks, computer vision has met new challenges. For example, in the field of object detection, it is one of challenges to improve the position detection accuracy of an incomplete object in an image.

Disclosure of Invention

Provided are a target detection method, a target detection model training device, an electronic device and a storage medium, wherein detection accuracy and detection convenience are improved.

One aspect of the present disclosure provides a target detection method, including: detecting a video frame to be processed in a video frame sequence to obtain predicted position information of a target object included in the video frame to be processed; in response to detecting that a target object included in the video frame to be processed is incomplete, acquiring target position information determined based on a previous video frame of the video frame to be processed in the video frame sequence; and in response to acquiring the target position information, correcting the predicted position information according to the target position information.

Another aspect of the present disclosure provides a training method for a target detection model, where the target detection model includes a feature extraction network, a location prediction network, and a classification network, and the training method includes: inputting a sample image comprising a target object into a feature extraction network to obtain a second image feature of the sample image, wherein the sample image comprises actual position information of the target object and actual classification information representing whether the target object is complete; inputting the second image characteristic into a position prediction network to obtain predicted position information of the target object; inputting the second image characteristics into a classification network to obtain prediction classification information representing whether the target object is complete or not; and training the target detection model based on the actual position information, the predicted position information, the actual classification information and the predicted classification information.

Another aspect of the present disclosure provides an object detecting apparatus including: the first position prediction module is used for detecting a video frame to be processed in the video frame sequence to obtain the predicted position information of a target object included in the video frame to be processed; the target position acquisition module is used for responding to the fact that a target object included in the video frame to be processed is not complete, and acquiring target position information determined based on a previous video frame of the video frame to be processed in the video frame sequence; and the first position correction module is used for responding to the acquired target position information and correcting the predicted position information according to the target position information.

Another aspect of the present disclosure provides a training apparatus for a target detection model, wherein the target detection model includes a feature extraction network, a location prediction network, and a classification network; the training device comprises: the characteristic extraction module is used for inputting a sample image comprising a target object into a characteristic extraction network to obtain a second image characteristic of the sample image, wherein the sample image comprises actual position information of the target object and actual classification information representing whether the target object is complete; the second position prediction module is used for inputting the second image characteristic into a position prediction network to obtain predicted position information of the target object; the second classification module is used for inputting the second image characteristics into a classification network to obtain prediction classification information representing whether the target object is complete or not; and the model training module is used for training the target detection model based on the actual position information, the predicted position information, the actual classification information and the predicted classification information.

Another aspect of the present disclosure provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the object detection methods and/or the training methods of the object detection models provided by the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the object detection method and/or the training method of the object detection model provided by the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the object detection method and/or the training method of the object detection model provided by the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of an application scenario of a training method and apparatus for a target detection method and a target detection model according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow diagram of a target detection method according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of the principle of updating target location information according to an embodiment of the present disclosure;

FIG. 4 is a schematic flow chart diagram of a target detection method according to another embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a method of target detection according to an embodiment of the present disclosure;

FIG. 6 is a flow chart diagram of a method of training a target detection model according to an embodiment of the present disclosure;

FIG. 7 is a block diagram of a target detection apparatus according to an embodiment of the present disclosure;

FIG. 8 is a block diagram of a training apparatus for a target detection model according to an embodiment of the present disclosure; and

fig. 9 is a block diagram of an electronic device for implementing the target detection method and/or the training method of the target detection model according to the embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The present disclosure provides a target detection method, which includes a position prediction stage, a target position acquisition stage, and a position correction stage. In the position prediction stage, a video frame to be processed in the video frame sequence is detected, and the predicted position information of a target object included in the video frame to be processed is obtained. In the target position acquiring stage, in response to detecting that a target object included in the video frame to be processed is incomplete, target position information determined based on a previous video frame of the video frame to be processed in the video frame sequence is acquired. In the position correction phase, in response to the acquisition of the target position information, the predicted position information is corrected according to the target position information.

An application scenario of the method and apparatus provided by the present disclosure will be described below with reference to fig. 1.

Fig. 1 is a schematic view of an application scenario of a target detection method and a training method and device of a target detection model according to an embodiment of the present disclosure.

As shown in fig. 1, the application scenario 100 of this embodiment may include an electronic device 110, and the electronic device 110 may be any electronic device with processing functionality, including but not limited to a smartphone, a tablet, a laptop, a desktop computer, a server, and so on.

The electronic device 110 may perform target detection on the input video frame 120, for example, so as to obtain the category of the target object and the position information 130 of the target object in the video frame 120. For example, the target object in the image may be classified according to the identified outline, by first extracting the feature of the video frame 120, and then identifying the outline of the target object in the video frame 120 and the position of the target object in the video frame 120 according to the extracted feature.

For example, the classification result for classifying the target object may include a probability that the target object belongs to each of a plurality of predetermined classes in the video frame. The target object may include, for example, a vehicle, a water cup, a backpack, etc., which may have a variety of shape outlines.

In one embodiment, the video frame 120 may be the most recently captured video frame in a sequence of video frames captured in real-time, and the target object may be any object that moves relative to the video capture device that captured the video frame. An object detection model may be employed to accomplish object detection for video frames 120. During the target detection, there may be a case where a partial region of the target object has moved out of the visual field range of the video capture device. I.e., the target object in the video frame 120 may be incomplete. In order to improve the accuracy of the position information obtained in this case, the embodiment may improve the accuracy of the target detection model in detecting the incomplete target object by improving the structure of the target detection model, optimizing a loss function used in training the target detection model, expanding a training sample used in training the target detection model, and the like.

According to an embodiment of the present disclosure, as shown in fig. 1, the application scenario 100 may further include a server 140. The electronic device 110 may be communicatively coupled to the server 140 via a network, which may include wireless or wired communication links.

Illustratively, the server 140 may be configured to train an object detection model, and send the trained object detection model 150 to the electronic device 110 in response to a model acquisition request sent by the electronic device 110, so as to facilitate object detection of the video frame by the electronic device 110. In an embodiment, the electronic device 110 may further transmit the video frames 120 to the server 140 through the network, and the server performs object detection on the obtained video frames according to the trained object detection model.

According to an embodiment of the present disclosure, as shown in fig. 1, the application scenario 100 may further include a database 160, and the database 160 may maintain a huge amount of images, which may include information indicating the actual positions of target objects in the images. The server 140 may access the database 160 and extract a portion of the image from the database 160 to train the object detection model.

It should be noted that the object detection method provided by the present disclosure may be executed by the electronic device 110 or the server 140. Accordingly, the object detection apparatus provided by the present disclosure may be disposed in the electronic device 110 or the server 140. The training method of the target detection model provided by the present disclosure may be performed by the server 140 or other servers communicatively connected to the server 140. Accordingly, the training apparatus of the target detection model provided by the present disclosure may be disposed in the server 140 or other servers communicatively connected to the server 140.

It should be understood that the number and type of electronic devices, servers, and databases in FIG. 1 are merely illustrative. There may be any number and type of terminal devices, servers, and databases, as the implementation requires.

The object detection method provided by the present disclosure will be described in detail below with reference to fig. 2 to 5.

Fig. 2 is a schematic flow diagram of a target detection method according to an embodiment of the present disclosure.

As shown in fig. 2, the object detection method 200 of this embodiment may include operations S210 to S230.

In operation S210, a video frame to be processed in a video frame sequence is detected, and predicted position information of a target object included in the video frame to be processed is obtained.

According to an embodiment of the present disclosure, the video frame sequence may be composed of a plurality of video frames sequentially arranged in the order of the capture time in a previously captured video segment. Or the video frames can be collected in real time, and the collected video frames are stored according to the collection sequence, so that the video frame sequence is obtained. The video frames in the video frame sequence can be acquired continuously or acquired by frame skipping. The embodiment can perform target detection on the video frames in the video frame sequence according to the arrangement order, so as to obtain the position information of the target object in each video frame.

According to the embodiment of the disclosure, the video frame to be processed may be a video frame newly stored in a video frame sequence, or may be a video frame with the earliest collection time among undetected video frames when a plurality of video frames are sequentially detected according to the arrangement order.

According to the embodiment of the present disclosure, a video frame to be processed may be input to a target detection model, a prediction detection frame of a target object is output by the target detection model, and a center point position of the prediction detection frame, a height of the prediction detection frame, and a width of the prediction detection frame may be used as prediction position information. The target detection model may include any one of multiple models, such as a fast R-CNN model, a Single Shot multi-box Detector (SSD) model, and a Single Look-up (YOLO) detection model. It is to be understood that the above-mentioned method of obtaining the predicted position information is only an example to facilitate understanding of the present disclosure, and the present disclosure is not limited thereto.

In operation S220, in response to detecting that a target object included in the to-be-processed video frame is incomplete, target position information determined based on a previous video frame of the to-be-processed video frame in the video frame sequence is acquired.

According to an embodiment of the present disclosure, whether the target object is complete may be determined according to a difference between a ratio between a height of the prediction detection frame and a width of the prediction detection frame in the detected predicted position information and a predetermined ratio between the height and the width of the set target object. If the ratio of the height to the width of the preset detection frame and the difference value of the preset ratio are in a preset range, the target object can be determined to be complete, otherwise, the target object is determined to be incomplete. The predetermined ratio and the predetermined range may be set according to actual requirements, which is not limited in this disclosure.

According to an embodiment of the present disclosure, the target position information may be position information of a target object included in the target video frame in the previous video frame. Wherein, the target video frame may be in the previous video frame: the video frame which is complete in target object and has the minimum time difference with the acquisition of the video frame to be processed is included. Alternatively, the target position information may be an average of position information of the target object in all previous video frames including the complete target object.

It is understood that when a video frame with a complete target object is not included in the previous video frames of the video frame to be processed, then the target position information is null. That is, the target location information cannot be acquired. In order to obtain the target location information, in this embodiment, when the target is detected for each video frame in the sequence of video frames, the obtained target location information and information on whether the target object is complete are stored in the predetermined space.

In operation S230, in response to acquiring the target position information, the predicted position information is corrected according to the target position information.

According to an embodiment of the present disclosure, after the target position information is acquired, the embodiment may replace the height and the width in the predicted position information with the height and the width of the target object in the target position information. And adjusting the position of the center point in the predicted position information based on a difference between the height in the target position information and the height in the predicted position information, and a difference between the width in the target position information and the width in the predicted position information.

For example, if the coordinates of the center point in the predicted position information are set to (x, y), the height in the predicted position information is h1 and the width in the predicted position information is W1, the height in the target position information is h2 and the width in the target position information is W2, the coordinates of the center point obtained by correcting the predicted position information are:

wherein x is and

the sum or the difference is taken, and depends on the magnitude relation between the x value of the central point in the target position information and the x value of the central point in the predicted position information. If the x value of the center point in the target position information is smaller than the x value of the center point in the predicted position information, x is equal to

Get a sum of, otherwise x and

taking the difference between them. Similarly, y is equal to

And the sum or the difference is taken, and the sum or the difference depends on the magnitude relation between the y value of the central point in the target position information and the y value of the central point in the predicted position information.

In an embodiment, when performing target detection on a video frame in a sequence of video frames, a moving track of the target object may be drawn based on predicted position information of the target object since the target object is detected, and a rectangular coordinate system may be established based on an arbitrary position of the video frame, where an X axis of the rectangular coordinate system is parallel to a width direction of the video frame and a Y axis of the rectangular coordinate system is parallel to a height direction of the video frame. If the target object moves along the positive direction of the X axis in the rectangular coordinate system, the X and

if the target moves along the positive direction of the Y axis in the rectangular coordinate system, then Y and

taking the sum.

According to the embodiment of the disclosure, before the movement track is drawn, whether the target object is detected in all the consecutive n frames of video frames may be determined, and if yes, the movement track is drawn. By the method, the drawing of the moving track of the target object which is not completely appeared in the video frame sequence can be avoided, so that the computing resource can be saved to a certain extent. This is because the target object that has not completely appeared in the video frame sequence has no way to determine the target position information, and the predicted position information cannot be corrected in operation S230. Wherein n may be an integer greater than 1, such as 3, which is not limited in this disclosure.

In summary, in the target detection method according to the embodiment of the present disclosure, when the target object is incomplete (i.e., truncated), the predicted position information is corrected according to the target position information determined in the previous video frame, so that the accuracy of the detected position of the target object can be improved. Compared with the technical scheme that the detection precision is improved by improving the structure of the target detection model, optimizing the loss function adopted when the target detection model is trained, expanding the training sample adopted when the target detection model is trained and the like in the related art, the detection precision can be improved to a certain extent, and the energy and time consumed for improving the detection precision are reduced.

Fig. 3 is a schematic diagram of the principle of updating target location information according to an embodiment of the present disclosure.

According to the embodiment of the present disclosure, when it is detected that the image object included in the video frame to be processed is complete, the target position information may be updated based on the predicted position information, for example, so as to improve the accuracy of the target position information.

For example, the predicted location information may be used in place of the target location information to obtain updated target location information.

For example, as shown in FIG. 3, in this embodiment 300, a predetermined number of target video frames including a complete target object may be selected from previous video frames 311-314 of the video frames 315 to be processed in the video frame sequence 310. The predetermined number of target video frames is the video frame whose capture time is closest to the pending video frame 315. The predetermined number may be, for example, 2, 5, 8, 10, or other positive integers set according to actual requirements. In an embodiment, the selected previous video frames may include video frame 313 and video frame 314. The position information of the target object included in the target video frame may be predicted position information of the target object obtained when the target video frame is subjected to target detection. For video frame 313 and video frame 314, position information 321 and position information 322 may be obtained.

After the predetermined number of pieces of position information are obtained, the target position information may be updated based on the predetermined number of pieces of position information and the predicted position information. For example, the target position information may be updated by using a predetermined number of pieces of average information of the position information and the predicted position information 323 instead of the target position information. As shown in fig. 3, the altitude in the position information 321, the average of the altitude in the position information 322 and the altitude in the predicted position information 323 may be calculated as the altitude in the target position information 330. Similarly, the width in the position information 321, the average of the width in the position information 322, and the width in the predicted position information 323 may be calculated as the width in the target position information 330. The coordinate value of the center point in the location information 321, the coordinate value of the center point in the location information 322, and the coordinate value of the center point in the predicted location information 323 may be averaged to obtain the coordinate value of the center point in the target location information 330.

According to the embodiment of the disclosure, when the video frame to be processed includes the complete target object, the updating of the complete target position information is performed by considering the predicted position information and the position information of the target object in the previous video frame, so that the situation that the target position information is inaccurate due to inaccurate prediction of the position information of a single video frame can be avoided, and therefore, the precision of the target position information can be improved to a certain extent.

In an embodiment, the height and width of the prediction detection frame obtained by detecting the complete video frame of the target object may be added to the averaging filter, and the averaging filter processes the prediction detection frame to obtain the target position information, and updates the target position information.

Fig. 4 is a schematic flow chart diagram of a target detection method according to another embodiment of the present disclosure.

According to the embodiment of the disclosure, the motion phase of the running target object shot by the video capture device generally comprises a phase gradually moving into the capture range of the video capture device, a phase completely located in the capture range of the video capture device and a phase gradually moving out of the capture range of the video capture device. The target location information may be employed to correct the predicted location information for video frames captured in stages that gradually move out of the capture range of the video capture device. For video frames acquired in a stage gradually moving into the acquisition range of the video acquisition equipment, the predicted position information can be corrected by adopting a template matching mode because target position information does not exist.

Illustratively, as shown in fig. 4, in this embodiment, the object detection method 400 may include operations S410 to S460.

In operation S410, a video frame to be processed in a sequence of video frames is detected, and predicted position information of a target object included in the video frame to be processed is obtained. The method for implementing operation S410 is similar to the method for obtaining the predicted location information described above, and is not described herein again.

In operation S420, in response to detecting that a target object included in the to-be-processed video frame is incomplete, target position information determined based on a previous video frame of the to-be-processed video frame in the video frame sequence is acquired. The method for implementing operation S420 is similar to the method for acquiring the target location information described above, and is not described herein again.

In operation S430, it is determined whether target location information is acquired. If not, the operation S440 to the operation S450 are executed, otherwise, the operation S460 is executed.

In operation S440, a target template matching a target object included in the video frame to be processed among the predetermined object templates is determined.

According to the embodiment of the disclosure, after the predicted position information of the target object is obtained, an image in the area indicated by the predicted position information can be intercepted from the video frame to be processed. And then matching the intercepted image with a preset object template, and selecting the template with the highest matching degree with the intercepted image as a target template matched with a target object. Wherein, the matching degree between the predetermined object template and the intercepted image can be determined by adopting any one of a square error matching method, a correlation coefficient matching method and a normalized correlation coefficient matching method.

In operation S450, the predicted position information is corrected based on the size of the target template.

According to the embodiments of the present disclosure, the height and width in the size of the target template may be taken as the height and width in the corrected predicted position information. And corrects the coordinate value of the center point in the predicted position information according to the formula described above. It is to be understood that the implementation of operation S450 is merely an example to facilitate understanding of the present disclosure, and the present disclosure is not limited thereto.

In operation S460, the predicted position information is corrected according to the target position information. The method for implementing operation S460 is similar to the method for correcting the predicted location information according to the target location information, and is not described herein again.

By adopting the target detection method of the embodiment of the disclosure, the correction of the predicted position information under various conditions of incomplete target objects can be realized. Compared with the related art, the target detection precision can be improved.

FIG. 5 is a schematic diagram of the principle of an object detection method according to an embodiment of the present disclosure.

According to the embodiment of the disclosure, the target detection model can be simply adjusted through the structure of the target detection model, so that the adjusted target detection model can be used for outputting not only the predicted position information, but also the classification information of whether the target object is complete or not. Therefore, the target detection efficiency can be improved, and the accuracy of correcting the predicted position information is improved conveniently.

As shown in fig. 5, the object detection method of this embodiment 500 may employ an object detection model including a feature extraction network 510, a location prediction network 520, and a classification network 530. The feature extraction network 510 is used to extract image features 502 from the video frame 501 to be processed. The location prediction network 520 is configured to predict a location of the target object to obtain predicted location information 503. The classification network 530 is used to classify the target object completely, i.e. the classification network 530 outputs the classification information 504. In one embodiment, the target detection network may further include another classification network 540 for predicting the object class of the target object and outputting a class probability 505. The object category may be any one of a plurality of predetermined categories. In a traffic surveillance scenario, the plurality of predetermined categories may include categories of pedestrians, non-motor vehicles, and the like.

Accordingly, when detecting the video frame 501 to be processed, the video frame 501 to be processed may be input into the feature extraction network 510, and the first image feature 502 of the video frame to be processed may be output by the feature extraction network 510. The first image feature 502 may then be input to the position prediction network 520, and the predicted position information 503 of the target object included in the video frame 501 to be processed may be output by the position prediction network 520. At the same time, the first image feature 502 may be input to a classification network 530, which outputs classification information 504 that characterizes whether the target object is complete. The classification information 504 may be, for example, a two-classification result, including complete or incomplete.

In an embodiment, the first image feature 502 may also be input into another classification network 540, and the class probability 505 may be output by the other classification network 540. The class probability 505 includes the probability that the target object belongs to each of a plurality of predetermined classes. And determining that the predetermined category corresponding to the maximum probability in the category probabilities 505 is the object category of the target object.

According to an embodiment of the present disclosure, the Feature extraction network 510 may be, for example, a Feature Pyramid Network (FPN), a convolutional neural network, or the like. The location prediction network 520 may be composed of a fully connected layer and a Regression layer (Regression), for example. The classification network 530 may implement two classifications using, for example, a logistic regression function. The further classification network 540 may for example be composed of a fully connected layer and a normalization layer. It is to be understood that the above-described structures of the respective networks are merely examples to facilitate understanding of the present disclosure, and the present disclosure is not limited thereto.

The present disclosure also provides a training method of the target detection model so as to perform the target detection method described above. The training method will be described in detail below with reference to fig. 6.

Fig. 6 is a flow chart diagram of a training method of an object detection model according to an embodiment of the present disclosure.

As shown in fig. 6, the training method 600 of the target detection model of this embodiment may include operations S610 to S640. The target detection model comprises the feature extraction network, the position prediction network and the classification network.

In operation S610, a sample image including a target object is input to a feature extraction network, resulting in a second image feature of the sample image.

In operation S620, the second image feature is input to the position prediction network, resulting in predicted position information of the target object.

The sample image comprises actual position information of the target object and actual classification information representing whether the target object is complete or not. The implementation of operations S610 to S620 is similar to the implementation of operation S210 described above, and is not described herein again.

In operation S630, the second image feature is input into the classification network, and prediction classification information indicating whether the target object is complete is obtained. The predicted classification information is similar to the classification information described above. And the classification network is similar to the classification network 530 described above and will not be described in detail herein.

In operation S640, the target detection model is trained based on the actual location information, the predicted location information, the actual classification information, and the predicted classification information.

According to an embodiment of the present disclosure, a location loss may be determined based on actual location information and predicted location information. A classification loss is determined based on the actual classification information and the predicted classification information. The loss of the object detection model is then determined based on a weighted sum of the location loss and the classification loss. And finally, adjusting the network weight in the target detection model through a back propagation algorithm to realize the training of the target detection model. The position Loss can be represented by a value of a Loss function such as an L1 Loss function, an L2 Loss function, or a smoothed L1 Loss function. The classification loss can be represented by the values of two classification loss functions such as a binary cross entropy loss function and a square loss function.

In one embodiment, the object detection model may also be a classification network similar to the other classification network 540 described above. As such, the sample image should also include information indicative of the actual object class of the target object. When training the target detection model, the class classification loss can also be determined according to the predicted class probability and the information of the actual object class. And taking the weighted sum of the classification loss, the position loss and the classification loss as the loss of the target detection model. The category classification loss can be represented by values of a multi-classification loss function such as a cross entropy loss function.

Based on the target detection method provided by the disclosure, the disclosure also provides a target detection device. The apparatus will be described in detail below with reference to fig. 7.

Fig. 7 is a block diagram of a structure of an object detection apparatus according to an embodiment of the present disclosure.

As shown in fig. 7, the object detection apparatus 700 of this embodiment includes a first position prediction module 710, an object position acquisition module 720, and a first position correction module 730.

The first position prediction module 710 is configured to detect a video frame to be processed in a sequence of video frames, and obtain predicted position information of a target object included in the video frame to be processed. In an embodiment, the first position prediction module 710 may be configured to perform the operation S210 described above, which is not described herein again.

The target position obtaining module 720 is configured to, in response to detecting that the target object included in the to-be-processed video frame is incomplete, obtain target position information determined based on a previous video frame of the to-be-processed video frame in the video frame sequence. In an embodiment, the target position obtaining module 720 may be configured to perform the operation S220 described above, which is not described herein again.

The first position correction module 730 is configured to, in response to obtaining the target position information, correct the predicted position information according to the target position information. In an embodiment, the first position correcting module 730 can be configured to perform the operation S230 described above, which is not described herein again.

According to an embodiment of the present disclosure, the target detection apparatus 700 may further include a target location updating module, configured to update the target location information based on the predicted location information in response to detecting that the target object included in the video frame to be processed is complete.

According to an embodiment of the present disclosure, the target location update module may include a location obtaining sub-module and a location update sub-module. The position obtaining submodule is used for obtaining position information of the target object included in the target video frame aiming at a preset number of target video frames including the complete target object in the previous video frame, and obtaining the preset number of position information. The location updating submodule is configured to update the target location information based on a predetermined number of location information and the predicted location information.

According to an embodiment of the present disclosure, the object detection apparatus 700 may further include a template matching module and a second position correction module. And the template matching module is used for responding to the situation that the target position information is not acquired, and determining a target template which is matched with a target object included in the video frame to be processed in the preset object templates. The second position correction module is used for correcting the predicted position information based on the size of the target template.

According to an embodiment of the present disclosure, the first location prediction module 710 may include a feature extraction sub-module and a location prediction sub-module. The feature extraction submodule is used for inputting the video frame to be processed into a feature extraction network included in the target detection model to obtain a first image feature of the video frame to be processed. And the position prediction submodule is used for inputting the first image characteristics into a position prediction network included by the target detection model to obtain the predicted position information of the target object included by the video frame to be processed. In this embodiment, the target detection apparatus 700 may further include a first classification module, configured to input the first image feature into a classification network included in the target detection model, so as to obtain classification information indicating whether the target object is complete.

Based on the training method of the target detection model provided by the disclosure, the disclosure also provides a training device of the target detection model. The apparatus will be described in detail below with reference to fig. 8.

Fig. 8 is a block diagram of a structure of a training apparatus of an object detection model according to an embodiment of the present disclosure.

As shown in fig. 8, the training apparatus 800 of the target detection model of this embodiment may include a feature extraction module 810, a second position prediction module 820, a second classification module 830, and a model training module 840. The target detection model may include a feature extraction network, a location prediction network, and a classification network, among others.

The feature extraction module 810 is configured to input a sample image including a target object into a feature extraction network, so as to obtain a second image feature of the sample image. The sample image comprises actual position information of the target object and actual classification information representing whether the target object is complete or not. In an embodiment, the feature extraction module 810 may be configured to perform the operation S610 described above, which is not described herein again.

The second location prediction module 820 is configured to input the second image feature into a location prediction network to obtain predicted location information of the target object. In an embodiment, the second location prediction module 820 may be configured to perform the operation S620 described above, which is not described herein again.

The second classification module 830 is configured to input the second image feature into a classification network, so as to obtain prediction classification information indicating whether the target object is complete. In an embodiment, the second classification module 830 may be configured to perform the operation S630 described above, which is not described herein again.

The model training module 840 is configured to train the target detection model based on the actual location information, the predicted location information, the actual classification information, and the predicted classification information. In an embodiment, the model training module 840 may be configured to perform the operation S640 described above, which is not described herein again.

In the technical scheme of the present disclosure, the processes of acquiring, collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the related user all conform to the regulations of related laws and regulations, and do not violate the good custom of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that may be used to implement the target detection methods and/or the training methods of the target detection models of the embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the various methods and processes described above, such as the target detection method and/or the training method of the target detection model. For example, in some embodiments, the object detection method and/or the training method of the object detection model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the object detection method and/or the training method of the object detection model described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured by any other suitable means (e.g. by means of firmware) to perform the object detection method and/or the training method of the object detection model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a traditional physical host and a VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of target detection, comprising:

detecting a video frame to be processed in a video frame sequence to obtain predicted position information of a target object included in the video frame to be processed;

in response to detecting that a target object included in the video frame to be processed is incomplete, acquiring target position information determined based on a previous video frame of the video frame to be processed in the video frame sequence; and

and in response to the acquisition of the target position information, correcting the predicted position information according to the target position information.

2. The method of claim 1, further comprising:

in response to detecting that a target object included in the video frame to be processed is complete, updating the target position information based on the predicted position information.

3. The method of claim 2, wherein updating the target location information based on the predicted location information comprises:

acquiring position information of a target object included in the target video frame aiming at a preset number of target video frames including a complete target object in the previous video frame to obtain a preset number of position information; and

updating the target location information based on the predetermined number of location information and the predicted location information.

4. The method of claim 1, further comprising:

in response to that the target position information is not acquired, determining a target template which is matched with a target object included in the video frame to be processed in a preset object template; and

correcting the predicted position information based on the size of the target template.

5. The method of claim 1, wherein the detecting a video frame to be processed in a sequence of video frames and obtaining predicted position information of a target object included in the video frame to be processed comprises:

inputting the video frame to be processed into a feature extraction network included in a target detection model to obtain a first image feature of the video frame to be processed; and

inputting the first image characteristics into a position prediction network included in the target detection model to obtain predicted position information of a target object included in the video frame to be processed;

wherein the method further comprises: and inputting the first image characteristics into a classification network included by the target detection model to obtain classification information representing whether the target object is complete or not.

6. A training method of a target detection model is provided, wherein the target detection model comprises a feature extraction network, a position prediction network and a classification network; the method comprises the following steps:

inputting a sample image comprising a target object into a feature extraction network to obtain a second image feature of the sample image, wherein the sample image comprises actual position information of the target object and actual classification information representing whether the target object is complete;

inputting the second image characteristics into the position prediction network to obtain predicted position information of the target object;

inputting the second image characteristics into the classification network to obtain prediction classification information representing whether the target object is complete or not; and

training the target detection model based on the actual position information, the predicted position information, the actual classification information, and the predicted classification information.

7. An object detection device comprising:

the first position prediction module is used for detecting a video frame to be processed in a video frame sequence to obtain predicted position information of a target object included in the video frame to be processed;

a target position obtaining module, configured to, in response to detecting that a target object included in the to-be-processed video frame is incomplete, obtain target position information determined based on a previous video frame of the to-be-processed video frame in the sequence of video frames; and

and the first position correction module is used for responding to the acquired target position information and correcting the predicted position information according to the target position information.

8. The apparatus of claim 7, further comprising:

and the target position updating module is used for responding to the detected completeness of a target object included in the video frame to be processed and updating the target position information based on the predicted position information.

9. The apparatus of claim 8, wherein the target location update module comprises:

a position obtaining sub-module, configured to obtain, for a predetermined number of target video frames including a complete target object in the previous video frame, position information of the target object included in the target video frame, to obtain a predetermined number of position information; and

a location update submodule for updating the target location information based on the predetermined number of location information and the predicted location information.

10. The apparatus of claim 7, further comprising:

the template matching module is used for responding to the situation that the target position information is not acquired, and determining a target template which is matched with a target object included in the video frame to be processed in a preset object template; and

a second position correction module for correcting the predicted position information based on the size of the target template.

11. The apparatus of claim 7, wherein the first location prediction module comprises:

the characteristic extraction submodule is used for inputting the video frame to be processed into a characteristic extraction network included in a target detection model to obtain a first image characteristic of the video frame to be processed; and

the position prediction submodule is used for inputting the first image characteristics into a position prediction network included in the target detection model to obtain predicted position information of a target object included in the video frame to be processed;

the device further comprises a first classification module, which is used for inputting the first image characteristics into a classification network included in the target detection model to obtain classification information representing whether the target object is complete or not.

12. A training device of an object detection model is provided, wherein the object detection model comprises a feature extraction network, a position prediction network and a classification network; the device comprises:

the characteristic extraction module is used for inputting a sample image comprising a target object into a characteristic extraction network to obtain a second image characteristic of the sample image, wherein the sample image comprises actual position information of the target object and actual classification information representing whether the target object is complete;

the second position prediction module is used for inputting the second image characteristics into the position prediction network to obtain the predicted position information of the target object;

the second classification module is used for inputting the second image characteristics into the classification network to obtain prediction classification information representing whether the target object is complete or not; and

and the model training module is used for training the target detection model based on the actual position information, the predicted position information, the actual classification information and the predicted classification information.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of claims 1-6.

15. A computer program product comprising a computer program which, when executed by a processor, implements a method according to any one of claims 1 to 6.