CN112749590B

CN112749590B - Object detection method, device, computer equipment and computer readable storage medium

Info

Publication number: CN112749590B
Application number: CN201911047482.5A
Authority: CN
Inventors: 范钊宣
Original assignee: Shanghai Goldway Intelligent Transportation System Co Ltd
Current assignee: Shanghai Goldway Intelligent Transportation System Co Ltd
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2023-02-07
Anticipated expiration: 2039-10-30
Also published as: CN112749590A

Abstract

The application discloses a target detection method, a target detection device, computer equipment and a computer readable storage medium, and belongs to the technical field of computer vision. The method comprises the following steps: determining a plurality of detection frames in a detection image in a video according to a target detection model; determining a plurality of prediction frames in the detected image according to a target frame in each image of the first n images adjacent to the detected image in the video, wherein the target frame is used for indicating an area with a target, and n is an integer greater than or equal to 2; and selecting a target frame from the plurality of detection frames in the detection image according to the plurality of prediction frames in the detection image. The method and the device can avoid the error filtering of the detection frame caused by the conditions of target shielding, target crowding and the like, and improve the target detection rate.

Description

Object detection method, device, computer equipment and computer readable storage medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a target detection method, an apparatus, a computer device, and a computer-readable storage medium.

Background

The target detection is one of important research subjects in the technical field of computer vision, and has wide application value in the aspects of intelligent traffic systems, intelligent monitoring systems and the like. The target detection means that a target is detected from a complex scene image so as to perform subsequent processing such as classification or tracking on the target. In object detection, a method of extracting an object frame from an image to be detected is generally adopted, and the object frame is used to indicate an area in the image where an object exists.

At present, when detecting an object, generally, an image to be detected is input into an object detection model, the object detection model outputs the size, the position, and the confidence of each detection frame in a plurality of detection frames in the image, and the confidence of each detection frame is used to indicate the probability that an area corresponding to each detection frame has the object. And then, according to the coincidence degree among the detection frames and the confidence degree of each detection frame, filtering the detection frames by using a Non-maximum suppression (NMS) algorithm, and determining the reserved detection frames as target frames.

However, in a complex scene image in which there are cases of target occlusion, target crowding, and the like, there may be many overlapping detection frames in which two different targets are located. When directly detecting the frame according to the contact ratio between these a plurality of detection frames and filtering, probably can the mistake filter the detection frame that a target place in these two different targets to can seriously influence the target relevance ratio.

Disclosure of Invention

The application provides a target detection method, a target detection device, computer equipment and a computer readable storage medium, which can solve the problem of low target detection rate of the related art. The technical scheme is as follows:

in one aspect, a method for detecting an object is provided, where the method includes:

determining a plurality of detection frames in a detection image in a video according to a target detection model;

determining a plurality of prediction frames in the detected image according to a target frame in each image of the first n images adjacent to the detected image in the video, wherein the target frame is used for indicating an area with a target, and n is an integer greater than or equal to 2;

and selecting a target frame from the plurality of detection frames in the detection image according to the plurality of prediction frames in the detection image.

Optionally, the determining, according to the target detection model, a plurality of detection frames in a detection image in a video includes:

inputting a detection image in a video to a target detection model, and outputting the size, the position and the confidence of each detection frame in a plurality of detection frames in the detection image by the target detection model, wherein the confidence of each detection frame is used for indicating the probability of the target existing in the region corresponding to each detection frame.

Optionally, the determining, according to a target frame in each of the first n images adjacent to the detected image in the video, a plurality of prediction frames in the detected image includes:

acquiring a first candidate frame from a target frame in each image in the first n images to obtain n first candidate frames, wherein the same target exists in a region corresponding to each first candidate frame in the n first candidate frames;

determining the position of a prediction frame in the detection image according to the position of each first candidate frame in the n first candidate frames;

determining the size of the one prediction frame in the detected image according to the size of each of the n first candidate frames.

Optionally, the determining a position of a prediction frame in the detected image according to a position of each of the n first candidate frames includes:

for any one first candidate frame in the n first candidate frames, constructing a first coordinate point corresponding to the one first candidate frame by taking the playing time of the image to which the one first candidate frame belongs as the abscissa of the first coordinate point corresponding to the one first candidate frame and taking the abscissa of the position of the one first candidate frame as the ordinate of the first coordinate point corresponding to the one first candidate frame; taking the playing time of the image to which the first candidate frame belongs as the abscissa of the second coordinate point corresponding to the first candidate frame, and taking the ordinate of the position of the first candidate frame as the ordinate of the second coordinate point corresponding to the first candidate frame, so as to construct the second coordinate point corresponding to the first candidate frame;

performing curve fitting on n first coordinate points corresponding to the n first candidate frames to obtain a first curve; performing curve fitting on n second coordinate points corresponding to the n first candidate frames to obtain a second curve;

and determining the ordinate of the coordinate point corresponding to the playing time of the detection image in the second curve as the ordinate of the position of the prediction frame in the detection image so as to obtain the position of the prediction frame in the detection image.

Optionally, the selecting a target frame from a plurality of detection frames in the detection image according to a plurality of prediction frames in the detection image includes:

acquiring a first candidate frame from a plurality of detection frames in the detection image as a target frame according to the coincidence degree between the plurality of prediction frames and the plurality of detection frames in the detection image and the confidence coefficient of each detection frame, wherein the confidence coefficient of each detection frame is used for indicating the probability that a target exists in a region corresponding to each detection frame;

according to the coincidence degree between a plurality of first candidate frames and a plurality of second candidate frames in the detection image, filtering the plurality of second candidate frames in the detection image, and taking the reserved second candidate frames as third candidate frames, wherein the second candidate frames are detection frames except the first candidate frames in the plurality of detection frames in the detection image;

and filtering the plurality of third candidate frames in the detection image according to the coincidence degree of the plurality of third candidate frames in the detection image and the confidence coefficient of each third candidate frame, and taking the reserved third candidate frames as target frames.

Optionally, the acquiring a first candidate frame from a plurality of detection frames in the detection image according to the degree of coincidence between the plurality of prediction frames and the plurality of detection frames in the detection image and the confidence of each detection frame includes:

selecting one prediction frame from a plurality of prediction frames in the detection image, and executing the following operations on the selected prediction frame until the following operations are executed on each prediction frame in the plurality of prediction frames in the detection image:

for any one of the detection frames in the detection image, multiplying the coincidence degree between the selected prediction frame and the detection frame by the confidence coefficient of the detection frame to obtain a target value of the detection frame;

and determining a detection frame with the maximum target value and larger than the reference value in the plurality of detection frames in the detection image as a first candidate frame corresponding to the selected prediction frame.

Optionally, the filtering the plurality of second candidate frames in the detection image according to the degree of coincidence between the plurality of first candidate frames and the plurality of second candidate frames in the detection image includes:

and deleting a second candidate frame of which the coincidence degree with any one of the plurality of first candidate frames in the detection image is greater than or equal to the first coincidence degree from the plurality of second candidate frames in the detection image.

Optionally, the filtering, according to a degree of coincidence between a plurality of third candidate frames in the detected image and a confidence of each third candidate frame, the plurality of third candidate frames in the detected image includes:

selecting a third candidate frame with the maximum confidence degree from a plurality of third candidate frames in the detection image as a target candidate frame;

determining a third candidate frame except the target candidate frame in a plurality of third candidate frames in the detection image as a fourth candidate frame;

deleting a fourth candidate frame, of the fourth candidate frames in the detection image, of which the degree of coincidence with the target candidate frame is greater than or equal to the second degree of coincidence;

judging whether the number of fourth candidate frames in the detection image is 1 or not;

when the number of the fourth candidate frames in the detection image is not 1, selecting the fourth candidate frame with the highest confidence degree from the fourth candidate frames in the detection image as a target candidate frame, and returning to the step of determining the third candidate frames except the target candidate frame in the plurality of third candidate frames in the detection image as the fourth candidate frames; and when the number of the fourth candidate frames in the detection image is 1, ending the operation.

In one aspect, an object detection apparatus is provided, the apparatus comprising:

the first determining module is used for determining a plurality of detection frames in a detection image in a video according to the target detection model;

a second determining module, configured to determine a plurality of prediction frames in the detected image according to a target frame in each of n previous images adjacent to the detected image in the video, where the target frame is used to indicate a region where a target exists, and n is an integer greater than or equal to 2;

and the selecting module is used for selecting a target frame from the plurality of detection frames in the detection image according to the plurality of prediction frames in the detection image.

Optionally, the first determining module includes:

the output unit is used for inputting a detection image in a video to a target detection model, and outputting the size, the position and the confidence coefficient of each detection frame in a plurality of detection frames in the detection image by the target detection model, wherein the confidence coefficient of each detection frame is used for indicating the probability of the target existing in the corresponding area of each detection frame.

Optionally, the second determining module includes:

a first obtaining unit, configured to obtain first candidate frames from a target frame in each of the first n images to obtain n first candidate frames, where a same target exists in a region corresponding to each of the n first candidate frames;

a first determining unit configured to determine a position of a prediction frame in the detection image according to a position of each of the n first candidate frames;

a second determining unit, configured to determine the size of the one prediction frame in the detected image according to the size of each of the n first candidate frames.

Optionally, the first determining unit is configured to:

for any one first candidate frame in the n first candidate frames, constructing a first coordinate point corresponding to the one first candidate frame by taking the playing time of the image to which the one first candidate frame belongs as the abscissa of the first coordinate point corresponding to the one first candidate frame and taking the abscissa of the position of the one first candidate frame as the ordinate of the first coordinate point corresponding to the one first candidate frame; constructing a second coordinate point corresponding to the first candidate frame by taking the playing time of the image to which the first candidate frame belongs as the abscissa of the second coordinate point corresponding to the first candidate frame and taking the ordinate of the position of the first candidate frame as the ordinate of the second coordinate point corresponding to the first candidate frame;

Optionally, the selecting module includes:

a second obtaining unit, configured to obtain, as a target frame, a first candidate frame from the multiple detection frames in the detection image according to coincidence degrees between the multiple prediction frames and the multiple detection frames in the detection image and a confidence level of each detection frame, where the confidence level of each detection frame is used to indicate a probability that a target exists in an area corresponding to each detection frame;

a first filtering unit, configured to filter a plurality of second candidate frames in the detection image according to degrees of coincidence between the plurality of first candidate frames and the plurality of second candidate frames in the detection image, and use a remaining second candidate frame as a third candidate frame, where the second candidate frame is a detection frame other than the first candidate frame in the plurality of detection frames in the detection image;

and the second filtering unit is used for filtering the plurality of third candidate frames in the detection image according to the coincidence degree among the plurality of third candidate frames in the detection image and the confidence coefficient of each third candidate frame, and taking the reserved third candidate frames as target frames.

Optionally, the second obtaining unit is configured to:

selecting one prediction frame from a plurality of prediction frames in the detection image, and executing the following operations on the selected prediction frame until executing the following operations on each prediction frame in the plurality of prediction frames in the detection image:

for any one of a plurality of detection frames in the detection image, multiplying the coincidence degree between the selected prediction frame and the one detection frame by the confidence coefficient of the one detection frame to obtain a target value of the one detection frame;

Optionally, the first filtering unit is configured to:

and deleting second candidate frames, of the plurality of second candidate frames in the detection image, of which the degree of coincidence with any first candidate frame in the plurality of first candidate frames in the detection image is greater than or equal to the first degree of coincidence.

Optionally, the second filter unit is configured to:

selecting a third candidate frame with the highest confidence degree from a plurality of third candidate frames in the detection image as a target candidate frame;

deleting a fourth candidate frame with the degree of coincidence with the target candidate frame being greater than or equal to the second degree of coincidence from among fourth candidate frames in the detection image;

when the number of the fourth candidate frames in the detection image is not 1, selecting a fourth candidate frame with the highest confidence level from the fourth candidate frames in the detection image as a target candidate frame, and returning to the step of determining third candidate frames except the target candidate frame from the plurality of third candidate frames in the detection image as fourth candidate frames; and when the number of the fourth candidate frames in the detection image is 1, ending the operation.

In one aspect, a computer device is provided, where the computer device includes a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete mutual communication through the communication bus, the memory is used to store a computer program, and the processor is used to execute the program stored in the memory to implement the steps of the object detection method.

In one aspect, a computer-readable storage medium is provided, in which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the steps of the object detection method described above.

In one aspect, a computer program product comprising instructions is provided, which when run on a computer, causes the computer to perform the steps of the object detection method described above.

The technical scheme provided by the application can bring the following beneficial effects at least:

after a plurality of detection frames in a detection image in a video are determined according to a target detection model, a plurality of prediction frames in the detection image are determined according to a target frame in each image in the first n images adjacent to the detection image in the video. Since the plurality of prediction frames in the detection image are determined according to the historical tracking information of the objects appearing in the previous n images, the objects are likely to exist near the areas indicated by the plurality of prediction frames in the detection image. Then, a target frame is selected from the plurality of detection frames in the detection image according to the plurality of prediction frames in the detection image, so that the detection frame error filtering caused by the conditions of target occlusion, target crowding and the like can be avoided, and the target detection rate can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a target detection method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating a location of a first candidate frame according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a first curve provided by an embodiment of the present application;

FIG. 4 is a diagram illustrating a second curve provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a target detection process provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a tracking prediction module according to an embodiment of the present application;

fig. 7 is a schematic diagram of a non-maxima suppression module according to an embodiment of the present disclosure;

FIG. 8 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present application;

FIG. 9 is a schematic structural diagram of another object detection apparatus provided in the embodiments of the present application;

fig. 10 is a schematic structural diagram of another object detection apparatus provided in the embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Before explaining the embodiments of the present application in detail, an application scenario provided by the embodiments of the present application will be explained.

The target detection method provided by the embodiment of the application can be applied to a scene of video analysis. Of course, the target detection method provided in the embodiment of the present application may also be applied to other scenarios, which is not limited in the embodiment of the present application.

A video often includes a plurality of objects, and the plurality of objects are usually in motion. When analyzing a video, people often want to locate the multiple targets, i.e., perform target detection on the video. At present, a target detection model is commonly used for target detection. Specifically, a video image of the video is input into the object detection model, and the object detection model outputs the size, position, and confidence of each of the plurality of detection frames in the video image. Since a plurality of detection boxes output by the target detection model often have redundant detection boxes, an NMS algorithm is usually used to filter the redundant detection boxes.

In the related art, the NMS algorithm is specifically: (1) And selecting the detection frame with the highest confidence level from the plurality of detection frames in the video image as a first reference frame. (2) And determining the detection frame except the first reference frame in the plurality of detection frames in the video image as a second reference frame. (3) And deleting second reference frames with the coincidence degree between the second reference frames and the first reference frame in the video image, wherein the coincidence degree between the second reference frames and the first reference frame is greater than or equal to the reference coincidence degree. (4) And judging whether the number of the second reference frames in the video image is 1 or not. (5) When the number of the second reference frames in the video image is not 1, selecting the second reference frame with the highest confidence level from the second reference frames in the video image as the first reference frame, and returning to the step (2); and when the number of the second reference frames in the video image is 1, ending the operation.

However, in a complex scene image with a situation of object occlusion, object crowding, etc., there is a high possibility that the detection frames of two different objects overlap, and at this time, if the above NMS algorithm is directly used to filter a plurality of detection frames in the video image, it is very likely that the detection frame of one of the two different objects is mistakenly filtered, so that the object detection rate is seriously affected.

Therefore, the embodiment of the present application provides a target detection method, which may determine a prediction frame in a detected image through a target frame in the first n images adjacent to the detected image in a video, and filter a plurality of detection frames in the detected image according to the prediction frame in the detected image, so as to reduce the situation of false filtering and improve the target detection rate.

Next, a detailed explanation will be given of the target detection method provided in the embodiments of the present application.

Fig. 1 is a flowchart of a target detection method according to an embodiment of the present application. Referring to fig. 1, the method includes the following steps.

Step 101: and determining a plurality of detection frames in the detection image in the video according to the target detection model.

It should be noted that the video may be a file storing a dynamic image, and the video may be a video captured by a terminal, such as a video captured by a camera, which is not limited in this embodiment of the present application. The detection image may be an image that needs to be subjected to object detection in the video, for example, the detection image may be an image currently playing in the video, and various objects such as people, animals, vehicles, and the like may appear in the detection image, which is not limited in this embodiment of the application.

In addition, the object detection model may be a pre-trained model capable of determining a detection frame in which an object appearing in the image is located, for example, the object detection model may be CNN (Convolutional Neural Network), or the like.

Furthermore, the detection box is used to indicate an area where the target may be present, which may be a generally rectangular area. And determining a plurality of detection frames in the detection image as the original detection result of the detection image according to the target detection model.

Specifically, the operation of step 101 may be: inputting a detection image in a video into a target detection model, and outputting the size, the position and the confidence coefficient of each detection frame in a plurality of detection frames in the detection image by the target detection model, wherein the confidence coefficient of each detection frame is used for indicating the probability of the target existing in the region corresponding to each detection frame.

It should be noted that, after the detection image is input into the target detection model, the target detection model may predict the size and the position of the detection frame corresponding to the region in which the target may exist in the detection image, determine the probability that the target exists in the region corresponding to each detection frame, and then output the size, the position, and the confidence of each detection frame. The size of the detection frame may be the length and width of the detection frame, and the position of the detection frame may be the coordinates of the center point of the detection frame.

In addition, the area corresponding to the detection frame is the area indicated by the detection frame where the target may exist. The higher the confidence of the detection frame, the higher the probability that the target exists in the region corresponding to the detection frame. The smaller the confidence of the detection frame is, the smaller the probability that the target exists in the region corresponding to the detection frame is. The confidence may be a numerical value greater than or equal to 0 and less than or equal to 1.

Step 102: and determining a plurality of prediction frames in the detected image according to the target frame in each image of the first n images adjacent to the detected image in the video, wherein n is an integer greater than or equal to 2.

It should be noted that the target box is used to indicate an area where the target exists, and the area may be a rectangular area. The first n images are images that have been subject to object detection, e.g., the first n images may be images that have been played before the detected image, and thus an object frame has been determined in each of the first n images.

In addition, for an object appearing in the first n images, the first n images all contain an object frame where the object is located. For the target, the target frame in which the target is located, which is included in the first n images, is the historical tracking information of the target.

Further, the plurality of prediction frames in the detection image are frames indicating a region where the target is highly likely to exist, which may be a rectangular region, determined based on the history tracking information of the target appearing in the previous n images.

Specifically, the operation of step 102 may be: acquiring a first candidate frame from the target frame in each image in the first n images to obtain n first candidate frames; determining the position of a prediction frame in the detection image according to the position of each first candidate frame in the n first candidate frames; the size of the one prediction frame in the detected image is determined based on the size of each of the n first candidate frames.

It should be noted that the n first candidate frames are target frames in which an object appears in the previous n images, the object corresponds to the n first candidate frames, and the n first candidate frames are history tracking information of the object. For example, n is 5, and the target 1 appears in the first 5 images adjacent to the detected image in the video, the target frame in which the target 1 is located in each of the first 5 images may be determined as a first candidate frame, so as to obtain 5 first candidate frames corresponding to the target 1, where the 5 first candidate frames are historical tracking information of the target 1.

In addition, since the n first candidate frames are history tracking information of the target, the position of the prediction frame in which the target is located in the detected image can be predicted according to the position of each first candidate frame in the n first candidate frames, and the size of the prediction frame in which the target is located in the detected image can be predicted according to the size of each first candidate frame in the n first candidate frames, so that the prediction frame in which the target is located in the detected image can be determined.

It is noted that, for a plurality of objects appearing in the first n images, each object in the plurality of objects may correspond to n first candidate frames. At this time, for each of the plurality of targets, the prediction frame in which the target is located in the detection image may be predicted according to the n first candidate frames corresponding to the target. Thus, the prediction frame in which each of the plurality of objects is located in the detected image can be predicted, so that a plurality of prediction frames in the detected image can be obtained.

When the size of the one prediction frame in the detected image is determined according to the size of each first candidate frame in the n first candidate frames, the average size of the n first candidate frames may be determined as the size of the one prediction frame in the detected image; alternatively, the size of the first candidate frame in the previous image adjacent to the detected image among the n first candidate frames may be determined as the size of the one prediction frame in the detected image. Of course, the size of the one prediction frame in the detected image may also be determined by other ways according to the size of each first candidate frame in the n first candidate frames, which is not limited in this embodiment of the application.

Wherein, the operation of determining the position of a prediction frame in the detected image according to the position of each of the n first candidate frames may include the following steps (1) to (3):

(1) For any one first candidate frame in the n first candidate frames, taking the playing time of the image to which the one first candidate frame belongs as the abscissa of the first coordinate point corresponding to the one first candidate frame, and taking the abscissa of the position of the one first candidate frame as the ordinate of the first coordinate point corresponding to the one first candidate frame, so as to construct the first coordinate point corresponding to the one first candidate frame; and constructing a second coordinate point corresponding to the first candidate frame by taking the playing time of the image to which the first candidate frame belongs as the abscissa of the second coordinate point corresponding to the first candidate frame and taking the ordinate of the position of the first candidate frame as the ordinate of the second coordinate point corresponding to the first candidate frame.

It should be noted that the playing time of the image to which the first candidate box belongs is predetermined before the video is played, and the playing time of the image is used to indicate when the image is played during the playing of the video. In the embodiment of the present application, the playing time of the image is taken as the abscissa of the first coordinate point and the second coordinate point corresponding to the first candidate frame.

In addition, the position of the one first candidate frame refers to coordinates of a center point of the one first candidate frame. In the embodiment of the present application, the abscissa of the center point of the one first candidate frame is taken as the ordinate of the first coordinate point corresponding to the one first candidate frame, and the ordinate of the center point of the one first candidate frame is taken as the ordinate of the second coordinate point corresponding to the one first candidate frame.

(2) Performing curve fitting on n first coordinate points corresponding to the n first candidate frames to obtain a first curve; and performing curve fitting on the n second coordinate points corresponding to the n first candidate frames to obtain a second curve.

It should be noted that each first candidate frame corresponds to one first coordinate point, and thus the n first candidate frames correspond to n first coordinate points in total. Each first candidate frame corresponds to one second coordinate point, and thus the n first candidate frames correspond to n second coordinate points in total.

In addition, since the n first candidate frames are n target frames where the same target is located, a first curve obtained by curve fitting n first coordinate points corresponding to the n first candidate frames is a first curve corresponding to the target, and the first curve represents a lateral movement trend of the target in the first n images. Similarly, a second curve obtained by curve-fitting the n second coordinate points corresponding to the n first candidate frames is a second curve corresponding to the target, and the second curve represents a longitudinal movement trend of the target in the first n images. In this way, the moving trend of the target in the first n images can be reflected by the first curve and the second curve.

(3) And determining the ordinate of the coordinate point corresponding to the playing time of the detection image in the first curve as the abscissa of the position of one prediction frame in the detection image by taking the playing time of the detection image as the abscissa, and determining the ordinate of the coordinate point corresponding to the playing time of the detection image in the second curve as the ordinate of the position of the one prediction frame in the detection image to obtain the position of the one prediction frame in the detection image.

Since the n first candidate frames are n target frames in which the same target is located, the first curve represents a lateral movement trend of the target in the first n images, and the second curve represents a longitudinal movement trend of the target in the first n images, an abscissa of a position of a prediction frame in which the target is located in the detection image can be obtained from the first curve through the playing time of the detection image, and an ordinate of a position of a prediction frame in which the target is located in the detection image can be obtained from the second curve, so as to obtain a position of the prediction frame in which the target is located in the detection image.

For ease of understanding, the operation of determining the position of a predicted frame in the detected image based on the position of each of the n first candidate frames will be exemplified below with reference to fig. 2 to 4.

Referring to fig. 2, fig. 2 shows the coordinates of the positions of 5 first candidate frames, and the coordinates of the positions of the 5 first candidate frames are (0, 8), (5, 10), (10, 13), (15, 18), (20, 24), respectively, and the operation of determining the position of a predicted frame according to the positions of the 5 first candidate frames is as follows:

(1) Referring to fig. 3, the playing time of the image to which each first candidate frame belongs is taken as the abscissa of the first coordinate point corresponding to each first candidate frame, the abscissa of the position of each first candidate frame is taken as the ordinate of the first coordinate point corresponding to each first candidate frame, 5 corresponding first coordinate points of the 5 first candidate frames are drawn, and a first curve is fitted according to the 5 first coordinate points.

The first curve reflects the rule that the abscissa of the position of the first candidate frame changes with time, i.e., the first curve represents the lateral movement tendency of the target in the first n images. In this way, the abscissa of the position of the prediction frame in the current detection image can be obtained from the first curve. The ordinate of the position of the coordinate point within "∘" in fig. 3 is the abscissa of the position of the prediction frame, i.e. the abscissa of the position of the prediction frame is 25.

(2) Referring to fig. 4, the playing time of the image to which each first candidate frame belongs is taken as the abscissa of the second coordinate point corresponding to each first candidate frame, the ordinate of the position of each candidate frame is taken as the ordinate of the second coordinate point corresponding to each first candidate frame, 5 second coordinate points corresponding to the 5 first candidate frames are drawn, and a second curve is fitted according to the 5 second coordinate points.

The second curve reflects the time-varying rule of the ordinate of the position of the first candidate frame, i.e. the second curve represents the longitudinal movement trend of the target in the first n images. In this way, the ordinate of the position of the prediction frame in the current detection image can be obtained from the second curve. The ordinate of the position of the coordinate point within "∘" in fig. 4 is the ordinate of the position of the prediction frame, i.e. the ordinate of the position of the prediction frame is 30.

After the abscissa and the ordinate of the position of the prediction frame are obtained in the first two steps, the position of the prediction frame is determined in the detection image. The position of the coordinate point within "∘" in fig. 2 is the position of the prediction frame, i.e. the coordinates of the position of the prediction frame are (25, 30).

Step 103: and selecting a target frame from the plurality of detection frames in the detection image according to the plurality of prediction frames in the detection image.

Since the plurality of prediction frames in the detection image are determined according to the historical tracking information of the targets appearing in the previous n images, targets are likely to exist near the areas indicated by the plurality of prediction frames in the detection image, and therefore, the target frames are selected from the plurality of detection frames in the detection image according to the plurality of prediction frames in the detection image, so that the false filtering of the detection frames caused by target blocking, target crowding and the like can be avoided, and the target detection rate is improved.

Specifically, the operation of step 103 may be: acquiring a first candidate frame from a plurality of detection frames in the detection image as a target frame according to the coincidence degree between a plurality of prediction frames and a plurality of detection frames in the detection image and the confidence degree of each detection frame; filtering a plurality of second candidate frames in the detection image according to the coincidence degrees between the plurality of first candidate frames and the plurality of second candidate frames in the detection image, and taking the reserved second candidate frames as third candidate frames; and filtering the plurality of third candidate frames in the detection image according to the coincidence degree among the plurality of third candidate frames in the detection image and the confidence coefficient of each third candidate frame, and taking the reserved third candidate frames as target frames.

It should be noted that the overlap ratio between two frames refers to an IOU (Intersection-over-Union ratio) between the two frames, and can be obtained by dividing the overlapping area of the two frames by the Union area of the two frames. The degree of coincidence between two frames may indicate the probability that the objects existing in the regions corresponding to the two frames are the same object. That is, the larger the coincidence degree between the two frames is, the higher the probability that the targets existing in the regions corresponding to the two frames are the same target is; the smaller the degree of coincidence between the two frames, the lower the probability that the objects existing in the regions corresponding to the two frames are the same object.

In addition, the second candidate frame is a detection frame other than the first candidate frame among the plurality of detection frames in the detection image. For example, the plurality of detection frames in the detection image are detection frame 1, detection frame 2, detection frame 3, and detection frame 4, the plurality of first candidate frames in the detection image are detection frame 1 and detection frame 3, and the plurality of second candidate frames in the detection image are detection frame 2 and detection frame 4.

The operation of acquiring the first candidate frame from the multiple detection frames in the detection image according to the overlap ratios between the multiple prediction frames and the multiple detection frames in the detection image and the confidence of each detection frame may be: selecting one prediction frame from a plurality of prediction frames in the detection image, and executing the following operations on the selected prediction frame until executing the following operations on each prediction frame in the plurality of prediction frames in the detection image: for any one of a plurality of detection frames in the detection image, multiplying the coincidence degree between the selected prediction frame and the one detection frame by the confidence coefficient of the one detection frame to obtain a target value of the one detection frame; and determining a detection frame with the maximum target value and larger than the reference value in the plurality of detection frames in the detection image as a first candidate frame corresponding to the selected prediction frame.

The degree of overlap between the one detection frame and the selected prediction frame may reflect, to some extent, the probability that the object present in the region corresponding to the one detection frame and the object present in the region corresponding to the selected prediction frame are the same object. And the confidence of the one detection frame is used for indicating the probability that the target exists in the region corresponding to the one detection frame. The value of the object in the one detection frame obtained by multiplying the degree of coincidence between the picked prediction frame and the one detection frame by the confidence of the one detection frame may indicate the probability that the object in the region corresponding to the picked prediction frame exists in the region corresponding to the one detection frame. That is, the larger the target value of the one detection frame is, the more likely that the target existing in the region corresponding to the selected prediction frame exists in the region corresponding to the one detection frame; the smaller the target value of the one detection frame is, the less the target in the region corresponding to the one detection frame is likely to be the target in the region corresponding to the selected prediction frame.

In addition, the reference value may be set in advance, and the reference value may be set to be larger, for example, the reference value may be 0.9. When the target value of the one detection frame is greater than the reference value, it indicates that there is a high possibility that there is a target in the region corresponding to the selected prediction frame in the region corresponding to the one detection frame, and therefore, the detection frame with the largest target value in the detection frames with the target value greater than the reference value in the plurality of detection frames in the detection image may be selected, and the selected detection frame is the detection frame with the highest possibility that there is a target in the region corresponding to the selected prediction frame in the plurality of detection frames in the detection image, and thus the selected detection frame may be used to determine the first candidate frame corresponding to the selected prediction frame, and the first candidate frame may be used as the target frame.

According to the coincidence degree between the plurality of first candidate frames and the plurality of second candidate frames in the detection image, the operation of filtering the plurality of second candidate frames in the detection image may be: and deleting second candidate frames, of the plurality of second candidate frames in the detection image, of which the coincidence degree with any one of the plurality of first candidate frames in the detection image is greater than or equal to the first coincidence degree.

It should be noted that the first contact ratio may be set in advance, and the first contact ratio may be set to be larger, for example, the first contact ratio may be 0.8, 0.9, and the like.

Furthermore, when the degree of coincidence between a second candidate frame and a first candidate frame in the detection image is greater than or equal to the first degree of coincidence, it indicates that the degree of coincidence between the second candidate frame and the first candidate frame is large, and at this time, it is highly likely that the object existing in the region corresponding to the second candidate frame and the object existing in the region corresponding to the first candidate frame are the same object, so that the second candidate frame can be deleted.

When the third candidate frames in the detection image are filtered according to the degree of coincidence between the third candidate frames in the detection image and the confidence of each third candidate frame, the third candidate frames in the detection image may be filtered by using an NMS algorithm according to the degree of coincidence between the third candidate frames in the detection image and the confidence of each third candidate frame. Specifically, the following steps (1) to (5) may be included:

(1) And selecting the third candidate frame with the highest confidence degree from the plurality of third candidate frames in the detection image as the target candidate frame.

It should be noted that the target candidate frame selected from the third candidate frames in the detection image is the third candidate frame corresponding to the area where the target is most likely to exist in the third candidate frames. In the embodiment of the application, the target candidate frames in the detection image are all reserved.

(2) And determining a third candidate frame except the target candidate frame in the plurality of third candidate frames in the detected image as a fourth candidate frame.

A fourth candidate frame determined from the plurality of third candidate frames in the detection image is the third candidate frame to be filtered. In the embodiment of the present application, the fourth candidate frame in the detection image is filtered.

(3) And deleting a fourth candidate frame of which the degree of coincidence with the target candidate frame is greater than or equal to the second degree of coincidence from among the fourth candidate frames in the detection image.

It should be noted that the second overlapping degree may be set in advance, and the second overlapping degree may be set to be larger. In practical applications, the second degree of overlap may be less than the first degree of overlap, e.g., the second degree of overlap may be 0.7, 0.75, etc.

In addition, when the degree of coincidence between a fourth candidate frame in the detection image and the target candidate frame is greater than or equal to the second degree of coincidence, it indicates that the degree of coincidence between the fourth candidate frame and the target candidate frame is large, and it is highly probable that the object existing in the region corresponding to the fourth candidate frame and the object existing in the region corresponding to the target candidate frame are the same object at this time, and thus the fourth candidate frame may be deleted.

(4) And judging whether the number of the fourth candidate frames in the detection image is 1 or not.

(5) When the number of the fourth candidate frames in the detection image is not 1, selecting the fourth candidate frame with the highest confidence degree from the fourth candidate frames in the detection image as the target candidate frame, and returning to the step (2); when the number of the fourth frame candidates in the detection image is 1, the operation is ended.

It should be noted that, when the number of the fourth candidate frames in the detected image is 1, it indicates that the degree of overlap between the target candidate frame already retained in the detected image and the only remaining fourth candidate frame in the detected image is less than the second degree of overlap, so that the fourth candidate frame may also be retained at this time, and the operation is ended.

When the number of the fourth candidate frames in the detection image is not 1, that is, when the number of the fourth candidate frames in the detection image is at least two, the fourth candidate frame with the highest confidence level may be continuously selected from the fourth candidate frames in the detection image as the target candidate frame, and then the step (2) is returned to re-determine the fourth candidate frame in the detection image, and the filtering of the fourth candidate frame in the detection image is continuously performed.

For ease of understanding, the target detection method provided in the embodiments of the present application is illustrated below with reference to fig. 5 to 7.

Specifically, referring to fig. 5, the target detection method provided in the embodiment of the present application may be applied to a target detection system, and the target detection system may include a tracking prediction module and a non-maximum suppression module.

The method comprises the following steps of firstly, obtaining the size, the position and the confidence of a plurality of detection frames in a detection image in a video output by a target detection model. And secondly, acquiring the size and the position of a target frame in each image of the first n images adjacent to the detection image in the video. And inputting the size and the position of the target frame in each image in the first n images into a tracking and predicting module for processing, specifically, for each target in a plurality of targets appearing in the first n images, the tracking and predicting module determines the size and the position of the prediction frame in the detected image according to the size and the position of a first candidate frame in which the target is located in the target frame of each image in the first n images. And fourthly, inputting the sizes, the positions and the confidence degrees of the plurality of detection frames in the detection module and the sizes and the positions of the plurality of prediction frames in the detection image into a non-maximum value suppression module for processing, wherein specifically, the non-maximum value suppression module can select a target frame from the plurality of detection frames in the detection image according to the sizes and the positions of the plurality of prediction frames in the detection image and the sizes, the positions and the confidence degrees of the plurality of detection frames in the detection module.

Therein, referring to fig. 6, the tracking prediction module includes a target tracking unit and a location prediction unit.

The method comprises the following steps of firstly, inputting the size and the position of a target frame in each image of the first n images into a target tracking unit for processing, specifically, for each target in a plurality of targets appearing in the first n images, acquiring the target frame where the target is located from the target frame in each image of the first n images as a first candidate frame by the target tracking unit, so as to obtain n first candidate frames corresponding to the target. And a second step of processing the size and position input position prediction units of the n first candidate frames, specifically, the position prediction unit may determine the size of the prediction frame in which the target is located in the detected image according to the size of each of the n first candidate frames, and determine the position of the prediction frame in which the target is located in the detected image according to the position of each of the n first candidate frames.

Wherein, referring to fig. 7, the non-maximum suppression module includes a tracking target based processing unit and remaining detection target based processing units.

The method comprises a first step of inputting the size, the position and the confidence of a plurality of detection frames in the detection module and the size and the position of a plurality of prediction frames in the detection image into a processing unit based on a tracking target for processing, specifically, the processing unit based on the tracking target may obtain a first candidate frame from the plurality of detection frames in the detection image as a target frame according to the degree of coincidence between the plurality of prediction frames and the plurality of detection frames in the detection image and the confidence of each detection frame, filter the plurality of second candidate frames according to the degree of coincidence between the plurality of first candidate frames and the plurality of second candidate frames in the detection image, and take the remaining second candidate frame as a third candidate frame. And a second step of inputting the size, the position and the confidence of the plurality of third candidate frames in the detection image into a processing unit of the remaining detection target for processing, specifically, the processing unit of the remaining detection target may filter the plurality of third candidate frames in the detection image according to the degree of overlap between the plurality of third candidate frames in the detection image and the confidence of each third candidate frame, and use the remaining third candidate frames as target frames.

In the embodiment of the application, after a plurality of detection frames in a detection image in a video are determined according to a target detection model, a plurality of prediction frames in the detection image are determined according to a target frame in each image of the first n images adjacent to the detection image in the video. Since the plurality of prediction frames in the detection image are determined based on the history tracking information of the target appearing in the previous n images, it is highly probable that the target exists near the area indicated by the plurality of prediction frames in the detection image. Then, a target frame is selected from the plurality of detection frames in the detection image according to the plurality of prediction frames in the detection image, so that the detection frame error filtering caused by the conditions of target occlusion, target crowding and the like can be avoided, and the target detection rate can be improved.

Fig. 8 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present application. Referring to fig. 8, the apparatus includes: a first determination module 801, a second determination module 802, and a selection module 803.

A first determining module 801, configured to determine multiple detection frames in a detection image in a video according to a target detection model;

a second determining module 802, configured to determine multiple prediction frames in the detected image according to a target frame in each of n previous images in the video adjacent to the detected image, where the target frame is used to indicate an area where a target exists, and n is an integer greater than or equal to 2;

a selecting module 803, configured to select a target frame from the multiple detection frames in the detection image according to the multiple prediction frames in the detection image.

Optionally, the first determining module 801 includes:

and the output unit is used for inputting the detection image in the video to the target detection model, outputting the size, the position and the confidence coefficient of each detection frame in a plurality of detection frames in the detection image by the target detection model, wherein the confidence coefficient of each detection frame is used for indicating the probability that the target exists in the corresponding region of each detection frame.

Optionally, the second determining module 802 includes:

the first acquisition unit is used for acquiring a first candidate frame from a target frame in each image in the previous n images to obtain n first candidate frames, and the same target exists in an area corresponding to each first candidate frame in the n first candidate frames;

a first determination unit configured to determine a position of a prediction frame in the detection image based on a position of each of the n first candidate frames;

a second determining unit configured to determine a size of a prediction frame in the detected image according to a size of each of the n first candidate frames.

Optionally, the first determining unit is configured to:

for any one first candidate frame in the n first candidate frames, constructing a first coordinate point corresponding to the first candidate frame by taking the playing time of the image to which the first candidate frame belongs as the abscissa of the first coordinate point corresponding to the first candidate frame and taking the abscissa of the position of the first candidate frame as the ordinate of the first coordinate point corresponding to the first candidate frame; constructing a second coordinate point corresponding to a first candidate frame by taking the playing time of an image to which the first candidate frame belongs as the abscissa of the second coordinate point corresponding to the first candidate frame and taking the ordinate of the position of the first candidate frame as the ordinate of the second coordinate point corresponding to the first candidate frame;

Optionally, the selecting module 803 includes:

a second obtaining unit, configured to obtain a first candidate frame from the plurality of detection frames in the detection image as a target frame according to coincidence degrees between the plurality of prediction frames and the plurality of detection frames in the detection image and a confidence of each detection frame, where the confidence of each detection frame is used to indicate a probability that a target exists in a region corresponding to each detection frame;

a first filtering unit, configured to filter a plurality of second candidate frames in the detection image according to degrees of coincidence between the plurality of first candidate frames and the plurality of second candidate frames in the detection image, and use a remaining second candidate frame as a third candidate frame, where the second candidate frame is a detection frame other than the first candidate frame, in the plurality of detection frames in the detection image;

Optionally, the second obtaining unit is configured to:

Optionally, the first filter unit is configured to:

and deleting second candidate frames, of the plurality of second candidate frames in the detection image, of which the coincidence degree with any one of the plurality of first candidate frames in the detection image is greater than or equal to the first coincidence degree.

Optionally, the second filter unit is for:

determining a third candidate frame other than the target candidate frame among a plurality of third candidate frames in the detected image as a fourth candidate frame;

judging whether the number of the fourth candidate frames in the detection image is 1 or not;

when the number of the fourth candidate frames in the detection image is not 1, selecting the fourth candidate frame with the highest confidence degree from the fourth candidate frames in the detection image as a target candidate frame, and returning to trigger to determine the third candidate frames except the target candidate frame in the plurality of third candidate frames in the detection image as the fourth candidate frames; when the number of the fourth frame candidates in the detection image is 1, the operation is ended.

In the embodiment of the application, after a plurality of detection frames in a detection image in a video are determined according to a target detection model, a plurality of prediction frames in the detection image are determined according to a target frame in each image in the first n images adjacent to the detection image in the video. Since the plurality of prediction frames in the detection image are determined based on the history tracking information of the target appearing in the previous n images, it is highly probable that the target exists near the area indicated by the plurality of prediction frames in the detection image. Then, a target frame is selected from the plurality of detection frames in the detection image according to the plurality of prediction frames in the detection image, so that the detection frame error filtering caused by the conditions of target occlusion, target crowding and the like can be avoided, and the target detection rate can be improved.

Fig. 9 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present application. Referring to fig. 9, the apparatus may be a terminal 900, and the terminal 900 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 900 may also be referred to by other names such as user equipment, portable terminals, laptop terminals, desktop terminals, and the like.

In general, terminal 900 includes: a processor 901 and a memory 902.

Processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 901 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). Processor 901 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in a wake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 901 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 901 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 902 may include one or more computer-readable storage media, which may be non-transitory. Memory 902 may also include high speed random access memory as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 902 is used to store at least one instruction for execution by processor 901 to implement a target detection method provided by method embodiments herein.

In some embodiments, terminal 900 can also optionally include: a peripheral interface 903 and at least one peripheral. The processor 901, memory 902, and peripheral interface 903 may be connected by buses or signal lines. Each peripheral may be connected to the peripheral interface 903 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 904, a touch display screen 905, a camera 906, an audio circuit 907, a positioning component 908, and a power supply 909.

The peripheral interface 903 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 901 and the memory 902. In some embodiments, the processor 901, memory 902, and peripheral interface 903 are integrated on the same chip or circuit board; in some other embodiments, any one or both of the processor 901, the memory 902, and the peripheral interface 903 may be implemented on a separate chip or circuit board, which is not limited in this application.

The Radio Frequency circuit 904 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 904 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 904 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 904 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, etc. The radio frequency circuit 904 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 904 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 905 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 905 is a touch display screen, the display screen 905 also has the ability to capture touch signals on or over the surface of the display screen 905. The touch signal may be input to the processor 901 as a control signal for processing. At this point, the display 905 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 905 may be one, provided on the front panel of the terminal 900; in other embodiments, the number of the display panels 905 may be at least two, and each of the display panels is disposed on a different surface of the terminal 900 or is in a foldable design; in still other embodiments, the display 905 may be a flexible display disposed on a curved surface or on a folded surface of the terminal 900. Even more, the display 905 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display panel 905 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 906 is used to capture images or video. Optionally, camera assembly 906 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of a terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 906 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp and can be used for light compensation under different color temperatures.

Audio circuit 907 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 901 for processing or inputting the electric signals to the radio frequency circuit 904 for realizing voice communication. For stereo sound acquisition or noise reduction purposes, the microphones may be multiple and disposed at different locations of the terminal 900. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 901 or the radio frequency circuit 904 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuit 907 may also include a headphone jack.

The positioning component 908 is used to locate the current geographic Location of the terminal 900 for navigation or LBS (Location Based Service). The Positioning component 908 may be a Positioning component based on the GPS (Global Positioning System) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.

The power supply 909 is used to supply power to the various components in the terminal 900. The power source 909 may be alternating current, direct current, disposable or rechargeable. When power source 909 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery can also be used to support fast charge technology.

In some embodiments, terminal 900 can also include one or more sensors 910. The one or more sensors 910 include, but are not limited to: acceleration sensor 911, gyro sensor 912, pressure sensor 913, fingerprint sensor 914, optical sensor 915, and proximity sensor 916.

The acceleration sensor 911 can detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 900. For example, the acceleration sensor 911 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 901 can control the touch display 905 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 911. The acceleration sensor 911 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 912 may detect a body direction and a rotation angle of the terminal 900, and the gyro sensor 912 may cooperate with the acceleration sensor 911 to acquire a 3D motion of the user on the terminal 900. The processor 901 can implement the following functions according to the data collected by the gyro sensor 912: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 913 may be disposed on the side bezel of terminal 900 and/or underneath touch display 905. When the pressure sensor 913 is disposed on the side frame of the terminal 900, the holding signal of the user to the terminal 900 may be detected, and the processor 901 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 913. When the pressure sensor 913 is disposed at a lower layer of the touch display 905, the processor 901 controls the operability control on the UI interface according to the pressure operation of the user on the touch display 905. The operability control comprises at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 914 is used for collecting a fingerprint of the user, and the processor 901 identifies the user according to the fingerprint collected by the fingerprint sensor 914, or the fingerprint sensor 914 identifies the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, processor 901 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 914 may be disposed on the front, back, or side of the terminal 900. When a physical key or vendor Logo is provided on the terminal 900, the fingerprint sensor 914 may be integrated with the physical key or vendor Logo.

The optical sensor 915 is used to collect ambient light intensity. In one embodiment, the processor 901 may control the display brightness of the touch display 905 based on the ambient light intensity collected by the optical sensor 915. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 905 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 905 is turned down. In another embodiment, the processor 901 can also dynamically adjust the shooting parameters of the camera assembly 906 according to the ambient light intensity collected by the optical sensor 915.

Proximity sensor 916, also known as a distance sensor, is typically disposed on a front panel of terminal 900. The proximity sensor 916 is used to collect the distance between the user and the front face of the terminal 900. In one embodiment, when the proximity sensor 916 detects that the distance between the user and the front face of the terminal 900 gradually decreases, the processor 901 controls the touch display 905 to switch from the bright screen state to the dark screen state; when the proximity sensor 916 detects that the distance between the user and the front surface of the terminal 900 gradually becomes larger, the processor 901 controls the touch display 905 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 9 does not constitute a limitation of terminal 900, and may include more or fewer components than those shown, or may combine certain components, or may employ a different arrangement of components.

Fig. 10 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present application. Referring to fig. 10, the apparatus may be a server 1000, and the server 1000 may be a server in a background server cluster. Specifically, the method comprises the following steps:

the server 1000 includes a CPU (Central Processing Unit) 1001, a system Memory 1004 including a RAM (Random Access Memory) 1002 and a ROM (Read-Only Memory) 1003, and a system bus 1005 connecting the system Memory 1004 and the Central Processing Unit 1001. The server 1000 also includes a basic I/O (Input/Output) system 1006 that facilitates the transfer of information between devices within the computer, and a mass storage device 1007 for storing an operating system 1013, application programs 1014, and other program modules 1015.

The basic input/output system 1006 includes a display 1008 for displaying information and an input device 1009 such as a mouse, keyboard, etc. for a user to input information. Wherein a display 1008 and an input device 1009 are connected to the central processing unit 1001 via an input/output controller 1010 connected to the system bus 1005. The basic input/output system 1006 may also include an input/output controller 1010 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, an input/output controller 1010 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1007 is connected to the central processing unit 1001 through a mass storage controller (not shown) connected to the system bus 1005. The mass storage device 1007 and its associated computer-readable media provide non-volatile storage for the server 1000. That is, the mass storage device 1007 may include a computer-readable medium (not shown) such as a hard disk or a CD-ROM (Compact disk Read-Only Memory) drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Electrically Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory, or other solid state Memory technology, as well as CD-ROM, DVD (Digital Versatile disk), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 1004 and mass storage device 1007 may be collectively referred to as memory.

According to various embodiments of the present application, the server 1000 may also operate as a remote computer connected to a network through a network, such as the Internet. That is, the server 1000 may be connected to the network 1012 through a network interface unit 1011 connected to the system bus 1005, or the network interface unit 1011 may be used to connect to another type of network or a remote computer system (not shown).

The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU. The one or more programs contain instructions for performing the object detection method provided by the embodiment of fig. 1.

In some embodiments, a computer-readable storage medium is also provided, in which a computer program is stored, which when executed by a processor implements the steps of the object detection method in the above embodiments. For example, the computer readable storage medium may be a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It is noted that the computer-readable storage medium referred to herein may be a non-volatile storage medium, in other words, a non-transitory storage medium.

It should be understood that all or part of the steps for implementing the above embodiments may be implemented by software, hardware, firmware or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The computer instructions may be stored in the computer readable storage medium described above.

That is, in some embodiments, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the steps of the object detection method described above.

The above-mentioned embodiments are provided by way of example and should not be construed as limiting the present application, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of object detection, the method comprising:

determining a plurality of prediction frames in the detection image according to a target frame in each of the first n images adjacent to the detection image in the video, wherein the target frame is used for indicating an area with a target, and n is an integer greater than or equal to 2;

selecting a target frame from a plurality of detection frames in the detection image according to the plurality of prediction frames in the detection image;

the determining a plurality of prediction frames in the detected image according to the target frame in each image of the first n images adjacent to the detected image in the video comprises:

acquiring a first candidate frame from a target frame in each image in the first n images to obtain n first candidate frames, wherein the same target exists in a region corresponding to each first candidate frame in the n first candidate frames; determining the size of a prediction frame in the detection image according to the size of each first candidate frame in the n first candidate frames; for any one first candidate frame in the n first candidate frames, constructing a first coordinate point corresponding to the one first candidate frame by taking the playing time of the image to which the one first candidate frame belongs as the abscissa of the first coordinate point corresponding to the one first candidate frame and taking the abscissa of the center point of the one first candidate frame as the ordinate of the first coordinate point corresponding to the one first candidate frame; constructing a second coordinate point corresponding to the first candidate frame by taking the playing time of the image to which the first candidate frame belongs as the abscissa of the second coordinate point corresponding to the first candidate frame and taking the ordinate of the central point of the first candidate frame as the ordinate of the second coordinate point corresponding to the first candidate frame;

2. The method of claim 1, wherein determining a plurality of detection boxes in a detected image in a video according to an object detection model comprises:

3. The method according to any one of claims 1-2, wherein said selecting a target frame from the plurality of detection frames in the detection image based on the plurality of prediction frames in the detection image comprises:

4. The method of claim 3, wherein the obtaining a first candidate frame from the plurality of detection frames in the detection image according to the degree of coincidence between the plurality of prediction frames and the plurality of detection frames in the detection image and the confidence of each detection frame comprises:

and determining a detection frame with the maximum target value and larger than the reference value in a plurality of detection frames in the detection image as a first candidate frame corresponding to the selected prediction frame.

5. The method of claim 3, wherein the filtering the plurality of third candidate frames in the detected image according to a degree of overlap between the plurality of third candidate frames in the detected image and a confidence of each third candidate frame comprises:

determining a third candidate frame, except for the target candidate frame, of a plurality of third candidate frames in the detected image as a fourth candidate frame;

6. An object detection apparatus, characterized in that the apparatus comprises:

a second determining module, configured to determine multiple prediction frames in the detected image according to a target frame in each of first n images adjacent to the detected image in the video, where the target frame is used to indicate an area where a target exists, and n is an integer greater than or equal to 2;

a selecting module, configured to select a target frame from the multiple detection frames in the detection image according to the multiple prediction frames in the detection image;

the second determining module includes:

a first obtaining unit, configured to obtain n first candidate frames from a target frame in each of the first n images, where a same target exists in an area corresponding to each of the n first candidate frames;

a second determining unit configured to determine a size of the one prediction frame in the detection image according to a size of each of the n first candidate frames;

the first determination unit is configured to:

for any one first candidate frame in the n first candidate frames, constructing a first coordinate point corresponding to the one first candidate frame by taking the playing time of the image to which the one first candidate frame belongs as the abscissa of the first coordinate point corresponding to the one first candidate frame and taking the abscissa of the center point of the one first candidate frame as the ordinate of the first coordinate point corresponding to the one first candidate frame; constructing a second coordinate point corresponding to the first candidate frame by taking the playing time of the image to which the first candidate frame belongs as the abscissa of the second coordinate point corresponding to the first candidate frame and taking the ordinate of the central point of the first candidate frame as the ordinate of the second coordinate point corresponding to the first candidate frame;

7. The apparatus of claim 6, wherein the selection module comprises:

and the second filtering unit is used for filtering the plurality of third candidate frames in the detection image according to the coincidence degrees among the plurality of third candidate frames in the detection image and the confidence coefficient of each third candidate frame, and taking the reserved third candidate frames as target frames.

8. The apparatus of claim 7, wherein the second obtaining unit is to:

9. The apparatus of claim 7, wherein the second filtration unit is to:

10. A computer device comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus, the memory stores a computer program, and the processor executes the program stored in the memory to implement the steps of the method according to any one of claims 1-5.

11. A computer-readable storage medium, characterized in that a computer program is stored in the storage medium, which computer program, when being executed by a processor, carries out the steps of the method according to any one of claims 1-5.