CN111986229A

CN111986229A - Video target detection method, device and computer system

Info

Publication number: CN111986229A
Application number: CN201910430828.3A
Authority: CN
Inventors: 诸小熊; 吕承飞
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-05-22
Filing date: 2019-05-22
Publication date: 2020-11-24

Abstract

The embodiment of the application discloses a video target detection method, a video target detection device and a video target detection system, wherein the method comprises the steps of carrying out target detection on a collected video image frame in a first mode after starting video target detection, determining a candidate target as a target detection result of a corresponding image frame if the candidate target with the confidence coefficient higher than a first threshold value is detected, and switching to a second mode; in the second mode, carrying out target detection on a newly acquired image frame, wherein if the confidence coefficient of a candidate target in the current image frame is detected to be greater than a second threshold value, and the relation between the position of the candidate target in the current image frame and the position of a target detection result in the previous image frame meets a preset condition, the candidate target is used as a target detection result in the current image frame; the first threshold is greater than the second threshold. Through the embodiment of the application, the false missing rate of the video target detection is reduced.

Description

Video target detection method, device and computer system

Technical Field

The present application relates to the field of video object detection, and in particular, to a method, an apparatus, and a computer system for video object detection.

Background

The video target detection refers to a technology of analyzing and processing images in the collected video stream by adopting a certain model algorithm to determine whether a detection target exists. The method not only requires accurate detection of each frame of image, but also requires continuous and stable detection capability of a detection target, namely, the condition of missing detection and false detection cannot occur in the detection process. Missing detection is a situation in which a target in a video is recognized as a non-target and cannot be detected. The false detection is a case where a non-target object in a video is recognized as a detection target and output. Both missing detection and false detection affect the accuracy of video target detection.

Taking an application program (App) of a rain control special effect on a mobile terminal such as a mobile phone as an example, the App applies a video target detection technology, takes a specific gesture as a detection target, and triggers the rain control special effect in a video when the specific gesture is detected to appear in a video stream, wherein the rain control special effect comprises the steps of suspending and stopping rain drops, or controlling water drops to move, and the like. And when the specific gesture cannot be detected, closing the rain control special effect. If the continuous detection capability of the model is low, the condition of missed detection or false detection often occurs, and the rain control special effect is triggered or closed by mistake.

In addition, in other applications, such as intelligent robot tracking applications, continuous and stable detection capability for the target is also required, otherwise, the robot tracking process is prone to deviation and interruption. For example, once the missing detection occurs, the robot cannot detect the target, the tracking process is interrupted, and once the false detection occurs, other objects are identified as the target, the tracking target is mistaken, and the tracking process is shifted.

Therefore, in the video target detection, the continuous and stable detection capability is ensured, and the reduction of missed detection and false detection is of great significance.

However, the current video detection technology has high false detection rate of missed detection: current video detection techniques generally detect the position information of candidate objects by a tracker or detector and calculate a score. The position information is generally a bounding rectangle of the candidate target, i.e. a candidate frame, and the score represents the possibility that the object in the candidate frame is a target of some kind. As shown in fig. 1, the diagram is a detection flow of a gesture detector, a set detection target is a fist gesture, after preliminary detection, a candidate target which may be the fist gesture and a position thereof are obtained, a rectangular box is used to mark position information of the gesture, the fist represents a gesture category, and numbers behind the rectangular box represent that the gesture is a fist with scores of 0.95 and 0.97 respectively. And comparing the score with a preset threshold value, and confirming the candidate target as the detection target for output only when the score of the candidate target is greater than the threshold value.

However, in actual detection, due to the limited detection capability of the model, different objects with similar shapes have similar scores. The shape and the pose of the same object have diversity, and if part of the object is blocked, the same object has different scores due to the pose and the like. As mentioned above, the prior art only performs detection according to scores, and therefore different objects with similar shapes are identified as the same target, resulting in false detection; alternatively, the same object may be confirmed as a target, or may not be confirmed as a target, resulting in missed inspection.

Therefore, how to effectively reduce the false detection rate of the video target detection becomes a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The application provides a video target detection method, a video target detection device and a computer system, and reduces the false detection rate of missed detection.

The application provides the following scheme:

a video object detection method, comprising:

after starting video target detection, carrying out target detection on the acquired video image frame in a first mode, wherein if a candidate target with a confidence coefficient greater than a first threshold value is detected, determining the candidate target as a target detection result of a corresponding image frame, and switching to a second mode;

In the second mode, carrying out target detection on a newly acquired image frame, wherein if the confidence coefficient of a candidate target in the current image frame is detected to be greater than a second threshold value, and the relation between the position of the candidate target in the current image frame and the position of a target detection result in the previous image frame meets a preset condition, the candidate target is used as a target detection result in the current image frame; the first threshold is greater than the second threshold.

A video object detection method, comprising:

carrying out target detection on the collected video image frame to obtain a target detection result and position information of the target detection result in the image frame;

determining the position of the target detection result in the current image frame and the change degree information relative to the position of the target detection result in the previous image frame;

and if the change degree is lower than a third threshold value, outputting the position of the target detection result in the previous image frame.

A video object detection apparatus comprising:

the first detection unit is used for carrying out target detection on the acquired video image frame in a first mode after starting video target detection, wherein if a candidate target with the confidence coefficient higher than the first threshold value is detected, the candidate target is determined as a target detection result of the corresponding image frame, and the candidate target is switched to a second mode;

The second detection unit is used for carrying out target detection on the newly acquired image frame in the second mode, wherein if the confidence coefficient of a candidate target in the current image frame is detected to be greater than a second threshold value, and the relation between the position of the candidate target in the current image frame and the position of a target detection result in the previous image frame meets a preset condition, the candidate target is used as the target detection result in the current image frame; the first threshold is greater than the second threshold.

A video object detection apparatus comprising:

the target detection unit is used for carrying out target detection on the collected video image frame to obtain a target detection result and position information of the target detection result in the image frame;

the position change determining unit is used for determining the position of the target detection result in the current image frame and the change degree information relative to the position of the target detection result in the previous image frame;

and the output unit is used for outputting the position of the target detection result in the previous image frame if the change degree is lower than a third threshold value.

A computer system, comprising:

One or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising:

A computer system, comprising:

one or more processors; and

A video object detection method, comprising:

under a first mode, carrying out target detection on the acquired first image frame;

when a candidate target meeting a first preset requirement is detected, determining the candidate target as a target detection result of the first image frame, and switching to a second mode;

under the second mode, carrying out target detection on a newly acquired second image frame;

and when detecting that the relation between the position of the candidate target in the second image frame and the position of the target detection result in the previous image frame in the second image frame meets a preset condition, taking the candidate target as the target detection result of the second image frame.

A video object detection method, comprising:

carrying out target detection on the collected first image frame;

when a candidate target meeting a first preset requirement is detected, determining the candidate target as a target detection result of the first image frame;

carrying out target detection on the newly acquired second image frame;

A video object detection apparatus comprising:

the target identification unit is used for carrying out target detection on the acquired first image frame in a first mode;

a recognition result determining unit, configured to, when a candidate target satisfying a first preset requirement is detected, determine the candidate target as a target detection result of the first image frame, and switch to a second mode;

the tracking identification unit is used for carrying out target detection on the newly acquired second image frame in the second mode;

and the result output unit is used for taking the candidate target as the target detection result of the second image frame when detecting that the relation between the position of the candidate target in the second image frame and the position of the target detection result in the previous image frame in the second image frame meets the preset condition.

A video object detection apparatus comprising:

the first object detection unit is used for carrying out object detection on the acquired first image frame;

the target determining unit is used for determining a candidate target as a target detection result of the first image frame when the candidate target meeting a first preset requirement is detected;

the second object detection unit is used for carrying out target detection on the newly acquired second image frame;

and the target output unit is used for taking the candidate target as the target detection result of the second image frame when detecting that the relation between the position of the candidate target in the second image frame and the position of the target detection result in the previous image frame in the second image frame meets a preset condition.

According to the specific embodiments provided herein, the present application discloses the following technical effects:

according to the embodiment of the application, the process of target detection in a video is divided into two modes, a first threshold with higher accuracy is set in the first mode, so that the accuracy and reliability of the detected target detection result are high, and meanwhile, the position information of the target detection result is used as prior information in the subsequent detection process; after a target detection result with higher accuracy is detected in the first mode, the mode can be switched to the second mode, in the mode, a lower second threshold value can be set, meanwhile, the position relation between the position of a candidate target in the current image frame and the position of the target detection result in the previous image frame is combined for judgment, and the possibility that a non-target is determined as the target is reduced through the judgment, namely, the false detection rate is reduced. Meanwhile, the threshold used in the second mode can be lower, so that the probability of missing detection of the target with lower confidence calculated by the algorithm is reduced. Therefore, the detection accuracy is improved, and meanwhile, the detection rate is also improved, namely, the false detection rate and the omission factor are reduced simultaneously.

Furthermore, the target position is repositioned by combining whether the target position detected by the previous frame and the next frame has obvious movement, and the output of the target position without obvious movement is the same as that of the previous time, so that the problem of target jitter is avoided.

Of course, it is not necessary for any product to achieve all of the above-described advantages at the same time for the practice of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic diagram of prior art target detection;

FIG. 2 is a flow chart of a first method provided by an embodiment of the present application;

FIGS. 3A-3C are schematic diagrams of an embodiment of the present application;

FIGS. 4A-4C are schematic diagrams of another embodiment of the present application;

FIG. 5 is a flowchart of a first specific implementation of the present application;

FIG. 6 is a flowchart of another specific implementation of the first embodiment of the present application;

FIG. 7 is a flow chart of a second method provided by embodiments of the present application;

FIG. 8 is a schematic diagram of a first apparatus provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of a second apparatus provided by an embodiment of the present application;

FIG. 10 is a block diagram of a computer system according to an embodiment of the present application

FIG. 11 is a flow chart of a third method provided by an embodiment of the application;

FIG. 12 is a flow chart of a fourth method provided by an embodiment of the application;

FIG. 13 is a schematic view of a third apparatus provided by an embodiment of the present application;

fig. 14 is a schematic diagram of a fourth apparatus provided in the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments that can be derived from the embodiments given herein by a person of ordinary skill in the art are intended to be within the scope of the present disclosure.

In order to improve the accuracy of video target detection and reduce the false detection rate of missed detection, the embodiment of the application innovatively provides a mode of combining the confidence coefficient of the candidate target and the position relation of the candidate target and the prior target to judge whether the candidate target is the detection target.

The candidate target in the embodiment of the present application refers to a region that is possibly a detected target and is identified from a video image frame through a certain algorithm, and a corresponding confidence level may be given. In the same frame of image, a plurality of candidate targets may be detected, and each candidate target corresponds to a different confidence level, where the confidence level is a probability that a specific candidate target belongs to the detected target. For example, the detected object is a "palm", a plurality of regions that may be "palms" may be detected in a certain frame image, and different probability information may be associated with each region. It should be noted that, in the process of determining the candidate object and the confidence of the candidate object, the characteristics of the pixels in the image, including the color, the shape of the connected region composed of a plurality of similar pixels, and the like, are mainly relied on.

The prior target in the embodiment of the present application refers to a target detected in a previous frame (prior frame) of a current frame. For video, if the same object is included in the previous and subsequent frames, the object will be located relatively close in the two frames of images. This is because, on the one hand, the position of the object moves continuously, and the motion trajectory appears as a straight line or a curved line, and meanwhile, since the frame rate of the video is usually high, for example, the frame rate of the online video may be usually 30 frames/S, and is usually not lower than 25 frames/S, which makes the time interval between the acquisition of the previous frame and the acquisition of the next frame very small, the positions of the same object appearing in the previous frame and the next frame of the video are usually relatively close to or even coincident (in the case that the position of the object is not fixed), and there is substantially no position jump. On the other hand, from the perspective of the user, the user is also interested in keeping the stability of the target such as a gesture as much as possible and also does not want to miss detection or false detection while using the video target detection related function provided by a certain client. By combining the two factors, the positions of the same target object in the front and back frame images are relatively close to each other in the video acquisition process. In the embodiment of the application, the characteristics are also used as an important basis for target detection.

Of course, it is not enough to just rely on the above-mentioned position information, and in addition, it is necessary to consider the confidence of the candidate object in the current frame, and to ensure the high reliability of the prior object in the prior frame (i.e., the probability of being falsely detected is low). However, in the case that the reliability of the prior object in the prior frame is high, if a candidate object is close to the prior object in the prior frame, the requirement for the confidence of the candidate object can be appropriately reduced. That is, in the case where the reliability of the target is guaranteed and the feature that the positions of the preceding and following frames are relatively close is taken as evidence, even if the confidence of the candidate target calculated by the algorithm is relatively low, the probability that the candidate target belongs to the detected target is actually very high. Therefore, in this case, the setting of the threshold value can be appropriately lowered to ensure a certain detection rate. Regarding the prior frame, because each frame uses the previous frame as the prior frame, the first frame becomes the source of the reliability of the detection result in each subsequent frame by frame recursion, and the high reliability of the target detection result in the first frame becomes the key. That is to say, if the reliability of the first prior target detected in the first frame is relatively high, in the subsequent detection process, due to the combination of the position feature and the confidence feature, the reliability of the detection result can be ensured even under the condition of low threshold setting, so as to ensure the low false detection rate of each subsequent frame, and meanwhile, due to the setting of a relatively low threshold, a certain detection rate is ensured, and the false detection rate is reduced.

The effective mode of improving the reliability and reducing the false detection rate is to set a higher threshold, and once a high-reliability prior target is obtained, the setting of the threshold can be reduced in the subsequent detection process of each frame on the basis of combining position information, so that the detection rate is ensured, and the false detection rate is reduced.

For this reason, in the embodiment of the present application, the process of detecting the target is divided into two modes, one is a detection mode, and the other is a tracking mode. Wherein the same detection process goes through the two different modes. Firstly, after the initial stage of detection starting, the method firstly enters a detection mode, the mode correspondingly sets a first threshold with higher degree, after candidate targets are detected from image frames and confidence degree information is obtained, the candidate targets are compared with the first threshold, and only when the candidate targets are higher than the first threshold, the candidate targets are regarded as detected targets, so that the prior targets are detected under the condition of higher reliability. After the first (or the first) a priori target is detected, a tracking mode is entered, which correspondingly sets a second threshold value which is relatively low, in the tracking mode, firstly, a candidate target and corresponding confidence information can be determined from a specific image frame, then, the candidate target and the corresponding confidence information are compared with the second threshold value, if the confidence of a certain candidate target is greater than the second threshold value and the position of the candidate target is very close to the position of the a priori target in the a priori frame, the target detected in the frame can be determined.

More specific implementations provided by the embodiments of the present application are described in detail below.

It should be noted that, the related product related to the embodiment of the present application may be an application providing a video object detection related function, for example, the related product may include some live broadcast applications and smart photograph applications, or may also be light applications such as live broadcast and smart photograph provided inside some comprehensive applications, and the like. Specifically, a client of such an application may be installed in a terminal device of a user, and some applications may also correspond to a server, but in the embodiment of the present application, corresponding functions are mainly provided by the client.

Example one

First, as soon as the embodiment provides a video target detection method from the perspective of the above-mentioned client, referring to fig. 2, the method includes:

s201: after starting video target detection, carrying out target detection on the acquired video image frame in a first mode, wherein if a candidate target with a confidence coefficient greater than the first threshold value is detected, determining the candidate target as a target detection result of the corresponding image frame, and switching to a second mode;

In the embodiment of the present application, the target detection process is divided into two stages, and in order to detect a target with high reliability in the first stage, a first threshold with a relatively high value needs to be set, so as to provide an effective prior target for subsequent target detection. Specifically, candidate targets may be determined in an image frame by using a preset algorithm, a confidence level may be determined for each candidate target, and then, the confidence level is compared with a first threshold, and if the confidence level of a candidate target in a certain image frame is greater than the first threshold, a target detection result may be determined.

After the target detection result is detected according to the first threshold, the mode can be switched to the second mode, and at this time, the detection can be performed by using the lower second threshold, but the determination needs to be performed by combining the position information of the prior target. Specifically, the mode may be switched to the second mode after a candidate object with a confidence level greater than the first threshold is detected in one of the image frames, or, in order to further ensure the reliability of the detection in the first stage in a specific implementation, the mode may be switched to the second mode after a candidate object with a confidence level greater than the first threshold is continuously detected in a plurality of consecutive frames, and so on.

It should be noted that, in practical applications, in some image frames at an initial stage after the video capture is started, there may be a case that the image quality is poor, or the image is not stable enough due to the user continuously adjusting the position or distance of the object, and the like, so the detection at the first stage may last for several seconds, and the first image frame of the candidate target with the confidence coefficient greater than the first threshold is detected, which is usually not the first image frame actually captured.

It should be noted that, in the embodiment of the present application, after the target detection result is detected in each image frame, position information of the target detection result in the image frame may be recorded separately, so as to provide a basis for subsequent detection.

S202: in the second mode, carrying out target detection on a newly acquired image frame, wherein if the confidence coefficient of a candidate target in the current image frame is detected to be greater than a second threshold value, and the relation between the position of the candidate target in the current image frame and the position of a target detection result in the previous image frame meets a preset condition, the candidate target is used as a target detection result in the current image frame; the first threshold is greater than the second threshold.

After switching to the second mode, for a newly acquired image frame, firstly, a candidate target and a corresponding confidence level may also be determined according to a preset algorithm, and then, the confidence level of the candidate target may be compared with a second threshold, and if the confidence level is greater than the second threshold and a relationship between a position of the candidate target in the current image frame and a position of a target detection result in a previous image frame meets a preset condition, for example, a distance between the two is within a first distance range, and the like, the candidate target may be used as the target detection result in the current image frame. That is to say, under the condition that a reliable prior target is obtained in a high-threshold manner, the position information of the prior target in the prior frame can be used as a reference for target detection in a subsequent image frame to help the determination of the target, so that a lower second threshold can be set in the tracking mode, so that the probability that a candidate target belongs to the detected target is determined by the information of the relative position information and the confidence coefficient, thereby ensuring the accuracy and reliability of the detection result, reducing the false detection rate, and simultaneously, the candidate target can be continuously used as the prior frame of the next frame; in addition, since the second threshold is set to be low, the detection rate of the target is also made to be high, and the false detection rate is reduced.

The relationship information between the position of the candidate target in the current image frame and the position of the target detection result in the previous image frame may be expressed in a plurality of ways, for example, in one of the ways, an enclosure of the candidate target in the current image frame and an enclosure of the target detection result in the previous image frame may be determined respectively, and then an intersection ratio (IOU) between the two may be used as a reference value for measuring the change of the positional relationship between the two. The IOU refers to the ratio of the intersection to the union of the two candidate bounding boxes. The IOU value is 1 when the two bounding boxes completely overlap. The higher the IOU value, the closer the candidate object is to the prior object. When the IOU value is greater than a set threshold, the representative candidate target is close to the prior target in position, and this can be used as a determination condition for correspondence of the position relationship between the representative candidate target and the prior target.

The following describes the effects of the embodiment of the present invention by a specific example:

as shown in fig. 3A, 3B and 3C, the candidate object patterns in three frames of the video are shown. Fig. 3A, 3B, and 3C are detected with the palm of the human hand as the detection target. Where fig. 3A shows a palm configuration in a first frame, fig. 3B shows a palm configuration in a second frame, and fig. 3C shows a dog's paw approximating the palm configuration in a third frame. Since a portion of the gesture is not in the image in fig. 3B, the shape of the palm has changed from the previous frame. Through model detection and calculation, the score of the candidate target being the palm in fig. 3A is 0.95, and the score of the candidate target being the palm in fig. 3B is 0.87 due to partial palm occlusion. The score for the candidate target palm in fig. 3C is 0.87 (of course, in actual algorithms, there may not be a case where "dog paw" is identified as a human palm, but this is merely an assumption for ease of explanation). Thus, judging with only a score, the palm in fig. 3B and the dog's paw in fig. 3C will be indistinguishable. If the threshold is set to 0.9, and the confidence degrees of the candidate object in fig. 3B and the candidate object in fig. 3C are lower than the set threshold, the model cannot determine the candidate object in fig. 3B and the candidate object in fig. 3C as the detection object, so that the candidate object in fig. 3B is missed. If the threshold is set to 0.86 and the confidence degrees of the candidate targets in fig. 3B and the candidate targets in fig. 3C are higher than the set threshold, the model determines the candidate targets in fig. 3B and the candidate targets in fig. 3C as the detection targets, resulting in the candidate targets in fig. 3C being falsely detected as the detection targets. However, by applying the method of the embodiment of the present application, in the detection mode, a higher threshold of 0.9 is set, and the target in fig. 3A is detected and has a high confidence, which is used as the prior target in the next frame of fig. 3B. At this time, a lower threshold value, for example, 0.85, may be set at the time of detection of the next frame image 3B. Through the judgment of the position relationship, the candidate target position in fig. 3B is close to the target position in the previous frame, i.e. fig. 3A, and the confidence coefficient is higher than the set threshold value, so the model determines the candidate target in fig. 3B as the detection target output. In this case, the detection target in fig. 3B is set as the prior target in fig. 3C, and the determination of the positional relationship makes the confidence of the candidate target in fig. 3C higher than the set threshold, but the position of the candidate target is far from the position of the prior target in fig. 3B, and therefore the candidate target is not output as the detection target. Compared with the prior art, the method and the device have the advantages that the missing detection and false detection rate is reduced, and the detection accuracy is improved.

The threshold is relatively high or low, and is set specifically according to actual conditions. For example, when the target morphology features are prominent and the variability is small, 0.9 may be a lower threshold relative to the target morphology features. And when the target features are not outstanding and the pose changes much, 0.8 is possibly a higher threshold value for the target. The embodiment of the present application does not limit a specific threshold.

As the first prior target, once the false detection occurs, the subsequent result based on the prior target is wrong. Taking fig. 4A, fig. 4B and fig. 4C as examples, they are respectively three adjacent frames in the video. Fig. 4A shows a first frame of the video, and fig. 4B and 4C show subsequent frames in sequence. Fig. 4A, 4B, and 4C are detected with the palm of the human hand as the detection target. Where fig. 4A shows a dog paw approximating the palm of the hand, fig. 4B shows a dog paw approximating the palm of the hand, and fig. 4C shows the palm of the hand with a small portion of the palm of the hand not in the image. Through calculation of model detection, the score of the candidate object being the palm in fig. 4A is 0.87, and the score of the candidate object being the palm in fig. 4B is 0.87. The score for the candidate object being a palm in fig. 4C is 0.87 due to partial palm occlusion. If the first detection threshold is set low, such as 0.85, the dog paw in fig. 4A is determined to be the palm target output and is the prior target of fig. 4B upon detection. The threshold for subsequent determination was set to 0.84. In the next frame, after position judgment, the candidate target in fig. 4B is closer to the prior target position, and the confidence is greater than the threshold. The candidate target in fig. 4B is thus determined as the detected target output while being the prior target of fig. 4C. In fig. 4C, although the palm is located far away from the prior target, the palm cannot be detected and output. As can be seen, if false detection occurs during the first detection, the detection accuracy is greatly reduced. Therefore, in the embodiment of the application, a higher first threshold value can be set for judging the condition of detecting the first prior target through the model. The first threshold is higher than a second threshold used in the tracking mode.

In the above embodiment, in the first stage of detection, a mode of setting a higher threshold is adopted, and when the confidence of the candidate target is higher than the higher threshold, the candidate target is output as the detection target, so that the false detection rate is reduced, and meanwhile, a priori target with higher accuracy is provided for subsequent detection. For the detection process with the prior target, a mode of combining the confidence coefficient and the position relation between the candidate target position and the prior target position is adopted, so that the false detection rate can be ensured to be lower through the judgment of the position relation under the condition of setting a lower threshold value and reducing the missed detection rate.

In addition, besides the problems of false detection and missed detection, another problem may exist in the video target detection process:

due to the limited detection capability of the model, the position of the detected target is difficult to accurately restore in the video detection process. For the same target, even if the position in the adjacent frame is not changed, the model may output different positions, and accordingly, slight displacement may occur between the identified bounding box and the previous frame, and if similar situations occur in multiple frames, a "shaking" phenomenon may occur, that is, the identified bounding box may jump up and down or left and right in the screen, and there is no stable state all the time. In addition, many times, the object actually photographed may be unintentionally shaken. For example, in the "rain control effect", the user may unintentionally shake slightly even if the user wants to keep the gesture still. These may result in a phenomenon in which the position output for the same object in adjacent frames differs a little, thereby generating object jitter. This reduces the visual experience for the user.

To this end, the embodiment of the present application provides a method for solving target jitter: after the target detection result is detected in the current image frame, the position of the target detection result in the previous image frame can be combined for judgment. If the position of the detected object in the current frame moves relatively little (within the second distance range) with respect to the position of the object detection result in the previous image frame, the position of the prior object may be determined as the position of the detected object and output. If the position of the detected target in the current frame moves greatly (outside the second distance range) relative to the position of the target detection result in the previous image frame, it is proved that the target in practice has obvious displacement, therefore, the output can be performed according to the position of the detected target in the current frame, so as to ensure the tracking of the actual target position. Through the process, the output target position is the same as the prior target position when the detection target slightly moves relative to the prior target, the target is kept still in the front frame and the rear frame of the video, and the phenomenon of shaking is avoided.

The determination of the position shift size (second distance range) may also be performed in various ways, for example, it may also be performed by an intersection ratio (IOU) between a bounding box of the detection result of the target in the current image frame and a bounding box of the detection result of the target in the previous image frame. When the IOU value is larger than a set threshold value, the position of the detected target detection result in the current image frame relative to the target detection result in the previous frame is not obviously moved. Conversely, when the IOU value is smaller than the set threshold, it can be considered that the position of the detected target detection result in the current image frame has moved significantly with respect to the target detection result in the previous frame.

In the embodiment of the present application, the first distance range and the second distance range are referred to, and the second distance range is a determination of whether the object has a position variation, so the second distance range is preferably smaller and stricter than the first distance range. That is, the candidate target and the prior target do not necessarily satisfy the second distance range on the premise that the first distance range is satisfied. Both the first distance range and the second distance range may be determined using an IOU value, wherein the IOU value used to make the first distance range determination is preferably less than the IOU value used to make the second distance range determination.

In other embodiments of the present application, the first distance range may also be the same as the second distance range. When the confidence degree of the candidate target is greater than the set threshold value and the positions of the candidate target and the prior target are within the first distance range, the candidate target is used as a detection target and the prior target position is used as the position of the detection target for outputting.

As shown in fig. 5, a specific implementation flow of the first embodiment of the present application is as follows:

the client configures a plurality of related thresholds in advance, wherein the specific thresholds comprise a first set threshold of IOU (input output unit) for judging whether the candidate target and the prior target are in a first distance range, a first threshold related to confidence judgment and a second threshold. The first threshold is greater than the second threshold.

The method comprises the steps of obtaining a candidate target (a candidate frame and confidence degree of the candidate target) and an a priori target (not necessarily available) of a current frame. And then judging whether a prior target of the current frame exists or not, if not, judging only according to the confidence level, at the moment, judging whether the confidence level of the candidate target is greater than a first threshold value or not, and if so, outputting the candidate target as a detection target. This determination is typically made at the time of the first test.

And if the prior target exists, judging by combining the relative position relation information with the prior target and the confidence coefficient. Specifically, the relation between the confidence of the candidate target and the second threshold is judged. If the confidence of the candidate target is less than the second threshold, it is very unlikely that the candidate target is a detected target regardless of the position, and the candidate target can be filtered out. And for the candidate target with the confidence coefficient larger than the second threshold, continuously judging whether the position of the candidate target is close to the position of the prior target, for example, whether the IOU value of the bounding box of the candidate target and the prior target is larger than the first set threshold of the IOU, if so, proving that the candidate target is near the prior target, and at the moment, determining the candidate target as the detection target.

Fig. 6 is another specific implementation flow of the first embodiment of the present application, and compared with the flow of fig. 5, a relocation process of a determined detected target position in the case of a priori target is added. In response to this, the second set threshold of the IOU is added to the input, and in this process, the second set threshold is greater than the first set threshold. And the method is used for judging whether the current frame candidate target slightly moves relative to the prior target. When the determined detection target frame is larger than the second set threshold relative to the prior target frame IOU, the movement is proved to be slight, and the prior target position is determined as the detection target position to be output, so that the shaking phenomenon is avoided. The contents of the two dashed boxes in fig. 6 show the relocation procedure for the detected target position.

Referring to fig. 6, the second set threshold of the IOU may be the same as the first set threshold of the IOU, and only one set threshold of the IOU needs to be input during the input. And if the IOU of the candidate target frame and the prior target frame is larger than the IOU set threshold and the confidence coefficient meets the condition that the confidence coefficient is larger than a second threshold, directly determining that the candidate target is the detection target and has the same position as the prior target.

Example two

In the first embodiment, a scheme for preventing the target jitter is mentioned, but the scheme can also exist independently without depending on the scheme for solving the false detection and the false detection mentioned in the first embodiment, that is, even in the scheme of performing the target detection by using a fixed threshold value in the prior art, the scheme can be used for solving the problem of the target jitter. Therefore, in the second embodiment, another video object detection method is provided, and referring to fig. 7, the method may include:

s701: carrying out target detection on the collected video image frame to obtain a target detection result and position information of the target detection result in the image frame;

s702: determining the position of the target detection result in the current image frame and the change degree information relative to the position of the target detection result in the previous image frame;

S703: and if the change degree is lower than a third threshold value, outputting the position of the target detection result in the previous image frame.

If the degree of change is higher than the third threshold, the position of the target detection result in the current image frame may be output. The change degree information may be specifically determined by: determining a cross-over ratio between a first surrounding frame corresponding to a target detection result in a current image frame and a second surrounding frame corresponding to a target detection result in a previous image frame, and determining the cross-over ratio as the change degree information; the intersection and union ratio is the ratio of the intersection and union of the first bounding box and the second bounding box.

Corresponding to the video target detection method in the foregoing embodiment, an embodiment of the present application further provides a video target detection apparatus, and referring to fig. 8, the apparatus may specifically include:

the first detection unit 801 is configured to, after starting video target detection, perform target detection on a captured video image frame in a first mode, determine a candidate target as a target detection result of a corresponding image frame when the candidate target with a confidence greater than the first threshold is detected, and switch to a second mode;

A second detecting unit 802, configured to perform target detection on a newly acquired image frame in the second mode, and when it is detected that a confidence of a candidate target in a current image frame is greater than a second threshold and a relationship between a position of the candidate target in the current image frame and a position of a target detection result in a previous image frame meets a preset condition, take the candidate target as a target detection result in the current image frame; the first threshold is greater than the second threshold.

The first detection unit 801 is specifically configured to switch to the second mode when a candidate target with a confidence coefficient greater than the first threshold is detected in one of the image frames, so that the second detection unit 12 performs target detection on a new image frame acquired after the image frame in the second mode.

In view of the aforementioned problem of detecting target jitter of the output, the apparatus may preferably further include:

the position change determining unit is used for determining the position of the target detection result in the current image frame after obtaining the target detection result in the current image frame, and the change degree of the position of the target detection result in the previous image frame is relative to the position of the target detection result in the current image frame; the specific position change unit is used for determining the intersection ratio between a first enclosing frame corresponding to a target detection result in a current image frame and a second enclosing frame corresponding to a target detection result in a previous image frame, and determining the intersection ratio as the change degree; the intersection and union ratio is the ratio of the intersection and union of the first bounding box and the second bounding box.

And the output unit is used for outputting the position of the target detection result in the previous image frame when the change degree is lower than a third threshold value.

And the output unit is further configured to output a position where the target detection result in the current image frame is located when the change degree is higher than the third threshold.

Corresponding to the video object detection method for solving the object shake in the second embodiment, an embodiment of the present application further provides another video object detection apparatus, as shown in fig. 9, the apparatus includes:

a target detection unit 901, configured to perform target detection on a collected video image frame to obtain a target detection result and position information of the target detection result in the image frame;

a position change determining unit 902, configured to determine a position where the target detection result in the current image frame is located, and a change degree of the position where the target detection result in the previous image frame is located; the position change determining unit is specifically configured to determine an intersection ratio between a first bounding box corresponding to a target detection result in a current image frame and a second bounding box corresponding to a target detection result in a previous image frame, and determine the intersection ratio as the change degree; the intersection and union ratio is the ratio of the intersection and union of the first bounding box and the second bounding box.

An output unit 903, configured to output a position where the target detection result in the previous image frame is located if the change degree is lower than a third threshold.

The output unit 903 is further configured to output a position where the target detection result in the current image frame is located when the change degree is higher than the third threshold.

For the unrefined parts of the embodiments of the apparatus, reference may be made to the descriptions of the corresponding embodiments of the method, which are not repeated herein.

In addition, an embodiment of the present application further provides a computer system, including:

one or more processors; and

after starting video target detection, carrying out target detection on the acquired video image frame in a first mode, wherein if a candidate target with a confidence coefficient greater than the first threshold value is detected, determining the candidate target as a target detection result of the corresponding image frame, and switching to a second mode;

An embodiment of the present application provides another computer system, including:

one or more processors; and

and if the change degree is lower than a third threshold value, outputting the position of the target detection result in the previous image frame. Fig. 10 illustrates an architecture of a computer system, which may specifically include a processor 1010, a video display adapter 1011, a disk drive 1012, an input/output interface 1013, a network interface 1014, and a memory 1020. The processor 1010, the video display adapter 1011, the disk drive 1012, the input/output interface 1013, and the network interface 1014, and the memory 1020 may be communicatively connected by a communication bus 1030.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solution provided by the present Application.

The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system 1021 for controlling the operation of the computer system 1000, a Basic Input Output System (BIOS) for controlling low-level operations of the computer system 1000. In addition, a web browser 1023, a data storage management system 1024, and a target recognition processing system 1025, etc. may also be stored. The target recognition processing system 1025 can be an application program that implements the operations of the foregoing steps in this embodiment of the present application. In summary, when the technical solution provided in the present application is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called for execution by the processor 1010.

The input/output interface 1013 is used for connecting an input/output module to realize information input and output. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The network interface 1014 is used for connecting a communication module (not shown in the figure) to realize the communication interaction between the device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 1030 includes a path that transfers information between various components of the device, such as processor 1010, video display adapter 1011, disk drive 1012, input/output interface 1013, network interface 1014, and memory 1020.

In addition, the computer system 1000 may also obtain information of specific obtaining conditions from the virtual resource object obtaining condition information database 1041, so as to perform condition judgment, and the like.

It should be noted that although the above devices only show the processor 1010, the video display adapter 1011, the disk drive 1012, the input/output interface 1013, the network interface 1014, the memory 1020, the bus 1030, etc., in a specific implementation, the device may also include other components necessary for normal operation. Furthermore, it will be understood by those skilled in the art that the apparatus described above may also include only the components necessary to implement the solution of the present application, and not necessarily all of the components shown in the figures.

EXAMPLE III

An embodiment of the present application provides another video target detection method, referring to fig. 11, which is a flowchart of the video target detection method, where the method may include the following steps:

s1101: under a first mode, carrying out target detection on the acquired first image frame;

firstly, target detection can be carried out on the collected first image frame in the first mode, after a candidate target meeting requirements is detected, the first mode can be switched to the second mode, and tracking type detection and identification can be continuously carried out on the target. The first mode and the second mode may use different recognition methods or recognition strategies, for example, in the foregoing embodiment, when detecting a target object in a video frame, a relatively strict threshold limit may be adopted in the first mode to detect a target more accurately and effectively at the beginning of detection, and a candidate target meeting a first preset requirement may be a candidate target whose confidence is greater than a first threshold. In the second mode, relatively loose threshold limit can be performed in combination with target position analysis, which has positive effects on reducing the omission ratio and improving the algorithm execution efficiency. Of course, in practical implementation, different differentiation modes may be adopted according to different practical applications, for example, for the first mode, a recognition method implemented by a more computationally intensive and efficient algorithm may be adopted, and for the second mode, a target recognition method implemented by a relatively lightweight algorithm may also be adopted to achieve similar technical effects.

S1102: when a candidate target meeting a first preset requirement is detected, determining the candidate target as a target detection result of the first image frame, and switching to a second mode;

when a relevant index of a candidate object meets a first preset requirement, for example, a confidence level that the candidate object is a target object, for example, a confidence level that a certain object is a certain gesture of a human, reaches or exceeds a first threshold, the candidate object may be determined as a target detection result detected in the first image frame. After the detection result is determined, the mode can be switched to the second mode, and then the target detection is continued in the second mode. The second mode may be a tracking target detection mode, i.e. the second mode may be a tracking detection mode. In contrast, the first mode may be an identification-type detection mode, and it is easily understood that target detection in the first mode may be understood as initial target object detection, and no other a priori information is available for reference, so the first mode is more similar to a simple mode of identifying a target object in a video frame than the second mode. Of course, only the difference between the first mode and the second mode is illustrated and described, and the difference between the two modes in practical applications may not be limited thereto.

S1103: under the second mode, carrying out target detection on a newly acquired second image frame;

the newly acquired second image frame may be subject to target detection in the second mode, which, as previously described, may employ relatively relaxed threshold limits, or a target identification method implemented using a relatively lightweight algorithm. Regarding the target image frames used by the method, such as the first image frame and the second image frame, there is no strict close proximity limitation, for example, the first image frame and the second image frame may be adjacent image frames, or may not be required to be strictly adjacent two image frames, in practical application, the first image frame and the second image frame may be samples of an image frame sequence, for example, one frame is selected every 5 frames as a target detection frame, or a currently acquired frame is determined as a target detection frame according to a real-time processing speed.

S1104: and when detecting that the relation between the position of the candidate target in the second image frame and the position of the target detection result in the previous image frame in the second image frame meets a preset condition, taking the candidate target as the target detection result of the second image frame.

After the newly acquired second image frame is subjected to target detection, whether the candidate target has large displacement or not can be judged according to the change condition of the position of the candidate target in the second image frame as an auxiliary basis, namely when the relation between the position of the candidate target in the second image frame and the position of the target detection result in the previous image frame is detected to meet the preset condition, the candidate target can be used as the target detection result of the second image frame. After determining the candidate target as the target detection result of the second image frame, for the application of the "sticker" class, for example, in the application of tracking a specific object and decorating during photographing, the same target position may be directly output to prevent the decorated content, i.e., the shake of the sticker in the image.

The third embodiment of the application provides a video target detection method, which can perform target detection on a collected first image frame in a first mode; when a candidate target meeting a first preset requirement is detected, determining the candidate target as a target detection result of the first image frame, and switching to a second mode; under a second mode, carrying out target detection on a newly acquired second image frame; and when detecting that the relation between the position of the candidate target in the second image frame and the position of the target detection result in the previous image frame in the second image frame meets the preset condition, taking the candidate target as the target detection result of the second image frame. For more detailed contents of the third embodiment, reference may be made to implementation of the first embodiment, which is not repeated herein, and the method may perform target detection in a first mode in a target detection stage and perform tracking detection in a second mode in a tracking stage, and may determine a moving range of a detected target in a tracking detection process according to a change situation of a relative position of the target in an image frame, so that a jitter problem in a video target identification and tracking process may be effectively prevented by outputting a determined detection result.

Example four

An embodiment of the present application provides another video target detection method, referring to fig. 12, the method may include the following steps:

s1201: carrying out target detection on the collected first image frame;

s1202: when a candidate target meeting a first preset requirement is detected, determining the candidate target as a target detection result of the first image frame;

s1203: carrying out target detection on the newly acquired second image frame;

s1204: and when detecting that the relation between the position of the candidate target in the second image frame and the position of the target detection result in the previous image frame in the second image frame meets a preset condition, taking the candidate target as the target detection result of the second image frame.

In the video target detection method provided by the fourth embodiment of the present application, target detection may be performed on the first image frame; when a candidate target meeting a first preset requirement is detected, wherein the first preset requirement may be as in the preset requirement in the first embodiment or the third embodiment, that is, whether the confidence of the candidate target is greater than a preset threshold is determined, and after the candidate target meets the requirement, the candidate target may be determined as a target detection result of the first image frame; and carrying out target detection on the next newly acquired image frame, such as a second image frame, and taking the candidate target as a target detection result of the second image frame when detecting that the relation between the position of the candidate target in the second image frame and the position of the target detection result in the last image frame in the second image frame meets a preset condition. In the tracking detection process, the moving range of the detected target can be determined by combining the change condition of the relative position of the target in the image frame, and the jitter problem in the video target identification tracking process can be effectively prevented by outputting the determined detection result.

Corresponding to the three phases of the embodiment of the present application, there is also provided a video object detection apparatus, as shown in fig. 13, which is a schematic diagram of the video object detection apparatus, and the apparatus may include:

a target identification unit 1301, configured to perform target detection on the acquired first image frame in the first mode;

a recognition result determining unit 1302, configured to, when a candidate target satisfying a first preset requirement is detected, determine the candidate target as a target detection result of the first image frame, and switch to a second mode;

a tracking identification unit 1303, configured to perform target detection on the newly acquired second image frame in the second mode;

a result output unit 1304, configured to, when it is detected that a relationship between a position of the candidate target in the second image frame and a position of a target detection result in a previous image frame meets a preset condition, take the candidate target as a target detection result of the second image frame.

Wherein the candidate object satisfying the first preset requirement may include that the confidence of the candidate object is greater than the first threshold. The second mode may be a tracking detection mode.

Corresponding to the three phases of the embodiment of the present application, another video object detection apparatus is further provided, as shown in fig. 14, which is a schematic diagram of the video object detection apparatus, and the apparatus may include:

A first object detection unit 1401 for performing object detection on the acquired first image frame;

a target determination unit 1402, configured to, when a candidate target satisfying a first preset requirement is detected, determine the candidate target as a target detection result of the first image frame;

a second object detection unit 1403, configured to perform object detection on the newly acquired second image frame; and

and the target output unit 1404 is configured to, when it is detected that a relationship between a position of the candidate target in the second image frame and a position of the target detection result in the previous image frame in the second image frame meets a preset condition, take the candidate target as the target detection result in the second image frame.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The video target detection method, the video target detection device and the computer system provided by the application are introduced in detail, and a specific example is applied in the text to explain the principle and the implementation of the application, and the description of the above embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, the specific embodiments and the application range may be changed. In view of the above, the description should not be taken as limiting the application.

Claims

1. A method for video object detection, comprising:

2. The method of claim 1,

switching to a second mode if a candidate target with a confidence level greater than the first threshold is detected, comprising:

and if a candidate target with a confidence coefficient larger than the first threshold value is detected in one image frame, switching to a second mode so as to perform target detection on a new image frame acquired after the image frame in the second mode.

3. The method of claim 1, further comprising:

after the target detection result in the current image frame is obtained, determining the position of the target detection result in the current image frame and the change degree information relative to the position of the target detection result in the previous image frame;

4. The method of claim 3, further comprising:

and if the change degree is higher than the third threshold value, outputting the position of the target detection result in the current image frame.

5. The method of claim 3,

the degree of change information is determined by:

determining a cross-over ratio between a first surrounding frame corresponding to a target detection result in a current image frame and a second surrounding frame corresponding to a target detection result in a previous image frame, and determining the cross-over ratio as the change degree information; the intersection and union ratio is the ratio of the intersection and union of the first bounding box and the second bounding box.

6. A method for video object detection, comprising:

7. The method of claim 6, further comprising:

8. The method of claim 6,

the degree of change information is determined by:

9. A video object detection apparatus, comprising:

10. A video object detection apparatus, comprising:

11. A computer system, comprising:

one or more processors; and

12. A computer system, comprising:

one or more processors; and

13. A method for video object detection, comprising:

14. The method of claim 13, wherein the candidate targets meeting the first preset requirement comprise: the confidence of the candidate object is greater than a first threshold.

15. The method of claim 13, wherein the second mode is a tracking detection mode.

16. A method for video object detection, comprising:

carrying out target detection on the collected first image frame;

carrying out target detection on the newly acquired second image frame;

17. A video object detection apparatus, comprising:

18. A video object detection apparatus, comprising: