CN113794886A

CN113794886A - Method and device for assisting target detection using video coding information

Info

Publication number: CN113794886A
Application number: CN202110918613.3A
Authority: CN
Inventors: 徐林; 周炎钧
Original assignee: Rongming Microelectronics Jinan Co ltd
Current assignee: Rongming Microelectronics Jinan Co ltd
Priority date: 2021-08-11
Filing date: 2021-08-11
Publication date: 2021-12-14
Anticipated expiration: 2041-08-11
Also published as: CN113794886B

Abstract

The invention discloses a method and device for assisting target detection using video coding information, including: determining a target reference frame for a video frame to be detected, the target reference frame including several sub-window blocks; for each sub-window block: in the to-be-detected video frame Detect the video frame to determine the reference position of the sub-window block; determine the reference value of the reference position; determine the overall reference value of the video frame to be detected based on the reference value of each sub-window block; within the constraint range of the target deformation function, based on For the overall reference value, a target window is found in the video frame to be detected; target detection is performed based on the target window. The method of the present disclosure can effectively reduce the amount of calculation, improve the efficiency of target detection, and on the whole, save hardware resources and reduce power consumption and delay.

Description

Method and apparatus for assisting target detection using video coding information

Technical Field

The invention relates to the technical field of computers, in particular to a method and equipment for assisting target detection by adopting video coding information.

Background

Object detection and recognition is a common problem in computer vision, which has wide application in many areas of life. The purpose of target detection is to distinguish a target from an uninteresting portion in an image or video, determine whether the target exists, and determine the position of the target if the target exists. At present, a method based on deep learning is generally used for detecting a target with higher precision. The position and the type of the target are found through the convolutional neural network, and the accuracy of the target can reach or even exceed the level of human eyes.

The biggest drawback of deep learning-based target detection is high computational complexity. The use of object detection on each frame of video may require a significant amount of computational resources (typically using a GPU) while also potentially increasing latency. Therefore, object detection is often used in conjunction with object tracking. Even so, the computational complexity of target tracking is high. This limits the field of use of target detection to a great extent and also causes great computational and energy consumption.

Another common approach is video object detection using motion information. Currently, the motion information used in a relatively large amount is optical flow information. The method for measuring the instantaneous speed of the pixel motion of a space moving object on an observation imaging plane is a method for finding the corresponding relation between the previous frame and the current frame by using the change of the pixels in the image sequence on a time domain and the correlation between the adjacent frames so as to calculate the motion information of the object between the adjacent frames. In general, optical flow is due to movement of the foreground objects themselves in the scene, motion of the camera, or both. But the computational effort to extract and use the optical flow itself is very large.

Because the existing target detection has high computational complexity, almost all protocols need to carry out motion estimation in the video coding and decoding process, and the target tracking is effectively carried out by utilizing the motion estimation, thereby improving the efficiency and the performance of the target detection.

Disclosure of Invention

The embodiment of the invention provides a method and equipment for assisting target detection by adopting video coding information, which are used for realizing auxiliary target detection and improving the efficiency of target detection.

In a first aspect, an embodiment of the present invention provides a method for assisting target detection by using video coding information, including: determining a target reference frame for a video frame to be detected, wherein the target reference frame comprises a plurality of sub-window blocks; for each sub-window block: determining the reference position of the sub-window block in the video frame to be detected; determining a reference value for the reference position; determining an overall reference value based on the reference values of the respective sub-window blocks; searching a target window in the video frame to be detected based on the integral reference value within the constraint range of the target deformation function; target detection is performed based on the target window.

In some embodiments, determining the target reference frame for the video frame to be detected comprises:

and selecting a target reference frame in front of the video frame to be detected based on a preset time range and the video frame to be detected.

In some embodiments, the preset time range is determined at a specified time interval, or is determined based on a time correlation function, wherein the closer to the video frame to be detected, the higher the confidence of the target reference frame.

In some embodiments, determining the reference value for the reference location comprises:

and determining a reference value of the reference position based on the size of the block corresponding to the reference position and the reliability of the reference position, wherein the reference value of the reference position increases with the increase of the size of the block corresponding to the reference position and/or the increase of the reliability of the reference position.

In some embodiments, finding a target window in the video frame to be detected comprises:

and searching a target window in the video frame to be detected within the constraint range of the target deformation function, so that the overall reference value of the target window and the overall reference value of the video frame to be detected exceed a first threshold, and the sum of the reference values of the reference positions contained in the target window and the size of the target window exceed a second threshold.

In some embodiments, further comprising: and pre-configuring a target detection count value, subtracting 1 from the target detection count value under the condition that the target window is found, and executing target detection based on each target window under the condition that the target detection count value is 0.

In some embodiments, further comprising: and under the condition that the target window cannot be found, directly executing target detection.

In some embodiments, the first threshold and/or the second threshold are adjusted to stabilize the search target window.

In a second aspect, an embodiment of the present invention further provides an object detection device for video coding, including a processor configured to execute steps of implementing an auxiliary object detection method for video coding according to embodiments of the present disclosure.

In a third aspect, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the auxiliary target detection method for video coding according to the embodiments of the present disclosure are implemented.

According to the embodiment of the invention, the target window is searched in the video frame to be detected based on the determined integral reference value in the constraint range of the target deformation function, and the target detection is executed based on the target window, so that the calculated amount can be effectively reduced, and the target detection efficiency is improved. Thereby saving hardware resources and reducing power consumption and delay on the whole.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 shows a schematic diagram of moving object vector detection.

Fig. 2 shows a basic flowchart of a target detection method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Almost all protocols require motion estimation during video encoding and decoding. The basic idea is to divide each frame of the image sequence into a plurality of non-overlapping blocks, consider the displacement of all pixels in the blocks to be the same, then find out the block most similar to the current block, i.e. the matching block, from each macro block to the reference frame according to a certain block matching criterion within a certain search range, and the relative displacement between the matching block and the current block is the motion vector.

Generally, the displacement of the object motion between two adjacent frames before and after the current macroblock position is not large, so that a small area around the macroblock at the same position in the historical frame can be searched from the current macroblock position to find a matching block. Obviously, the larger the search area, the higher the calculation cost, so that some skill is needed in practical products to accelerate the search speed and reduce the calculation amount. For example, the h.264 protocol allows for finding a matching block in any reconstructed frame in the DPB buffer to find an optimal matching block. It is clear that the cost of searching is proportional to the number of reference frames, e.g. the time (chip area) and power consumption required to search for a matching block in 2 reference frames is essentially 2 times that of the single reference frame case. Therefore, from the viewpoint of cost saving, the simplest scheme is to only refer to the previous frame, which saves both reference frame buffer and encoder computation power, but has the disadvantage of sacrificing some code rate performance.

After finding a best matching block, Motion estimation outputs a Motion Vector (MV), i.e., the position coordinates of the reference block with respect to the current block. If no suitable matching block is found, the macroblock may be intra predicted. Thus, as shown in fig. 1, both P-frames and B-frames may have Intra-coded macroblocks in h.264 and h.265 technologies.

If the frequency of target detection is reduced, the method can be used in certain application scenes. But is not applicable to scenarios where video frame rates are required. For example, in a region enhancement application, a video encoder needs to adjust the QP for a specific object to improve image quality, which requires target recognition for each frame. Even when license plate recognition is carried out in a parking lot, some high-speed vehicles may miss detection due to the interval of the timing snapshots.

In the conventional video target detection using motion information, additional computing resources are required to estimate the motion information, so that resource waste is caused, and the cost of hardware and energy consumption is increased.

In order to determine and track target detection by using motion vectors, an embodiment of the present invention provides a method for assisting target detection by using video coding information, including:

step S201, determining a target reference frame for a video frame to be detected, wherein the target reference frame comprises a plurality of sub-window blocks. In particular, the sub-window block in this embodiment may be a target location found through deep learning or other methods. Due to the continuity of video in many cases, the target reference frame, e.g., the previous frame, can be selected using the previous video frame in the temporal dimension. In step S202, for each sub-window block: determining the reference position of the sub-window block in the video frame to be detected; a reference value for the reference position is determined. That is, for a block in the target window of each target reference frame, a reference position of the sub-window block may be determined in the video frame to be detected, and the reference position may be a corresponding position of the sub-window block in the video frame to be detected. From this reference position, a reference value may be calculated, which may be used to describe the temporal confidence of the sub-window-block in embodiments of the present disclosure. An overall reference value is determined based on the reference values of the respective sub-window blocks in step S203. The overall reference value may be determined by summing, averaging or weighted averaging, for example, and is not limited herein.

Then, in step S204, in the constraint range of the target deformation function, based on the overall reference value, a target window is searched in the video frame to be detected. Finally, in step S205, target detection is performed based on the target window. In some embodiments, initially, the target detection result of the previous frame may be obtained by deep learning, so as to determine the position of each sub-window block containing the moving target. And the position of the sub-window block in the target detection result region in the frame before detection can be found through the motion vector. These locations may be very concentrated and may also be very dispersed. In the decentralized case, if all locations need to be covered, the new target window it generates is large. To control this, the present application proposes to limit the range of the deformation by a target deformation function, the target window satisfying a condition similar to that of the previous frame, and also allowing a certain variation.

The method of the present disclosure is able to find its new position for the next frame (video frame to be detected) by motion estimation without reusing deep learning. And the motion estimation is a necessary step in the video coding process, so that the target detection efficiency is effectively improved, and the target detection result of each frame is from the motion vectors and the target detection results of other frames, so that the operation resource is effectively saved. And searching a target window through a target deformation function, thereby reducing the calculation amount of searching.

and selecting a target reference frame in front of the video frame to be detected based on a preset time range and the video frame to be detected. In some embodiments, the preset time range is determined at a specified time interval, or is determined based on a time correlation function, wherein the closer to the video frame to be detected, the higher the confidence of the target reference frame. For example, a video frame before a specified time interval may be selected as the target reference frame, for example, a previous frame may be selected as the target reference frame, and a second previous frame may also be selected as the target reference frame. The time reliability function f (t): the input is the time interval with the current frame, and the output is the reliability. The shorter the interval, the closer the target reference frame and the current frame are, the higher their confidence level is. Its implementation can be implemented with a one-dimensional LUT. The maximum value of the output is 1, and the minimum value is 0. For example, if it is desired to refer to only the previous frame (previous frame as target reference frame) F (t ═ -1) ═ 1, F (t | ═ -1) ═ 0. If it is desired to refer to two frames F (t-1) 0.8, F (t-2) 0.8, F (t-1, -2).

In some embodiments, determining the reference value for the reference location comprises: and determining a reference value of the reference position based on the size of the block corresponding to the reference position and the reliability of the reference position, wherein the reference value of the reference position increases with the increase of the size of the block corresponding to the reference position and/or the increase of the reliability of the reference position. As an example, the size × confidence value of the block corresponding to the reference position may be utilized. Step S103, determining an overall reference value based on the reference values of the respective sub-window blocks. The overall reference value may be determined by way of summation, for example.

In some embodiments, finding a target window in the video frame to be detected comprises: and searching a target window in the video frame to be detected within the constraint range of the target deformation function, so that the overall reference value of the target window and the overall reference value of the video frame to be detected exceed a first threshold, and the sum of the reference values of the reference positions contained in the target window and the size of the target window exceed a second threshold. The specific search method may be to use the target deformation function Q (X, Y): the allowable deformation range of the target window in the current frame. The deformation allowed for the target window corresponding to the current frame is compared to the size and shape of the target window in the previous frame. For example, Q (0.1 ) may be a deformation (increase or decrease) that allows 10% of each of the width and height of the target window. It is also possible to allow only fixed size target windows [1.1W, 1.1H ], [1.1W, 1.0H ], [1.1W, 0.9H ], [1.0W, 1.1H ], [1.0W, 1.0H ], [1.0W, 0.9H ], [0.9W, 1.1H ], [0.9W, 1.0H ], [0.9W, 0.9H ]. Q (0, 0) indicates that the target window size is consistent with the previous frame. In some embodiments, the target deformation function may be configured to tolerate a small range of scene deformation. The purpose of the target warping function in this example is to limit the scope of the search, reducing the computational load of the search. Meanwhile, the deformation of the target in the video can be tolerated to a certain extent. Considering the short time from frame to frame, the distortion tolerance is set to a small range in most scenes in this example. The first threshold value Pc and the second threshold value Pd may be preset, so that a new target window may be searched in the current frame (frame to be detected) within the allowable range of the target deformation function, and the ratio of the sum of the reliable reference values of the reference positions to the overall reliable reference value may need to exceed Pc. If a matching target window is found, it is further detected that the ratio of the sum of the sizes of the corresponding reference positions to the size of the target window needs to exceed Pd.

In some embodiments, the first threshold and/or the second threshold may be adjusted to stabilize the search target window. In this example, it is easier to obtain a larger reference position coverage, i.e., larger than Pc, considering the large target window. While a small target window is easier to achieve with higher coverage density. By adjusting Pc and Pd, the stable target window can be controlled to be finally obtained.

In some embodiments, a new target window may be found in the current frame within the allowable range of the target warping function, which may contain the ratio of the sum of the trusted reference values of the reference positions to the overall trusted reference value that needs to exceed Pc.

The new window must contain a certain proportion of blocks in the old window (move into the new window) → new window enlargement. The ratio of the sum of the corresponding reference position sizes to the size of the target window for the new target window found needs to exceed Pd, but there must not be too many unknown blocks in the new window (which do not come from the old window) → the new window shrinking. The target window is obtained by matching according to the process. Based on two factors of Pd and Pc, the position of the target in the current frame and the size of the target window are controlled, so that the target window fits the actual requirement.

In some embodiments, further comprising: and pre-configuring a target detection count value, subtracting 1 from the target detection count value under the condition that the target window is found, and executing target detection based on each target window under the condition that the target detection count value is 0. In some embodiments, further comprising: and under the condition that the target window cannot be found, directly executing target detection. For example, the target detection count value N ═ k may be configured in advance, and the target detection count value may be configured in a forced target detection counter for performing counting of forced target detection.

If a new frame appears, if intra prediction is forced due to the frame being an I-frame (key frame) or other mechanism. The number of reference frames is 0, and the motion vector cannot be determined, so that target detection can be performed on the frame by a preset method, such as a deep learning method. The value N ═ k.

And if the target window conforming to Pc and Pd is found, outputting the target window N-1.

If there is no target window that meets both Pc and Pd (not meeting the target tracking condition), directly running target detection (e.g. directly completing target localization or target detection for the frame through deep learning). The value N ═ k.

If N is 0, target detection is run and N is reset. N ═ k.

In summary, the method of the present invention can use the motion vector to track the target, thereby reducing the use of computing resources. The target detection result of each frame is derived from the motion vectors and target detection results of other frames, and in the process, continuous accumulation of errors can be generated. Setting the forced target detection counter is also employed in this example to prevent accumulated errors. Therefore, the newly appeared target cannot be missed under the condition of ensuring target tracking. The method disclosed by the invention cannot find a new target, the target to be tracked can be detected by matching with deep learning in the specific implementation process, then the target tracking is carried out by the method disclosed by the invention, and the algorithm complexity in the case of tracking the same target is 1/30 or even lower because the target positioning is completed by using deep learning in interframes, so that the method disclosed by the invention effectively saves the computing resources. The range of searching is limited by using the target deformation function, and the calculation amount of searching is greatly reduced. By adjusting Pc and Pd, the stable target window can be controlled to be finally obtained. And the weight of the reference frame is adjusted by using the time credibility function, so that a more accurate result is obtained. During decoding. The decoder can also obtain information of the motion vectors. Therefore, the method disclosed by the invention can be used for accelerating the target detection and saving the computation.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. a method for assisting target detection using video coding information, is characterized in that, comprising:

Determine a target reference frame for the video frame to be detected, and the target reference frame includes several sub-window blocks;

For each child window block:

Determine the reference position of the sub-window block in the video frame to be detected;

determine the reference value of the reference position;

Determine the overall reference value based on the reference value of each sub-window block;

Searching for a target window in the video frame to be detected based on the overall reference value within the constraint range of the target deformation function;

Object detection is performed based on the object window.

2. The method for assisting target detection using video coding information as claimed in claim 1, wherein determining the target reference frame for the video frame to be detected comprises:

Based on a preset time range and the to-be-detected video frame, a target reference frame is selected before the to-be-detected video frame.

3. The method for assisting target detection using video coding information as claimed in claim 2, wherein the preset time range is determined by a specified time interval, or is determined based on a time correlation function, wherein The closer the video frame to be detected is, the higher the reliability of the target reference frame.

4. The method according to claim 3, wherein determining the reference value of the reference position comprises:

The reference value of the reference position is determined based on the size of the block corresponding to the reference position and the reliability of the reference position, wherein the reference value of the reference position increases with the size of the block corresponding to the reference position, and/or, The confidence of this reference position increases, which increases.

5. The method for assisting target detection using video coding information as claimed in claim 4, wherein searching for a target window in the video frame to be detected comprises:

Within the constraint range of the target deformation function, a target window is searched in the video frame to be detected, so that the overall reference value of the target window and the overall reference value of the video frame to be detected exceed the first threshold, and the target window is The sum of the reference values of the included reference positions and the size of the target window exceeds the second threshold.

6. The method for assisting target detection using video coding information as claimed in claim 5, further comprising:

The target detection count value is pre-configured, and when a target window is found, the target detection count value is decremented by 1, and when the target detection count value is 0, target detection is performed based on each target window.

7. The method for assisting target detection using video coding information as claimed in claim 4, further comprising:

In the case that the target window cannot be found, the target detection is performed directly.

8 . The method for assisting target detection using video coding information according to claim 5 , wherein the first threshold and/or the second threshold are adjusted to stably find the target window. 9 .

9. A device for assisting target detection using video coding information, comprising a processor configured to perform the steps of the method for assisting target detection using video coding information according to any one of claims 1-8.

10. A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program according to any one of claims 1 to 8 is implemented. The steps of using video coding information to assist object detection method.