Disclosure of Invention
The embodiment of the invention provides a method and equipment for assisting target detection by adopting video coding information, which are used for realizing auxiliary target detection and improving the efficiency of target detection.
In a first aspect, an embodiment of the present invention provides a method for assisting target detection by using video coding information, including: determining a target reference frame for a video frame to be detected, wherein the target reference frame comprises a plurality of sub-window blocks; for each sub-window block: determining the reference position of the sub-window block in the video frame to be detected; determining a reference value for the reference position; determining an overall reference value based on the reference values of the respective sub-window blocks; searching a target window in the video frame to be detected based on the integral reference value within the constraint range of the target deformation function; target detection is performed based on the target window.
In some embodiments, determining the target reference frame for the video frame to be detected comprises:
and selecting a target reference frame in front of the video frame to be detected based on a preset time range and the video frame to be detected.
In some embodiments, the preset time range is determined at a specified time interval, or is determined based on a time correlation function, wherein the closer to the video frame to be detected, the higher the confidence of the target reference frame.
In some embodiments, determining the reference value for the reference location comprises:
and determining a reference value of the reference position based on the size of the block corresponding to the reference position and the reliability of the reference position, wherein the reference value of the reference position increases with the increase of the size of the block corresponding to the reference position and/or the increase of the reliability of the reference position.
In some embodiments, finding a target window in the video frame to be detected comprises:
and searching a target window in the video frame to be detected within the constraint range of the target deformation function, so that the overall reference value of the target window and the overall reference value of the video frame to be detected exceed a first threshold, and the sum of the reference values of the reference positions contained in the target window and the size of the target window exceed a second threshold.
In some embodiments, further comprising: and pre-configuring a target detection count value, subtracting 1 from the target detection count value under the condition that the target window is found, and executing target detection based on each target window under the condition that the target detection count value is 0.
In some embodiments, further comprising: and under the condition that the target window cannot be found, directly executing target detection.
In some embodiments, the first threshold and/or the second threshold are adjusted to stabilize the search target window.
In a second aspect, an embodiment of the present invention further provides an object detection device for video coding, including a processor configured to execute steps of implementing an auxiliary object detection method for video coding according to embodiments of the present disclosure.
In a third aspect, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the auxiliary target detection method for video coding according to the embodiments of the present disclosure are implemented.
According to the embodiment of the invention, the target window is searched in the video frame to be detected based on the determined integral reference value in the constraint range of the target deformation function, and the target detection is executed based on the target window, so that the calculated amount can be effectively reduced, and the target detection efficiency is improved. Thereby saving hardware resources and reducing power consumption and delay on the whole.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Almost all protocols require motion estimation during video encoding and decoding. The basic idea is to divide each frame of the image sequence into a plurality of non-overlapping blocks, consider the displacement of all pixels in the blocks to be the same, then find out the block most similar to the current block, i.e. the matching block, from each macro block to the reference frame according to a certain block matching criterion within a certain search range, and the relative displacement between the matching block and the current block is the motion vector.
Generally, the displacement of the object motion between two adjacent frames before and after the current macroblock position is not large, so that a small area around the macroblock at the same position in the historical frame can be searched from the current macroblock position to find a matching block. Obviously, the larger the search area, the higher the calculation cost, so that some skill is needed in practical products to accelerate the search speed and reduce the calculation amount. For example, the h.264 protocol allows for finding a matching block in any reconstructed frame in the DPB buffer to find an optimal matching block. It is clear that the cost of searching is proportional to the number of reference frames, e.g. the time (chip area) and power consumption required to search for a matching block in 2 reference frames is essentially 2 times that of the single reference frame case. Therefore, from the viewpoint of cost saving, the simplest scheme is to only refer to the previous frame, which saves both reference frame buffer and encoder computation power, but has the disadvantage of sacrificing some code rate performance.
After finding a best matching block, Motion estimation outputs a Motion Vector (MV), i.e., the position coordinates of the reference block with respect to the current block. If no suitable matching block is found, the macroblock may be intra predicted. Thus, as shown in fig. 1, both P-frames and B-frames may have Intra-coded macroblocks in h.264 and h.265 technologies.
If the frequency of target detection is reduced, the method can be used in certain application scenes. But is not applicable to scenarios where video frame rates are required. For example, in a region enhancement application, a video encoder needs to adjust the QP for a specific object to improve image quality, which requires target recognition for each frame. Even when license plate recognition is carried out in a parking lot, some high-speed vehicles may miss detection due to the interval of the timing snapshots.
In the conventional video target detection using motion information, additional computing resources are required to estimate the motion information, so that resource waste is caused, and the cost of hardware and energy consumption is increased.
In order to determine and track target detection by using motion vectors, an embodiment of the present invention provides a method for assisting target detection by using video coding information, including:
step S201, determining a target reference frame for a video frame to be detected, wherein the target reference frame comprises a plurality of sub-window blocks. In particular, the sub-window block in this embodiment may be a target location found through deep learning or other methods. Due to the continuity of video in many cases, the target reference frame, e.g., the previous frame, can be selected using the previous video frame in the temporal dimension. In step S202, for each sub-window block: determining the reference position of the sub-window block in the video frame to be detected; a reference value for the reference position is determined. That is, for a block in the target window of each target reference frame, a reference position of the sub-window block may be determined in the video frame to be detected, and the reference position may be a corresponding position of the sub-window block in the video frame to be detected. From this reference position, a reference value may be calculated, which may be used to describe the temporal confidence of the sub-window-block in embodiments of the present disclosure. An overall reference value is determined based on the reference values of the respective sub-window blocks in step S203. The overall reference value may be determined by summing, averaging or weighted averaging, for example, and is not limited herein.
Then, in step S204, in the constraint range of the target deformation function, based on the overall reference value, a target window is searched in the video frame to be detected. Finally, in step S205, target detection is performed based on the target window. In some embodiments, initially, the target detection result of the previous frame may be obtained by deep learning, so as to determine the position of each sub-window block containing the moving target. And the position of the sub-window block in the target detection result region in the frame before detection can be found through the motion vector. These locations may be very concentrated and may also be very dispersed. In the decentralized case, if all locations need to be covered, the new target window it generates is large. To control this, the present application proposes to limit the range of the deformation by a target deformation function, the target window satisfying a condition similar to that of the previous frame, and also allowing a certain variation.
The method of the present disclosure is able to find its new position for the next frame (video frame to be detected) by motion estimation without reusing deep learning. And the motion estimation is a necessary step in the video coding process, so that the target detection efficiency is effectively improved, and the target detection result of each frame is from the motion vectors and the target detection results of other frames, so that the operation resource is effectively saved. And searching a target window through a target deformation function, thereby reducing the calculation amount of searching.
In some embodiments, determining the target reference frame for the video frame to be detected comprises:
and selecting a target reference frame in front of the video frame to be detected based on a preset time range and the video frame to be detected. In some embodiments, the preset time range is determined at a specified time interval, or is determined based on a time correlation function, wherein the closer to the video frame to be detected, the higher the confidence of the target reference frame. For example, a video frame before a specified time interval may be selected as the target reference frame, for example, a previous frame may be selected as the target reference frame, and a second previous frame may also be selected as the target reference frame. The time reliability function f (t): the input is the time interval with the current frame, and the output is the reliability. The shorter the interval, the closer the target reference frame and the current frame are, the higher their confidence level is. Its implementation can be implemented with a one-dimensional LUT. The maximum value of the output is 1, and the minimum value is 0. For example, if it is desired to refer to only the previous frame (previous frame as target reference frame) F (t ═ -1) ═ 1, F (t | ═ -1) ═ 0. If it is desired to refer to two frames F (t-1) 0.8, F (t-2) 0.8, F (t-1, -2).
In some embodiments, determining the reference value for the reference location comprises: and determining a reference value of the reference position based on the size of the block corresponding to the reference position and the reliability of the reference position, wherein the reference value of the reference position increases with the increase of the size of the block corresponding to the reference position and/or the increase of the reliability of the reference position. As an example, the size × confidence value of the block corresponding to the reference position may be utilized. Step S103, determining an overall reference value based on the reference values of the respective sub-window blocks. The overall reference value may be determined by way of summation, for example.
In some embodiments, finding a target window in the video frame to be detected comprises: and searching a target window in the video frame to be detected within the constraint range of the target deformation function, so that the overall reference value of the target window and the overall reference value of the video frame to be detected exceed a first threshold, and the sum of the reference values of the reference positions contained in the target window and the size of the target window exceed a second threshold. The specific search method may be to use the target deformation function Q (X, Y): the allowable deformation range of the target window in the current frame. The deformation allowed for the target window corresponding to the current frame is compared to the size and shape of the target window in the previous frame. For example, Q (0.1 ) may be a deformation (increase or decrease) that allows 10% of each of the width and height of the target window. It is also possible to allow only fixed size target windows [1.1W, 1.1H ], [1.1W, 1.0H ], [1.1W, 0.9H ], [1.0W, 1.1H ], [1.0W, 1.0H ], [1.0W, 0.9H ], [0.9W, 1.1H ], [0.9W, 1.0H ], [0.9W, 0.9H ]. Q (0, 0) indicates that the target window size is consistent with the previous frame. In some embodiments, the target deformation function may be configured to tolerate a small range of scene deformation. The purpose of the target warping function in this example is to limit the scope of the search, reducing the computational load of the search. Meanwhile, the deformation of the target in the video can be tolerated to a certain extent. Considering the short time from frame to frame, the distortion tolerance is set to a small range in most scenes in this example. The first threshold value Pc and the second threshold value Pd may be preset, so that a new target window may be searched in the current frame (frame to be detected) within the allowable range of the target deformation function, and the ratio of the sum of the reliable reference values of the reference positions to the overall reliable reference value may need to exceed Pc. If a matching target window is found, it is further detected that the ratio of the sum of the sizes of the corresponding reference positions to the size of the target window needs to exceed Pd.
In some embodiments, the first threshold and/or the second threshold may be adjusted to stabilize the search target window. In this example, it is easier to obtain a larger reference position coverage, i.e., larger than Pc, considering the large target window. While a small target window is easier to achieve with higher coverage density. By adjusting Pc and Pd, the stable target window can be controlled to be finally obtained.
In some embodiments, a new target window may be found in the current frame within the allowable range of the target warping function, which may contain the ratio of the sum of the trusted reference values of the reference positions to the overall trusted reference value that needs to exceed Pc.
The new window must contain a certain proportion of blocks in the old window (move into the new window) → new window enlargement. The ratio of the sum of the corresponding reference position sizes to the size of the target window for the new target window found needs to exceed Pd, but there must not be too many unknown blocks in the new window (which do not come from the old window) → the new window shrinking. The target window is obtained by matching according to the process. Based on two factors of Pd and Pc, the position of the target in the current frame and the size of the target window are controlled, so that the target window fits the actual requirement.
In some embodiments, further comprising: and pre-configuring a target detection count value, subtracting 1 from the target detection count value under the condition that the target window is found, and executing target detection based on each target window under the condition that the target detection count value is 0. In some embodiments, further comprising: and under the condition that the target window cannot be found, directly executing target detection. For example, the target detection count value N ═ k may be configured in advance, and the target detection count value may be configured in a forced target detection counter for performing counting of forced target detection.
If a new frame appears, if intra prediction is forced due to the frame being an I-frame (key frame) or other mechanism. The number of reference frames is 0, and the motion vector cannot be determined, so that target detection can be performed on the frame by a preset method, such as a deep learning method. The value N ═ k.
And if the target window conforming to Pc and Pd is found, outputting the target window N-1.
If there is no target window that meets both Pc and Pd (not meeting the target tracking condition), directly running target detection (e.g. directly completing target localization or target detection for the frame through deep learning). The value N ═ k.
If N is 0, target detection is run and N is reset. N ═ k.
In summary, the method of the present invention can use the motion vector to track the target, thereby reducing the use of computing resources. The target detection result of each frame is derived from the motion vectors and target detection results of other frames, and in the process, continuous accumulation of errors can be generated. Setting the forced target detection counter is also employed in this example to prevent accumulated errors. Therefore, the newly appeared target cannot be missed under the condition of ensuring target tracking. The method disclosed by the invention cannot find a new target, the target to be tracked can be detected by matching with deep learning in the specific implementation process, then the target tracking is carried out by the method disclosed by the invention, and the algorithm complexity in the case of tracking the same target is 1/30 or even lower because the target positioning is completed by using deep learning in interframes, so that the method disclosed by the invention effectively saves the computing resources. The range of searching is limited by using the target deformation function, and the calculation amount of searching is greatly reduced. By adjusting Pc and Pd, the stable target window can be controlled to be finally obtained. And the weight of the reference frame is adjusted by using the time credibility function, so that a more accurate result is obtained. During decoding. The decoder can also obtain information of the motion vectors. Therefore, the method disclosed by the invention can be used for accelerating the target detection and saving the computation.
In a second aspect, an embodiment of the present invention further provides an object detection device for video coding, including a processor configured to execute steps of implementing an auxiliary object detection method for video coding according to embodiments of the present disclosure.
In a third aspect, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the auxiliary target detection method for video coding according to the embodiments of the present disclosure are implemented.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.