CN114219938A

CN114219938A - Region-of-interest acquisition method

Info

Publication number: CN114219938A
Application number: CN202111443193.4A
Authority: CN
Inventors: 李奕霖; 李晓雯; 王豪; 李凯; 陈颖
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-03-22

Abstract

One or more embodiments of the present specification provide a region of interest acquisition method, including: determining whether a scene change occurs between an image frame and a previous image frame; if yes, performing visual target detection on the current image frame, and determining an interested area contained in the current image frame; if not, performing visual target tracking on the current image frame, and determining an interested area contained in the current image frame. The present description may reduce temporal complexity while maintaining efficient acquisition of regions of interest.

Description

Region-of-interest acquisition method

Technical Field

The specification relates to the field of computer vision, in particular to a region-of-interest acquisition method.

Background

Region of Interest (ROI) acquisition is an important technical means in video surveillance, telemedicine, video call and other scenes. The principle is based on an important a priori assumption of the human visual system that a person's attention is usually focused on a limited area in an image frame, while the sensitivity to image quality of other areas is relatively low. Taking video coding as an example, if the region-of-interest and the region-of-non-interest in the image frame can be coded in a sub-region manner and transmitted, the coding performance can be optimized on the premise of not losing the image quality of the human eye sensitive region.

Disclosure of Invention

In view of the above technical problems, in a first aspect of embodiments of the present specification, a method for acquiring a region of interest is provided, where:

determining whether a scene change occurs between an image frame and a previous image frame;

if yes, performing visual target detection on the current image frame, and determining an interested area contained in the current image frame;

if not, performing visual target tracking on the current image frame, and determining an interested area contained in the current image frame.

In a second aspect of the embodiments of the present specification, there is provided a region of interest acquisition method, where the technical solution is as follows:

determining whether a scene change occurs between a current image frame and a previous image frame;

if not, acquiring parameters used for performing visual target tracking on the previous image frame of the current image frame; the parameter is a parameter adopted when the confidence of an interested area contained in the previous image frame is higher than a threshold value by performing visual target tracking on the previous image frame;

and taking the acquired parameters as initial parameters, and performing visual target tracking on the current image frame to obtain an interested area contained in the current image frame.

In a third aspect of embodiments herein, there is provided a medium; the storage medium has stored thereon computer instructions which, when executed by a processor, implement the steps of the method as follows:

if yes, calling a visual target detection algorithm to determine an interested area contained in the current image frame;

if not, calling a visual target tracking algorithm to determine the region of interest contained in the current image frame.

The technical scheme provided by the embodiment of the specification can have the following beneficial effects:

in this specification, because a mode of combining visual target detection and visual target tracking is adopted to determine the region of interest from the image frames, compared with an implementation mode of determining the region of interest by adopting visual target detection for each frame of image frames, the processing efficiency in determining the region of interest can be improved to the greatest extent on the premise of ensuring the accuracy of the determined region of interest.

Drawings

Fig. 1 is an architecture diagram of a region of interest acquisition method shown in an embodiment of the present specification;

FIG. 2 is a flow chart illustrating a region of interest acquisition method according to an embodiment of the present disclosure;

FIG. 3 is a diagram illustrating a division of a current image frame into a plurality of image partitions according to an embodiment of the present disclosure;

FIG. 4 is a diagram illustrating dividing a frame previous to a current image frame into a plurality of image partitions according to an embodiment of the present disclosure;

FIG. 5 is a flow chart illustrating another region of interest acquisition method in one embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a region-of-interest acquiring apparatus according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of another region of interest acquisition apparatus shown in an embodiment of the present specification;

fig. 8 is a diagram illustrating a hardware architecture of a computing device, according to an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the specification, as detailed in the appended claims.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present specification. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described herein. In some other embodiments, the method may include more or fewer steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps for description in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.

Region of Interest (ROI) acquisition, whose principle is based on an important a priori assumption of the human visual system that human attention is usually focused on a limited area in the image frame, while sensitivity to image quality of other areas is relatively low. For example, taking intelligent video coding as an example, if the image frame can be encoded and transmitted in a partitioned manner, the encoding performance can be optimized without losing the image quality of the human eye sensitive area.

Due to the region-of-interest acquisition technology, the encoding efficiency of the image frame can be remarkably improved, so that the region-of-interest acquisition technology is an important technical means in many fields at present.

For example, in practical applications, the region-of-interest acquisition technology may be applied to the fields of security (e.g., road traffic scene, etc.), monitoring (e.g., face recognition scene), inspection (e.g., unmanned aerial vehicle tracking scene), online streaming media video (e.g., online education, online live broadcast), and the like.

Taking a road traffic scene in the security field as an example, in the scene, a vehicle target included in an image frame captured by a monitoring camera deployed outdoors is usually an interested target in the scene, and the camera can acquire the vehicle target included in the image frame by carrying a related algorithm to perform subsequent processing.

For another example, in a face recognition scene in the monitoring field, a deployed monitoring camera captures a face target included in an image frame, which is usually an object of interest in the scene, and the camera may obtain the face target included in the image frame by carrying a related algorithm, so as to perform subsequent processing.

In practical applications, the most common method for acquiring a region of interest from an image frame is visual object detection. Visual object detection generally refers to a process of performing image detection on an image frame and separating an object of interest from the image frame. Wherein, in addition to isolating an object of interest, the category and location of such an object can generally be determined using visual object detection. From the earlier Viola-Jones algorithm, through the development of more than 20 years, to various currently popular visual target detection algorithms based on deep learning (such as fast R-CNN, YOLO, SSD, and various improved versions thereof), the detection accuracy has breakthrough development, and meanwhile, the linear promotion of time and computational complexity is brought, and the dependence on hardware performance is higher and higher. Aiming at the problem of high complexity, the academic and industrial fields carry out a great deal of optimization on detection algorithms, so that a great number of excellent algorithms can be applied to actual scenes. However, from a multimedia production and processing perspective, if the region of interest is acquired by the visual target detection algorithm every frame, the increased time cost thereof will greatly reduce the commercial value of region of interest acquisition.

In contrast to visual target detection, visual target tracking provides a solution that can balance the accuracy and complexity of region of interest acquisition. Compared with a visual target detection algorithm, the visual target tracking algorithm focuses on finding out an interested target in continuous image frames, the time complexity of the algorithm is greatly reduced, and the processing efficiency is high. However, although the visual target tracking algorithm has an advantage in time complexity compared to the visual target tracking algorithm, the accuracy of the algorithm is affected by the change in the appearance of the tracked target; for example, when the appearance of the tracked target is heavily deformed and partially occluded, the tracking result of the visual target tracking algorithm is shifted to generate an error.

Currently, the acquisition of a region of interest mainly focuses on performance and complexity, the former generally refers to the accuracy and precision of the region of interest acquisition, and the latter generally refers to the time and computational cost required for the region of interest acquisition. Still taking intelligent video coding as an example, on one hand, how to accurately acquire the region of interest is an important prerequisite for whether intelligent video coding can be performed. On the other hand, how to efficiently acquire the region of interest determines the cost of intelligent encoding. In the former case, only accurate acquisition of the region of interest can improve the image quality of the region of interest. For the latter, the efficiency of acquiring the region of interest determines the cost of intelligent video coding. Therefore, the accuracy and the efficiency of acquiring the region of interest directly influence the overall value of acquiring the region of interest.

As mentioned above, although the visual target detection algorithm has an advantage in time complexity compared to the visual target tracking algorithm, the accuracy of the algorithm is affected by the appearance change of the tracked target; for example, when the appearance of the tracked target is heavily deformed and partially occluded, the tracking result of the visual target tracking algorithm is shifted to generate an error.

In view of this, the present specification provides a technical solution for determining a region of interest from an image frame by combining a visual target detection algorithm and a visual target tracking algorithm, and by determining whether a scene change occurs between an image frame and a previous image frame: if scene switching occurs, a visual detection algorithm with higher complexity can be called to determine the region of interest contained in the image frame, and if scene switching does not occur, a visual target tracking algorithm with lower complexity can be called to determine the region of interest contained in the image frame. By the method, the interested region is determined from the image frame by adopting a mode of combining the visual target detection algorithm and the visual target tracking algorithm, and compared with an implementation mode of determining the interested region by adopting the visual target detection algorithm for each frame of image frame, the processing efficiency in determining the interested region can be improved to the greatest extent on the premise of ensuring the accuracy of the determined interested region.

The technical solution of the present specification is described below with reference to the accompanying drawings.

Fig. 1 is a schematic structural diagram of an acquisition system of a region of interest provided in an embodiment of the present specification. As shown in fig. 1, the system may include a network 10, a server 11, a number of electronic devices, such as a cell phone 12, a cell phone 13, a cell phone 14, and so on.

The server 11 may be a physical server comprising an independent host, or the server 11 may be a virtual server, a cloud server, etc. carried by a cluster of hosts. Handsets 12-14 are just one type of electronic device that a user may use. In fact, it is obvious that the user can also use electronic devices of the type such as: tablet devices, notebook computers, Personal Digital Assistants (PDAs), wearable devices (e.g., smart glasses, smart watches, etc.), cameras, camcorders, etc., which may capture video or store video, and one or more embodiments of the present disclosure are not limited thereto. The network 10 may include various types of wired or wireless networks.

In one embodiment, the server 11 may cooperate with handsets 12-14; the mobile phones 12 to 14 may capture images or videos, upload the captured images or videos to the server 11 through the network 10, and then the server 11 processes the received images or videos based on the region of interest acquisition scheme of the present specification to determine a region of interest included in an image frame corresponding to the images or videos. In another embodiment, the handsets 12-14 may independently implement the region of interest acquisition scheme of the present description; the mobile phones 12 to 14 capture images or videos, and process the captured images or videos based on the region of interest acquisition scheme of the present specification to determine a region of interest included in an image frame corresponding to the images or videos.

Fig. 2 is a flowchart illustrating a region of interest acquisition method according to an embodiment of the present disclosure, where the method includes the following steps:

step 202, it is determined whether a scene cut occurs between an image frame and a previous image frame.

The image frame may be a certain frame or multiple frames of images in a video file, or may be a single or multiple pictures.

The scene change may generally mean that factors such as color and content of an image frame and a previous image frame change. When determining whether a scene change occurs between an image frame and a previous image frame, the determination may be made based on factors such as an overlapping area and a similarity between the image frames, or may be made by combining multiple factors, and is not particularly limited in this specification.

In one embodiment, whether a scene cut occurs in an image frame may be determined based on an overlapping area of a region of interest included in the image frame and a previous image frame.

In this case, in order to determine whether a scene change occurs between an image frame and a previous image frame, an overlapping area of a region of interest included in the image frame and the previous image frame may be calculated. Then determining whether the overlapping area is smaller than a threshold value; if yes, determining that scene switching occurs between the image frame and the image frame before the image frame; if not, it may be determined that no scene cut has occurred between the image frame and the previous image frame.

It should be noted that, in practical applications, when determining whether a scene change occurs between a current image frame and a previous image frame, the previous image frame may be any frame before the current image frame.

For example, in an example, in a case where an image frame previous to the current image frame exists (i.e., the current image frame is not the first frame), the size and the position of the region of interest in the image frame previous to the current image frame may be obtained first, and the size and the position of the region of interest in the image frame previous to the current image frame may be obtained first, and the area of the overlapping portion of the two may be calculated. And then further judging whether the overlapping area is smaller than a threshold value. If the area of the overlapped part is smaller than the threshold value, the scene switching between the current image frame and the previous image frame can be determined.

In one embodiment, whether a scene cut occurs in an image frame may also be determined based on the similarity between the image frame and a previous image frame.

In this case, in order to determine whether a scene change occurs between an image frame and a previous image frame, a similarity between the image frame and the previous image frame may be calculated and it may be determined whether the similarity is less than a threshold. If yes, determining that scene switching occurs between the image frame and the image frame before the image frame; if not, it may be determined that no scene cut has occurred between the image frame and the previous image frame.

Note that, a specific embodiment of calculating the similarity between an image frame and a preceding image frame is not particularly limited in this specification.

For example, in one embodiment shown, the similarity between an image frame and a previous image frame may be determined by dividing the image frame and the previous image frame into image blocks, calculating the similarity of each image block, and counting the image blocks similar or dissimilar in the image frame and the previous image frame.

In this case, the image frame and the previous image frame may be divided into a plurality of image blocks with the same size, and the similarity between each image block in the image frame and the image block of the corresponding region in the previous image frame may be calculated.

The specific manner of calculating the similarity between the image blocks is not particularly limited in this specification.

For example, in one example, the algorithm for calculating the similarity may include an sad (sum of absolute differences) algorithm, which may calculate an absolute value of a difference of pixel values between image blocks and further calculate a sum of the absolute values as the similarity between corresponding image blocks.

Note that the pixel value may include a luminance value or a color value.

In one example, whether the image frames are similar or not may be determined by calculating the similarity of the image frame and the previous image frame thereof in the dimension of the brightness value;

in another example, whether the image frames are similar may be determined by calculating the similarity of the image frame and the previous image frame thereon in the color value dimension;

in addition to determining whether image frames are similar based on color values or luminance values, both of them may be considered together.

For example, the similarity of the image frame and the previous image frame in the brightness dimension may be calculated first, if the similarity is determined to be dissimilar, the similarity of the image frame and the previous image frame in the color dimension may be continuously calculated, and if the similarity in the color dimension is higher than a certain threshold, it may be determined that the scene switching of the image frame occurs.

The method for determining whether the image frame has the scene switching or not from the two dimensions of color and brightness can avoid the false recognition of some scenes, such as scenes in which the image frame has the significant brightness change but the image content is basically unchanged, and the scene switching can be determined to occur by mistake only from the dimension of the brightness.

After calculating the similarity between the corresponding image patches, the number of image patches determined to be similar or the number of image patches determined to be dissimilar may be further counted as the similarity of the image frames, and whether the image frames are similar may be determined based on the similarity.

In one embodiment shown, a first number of image blocks with similarity smaller than a threshold in an image frame may be counted as the similarity of the image frame and a previous image frame, and it is determined whether a ratio of the first number to the total number of image blocks is larger than the threshold; if so, determining that the image frame is not similar to the image frame before the image frame; otherwise, the image frame is determined to be similar to the previous image frame.

FIG. 3 is a diagram illustrating a division of a current image frame into a plurality of image partitions according to an embodiment of the present disclosure; fig. 4 is a schematic diagram illustrating dividing a previous frame of a current image frame into a plurality of image blocks according to an embodiment of the present disclosure. As shown in fig. 3 and 4, the image block 410 in the image frame shown in fig. 3 is in the same region in the image frame as the image block 420 in the image frame shown in fig. 4, and the similarity between the two image blocks may be calculated by a similarity calculation algorithm. Whether the image blocks are similar can be determined by judging whether the similarity between the image blocks is smaller than a threshold value. Obviously, in fig. 3 and 4, the image patches are not similar, so a first number of image patches with similarity less than the threshold in the statistical image frame is available, the first number being 12. Assuming that the threshold value is 0.8 for the ratio of the first number to the total number of image patches, it is apparent that, among the corresponding image patches of the image frame shown in fig. 3 and 4, the number of dissimilar corresponding image patches is 1 in proportion to all the image patches, which is greater than the threshold value, so it can be determined that the image frame is dissimilar to the image frame before it.

In another embodiment shown, a second number of image blocks with similarity greater than a threshold in an image frame may be counted as the similarity of the image frame to its previous image frame, and it is determined whether a ratio of the second number to the total number of image blocks is less than the threshold; if so, determining that the image frame is not similar to the image frame before the image frame; otherwise, the current image frame is determined to be similar to its previous image frame.

FIG. 3 is a diagram illustrating a division of a current image frame into a plurality of image partitions according to an embodiment of the present disclosure; fig. 4 is a schematic diagram illustrating dividing a previous frame of a current image frame into a plurality of image blocks according to an embodiment of the present disclosure. As shown in fig. 3 and 4, the image block 410 in the image frame shown in fig. 3 is in the same region in the image frame as the image block 420 in the image frame shown in fig. 4, and the similarity between the two image blocks may be calculated by a similarity calculation algorithm. Whether the image blocks are similar or not can be determined by judging whether the difference of the similarity between the image blocks is larger than a threshold value or not. Obviously, in fig. 3 and 4, the image blocks are not similar, so a first number of image blocks with similarity difference greater than the threshold in the statistical image frame is available, the first number being 0. Assuming that the threshold value is 0.8 for the ratio of the first number to the total number of image patches, it is apparent that, among the corresponding image patches of the image frame shown in fig. 3 and 4, the number of dissimilar corresponding image patches is 0 in proportion to all the image patches, which is smaller than the threshold value, so it can be determined that the image frame is dissimilar to the image frame before it.

It should be noted that the number of the divided image blocks described above may be adjusted according to the actually required precision, and is not particularly limited in this specification. The more the divided image blocks are, the higher the accuracy of image similarity judgment is.

For example, for a video of size 1280x720, with a video aspect ratio of 16:9, an image frame may be divided into 16x9 number of image tiles.

It should be emphasized that, in practical applications, the above threshold value can be flexibly set based on practical requirements, and the specification does not limit this. In one example, the threshold for determining whether the corresponding image partitions are similar may be set according to the sizes of the image partitions.

For example, assuming that the threshold set for an image of a unit area is 5 and the size of an image patch is 80x80, the threshold may be set to 80x80x 5.

In addition, in practical application, the threshold value needs to be tested through a large number of image frames, so as to determine the threshold value applicable to most image frames. If the threshold value is set too loosely, more image frames are determined to have scene switching, so that the complexity of the algorithm is increased; otherwise, the image frames with scene switching cannot be correctly identified, resulting in missing detection and reducing the accuracy of the algorithm.

And 204, if yes, calling a visual target detection algorithm to determine the region of interest contained in the current image frame.

Upon determining that a scene cut occurs between an image frame and a previous image frame, a visual object detection algorithm may be invoked to determine a region of interest contained in the image frame.

In one embodiment, the visual object detection refers to a process of performing image detection on an image frame and separating an object of interest contained in the image frame from the image frame. Accordingly, the visual target detection algorithm generally refers to a type of algorithm that detects a region of interest included in an image frame by performing image detection on the image frame.

Wherein, in addition to isolating an object of interest, the category and location of such an object can generally be determined using visual object detection. For example, by using a visual object detection algorithm such as Fast-RCNN, YOLO, SSD, or the like, it is possible to recognize information such as the type and position of a region of interest in addition to the region of interest in which an image frame can be acquired. It should be noted that, the specific type of the above-mentioned visual target detection algorithm used in the specification is not particularly limited, and may be flexibly selected based on actual requirements in practical applications.

And step 206, if not, calling a visual target tracking algorithm to determine the region of interest contained in the current image frame.

Upon determining that a scene cut has not occurred between an image frame and a previous image frame, a visual target tracking algorithm may be invoked to determine a region of interest contained in the image frame.

In another embodiment, the visual target tracking refers to a process of finding an object of interest from a subsequent frame of an image frame by means of target tracking. Accordingly, a visual target tracking algorithm generally refers to a type of algorithm that performs target tracking calculation with respect to an image frame and a previous image frame to determine an object of interest from the image frame.

Based on such algorithms, in addition to determining the region of interest contained in the image frame, the position of the target can be predicted by matching the candidate frame of the tracked target with the template. For example, in implementation, based on such an algorithm, a template image may be obtained in a target area of a current video frame or a previous frame of an image frame, a feature of the template image is extracted, and then the feature is matched with the current video frame or the image frame, so as to obtain a position of a tracked target in the current frame or the image.

It should be noted that, the specific type of the above-mentioned visual target tracking algorithm used in the specification is not particularly limited, and may be flexibly selected based on actual requirements in practical applications. For example, the image frame may be subjected to visual target tracking using a Kernel Correlation Filter (KCF), a Scale discriminant algorithm (DSST), or the like.

In practical applications, after the features of the template image are extracted from the image frame by using a visual target tracking algorithm, in order to improve the accuracy of the image features, the features are generally required to be mapped to a high-dimensional space. However, the existing visual target tracking algorithm usually uses a gaussian kernel function to calculate an inner product of template image features, and then maps the features to high-dimensional features based on the inner product. However, by adopting the gaussian kernel function, although the feature accuracy can be improved, the calculation complexity of the algorithm is obviously improved.

Based on this, in one embodiment shown, an existing visual target tracking algorithm may be optimized, and in the process of performing target tracking calculation on an image frame and a previous image frame by using the visual target tracking algorithm, a kernel function used by the algorithm for calculating an inner product may be replaced by a linear kernel function. Since the complexity of the linear kernel function is lower than that of the gaussian kernel function, the complexity of the algorithm can be reduced while the accuracy of the features mapped to the high-dimensional space is ensured.

For example, taking the KCF algorithm mentioned above as an example, the KCF algorithm defaults to mapping the extracted features to a high-dimensional space using a gaussian kernel function, which can improve the accuracy of the features but also increases the complexity of the algorithm. Therefore, the Gaussian kernel function in the KCF algorithm can be replaced by a linear kernel function with lower complexity, and the complexity of the algorithm is reduced while the accuracy of the features mapped to the high-dimensional space is ensured.

It should be added that, in the present specification, the optimization performed on the above-mentioned visual target tracking algorithm may include the optimization described above in which a gaussian kernel function is replaced by a linear kernel function, and the following optimization measures may be adopted: in one embodiment shown, when a visual target tracking algorithm is called to determine an area of interest from an image frame before an image frame, a target moving direction may be obtained according to a position change of the area of interest in several image frames before the image frame, and the area of interest of the image frame may also be searched based on the target moving direction.

For example, when a visual target tracking algorithm is called to determine a region of interest from an image frame preceding an image frame, it is common that a central point of a visual target in the previous image frame is used as a starting point of the algorithm, and then the region of interest is searched for in a region centered on the central point. Based on this, the target moving direction may be obtained by a position change of the region of interest in several image frames before the image frame, and the visual target tracking algorithm may search the region of interest within the region according to the target moving direction.

In one illustrated embodiment, the visual tracking algorithm may also obtain the size of an image frame preceding the image frame as the size of the image frame.

For example, visual target tracking may be achieved using a KCF algorithm based on correlation filtering, which may match a plurality of different sizes with an image frame, calculate a confidence for each size, and select the size with the highest confidence as the size of the visual target in the image frame. Since video is continuous, the size change typically increases or decreases with the time sequence of the image frames, so the size of the visual object between successive image frames is very close. Therefore, the size of the visual target in each image frame can be marked, the size of the visual target in the image frame can be preferentially taken as the size matched with the next image frame, the confidence coefficient of the size is calculated, and if the confidence coefficient of the size is greater than a certain threshold value, the size is considered reasonable; otherwise, matching a plurality of other different sizes with the image frame to calculate the confidence coefficient of each size, and selecting the size with the maximum confidence coefficient as the size of the visual target in the next frame of the image frame.

In one illustrated embodiment, the feature extraction process may also be optimized when a visual target tracking algorithm is invoked to determine a region of interest from an image frame preceding the image frame.

For example, the existing KCF algorithm uses fHoG (Felzenszwalb histogram of gradient) feature. Firstly, dividing an image frame into subblocks with the same size, marking the subblocks as a cell, calculating the gradient direction of the cell according to the pixel difference between two adjacent pixels of the pixel, then counting a histogram of the gradient direction and mapping the histogram to a certain interval of 0-360 degrees to obtain the fHoG value of the cell. The size and the interval number of the cell determine the accuracy of the features, the smaller the cell is, the larger the interval number is, the more accurate the features are, and meanwhile, the complexity can be obviously improved. Therefore, the size of the cell and the number of the intervals used by each image frame can be marked, and then when the visual target tracking algorithm is called in the next frame, the size of the cell is increased and the number of the intervals is reduced. If the target confidence degree predicted by the increased cell size and the reduced interval number is higher than the threshold value, the cell size and the interval number are considered to be reasonable, and the cell size and the interval number are preferentially used for processing the current image frame; otherwise, correspondingly reducing the cell size and increasing the number of the intervals, then selecting the cell size and the number of the intervals, and marking the cell size and the number of the intervals as the cell size and the number of the intervals of the next frame.

In one example, it may also be determined whether the current time reaches a time at which periodic detection is performed for the image frame; if yes, calling a visual target detection algorithm to determine an interested area contained in the image frame; if not, calling a visual target tracking algorithm to determine the region of interest contained in the image frame.

For example, a period value may be set for the number of frames of an image frame, a visual target detection algorithm is invoked every fixed number of frames from a certain frame of a video to detect a region of interest, and the visual target tracking algorithm is used for obtaining the region of interest at the rest of time; and a period value can also be set aiming at the running time, a visual target detection algorithm is called to detect the region of interest at fixed running time intervals from a certain frame of the video, and the region of interest is obtained by using a visual target tracking algorithm at the rest time. There are various ways to implement this condition, and this specification does not limit this.

Fig. 5 is a flowchart illustrating another region of interest acquisition method according to an embodiment of the present disclosure, where the method includes the following steps:

step 502, determining whether a scene switching occurs between a current image frame and a previous image frame;

specifically, when determining whether a scene change occurs between an image frame and a previous image frame, the determination may be performed based on factors such as an overlapping area and a similarity between the image frames, or may be performed by combining a plurality of factors, which is not particularly limited in this specification.

Step 504, if yes, performing visual target detection on the current image frame, and determining an area of interest contained in the current image frame;

Step 506, if not, acquiring parameters used for performing visual target tracking on the previous image frame of the current image frame; and taking the acquired parameters as initial parameters, and performing visual target tracking on the current image frame to obtain an interested area contained in the current image frame.

When it is determined that the image frame and the previous image frame are not subjected to scene switching, parameters used for performing visual target tracking on the previous image frame of the current image frame may be obtained first, and a visual target tracking algorithm is invoked to determine an area of interest included in the image frame.

The parameter is a parameter adopted when the confidence of an interested area contained in the previous image frame is higher than a threshold value by performing visual target tracking on the previous image frame.

In this specification, the specific implementation manner of steps 502 to 506 is similar to that of steps 202 to 206, and is not described herein again.

In an exemplary embodiment of the present specification, a region of interest acquisition apparatus is also provided. Referring to fig. 6, the apparatus may include:

a first detection trigger unit 610 for determining whether a scene change occurs between an image frame and a previous image frame;

a first detection unit 620, configured to, when it is determined that a scene switching occurs between an image frame and an image frame before the image frame, invoke a visual target detection algorithm to determine a region of interest included in the current image frame;

the first tracking unit 630 is configured to, when it is determined that a scene cut does not occur between an image frame and a previous image frame, invoke a visual target tracking algorithm to determine a region of interest included in the current image frame.

Optionally, the first detection triggering unit 610 is further configured to:

calculating the overlapping area of the region of interest contained in the current image frame and the previous image frame;

determining whether the overlap area is less than a threshold;

if yes, determining that scene switching occurs between the current image frame and the previous image frame;

if not, determining that the scene switching does not occur between the current image frame and the previous image frame.

Optionally, the first detection triggering unit 610 is further configured to:

determining whether a current time reaches a time for periodic detection of the current image frame;

if yes, calling the visual target detection algorithm to determine an interested area contained in the current image frame;

if not, calling the visual target tracking algorithm to determine the region of interest contained in the current image frame.

Optionally, the first detection triggering unit 610 is further configured to:

calculating the similarity of the current image frame and the previous image frame, and determining whether the current image frame is similar to the previous image frame based on the similarity; if yes, determining that scene switching occurs between the current image frame and the previous image frame;

Optionally, the first detection triggering unit 610 is further configured to:

calculating the similarity of the image blocks in the current image frame and the image blocks in the corresponding area in the previous image frame;

counting a first number of image blocks in the current image frame, of which the similarity is smaller than a threshold value, as the similarity of the current image frame and the previous image frame; or counting a second number of image blocks with the similarity greater than a threshold in the current image frame as the similarity between the current image frame and the previous image frame;

the determining whether the current image frame is similar to its previous image frame based on the similarity comprises:

determining whether a ratio of the first number to a total number of the image patches is greater than a threshold; if yes, determining that the current image frame is similar to the previous image frame; otherwise, determining that the current image frame is not similar to the previous image frame; alternatively, the first and second electrodes may be,

determining whether a ratio of the second number to a total number of the image patches is less than a threshold; if so, determining that the current image frame is not similar to the previous image frame; otherwise, the current image frame is determined to be similar to its previous image frame.

Optionally, the first detection triggering unit 610 is further configured to:

calculating an absolute value of a difference between pixel values of each pixel corresponding to a position included in an image block of a corresponding region in the image frame before the image block in the current image frame;

and summing the absolute values of the pixels, and taking the summation result as the similarity between the corresponding image blocks, wherein the pixel values comprise color values and/or brightness values.

Optionally, in the process of performing target tracking calculation on the current image frame and the image frame before the current image frame by using the visual target tracking algorithm, the kernel function used for calculating the inner product is a linear kernel function.

Optionally, the first tracking unit 630 is further configured to:

determining a target moving direction corresponding to an interested area according to the position change of the interested area of a plurality of image frames before the current image frame;

and calling the visual target tracking algorithm to perform target tracking calculation on the region of interest based on the target moving direction so as to determine the region of interest contained in the current image frame. In an exemplary embodiment of the present specification, a region of interest acquisition apparatus is also provided. Referring to fig. 7, the apparatus may include:

a second detection trigger unit 710 for determining whether a scene switching occurs between an image frame and an image frame before the image frame;

the second detection unit 720, configured to, when it is determined that a scene switching occurs between an image frame and a previous image frame, invoke a visual target detection algorithm to determine a region of interest included in the current image frame;

the second tracking unit 730, configured to acquire a parameter used for performing visual target tracking on a previous image frame of a current image frame when it is determined that a scene switching does not occur between the image frame and a previous image frame; the parameter is a parameter adopted when the confidence of an interested area contained in the previous image frame is higher than a threshold value by performing visual target tracking on the previous image frame;

In an exemplary embodiment of the present specification, there is also provided a computing device capable of implementing the above method.

FIG. 8 is a schematic block diagram of an apparatus provided in an exemplary embodiment. Referring to fig. 8, at the hardware level, the apparatus includes a processor 802, an internal bus 804, a network interface 806, a memory 808, and a non-volatile memory 810, but may also include hardware required for other functions. One or more embodiments of the present description can be implemented based on software, such as the processor 802 reading a corresponding computer program from the non-volatile storage 810 into the memory 809 and then running the computer program. Of course, besides software implementation, the one or more embodiments in this specification do not exclude other implementations, such as logic devices or combinations of software and hardware, and so on, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or logic devices.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.

In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage media or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments herein. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims

1. A region-of-interest acquisition method, comprising:

2. The method of claim 1, the determining whether a scene cut has occurred between the current image frame and an image frame before it, comprising:

determining whether the overlap area is less than a threshold;

3. The method of claim 1, further comprising:

4. The method of claim 1, determining whether a scene cut occurred between an image frame and a previous image frame, comprising:

calculating the similarity of the current image frame and the previous image frame, and determining whether the current image frame is similar to the previous image frame based on the similarity;

5. The method of claim 1, the visually tracking an object of the current image frame, determining a region of interest contained by the current image frame, comprising:

acquiring parameters used for performing visual target tracking on a previous image frame of a current image frame; the parameter is a parameter adopted when the confidence of an interested area contained in the previous image frame is higher than a threshold value by performing visual target tracking on the previous image frame;

6. The method of claim 5, comprising:

determining whether a confidence level of a region of interest contained in the current image frame is higher than a threshold value; if the confidence of the interested area is higher than a threshold value, recording the parameter as an initial parameter when performing visual target tracking on the next image frame of the current image frame.

7. The method of claim 5, the parameters employed for visual target tracking for the previous image frame comprising a combination of any one or more of:

a first size of a candidate region corresponding to the tracked object of interest;

when the candidate region is divided into a plurality of feature extraction units, the adopted second size corresponding to the feature extraction units;

the number of mapping sections used when performing feature mapping on the image features extracted by the feature extraction unit.

8. The method of claim 7, wherein performing visual target tracking on the current image frame by using the acquired parameters as initial parameters comprises:

and increasing the numerical value corresponding to the parameter based on a preset amplitude, and performing visual target tracking on the current image frame by taking the parameter with the increased numerical value as an initial parameter.

9. The method of claim 8, further comprising:

after the parameter with the increased value is used as an initial parameter to perform visual target tracking on the current image frame to obtain an interested region contained in the current image frame, if the confidence corresponding to the interested region is not higher than a threshold value, the value corresponding to the parameter is reduced based on a preset amplitude, the parameter with the reduced value is used as the initial parameter, and the visual target tracking is performed on the current image frame again.

10. The method of claim 1, wherein a kernel function used for calculating an inner product during visual target tracking of the current image frame is a linear kernel function.

11. The method of claim 8, the visually target tracking the current image frame, comprising:

and based on the acquired parameters, carrying out visual target tracking on the current image frame according to the target moving direction.

12. A region-of-interest acquisition method, comprising:

13. A storage medium having stored thereon computer instructions which, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 12.