WO2020259264A1

WO2020259264A1 - Subject tracking method, electronic apparatus, and computer-readable storage medium

Info

Publication number: WO2020259264A1
Application number: PCT/CN2020/094848
Authority: WO
Inventors: 康健
Original assignee: Oppo广东移动通信有限公司
Priority date: 2019-06-28
Filing date: 2020-06-08
Publication date: 2020-12-30
Also published as: CN110334635A; CN110334635B

Abstract

A subject tracking method comprises: obtaining, from a video stream, an image frame to serve as a reference image; performing subject detection on the reference image so as to obtain from the reference image a subject region in which a subject is present; sequentially obtaining each image frame following the reference image in the video stream; tracking, on the basis of the subject region, each image frame following the reference image by means of a tracking algorithm, so as to obtain a region of the subject in each image frame; and when the number of the tracked image frames is greater than or equal to a frame number threshold, using the next obtained image frame as a reference image, and returning to the operation of performing subject detection on the reference image so as to obtain from the reference image a subject region in which the subject is present.

Description

Subject tracking method, electronic equipment and computer readable storage medium

Cross references to related applications

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 2019105724125, and the invention title is "Subject tracking method, device, electronic equipment and computer-readable storage medium" on June 28, 2019, and its entire content Incorporated in this application by reference.

Technical field

This application relates to the field of imaging technology, in particular to a subject tracking method, electronic equipment and computer-readable storage medium.

Background technique

With the development of imaging technology, the application of subject tracking technology has become more and more widespread. At present, subject tracking technology usually relies on the user to manually select the subject in the image, and then subject the subsequent image to subject tracking based on the subject. However, since the subject, the size of the subject, and the position of the subject in the video stream may change during the shooting of the video stream, traditional subject tracking methods often cannot track the subject accurately, and the accuracy of subject tracking is relatively high. Low problem.

Summary of the invention

According to various embodiments of the present application, a subject tracking method, an electronic device, and a computer-readable storage medium are provided.

A subject tracking method includes:

Obtain a frame of image in the video stream as a reference image;

Subject detection on the reference image to obtain the subject area where the subject is located in the reference image;

Sequentially acquiring each frame of image after the reference image in the video stream;

Based on the subject area, track each frame of image after the reference image by a tracking algorithm to obtain the subject area in each frame of image; and

When the number of tracked image frames is greater than or equal to the frame number threshold, the next frame of image acquired is used as the reference image, and the subject detection of the reference image is returned to obtain the subject area of the subject in the reference image. operating.

To operate an electronic device, including a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor performs the following operations:

Obtain a frame of image in the video stream as a reference image;

A computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the following operations are realized:

Obtain a frame of image in the video stream as a reference image;

The subject tracking method, electronic equipment, and computer-readable storage medium described above obtain the subject area where the subject is located by subject detection on the reference image in the video stream. The area in a frame of image, when the number of image frames to be tracked is greater than or equal to the frame number threshold, the next frame of image acquired is used as the reference image, and the operation of subject detection on the reference image is returned, that is, the subject of the image can be updated Area, to avoid the problem of subject tracking failure when the subject changes in the video stream, which can improve the accuracy of subject tracking.

The details of one or more embodiments of the present invention are set forth in the following drawings and description. Other features, objects and advantages of the present invention will become apparent from the description, drawings and claims.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.

FIG. 1 is a schematic diagram of the internal structure of an electronic device in one or more embodiments;

Figure 2 is a flowchart of a subject tracking method in one or more embodiments;

FIG. 3 is a flowchart of subject tracking on an image in one or more embodiments;

Figure 4(a) is a schematic diagram of the previous frame of image in one or more embodiments;

Figure 4(b) is a schematic diagram of the current frame image corresponding to Figure (a) in one or more embodiments;

FIG. 5 is a flowchart of setting a frame number threshold in one or more embodiments;

FIG. 6 is a flowchart of subject tracking on an image in one or more embodiments;

FIG. 7 is a flowchart of subject detection of an image in one or more embodiments;

8 is a flowchart of processing the confidence map of the subject region in one or more embodiments;

FIG. 9 is a schematic diagram of image detection effects of one or more embodiments;

FIG. 10 is a flowchart of a process of obtaining the subject area where the subject is located according to the subject area confidence map in one or more embodiments;

FIG. 11 is a structural block diagram of a subject tracking device in one or more embodiments;

Fig. 12 is a schematic diagram of an image processing circuit in one or more embodiments.

Detailed ways

In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application.

It can be understood that the terms "first", "second", etc. used in this application can be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish the first element from another element. For example, without departing from the scope of the present application, the first acquisition module may be referred to as the second acquisition module, and similarly, the second acquisition module may be referred to as the first acquisition module. Both the first acquisition module and the second acquisition module are acquisition modules, but they are not the same acquisition module.

Fig. 1 is a schematic diagram of the internal structure of an electronic device in an embodiment. As shown in Figure 1, the electronic device includes a processor and a memory connected via a system bus. Among them, the processor is used to provide calculation and control capabilities to support the operation of the entire electronic device. The memory may include a non-volatile storage medium and internal memory. The non-volatile storage medium stores an operating system and a computer program. The computer program can be executed by a processor to implement a subject tracking method provided in the following embodiments. The internal memory provides a cached operating environment for the operating system computer program in the non-volatile storage medium. The electronic device can be a mobile phone, a tablet computer or a personal digital assistant or a wearable device. In some embodiments, the electronic device may also be a server. Among them, the server may be an independent server or a server cluster composed of multiple servers.

Fig. 2 is a flowchart of a subject tracking method in an embodiment. The subject tracking method in this embodiment is described by taking an example of running on the electronic device in FIG. 1. As shown in FIG. 2, the subject tracking method includes operations 202 to 210, where:

In operation 202, a frame of image is obtained as a reference image in the video stream.

A video stream is a video consisting of multiple frames of images. The video stream can be a video recorded by an electronic device through a camera, or a video stored locally in the electronic device or a video downloaded from the network. The video stream can also be generated by the electronic device using the camera to capture the picture of the current scene in real time, that is, the electronic device collects multiple frames of preview images through the camera in real time. The preview images can be displayed on the display of the electronic device, and the video stream consists of multiple frames of preview images. consist of.

The reference image is a frame of image in the video stream. The electronic device can obtain a frame of image in the video stream as a reference image. Specifically, the electronic device may obtain the first frame image in the video stream as the reference image. Optionally, the electronic device may obtain a frame of image in the video stream selected by the user as the reference image; or the first frame of image obtained after receiving the subject tracking instruction as the reference image. Of course, the reference image can be any frame of image in the video stream, which is not limited here.

In operation 204, subject detection is performed on the reference image to obtain the subject area where the subject is located in the reference image.

The electronic device performs subject detection on the reference image to obtain the subject area where the subject is located in the reference image. Specifically, the electronic device may train the subject detection model through a neural network algorithm of deep learning to perform subject detection on the reference image. By inputting the image with the subject area and category into the neural network, the neural network adjusts the parameters of the neural network according to the detected predicted area and predicted category to obtain a subject detection model that can accurately identify the subject area and category. The electronic device can input the reference image to the subject detection model, perform subject detection on the reference image through the subject detection model, and segment the reference image according to the identified subject to obtain the subject area where the subject is located. The subject area where the subject is located is the smallest area in the reference image that contains the pixels corresponding to the subject. Specifically, when the subject detection model uses a rectangular frame to output the subject area where the subject is located, the pixels contained in the subject area have a higher correlation with the pixels corresponding to the subject than pixels contained in other rectangular areas in the reference image and the pixels corresponding to the subject. The degree of relevance of points; when the subject detection model uses the subject outline to output the subject area where the subject is located, the edge pixels of the subject area are the edge pixels of the outline of the subject, and the pixels contained in the subject area correspond to the subject The pixels have the highest degree of association. Optionally, the subject recognition network can be implemented by deep learning algorithms such as CNN (Convolutional Neural Network), DNN (Deep Neural Network, Deep Neural Network), or RNN (Recurrent Neural Network, Recurrent Neural Network), etc. . Optionally, in some embodiments, the electronic device may also obtain the main body area selected by the user.

In operation 206, each frame of image after the reference image in the video stream is sequentially acquired.

After the electronic device obtains the reference image and the area where the subject is located in the reference image, it can sequentially obtain each frame of the image after the reference image in the video stream to track the subject of the image in the video stream. It is understandable that the process of subject tracking is usually carried out frame by frame, that is, subject tracking is performed on one frame of image, and subject tracking is performed on the next frame of image after completion.

Operation 208, based on the subject area, track each frame of the image after the reference image by the tracking algorithm to obtain the subject area in each frame of the image.

The subject area contains the feature information corresponding to the subject and the location information of the subject in the reference image. The feature information includes the subject's color feature, texture feature, shape feature, and spatial relationship feature. The position information can be represented by the coordinate position of the subject in the reference image.

The electronic device can track each frame of the image after the reference image based on the subject area through a tracking algorithm to obtain the subject area in each frame of the image. Specifically, the electronic device can obtain the feature information of the subject contained in the subject area in the reference image, so as to find an area that matches the feature information of the subject in each frame of the image after the reference image through a tracking algorithm, that is, the subject in the image The area where the electronic device is located; the electronic device can also search for an area that matches the feature information of the subject around the corresponding location in each frame of the image after the reference image according to the subject's position information in the reference image. Among them, the tracking algorithms that can be used by the electronic device can be, but are not limited to, frame difference method, optical flow method, feature point matching, KCF (High-Speed Tracking with Kernelized Correlation Filters, high-speed tracking algorithm based on nuclear-related filters), etc.

Optionally, in one embodiment, the electronic device uses the KCF tracking algorithm to track the subject in each frame of the image after the reference image. Specifically, during the tracking process, the electronic device tracks the area where the subject is in the previous frame of the image. The area where the main body of the current frame image is located. In this embodiment, the previous image is the reference image. The circulant matrix that the electronic device can use is used to sample around the main area of the reference image, and the kernel correlation filter is used to train the classifier based on the sample , And then adopt the trained classifier sampling in the current frame image to obtain the correlation value of each sample area, and use the sample area with the largest correlation value as the area where the subject is located in the current frame image. When using KCF for image tracking, sampling through the rotation matrix can increase training samples, improve the accuracy of the classifier, and thereby improve the accuracy of subject tracking. In addition, the KCF tracking algorithm uses Fourier transform when sampling the circulant matrix, which can avoid matrix inversion operations and increase the speed of subject tracking.

In operation 210, when the number of tracked image frames is greater than or equal to the frame number threshold, the acquired next frame of image is used as a reference image, and the operation of performing subject detection on the reference image to obtain the subject area of the subject in the reference image is returned.

The frame number threshold can be set according to actual application requirements, and is not limited here. For example, the frame number threshold may be 3 frames, 5 frames, 8 frames, 10 frames, and so on. When the electronic device sequentially acquires the reference image for tracking each frame, it can count the number of image frames that are tracked. When the number of image frames to be tracked is greater than or equal to the frame number threshold, the next frame of image acquired is used as the reference image . For example, when the frame number threshold is 4 frames, if the electronic device uses the first frame of image in the video stream as a reference image and detects the subject area in which the subject contained in the reference image is located, the electronic device can sequentially acquire the images after the first frame of image Each frame of image is tracked until the number of tracked image frames is greater than or equal to the frame number threshold. In this example, when the fifth frame of image is tracked, the tracked image frame number is equal to the frame number threshold, and the electronic device can The next frame that is acquired, that is, the sixth frame image is used as the reference image.

The electronic device can also use the next frame of image acquired as the reference image when the continuous tracking time is greater than or equal to the time threshold, and return to perform subject detection on the reference image to obtain the subject area of the subject in the reference image . It is understandable that in the video stream, the number of tracked image frames and the tracking time can be converted. For example, when the frame rate of the video stream is 60 frames per second, the frame number threshold is 3 frames equivalent to the time threshold of 3s, the frame number threshold is 5 frames is equivalent to the time threshold of 5s, and the frame number threshold is 10 frames is equivalent to time The threshold is 10s and so on. For example, when the frame rate of the video stream is 30 frames per second, if the frame number threshold is 5 frames, the electronic device can use the next frame of image acquired as the reference image when the number of continuously tracked image frames is greater than or equal to 5 frames , Which is equivalent to the electronic device taking the next frame of image acquired as the reference image when the continuous tracking time is greater than or equal to 10s. After the electronic device uses the acquired next frame of image as a reference image, it returns to perform subject detection on the reference image to obtain the subject area of the subject in the reference image, that is, in the subject tracking process of the video stream, the number of frames can be tracked After the threshold image, the subject detection is performed again to update the subject area of the image and continue tracking.

In the embodiment of the present application, the subject area of the subject is obtained by subject detection on the reference image in the video stream, and each frame of the image after the reference image is obtained in turn for subject tracking to obtain the subject area in each frame of image. When the number of tracked image frames is greater than or equal to the frame number threshold, the next frame of image acquired is used as the reference image, and the operation of subject detection on the reference image is returned, that is, the subject area of the image can be updated to avoid subject occurrence in the video stream The problem of subject tracking failure caused by changes can improve the accuracy of subject tracking. Moreover, in this application, the deep learning method is used to detect the subject of the image, and the image tracking algorithm is used to track the subject, which can avoid the use of neural networks to identify and track the subject, which leads to high power consumption and poor real-time performance. The traditional The image processing method detects the subject in the image and causes the problem of poor tracking effect, that is, the technical solution provided by the embodiment of the present application can reduce power consumption while improving the real-time performance and accuracy of subject detection.

As shown in FIG. 3, in one embodiment, the subject tracking method provided is based on the subject area, and the process of tracking each frame of image after the reference image through the tracking algorithm to obtain the subject area in each frame of image, include:

In operation 302, an area of the subject in the previous frame of image is obtained.

The last frame image is the previous frame image of the current frame image to be tracked in the video stream. The current frame image is the image to be tracked. The electronic device can obtain the area where the subject is located in the previous frame of the current frame of image. Optionally, if the current frame image is the first frame image after the reference image, the previous frame image is the reference image.

In operation 304, the area of the subject in the previous frame of image is increased by a preset size to obtain a first prediction area.

The preset size can be set according to actual application requirements and is not limited here. The preset size includes sizes in different directions. For example, when the area of the subject in the last frame of image is a circle, the preset size may be the radius to be increased; when the area of the subject in the last frame of image is a square, the preset size may include four The size of the side length to grow. Specifically, the preset size may be a fixed value, or different preset sizes may be adopted according to different shooting scenes. For example, the electronic device may preset sizes corresponding to different subject categories, so as to obtain the corresponding preset sizes according to the subject recognition result of the reference image. It is understandable that the preset size can also be determined based on the size of the area of the subject in the previous frame of image. For example, the electronic device may preset the increment to be 0.1, 0.2, 0.3, etc., of the original area size. Therefore, the electronic device may determine the preset according to the size of the subject's area in the previous frame of image and the preset amplitude. size.

Operation 306: Acquire a second prediction area corresponding to the position of the first prediction area from the current frame image.

The first prediction area is the area in the previous frame of image. The position of the second prediction area in the current frame of image is the same as the position of the first prediction area in the previous frame of image. The electronic device can increase the area of the subject in the last frame of image by a preset size to obtain the first prediction area, and then obtain the first prediction area corresponding to the position in the current frame according to the position of the first prediction area in the previous frame of image. 2. Forecast area. Specifically, the electronic device can map the first prediction area to the current frame image according to the position of the first prediction area in the previous frame of image to obtain the second prediction area; it can also obtain the position of the first prediction area in the previous frame of image. The coordinate position, and the corresponding second prediction area is obtained from the current frame image according to the coordinate position.

In operation 308, the second prediction region is tracked to obtain the region of the subject in the current frame image.

The electronic device can track the second prediction area of the current frame image to obtain the area of the subject in the current frame image. That is, when the electronic device performs subject tracking of the current frame of image, it does not need to track the entire frame of image, which can reduce the amount of calculation during image tracking and improve the real-time performance and efficiency of subject tracking.

Fig. 4(a) is a schematic diagram of the previous frame of image in an embodiment. Fig. 4(b) is a schematic diagram of the current frame image corresponding to Fig. 4(a) in an embodiment. As shown in Figures 4(a) and 4(b), the area 404 where the subject is located in 402 in the previous frame, the electronic device increases the area 404 of the subject in the previous frame by a preset size to obtain the first prediction Region 406; and then obtain the second prediction region 416 corresponding to the position of the first prediction region 406 from the current frame image 412, and perform subject tracking on the second prediction region 416 according to the region 404 of the subject in the previous frame image to obtain the subject The area 414 in the current frame image.

In an embodiment, increasing the area of the subject in the previous frame of image by a preset size, and before obtaining the first prediction area, further includes: acquiring the area of the subject in multiple frames of images before the previous frame of image; Analyze the moving speed of the subject in the area in the multi-frame image; when the moving speed is greater than or equal to the preset speed, increase the preset size; when the moving speed is less than the preset speed, reduce the preset size.

When the electronic device tracks the image, it can obtain and output the area where the subject is in the image. The multi-frame images before the previous image are usually the images between the reference image and the current frame image in the video stream. Optionally, if the subjects in at least two reference images before the current frame image are the same or similar, the number of multi-frame images acquired by the electronic device may be greater than the frame number threshold; if at least two reference images before the current frame image When the subjects in are not the same, the number of acquired multi-frame images may be less than or equal to the frame number threshold.

The electronic device can obtain the area of the subject in the multi-frame image before the previous image, analyze the movement speed of the subject according to the area of the subject in the multi-frame image, and increase the preset size when the movement speed is greater than or equal to the preset speed , When the moving speed is less than the preset speed, the preset size is reduced. The moving speed of the subject can be calculated according to the position of the subject area in the multi-frame image and the frame rate of the video stream. The range of increase and decrease of the preset size can be set according to actual application requirements, and is not limited here. Optionally, the greater the moving speed, the greater the increase in the preset size; the smaller the moving speed, the smaller the decrease in the preset size. The preset size may be an optimal adjustment size determined when the moving speed of the main body is the preset speed.

By analyzing the movement speed of the subject according to the region in the multiple frames before the previous image, the preset size is adjusted according to the movement speed of the subject. When the movement speed is high, the preset size is increased to avoid the subject When the area of the current frame image exceeds the second prediction area set by the unadjusted preset size, the problem of tracking failure is caused. When the moving speed is small, reducing the preset size can further reduce the amount of calculation during image tracking. That is, the efficiency of subject tracking can be improved while ensuring the success of subject tracking.

In an embodiment, before the number of image frames to be tracked is greater than or equal to the frame number threshold, the provided subject tracking method may further include:

In operation 502, an area of the subject in the tracked multi-frame image is obtained.

Generally, the number of tracked multi-frame images is less than or equal to the frame number threshold. In some embodiments, the tracked multi-frame images may include reference images. When the subjects in at least two frames of reference images are the same or similar, the number of the tracked multi-frame images may be greater than the frame number threshold. The electronic device can obtain the area of the subject in the tracked multiple frames of images.

In operation 504, the position change amount of the subject is analyzed based on the area of the subject in the tracked multi-frame image, and the position change amount represents the change magnitude of the position of the subject in the image.

The change in the position of the subject represents the change in the position of the subject in the image. The position change amount of the main body may include at least one of the change amount of the area of the main body in the video stream and the change amount caused by the movement of the main body. The electronic device analyzes the position change of the subject based on the area of the subject in the tracked multi-frame image, that is, analyzes the change range of the subject in the video stream. The greater the position change, the greater the subject's change range; conversely, the smaller the position change, the smaller the subject's change range.

In operation 506, when the position change amount is greater than or equal to the change amount threshold, the frame number threshold is set to the first value.

In operation 508, when the position change amount is less than the change amount threshold, the frame number threshold is set to a second value, where the second value is greater than the first value.

The change threshold can be set according to actual application requirements and is not limited here. Before the electronic device sets the frame number threshold according to the position change, the electronic device can determine whether the number of tracked image frames is greater than or equal to the frame number threshold according to the default frame number threshold. Optionally, the default frame number threshold may be used to update the optimal frame number threshold of the reference image when the change in the position of the subject determined according to experimental data is the change threshold. The first value and the second value can be set according to actual application needs, and do not set here. Specifically, the second value is greater than the first value, and the default frame number threshold of the electronic device is greater than or equal to the first value and less than or equal to the second value. For example, the first value is 3, the second value may be 5; the first value is 5, and the second value may be 10; the first value is 4, the second value may be 8, etc., which are not limited here.

The electronic device may set the frame number threshold to a first value when the position change is greater than or equal to the change threshold, and set the frame number threshold to a second value greater than the first value when the position change is less than the change threshold. That is, when the change of the subject is large, the reference image can be updated in time to re-determine the area where the subject of the reference image is located. When the change of the subject is small, the update of the reference image can be delayed, which can reduce frequent reference images. The problem of high power consumption caused by subject detection.

In one embodiment, the electronic device includes a gyroscope, and the subject tracking method further includes: obtaining angular velocity data output by the gyroscope; analyzing jitter information of the electronic device according to the angular velocity data; and adjusting the frame number threshold according to the jitter information.

A gyroscope is an angular motion detection device for detecting angular velocity. Electronic equipment can obtain the angular velocity data output by the gyroscope during the video stream collection process. The electronic device can analyze the jitter amplitude of the electronic device according to the angular velocity data, and then adjust the frame number threshold according to the jitter amplitude. The greater the jitter amplitude of the electronic device, the higher the possibility that the main body in the video stream will change. The electronic device may be preset with an amplitude threshold. When the jitter amplitude exceeds the amplitude threshold, the frame number threshold is lowered; electronic equipment The smaller the jitter amplitude of the video stream is, the lower the possibility of changes in the main body of the video stream. The electronic device can increase the frame number threshold when the jitter amplitude is less than the amplitude threshold. Optionally, the electronic device can also pre-divide multiple amplitude intervals and the number of frames corresponding to each amplitude interval, so that the jitter amplitude can be analyzed according to the angular velocity data output by the gyroscope, and the frame number threshold can be adjusted to the amplitude of the jitter amplitude. The number of frames corresponding to the interval.

In one embodiment, the subject tracking method provided is based on the subject area, using a tracking algorithm to track each frame of image after the reference image to obtain the subject area in each frame of image, including:

Operation 602: Acquire a subject region and category corresponding to each subject in the reference image.

The reference image may include one or more subjects. When the electronic device detects the subject of the reference image, it can output the subject area and category corresponding to each subject in the reference image. The categories of the main area include people, animals, plants, books, furniture, etc., which are not limited here.

In operation 604, the tracking order of each subject is determined according to at least one of the priority of the category corresponding to each subject, the size of the subject area, and the location of the subject area.

Specifically, the electronic device can also preset priority levels of different categories, different area sizes, and score values of different locations of the areas in the image, so that the priority levels of the categories corresponding to each subject, the size of the area, and the area Calculate the score value of each subject at the position in the image, and determine the tracking order of each subject according to the score value of each subject. Generally, the higher the priority of the subject category, the larger the subject area, and the closer the subject area to the center of the image, the higher the tracking order of the subject. Taking the higher the priority level, the greater the score value, the larger the subject area, the greater the score value, the closer the subject area to the image center, the greater the score value, for example, the tracking order of each subject is based on the score value The order of subjects sorted from high to low.

In operation 606, each frame of image after the reference image is tracked based on the tracking sequence to obtain the area where each subject is located in each frame of image.

The electronic device tracks each frame of the image based on the tracking sequence to obtain the area where each subject in each frame of the image is located, that is, when tracking a frame of image, each subject in the image can be tracked in sequence in the tracking sequence , Output the area where each subject in the image is located.

In the process of image or video shooting, the object of interest is often imaged in the center of the image, or the object between the camera and the object of interest is zoomed in to make the area of the object of interest in the image Bigger. The electronic device determines the tracking order of each subject according to at least one of the priority level of the subject's corresponding category, the size of the subject area, and the location of the subject area, and tracking the images according to the tracking order can improve the effect of subject tracking and satisfy users Individual needs.

In one embodiment, in the subject tracking method provided, subject detection is performed on the reference image to obtain the subject area in the reference image, including:

In operation 702, a center weight map corresponding to the reference image is generated, wherein the weight value represented by the center weight map gradually decreases from the center to the edge.

The central weight map refers to a map used to record the weight value of each pixel in the reference image. The weight value recorded in the center weight map gradually decreases from the center to the four sides, that is, the center weight is the largest, and the weight gradually decreases toward the four sides. The weight value from the center pixel of the reference image to the edge pixel of the image is gradually reduced through the center weight graph.

The electronic device can generate a corresponding center weight map according to the size of the reference image. The weight value represented by the center weight map gradually decreases from the center to the four sides. The center weight map can be generated using a Gaussian function, a first-order equation, or a second-order equation. The Gaussian function may be a two-dimensional Gaussian function.

In operation 704, the reference image and the center weight map are input into the subject detection model to obtain the subject region confidence map.

Among them, the subject detection model is a model obtained by training in advance according to the sample map, the center weight map and the corresponding labeled subject mask map of the same scene. Specifically, the electronic device may collect a large amount of training data in advance, and input the training data into the subject detection model including the initial network weight for training, and obtain the subject detection model. Each set of training data includes a sample map corresponding to the same scene, a center weight map, and a labeled subject mask map. Among them, the sample map and the center weight map are used as the input of the trained subject detection model, and the labeled subject mask map is used as the ground truth that the trained subject detection model expects to output. The subject mask map is an image filter template used to identify the subject in the image, which can block other parts of the image and filter out the subject in the image. The subject detection model can be trained to recognize and detect various subjects, such as people, flowers, cats, dogs, etc.

Specifically, the electronic device can input the reference image and the center weight map into the subject detection model, and perform the detection to obtain the subject area confidence map. The subject area confidence map contains the confidence values of each pixel for different subject categories. For example, the confidence that a certain pixel belongs to a person is 0.8, the confidence of a flower is 0.1, and the confidence of a dog is 0.1.

In operation 706, the subject in the reference image is determined according to the subject region confidence map, and the subject region where the subject is located is obtained.

The subject can be various objects, such as people, flowers, cats, dogs, cows, white clouds, etc. The electronic device can determine each subject included in the reference image and the subject area where the subject is located according to the size of the confidence value of each pixel in the subject area confidence map for different subject categories.

Specifically, the electronic device can perform adaptive threshold filtering on the confidence map of the subject area, and can eliminate pixels with low confidence values and/or scattered pixels in the confidence map of the subject area; the electronic device can also perform the confidence map of the subject area. One or more of filtering, dilation, and corrosion can be performed to obtain a confidence map of the subject area with fine edges; thus, the electronic device can output the location of multiple subjects contained in the reference image according to the processed subject area confidence map The subject area can improve the accuracy of subject detection.

By generating the center weight map corresponding to the reference image, and inputting the reference image and center weight map into the corresponding subject detection model, the confidence map of the subject area can be obtained. According to the confidence map of the subject area, the subject and the location in the reference image can be determined Using the center weight map to make the object in the center of the image easier to detect, and to more accurately identify the subject in the reference image.

In one embodiment, the subject tracking method provided can also obtain the depth image corresponding to the reference image, perform registration processing on the reference image and the depth image, and obtain the registered reference image and the depth image, thereby combining the registered reference The image, depth image, and center weight map are input into the subject detection model to obtain a confidence map of the subject area, the subject in the reference image is determined according to the confidence map of the subject area, and the subject area where the subject is located is obtained.

A depth image refers to an image containing depth information. The depth image can be a depth map calculated by shooting the same scene with dual cameras; it can also be a depth map collected by a structured light camera or a TOF (Time of Flight) camera. Specifically, the electronic device can obtain the reference image and the corresponding depth image by shooting the same scene through the camera, and then use the camera calibration parameters to register the reference image and the depth image to obtain the registered visible light image and the depth image. Optionally, after the electronic device registers the reference image and the depth image, the pixel value of the pixel in the reference image and the pixel value of the pixel in the depth image may be normalized respectively. Specifically, the integer normalization of the pixel value of the reference image from 0 to 255 is a floating-point value from -1 to +1, and the normalization of the pixel value of the pixel in the depth image is 0 Floating point value up to 1. When a depth image cannot be captured, a simulated depth map with a preset depth value can be automatically generated. The preset value can be a floating point value from 0 to 1.

In this embodiment, the subject detection model is a model obtained by training in advance according to the visible light map, the depth map, the center weight map and the corresponding labeled subject mask map of the same scene. The subject detection model is obtained by pre-collecting a large amount of training data, and inputting the training data to the subject detection model containing the initial network weights for training. Each set of training data includes the visible light map, the depth map, the center weight map and the labeled subject mask map corresponding to the same scene.

In this embodiment, the depth image and the center weight map are used as the input of the subject detection model. The depth information of the depth image can be used to make objects closer to the camera easier to be detected. The center weight map is used to have a large center weight and a small weight on the four sides. The central attention mechanism makes it easier to detect objects in the center of the image. The introduction of depth images can enhance the depth features of the subject, and the central weight map can be used to enhance the central attention features of the subject, which can not only accurately identify the target subject in simple scenes. , To greatly improve the accuracy of subject recognition in complex scenes. The introduction of depth images can solve the problem of poor robustness of traditional target detection methods to the ever-changing targets of natural images. A simple scene refers to a scene with a single subject and low contrast in the background area.

In one embodiment, in the image coding method provided, the process of determining the subject in the reference image according to the subject region confidence map and obtaining the subject region where the subject is located includes:

In operation 802, the subject region confidence map is processed to obtain a subject mask map.

Specifically, there are some low-confidence and scattered points in the confidence map of the subject area, and the electronic device can filter the confidence map of the subject area to obtain the subject mask map. The filtering process can be configured to configure a confidence threshold to filter pixels with a confidence value lower than the confidence threshold in the confidence map of the subject area. The confidence threshold may be an adaptive confidence threshold, a fixed threshold, or a corresponding threshold configured by region. Wherein, the adaptive confidence threshold may be a local adaptive confidence threshold. The local adaptive confidence threshold is to determine the binarization confidence threshold at the position of the pixel according to the pixel value distribution of the domain block of the pixel. The image area with higher brightness has a higher binarization confidence threshold configuration, and the image area with lower brightness has a lower binarization threshold confidence configuration.

Optionally, the electronic device can also perform adaptive confidence threshold filtering processing on the subject region confidence map to obtain a binarized mask map; perform morphological processing and guided filtering processing on the binarized mask map to obtain Main mask map. Specifically, after the electronic device filters the confidence map of the subject area according to the adaptive confidence threshold, the confidence value of the retained pixel is represented by 1, and the confidence value of the removed pixel is represented by 0, to obtain the binarization Mask map. Morphological treatments can include corrosion and expansion. You can perform the erosion operation on the binarized mask first, and then perform the expansion operation to remove the noise; and then conduct the guided filtering process on the binarized mask after morphological processing to realize the edge filtering operation and obtain the main mask for edge extraction. Membrane diagram. Through morphological processing and guided filtering processing, it can be ensured that the resulting subject mask has less or no noise and the edges are softer.

In operation 804, the reference image is detected, and the highlight area in the reference image is determined.

Among them, the highlight area refers to an area where the brightness value is greater than the brightness threshold.

Specifically, the electronic device performs highlight detection on the reference image, filters to obtain target pixels with a brightness value greater than a brightness threshold, and applies connected domain processing to the target pixels to obtain a highlight area.

In operation 806, the subject in the reference image is determined according to the highlight area in the reference image and the subject mask map, and the subject area where the subject is located is obtained.

Specifically, the electronic device may perform difference calculation or logical sum calculation between the highlight area in the reference image and the subject mask map to obtain the subject area corresponding to the subject whose highlight is eliminated in the reference image. Wherein, the electronic device performs difference processing on the highlight area in the reference image and the subject mask image, that is, the reference image and the corresponding pixel values in the subject mask image are subtracted to obtain the subject area where the subject in the reference image is located.

The main body mask image is obtained by filtering the main body region confidence map, which improves the reliability of the main body region confidence map. The reference image is detected to obtain the highlight area, and then processed with the main body mask image to eliminate the highlight The main body area where the main body is located is processed by a filter separately for the highlights and highlight areas that affect the accuracy of the main body recognition, which improves the accuracy and accuracy of the main body recognition.

Fig. 9 is a schematic diagram of image processing effects in an embodiment. As shown in 9, there is a butterfly in the reference image 902. After inputting the reference image 902 into the subject detection model, the subject region confidence map 904 is obtained, and then the subject region confidence map 904 is filtered and binarized to obtain the binarization For the mask image 906, morphological processing and guided filtering are performed on the binary mask image 906 to achieve edge enhancement, and the main mask image 908 is obtained.

In operation 1002, the regions and corresponding categories of multiple objects included in the reference image are obtained according to the subject region confidence map.

Specifically, the electronic device can perform subject detection on the reference image through the subject recognition network to obtain the regions and corresponding categories of multiple objects contained in the reference image.

In operation 1004, the target object as the subject is determined based on at least one of the priority of the category corresponding to each object, the size of the area, and the location of the area.

The electronic device can preset priority levels corresponding to different categories. For example, the priority of the category may be lowered in order of human, flower, cat, dog, cow, and white cloud. The electronic device determines the target object as the subject based on at least one of the priority of the category corresponding to each object, the size of the area, and the location of the area. Specifically, when there are multiple objects belonging to the same category in the reference image, the electronic device can determine the object with the largest area as the target object according to the size of the area corresponding to the multiple objects, or the object closest to the center of the image Target object. When there are multiple objects belonging to different categories in the reference image, the electronic device can use the object corresponding to the category with the highest priority as the target object. If there are multiple objects with the highest priority in the reference image, it can be further based on multiple objects. The size of the area where the object is located determines the target area; the electronic device can also determine the target object of the subject based on the position of each object in the image. For example, the electronic device can also preset the priority levels of different categories, the size of different regions, and the score values of different locations of the regions in the image, so as to determine the priority of each object, the size of the region, and the size of the region in the image. Calculate the score value of each object in the position in, and take the object with the highest score value as the target object.

In operation 1006, the area where the target object is located is taken as the main body area where the main body is located.

After the electronic device determines the target object as the main body, it uses the area where the target object is located as the main body area where the main body is located.

By determining the target object as the subject based on at least one of the priority of the category corresponding to each object, the size of the area, and the location of the area, the area where the target object is located is the subject where the subject is located, which can improve the accuracy of subject recognition.

It should be understood that although the various operations in the flowcharts of FIGS. 2, 3, and 5-7 are sequentially displayed as indicated by the arrows, these operations are not necessarily performed sequentially in the order indicated by the arrows. Unless explicitly stated in this article, there is no strict order for the execution of these operations, and these operations can be executed in other orders. Moreover, at least part of the operations in Figures 2, 3, and 5-7 may include multiple sub-operations or multiple stages. These sub-operations or stages are not necessarily executed at the same time, but can be executed at different times. The execution order of the sub-operations or stages is not necessarily performed sequentially, but may be executed alternately or alternately with other operations or at least a part of the sub-operations or stages of other operations.

Fig. 11 is a structural block diagram of a subject tracking device according to an embodiment. As shown in FIG. 11, the subject tracking device includes a first acquisition module 1102, a subject detection module 1104, a second acquisition module 1106, a subject tracking module 1108, and an image determination module 1110. among them:

The first acquisition module 1102 is configured to acquire a frame of image in the video stream as a reference image;

The subject detection module 1104 is used to perform subject detection on the reference image to obtain the subject area where the subject is located in the reference image;

The second acquisition module 1106 is configured to sequentially acquire each frame of image after the reference image in the video stream;

The subject tracking module 1108 is used to track each frame of the image after the reference image based on the subject area through a tracking algorithm to obtain the subject area in each frame of the image;

The image determination module 1110 is used for when the number of image frames to be tracked is greater than or equal to the frame number threshold, the next frame of image acquired is used as the reference image, and the subject detection of the reference image is performed back to obtain the subject area of the subject in the reference image Operation.

The subject tracking device provided by the embodiments of the present application obtains the subject area where the subject is located by subject detection on the reference image in the video stream, acquires each frame of the reference image in turn for subject tracking, and obtains the subject in each frame of image When the number of tracked image frames is greater than or equal to the frame number threshold, the next frame of image acquired is used as the reference image, and the operation of subject detection on the reference image is returned, that is, the subject area of the image can be updated to avoid video The problem of subject tracking failure caused by changes in the subject in the stream can improve the accuracy of subject tracking.

In one embodiment, the subject tracking module 1108 can also be used to obtain the subject area in the previous frame of image; increase the subject area in the previous frame of image by a preset size to obtain the first prediction area; The second prediction area corresponding to the position of the first prediction area is acquired from the image; the second prediction area is tracked to obtain the area of the subject in the current frame image.

In one embodiment, the provided subject tracking device further includes a size adjustment module 812, which is used to obtain the subject area in the multi-frame image before the previous frame image; analyze the subject area in the multi-frame image The movement speed of the subject; when the movement speed is greater than or equal to the preset speed, increase the preset size; when the movement speed is less than the preset speed, decrease the preset size.

In one embodiment, the subject tracking device provided further includes a frame number threshold setting module 814. The frame number threshold setting module 814 is used to obtain the area of the subject in the tracked multi-frame image; The area in the frame image analyzes the subject’s position change; when the position change is greater than or equal to the change threshold, the frame number threshold is set to the first value; when the position change is less than the change threshold, the frame number threshold is set The second value, wherein the second value is greater than the first value.

In an embodiment, the frame number threshold setting module 814 can also be used to obtain angular velocity data output by the gyroscope; analyze the jitter amplitude of the electronic device according to the angular velocity data; and adjust the frame number threshold according to the jitter amplitude.

In an embodiment, the subject tracking module 808 can also be used to obtain the subject area and category corresponding to each subject in the reference image; according to the priority of each subject’s category, the size of the subject area and the location of the subject area. At least one determines the tracking order of each subject; based on the tracking order, each frame of image after the reference image is tracked to obtain the area where each subject is located in each frame of image.

In an embodiment, the subject detection module 1104 can also be used to generate a center weight map corresponding to the reference image, wherein the weight value represented by the center weight map gradually decreases from center to edge; input the reference image and the center weight map In the subject detection model, the subject region confidence map is obtained; the subject in the reference image is determined according to the subject region confidence map, and the subject region where the subject is located is obtained.

In an embodiment, the subject detection module 1104 can also be used to process the confidence map of the subject area to obtain the subject mask map; detect the reference image and determine the highlight area in the reference image; according to the highlight area in the reference image and the subject Mask map, determine the subject in the reference image, and obtain the subject area where the subject is located.

In an embodiment, the subject detection module 1104 may also be used to obtain a depth image corresponding to the reference image; perform registration processing on the reference image and the depth image to obtain the registered reference image and the depth image; The reference image, the depth image, and the center weight map are input into the subject detection model to obtain the subject region confidence map; the subject in the reference image is determined according to the subject region confidence map, and the subject region where the subject is located is obtained.

In one embodiment, the subject detection module 1104 can also be used to reference the image containing the regions and corresponding categories of multiple objects according to the subject region confidence map; based on the priority level of the category corresponding to each object, the size of the region, and the region. At least one of the positions determines the target object as the subject; the area where the target object is located is the subject area where the subject is located.

The division of each module in the subject tracking device is only for illustration. In other embodiments, the subject tracking device can be divided into different modules as needed to complete all or part of the functions of the subject tracking device.

The implementation of each module in the subject tracking device provided in the embodiment of the present application may be in the form of a computer program. The computer program can be run on a terminal or server. The program module composed of the computer program can be stored in the memory of the terminal or server. When the computer program is executed by the processor, the operation of the method described in the embodiment of the present application is realized.

The embodiment of the application also provides an electronic device. The above-mentioned electronic equipment includes an image processing circuit. The image processing circuit may be implemented by hardware and/or software components, and may include various processing units that define an ISP (Image Signal Processing, image signal processing) pipeline. Fig. 12 is a schematic diagram of an image processing circuit in an embodiment. As shown in FIG. 12, for ease of description, only various aspects of the image processing technology related to the embodiments of the present application are shown.

As shown in FIG. 12, the image processing circuit includes an ISP processor 1240 and a control logic 1250. The image data captured by the imaging device 1210 is first processed by an ISP processor 1240, and the ISP processor 1240 analyzes the image data to capture image statistics that can be used to determine and/or one or more control parameters of the imaging device 1210. The imaging device 1210 may include a camera having one or more lenses 1212 and an image sensor 1214. The image sensor 1214 may include a color filter array (such as a Bayer filter). The image sensor 1214 may obtain the light intensity and wavelength information captured by each imaging pixel of the image sensor 1214, and provide a set of raw materials that can be processed by the ISP processor 1240. Image data. The sensor 1220 (such as a gyroscope) can provide the collected image processing parameters (such as anti-shake parameters) to the ISP processor 1240 based on the interface type of the sensor 1220. The sensor 1220 interface may utilize SMIA (Standard Mobile Imaging Architecture) interfaces, other serial or parallel camera interfaces, or a combination of the above interfaces.

In addition, the image sensor 1214 may also send raw image data to the sensor 1220, and the sensor 1220 may provide the raw image data to the ISP processor 1240 based on the sensor 1220 interface type, or the sensor 1220 may store the raw image data in the image memory 1230.

The ISP processor 1240 processes the original image data pixel by pixel in multiple formats. For example, each image pixel may have a bit depth of 8, 10, 12, or 14 bits, and the ISP processor 1240 may perform one or more image processing operations on the original image data and collect statistical information about the image data. Among them, the image processing operations can be performed with the same or different bit depth accuracy.

The ISP processor 1240 may also receive image data from the image memory 1230. For example, the sensor 1220 interface sends the original image data to the image memory 1230, and the original image data in the image memory 1230 is then provided to the ISP processor 1240 for processing. The image memory 1230 may be a part of a memory device, a storage device, or an independent dedicated memory in an electronic device, and may include DMA (Direct Memory Access, direct memory access) features.

When receiving raw image data from the image sensor 1214 interface or from the sensor 1220 interface or from the image memory 1230, the ISP processor 1240 may perform one or more image processing operations, such as temporal filtering. The processed image data can be sent to the image memory 1230 for additional processing before being displayed. The ISP processor 1240 receives the processed data from the image memory 1230, and performs image data processing in the original domain and in the RGB and YCbCr color spaces on the processed data. The image data processed by the ISP processor 1240 may be output to the display 1270 for viewing by the user and/or further processed by a graphics engine or GPU (Graphics Processing Unit, graphics processor). In addition, the output of the ISP processor 1240 can also be sent to the image memory 1230, and the display 1270 can read image data from the image memory 1230. In one embodiment, the image memory 1230 may be configured to implement one or more frame buffers. In addition, the output of the ISP processor 1240 may be sent to the encoder/decoder 1260 in order to encode/decode image data. The encoded image data can be saved and decompressed before being displayed on the display 1270 device. The encoder/decoder 1260 may be implemented by a CPU or GPU or a coprocessor.

The statistical data determined by the ISP processor 1240 may be sent to the control logic 1250 unit. For example, the statistical data may include image sensor 1214 statistical information such as automatic exposure, automatic white balance, automatic focus, flicker detection, black level compensation, and lens 1212 shadow correction. The control logic 1250 may include a processor and/or a microcontroller that executes one or more routines (such as firmware). One or more routines can determine the control parameters and ISP processing of the imaging device 1210 based on the received statistical data. The control parameters of the device 1240. For example, the control parameters of the imaging device 1210 may include sensor 1220 control parameters (such as gain, integration time for exposure control, anti-shake parameters, etc.), camera flash control parameters, lens 1212 control parameters (such as focal length for focusing or zooming), or these The combination of parameters. The ISP control parameters may include gain levels and color correction matrices for automatic white balance and color adjustment (for example, during RGB processing), and lens 1212 shading correction parameters.

In the embodiment provided in this application, the imaging device 1210 can be used to capture each frame of image in the video stream; the image memory 1230 is used to store the images collected by the imaging device 1210; the ISP processor 1240 can obtain the images collected by the imaging device 1210 Perform subject detection on a frame of image in the video stream to obtain the subject area of the subject in the reference image, and perform subject tracking on each frame after the reference image according to the subject area. When the number of tracked image frames is greater than or equal to the number of frames When the threshold is used, the next frame of image acquired is used as the reference image, and the operation of performing subject detection on the reference image to obtain the subject area of the subject in the reference image is performed until the video stream tracking is completed. The electronic device can implement the subject tracking method provided in the foregoing embodiment through the foregoing image processing circuit, and details are not described herein again.

The embodiment of the present application also provides a computer-readable storage medium. One or more non-volatile computer-readable storage media containing computer-executable instructions, when the computer-executable instructions are executed by one or more processors, cause the processors to perform the operations of the subject tracking method.

A computer program product containing instructions that, when run on a computer, causes the computer to execute the subject tracking method.

Any reference to memory, storage, database, or other media used in the embodiments of the present application may include non-volatile and/or volatile memory. Suitable non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM), which acts as external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The above-mentioned embodiments only express several implementation manners of the present application, and their descriptions are relatively specific and detailed, but they should not be understood as a limitation on the patent scope of the present application. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims

A subject tracking method, characterized by comprising:

Obtain a frame of image in the video stream as a reference image;

Subject detection on the reference image to obtain the subject area where the subject is located in the reference image;

Sequentially acquiring each frame of image after the reference image in the video stream;

Based on the subject area, track each frame of image after the reference image by a tracking algorithm to obtain the subject area in each frame of image; and

When the number of tracked image frames is greater than or equal to the frame number threshold, the next frame of image acquired is used as the reference image, and the subject detection of the reference image is returned to obtain the subject area of the subject in the reference image. operating.
The method according to claim 1, wherein the tracking algorithm is used to track each frame of image after the reference image based on the subject area to obtain the area of the subject in each frame of image ,include:

Acquiring the area of the subject in the previous frame of image;

Increasing the area of the subject in the previous frame of image by a preset size to obtain the first prediction area;

Obtaining a second prediction area corresponding to the position of the first prediction area from the current frame image; and

Tracking the second prediction area to obtain the area of the subject in the current frame image.
The method according to claim 2, wherein the obtaining the second prediction area corresponding to the position of the first prediction area from the current frame image comprises:

Mapping the first prediction area to the current frame image according to the position of the first prediction area in the previous frame image to obtain the second prediction area; or

Acquire the coordinate position of the first prediction area in the previous frame image, and obtain the corresponding second prediction area from the current frame image according to the coordinate position.
The method according to claim 2, wherein the step of increasing the area of the subject in the previous frame of image by a preset size to obtain the first prediction area further comprises:

Acquiring an area of the subject in multiple frames of images before the last frame of image;

Analyzing the moving speed of the subject according to the region of the subject in the multiple frames of the image;

When the moving speed is greater than or equal to the preset speed, increase the preset size; and

When the moving speed is less than the preset speed, reduce the preset size.
The method according to claim 1, characterized in that, when the number of tracked image frames is greater than or equal to the frame number threshold, before using the acquired next frame of image as the reference image, the method further comprises:

Acquiring the subject area in the tracked multiple frames of images;

Analyzing the position change of the subject based on the area of the subject in the tracked multi-frame images, wherein the position change represents the magnitude of the change in the position of the subject in the image;

When the position change is greater than or equal to the change threshold, the frame number threshold is set to the first value; and

When the position change amount is less than the change amount threshold, the frame number threshold is set to a second value, where the second value is greater than the first value.
The method according to claim 1, further comprising:

Obtain the angular velocity data output by the gyroscope;

Analyzing the jitter amplitude of the electronic device according to the angular velocity data; and

The frame number threshold is adjusted according to the jitter amplitude.
The method according to claim 1, wherein the tracking algorithm is used to track each frame of image after the reference image based on the subject area to obtain the area of the subject in each frame of image ,include:

Acquiring the subject area and category corresponding to each subject in the reference image;

Determining the tracking order of each subject according to at least one of the priority of the category corresponding to each subject, the size of the subject area, and the location of the subject area; and

Tracking each frame of image after the reference image based on the tracking sequence to obtain the area where each subject is located in each frame of image.
The method according to any one of claims 1 to 7, wherein the subject detection of the reference image to obtain the subject area where the subject is located in the reference image comprises:

Generating a center weight map corresponding to the reference image, wherein the weight value represented by the center weight map gradually decreases from center to edge;

Input the reference image and the center weight map into the subject detection model to obtain the subject region confidence map; and

Determine the subject in the reference image according to the subject region confidence map, and obtain the subject region where the subject is located.
The method according to claim 8, wherein the determining the subject in the reference image according to the subject region confidence map, and obtaining the subject region where the subject is located, comprises:

Processing the confidence map of the subject area to obtain a subject mask map;

Detecting the reference image and determining the highlight area in the reference image; and

Determine the subject in the reference image according to the highlight area in the reference image and the subject mask map, and obtain the subject area where the subject is located.
The method according to claim 9, wherein the processing the confidence map of the subject area to obtain a subject mask map comprises:

Performing adaptive confidence threshold filtering processing on the confidence map of the subject area to obtain a binarized mask map; and

Morphological processing and guided filtering processing are performed on the binarized mask image to obtain a main body mask image.
The method according to claim 8, further comprising:

Acquiring a depth image corresponding to the reference image; and

Performing registration processing on the reference image and the depth image to obtain a registered reference image and a depth image;

The inputting the reference image and the center weight map into the subject detection model to obtain the subject region confidence map includes:

The registered reference image, the depth image, and the center weight map are input into the subject detection model to obtain the subject region confidence map.
The method according to claim 8, wherein the determining the subject in the reference image according to the subject region confidence map, and acquiring the subject region where the subject is located, comprises:

Obtaining, according to the subject area confidence map, the reference image including the area where multiple objects are located and the corresponding category;

Determine the target object as the subject based on at least one of the priority of the category corresponding to each of the objects, the size of the area, and the location of the area; and

The area where the target object is located is taken as the main body area where the main body is located.
An electronic device includes a memory and a processor, and a computer program is stored in the memory. When the computer program is executed by the processor, the processor performs the following operations:

Obtain a frame of image in the video stream as a reference image;

Subject detection on the reference image to obtain the subject area where the subject is located in the reference image;

Sequentially acquiring each frame of image after the reference image in the video stream;

Based on the subject area, track each frame of image after the reference image by a tracking algorithm to obtain the subject area in each frame of image; and

When the number of tracked image frames is greater than or equal to the frame number threshold, the next frame of image acquired is used as the reference image, and the subject detection of the reference image is returned to obtain the subject area of the reference image. operating.
The electronic device according to claim 13, wherein the processor executes the tracking algorithm for each frame of the image after the reference image based on the subject area to obtain the subject in each frame When the area in a frame of image, also perform the following operations:

Acquiring the area of the subject in the previous frame of image;

Increasing the area of the subject in the previous frame of image by a preset size to obtain the first prediction area;

Obtaining a second prediction area corresponding to the position of the first prediction area from the current frame image; and

Tracking the second prediction area to obtain the area of the subject in the current frame image.
The electronic device according to claim 14, wherein the processor executes said increasing the area of the subject in the previous frame of image by a preset size, and before obtaining the first prediction area, further performs the following operations :

Acquiring an area of the subject in multiple frames of images before the last frame of image;

Analyzing the moving speed of the subject according to the region of the subject in the multiple frames of the image;

When the moving speed is greater than or equal to the preset speed, increase the preset size; and

When the moving speed is less than the preset speed, reduce the preset size.
The electronic device according to claim 13, wherein the processor executes when the number of image frames to be tracked is greater than or equal to the frame number threshold, before taking the next frame of image acquired as the reference image, Do the following:

Acquiring the subject area in the tracked multiple frames of images;

Analyzing the amount of change in the position of the subject based on the area of the subject in the tracked multi-frame images, wherein the amount of position change represents the amplitude of the change in the position of the subject in the image;

When the position change is greater than or equal to the change threshold, the frame number threshold is set to a first value; and

When the position change amount is less than the change amount threshold, the frame number threshold is set to a second value, where the second value is greater than the first value.
The electronic device according to claim 13, wherein the processor executes the tracking algorithm for each frame of the image after the reference image based on the subject area to obtain the subject in each frame When the area in a frame of image, also perform the following operations:

Acquiring the subject area and category corresponding to each subject in the reference image;

Determining the tracking order of each subject according to at least one of the priority of the category corresponding to each subject, the size of the subject area, and the location of the subject area; and

Tracking each frame of image after the reference image based on the tracking sequence to obtain the area where each subject is located in each frame of image.
The electronic device according to any one of claims 13 to 17, wherein when the processor executes the subject detection of the reference image to obtain the subject area where the subject is located in the reference image, further Do the following:

Generating a center weight map corresponding to the reference image, wherein the weight value represented by the center weight map gradually decreases from center to edge;

Inputting the reference image and the center weight map into the subject detection model to obtain the subject region confidence map; and

Determine the subject in the reference image according to the subject region confidence map, and obtain the subject region where the subject is located.
The electronic device according to claim 18, wherein when the processor executes the determination of the subject in the reference image according to the subject area confidence map, and obtains the subject area where the subject is located, it also executes Do as follows:

Processing the confidence map of the subject area to obtain a subject mask map;

Detecting the reference image and determining the highlight area in the reference image; and

According to the highlight area in the reference image and the subject mask map, the subject in the reference image is determined, and the subject area where the subject is located is acquired.
A computer-readable storage medium having a computer program stored thereon, wherein the computer program implements the operation of the method according to any one of claims 1 to 12 when the computer program is executed by a processor.