CN108776822B

CN108776822B - Target area detection method, device, terminal and storage medium

Info

Publication number: CN108776822B
Application number: CN201810650498.4A
Authority: CN
Inventors: 姜媚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-06-22
Filing date: 2018-06-22
Publication date: 2020-04-24
Anticipated expiration: 2038-06-22
Also published as: CN108776822A

Abstract

The embodiment of the invention discloses a target area detection method, a target area detection device, a terminal and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: determining a plurality of sample regions and classification results; obtaining a classifier, wherein the classifier comprises a plurality of classification nodes which are sequentially arranged; training a first classification node in the classifier according to classification results of the plurality of sample regions and the plurality of sample regions, and continuing training a next classification node after the training of the first classification node is finished until the training of all the plurality of classification nodes is finished; when the target area is determined not to be included in the currently tracked image, the trained classifier is applied to classify at least one area in the second image behind the currently tracked image, the target area in the second image is determined according to the classification result, each frame of image does not need to be detected, and unnecessary calculation amount is reduced. And the accuracy of the classifier is improved, so that the accuracy of the target area is improved.

Description

Target area detection method, device, terminal and storage medium

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a target area detection method, a target area detection device, a target area detection terminal and a storage medium.

Background

With the rapid development of the internet and the wide rise of video social contact, the main propagation type of internet information is gradually changed from characters and pictures into videos, and various video processing functions such as video filters and video marking (tagging) appear in succession, so that certain target areas in the videos can be subjected to personalized processing through the video processing functions, and interestingness is improved.

In the related art, in the process of playing a video by a terminal, a user may manually determine a target area in a current image, and the terminal performs editing processing on the target area in the current image, such as adding a sticker to the target area or beautifying the target area. And the terminal respectively performs forward tracking and backward tracking from the current image by taking the position of the target area in the current image as a reference, and determines the positions of the target area in each frame of image before and after the current image, so that the target area in each frame of image is subjected to the same editing processing, and the consistency between the images is ensured.

However, if the position or posture change during the process of shooting the video by the terminal is large, which may cause some images of the video not to include the target area, when tracking to an image not including the target area, the tracking of the target area fails, and even if the target area exists in the tracked image, the target area is difficult to detect again.

Disclosure of Invention

The embodiment of the invention provides a target area detection method, a target area detection device, a terminal and a storage medium, which can solve the problems in the related art. The technical scheme is as follows:

in one aspect, a target area detection method is provided, and the method includes:

according to a target region determined in a first image of a video by a user, determining a plurality of sample regions and classification results of the plurality of sample regions, wherein the classification results are used for indicating whether the sample regions belong to the target region or not;

obtaining a classifier to be trained, wherein the classifier comprises a plurality of classification nodes which are sequentially arranged according to a sequence;

training a first classification node in the classifier according to the plurality of sample regions and classification results of the plurality of sample regions, and continuing to train a next classification node after the training of the first classification node is finished until the training of all the classification nodes is finished;

the target area is tracked in other images except the first image in the video, when the currently tracked image is determined not to include the target area, the trained classifier is applied to classify at least one area in a second image behind the currently tracked image, and the target area in the second image is determined according to the classification result.

In another aspect, there is provided a target area detecting apparatus, the apparatus including:

a sample determining module, configured to determine, according to a target region determined in a first image of a video by a user, a plurality of sample regions and classification results of the plurality of sample regions, where the classification results are used to indicate whether the sample regions belong to the target region;

the training device comprises an acquisition module, a training module and a training module, wherein the acquisition module is used for acquiring a classifier to be trained, and the classifier comprises a plurality of classification nodes which are sequentially arranged according to the sequence;

the training module is used for training a first classification node in the classifier according to the plurality of sample regions and classification results of the plurality of sample regions, and continuing to train a next classification node after the training of the first classification node is finished until the training of all the plurality of classification nodes is finished;

the detection module is used for tracking the target area in other images except the first image in the video, applying the trained classifier to classify at least one area in a second image after the current tracked image when the current tracked image is determined not to include the target area, and determining the target area in the second image according to the classification result.

In yet another aspect, a terminal for detecting a target area is provided, where the terminal includes a processor and a memory, where the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, and the instruction, the program, the set of codes, or the set of instructions is loaded by the processor and has an operation to implement the target area detection method.

In yet another aspect, a computer-readable storage medium is provided, in which at least one instruction, at least one program, code set, or set of instructions is stored, which is loaded by a processor and has an operation to implement the target area detection method.

The method, the device, the terminal and the storage medium provided by the embodiment of the invention determine classification results of a plurality of sample regions and a plurality of sample regions according to a target region determined by a user in a first image of a video, the classification results are used for indicating whether the sample regions belong to the target region or not, a classifier to be trained is obtained, the classifier comprises a plurality of classification nodes which are sequentially arranged according to the sequence, a first classification node in the classifier is trained according to the classification results of the plurality of sample regions and the plurality of sample regions, the training of the next classification node is continued after the training of the first classification node is completed until the training of the plurality of classification nodes is completed, the target region is tracked in other images except the first image in the video, and when the currently tracked image is determined not to include the target region, the trained classifier is applied, at least one area in a second image behind the current tracked image is classified, a target area in the second image is determined according to the classification result, each frame of image does not need to be detected, and unnecessary calculation amount is reduced. And firstly, a dynamic programming training mode is adopted, the next classification node is trained after the last classification node in the classifier is trained, the accuracy of the classifier is improved, the trained classifier is applied to classify when the target region tracking fails after the classifier is trained, the target region is detected again, and the accuracy of the target region can be improved.

And when the target area is greatly deformed, the classifier can be updated according to the deformed target area, a new target area can be learned in time, the robustness and the reliability of the classifier are improved, the target area can be detected accurately in time under the conditions that the terminal is rapidly shaken and rotated, the target is shielded and the like, and the detection effect is ideal.

And moreover, the linear structure classifier is adopted, so that the classification space can be maximally subdivided under the condition of ensuring that the number of classification nodes is fixed, and the classification accuracy is improved.

And moreover, the image visual information is combined with the sensor data, the position of the target area is estimated through the pose information provided by the configured sensor, tracking or detection is carried out in the estimated target area, and other areas except the estimated target area do not need to be tracked or detected, so that tracking failure caused by sensor errors can be avoided, unnecessary calculated amount can be reduced, and the operation speed is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a TLD algorithm provided by the related art;

fig. 2 is a schematic diagram of a target area detection method according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a classifier according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an image tracking system provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of a feature point provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of another feature provided by an embodiment of the present invention;

FIG. 7 is a schematic diagram of a coordinate system provided by an embodiment of the present invention;

FIG. 8 is a schematic diagram of a cascaded classifier provided by an embodiment of the present invention;

FIG. 9 is a schematic flow chart of an operation provided by the embodiment of the invention;

FIG. 10 is a schematic diagram of a tracking speed provided by an embodiment of the present invention;

fig. 11 is a schematic structural diagram of a target area detection apparatus according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

Before describing the embodiments of the present invention in detail, the TLD (Tracking-Learning-Detection) algorithm is first described as follows:

the TLD algorithm is used for long-time tracking of a single object in a video, and, referring to fig. 1, the TLD algorithm includes three modules: the device comprises a tracking module, a detection module and a learning module.

Firstly, a tracking module:

the tracking module is used for tracking the motion change situation between any two adjacent images, and determining the position of the target area in the next frame image according to the position of the target area in the previous frame image and the motion change situation between the two frame images. The tracking module is only active when there is a target area in the next frame of image.

And the tracking module also provides the target area tracked in the next frame image as a positive sample area to the learning module, and the positive sample area is used for training the classifier by the learning module.

II, a detection module:

the detection module is used for scanning the image comprehensively, classifying the scanned area by using a classifier, finding out an area similar to the target area, generating a positive sample area and a negative sample area, and providing the areas to the learning module.

When the tracking module fails to track due to the fact that the target area does not exist in the tracked image, the detection module may provide the found target area to the tracking module, and the tracking module continues to track in the subsequent image.

Thirdly, a learning module:

the learning module is used for carrying out iterative training on the classifier of the detection module according to the sample areas provided by the tracking module and the detection module, and the classification accuracy of the classifier is improved.

In the related art, when a target area is tracked in a multi-frame image of a video, when the tracking fails to an image not including the target area, if the target area exists in the image to be tracked later, the target area needs to be detected in the image to continue the tracking. However, in the detection process, a classifier needs to be applied, the classifier is obtained by training according to the target area tracked before, and when the tracking fails, the classifier is not completely trained, so that the accuracy is poor, and the target area is difficult to detect accurately.

The embodiment of the invention provides a target region detection method, which can train a classifier according to classification results of a plurality of sample regions and a plurality of sample regions in a first image after a user determines a target region in the first image. Even if the tracking fails, the trained classifier can be applied to accurately detect the target area.

The embodiment of the invention can be applied to a scene for editing a video, when a user manually determines a target area in one image of the video, the terminal can edit the target area, detect the target area in other images of the video and edit the target area in the other images in the same way.

For example, when a user takes a video and selects a head area, the terminal may add a sticker to the head area on each frame of image in the video, and as the position of the head area changes, the position of the sticker also changes accordingly.

Fig. 2 is a schematic diagram of a target area detection method according to an embodiment of the present invention. The execution subject of the embodiment of the present invention is a terminal, and referring to fig. 2, the method includes:

201. the terminal acquires a target area determined by a user in a first image of the video.

The terminal can be a mobile phone, a smart camera and other devices, and is provided with a camera, so that images or videos can be shot through the camera. The video comprises a plurality of frames of images, the first image is any one of the images in the video, and may be the first frame of image in the video, or may also be an image played by the video when the user triggers an editing instruction, and the like.

For example, in the process of playing a video by the terminal, when a play pause instruction is detected, a currently played first image is displayed, a user can select a target area in the first image to indicate that editing processing is to be performed on the target area, and when the terminal detects an operation of selecting the target area, the target area is acquired. The operation of selecting the target area may be a sliding operation or a clicking operation, and the target area may be determined according to a start position and an end position of the sliding operation or according to a clicking area of the clicking operation.

202. The terminal determines a plurality of sample regions and classification results of the plurality of sample regions according to the target region in the first image.

Wherein each sample region has a classification result indicating whether the sample region belongs to the target region, a positive sample region if the sample region belongs to the target region, and a negative sample region if the sample region does not belong to the target region.

In one possible implementation, the terminal performs region detection on the first image to obtain a plurality of sample regions, determines an overlap rate between each sample region and the target region according to the position of each sample region and the target region in the first image, and determines classification results of the plurality of sample regions according to the overlap rates between the plurality of sample regions and the target region.

Alternatively, when performing the region detection on the first image, the first image may be traversed by using a window with a fixed size, and a plurality of sample regions with corresponding sizes are obtained. The size of the window may be smaller than the size of the target area, so as to select a plurality of sample areas belonging to the target area, and the size of the window may be determined according to the size of the target area and the requirement for accuracy.

Alternatively, for each sample region, when the overlap ratio between the sample region and the target region is greater than a preset value, it is determined that the sample region belongs to the target region, and when the overlap ratio between the sample region and the target region is not greater than the preset value, it is determined that the sample region does not belong to the target region. The preset value may be 0 or 50%, etc., and is determined according to the requirement for accuracy.

Of course, the classification result of each sample region may also be determined in other manners, such as comparing the sample region with the target region, calculating the similarity between the sample region and the target region, determining the classification result of the sample region according to the similarity, and the like.

203. The terminal acquires a classifier to be trained, wherein the classifier comprises a plurality of classification nodes which are sequentially arranged according to the sequence.

In the embodiment of the invention, in order to ensure the classification accuracy, the classifier is not trained according to the tracked target region in the process of tracking the target region, but is trained according to the sample region in the first image before the target region is tracked, so that the more accurate classifier can be applied to detect the target region when the target region is not tracked.

And the classifier adopted by the terminal comprises a plurality of classification nodes which are sequentially arranged according to the sequence, the classification nodes form a linear structure, and each classification node can be used for region classification. The structure of the classifier can be as shown in fig. 3.

204. And the terminal trains a first classification node in the classifier according to the classification results of the plurality of sample regions and the plurality of sample regions, and continues to train a next classification node after the training of the first classification node is finished until the training of all the plurality of classification nodes is finished.

The embodiment of the invention provides a dynamic programming training method, which is characterized in that a first classification node is trained from a classifier only comprising one classification node, when the training of the first classification node is finished, the first classification node is fixed, a second classification node is trained, and the like, until all classification nodes in the classifier are trained, so that the classifier which is trained before can be ensured to be optimal during each training, the optimal classifier can be obtained, and the accuracy of the classifier is improved.

In a possible implementation manner, the terminal initializes node parameters of a plurality of classification nodes, trains the node parameter of a first classification node in the classifier according to classification results of a plurality of sample regions and a plurality of sample regions to obtain the node parameter after the training of the first classification node, continues to train the node parameter of a next classification node according to the classification results of the plurality of sample regions, the plurality of sample regions and the node parameter after the training of the previous classification node, obtains the node parameter after the training of the next classification node until the training of the plurality of classification nodes is completed, and at this time, the node parameters of the plurality of classification nodes are trained and can be classified by adopting the plurality of classification nodes.

Optionally, when any classification node in the classifier outputs a first classification numerical value, it indicates that the current region to be classified belongs to the target region, and when any classification node outputs a second classification numerical value, it indicates that the current region to be classified does not belong to the target region. The first classification value is different from the second classification value, for example, the second classification value is 0 when the first classification value is 1, or the second classification value is 1 when the first classification value is 0.

Alternatively, the node parameter of each classification node may include two pixel positions i and j and a threshold value x, i and j being positive integers. When a region of an image is input into a classification node, the region may be classified according to whether a difference between a gray level at a pixel position i and a gray level at a pixel position j is greater than a threshold x, when the difference is not greater than the threshold x, a classification result of the classification node is determined to be a first classification value, and when the difference is greater than the threshold x, a classification result of the classification node is determined to be a second classification value.

Referring to fig. 3, the classifier includes n classification nodes, which can output n classification values, which are combined to form a binary value, and the value range after conversion into decimal values is 0-2^n-1Each classification node has two classification spaces of 0 and 1, then the classifier has 2ⁿThe classification space can be maximally subdivided under the condition of ensuring that the number of the classification nodes is fixed, and the classification accuracy is improved. Wherein n is a positive integer, such as 6 or 10.

205. The terminal selects a plurality of positive sample regions belonging to the target region from the plurality of sample regions, and determines a target classification result according to a classification result obtained by classifying the positive sample regions by the classifier.

In order to find out a classification space where a target area is located from a plurality of classification spaces, a terminal can obtain a plurality of positive sample areas, for each positive sample area, the terminal applies a plurality of classification nodes to classify the positive sample areas respectively to obtain classification values output by the classification nodes respectively, the classification values output by the classification nodes respectively are combined to form a binary number according to the sequence of the classification nodes, a decimal value corresponding to the binary number is used as a classification result of the positive sample area, and the classification result with the largest occurrence frequency in the positive sample areas is determined as a target classification result. Then, it can be determined that a certain area belongs to the target area only when the classification result of the certain area is equal to the target classification result, and it is determined that the certain area does not belong to the target area when the classification result of the certain area is not equal to the target classification result.

For example, after a positive sample region is input into the classifier, the binary value formed by the combination of the classification values output by the classification nodes 1 to n is 100110, and the corresponding decimal value is 38.

206. The terminal tracks the target area in the other images in the video except the first image.

Referring to fig. 4, the terminal may perform forward tracking for images in the video that temporally precede the first image to determine the target region in the images, and backward tracking for images in the video that temporally follow the first image to determine the target region in the images.

Specifically, the terminal detects a target area in the first image to obtain a plurality of feature points, determines the positions of the feature points in other images by tracking the feature points in any two adjacent images, and determines the target area in other images according to the positions of the feature points in other images.

When extracting the feature points, referring to fig. 5, the terminal may adopt a uniform grid point extraction mode, set a plurality of uniform and equal grids in the first image, and select a point in each grid as a feature point, so as to quickly select a fixed number of feature points.

Or, considering that the selected Feature points need to effectively reflect the features of the image, algorithms such as FAST (Accelerated segmentation Test) algorithm, Harris (a corner detection algorithm), Speed Up Robust Feature (speeded Up Robust Feature), brosk (Binary Robust scalable key points), and the like may be used to extract the Feature points from the first image, where the extracted Feature points are as shown in fig. 6, and may reflect the image features of the target region.

In a possible implementation manner, the terminal may track the plurality of feature points in the next frame image from the first image, find a matching feature point in the next frame image, thereby obtaining motion information of the plurality of feature points, where the motion information may indicate a position change condition of the next frame image relative to the first image, and perform iterative computation according to the position of the target area in the first image and the motion information of the plurality of feature points, so as to determine the position of the target area in the next frame image, thereby tracking the target area. And performing iterative calculation on subsequent images in a similar tracking manner according to the position of the target area in the previous frame of image and the motion information of the plurality of characteristic points to determine the position of the target area in the next frame of image.

The terminal may acquire the motion information of the feature points by using an optical flow matching algorithm, or acquire the motion information of the feature points by using another algorithm.

After the terminal acquires the motion information of the plurality of feature points, the terminal can determine the position information of the plurality of feature points in the previous frame image and the position information of the plurality of feature points in the next frame image according to the motion information, so as to determine a rotation and translation matrix of the next frame image relative to the next frame image, wherein the displacement parameter in the rotation and translation matrix is the position change information of the next frame image relative to the previous frame image, and the position of the target area in the next frame image can be determined according to the displacement parameter.

In a possible implementation manner, for a video shot in real time, the terminal may acquire, during the process of shooting the video, pose information of the camera when shooting each frame of image through a configured sensor, where the pose information may represent a current position and a current posture of the camera, and the sensor may include an acceleration sensor, a gyroscope sensor, and the like. And obtaining an estimated target area in the next frame of image according to the variation of the pose information between any two adjacent frames of images and the position of the target area in the previous frame of image. The feature point tracking is carried out in the estimation target area to determine the position of the target area in the next frame image without carrying out the feature point tracking on the area outside the estimation target area, thereby reducing the unnecessary calculation amount and improving the tracking speed.

It should be noted that if the position or posture of the camera is changed too much when the video is captured, which may result in some images not including the target area, the following step 207 may be performed to detect the target area again in the subsequent images.

Or, if parameters such as the position or the posture of the camera are changed too much when the video is shot, a target area in some images is deformed greatly, and the target area is difficult to track according to the originally extracted feature points. In order to ensure that the target region can be detected even in such a case, in a possible implementation manner, taking tracking to the third image as an example, the terminal may obtain a tracking error of the third image when tracking to the target region in the third image, and when the tracking error is greater than a first preset threshold, indicating that the target region is deformed greatly, collect the tracked target region in the third image as a sample region, and update the classifier according to the sample region, so as to obtain an updated classifier.

The tracking Error may be FB (Forward-Backward) Error, NCC (Normalized Cross Correlation) Error, SSD (Sum-of-squared differences) Error, or the like.

In another possible implementation manner, the terminal may set a first preset threshold and a second preset threshold, where the second preset threshold is greater than the first preset threshold, and when the tracking error of the third image is greater than the first preset threshold and is not greater than the second preset threshold, collect a target region tracked in the third image as a sample region, and update the classifier according to the sample region to obtain an updated classifier. When the tracking error is greater than the second preset threshold, it indicates that the tracking is failed, and the currently tracked area error is too large to be the target area, so it is determined that the third image does not include the target area, and at this time, the following step 207 still needs to be performed to re-detect the target area in the subsequent image.

In addition to the situation that the target area is greatly deformed, the terminal can also set preset time or preset number, after images are arranged at intervals of the preset time or the preset number, the target area tracked currently is used as a sample area, the classifier is updated according to the sample area, and the updated classifier is obtained.

207. When the target area is determined not to be included in the currently tracked image, the terminal applies the trained classifier to classify at least one area in a second image behind the currently tracked image, and the target area in the second image is determined according to the classification result.

If the target area cannot be tracked when a certain currently tracked image is reached by starting tracking from the first image, it can be determined that the target area is not included in the currently tracked image, and the tracking fails. At this time, the target area needs to be re-detected in an image subsequent to the currently tracked image, and the tracking can be continued.

When performing the forward tracking, the image after the currently tracked image is an image temporally located before the current image, and when performing the backward tracking, the image after the currently tracked image is an image temporally located after the current image.

Taking a second image after the currently tracked image as an example, the terminal may perform region detection on the second image to obtain at least one region in the second image, input the at least one region into a trained classifier, and classify the at least one region by using the classifier to obtain a classification result, that is, determine which regions in the at least one region belong to the target region and which regions do not belong to the target region, thereby determining the position of the target region in the second image according to the classification result, and implementing relocation of the target region.

In a possible implementation manner, for a video shot in real time, the terminal may acquire, during the process of shooting the video, pose information of the camera when shooting each frame of image through a configured sensor, where the pose information may represent a current position and a current posture of the camera, and the sensor may include an acceleration sensor, a gyroscope sensor, and the like. And obtaining an estimated target area in the next frame of image according to the variation of the pose information between any two adjacent frames of images and the position of the target area in the previous frame of image. And performing area detection on the estimated target area to obtain at least one area, determining the accurate position of the target area after classifying by using the classifier, and detecting other areas except the estimated target area without detection, so that unnecessary calculated amount can be reduced, and the detection speed is increased.

The coordinate system of the terminal can be as shown in fig. 7, the displacement of the terminal in three directions in the process of shooting any two adjacent images can be obtained through the sensor, and the position X of the last image frame of the terminal can be obtained according to the target area_tEstimating the position X of the target area in the next frame image by adopting the following formula_t+1：

X_t+1＝K*R*K^-1X_t；

Wherein the content of the first and second substances,

x and y represent two-dimensional coordinates of the pixel points, X represents homogeneous coordinates of the pixel points, K represents a parameter matrix of the camera,

fx, fy, cx and cy represent parameters of the camera, and R represents a rotational-translation matrix between two frames of images, which can be determined according to the displacement of the terminal in three directions during the process of capturing any two adjacent images.

Based on the possible implementation manner in step 204, for each region in the second image, the terminal may apply a plurality of classification nodes to classify the region, respectively, to obtain classification values output by the plurality of classification nodes, combine the classification values output by the plurality of classification nodes, respectively, according to the sequence of the plurality of classification nodes, to form a binary value, take a decimal value corresponding to the binary value as a classification result of the region, and determine whether the classification result of the region is equal to the target classification result, when the classification result is equal to the target classification result, determine that the region belongs to the target region, and when the classification result is not equal to the target classification result, determine that the region does not belong to the target region. By adopting the method, whether each area in the second image belongs to the target area or not can be determined, and then the position of the target area in the second image can be determined.

In a possible implementation manner, the classifier may be used to screen a region in the second image to obtain a plurality of regions that may belong to the target region, and then the nearest neighbor classifier may be used to continue screening the remaining regions, that is, the similarity between each region and the target region is calculated, when the similarity is greater than the preset similarity, it is determined that the region belongs to the target region, and when the similarity is not greater than the preset similarity, it is determined that the region does not belong to the target region, and the region is filtered. After the screening is completed, the region belonging to the target region can be determined, and then the position of the target region in the second image can be determined.

In another possible implementation manner, the classifier may be used to screen the region in the second image to obtain a plurality of regions that may belong to the target region. At this time, the descriptors of each feature point in the target region may be combined to form the feature of the target region, a feature matching classifier is applied, the feature point is extracted for each remaining region, the feature of the region is combined according to the descriptors of each feature point in the region, the distance between the feature of the region and the feature of the target region is calculated, when the distance is smaller than a preset distance, the region is determined to belong to the target region, when the distance is not smaller than the preset distance, the region is determined not to belong to the target region, and the region is filtered. The distance may be a euclidean distance or a hamming distance.

In another possible implementation manner, the linear classifier, the nearest neighbor classifier, and the feature matching classifier in step 207 may be combined to form a cascade classifier as shown in fig. 8, and the cascade classifier is applied to perform multiple screening to detect the target region in the second image. Referring to fig. 8, a region 1, a region 2, and a region 3 are input into a cascade classifier, a classifier of a linear structure determines that the region 1 does not belong to a target region, and the regions 2 and 3 belong to target regions, then the region 1 is filtered, the regions 2 and 3 are input into a nearest neighbor classifier, the nearest neighbor classifier determines that the region 2 does not belong to a target region, the region 3 belongs to a target region, then the region 2 is filtered, the region 3 is input into a feature point matching classifier, the feature point matching classifier determines that the region 3 belongs to a target region, then the output target region is the region 3.

The tracking may be continued for the images subsequent to the second image, that is, a plurality of feature points are extracted from the target region in the second image, and the plurality of feature points are tracked in a manner similar to that in step 206 described above to find the target region.

In another embodiment, when each region in the second image does not belong to the target region, it indicates that the target region is not included in the second image, and at this time, the detection of the subsequent images may be continued until the target region is found in a certain image.

In the embodiment of the present invention, when a target area is tracked or detected in any frame of image of a video, the target area may be edited, for example, the target area is reduced or enlarged, a sticker or a light-emitting special effect is added to the target area, the target area is subjected to mosaic processing, and a specific processing manner may be set by default in a terminal or by a user. By editing the target area, the user can be helped to generate more abundant and vivid videos with individual characteristics, and entertainment and interestingness are enhanced.

The method provided by the embodiment of the invention comprises the steps of determining classification results of a plurality of sample regions and a plurality of sample regions according to a target region determined by a user in a first image of a video, wherein the classification results are used for indicating whether the sample regions belong to the target region, acquiring a classifier to be trained, the classifier comprises a plurality of classification nodes which are sequentially arranged, training a first classification node in the classifier according to the classification results of the plurality of sample regions and the plurality of sample regions, continuing to train a next classification node after the training of the first classification node is finished until the training of the plurality of classification nodes is finished, tracking the target region in other images except the first image in the video, and classifying at least one region in a second image after the currently tracked image by using the trained classifier when the currently tracked image does not comprise the target region, and the target area in the second image is determined according to the classification result, and each frame of image does not need to be detected, so that the unnecessary calculation amount is reduced. And firstly, a dynamic programming training mode is adopted, the next classification node is trained after the last classification node in the classifier is trained, the accuracy of the classifier is improved, the trained classifier is applied to classify when the target region tracking fails after the classifier is trained, the target region is detected again, and the accuracy of the target region can be improved.

In addition, when the target area is tracked only by means of sensor data, once rapid and violent shaking occurs in the terminal, the sensor data fluctuation is large, the position of the target area deviates, and tracking failure is caused. The embodiment of the invention combines the image visual information with the sensor data, estimates the position of the target area through the pose information provided by the configured sensor, tracks or detects in the estimated target area, and does not need to track or detect other areas except the estimated target area, thereby avoiding the tracking failure caused by the sensor error, reducing the unnecessary calculated amount and improving the operation speed.

The operation flow chart of the embodiment of the present invention may be as shown in fig. 9, where the terminal may include a tracking module, a detection module, and a learning module, the tracking module is configured to execute the step 206, and provide the tracked target region as a positive sample region to the learning module, the detection module is configured to execute the step 201 and the step 205 to obtain the trained classifier, and when the tracking module fails to track, execute the step 207 to detect the target region again, and then the tracking module continues to track. And when the tracked target area is greatly deformed, the target area can be learned through the learning module, and the classifier is updated.

In a traditional TLD algorithm, a tracking module, a detection module and a learning module are combined with each other, for each frame of image, the results of the tracking module and the detection module are fused, the position of a target area is determined, the determined target area is used as a positive sample area, and a classifier is trained through the learning module. And the robustness of the detection module is improved. Because the traditional TLD algorithm is designed for tracking a single target, each frame of image needs to be processed by three parts, the calculated amount is large, and the processing speed is slow.

In the method provided by the embodiment of the invention, each frame of image does not need to be detected and learned, the detection is only carried out when the tracking fails, and the learning is carried out when the target area is greatly deformed, so that unnecessary calculation amount is avoided.

In addition, the classifier applied by the detection module is trained in a dynamic programming mode before tracking, so that the accuracy of the classifier is improved, and the accuracy of the detected target area can be ensured by detecting when the tracking fails.

Because a classifier with a binary tree structure is adopted in the traditional TLD algorithm, the classifier is supposed to totally comprise 15 classification nodes, 4-layer classification is needed, only 8 classification intervals are determined finally, and the divided classification intervals are reduced, so that the classification accuracy is not high enough. In the embodiment of the invention, a linear classifier is adopted, n classification nodes in the classifier are classified, and the classifier has 2ⁿThe classification space can ensure the fixed number of classification nodesAnd the classification space is subdivided to the maximum extent, and the classification accuracy is improved.

For 3 test videos, the tracking errors of the method adopted by the embodiment of the present invention and the conventional TLD algorithm can be shown in table 1 below, and it can be seen from table 1 that the embodiment of the present invention significantly reduces the tracking errors and has higher accuracy.

TABLE 1

	Test video 1	Test video 2	Test video 3
				The invention	5.6	10.11	1.24
Legacy TLD	7.1	15.3	1.33

The Tracking speed of the method and the CT (Compressive Tracking) algorithm, the conventional TLD algorithm and the ECO (Efficient Convolution operator for Tracking) algorithm adopted by the embodiment of the present invention can be shown in fig. 10, and it can be seen from fig. 10 that the embodiment of the present invention significantly improves the Tracking speed and can basically achieve real-time Tracking.

Fig. 11 is a schematic structural diagram of a target area detection apparatus according to an embodiment of the present invention. Referring to fig. 11, the apparatus includes:

a sample determination module 1101 configured to perform the step of determining the classification results of the plurality of sample regions and the plurality of sample regions in the above embodiment;

an obtaining module 1102, configured to perform the step of obtaining the classifier to be trained in the foregoing embodiment;

a training module 1103, configured to perform the steps of training a first classification node in the classifier according to the classification results of the multiple sample regions and the multiple sample regions in the foregoing embodiment, and continuing to train a next classification node after the training of the first classification node is completed until the training of all the multiple classification nodes is completed;

a detecting module 1104, configured to perform the steps of tracking a target region in other images except for the first image in the video in the foregoing embodiment, when it is determined that the target region is not included in the currently tracked image, applying a trained classifier, classifying at least one region in a second image after the currently tracked image, and determining the target region in the second image according to a classification result.

Optionally, the sample determining module 1101 includes:

a region detection unit configured to perform a step of performing region detection on the first image to obtain a plurality of sample regions in the above-described embodiment;

a determination unit configured to perform the step of determining the classification results of the plurality of sample regions according to the overlapping rates between the plurality of sample regions and the target region in the above-described embodiment.

Optionally, training module 1103 includes:

an initialization unit, configured to perform a step of initializing node parameters of a plurality of classification nodes in the above embodiment;

a training unit, configured to perform, according to the classification results of the plurality of sample regions and the plurality of sample regions in the above embodiment, a step of training a node parameter of a first classification node in the classifier to obtain a node parameter after the training of the first classification node;

and the training unit is further configured to continue to train the node parameter of the next classification node according to the multiple sample regions, the classification results of the multiple sample regions, and the node parameter after the last classification node is trained in the above embodiment, so as to obtain the node parameter after the next classification node is trained, until all the classification nodes are trained.

Optionally, when any classification node in the classifier outputs a first classification value, it indicates that the current region to be classified belongs to the target region, and when any classification node outputs a second classification value, it indicates that the current region to be classified does not belong to the target region;

the device still includes:

a selecting module, configured to perform the step of selecting, from the multiple sample regions, multiple positive sample regions belonging to the target region in the above embodiment;

the classification module is used for executing the step of applying a plurality of classification nodes to each positive sample region in the above embodiment, and classifying the positive sample regions respectively to obtain classification values output by the classification nodes respectively;

the combination module is used for combining the classification values respectively output by the classification nodes to form a binary number value according to the sequence of the classification nodes, and taking the decimal value corresponding to the binary number value as the classification result of the positive sample area;

and the target determination module is used for executing the step of determining the classification result with the largest occurrence number in the positive sample areas as the target classification result in the embodiment.

Optionally, the apparatus further comprises:

an error obtaining module, configured to perform the step of obtaining a tracking error when tracking the target region in a third image, except the first image, in the video in the embodiment;

a sample acquisition module, configured to perform the step of taking the tracked target region in the third image as a sample region when the tracking error is greater than the first preset threshold in the above embodiment;

and the updating module is used for updating the classifier according to the sample region to obtain the updated classifier in the embodiment.

Optionally, the detecting module 1104 is configured to apply a plurality of classification nodes to each region in the second image in the foregoing embodiment, and classify the region respectively to obtain classification values output by the plurality of classification nodes respectively; combining the classification values respectively output by the classification nodes to form binary values according to the sequence of the classification nodes, and taking decimal values corresponding to the binary values as the classification results of the regions; and a step of determining that the region belongs to the target region when the classification result is equal to the target classification result.

Optionally, the detecting module 1104 is configured to perform detection on the target area in the first image in the foregoing embodiment to obtain a plurality of feature points; determining the positions of a plurality of feature points in other images by tracking the feature points in any two adjacent images; and determining the target area in the other image according to the positions of the plurality of feature points in the other image.

Optionally, the apparatus further comprises:

a sample collection module, configured to perform the step of taking the tracked target region in the third image as a sample region when the tracking error is greater than a first preset threshold in the above embodiment;

Optionally, the sample collecting module is further configured to perform the step of regarding the tracked target region in the third image as the sample region when the tracking error is greater than a first preset threshold and not greater than a second preset threshold, where the second preset threshold is greater than the first preset threshold;

the device still includes: and the determining module is used for executing the step of determining that the target area is not included in the third image when the tracking error is greater than the second preset threshold value in the embodiment.

It should be noted that: in the target area detection apparatus provided in the foregoing embodiment, when detecting the target area, only the division of the functional modules is illustrated, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the terminal is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the target area detection apparatus provided in the above embodiments and the target area detection method embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments, and are not described herein again.

Fig. 12 is a block diagram illustrating a terminal 1200 according to an exemplary embodiment of the present invention. The terminal 1200 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture experts Group Audio Layer III, motion video experts compression standard Audio Layer 3), an MP4 player (Moving Picture experts Group Audio Layer IV, motion video experts compression standard Audio Layer 4), a notebook computer, a desktop computer, a head-mounted device, or any other intelligent terminal. Terminal 1200 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and so forth.

In general, terminal 1200 includes: a processor 1201 and a memory 1202.

The processor 1201 may include one or more processing cores, such as a 4-core processor, a 5-core processor, or the like. The processor 1201 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1201 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1201 may be integrated with a GPU (Graphics Processing Unit) that is responsible for rendering and drawing content that the display screen needs to display. In some embodiments, the processor 1201 may further include an AI (Artificial Intelligence) processor for processing a computing operation related to machine learning.

Memory 1202 may include one or more computer-readable storage media, which may be non-transitory. Memory 1202 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1202 is used to store at least one instruction for being possessed by processor 1201 to implement the target area detection methods provided by method embodiments herein.

In some embodiments, the terminal 1200 may further optionally include: a peripheral interface 1203 and at least one peripheral. The processor 1201, memory 1202, and peripheral interface 1203 may be connected by a bus or signal line. Various peripheral devices may be connected to peripheral interface 1203 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1204, touch display 1205, camera 1206, audio circuitry 1207, pointing component 1208, and power source 1209.

The peripheral interface 1203 may be used to connect at least one peripheral associated with I/O (Input/Output) to the processor 1201 and the memory 1202. In some embodiments, the processor 1201, memory 1202, and peripheral interface 1203 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1201, the memory 1202 and the peripheral device interface 1203 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 1204 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuit 1204 communicates with a communication network and other communication devices by electromagnetic signals. The radio frequency circuit 1204 converts an electric signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electric signal. Optionally, the radio frequency circuit 1204 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1204 may communicate with other terminals through at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 12G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 1204 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 1205 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1205 is a touch display screen, the display screen 1205 also has the ability to capture touch signals on or over the surface of the display screen 1205. The touch signal may be input to the processor 1201 as a control signal for processing. At this point, the display 1205 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 1205 may be one, providing the front panel of the terminal 1200; in other embodiments, the display 1205 can be at least two, respectively disposed on different surfaces of the terminal 1200 or in a folded design; in still other embodiments, the display 1205 may be a flexible display disposed on a curved surface or on a folded surface of the terminal 1200. Even further, the display screen 1205 may be arranged in a non-rectangular irregular figure, i.e., a shaped screen. The Display panel 1205 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or other materials.

The camera assembly 1206 is used to capture images or video. Optionally, camera assembly 1206 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1206 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuitry 1207 may include a microphone and a speaker. The microphone is used for shooting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1201 for processing or inputting the electric signals to the radio frequency circuit 1204 to achieve voice communication. The plurality of microphones may be provided at different portions of the terminal 1200 for the purpose of stereo photography or noise reduction. The microphone may also be an array microphone or an omni-directional photographing type microphone. The speaker is used to convert electrical signals from the processor 1201 or the radio frequency circuit 1204 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 1207 may also include a headphone jack.

The positioning component 1208 is used to locate a current geographic location of the terminal 1200 to implement navigation or LBS (location based Service). The positioning component 1208 may be a positioning component based on the GPS (global positioning System) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.

The power supply 1209 is used to provide power to various components within the terminal 1200. The power source 1209 may be alternating current, direct current, disposable or rechargeable. When the power source 1209 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 1200 also includes one or more sensors 1210. The one or more sensors 1210 include, but are not limited to: acceleration sensor 1211, gyro sensor 1212, pressure sensor 1213, fingerprint sensor 1214, optical sensor 1215, and proximity sensor 1216.

The acceleration sensor 1211 can detect magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 1200. For example, the acceleration sensor 1211 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 1201 may control the touch display panel 1205 to perform display of the user interface in a landscape view or a portrait view according to the gravitational acceleration signal photographed by the acceleration sensor 1211. The acceleration sensor 1211 may also be used for photographing motion data of a game or a user.

The gyro sensor 1212 may detect a body direction and a rotation angle of the terminal 1200, and the gyro sensor 1212 may photograph a 3D motion of the user on the terminal 1200 in cooperation with the acceleration sensor 1211. The processor 1201 can implement the following functions according to the data captured by the gyro sensor 1212: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 1213 may be disposed on a side bezel of terminal 1200 and/or an underlying layer of touch display 1205. When the pressure sensor 1213 is disposed at a side frame of the terminal 1200, a user's grip signal to the terminal 1200 can be detected, and left-right hand recognition or shortcut operation can be performed by the processor 1201 based on the grip signal photographed by the pressure sensor 1213. When the pressure sensor 1213 is disposed at a lower layer of the touch display screen 1205, the processor 1201 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 1205. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 1214 is used to capture a fingerprint of the user, and the processor 1201 identifies the user based on the fingerprint captured by the fingerprint sensor 1214, or the fingerprint sensor 1214 identifies the user based on the captured fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 1201 authorizes the user to have relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 1214 may be provided on the front, back, or side of the terminal 1200. When a physical button or vendor Logo is provided on the terminal 1200, the fingerprint sensor 1214 may be integrated with the physical button or vendor Logo.

The optical sensor 1215 is used to capture the ambient light intensity. In one embodiment, the processor 1201 may control the display brightness of the touch display 1205 according to the intensity of ambient light captured by the optical sensor 1215. Specifically, when the ambient light intensity is high, the display brightness of the touch display panel 1205 is increased; when the ambient light intensity is low, the display brightness of the touch display panel 1205 is turned down. In another embodiment, processor 1201 may also dynamically adjust the camera assembly 1206 capture parameters based on the intensity of ambient light captured by optical sensor 1215.

A proximity sensor 1216, also known as a distance sensor, is typically disposed on the front panel of the terminal 1200. The proximity sensor 1216 is used to photograph a distance between the user and the front surface of the terminal 1200. In one embodiment, when the proximity sensor 1216 detects that the distance between the user and the front surface of the terminal 1200 gradually decreases, the processor 1201 controls the touch display 1205 to switch from the bright screen state to the dark screen state; when the proximity sensor 1216 detects that the distance between the user and the front surface of the terminal 1200 gradually becomes larger, the processor 1201 controls the touch display 1205 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 12 is not intended to be limiting of terminal 1200 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

The embodiment of the present invention further provides a terminal for detecting a target area, where the terminal includes a processor and a memory, where the memory stores at least one instruction, at least one section of program, code set, or instruction set, and the instruction, the program, the code set, or the instruction set is loaded by the processor and has an operation to implement the target area detection method of the foregoing embodiment.

An embodiment of the present invention further provides a computer-readable storage medium, where at least one instruction, at least one program, a code set, or a set of instructions is stored in the computer-readable storage medium, and the instruction, the program, the code set, or the set of instructions is loaded by a processor and has an operation to implement the target area detection method of the foregoing embodiment.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only a preferred embodiment of the present invention, and should not be taken as limiting the invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A target area detection method, the method comprising:

obtaining a classifier to be trained, wherein the classifier comprises a plurality of classification nodes which are sequentially arranged according to a sequence, the classification nodes form a linear structure, and each classification node can be used for region classification;

tracking the target area in other images except the first image in the video, applying the trained classifier to classify at least one area in a second image after the currently tracked image when the currently tracked image is determined not to include the target area, and determining the target area in the second image according to the classification result;

the target area is tracked in the following mode: the method comprises the steps of obtaining first pose information of a currently tracked image and second pose information of a previous frame image of the currently tracked image, determining an estimated target area in the currently tracked image according to the variation between the first pose information and the second pose information and the position of the target area in the previous frame image, determining the position of the target area in the currently tracked image according to the estimated target area, wherein the pose information is used for representing the position and the pose of shooting equipment when the image is shot.

2. The method of claim 1, wherein determining a plurality of sample regions and a classification of the plurality of sample regions based on a target region identified by a user in a first image of a video comprises:

carrying out region detection on the first image to obtain a plurality of sample regions;

determining classification results of the plurality of sample regions according to overlapping rates between the plurality of sample regions and the target region.

3. The method of claim 1, wherein the training a first classification node in the classifier according to the classification results of the plurality of sample regions and the plurality of sample regions, and continuing training a next classification node after the training of the first classification node is completed until all of the plurality of classification nodes are trained, comprises:

initializing node parameters of the plurality of classification nodes;

training the node parameter of a first classification node in the classifier according to the plurality of sample regions and the classification results of the plurality of sample regions to obtain the trained node parameter of the first classification node;

and continuing to train the node parameter of the next classification node according to the plurality of sample regions, the classification results of the plurality of sample regions and the node parameter trained by the last classification node to obtain the node parameter trained by the next classification node until the training of all the classification nodes is completed.

4. The method according to claim 3, wherein when any classification node in the classifier outputs a first classification value, it indicates that the region to be classified this time belongs to the target region, and when any classification node outputs a second classification value, it indicates that the region to be classified this time does not belong to the target region;

after the plurality of classification nodes are trained, the method further comprises:

selecting a plurality of positive sample regions belonging to the target region from the plurality of sample regions;

for each positive sample region, applying the classification nodes to classify the positive sample region respectively to obtain classification values output by the classification nodes respectively;

combining the classification values respectively output by the classification nodes to form a binary number value according to the sequence of the classification nodes, and taking a decimal value corresponding to the binary number value as a classification result of the positive sample area;

and determining the classification result with the largest occurrence number in the positive sample areas as a target classification result.

5. The method of claim 4, wherein the applying the trained classifier to classify at least one region in a second image subsequent to the currently tracked image, and determining the target region in the second image according to the classification result comprises:

for each region in at least one region in a second image after the currently tracked image, applying the plurality of classification nodes to classify the region respectively to obtain classification numerical values output by the plurality of classification nodes respectively;

combining the classification values respectively output by the classification nodes to form a binary number value according to the sequence of the classification nodes, and taking a decimal value corresponding to the binary number value as a classification result of the region;

determining that the region belongs to the target region when the classification result is equal to the target classification result.

6. The method of claim 1, wherein tracking the target region in an image other than the first image in the video comprises:

detecting the target area in the first image to obtain a plurality of feature points;

determining the positions of the plurality of feature points in the other images by tracking the plurality of feature points in any two adjacent images;

and determining the target area in the other image according to the positions of the plurality of characteristic points in the other image.

7. The method of claim 1, further comprising:

when the target area is tracked in a third image except the first image in the video, acquiring a tracking error;

when the tracking error is larger than a first preset threshold, taking the target region tracked in the third image as a sample region;

and updating the classifier according to the sample region to obtain the updated classifier.

8. The method according to claim 7, wherein the regarding the tracked target region in the third image as a sample region when the tracking error is greater than a first preset threshold comprises:

when the tracking error is greater than the first preset threshold and not greater than a second preset threshold, taking the tracked target region in the third image as a sample region, wherein the second preset threshold is greater than the first preset threshold;

when the tracking error is greater than the second preset threshold, determining that the target region is not included in the third image.

9. A target area detection apparatus, the apparatus comprising:

the training device comprises an acquisition module, a training module and a training module, wherein the acquisition module is used for acquiring a classifier to be trained, the classifier comprises a plurality of classification nodes which are sequentially arranged according to a sequence, the classification nodes form a linear structure, and each classification node can be used for region classification;

a detection module, configured to track the target region in other images in the video except for the first image, apply the trained classifier to classify at least one region in a second image subsequent to the currently tracked image when it is determined that the target region is not included in the currently tracked image, and determine the target region in the second image according to a classification result;

the detection module is further configured to obtain first pose information of a currently tracked image and second pose information of a previous frame image of the currently tracked image, determine an estimated target area in the currently tracked image according to a variation between the first pose information and the second pose information and a position of the target area in the previous frame image, determine a position of the target area in the currently tracked image according to the estimated target area, and use the pose information to indicate a position and a pose of a shooting device when the image is shot.

10. The apparatus of claim 9, wherein the sample determination module comprises:

a region detection unit, configured to perform region detection on the first image to obtain a plurality of sample regions;

a determination unit, configured to determine classification results of the plurality of sample regions according to overlapping rates between the plurality of sample regions and the target region.

11. The apparatus of claim 9, wherein the training module comprises:

an initialization unit, configured to initialize node parameters of the plurality of classification nodes;

the training unit is used for training the node parameter of a first classification node in the classifier according to the plurality of sample regions and the classification results of the plurality of sample regions to obtain the trained node parameter of the first classification node;

the training unit is further configured to continue to train the node parameter of the next classification node according to the plurality of sample regions, the classification results of the plurality of sample regions, and the node parameter after the training of the last classification node, so as to obtain the node parameter after the training of the next classification node until all the classification nodes are trained.

12. The apparatus according to claim 11, wherein when any classification node in the classifier outputs a first classification value, it indicates that the region to be classified this time belongs to the target region, and when any classification node outputs a second classification value, it indicates that the region to be classified this time does not belong to the target region;

the device further comprises:

a selecting module for selecting a plurality of positive sample regions belonging to the target region from the plurality of sample regions;

the classification module is used for applying the classification nodes to each positive sample region, and classifying the positive sample regions respectively to obtain classification values output by the classification nodes respectively;

the combination module is used for combining the classification numerical values respectively output by the classification nodes according to the sequence of the classification nodes to form a binary numerical value, and taking a decimal numerical value corresponding to the binary numerical value as a classification result of the positive sample area;

and the target determining module is used for determining the classification result with the largest occurrence number in the positive sample areas as a target classification result.

13. The apparatus of claim 9, further comprising:

an error obtaining module, configured to obtain a tracking error when the target region is tracked in a third image in the video, except for the first image;

a sample acquisition module, configured to take the target region tracked in the third image as a sample region when the tracking error is greater than a first preset threshold;

and the updating module is used for updating the classifier according to the sample region to obtain the updated classifier.

14. A terminal for detecting a target area, the terminal comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the instruction, the program, the set of codes, or the set of instructions being loaded by the processor and having such operations as to implement the target area detection method according to any one of claims 1 to 8.

15. A computer-readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded by a processor and has an operation to carry out the operation of the target area detection method according to any one of claims 1 to 8.