CN109934183B

CN109934183B - Image processing method and device, detection equipment and storage medium

Info

Publication number: CN109934183B
Application number: CN201910205458.3A
Authority: CN
Inventors: 金晟; 刘文韬; 钱晨
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2019-03-18
Filing date: 2019-03-18
Publication date: 2021-09-14
Anticipated expiration: 2039-03-18
Also published as: CN109934183A

Abstract

The embodiment of the invention discloses an image processing method and device, detection equipment and a storage medium. The image processing method comprises the following steps: determining a target area of a target in the image; extracting a first class of features from the target region, wherein the first class of features comprises image features of the target; obtaining a second type of characteristics according to the distribution of two frames of images of the same target; and tracking the target according to the first class of characteristics and the second class of characteristics.

Description

Image processing method and device, detection equipment and storage medium

Technical Field

The present invention relates to the field of information technologies, and in particular, to an image processing method and apparatus, a detection device, and a storage medium.

Background

In the fields of security protection, action analysis and the like, key point detection is required to be carried out on the portrait in the image, and spatial position information and/or human body characteristic information of a human body are/is obtained based on the key point detection. There are various methods for detecting key points of a human body in the prior art, but it is found that errors are large, for example, one portrait may be recognized as a plurality of portraits.

Disclosure of Invention

The embodiment of the invention provides an image processing method and device, detection equipment and a storage medium.

The technical scheme of the invention is realized as follows: an image processing method comprising:

determining a target area of a target in the image;

extracting a first class of features from the target region, wherein the first class of features comprises image features of the target;

obtaining a second type of characteristics according to the distribution of two frames of images of the same target;

and tracking the target according to the first class of characteristics and the second class of characteristics.

Based on the above scheme, the second class of features includes:

a vector in which a key point of a target in the t frame image points to the center point of the target corresponding to the t +1 frame image, and/or,

and the key point of the target of the t +1 th frame image points to the vector of the central point of the target corresponding to the t-th frame image, and t is a natural number.

Based on the above scheme, the tracking the target according to the first class of features and the second class of features includes:

matching the first type features of the t +1 th frame image with the first type features of the t th frame image to obtain first difference information;

matching the second type of characteristics of the t +1 th frame image relative to the t-th frame image with the second type of characteristics of the t-th frame image relative to the t-1 th frame image to obtain second differential information;

and obtaining the corresponding relation between the target in the t +1 th frame image and the corresponding target in the t +1 th frame image according to the first difference information and the second difference information.

Based on the above scheme, the obtaining a correspondence between the target in the t +1 th frame image and the corresponding target in the t +1 th frame image according to the first differential information and the second differential information includes:

weighting and summing first difference information of a first target and second difference information of the first target in the t +1 th frame image to obtain a summation value;

and determining that the first target of the t +1 th frame image and the second target of the t +1 th frame image corresponding to the minimum summation value are the same target.

Based on the above scheme, the extracting the first type of features from the target region includes: carrying out residual error processing on the target area to obtain residual error characteristics; and obtaining the first type of features based on the residual features.

Based on the above scheme, the extracting the first type of features from the target region includes:

carrying out residual error processing on the target area by utilizing a first residual error layer to obtain first residual error characteristics;

performing residual error processing on the first residual error characteristic by using a second residual error layer to obtain a second residual error characteristic;

performing residual error processing on the second residual error characteristic by using a third residual error layer to obtain a third residual error characteristic;

performing residual error processing on the third residual error characteristic by using a fourth residual error layer to obtain a fourth residual error characteristic;

obtaining the image feature based on the fourth residual feature;

obtaining the first class of features based on the residual features, including: and obtaining the image feature based on the fourth residual feature.

Based on the above scheme, the performing residual processing on the target region by using the first residual layer to obtain a first residual feature includes:

performing residual error processing on the target area by using a first residual error sub-layer comprising N1 first residual error modules to obtain a primary residual error characteristic;

performing residual error processing on the primary residual error characteristics by using a second residual error sub-layer comprising N2 second residual error modules to obtain secondary residual error characteristics, wherein N1 is a positive integer; n2 is a positive integer;

and combining the primary residual error characteristic and the secondary residual error characteristic to obtain the first residual error characteristic.

Based on the above scheme, obtaining the image feature based on the fourth residual feature includes:

pooling the fourth residual characteristics to obtain pooled characteristics;

and obtaining the image characteristics based on the pooled characteristics.

Based on the above scheme, obtaining the image feature based on the pooled feature includes:

fully connecting a first pooling characteristic obtained by performing first pooling on the fourth residual characteristic with the second residual characteristic to obtain a first characteristic;

performing second pooling on the fourth residual characteristics to obtain second characteristics;

and splicing the first characteristic and the second characteristic to obtain the image characteristic.

Based on the above scheme, the obtaining a second type of feature according to the distribution of two frames of images before and after the same target includes:

respectively obtaining two third-class features from the front frame image and the rear frame image, wherein the third-class features comprise: the spatial position information of key points in the same target is coded, and the characteristics of different targets can be distinguished;

and generating the second class of features based on the two third class of features.

Based on the above scheme, the generating the second class of features based on the two third class of features includes:

according to a fourth class of features, splicing the two second class of features, wherein the fourth class of features comprises: a confidence level indicating a keypoint targeted by the corresponding pixel;

splicing the two third features to obtain splicing features based on the fourth features;

and generating the second class of features based on the splicing features.

Based on the above scheme, the generating the second class of features based on the splicing features includes:

performing convolution processing on the splicing features by using a first convolution layer to obtain first convolution features;

converting the first convolution characteristic by using an hourglass-shaped conversion network to obtain a conversion characteristic;

and carrying out convolution processing on the converted features by utilizing a second convolution layer to obtain the second class of features.

Based on the above scheme, the performing convolution processing on the converted features by using the second convolution layer to obtain the second class of features includes:

performing primary convolution processing on the conversion characteristic by using a first convolution sublayer to obtain a primary convolution characteristic;

performing secondary convolution processing on the primary convolution characteristic by using a second convolution sublayer to obtain a secondary convolution characteristic;

and carrying out tertiary convolution processing on the secondary convolution characteristics by utilizing a third convolution sublayer to obtain the second type characteristics.

An image processing method comprising:

a determining module for determining a target region of a target in the image;

the extraction module is used for extracting a first class of features from the target area, wherein the first class of features comprise image features of the target;

the obtaining module is used for obtaining a second type of characteristics according to the distribution of two frames of images of the same target in front and back;

and the tracking module is used for tracking the target according to the first class of characteristics and the second class of characteristics.

Based on the above scheme, the second class of features includes:

Based on the scheme, the tracking module is specifically configured to match the first type of features of the t +1 th frame of image with the first type of features of the t th frame of image to obtain first difference information; matching the second type of characteristics of the t +1 th frame image relative to the t-th frame image with the second type of characteristics of the t-th frame image relative to the t-1 th frame image to obtain second differential information; and obtaining the corresponding relation between the target in the t +1 th frame image and the corresponding target in the t +1 th frame image according to the first difference information and the second difference information.

Based on the above scheme, the tracking module is specifically configured to perform weighted summation on first difference information of a first target in a t +1 th frame image and the second difference information of the first target to obtain a summation value; and determining that the first target of the t +1 th frame image and the second target of the t +1 th frame image corresponding to the minimum summation value are the same target.

Based on the scheme, residual error processing is carried out on the target area to obtain residual error characteristics; and obtaining the first type of features based on the residual features.

Based on the above scheme, the extraction module is specifically configured to perform residual error processing on the target region by using a first residual error layer to obtain a first residual error feature; performing residual error processing on the first residual error characteristic by using a second residual error layer to obtain a second residual error characteristic; performing residual error processing on the second residual error characteristic by using a third residual error layer to obtain a third residual error characteristic; performing residual error processing on the third residual error characteristic by using a fourth residual error layer to obtain a fourth residual error characteristic; obtaining the image feature based on the fourth residual feature; and obtaining the image feature based on the fourth residual feature.

Based on the above scheme, the extracting module is specifically configured to perform residual error processing on the target region by using a first residual error sub-layer including N1 first residual error modules, so as to obtain a primary residual error feature; performing residual error processing on the primary residual error characteristics by using a second residual error sub-layer comprising N2 second residual error modules to obtain secondary residual error characteristics, wherein N1 is a positive integer; n2 is a positive integer; and combining the primary residual error characteristic and the secondary residual error characteristic to obtain the first residual error characteristic.

Based on the above scheme, the extraction module is specifically configured to perform pooling processing on the fourth residual features to obtain pooled features; and obtaining the image characteristics based on the pooled characteristics.

Based on the above scheme, the extraction module is specifically configured to perform full connection on the first pooled feature obtained by performing the first pooling on the fourth residual feature and the second residual feature to obtain a first feature; performing second pooling on the fourth residual characteristics to obtain second characteristics; and splicing the first characteristic and the second characteristic to obtain the image characteristic.

Based on the above scheme, the obtaining module is specifically configured to obtain two third features from two previous and subsequent frames of images, where the third features include: the spatial position information of key points in the same target is coded, and the characteristics of different targets can be distinguished; and generating the second class of features based on the two third class of features.

Based on the above scheme, the obtaining module is specifically configured to splice two second features according to a fourth feature, where the fourth feature includes: a confidence level indicating a keypoint targeted by the corresponding pixel; splicing the two third features to obtain splicing features based on the fourth features; and generating the second class of features based on the splicing features.

Based on the above scheme, the obtaining module is specifically configured to perform convolution processing on the splicing feature by using a first convolution layer to obtain a first convolution feature; converting the first convolution characteristic by using an hourglass-shaped conversion network to obtain a conversion characteristic; and carrying out convolution processing on the converted features by utilizing a second convolution layer to obtain the second class of features.

Based on the above scheme, the obtaining module is specifically configured to perform a convolution process on the conversion feature by using a first convolution sublayer to obtain a convolution feature; performing secondary convolution processing on the primary convolution characteristic by using a second convolution sublayer to obtain a secondary convolution characteristic; and carrying out tertiary convolution processing on the secondary convolution characteristics by utilizing a third convolution sublayer to obtain the second type characteristics.

A detection apparatus, the detection apparatus comprising:

a memory for storing computer executable instructions;

and the processor is connected with the memory and used for realizing the image processing method provided by any technical scheme by executing the computer executable instruction.

A computer storage medium having stored thereon computer-executable instructions; the computer-executable instructions, when executed by a processor, enable the image processing method provided by any of the foregoing embodiments.

According to the technical scheme provided by the embodiment of the invention, when the key point is detected, the first type of characteristics and the second type of characteristics are combined, so that the characteristic values of the key points are obtained after the two characteristics are mutually fused; therefore, the obtained characteristic value of each key point not only comprises enough apparent information, but also comprises the internal space structure characteristics of the same target, and the subsequent targets are distinguished by utilizing the characteristic value of the key point obtained in the mode, or the accuracy can be improved by carrying out target detection.

Drawings

Fig. 1 is a schematic flowchart of a first image processing method according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a process for obtaining a first class of features according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of key points of a human body according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a deep learning model for obtaining a first class of features according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a deep learning model for obtaining a second class of features according to an embodiment of the present invention;

fig. 6 is a schematic flowchart of a process of detecting a target area according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a deep learning model according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an image device according to an embodiment of the present invention.

Detailed Description

The technical solution of the present invention is further described in detail with reference to the drawings and the specific embodiments of the specification.

As shown in fig. 1, the present embodiment provides an image processing method, including:

step S110: determining a target area of a target in the image;

step S120: extracting a first class of features from the target region, wherein the first class of features comprises image features of the target;

step S130: obtaining a second type of characteristics according to the distribution of two frames of images of the same target;

step S140: and tracking the target according to the first class of characteristics and the second class of characteristics.

The target in this embodiment may be a graphical element of any movable object, such as a human or an animal or a device.

In this embodiment, the step S110 may include: and obtaining an external frame based on the key points of the target, wherein the position surrounded by the external frame is the target area. The image area contained by the bounding box may be the target area or referred to as a region of interest.

In some embodiments, the image device performing the image processing obtains region coordinates and the like of a plurality of image regions included in an image while receiving the image from another device; for another example, the image region is output by other networks such as other fully convolutional neural networks.

In this embodiment, after obtaining the target region based on the key points of the same target, the target region is segmented from the image and used as the input of the first-class feature extraction. The first type of feature is an image feature of an image region where the object is located, including but not limited to an apparent feature and/or a structural feature of the object. The structural features include the body proportion of the target, etc.

The apparent features include, but are not limited to, surface observable color features and/or contour features of the target, and the like.

The structural features include, but are not limited to, spatial positional relationships between different portions within the target.

In order to improve the tracking accuracy of the target in the embodiment, the target is tracked not only according to the first type of features, but also according to the distribution of the same target in two frames of images before and after, the second type of features are obtained.

And combining the first type of features and the second type of features to comprehensively obtain a tracking result of target tracking, wherein the tracking result considers the similarity of the apparent features of the same target in the two adjacent frames of images based on the first type of features, and simultaneously, because the second type of features are introduced for target tracking, the second type of features reflects the spatial transformation relation of the same target, so that the tracking is carried out by comprehensively considering the appearance similarity and the spatial transformation relation of the first type of features on the two adjacent frames. And the second type of characteristics are obtained by combining the characteristics of the front and rear two frames of images, so that the time constraint is provided for carrying out target tracking on the appearance characteristics, and even if the target moves rapidly, moves in a large span or deforms in appearance in the image, the target tracking can be accurately carried out on the basis of the time constraint, so that the target tracking accuracy is improved, the target tracking loss phenomenon is reduced, and the target tracking effect is improved.

In some embodiments, the second class of features includes: and/or the key point of the target of the t +1 frame image points to the vector of the central point of the target of the t +1 frame image, wherein t is a natural number.

Here, the t +1 th frame image is a frame image subsequent to the t-th frame image. Assuming that the t frame image and the t +1 frame image both include S targets, the t frame image and the t +1 frame image both include first-type features of the S targets, and the t +1 frame image obtains a second-type feature map with respect to the t frame image, where a pixel value of the example time-series embedded feature map is the second-type feature. And a second class feature of the S targets contained in the second class feature map.

In some embodiments, the step S130 may include:

The first difference information may be information such as the euclidean distance between different first-type features in two images, where the euclidean distance is merely an example, and the specific implementation manner is many, but not limited thereto.

Similarly, the second difference information may also be information about distances between the second features corresponding to the two images, or incompatibility information such as incompatibility.

The obtaining of the correspondence between the target in the t +1 th frame image and the corresponding target in the t +1 th frame image according to the first differential information and the second differential information includes:

Since the keypoints corresponding to the first class of features are known, the center points corresponding to these keypoints are also known. The center point of the target is used in the first-class features, so that which first-class features correspond to which second-class features in a frame of image can be known according to the center point matching, and thus, the first differential information and the second differential information which are matched with each other can be subjected to weighted summation to obtain final differential information; and if a group of matched final difference information with the minimum final difference information is matched, the two adjacent frames of images correspond to the same target, so that target tracking is realized.

In some embodiments, the step S120 may include:

carrying out residual error processing on the target area to obtain residual error characteristics;

and obtaining the first type of features based on the residual features.

In this embodiment, the residual processing is processing performed by a residual module, and after the residual processing, on the one hand, the original technical details can be maintained, and on the other hand, the required information of the target can be highlighted through processing such as convolution in the residual module.

In some embodiments, as shown in fig. 2, the step S120 may include:

step S121: carrying out residual error processing on the target area by utilizing a first residual error layer to obtain first residual error characteristics;

step S122: performing residual error processing on the first residual error characteristic by using a second residual error layer to obtain a second residual error characteristic;

step S123: performing residual error processing on the second residual error characteristic by using a third residual error layer to obtain a third residual error characteristic;

step S124: performing residual error processing on the third residual error characteristic by using a fourth residual error layer to obtain a fourth residual error characteristic;

step S125: and obtaining the image feature based on the fourth residual feature.

In some embodiments, the first residual layer may be a residual layer formed by a single residual module, or may be a residual module formed by a plurality of residual sub-layers. In short, the residual features can be obtained by residual processing.

The first residual layer may be any one of the residual modules provided in a residual network (ResNet).

In this embodiment, the second residual layer is located behind the first residual layer, and the second residual feature can be obtained by performing residual processing again on the first residual feature obtained by performing residual processing on the first residual layer. The third residual feature is located after the second residual module and the fourth residual layer is located after the third residual layer. The specific structures of the residual modules included in the first residual layer, the second residual layer, the third residual layer and the fourth residual layer may be the same or different, and these residual modules may be from different versions or different structures of residual networks.

The residual modules included in any two of the first residual layer, the second residual layer, the third residual layer and the fourth residual layer may be the same.

Specifically, the step S121 may include:

in this embodiment, the first residual module and the second residual module may be residual modules of different network structures.

Optionally, the values of N1 and N2 are both positive integers not less than 2, for example, N1 is 4, N2 is 6, and the like. The information quantity of the target reserved by the residual error characteristics obtained after the processing of different residual error modules is adopted is different, and compared with the situation that the same residual error module is adopted, the large tracking error caused by the fact that a single type of residual error module loses a certain type of information can be reduced; therefore, the tracking accuracy can be further improved by selecting different residual modules.

In some embodiments, the step S125 may include: pooling the fourth residual characteristics to obtain pooled characteristics; and obtaining the image characteristics based on the pooled characteristics.

The information required for the second class of features is obtained after the fourth residual feature is obtained in this embodiment, but redundant feature values can be filtered out through a pooling operation in order to reduce the subsequent data processing amount. In this embodiment, the pooling operations include, but are not limited to: average pooling operation and/or maximum pooling operation, etc.

Specifically, the step S125 may include:

In this embodiment, the splicing the first feature and the second feature includes: and directly splicing a first feature corresponding to the ith row and jth column pixels in the first feature map comprising the first feature and a second feature of the ith row and jth column pixels in the second feature map comprising the second feature, for example, splicing the first feature of the dimension S1 and the second feature of the dimension S2 to obtain the image feature of the dimension S1+ S2.

Specifically, the image features are obtained as follows:

performing residual error processing on the target area by using a first residual error sub-layer comprising N1 first residual error modules to obtain a primary residual error characteristic, and performing secondary residual error processing on the primary residual error characteristic by using a second residual error sub-layer comprising N2 second residual error modules to obtain a second residual error characteristic, wherein N1 is a positive integer; n2 is a positive integer;

processing the first residual error characteristics by using a second residual error layer to obtain second residual error characteristics;

processing the second residual error feature by using a third residual error layer to obtain a third residual error feature;

processing the third residual error feature by using a fourth residual error layer to obtain a fourth residual error feature;

fully connecting a first pooling characteristic obtained by performing first pooling on the fourth residual characteristic with the third residual characteristic to obtain a first characteristic;

As shown in fig. 4, the number of the first residual error modules is 4, which are res3a, res3b, res3c and res3 d; the number of the second residual modules is 6, which are res4a, res4b, res4c, res4d, res4e and res4 f.

The third residual layer may include a residual module res5 a; the fourth residual layer may include a residual module res5 b; the fifth residual layer may include a residual module res5 c.

The first pooling process may be an average pooling, and the intermediate layer characteristic obtained after passing the full connection (fc) may be one of the first characteristics described above.

The second pooling corresponding to the fifth residual feature may be average pooling, and a top-level feature is obtained as one of the second features. The second feature may be a 2048-dimensional (D) feature.

And fusing the intermediate features and the top-level features to obtain the first-class features.

Fig. 4 is a network architecture diagram of the deep learning model for extracting the first class features in this embodiment.

FIG. 5 is a diagram of a network architecture of a deep learning model for extracting the second class of features according to the present embodiment;

in the embodiment, after the first class features and the second class features are obtained respectively, the target tracking is comprehensively realized, and a target tracking result can be improved.

In some embodiments, the step S130 may include:

The third feature in this embodiment includes: the vector composition of the key points of a target relative to the center point of the target characterizes the positional relationship between the key points within a target.

In some embodiments, said generating said second class of features based on two of said third class of features comprises:

and generating the second class of features based on the splicing features.

For example, the fourth class of features may be confidence in a gaussian response map based on a gaussian algorithm. The confidence level indicates the probability of whether the corresponding pixel is a keypoint of the target.

In some embodiments, said generating said second class of features based on said stitching features comprises:

The hourglass type network is a network symmetrically distributed at a middle point.

In other embodiments, the convolving the transformed features with the second convolutional layer to obtain the second class of features includes:

The second class of features can be obtained step by performing convolution through a plurality of convolution layers or convolution sublayers.

There are various ways to determine the image region in step S110, and an optional way is provided below; as shown in fig. 6, the present embodiment provides an image processing method including:

step S220: detecting a fourth type of feature from the image, wherein the fourth type of feature at least comprises: spatial position information of the target;

step S210: detecting a fifth type of feature from the image, wherein the fifth type of feature at least comprises: apparent information of the target;

step S230: and fusing the fifth type of features and the fourth type of features to obtain feature values of the key points.

A fifth type of feature (KE) is detected from the image, the KE includes, but is not limited to, apparent information of the target body surface, which may be various directly visually observable contour information, texture information, skin texture information, and the like.

Taking human body as an example, the apparent information includes but is not limited to: contour information of the five sense organs; distribution information of five sense organs, etc.

In an image comprising: pixels belonging to the object and pixels belonging to the background other than the object. In this embodiment, the pixels included in the target and the pixels in the background are distinguished, and different pixel values (or called feature values) are used to represent the generated feature map including the fifth feature, for example, the pixel value of the pixel corresponding to the background in the detected image in the feature map is "0", and the pixel value of the pixel corresponding to the target is a pixel value other than "0". In this embodiment, there may be multiple targets in the detected image, and in order to distinguish the multiple targets, different values are used for the pixel values of different target pixels in the feature map. For example, the eigenvalue corresponding to the target a is represented by "1", and the eigenvalue corresponding to the target B is represented by "2". And the characteristic value corresponding to the background in the image is '0'; then, at this time, 1 is different from 2 and different from 0; 2 also differs from the eigenvalue 0; thus, based on the comparison of the numerical values, it is known which of the feature maps are the background and which are the target; meanwhile, different targets use different characteristic values, so that which pixels belong to the same target can be identified according to specific values of the characteristic values.

The fourth type of features includes spatial position information of the target, optionally, a feature value of the fourth type of features is used to indicate that each key point is a relative position relationship with respect to a central point of the target, and specifically, the fourth type of features may be: the spatial keypoints point to a vector of the center point of the target. A fourth class of features may characterize the relative positional relationship between various parts within the target. Specifically, for example, taking a human body as the target, the fourth feature may include: the relative position relationship of the joint key points of different joints in the human body relative to the central point of the human body includes but is not limited to: the direction and/or distance may be represented by a vector of keypoints pointing to the center point of the body. The human body center point may be a root node of the human body. Fig. 3 is a schematic diagram of key points of a human body, in which a key point 0 is the root node and is obtained based on calculation. In fig. 3, the key point 10 is a head key point; the key point 9 is a neck key point; keypoints 11 and 14 are shoulder keypoints; the key point 8 is a key point for connecting the shoulder and the neck; the key point 7 is a waist key point; keypoints 12 and 15 are elbow keypoints;

key points

13 and 16 are wrist key points; key points 1 and 4 are crotch key points; keypoints 5 and 20 are knee keypoints; keypoints 6 and 3 are ankle keypoints.

In other embodiments, the human body center point may also be an average value of key points belonging to each space, and a coordinate value of the human body center point is obtained; in this way, the distribution of each spatial key point in the target relative to the human body center point meets a specific distribution condition. If the space instance embedding characteristics are judged to be one target, determining which space instance embedding characteristics corresponding to the embedding values belong to the same target according to the fact that the embedding values of the space instance embedding characteristics meet the distribution condition.

Assuming that the target is a human body, the embedding value corresponding to the embedding feature of the space instance is an array comprising two elements, wherein the first element in the array represents the difference value in the x direction; the second element represents the difference in the y-direction, the x-direction and the y-direction being perpendicular to each other. The x direction and the y direction are relative to the image, for example, a two-dimensional rectangular coordinate system including an x axis and a y axis is established in a plane where the image is located, and then the x direction may be the x axis direction of the image coordinate system; the y-direction may be a y-axis direction of an image coordinate system. For example, the first element in the embedding value obtained by subtracting the coordinate of the human body central point from the coordinate of the head left face key point is a positive value, and the second element is a positive value; the first element in the embedding value obtained by subtracting the coordinate of the human body central point from the coordinate of the head right face key point is a negative value, and the second element is a positive value; the first element of the embedding value obtained by subtracting the coordinate of the human body central point from the coordinate of the key point of the left foot is a positive value, and the second element is a negative value; the first element of the embedding value obtained by subtracting the coordinate of the line point in the human body from the coordinate of the key point of the right foot is a negative value, and the second element is a negative value. When determining the embedded value belonging to a target, the method may be performed according to a local part corresponding to the feature value of the key point corresponding to the embedded value, that is, the characteristic of the embedded value.

In this embodiment, the fourth type of feature is a vector of each spatial key point with respect to the central point, which substantially defines a relative position relationship between the key points in one target.

Since the fifth class of features focuses more on the apparent information of the target, in the absence of spatial constraints, different keypoints of the same target can be attributed to different targets, resulting in inaccuracy.

Due to the fact that the fourth type of features pay more attention to different spatial key points in the target, the relative position relation between different targets may be ignored, and for points which are far away from the center point of the target, the accuracy is poor due to the fact that the encoding error is large and the like.

In the embodiment, when the feature value of the key point is detected, the two features are integrated, so that the two features complement each other, for example, a fourth feature is used as a spatial constraint of a fifth feature, the deficiency of the fourth feature is reinforced by the fifth feature, the two features are fused to obtain a fused feature as the feature value of the key point, which key points belong to the same target can be determined based on the feature value of the key point, and the apparent information of the target can be obtained at the same time. And because the accuracy of the characteristic value of the key point is improved, the problem of low extraction efficiency of the characteristic value of the key point caused by error correction and other reasons is solved, and the extraction efficiency of the characteristic value of the key point is improved.

In some embodiments, the method further comprises:

detecting a third type feature map from the image, wherein the third type feature map at least comprises: prediction information of the feature values of the key points;

the step S230 may include:

and fusing the fifth type of features and the fourth type of features to obtain feature values of the key points based on the third type of feature map.

In this embodiment, the third class feature map may also be referred to as a thermodynamic map, and a pixel in the third class feature map may be prediction information such as a confidence level or a probability value, and the prediction information may indicate a probability value that a corresponding pixel in the image is a key point, or a confidence level that the pixel is predicted as a key point, and the like.

In this embodiment, the detection position of the key point is determined by combining the third type feature map.

When the fifth feature and the fourth feature are fused in step S230, the fifth feature map in which the fifth feature is located and the spatial instance embedding map in which the fourth feature is located are aligned, and are aligned with the third feature map, where the alignment means that the images include the same number of pixels and are in one-to-one correspondence in spatial position.

Therefore, when the feature value of the key point is obtained, the fifth feature and the fourth feature in the same detection position are fused to obtain the feature value of the key point.

In this embodiment, the fusion of the fifth feature and the fourth feature includes, but is not limited to:

and splicing the fifth type of features and the fourth type of features. For example, the fifth type of feature is an m1 dimensional feature; the fourth class of features is m2 dimensional features, and after the two features are spliced, the fourth class of features can be m1+ m2 dimensional features.

In some embodiments, the fifth class of features may be 1-dimensional features; the fourth class of features may be 2-dimensional features; after the fusion, the resulting mosaic feature may be a 3-dimensional feature.

In this embodiment, through the direct splicing of such features, the formed spliced features simultaneously retain the feature values of the fifth type of features and the feature values of the fourth type of features, that is, apparent information and spatial position information are simultaneously retained, and the feature values of the key points are obtained by using the spliced features formed after the splicing, so that the error rate can be obviously reduced, and the accuracy can be improved.

In some embodiments, the step S230 may specifically include:

determining the detection position of the feature value of the key point according to the confidence coefficient of the predicted key point in the third class of feature map of the key point Gaussian response map;

and splicing the fifth type of features in the detection positions in the fifth type of feature map and the fourth type of features in the detection positions in the fourth type of feature map to obtain the feature values of the key points.

In this embodiment, the higher the confidence is, the higher the probability that the feature value indicating that the corresponding pixel is the key point in the feature map of the third type is. For example, taking the confidence level of the head keypoint as an example, traversing the pixel values (i.e., the confidence levels) of the pixels in the third class feature map, finding out local maximum values and local maximum confidence levels in different regions, and taking the coordinates of the pixel where the maximum confidence level is located as (X1, Y1), then taking out the fifth class features of the fifth class feature map (X1, Y1); and (4) taking the fourth type of features of the fourth type of feature map (X1, Y1) and fusing the two features to obtain the feature value of one key point. The coordinates of the key point in the image are (X1, Y1), and the feature value of the key point is composed of an embedded value of the fifth class of features of dimension m1 and an embedded value of the fourth class of features of dimension m 2.

For example, with a human body as a target, if the human body includes M key points, finally, after the fifth feature and the fourth feature are fused based on the third feature graph, feature values of the M key points are obtained, and each feature value is formed by splicing the fifth feature and the fourth feature of the corresponding key point.

In some embodiments the method may further comprise:

clustering the characteristic values of the key points to obtain a clustering result;

and determining key points belonging to the same target according to the clustering result.

For example, feature values of the key points are obtained after the stitching, for example, taking the target as an example, if the key points of the human body are S1, and if there are S2 targets in the image, S1 × S2 key points are obtained;

and then clustering S1 × S2 key points to obtain a clustering result.

For example, the step S140 may be as follows:

clustering each type of key points of the human body according to a preset direction, for example, clustering based on distance;

obtaining local optimal solutions of different types of key points based on clustering;

and combining all local optimal solutions to obtain the clustering result.

For example, taking a target as a human body as an example, clustering is performed from the head to a predetermined direction corresponding to the foot; the distance between the key points of each type of the human body according to the preset direction comprises the following steps:

performing distance clustering on each head key point and each neck key point to obtain the distance between each head key point and each neck key point;

carrying out example clustering on each neck key point and each chest key point to obtain the distance between each neck key point and each chest key point,

repeating the steps until all local key points are traversed;

the obtaining of the local optimal solution of different types of key points based on clustering comprises the following steps:

selecting the head key point and the neck key point with the minimum distance as local optimal matching;

selecting the neck key point and the chest key point with the minimum distance as local optimal matching;

repeating the steps until all local optimal matching is finished;

the combining each local optimal solution to obtain the clustering result includes:

and matching and combining the same key points involved in each local optimal matching to obtain a clustering result taking the target as the granularity.

And finally, reversely deducing all key points contained in the same target according to the clustering result.

Of course, the above is only an example for dividing different key points into the same object, and there are various specific implementation manners, which are not illustrated here.

In this embodiment, the fifth class of features and/or the spatial instance features are obtained by using a deep learning model.

The deep learning model includes, but is not limited to, a neural network.

For example, referring to fig. 8, the deep learning model includes:

the feature extraction layer is used for extracting low-level features from the image to obtain a feature map;

the conversion layer is positioned at the rear end of the feature extraction layer and used for obtaining the third-class feature map, a fifth-class feature map containing the fifth-class features and a fourth-class feature map containing the fourth-class features based on the feature map;

and the feature fusion convolutional layer is positioned at the rear end of the last conversion layer and is used for fusing the fifth feature graph and the fourth feature graph based on the third feature graph.

In this embodiment, the third, fifth and fourth feature maps have the same number of pixels, but the dimensions of the single pixel may be different.

For example, the third, fifth, and fourth class of feature maps comprise W × H pixels; w and H are both positive integers. The dimension of one pixel in the third class of feature map can be J; the dimension of one pixel in the fifth class of feature map can be J; the dimension of the fourth class feature map may be 2. The feature fusion convolutional layer may be a number of channels J + 2; convolution kernel 1:1 the convolution step size may be 1.

In some embodiments, the translation layer comprises: n hourglass-shaped coding sub-networks connected in series, wherein the network architecture of the hourglass-shaped coding sub-networks is hourglass-shaped; the N hourglass-shaped coding sub-networks are used for obtaining the third class feature map, a fifth class feature map containing the fifth class features and a fourth class feature map containing the fourth class features based on the feature map; n is a positive integer, for example, N can be 2, 3, or 4.

For example, the translation layer may include: the system comprises an hourglass coding sub-network, at least two tail convolution sub-layers and a characteristic splicing node, wherein the tail convolution sub-layers are positioned at the rear end of the hourglass coding sub-network; the hourglass-shaped coding sub-network obtains a characteristic diagram from the characteristic extraction layer, processes the characteristic diagram, and inputs the processed characteristics into at least two convolution sub-layers connected in series for convolution processing; splicing the convolution features output by the last convolution sublayer and the feature graph obtained from the feature extraction layer to obtain a J + J + 2-dimensional feature graph, wherein 1J-dimensional feature corresponds to a third class of feature graphs; another J-dimension feature can be a fifth class of feature maps for the J-dimension; the 2-dimensional features are feature maps of the fourth class.

In this embodiment, the conversion layer employs an hourglass coding sub-network, and a residual error module may be further employed to replace the hourglass coding sub-network in the specific implementation process, which is only an example here in short, and the specific implementation manners are various and are not necessarily exemplified here.

As shown in fig. 7, the present embodiment provides an image processing apparatus including:

a determining module 110, configured to determine a target region of a target in the image;

an extracting module 120, configured to extract a first class of features from the target region, where the first class of features includes an image feature of the target;

an obtaining module 130, configured to obtain a second type of feature according to distribution of two previous and subsequent frames of images of the same target;

and a tracking module 140, configured to perform target tracking according to the first class of features and the second class of features.

The embodiment provides an image processing apparatus which can be applied to various electronic devices, such as mobile devices, fixed devices and the like. The mobile device includes but is not limited to a mobile phone, a tablet computer or various wearable devices. Including but not limited to desktop notebooks or servers, etc.

In some embodiments, the determining module 110, the extracting module 120, the obtaining module 130, and the tracking module 140 may be program modules, which are executed by a processor and capable of detecting the first type of feature and the second type of feature and obtaining feature values of the key points.

In other embodiments, the determining module 110, the extracting module 120, the obtaining module 130, and the tracking module 140 may be a combination of hardware and software modules, which may include various programmable arrays; the programmable array includes, but is not limited to, a complex programmable array or a field programmable column.

In some embodiments, the second class of features includes:

In some embodiments, the tracking module 140 is specifically configured to match the first type of feature of the t +1 th frame image with the first type of feature of the t th frame image to obtain first difference information; matching the second type of characteristics of the t +1 th frame image relative to the t-th frame image with the second type of characteristics of the t-th frame image relative to the t-1 th frame image to obtain second differential information; and obtaining the corresponding relation between the target in the t +1 th frame image and the corresponding target in the t +1 th frame image according to the first difference information and the second difference information.

In some embodiments, the tracking module 140 is specifically configured to perform weighted summation on first difference information of a first target in a t +1 th frame of image and the second difference information of the first target to obtain a summation value; and determining that the first target of the t +1 th frame image and the second target of the t +1 th frame image corresponding to the minimum summation value are the same target.

In some embodiments, the extracting module 120 is specifically configured to perform residual processing on the target region to obtain a residual feature; and obtaining the first type of features based on the residual features.

Further, the extracting module 120 may be specifically configured to perform residual error processing on the target region by using a first residual error layer to obtain a first residual error feature; performing residual error processing on the first residual error characteristic by using a second residual error layer to obtain a second residual error characteristic; performing residual error processing on the second residual error characteristic by using a third residual error layer to obtain a third residual error characteristic; performing residual error processing on the third residual error characteristic by using a fourth residual error layer to obtain a fourth residual error characteristic; obtaining the image feature based on the fourth residual feature; and obtaining the image feature based on the fourth residual feature.

In some embodiments, the extracting module 120 is specifically configured to perform residual error processing on the target region by using a first residual error sub-layer including N1 first residual error modules, so as to obtain a primary residual error feature; performing residual error processing on the primary residual error characteristics by using a second residual error sub-layer comprising N2 second residual error modules to obtain secondary residual error characteristics, wherein N1 is a positive integer; n2 is a positive integer; and combining the primary residual error characteristic and the secondary residual error characteristic to obtain the first residual error characteristic.

In some embodiments, the extracting module 120 is specifically configured to perform pooling processing on the fourth residual features to obtain pooled features; and obtaining the image characteristics based on the pooled characteristics.

In some embodiments, the extracting module 120 is specifically configured to fully connect the first pooled feature obtained by performing the first pooling on the fourth residual feature with the second residual feature to obtain a first feature; performing second pooling on the fourth residual characteristics to obtain second characteristics; and splicing the first characteristic and the second characteristic to obtain the image characteristic.

In some embodiments, the obtaining module 130 is specifically configured to obtain two third features from two previous and next frames of images, where the third features include: the spatial position information of key points in the same target is coded, and the characteristics of different targets can be distinguished; and generating the second class of features based on the two third class of features.

In some embodiments, the obtaining module 130 is specifically configured to perform splicing of two second features according to a fourth feature, where the fourth feature includes: a confidence level indicating a keypoint targeted by the corresponding pixel; splicing the two third features to obtain splicing features based on the fourth features; and generating the second class of features based on the splicing features.

In some embodiments, the obtaining module 130 is specifically configured to perform convolution processing on the splicing feature by using a first convolution layer to obtain a first convolution feature; converting the first convolution characteristic by using an hourglass-shaped conversion network to obtain a conversion characteristic; and carrying out convolution processing on the converted features by utilizing a second convolution layer to obtain the second class of features.

In some embodiments, the obtaining module 130 is specifically configured to perform a convolution process on the conversion feature by using a first convolution sublayer to obtain a first convolution feature; performing secondary convolution processing on the primary convolution characteristic by using a second convolution sublayer to obtain a secondary convolution characteristic; and carrying out tertiary convolution processing on the secondary convolution characteristics by utilizing a third convolution sublayer to obtain the second type characteristics.

Several specific examples are provided below in connection with any of the embodiments described above:

example 1:

the human body key point detection is the basis of video analysis, and has important application prospects in the fields of security and protection and action analysis.

This example provides two human body key point detection techniques, one is a solution based on a first type of feature (KE), and the other is an image processing method based on a second type of feature (SIE).

The first class characteristic diagram and the second class characteristic diagram have the same dimension and can also be represented by a series of two-dimensional matrixes with output resolution, wherein the category of each key point corresponds to one two-dimensional matrix and corresponds to the key point in a one-to-one mode on the spatial position.

The first kind of feature KE draws the embedded values of the key points of the same person closer and draws the embedded values of the key points of different persons farther during the training process.

KE contains mainly the appearance information of pixels near the keypoint. KE mainly relates to apparent information, is insensitive to space positions and can model long-distance node relation; however, relying on KE alone may erroneously bring together keypoints of different people at a distance due to lack of spatial constraints.

In the training process, the SIE of the second type returns each pixel value to the vector of the human body center, so that the SIE contains the position information of the human body center.

The SIE mainly comprises spatial position information, encodes the central position of the human body, and can effectively utilize the spatial position for clustering. However, for points (e.g., the top of the head, the ankle) far from the center of the human body, the coding error of the SIE is large, and the same person may be erroneously divided into a plurality of parts.

As shown in fig. 6, the present example provides a multi-task multi-branch keypoint detection model, which can extract the first class of features and the second class of features simultaneously, and is dedicated to organically integrating the two bottom-up keypoint detection schemes, and by combining the advantages of the two bottom-up keypoint detection schemes, more efficient and more accurate human keypoint detection is achieved. When the keypoint detection model shown in fig. 6 performs keypoint detection, a third feature map is also detected, which facilitates obtaining a final feature value of the keypoint (i.e., the final detection result shown in fig. 6) from the distance of the subsequent keypoints.

Specifically, the present example proposes a multitasking and multi-branching image processing method, including: and combining the first class of features and the second class of features to predict key points of the multi-person human body.

The detection method can be used for detecting the human key points of multiple persons, and can also be expanded into a tracking task of the human key points. As shown in fig. 7, for each frame of image, a gaussian response map of the human body key points, a first class feature map and a second class feature map are directly output through a multitask bottom-up human body key point model. The feature extraction layer shown in fig. 7 includes: a plurality of convolution sublayers and pooling layers, the number of convolution sublayers in fig. 7 being 5; the pooling layer is a maximum pooling layer, wherein the maximum pooling layer is a downsampling layer with a maximum value reserved; the number of channels of the 1 st convolution sublayer is 64, the size of a convolution kernel is 7 x 7, and the convolution step size is 2; the number of channels of the 2 nd convolution sublayer is 128, the size of a convolution kernel is 3 x 3, and the convolution step size is 1; the number of channels of the 3 rd convolution sublayer is 128, the size of a convolution kernel is 7 x 7, and the convolution step size is 1; the number of channels of the 4 th convolution sublayer is 128, the size of a convolution kernel is 3 x 3, and the convolution step size is 1; the number of channels of the 5 th convolution sublayer is 256, the size of the convolution kernel is 3 x 3, and the convolution step size is 1. The feature extraction layer outputs a 256-bit feature map, and the pixel values of the feature map are the bottom layer features.

A feature conversion layer formed by the S conversion modules; one of the conversion modules comprises an hourglass sub-network and a plurality of convolution sub-layers; the value of S may be any positive integer of 2 or more, for example, 4. As shown in fig. 4, there are two convolution layers, the number of channels of the two convolution sublayers is 256, the size of the convolution kernel is 3 × 3, and the convolution step size is 1. After the deep learning model passes through a feature conversion layer formed by 4 conversion modules, J-dimensional third-class feature maps, J-dimensional first-class feature maps and 2-dimensional second-class feature maps are output through a convolution sublayer.

After feature splicing is carried out on the fusion layer, a J-dimensional Gaussian response graph, a J-dimensional first-class feature graph and a 2-dimensional second-class feature graph are respectively output through convolution with the channel number of J + J +2, the convolution kernel size of 1 x1 and the convolution step length of 1. The two types of embedded characteristic maps are also represented by a series of two-dimensional matrixes, wherein the category of each key point corresponds to one two-dimensional matrix, and the Gaussian response maps can correspond to one another in spatial position. For the first class feature graph KE, the key points of the same person have similar embedded values; the key points of different persons are required to have different embedding values. The value of J can be determined by the number of key points contained in a target; for example, if the target is a human body, the number of key points contained in the human body may be 14 or 17; then J is 14 or 17.

For the spatial instance embedding graph, each pixel point regresses a coordinate vector to the center of the human body. The spatial example is embedded in the figure SIE and naturally contains the coordinate information of the center position of the human body.

The Gaussian response, the first class of characteristics and the second class of characteristics of the human body key points can be obtained through a bottom-up key point model based on a convolutional neural network.

In the third class of feature images, the value of each position is the confidence with which the point is predicted as the corresponding keypoint. The coordinates of the pixel points with the highest confidence level in the graph are the detection positions of the corresponding key points.

And then splicing the first class of feature maps and the second class of feature maps together along feature dimensions, and carrying out clustering on joint points together, wherein the joint points form the whole human body posture.

Training loss function:

in the above formula L₁A loss function representing a first type of feature, J is the number of types of joint points, and K is the number of targets contained in one image; m (p)_j,k) An embedded value corresponding to the first type of feature; p is a radical of_j,kThe position of the jth key point of the kth target;

is the mean value of the embedded values of the first class features of the kth target.

The functional relationship of the second loss term can be found as follows:

in the above formula, L₂Is the firstTwo loss terms. p is a radical of_j,kA vector of a jth key point of a kth target relative to a center point of the kth target;

is the coordinate of the center point of the kth target. J is the total number of key points contained in a target; k is the number of objects contained in one image. .

The method based on the first class of characteristics is used only, KE mainly relates to apparent information, is insensitive to space positions and can model long-distance node relationships; however, relying on KE alone may erroneously bring together keypoints of different people at a distance due to lack of spatial constraints.

The method based on the second type of characteristics is used only, the SIE mainly comprises spatial position information, the central position of the human body is coded, and the spatial position can be effectively utilized for clustering. However, for points (e.g., the top of the head, the ankle) far from the center of the human body, the coding error of the SIE is large, and the same person may be erroneously divided into a plurality of parts.

In summary, the present example proposes a bottom-up multi-tasking keypoint prediction model with simultaneous first and second class feature extraction.

And combining the first class of features and the second class of features to predict key points of the multi-person human body.

The example combines the first type of feature and the second type of feature to perform multi-person human key point prediction. Apparent information contained in the first type of features is combined with spatial position information of the second type of features, and therefore the detection precision of the key points can be effectively improved.

The key point prediction model provided by the example can be used for accurately predicting the positions of key points of a human body in an internet video by using the algorithm; the predicted key points can be used for analyzing the behavior types of the human body and adding real-time special effects on different parts of the human body after accurately positioning the different parts of the human body. In some scenarios, whether the first type of feature and the second type of feature are simultaneously adopted in the product or not performs a key point detection or tracking task.

Example 2:

the example provides a two-branch time sequence feature extraction deep learning model, and a fourth class feature and a fifth class feature of a human body are extracted for human body tracking. In this example, the fourth kind of feature of the human body is one of the aforementioned fourth kind of features, and is called a fourth kind of feature of the human body since the tracked target is the human body. However, in a specific implementation process, the tracking of the target is not limited to a human body, and may be other moving objects, such as vehicles and/or ground mobile robots or low-altitude flying robots.

The fourth class of features of the human body contains overall appearance information of key point regions, and the time sequence example embedding contains time consistency constraint.

The fourth class of characteristics of the human body contains integral apparent information, does not depend on the spatial position information of the human body, and has good robustness for the rapid movement of the human body, the motion and the scaling of a camera. The fifth type of characteristics contain constraint information of time consistency, so that the motion is smoother, and the posture change and the shielding are more robust.

The example provides that the fourth kind of features of the human body and the time sequence example can be embedded, and the fourth kind of features and the time sequence example are combined to jointly perform the tracking task of the key points of the human body. The tracking performance of the model is greatly improved.

The deep learning model is used for tracking tasks of key points of a multi-person human body. As shown in fig. 8, the present example employs bottom-up prediction of key points of a human body in a single frame image based on spatial instance embedding. For each frame of image, the third class feature map, the second class feature map and the final posture detection result of each frame are obtained firstly.

Then, for two continuous frames of images, inputting the images into a double-branch time sequence feature extraction network to obtain a fourth class feature and a fifth class feature of the human body. And combining the outputs of the two, jointly predicting a time sequence matching result (tracking result) with the detection result of the previous frame, and realizing the online tracking of the key points of the human body.

As shown in fig. 9, a schematic network structure of the dual-branch timing feature extraction network is shown. FIG. 8 shows a fourth type of feature extraction branch of a human body, which is input into the feature representation of the bottom layer of the neural network, extracts the alignment of interest (ROI-Align) features of a human body region according to the human body posture predicted by a single frame, and extracts the features of higher layers through a series of residual convolution operations.

And fusing the features of all the layers to obtain the fourth class of features of the human body.

For each body box (one of which corresponds to one of the aforementioned target regions), a vector of dimensions of a predetermined number of dimensions (e.g., 3072) is obtained as a fourth class of features of the body.

The vector is similar for the fourth class of features for the same person, and the features are not the same for different persons.

The training method is similar to a human body re-recognition algorithm, namely the fourth class characteristics of the same people are required to be similar, and the characteristics of different people are different.

Fig. 4 is a time sequence example embedding branch, which is to input a feature map of low-level features extracted from two consecutive frames of images, a third-class feature map, and a second-class feature map for stitching, input the result into an hourglass model for processing after convolution processing with 256 channels, a convolution kernel size of 1 × 1, and a convolution step size of 1, and output a time sequence example for embedding through processing of three convolution layers. The number of channels of the first two of the three convolutional layers is 256, and the size of the convolutional kernel is 3 x 3; the convolution step is 1. The number of channels of the 3 rd convolution layer is 2 x 2, and the size of the convolution kernel is 1 x 1; the convolution step is 1.

The timing instance embedding is a bidirectional signature graph. For embedding of the forward time sequence example, each pixel point on the t-th frame image returns to the human body center point coordinate of a t +1 frame image. On the contrary, for the reverse time sequence example embedding, each pixel point on the t +1 th frame image returns the center point coordinate of the human body corresponding to the t frame image.

The example provides a two-branch time sequence feature extraction network, which extracts the fourth kind of features and the fifth kind of features of human body for tracking. The fourth class of features of the human body contains overall appearance information of key point regions, and the time sequence example embedding contains time consistency constraint.

The fourth class of characteristics of the human body contains integral apparent information, does not depend on spatial position information, and has good robustness for the rapid movement of the human body, the motion and the scaling of a camera. The fifth type of characteristics contain constraint information of time consistency, so that the motion is smoother, and the posture change and the shielding are more robust.

The example proposes to combine the fourth kind of features of the human body and the embedding of the time sequence examples to jointly perform the tracking task of the key points of the human body. The tracking performance of the model is greatly improved.

As shown in fig. 9, an embodiment of the present application provides a detection apparatus, including:

a memory for storing information;

and the processor is connected with the display and the memory respectively, and is used for implementing the image processing method provided by one or more of the above technical solutions by executing the computer executable instructions stored in the memory, for example, at least one of the image processing methods shown in fig. 1, fig. 2, fig. 3, fig. 4, fig. 5 and fig. 7.

The memory can be various types of memories, such as random access memory, read only memory, flash memory, and the like. The memory may be used for information storage, e.g., storing computer-executable instructions, etc. The computer-executable instructions may be various program instructions, such as object program instructions and/or source program instructions, and the like.

The processor may be various types of processors, such as a central processing unit, a microprocessor, a digital signal processor, a programmable array, a digital signal processor, an application specific integrated circuit, or an image processor, among others.

The processor may be connected to the memory via a bus. The bus may be an integrated circuit bus or the like.

In some embodiments, the terminal device may further include: a communication interface, which may include: a network interface, e.g., a local area network interface, a transceiver antenna, etc. The communication interface is also connected with the processor and can be used for information transceiving.

In some embodiments, the terminal device further comprises a human-computer interaction interface, for example, the human-computer interaction interface may comprise various input and output devices, such as a keyboard, a touch screen, and the like.

In some embodiments, the detection apparatus further comprises: a display that can display various prompts, captured facial images, and/or various interfaces.

The embodiment of the application provides a computer storage medium, wherein computer executable codes are stored in the computer storage medium; the computer executable code, when executed, is capable of implementing an image processing method provided by one or more of the foregoing technical solutions, for example, at least one of the image processing methods shown in fig. 1, fig. 2, fig. 3, fig. 4, fig. 5, and fig. 7.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only a specific embodiment of the present example, but the protection scope of the present example is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present example disclosure, and all the changes or substitutions should be covered within the protection scope of the present example. Therefore, the protection scope of the present example shall be subject to the protection scope of the claims.

Claims

1. An image processing method, comprising:

determining a target area of a target in the image;

matching the first type of characteristics of the t +1 th frame of image with the first type of characteristics of the t th frame of image to obtain first differential information; matching the second type of characteristics of the t +1 th frame image relative to the t-th frame image with the second type of characteristics of the t-th frame image relative to the t-1 th frame image to obtain second differential information; and obtaining the corresponding relation between the target in the t +1 th frame image and the corresponding target in the t +1 th frame image according to the first difference information and the second difference information.

2. The method of claim 1,

the second class of features includes:

3. The method of claim 1,

4. The method according to any one of claims 1 to 3, wherein the extracting the first type of features from the target region comprises:

and obtaining the first type of features based on the residual features.

5. The method of claim 4, wherein the residual processing the target region to obtain residual features comprises:

obtaining the first class of features based on the residual features, including:

and obtaining the image feature based on the fourth residual feature.

6. The method of claim 5, wherein the residual processing the target region using the first residual layer to obtain a first residual feature comprises:

7. The method according to claim 5 or 6, wherein the deriving the image feature based on the fourth residual feature comprises:

pooling the fourth residual characteristics to obtain pooled characteristics;

and obtaining the image characteristics based on the pooled characteristics.

8. The method of claim 7, wherein deriving the image feature based on the pooled features comprises:

9. The method according to any one of claims 1 to 3, wherein the obtaining of the second type of features according to the distribution of the same target in two frames of images comprises:

10. The method of claim 9, wherein generating the second class of features based on the two third classes of features comprises:

splicing the two third features to obtain splicing features based on the fourth features; wherein the fourth class of features includes: a confidence level indicating a keypoint targeted by the corresponding pixel;

and generating the second class of features based on the splicing features.

11. The method of claim 10, wherein generating the second class of features based on the stitching features comprises:

12. The method of claim 11, wherein said convolving the transformed features with the second convolutional layer to obtain the second class of features comprises:

13. An image processing apparatus characterized by comprising:

a determining module for determining a target region of a target in the image;

the tracking module is used for matching the first type of characteristics of the t +1 th frame of image with the first type of characteristics of the t th frame of image to obtain first difference information; matching the second type of characteristics of the t +1 th frame image relative to the t-th frame image with the second type of characteristics of the t-th frame image relative to the t-1 th frame image to obtain second differential information; and obtaining the corresponding relation between the target in the t +1 th frame image and the corresponding target in the t +1 th frame image according to the first difference information and the second difference information.

14. The apparatus of claim 13,

the second class of features includes:

15. The apparatus of claim 13,

the tracking module is specifically configured to perform weighted summation on first difference information of a first target in a t +1 th frame image and the second difference information of the first target to obtain a summation value, and perform weighted summation on the first difference information of the first target in the t +1 th frame image and the second difference information of the first target to obtain a summation value; and determining that the first target of the t +1 frame image and the second target of the t frame image corresponding to the minimum summation value are the same target, and determining that the first target of the t +1 frame image and the second target of the t frame image corresponding to the minimum summation value are the same target.

16. The apparatus according to any one of claims 13 to 15, wherein the extracting module is specifically configured to perform residual processing on the target region to obtain residual features; and obtaining the first type of features based on the residual features.

17. The apparatus according to claim 16, wherein the extracting module is specifically configured to perform residual processing on the target region by using a first residual layer to obtain a first residual feature; performing residual error processing on the first residual error characteristic by using a second residual error layer to obtain a second residual error characteristic; performing residual error processing on the second residual error characteristic by using a third residual error layer to obtain a third residual error characteristic; performing residual error processing on the third residual error characteristic by using a fourth residual error layer to obtain a fourth residual error characteristic; and obtaining the image feature based on the fourth residual feature.

18. The apparatus according to claim 17, wherein the extracting module is specifically configured to perform residual processing on the target region by using a first residual sub-layer including N1 first residual modules, so as to obtain a primary residual feature; performing residual error processing on the primary residual error characteristics by using a second residual error sub-layer comprising N2 second residual error modules to obtain secondary residual error characteristics, wherein N1 is a positive integer; n2 is a positive integer; and combining the primary residual error characteristic and the secondary residual error characteristic to obtain the first residual error characteristic.

19. The apparatus according to claim 17 or 18, wherein the extraction module is specifically configured to pool the fourth residual features to obtain pooled features; and obtaining the image characteristics based on the pooled characteristics.

20. The apparatus according to claim 19, wherein the extracting module is specifically configured to fully connect a first pooled feature obtained by performing a first pooling process on the fourth residual feature with the second residual feature to obtain a first feature; performing second pooling on the fourth residual characteristics to obtain second characteristics; and splicing the first characteristic and the second characteristic to obtain the image characteristic.

21. The apparatus according to any one of claims 13 to 15, wherein the obtaining module is specifically configured to obtain two third features from two previous and subsequent images, respectively, where the third features include: the spatial position information of key points in the same target is coded, and the characteristics of different targets can be distinguished; and generating the second class of features based on the two third class of features.

22. The apparatus according to claim 21, wherein the obtaining module is specifically configured to splice two features of the third type to obtain a spliced feature based on a fourth type of feature, where the fourth type of feature includes: a confidence level indicating a keypoint targeted by the corresponding pixel; and generating the second class of features based on the splicing features.

23. The apparatus according to claim 22, wherein the obtaining module is specifically configured to perform convolution processing on the stitched feature using a first convolution layer to obtain a first convolution feature; converting the first convolution characteristic by using an hourglass-shaped conversion network to obtain a conversion characteristic; and carrying out convolution processing on the converted features by utilizing a second convolution layer to obtain the second class of features.

24. The apparatus according to claim 23, wherein the obtaining module is specifically configured to perform a convolution process on the converted feature by using a first convolution sublayer to obtain a convolution feature; performing secondary convolution processing on the primary convolution characteristic by using a second convolution sublayer to obtain a secondary convolution characteristic; and carrying out tertiary convolution processing on the secondary convolution characteristics by utilizing a third convolution sublayer to obtain the second type characteristics.

25. A detection apparatus, the detection apparatus comprising:

a memory for storing computer executable instructions;

a processor coupled to the memory for implementing the method provided by any of claims 1 to 12 by executing the computer-executable instructions.

26. A computer storage medium having stored thereon computer-executable instructions; the computer executable instructions, when executed by a processor, are capable of implementing the method of any one of claims 1 to 12.