CN112949458A

CN112949458A - Training method of target tracking segmentation model and target tracking segmentation method and device

Info

Publication number: CN112949458A
Application number: CN202110219025.0A
Authority: CN
Inventors: 王伟农; 戴宇荣
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2021-06-11
Anticipated expiration: 2041-02-26
Also published as: CN112949458B

Abstract

The disclosure relates to a training method of a target tracking segmentation model, and a target tracking segmentation method and device. The training method comprises the following steps: acquiring image sample data, wherein each image sample data comprises a target image, a tracking image and a target mask image, the target image is an image comprising a target to be tracked, the tracking image is an image of a target to be tracked, and the target mask image is a mask image of a pre-marked target true value; inputting the target image and the tracking image into a target tracking segmentation model to obtain a plurality of characteristic graphs with different resolutions; processing each feature map in the plurality of feature maps respectively to generate a plurality of target foreground probability maps with the same resolution; performing fusion on the plurality of target foreground probability maps to generate a final target foreground probability map; determining a loss function based on each target foreground probability map and the target mask image to obtain a plurality of loss functions; parameters of the target tracking segmentation model are adjusted according to a plurality of loss functions.

Description

Training method of target tracking segmentation model and target tracking segmentation method and device

Technical Field

The present disclosure relates to the field of video technologies, and in particular, to a method and an apparatus for training a target tracking segmentation model, and a method and an apparatus for target tracking segmentation.

Background

Target tracking and segmentation are one of important technologies in the field of image processing, and are widely applied to the fields of picture/video editing, movie and television production, automatic monitoring and the like. The target tracking technology is that the size and the position of a target object in an initial frame of a video sequence are given, and the size and the position of the target object are predicted in a subsequent frame. The target tracking and segmentation technique is above the target tracking technique, and gives the segmentation result of the pixel level of the target object in the prediction of the subsequent frame. The traditional target tracking algorithm can only give the position and the size of a target object in a subsequent frame, and is mainly based on a related filtering method. With the development of deep learning, a deep neural network is applied to target tracking and target tracking segmentation, and a target object and a background can be more accurately distinguished from a complex scene by high-level semantic features extracted from the deep neural network, so that the target tracking and segmenting effect is greatly improved, and a target tracking and segmenting technology based on deep learning also becomes one of mainstream technologies. However, the target tracking and segmentation technology based on deep learning achieves a good target tracking effect, but the edge of the target segmentation result is not fine enough, and even a large number of over-segmentation problems exist, so that the segmentation effect of the target object is poor.

Disclosure of Invention

The present disclosure provides a training method and apparatus for a target tracking segmentation model, and a target tracking segmentation method and apparatus, so as to solve at least the problems in the related art described above, and may not solve any of the problems described above.

According to a first aspect of the embodiments of the present disclosure, there is provided a training method for a target tracking segmentation model, including: acquiring image sample data, wherein each image sample data comprises a target image, a tracking image and a target mask image, the target image is an image comprising a target to be tracked, the tracking image is an image on which tracking of the target is to be executed, and the target mask image is a mask image of a pre-marked target true value; inputting the target image and the tracking image into the target tracking segmentation model to obtain a plurality of characteristic graphs with different resolutions, wherein the target tracking segmentation model is a SimMask model; processing each feature map in the plurality of feature maps respectively to generate a plurality of target foreground probability maps with the same resolution; performing fusion on the plurality of target foreground probability maps to generate a final target foreground probability map; determining a loss function based on each target foreground probability map and the target mask image in the plurality of target foreground probability maps and the final target foreground probability map, thereby obtaining a plurality of loss functions; adjusting parameters of the target tracking segmentation model according to the plurality of loss functions to train the target tracking segmentation model.

Optionally, the target tracking segmentation model may include a ResNet-50 layer, a depth separable convolved cross-correlation operation layer, wherein the target image and the tracking image pass through the ResNet-50 layer and the depth separable convolved cross-correlation operation layer to generate a plurality of candidate window response features; wherein the plurality of feature maps may be generated based on a first candidate window response feature of the plurality of candidate window response features and different intermediate stage feature maps generated by the ResNet-50 layer for the target image.

Optionally, the first candidate window response feature may be selected by: performing target background classification on each candidate window response feature in the plurality of candidate window response features to obtain a confidence score of whether each candidate window response feature is a target region feature; and selecting the candidate window response characteristic with the highest confidence score as the first candidate window response characteristic.

Optionally, the plurality of feature maps may include a first resolution feature map, a second resolution feature map, a third resolution feature map, and a fourth resolution feature map; the first resolution characteristic map is a characteristic map obtained by performing deconvolution on the first candidate window response characteristic; the second resolution feature map is a feature map obtained by fusing the first resolution feature map and a third-stage feature map generated by the ResNet-50 layer of the target image, wherein the resolution of the first resolution feature map is the same as that of the third-stage feature map; the third-resolution feature map is a feature map obtained by fusing a second-stage feature map generated by the ResNet-50 layer on the second-resolution feature map and the target image, wherein the resolution of the second-resolution feature map is the same as that of the second-stage feature map; the fourth-resolution feature map is a feature map obtained by fusing a third-resolution feature map and a first-stage feature map generated by the ResNet-50 layer of the target image, wherein the resolution of the third-resolution feature map is the same as that of the first-stage feature map; the resolution ratios of the first resolution characteristic diagram, the second resolution characteristic diagram, the third resolution characteristic diagram and the fourth resolution characteristic diagram are sequentially increased, and the resolution ratio of the fourth resolution characteristic diagram is the same as that of the target mask image.

Optionally, the separately performing processing on each of the plurality of feature maps to generate a plurality of target foreground probability maps with the same resolution may include: and performing at least one of convolution, Sigmoid activation and upsampling on each feature map in the plurality of feature maps to obtain a target foreground probability map with the same resolution as the target mask image.

Optionally, the fusing the plurality of target foreground probability maps to generate a final target foreground probability map may include: and performing splicing, convolution and Sigmoid activation on the plurality of target foreground probability graphs to obtain the final target foreground probability graph.

Alternatively, the same type of loss function, which is a binary cross-entropy loss function or a binary logistic regression loss function, may be employed for each target foreground probability map.

Alternatively, the loss function may be expressed as:

wherein n represents the number of candidate window response features; y is_nE { ± 1}, a binary label representing an nth candidate window response feature of the plurality of candidate window response features, +1 represents a target region, -1 represents not a target region; w and h represent the width and height of the target mask image, respectively; i and j represent a pixel location (i, j) within the target mask image;

the label corresponding to the pixel (i, j) in the target mask image has a value of 1 or 0, wherein 1 represents that the target pixel is a target pixel, and 0 represents that the target pixel is not a target pixel;

represents the value of the pixel (i, j) in the target foreground probability map predicted based on the nth candidate window response feature, and the value range is [0,1]]And represents the probability of being the target pixel.

According to a second aspect of the embodiments of the present disclosure, there is provided a target tracking segmentation method, including: acquiring a target image to be tracked and a tracking image, wherein the target image is an image including a target to be tracked, and the tracking image is an image on which tracking of the target is to be performed; inputting the target image and the tracking image into a target tracking segmentation model to obtain a plurality of characteristic graphs with different resolutions, wherein the target tracking segmentation model is a SimMask model; processing each feature map in the plurality of feature maps respectively to generate a plurality of target foreground probability maps with the same resolution; performing fusion on the plurality of target foreground probability maps to generate a final target foreground probability map; and obtaining a target tracking segmentation result based on the final target foreground probability map and the tracking image.

Optionally, the obtaining a target tracking segmentation result based on the final target foreground probability map and the tracking image may include: generating an estimated target mask image based on the final target foreground probability map; and determining and displaying a target segmentation area in the tracking image as a target tracking segmentation result based on the estimated target mask image.

Optionally, the target tracking segmentation model may be trained according to the training method of any one of claims 1 to 8.

According to a third aspect of the embodiments of the present disclosure, there is provided a training apparatus for a target tracking segmentation model, including: a sample acquisition unit configured to: acquiring image sample data, wherein each image sample data comprises a target image, a tracking image and a target mask image, the target image is an image comprising a target to be tracked, the tracking image is an image on which tracking of the target is to be executed, and the target mask image is a mask image of a pre-marked target true value; a feature map acquisition unit configured to: inputting the target image and the tracking image into the target tracking segmentation model to obtain a plurality of characteristic graphs with different resolutions, wherein the target tracking segmentation model is a SimMask model; a processing unit configured to: processing each feature map in the plurality of feature maps respectively to generate a plurality of target foreground probability maps with the same resolution; a fusion unit configured to: performing fusion on the plurality of target foreground probability maps to generate a final target foreground probability map; a determination unit configured to: determining a loss function based on each target foreground probability map and the target mask image in the plurality of target foreground probability maps and the final target foreground probability map, thereby obtaining a plurality of loss functions; a training unit configured to: adjusting parameters of the target tracking segmentation model according to the plurality of loss functions to train the target tracking segmentation model.

Optionally, the target tracking segmentation model may include a ResNet-50 layer, a depth separable convolved cross-correlation operation layer, wherein the target image and the tracking image pass through the ResNet-50 layer and the depth separable convolved cross-correlation operation layer to generate a plurality of candidate window response features; wherein the plurality of feature maps are generated based on a first candidate window response feature of the plurality of candidate window response features and different intermediate stage feature maps generated by the ResNet-50 layer for the target image.

Optionally, the processing unit may be configured to: and performing at least one of convolution, Sigmoid activation and upsampling on each feature map in the plurality of feature maps to obtain a target foreground probability map with the same resolution as the target mask image.

Optionally, the fusion unit may be configured to: and performing splicing, convolution and Sigmoid activation on the plurality of target foreground probability graphs to obtain the final target foreground probability graph.

Alternatively, the loss function may be expressed as:

the representation is based on the nth candidate windowThe value range of the pixel (i, j) in the target foreground probability image predicted by the response characteristics is [0,1]]And represents the probability of being the target pixel.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a target tracking segmentation apparatus including: an image acquisition unit configured to: acquiring a target image to be tracked and a tracking image, wherein the target image is an image including a target to be tracked, and the tracking image is an image on which tracking of the target is to be performed; a feature map acquisition unit configured to: inputting the target image and the tracking image into a target tracking segmentation model to obtain a plurality of characteristic graphs with different resolutions, wherein the target tracking segmentation model is a SimMask model; a processing unit configured to: processing each feature map in the plurality of feature maps respectively to generate a plurality of target foreground probability maps with the same resolution; a fusion unit configured to: performing fusion on the plurality of target foreground probability maps to generate a final target foreground probability map; a result obtaining unit configured to: and obtaining a target tracking segmentation result based on the final target foreground probability map and the tracking image.

Optionally, the result obtaining unit may be configured to: generating an estimated target mask image based on the final target foreground probability map; and determining and displaying a target segmentation area in the tracking image as a target tracking segmentation result based on the estimated target mask image.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform a training method or a target tracking segmentation method of a target tracking segmentation model according to the present disclosure.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by at least one processor, cause the at least one processor to perform a training method or a target tracking segmentation method of a target tracking segmentation model according to the present disclosure.

According to an eighth aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by at least one processor, implement a training method or a target tracking segmentation method of a target tracking segmentation model according to the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

according to the training method and the training device for the target tracking segmentation model and the target tracking segmentation method and device, information with different resolution characteristics can be fused by using a multi-scale and multi-loss function scheme, meanwhile, the network is deeply supervised by using the multi-loss function, and useful information in data is fully mined, so that the segmented result is more accurate, and the edge is finer.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a schematic diagram showing a basic network structure of the SiamMask model.

FIG. 2 is a schematic diagram showing the improved branch of the SimMask model to generate the mask.

Fig. 3 is a schematic diagram illustrating a target tracking segmentation scheme according to an exemplary embodiment of the present disclosure.

Fig. 4 is a flowchart illustrating a training method of a target tracking segmentation model according to an exemplary embodiment of the present disclosure.

Fig. 5 is a schematic diagram illustrating feature map fusion according to an exemplary embodiment of the present disclosure.

Fig. 6 is a flowchart illustrating a target tracking segmentation method according to an exemplary embodiment of the present disclosure.

Fig. 7 is a schematic diagram illustrating the comparison of the effect of the target tracking segmentation method according to the present disclosure with the existing SiamMask algorithm.

Fig. 8 is a block diagram illustrating a training apparatus of a target tracking segmentation model according to an exemplary embodiment of the present disclosure.

Fig. 9 is a block diagram illustrating a target tracking segmentation apparatus according to an exemplary embodiment of the present disclosure.

Fig. 10 is a block diagram of an electronic device 1000 according to an example embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

At present, a siemese network structure is generally adopted in a tracking segmentation method based on deep learning, wherein a siemese network, namely a twin network, can be used for comparing the similarity of two pictures, so that in a tracking task, a template picture outputs characteristics after passing through the network, and a region with the highest similarity response can be searched in a whole picture in a subsequent picture. With the siemese network structure, the position of the object (i.e. where the score response is highest) can be obtained by predicting the score map of the candidate region, the scale size of the object is usually obtained by image pyramid, or by means of rectangular frame regression, a more accurate rectangular frame is obtained, and the obtained rectangular frame is further adjusted by using the network prediction aspect ratio. Furthermore, the SiamMask algorithm unifies the visual target tracking and the video target segmentation into one framework based on the Siamese structure, the algorithm only needs to input a rectangular frame of the video tracking target in the initialization stage, and then the rectangular frame of the tracking target and the segmentation result of the pixel level are automatically given in the subsequent frames.

Referring to FIG. 1, the inputs to the SiamMask model may include a target image and a tracking image. The target image is an image including a target to be tracked, for example, 127 × 3, where 127 × 127 represents the size of the target image and 3 represents the number of channels. The tracking image refers to an image on which tracking of the target is to be performed, for example, 255 × 3, and similarly, 255 × 255 denotes the size of the tracking image and 3 denotes the number of channels.

The target image and the tracking image can be passed through Siamese Net (f)_θ) (e.g., its backbone may be ResNet-50), two signatures may be obtained, e.g., 15 x 256 and 31 x 256, respectively, wherein a one-level down-sampling is performed after the fourth stage in ResNet-50, and thus, a depth of 256 may be output.

Subsequently, a cross correlation operation (cross correlation) of depth-separable convolution (depth-wise) may be performed on the two feature maps to obtain a plurality of candidate window response features (response of candidate windows, RoW). That is, the two feature maps may be subjected to correlation calculation channel by channel to obtain a candidate window response feature map with a constant number of channels, for example, 17 × 256, where the candidate window response feature map includes a plurality of candidate window response features RoW, for example, 1 × 256, that is, the candidate window response feature map may include 17 × 17 RoW.

Subsequently, RoW may go through three branches, performing object segmentation separately

Target position regression (b)_σ) And object background classification(s)_ψ). Object segmentation

For example, 17 × 17 (63 × 63), each RoW may correspond to a feature map of 1 × 1 (63 × 63), and each RoW may generate a mask image (mask) of the target of 127 × 1. Target position regression (b)_σ) A target location box (box) may be obtained, e.g., 17 x 4k, where k represents the number of anchors and 4 is the four coordinates top left, top right, bottom right. Object background classification(s)_ψ) A softmax two-classification (score) may be obtained for the corresponding location as the target, which may be understood as a confidence score of whether each RoW is a target region feature. However, for object segmentation

Each RoW mask graph corresponding to the generated vector is a (1 x 63) vector, which is flattened to obtain the final productThe mask image is very coarse.

Referring to fig. 2, the improved branch adopts a top-down (top-down) structure, which gradually upsamples a deep low-resolution feature map, fuses with shallow features, and then further upsamples until a finer mask result is finally obtained through output.

Specifically, ResNet-50 is used as f in FIG. 2_θThe ResNet-50 includes four stages of convolutional layers (e.g., conv1, conv2, conv3, and conv4), for example, the convolutional layer output of each stage may be 61 × 64, 31 × 256, 15 × 512, 15 × 1024 in order for a target image branch. The target and tracking images may be down-sampled (adjust) once through the fourth stage of ResNet-50 to obtain two feature maps, e.g., 15 × 256 and 31 × 256, respectively. The two feature maps are subjected to a cross-correlation operation of depth separable convolution to yield a plurality RoW, e.g., 17 x 256, one RoW being 1 x 256. A target context classification(s) may be selected from the resulting plurality RoW_ψ) The highest one RoW of the resulting classifications is used to generate a mask.

The selected RoW may be deconvoluted (e.g., deconv,32) to yield a first resolution profile, e.g., 15 x 32. Subsequently, the first resolution feature map and the target image may be fused (U) via a third stage feature map generated by ResNet-50₂) And a second resolution profile, e.g., 31 x 16, is obtained. Subsequently, the second resolution feature map and the target image may be fused (U) by a second stage feature map generated by ResNet-50₃) And a third resolution profile, e.g., 61 x 8, is obtained. Subsequently, the third resolution feature map and the target image may be fused (U) by the first stage feature map generated by ResNet-50₄) And a fourth resolution profile, e.g., 127 x 4, is obtained. Finally, the fourth resolution feature map may be mask, e.g., 127 × 1, by convolution (e.g., conv,3 × 3,1) with the Sigmoid activation function. Here, fusion (U)₂、U₃、U₄) Two feature maps of the input can be fused into one resolutionThe operation of the feature map with increased rate (for example, approximately doubled) may be, for example, a feature map obtained by performing convolution and ReLU activation on the two input feature maps, and then performing element addition and upsampling on the two input feature maps.

It can be seen that although the improved branch in fig. 2 combines multi-scale information, this top-down approach simultaneously introduces noise at the shallow feature level, which leads to the problems of relatively rough final mask result, inaccurate edge, and the like.

Aiming at the problems of inaccurate target segmentation result, inaccurate edge and the like of a Simmask framework, the invention provides a training method and a target tracking segmentation method of a target tracking segmentation model. Namely, by introducing a deep supervision mode, the hidden layers of different layers of the network are respectively supervised, the features of different layers of the network are enabled to fully mine information in training data, noise of the feature layers of different layers is reduced to a certain extent, gradient disappearance can be relieved, convergence speed can be accelerated, and finally, a foreground image fusion module is used for generating a final segmentation mask of the object to be tracked.

Hereinafter, a training method and apparatus of a target tracking segmentation model and a target tracking segmentation method and apparatus according to the present disclosure will be described in detail with reference to fig. 3 to 10.

Referring to fig. 3, a target tracking segmentation scheme according to an exemplary embodiment of the present disclosure may propose a multi-scale and multi-loss function module based on the improved branch of the Siammask model, as shown in the solid-line box portion of fig. 3.

Specifically, according to the target tracking segmentation scheme of the exemplary embodiment of the present disclosure, four target foreground probability maps, each of which is 127 × 1, may be obtained from four feature maps (D1, D2, D3, D4) of different resolutions generated in the improvement branch, respectively. Since the resolution of the feature maps D1, D2 and D3 is lower than that of the mask, the feature maps D1, D2 and D3 can pass through the convolutional layer (Conv), a Sigmoid activation function and an upsampling (upsampling) module to obtain a target foreground probability map with the same resolution as the mask. Since the resolution of the feature map D4 is the same as the mask resolution, the target foreground probability map, i.e., the output of the improved branch of the Siammask model shown in fig. 2, needs to be obtained only by the convolutional layer (Conv) and one Sigmoid activation function.

Then, the target foreground probability maps are fused

For example, these target foreground probability maps may be concatenated (concatenated) in the channel dimension and then fused by operations followed by the 1 × 1 convolutional layer and Sigmoid activation functions to generate the final target foreground probability map, e.g., 127 × 1. During training, the same Loss function (Loss) is applied to the five target foreground probability graphs respectively, and a deep supervision mode is realized. Therefore, the multi-scale and multi-loss function module provided by the disclosure can better fuse information of different resolution characteristics, and meanwhile, the multi-loss function is utilized to carry out deep supervision on the network, so that useful information in data is fully mined, and therefore, the segmented result is more accurate, and the edge is finer.

Fig. 4 is a flowchart illustrating a training method of a target tracking segmentation model according to an exemplary embodiment of the present disclosure. The training method shown in fig. 4 can be applied to the improved branch structure of the SiamMask model.

Referring to fig. 4, in step 401, image sample data may be acquired, where each image sample data includes a target image, a tracking image and a target mask image, where the target image refers to an image including a target to be tracked, the tracking image refers to an image on which tracking of the target is to be performed, and the target mask image refers to a mask image of a pre-marked target truth value. Here, the image sample data may be obtained from a target segmentation sample database.

In step 402, the target image and the tracking image may be input into a target tracking segmentation model (i.e., a SiamMask model), resulting in a plurality of feature maps of different resolutions.

According to an example embodiment of the present disclosure, a target tracking segmentation model may include a ResNet-50 layer, a depth separable convolved cross-correlation operation layer, wherein a target image and a tracking image pass through the ResNet-50 layer and the depth separable convolved cross-correlation operation layer, producing a plurality of candidate window response features (RoW). Multiple feature maps of different resolutions may be generated based on a first candidate window response feature of the multiple candidate window response features and different intermediate stage feature maps generated by the ResNet-50 layer for the target image. For example, the first candidate windowed response feature may be based on a target context classification(s) as shown in FIG. 1_ψ) The score obtained by branching is selected. Specifically, target background classification may be performed on each candidate window response feature of the plurality of candidate window response features, a confidence score of whether each candidate window response feature is a target region feature is obtained, and a candidate window response feature with the highest confidence score is selected as the first candidate window response feature.

According to an exemplary embodiment of the disclosure, the plurality of feature maps of different resolutions may include a first resolution feature map, a second resolution feature map, a third resolution feature map, and a fourth resolution feature map (e.g., D1, D2, D3, D4 in fig. 3), wherein the resolutions thereof are sequentially increased (e.g., increased by approximately multiples), and the resolutions of the fourth resolution feature map and the target mask image are the same (e.g., 127 × 127). Specifically, the first resolution feature map is a feature map obtained by performing deconvolution (e.g., deconv in fig. 3) on the first candidate windowed response feature; the second resolution feature map is fused with the first resolution feature map and the third stage feature map generated by the ResNet-50 layer (U in FIG. 3)₂) And obtaining a feature map, wherein the first resolution feature map and the third-stage feature map have the same resolution (e.g., 15 × 15 in fig. 3); the third resolution profile is generated by ResNet-50 layer for the second resolution profile and the target imageTwo-phase feature graph execution fusion (U in FIG. 3)₃) And obtaining a feature map, wherein the second-resolution feature map and the second-stage feature map have the same resolution (e.g., 31 × 31); the fourth resolution feature map is obtained by performing fusion (U in FIG. 3) between the third resolution feature map and the first-stage feature map generated by ResNet-50 layer₄) And obtaining a feature map, wherein the third-resolution feature map and the first-stage feature map have the same resolution (e.g., 61 × 61).

According to an exemplary embodiment of the present disclosure, a fusion (U)₂、U₃、U₄) The operation may be to merge the two input feature maps into one feature map with increased resolution (for example, approximately doubled), for example, the two input feature maps may be subjected to convolution and ReLU activation, and then to element addition and upsampling, respectively.

Fig. 5 is a schematic diagram illustrating feature map fusion according to an exemplary embodiment of the present disclosure. FIG. 5 shows U₃Exemplary operation of, U₂And U₄The same or similar operations may also be performed.

Referring to fig. 5, the second resolution feature map (31 × 16) may be sequentially passed through one convolution layer (e.g., conv,3 × 3,16), one ReLU activation function, one convolution layer (e.g., conv,3 × 3,16) to obtain the first transformed feature map (31 × 16). The second transformed feature map (31 × 256conv2) may be sequentially passed through a convolution layer (e.g., conv,3 × 3,64), a ReLU activation function, a convolution layer (e.g., conv,3 × 3,32), a ReLU activation function, and a convolution layer (e.g., conv,3 × 3,16) to obtain a second transformed feature map (31 × 16). The first transformed feature map (31 x 16) and the second transformed feature map (31 x 16) may be subjected to Element-wise addition (Element-wise sum) to obtain a merged transformed feature map, and the merged transformed feature map may be subjected to a ReLU activation function and an upsampling module (e.g., double upsampling (2x, up)) to obtain a third resolution feature map (61 x 8).

Referring back to fig. 4, in step 403, processing may be performed on each of the plurality of feature maps, respectively, to generate a plurality of target foreground probability maps with the same resolution.

According to the exemplary embodiment of the disclosure, at least one of convolution, Sigmoid activation and upsampling is performed on each of a plurality of feature maps to obtain a target foreground probability map having the same resolution as the target mask image. For example, for a feature map (e.g., D1, D2, and D3) of the plurality of feature maps having a lower resolution than the target mask image, convolution, Sigmoid activation, and upsampling may be performed to obtain a target foreground probability map having the same resolution as the target mask image. For a feature map of the plurality of feature maps having a resolution equal to the target mask image (e.g., D4), convolution and Sigmoid activation may be performed to obtain a target foreground probability map having the same resolution as the target mask image, i.e., the output of the improved branch of the Siammask model shown in fig. 2.

At step 404, the multiple target foreground probability maps may be fused to generate a final target foreground probability map.

According to the exemplary embodiment of the disclosure, splicing, convolution and Sigmoid activation can be performed on a plurality of target foreground probability maps to obtain a final target foreground probability map. For example, a plurality of target foreground probability maps may be spliced in the channel dimension, and the spliced probability maps are passed through a 1 × 1 convolutional layer and a Sigmoid activation function to generate a final target foreground probability map. Here, the resolution of the final target foreground probability map is also the same as the target mask image.

In step 405, a loss function may be determined based on each of the plurality of target foreground probability maps and the target mask image, and the final target foreground probability map, thereby obtaining a plurality of loss functions.

According to an exemplary embodiment of the present disclosure, the same type of loss function may be employed for each target foreground probability map, such as, but not limited to, a binary cross entropy loss function (binary cross entropy loss) or a binary logistic regression loss function (binary logistic regression loss).

According to an exemplary embodiment of the present disclosure, a binary logistic regression loss function may be employed, which is formulated as follows:

wherein n represents the number of candidate window response features (RoW); y is_nE { ± 1}, a binary label representing an nth candidate window response feature of the plurality of candidate window response features, +1 represents a target region, -1 represents not a target region; w and h represent the width and height of the target mask image, respectively; i and j represent a pixel location (i, j) within the target mask image;

At step 406, parameters of the target tracking segmentation model may be adjusted according to a plurality of loss functions to train the target tracking segmentation model. That is, values obtained by the plurality of loss functions may be used to back-propagate to the target tracking segmentation model to adjust parameters of the target tracking segmentation model, so as to train the target tracking segmentation model.

Fig. 6 is a flowchart illustrating a target tracking segmentation method according to an exemplary embodiment of the present disclosure. The target tracking segmentation method illustrated in fig. 6 is performed based on a target tracking segmentation model (i.e., SiamMask model), where the target tracking segmentation model may be trained according to the training method of the present disclosure.

Referring to fig. 6, in step 601, a target image to be tracked, which refers to an image including a target to be tracked, and a tracking image, which refers to an image of a target to be tracked to be performed may be acquired. For example, the target image may be acquired from a video image on which target tracking segmentation is to be performed, e.g., a first frame of the video image is selected and cropped to obtain an image including the target to be tracked. And may acquire a tracking image from a video image on which target tracking segmentation is to be performed, e.g., for a subsequent frame of the video image that is input in real time, performing the target tracking segmentation method as shown in fig. 6.

In step 602, the target image and the tracking image may be input into a target tracking segmentation model to obtain a plurality of feature maps with different resolutions, where the target tracking segmentation model is a SiamMask model.

According to an exemplary embodiment of the disclosure, the plurality of feature maps of different resolutions may include a first resolution feature map, a second resolution feature map, a third resolution feature map, and a fourth resolution feature map (e.g., D1, D2, D3, D4 in fig. 3), wherein the resolutions thereof are sequentially increased (e.g., increased by approximately multiples), and the resolutions of the fourth resolution feature map and the target mask image are the same (e.g., 127 × 127). Specifically, the first resolution feature map is a feature map obtained by performing deconvolution (e.g., deconv in fig. 3) on the first candidate windowed response feature; the second resolution profile is a third stage profile generated by ResNet-50 layers for the first resolution profile and the target imageGraph execution fusion (e.g., U in FIG. 3)₂) And obtaining a feature map, wherein the first resolution feature map and the third-stage feature map have the same resolution (e.g., 15 × 15 in fig. 3); the third resolution feature map is a second stage feature map generated by ResNet-50 layer on the second resolution feature map and the target image (e.g. U in FIG. 3)₃) And obtaining a feature map, wherein the second-resolution feature map and the second-stage feature map have the same resolution (e.g., 31 × 31); the fourth resolution feature map is obtained by performing fusion (U in FIG. 3) between the third resolution feature map and the first-stage feature map generated by ResNet-50 layer₄) And obtaining a feature map, wherein the third-resolution feature map and the first-stage feature map have the same resolution (e.g., 61 × 61).

In step 603, a process may be performed on each of the plurality of feature maps to generate a plurality of target foreground probability maps with the same resolution.

At step 604, the multiple target foreground probability maps may be fused to generate a final target foreground probability map.

In step 605, a target tracking segmentation result may be obtained based on the final target foreground probability map and the tracking image.

According to an exemplary embodiment of the present disclosure, an estimated target mask image may be generated based on the final target foreground probability map; and determining and displaying a target segmentation area in the tracking image as a target tracking segmentation result based on the estimated target mask image. Here, each pixel value in the target foreground probability map is a probability value of whether the pixel is a target, and has a value range of [0,1], and each pixel value in the target mask image is a classification value of whether the pixel is a target, and has a value of 0 or 1, for example, but not limited thereto, 0 may represent a non-target, and 1 may represent a target. Therefore, the target foreground probability map may be converted into the target mask image according to a predetermined rule.

Referring to fig. 7, fig. 7(a) shows a result of tracking segmentation by using the existing SiamMask algorithm alone, and fig. 7(b) shows a result of tracking segmentation by using the object tracking segmentation method of the present disclosure. It can be seen that the target tracking segmentation result of fig. 7(b) is more accurate and edge refined than the target tracking segmentation result of fig. 7 (a).

Fig. 8 is a block diagram illustrating a training apparatus of a target tracking segmentation model according to an exemplary embodiment of the present disclosure. The training apparatus shown in fig. 8 can be applied to the improved branch structure of the SiamMask model.

Referring to fig. 8, a training apparatus 800 of a target tracking segmentation model according to an exemplary embodiment of the present disclosure may include a sample acquisition unit 801, a feature map acquisition unit 802, a processing unit 803, a fusion unit 804, a determination unit 805, and a training unit 806.

The sample acquiring unit 801 may acquire image sample data, where each image sample data includes a target image, a tracking image, and a target mask image, where the target image is an image including a target to be tracked, the tracking image is an image of a target to be tracked, and the target mask image is a mask image of a pre-marked target true value. Here, the image sample data may be obtained from a target segmentation sample database.

The feature map obtaining unit 802 may input the target image and the tracking image into a target tracking segmentation model (i.e., a SiamMask model), so as to obtain a plurality of feature maps with different resolutions.

According to an exemplary embodiment of the present disclosure, the plurality of feature maps of different resolutions may include a first resolution feature map, a second resolution feature map, a third resolution feature map, and a fourth resolution feature map (e.g., D1, D2, D3, D4 in fig. 3), wherein their resolutions are sequentially increased (D2, D3, D4)E.g., increased by an approximate multiple), the fourth resolution feature map has the same resolution as the target mask image (e.g., 127 x 127). Specifically, the first resolution feature map is a feature map obtained by performing deconvolution (e.g., deconv in fig. 3) on the first candidate windowed response feature; the second resolution feature map is fused with the first resolution feature map and the third stage feature map generated by the ResNet-50 layer (U in FIG. 3)₂) And obtaining a feature map, wherein the first resolution feature map and the third-stage feature map have the same resolution (e.g., 15 × 15 in fig. 3); the third resolution feature map is a second stage feature map generated by ResNet-50 layer on the second resolution feature map and the target image (e.g. U in FIG. 3)₃) And obtaining a feature map, wherein the second-resolution feature map and the second-stage feature map have the same resolution (e.g., 31 × 31); the fourth resolution feature map is obtained by performing fusion (U in FIG. 3) between the third resolution feature map and the first-stage feature map generated by ResNet-50 layer₄) And obtaining a feature map, wherein the third-resolution feature map and the first-stage feature map have the same resolution (e.g., 61 × 61).

The processing unit 803 may perform processing on each of the plurality of feature maps, respectively, to generate a plurality of target foreground probability maps with the same resolution.

According to an exemplary embodiment of the present disclosure, the processing unit 803 performs at least one of convolution, Sigmoid activation, and upsampling on each of the plurality of feature maps to obtain a target foreground probability map having the same resolution as the target mask image. For example, for a feature map (e.g., D1, D2, and D3) of the plurality of feature maps having a lower resolution than the target mask image, the processing unit 803 may perform convolution, Sigmoid activation, and upsampling to obtain a target foreground probability map having the same resolution as the target mask image. For a feature map of the plurality of feature maps having a resolution equal to the target mask image (e.g., D4), the processing unit 803 may perform convolution and Sigmoid activation to obtain a target foreground probability map having the same resolution as the target mask image, i.e., the output of the improved branch of the Siammask model shown in fig. 2.

The fusion unit 804 may perform fusion on the multiple target foreground probability maps to generate a final target foreground probability map.

According to an exemplary embodiment of the present disclosure, the fusion unit 804 may perform splicing, convolution and Sigmoid activation on the plurality of target foreground probability maps to obtain a final target foreground probability map. For example, the fusion unit 804 may splice a plurality of target foreground probability maps in the channel dimension, and pass the spliced probability maps through the 1 × 1 convolutional layer and the Sigmoid activation function to generate a final target foreground probability map. Here, the resolution of the final target foreground probability map is also the same as the target mask image.

The determining unit 805 may determine a loss function based on each of the target foreground probability maps and the target mask image in the plurality of target foreground probability maps and the final target foreground probability map, thereby obtaining a plurality of loss functions.

According to an exemplary embodiment of the present disclosure, the determining unit 805 may employ the same type of loss function for each target foreground probability map, such as, but not limited to, a binary cross entropy loss function (binary cross entropy loss) or a binary logistic regression loss function (binary logistic regression loss).

According to an example embodiment of the present disclosure, the determining unit 805 may employ a binary logistic regression loss function, whose formula is as follows:

wherein n represents the number of candidate window response features (RoW); y is_nE { + -1 }, a binary index representing the nth candidate window response feature of the plurality of candidate window response featuresA label, +1 indicates a target region, -1 indicates not a target region; w and h represent the width and height of the target mask image, respectively; i and j represent a pixel location (i, j) within the target mask image;

The training unit 806 may adjust parameters of the target tracking segmentation model according to a plurality of loss functions to train the target tracking segmentation model. That is, the training unit 806 may train the target tracking segmentation model by back-propagating the values obtained by the plurality of loss functions to the target tracking segmentation model to adjust the parameters of the target tracking segmentation model.

Fig. 9 is a block diagram illustrating a target tracking segmentation apparatus according to an exemplary embodiment of the present disclosure. The target tracking segmentation apparatus illustrated in fig. 9 performs operations based on a target tracking segmentation model (i.e., SiamMask model), where the target tracking segmentation model may be trained according to the training method of the present disclosure.

Referring to fig. 9, a target tracking segmentation apparatus 900 according to an exemplary embodiment of the present disclosure may include an image acquisition unit 901, a feature map acquisition unit 902, a processing unit 903, a fusion unit 904, and a result acquisition unit 905.

The image acquisition unit 901 may acquire a target image to be tracked, which refers to an image including a target to be tracked, and a tracking image, which refers to an image of a target to be tracked. For example, the image acquisition unit 901 may acquire a target image from a video image on which target tracking segmentation is to be performed, for example, select a first frame of the video image, and crop the first frame to obtain an image including the target to be tracked. And the image acquisition unit 901 may acquire a tracking image from a video image on which target tracking segmentation is to be performed, for example, for a subsequent frame of the video image input in real time, the target tracking segmentation method as shown in fig. 6 is performed.

The feature map obtaining unit 902 may input the target image and the tracking image into a target tracking segmentation model to obtain a plurality of feature maps with different resolutions, where the target tracking segmentation model is a SiamMask model.

According to an exemplary embodiment of the disclosure, the plurality of feature maps of different resolutions may include a first resolution feature map, a second resolution feature map, a third resolution feature map, and a fourth resolution feature map (e.g., D1, D2, D3, D4 in fig. 3), wherein the resolutions thereof are sequentially increased (e.g., increased by approximately multiples), and the resolutions of the fourth resolution feature map and the target mask image are the same (e.g., 127 × 127). Specifically, the first resolution feature map is a feature map obtained by performing deconvolution (e.g., deconv in fig. 3) on the first candidate windowed response feature; the second resolution characteristic map is obtained by performing fusion (such as graph) on the first resolution characteristic map and a third-stage characteristic map generated by the target image through a ResNet-50 layerU in 3₂) And obtaining a feature map, wherein the first resolution feature map and the third-stage feature map have the same resolution (e.g., 15 × 15 in fig. 3); the third resolution feature map is a second stage feature map generated by ResNet-50 layer on the second resolution feature map and the target image (e.g. U in FIG. 3)₃) And obtaining a feature map, wherein the second-resolution feature map and the second-stage feature map have the same resolution (e.g., 31 × 31); the fourth resolution feature map is obtained by performing fusion (U in FIG. 3) between the third resolution feature map and the first-stage feature map generated by ResNet-50 layer₄) And obtaining a feature map, wherein the third-resolution feature map and the first-stage feature map have the same resolution (e.g., 61 × 61).

The processing unit 903 may perform processing on each of the plurality of feature maps, respectively, to generate a plurality of target foreground probability maps with the same resolution.

According to an exemplary embodiment of the present disclosure, the processing unit 903 performs at least one of convolution, Sigmoid activation, and upsampling on each of the plurality of feature maps to obtain a target foreground probability map having the same resolution as the target mask image. For example, for a feature map of the plurality of feature maps having a lower resolution than the target mask image (e.g., D1, D2, and D3), the processing unit 903 may perform convolution, Sigmoid activation, and upsampling to obtain a target foreground probability map having the same resolution as the target mask image. For a feature map of the plurality of feature maps having a resolution equal to the target mask image (e.g., D4), the processing unit 903 may perform convolution and Sigmoid activation to obtain a target foreground probability map having the same resolution as the target mask image, i.e., the output of the improved branch of the Siammask model shown in fig. 2.

The fusion unit 904 may perform fusion of the plurality of target foreground probability maps to generate a final target foreground probability map.

According to an exemplary embodiment of the disclosure, the fusion unit 904 may perform stitching, convolution and Sigmoid activation on a plurality of target foreground probability maps to obtain a final target foreground probability map. For example, a plurality of target foreground probability maps may be spliced in the channel dimension, and the spliced probability maps are passed through a 1 × 1 convolutional layer and a Sigmoid activation function to generate a final target foreground probability map. Here, the resolution of the final target foreground probability map is also the same as the target mask image.

The result obtaining unit 905 may obtain a target tracking segmentation result based on the final target foreground probability map and the tracking image.

According to an exemplary embodiment of the present disclosure, the result obtaining unit 905 may generate an estimated target mask image based on the final target foreground probability map; and determining and displaying a target segmentation area in the tracking image as a target tracking segmentation result based on the estimated target mask image. Here, each pixel value in the target foreground probability map is a probability value of whether the pixel is a target, and has a value range of [0,1], and each pixel value in the target mask image is a classification value of whether the pixel is a target, and has a value of 0 or 1, for example, but not limited thereto, 0 may represent a non-target, and 1 may represent a target. Therefore, the target foreground probability map may be converted into the target mask image according to a predetermined rule.

Referring to fig. 10, the electronic device 1000 comprises at least one memory 1001 and at least one processor 1002, the at least one memory 701 having stored therein a set of computer-executable instructions that, when executed by the at least one processor 1002, perform a method of training a target tracking segmentation model or a method of target tracking segmentation according to an exemplary embodiment of the present disclosure.

By way of example, the electronic device 1000 may be a PC computer, tablet device, personal digital assistant, smartphone, or other device capable of executing the set of instructions described above. The electronic device 1000 need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) individually or in combination. The electronic device 1000 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the electronic device 1000, the processor 1002 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor 1002 may execute instructions or code stored in the memory 1001, wherein the memory 1001 may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory 1001 may be integrated with the processor 1002, for example, by having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 1001 may include a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 1001 and the processor 1002 may be operatively coupled or may communicate with each other, e.g., through I/O ports, network connections, etc., so that the processor 1002 can read files stored in the memory.

In addition, the electronic device 1000 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 1000 may be connected to each other via a bus and/or a network.

According to an exemplary embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions which, when executed by at least one processor, cause the at least one processor to perform a training method of a target tracking segmentation model or a target tracking segmentation method according to the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an exemplary embodiment of the present disclosure, a computer program product may also be provided, in which instructions are executable by a processor of a computer device to perform a training method of a target tracking segmentation model or a target tracking segmentation method according to an exemplary embodiment of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A training method of a target tracking segmentation model is characterized by comprising the following steps:

acquiring image sample data, wherein each image sample data comprises a target image, a tracking image and a target mask image, the target image is an image comprising a target to be tracked, the tracking image is an image on which tracking of the target is to be executed, and the target mask image is a mask image of a pre-marked target true value;

inputting the target image and the tracking image into the target tracking segmentation model to obtain a plurality of characteristic graphs with different resolutions, wherein the target tracking segmentation model is a SimMask model;

processing each feature map in the plurality of feature maps respectively to generate a plurality of target foreground probability maps with the same resolution;

performing fusion on the plurality of target foreground probability maps to generate a final target foreground probability map;

determining a loss function based on each target foreground probability map and the target mask image in the plurality of target foreground probability maps and the final target foreground probability map, thereby obtaining a plurality of loss functions;

adjusting parameters of the target tracking segmentation model according to the plurality of loss functions to train the target tracking segmentation model.

2. The training method of claim 1, wherein the target tracking segmentation model comprises a ResNet-50 layer, a depth separable convolved cross-correlation operation layer, wherein the target image and the tracking image pass through the ResNet-50 layer and the depth separable convolved cross-correlation operation layer to produce a plurality of candidate window response features;

wherein the plurality of feature maps are generated based on a first candidate window response feature of the plurality of candidate window response features and different intermediate stage feature maps generated by the ResNet-50 layer for the target image.

3. The training method of claim 2, wherein the first candidate window response characteristic is selected by:

performing target background classification on each candidate window response feature in the plurality of candidate window response features to obtain a confidence score of whether each candidate window response feature is a target region feature;

and selecting the candidate window response characteristic with the highest confidence score as the first candidate window response characteristic.

4. The training method of claim 2, wherein the plurality of feature maps comprises a first resolution feature map, a second resolution feature map, a third resolution feature map, and a fourth resolution feature map;

the first resolution characteristic map is a characteristic map obtained by performing deconvolution on the first candidate window response characteristic;

the second resolution feature map is a feature map obtained by fusing the first resolution feature map and a third-stage feature map generated by the ResNet-50 layer of the target image, wherein the resolution of the first resolution feature map is the same as that of the third-stage feature map;

the third-resolution feature map is a feature map obtained by fusing a second-stage feature map generated by the ResNet-50 layer on the second-resolution feature map and the target image, wherein the resolution of the second-resolution feature map is the same as that of the second-stage feature map;

the fourth-resolution feature map is a feature map obtained by fusing a third-resolution feature map and a first-stage feature map generated by the ResNet-50 layer of the target image, wherein the resolution of the third-resolution feature map is the same as that of the first-stage feature map;

the resolution ratios of the first resolution characteristic diagram, the second resolution characteristic diagram, the third resolution characteristic diagram and the fourth resolution characteristic diagram are sequentially increased, and the resolution ratio of the fourth resolution characteristic diagram is the same as that of the target mask image.

5. A target tracking segmentation method is characterized by comprising the following steps:

acquiring a target image to be tracked and a tracking image, wherein the target image is an image including a target to be tracked, and the tracking image is an image on which tracking of the target is to be performed;

inputting the target image and the tracking image into a target tracking segmentation model to obtain a plurality of characteristic graphs with different resolutions, wherein the target tracking segmentation model is a SimMask model;

and obtaining a target tracking segmentation result based on the final target foreground probability map and the tracking image.

6. A training device for a target tracking segmentation model is characterized by comprising:

a sample acquisition unit configured to: acquiring image sample data, wherein each image sample data comprises a target image, a tracking image and a target mask image, the target image is an image comprising a target to be tracked, the tracking image is an image on which tracking of the target is to be executed, and the target mask image is a mask image of a pre-marked target true value;

a feature map acquisition unit configured to: inputting the target image and the tracking image into the target tracking segmentation model to obtain a plurality of characteristic graphs with different resolutions, wherein the target tracking segmentation model is a SimMask model;

a processing unit configured to: processing each feature map in the plurality of feature maps respectively to generate a plurality of target foreground probability maps with the same resolution;

a fusion unit configured to: performing fusion on the plurality of target foreground probability maps to generate a final target foreground probability map;

a determination unit configured to: determining a loss function based on each target foreground probability map and the target mask image in the plurality of target foreground probability maps and the final target foreground probability map, thereby obtaining a plurality of loss functions;

a training unit configured to: adjusting parameters of the target tracking segmentation model according to the plurality of loss functions to train the target tracking segmentation model.

7. An object tracking segmentation apparatus, comprising:

an image acquisition unit configured to: acquiring a target image to be tracked and a tracking image, wherein the target image is an image including a target to be tracked, and the tracking image is an image on which tracking of the target is to be performed;

a feature map acquisition unit configured to: inputting the target image and the tracking image into a target tracking segmentation model to obtain a plurality of characteristic graphs with different resolutions, wherein the target tracking segmentation model is a SimMask model;

a result obtaining unit configured to: and obtaining a target tracking segmentation result based on the final target foreground probability map and the tracking image.

8. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform a method of training the target tracking segmentation model of any one of claims 1 to 4 or a method of target tracking segmentation of claim 5.

9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform a method of training a target tracking segmentation model as claimed in any one of claims 1 to 4 or a method of target tracking segmentation as claimed in claim 5.

10. A computer program product comprising computer instructions, characterized in that the computer instructions, when executed by at least one processor, implement a training method of a target tracking segmentation model according to any one of claims 1 to 4 or a target tracking segmentation method according to claim 5.