CN116843897A

CN116843897A - Training method of segmentation model, image segmentation method, device, equipment and medium

Info

Publication number: CN116843897A
Application number: CN202310722042.5A
Authority: CN
Inventors: 沈智勇; 赵一麟; 陆勤; 龚建
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-06-16
Filing date: 2023-06-16
Publication date: 2023-10-03

Abstract

The disclosure provides a training method of a segmentation model, relates to the technical field of artificial intelligence, in particular to the technical field of computer vision and the technical field of deep learning, and can be applied to smart city scenes. The specific implementation scheme is as follows: inputting the first sample image into a first segmentation model to obtain gradient information of a first intermediate network of the first segmentation model; obtaining a first intermediate mask feature corresponding to the first intermediate network according to the gradient information of the first intermediate network and the first intermediate feature output by the first intermediate network; determining a first distillation loss based on at least one second intermediate mask feature and at least one first intermediate mask feature, wherein the at least one second intermediate mask feature is derived based on gradient information of a second intermediate network and at least one second intermediate feature of a second intermediate network output; and training a first segmentation model based on the first distillation loss. The disclosure also provides an image segmentation method, an image segmentation device, electronic equipment and a storage medium.

Description

Training method of segmentation model, image segmentation method, device, equipment and medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of computer vision and the technical field of deep learning, and can be applied to smart city scenes. More specifically, the present disclosure provides a training method of a segmentation model, an image segmentation method, an apparatus, an electronic device, and a storage medium.

Background

With the development of artificial intelligence technology, the application scene of the deep learning model is continuously increased. The image segmentation model performance can be improved based on a model distillation technology.

Disclosure of Invention

The disclosure provides a training method of a segmentation model, an image segmentation method, an image segmentation device, equipment and a storage medium.

According to an aspect of the present disclosure, there is provided a training method of a segmentation model, the method including: inputting the first sample image into a first segmentation model to obtain gradient information of a first intermediate network of the first segmentation model; obtaining at least one first intermediate mask feature corresponding to the first intermediate network according to the gradient information of the first intermediate network and at least one first intermediate feature output by the first intermediate network; determining a first distillation loss according to at least one second intermediate mask feature and at least one first intermediate mask feature, wherein the at least one second intermediate mask feature corresponds to a second intermediate network of a second segmentation model, the at least one second intermediate mask feature is obtained according to gradient information of the second intermediate network and at least one second intermediate feature output by the second intermediate network, the gradient information of the second intermediate network is obtained by inputting a second sample image into the second segmentation model, and the parameter quantity of the second segmentation model is larger than the parameter quantity of the first segmentation model; and training a first segmentation model based on the first distillation loss.

According to another aspect of the present disclosure, there is provided an image segmentation method including: inputting a target image into a first segmentation model to obtain a target segmentation result, wherein the target segmentation result comprises a target mask of a target instance in the target image and a category of the target instance, and the first segmentation model is trained by using the method provided by the disclosure

According to another aspect of the present disclosure, there is provided a training apparatus of a segmentation model, the apparatus including: the first acquisition module is used for inputting the first sample image into the first segmentation model to obtain gradient information of a first intermediate network of the first segmentation model; the second obtaining module is used for obtaining at least one first intermediate mask feature corresponding to the first intermediate network according to the gradient information of the first intermediate network and at least one first intermediate feature output by the first intermediate network; a first determining module, configured to determine a first distillation loss according to at least one second intermediate mask feature and at least one first intermediate mask feature, where the at least one second intermediate mask feature corresponds to a second intermediate network of a second segmentation model, the at least one second intermediate mask feature is obtained according to gradient information of the second intermediate network and at least one second intermediate feature output by the second intermediate network, the gradient information of the second intermediate network is obtained by inputting a second sample image into the second segmentation model, and a parameter amount of the second segmentation model is greater than a parameter amount of the first segmentation model; and a training module for training the first segmentation model based on the first distillation loss.

According to another aspect of the present disclosure, there is provided an image segmentation apparatus including: and the third obtaining module is used for inputting the target image into the first segmentation model to obtain a target segmentation result, wherein the target segmentation result comprises a target mask of a target instance in the target image and a category of the target instance, and the first segmentation model is trained by using the device provided by the disclosure.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method provided in accordance with the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method provided according to the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method provided according to the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic illustration of a segmentation model according to one embodiment of the present disclosure;

FIG. 2 is a flow chart of a method of training a segmentation model according to one embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a first distillation loss according to one embodiment of the present disclosure;

FIG. 4 is a schematic diagram of determining a second distillation loss and an attention distillation loss according to one embodiment of the disclosure;

FIG. 5 is a schematic diagram of an attention mechanism according to one embodiment of the present disclosure;

FIG. 6A is a schematic diagram of a first output network according to one embodiment of the present disclosure;

FIG. 6B is a schematic diagram of determining distillation loss according to one embodiment of the disclosure;

FIG. 7 is a flow chart of an image segmentation method according to another embodiment of the present disclosure

FIG. 8 is a block diagram of a training apparatus for a segmentation model according to one embodiment of the present disclosure;

FIG. 9 is a block diagram of an image segmentation apparatus according to one embodiment of the present disclosure; and

fig. 10 is a block diagram of an electronic device to which an image segmentation method may be applied according to one embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

A large model may have a large number of parameters and may have the ability to preserve a priori knowledge of the image. The application field of large models is increasing. The large model can quickly improve the accuracy of the image processing result. However, the large model has a large parameter amount and a slow model response speed. Large models are difficult to implement in scenes where there is a high demand for the processing speed of the model. Thus, small models can be trained from knowledge learned by large models based on knowledge distillation techniques to perform model compression without losing too much model performance.

Image segmentation techniques may include semantic segmentation techniques and instance segmentation techniques. The purposes of example segmentation include: all objects in the image are distinguished, and the corresponding instance of the pixel in the image is determined. The class of the instance may be preset. For example, when the image includes a plurality of human examples, the examples may correspond to a single person, or may correspond to a plurality of persons (i.e., crowd) that are closer to each other. Compared with semantic segmentation, the implementation difficulty of instance segmentation is high. The categories of objects in the image may belong to a set of semantic categories, but the number of instances is variable. Semantic segmentation may be achieved by dense, pixel-by-pixel classification techniques. However, instance segmentation is difficult to achieve by this classification technique.

In some embodiments, instance segmentation may be performed based on a top-down pattern. For example, the image may be subject to object detection, resulting in a detection frame of the object in the image. The partial image in the detection frame is divided, and a mask of the object can be obtained. However, in the case where the result of target detection is accurate, instance segmentation based on a top-down pattern may result in a more accurate result.

In some embodiments, instance segmentation may be performed as well, possibly on a bottom-up basis. For example, pixels in an image may be processed using an embedding (embedding) vector such that the distance between pixels of the same instance decreases and such that the distance between pixels of different instances increases. After the embedding process, an aggregation process may be performed to obtain instances in the image. However, the bottom-up mode requires high accuracy for the embedding process and the aggregation process.

A deep learning model may be used to implement instance segmentation. For example, a multi-stage instance segmentation model for first detecting a re-segmentation may be trained based on a top-down pattern. For another example, a single-stage instance segmentation model may be trained to accurately determine the instance corresponding to each pixel in the image. The single-phase instance segmentation model may have a faster reasoning speed. However, knowledge distillation techniques are difficult to use to improve the performance of single-stage example segmentation models.

The single-stage instance segmentation model with larger parameter quantity can be used as a teacher model, and the single-stage instance segmentation model with smaller parameter quantity can be used as a student model. And respectively inputting the sample images into a teacher model and a student model to obtain two segmentation results. The segmentation result may indicate whether the pixel corresponds to an instance and the class of the pixel. Next, parameters of the student model may be adjusted to reduce the difference between the two segmentation results. However, the results output by the model are aligned, the distillation mode is simple, and the model is difficult to train efficiently.

Further, in an instance segmentation scenario, a backbone network (backbone) of the teacher model may include a plurality of transducer modules (Transformer Block). In order to reduce the number of parameters, the backbone network of the student model may instead comprise a convolutional neural network (Convolutional Neural Network, CNN). The main networks of the teacher model and the student model are different in structure, and the performance of the student model is difficult to improve by using the teacher model.

In order to improve the performance of the example segmentation model, the present disclosure provides a training method of the segmentation model, which will be described below.

FIG. 1 is a schematic illustration of a segmentation model according to one embodiment of the present disclosure.

In some embodiments, the segmentation model may include a Backbone network (Backbone), an intermediate network (ck), and an output network (Head). As shown in fig. 1, the first split model 110 may include a first backbone network 111, a first intermediate network 112, and a first output network 113. The second partition model 120 may include a second backbone network 121, a second intermediate network 122, and a second output network 123. For example, the first backbone network 111 may be a convolutional neural network. The second backbone network 121 may include a plurality of fransformer modules. It will be appreciated that the architecture of the first backbone network 111 is different from the architecture of the second backbone network 121. It is also understood that the first segmentation model 110 may be used as a student model. The second segmentation model 120 may serve as a teacher model. The number of parameters of the second segmentation model 120 may be greater than the number of parameters of the first segmentation model 110.

In the embodiment of the disclosure, the first sample image may be input into a first backbone network to obtain a first backbone characteristic. The first backbone feature is input into a first intermediate network, and at least one first intermediate feature is available. From the at least one first intermediate feature, a first input feature of the first output network may be determined. The first input feature is input into a first output network, and a first segmentation result can be obtained. As shown in fig. 1, a sample image 1001 is input to a first backbone network 111, and a first backbone characteristic can be obtained. The first backbone feature is input into the first intermediate network 112, and at least one first intermediate feature is available. The first segmentation result may be obtained by inputting at least one first intermediate feature as a first input feature into the first output network 113. From the first segmentation result and the first label of the first sample image, a first segmentation loss of the first segmentation model may be determined.

In the embodiment of the disclosure, the second sample image may be input into a second backbone network to obtain a second backbone characteristic. The second backbone feature is input into a second intermediate network, at least one second intermediate feature being available. A second input characteristic of the second output network may be determined based on the at least one second intermediate characteristic. And inputting the second input characteristic into a second output network to obtain a second segmentation result. As shown in fig. 1, the sample image 1002 is input to the second backbone network 121, and a second backbone characteristic can be obtained. The second backbone feature is input to a second intermediate network 122, where at least one second intermediate feature may be obtained. The second segmentation result may be obtained by inputting at least one second intermediate feature as a second input feature into the second output network 123. From the second segmentation result and the second label of the second sample image, a segmentation loss of the second segmentation model may be determined.

It will be appreciated that the structure of the first segmentation model and the second style model of the present disclosure is described above, and the training method of the segmentation model is further described below.

Fig. 2 is a flow chart of a method of training a segmentation model according to one embodiment of the present disclosure.

As shown in fig. 2, the method 200 may include operations S210 to S240.

In operation S210, a first sample image is input to a first segmentation model, and gradient information of a first intermediate network of the first segmentation model is obtained.

For example, the first sample image may be the first sample image 1001 described above. The first segmentation model may be the first segmentation model 110 described above. The first intermediate network may be the first intermediate network 112 described above. The first segmentation model may be used as a student model.

In the embodiment of the disclosure, a first segmentation loss of the first segmentation model may be determined according to the first segmentation result and the first label of the first sample image. From the first segmentation loss, gradient information of the first segmentation model may be determined. The gradient information of the first segmentation model may comprise gradient information of the first intermediate network. It will be appreciated that the gradient information of the first segmentation model may also comprise gradient information of the first backbone network and gradient information of the first output network.

In operation S220, at least one first intermediate mask feature corresponding to the first intermediate network is obtained from the gradient information of the first intermediate network and at least one first intermediate feature output by the first intermediate network.

In an embodiment of the present disclosure, the first intermediate network may include a feature pyramid network (Feature Pyramid Networks, FPN). The first intermediate feature may be at least one. The gradient information of the first intermediate network may comprise gradient information of at least one first intermediate feature. For example, the first intermediate network may output 4-level features, which may be respectively the first intermediate features. The gradient information of the first intermediate network may include gradient information of each of the 4 first intermediate features.

In operation S230, a first distillation loss is determined based on the at least one second intermediate mask feature and the at least one first intermediate mask feature.

In an embodiment of the present disclosure, the at least one second intermediate mask feature corresponds to a second intermediate network of the second segmentation model. For example, the second segmentation model may be the second segmentation model 120 described above. The second intermediate network may be the second intermediate network 122 described above. The second segmentation model may serve as a teacher model.

In an embodiment of the disclosure, the at least one second intermediate mask feature is obtained according to gradient information of a second intermediate network and at least one second intermediate feature output by the second intermediate network, the gradient information of the second intermediate network is obtained by inputting a second sample image into a second segmentation model, and a parameter quantity of the second segmentation model is larger than a parameter quantity of the first segmentation model.

For example, the second sample image may be the second sample image 1001 described above. From the second segmentation result and the second label of the second sample image, a segmentation loss of the second segmentation model may be determined. From the second segmentation loss, gradient information for the second segmentation model may be determined. The gradient information of the second segmentation model may comprise gradient information of the second intermediate network. It will be appreciated that the gradient information of the second segmentation model may also comprise gradient information of the second backbone network and gradient information of the second output network. The second sample image may be the second sample image 1002 described above. It will also be appreciated that information relating to the second segmentation model may be acquired before or simultaneously with the acquisition of information relating to the first segmentation model.

The second intermediate network may also include a feature pyramid network. The second intermediate feature may be at least one. The gradient information of the second intermediate network may comprise gradient information of at least one second intermediate feature. For example, the second intermediate network may output at least 4-level features, which may be respectively the second intermediate features. The gradient information of the second intermediate network may comprise gradient information of each of the 4 second intermediate features.

In operation S240, a first segmentation model is trained based on the first distillation loss.

For example, based on the first distillation loss, parameters of the first segmentation model may be adjusted to converge the first distillation loss.

According to the embodiment of the disclosure, the first distillation loss is determined according to the first intermediate feature output by the first segmentation model and the second intermediate feature of the second segmentation model, and the capability of the second segmentation model for extracting the features can be transferred to the first segmentation model, so that the capability of the first segmentation model can be improved effectively. Therefore, if the second segmentation model is a single-stage example segmentation model, the first segmentation model trained by using the first distillation loss can be used for efficiently carrying out single-stage example segmentation on the image.

It will be appreciated that while the training process of the present disclosure is described above, the first distillation loss of the present disclosure will be further described below.

Fig. 3 is a schematic diagram of a first distillation loss according to one embodiment of the present disclosure.

As shown in fig. 3, the first intermediate network may output 4 first intermediate features, which are a first intermediate feature F311, a first intermediate feature F312, a first intermediate feature F313, and a first intermediate feature F314, respectively. The second intermediate network may output 4 second intermediate features, namely a second intermediate feature F321, a first intermediate feature F322, a first intermediate feature F323, and a first intermediate feature F324.

In an embodiment of the disclosure, the gradient information of the first intermediate feature comprises gradients of each of a plurality of first intermediate feature values of the first intermediate feature. Obtaining at least one first intermediate mask feature corresponding to the first intermediate network from the gradient information of the first intermediate network and the at least one first intermediate feature output by the first intermediate network may include: and obtaining a first intermediate mask feature according to the preset gradient threshold and the gradients of each of a plurality of first intermediate feature values of the first intermediate feature. For example, for a plurality of first intermediate feature values of the first intermediate feature, replacing the first intermediate feature value that is greater than or equal to the preset gradient threshold with a first preset value and replacing the first intermediate feature value that is less than the preset gradient threshold with a second preset value, resulting in a first intermediate mask feature.

As shown in fig. 3, the first intermediate mask feature M311 may be obtained from the preset gradient threshold value Th31 and the first intermediate feature F311. The first intermediate mask feature M312 may be derived from the preset gradient threshold Th31 and the first intermediate feature F312. From the preset gradient threshold Th31 and the first intermediate feature F313, a first intermediate mask feature M313 can be obtained. From the preset gradient threshold Th31 and the first intermediate feature 314, a first intermediate mask feature M314 may be obtained. Taking the first intermediate feature F311 as an example, for a plurality of first intermediate feature values of the first intermediate feature F311, if it is determined that the gradient of one first intermediate feature value is smaller than the preset gradient threshold Th31, the first intermediate feature value may be replaced with a second preset value (for example, 0). If it is determined that the gradient of the other first intermediate feature value is greater than the preset gradient threshold Th31, the first intermediate feature value may be replaced with a first preset value (e.g., 1). After replacing the plurality of first intermediate feature values with the first preset value and the second preset value, the first intermediate mask feature M311 may be obtained. The preset gradient threshold Th1 may be, for example, 0.3. It is understood that the gradient of the first intermediate feature value may be a normalized gradient. It will also be appreciated that the manner of obtaining the first intermediate mask feature M312, the first intermediate mask feature M313, and the first intermediate mask feature M314 according to the first intermediate feature F312, the first intermediate feature F313, and the first intermediate feature F314 is the same as or similar to the manner of obtaining the first intermediate mask feature M311 according to the first intermediate feature F311, and will not be repeated herein. According to the embodiment of the disclosure, the first intermediate mask feature is determined by utilizing the gradient of the first intermediate feature value and the preset gradient threshold value, so that the knowledge of the image feature extracted by the second segmentation model can be fully utilized, the capability of extracting the image feature of the first intermediate network is improved, and the single-stage instance segmentation capability of the first segmentation model is improved.

In an embodiment of the disclosure, the gradient information of the second intermediate network comprises gradient information of at least one second intermediate feature, the gradient information of the second intermediate feature comprising gradients of each of a plurality of second intermediate feature values of the second intermediate feature. The second intermediate mask feature is derived from a preset gradient threshold and gradients of each of a plurality of second intermediate feature values of the second intermediate feature. For example, for a plurality of second intermediate feature values of the second intermediate feature, replacing the second intermediate feature value that is greater than or equal to the preset gradient threshold with the first preset value and replacing the second intermediate feature value that is less than the preset gradient threshold with the second preset value, resulting in a second intermediate mask feature.

As shown in fig. 3, the second intermediate mask feature M321 may be obtained from the preset gradient threshold Th32 and the second intermediate feature F321. The second intermediate mask feature M322 may be derived from the preset gradient threshold Th32 and the second intermediate feature F322. The second intermediate mask feature M323 can be obtained from the preset gradient threshold Th32 and the second intermediate feature F323. From the preset gradient threshold Th32 and the second intermediate feature 324, a second intermediate mask feature M324 may be obtained. Taking the second intermediate feature F321 as an example, for a plurality of second intermediate feature values of the second intermediate feature F321, if it is determined that the gradient of one second intermediate feature value is smaller than the preset gradient threshold Th32, the second intermediate feature value may be replaced with a second preset value (for example, 0). If it is determined that the gradient of the other second intermediate feature value is greater than the preset gradient threshold Th32, the second intermediate feature value may be replaced with the first preset value (e.g., 1). After replacing the plurality of second intermediate feature values with the first preset value and the second preset value, the second intermediate mask feature M311 may be obtained. The preset gradient threshold Th2 may be, for example, 0.3. It is understood that the gradient of the second intermediate feature value may be a normalized gradient. It will also be appreciated that the manner of obtaining the second intermediate mask feature M322, the second intermediate mask feature M323, and the second intermediate mask feature M324 according to the second intermediate feature F322, the second intermediate feature F323, and the second intermediate feature F324 is the same as or similar to the manner of obtaining the second intermediate mask feature M321 according to the second intermediate feature F321, and will not be repeated herein.

Next, in an embodiment of the present disclosure, a first distillation loss may be determined from the at least one first intermediate mask feature and the at least one second intermediate mask feature. For example, the at least one correlation loss may be determined from the at least one first intermediate mask feature and the at least one second intermediate mask feature. A first distillation loss is determined based on the at least one correlation loss. As shown in fig. 3, a pearson correlation coefficient (Pearson Correlation Coefficient, PCC) may be determined as a first correlation loss from the first intermediate mask feature M311 and the second intermediate mask feature M321. The pearson correlation coefficient may be determined as the second correlation loss from the second intermediate mask feature M312 and the second intermediate mask feature M322. The pearson correlation coefficient may be determined as a third correlation loss from the first intermediate mask feature M313 and the second intermediate mask feature M323. The pearson correlation coefficient may be determined as a fourth correlation loss from the first intermediate mask feature M314 and the second intermediate mask feature M324. From the first correlation loss to the fourth correlation loss, a first distillation loss L301 may be determined. By way of an embodiment of the present disclosure, a correlation penalty between a first intermediate mask feature and a second intermediate mask feature is determined. By using the first distillation loss training model obtained according to the correlation loss, the correlation between the characteristics output by the first intermediate network and the characteristics output by the second intermediate network can be improved, the capability of the second segmentation model can be further transferred to the first segmentation model, and the instance segmentation capability of the first segmentation model can be further improved.

It will be appreciated that the first distillation loss of the present disclosure is described above. The distillation loss of the present disclosure is not limited thereto, and the second distillation loss and the attention distillation loss of the present disclosure will be described below.

FIG. 4 is a schematic diagram of determining a second distillation loss and an attention distillation loss according to one embodiment of the disclosure.

As shown in fig. 4, the first intermediate network may output 4 first intermediate features, which are a first intermediate feature F411, a first intermediate feature F412, a first intermediate feature F413, and a first intermediate feature F414, respectively. The second intermediate network may output 4 second intermediate features, namely a second intermediate feature F421, a second intermediate feature F422, a first intermediate feature F423, and a second intermediate feature F424. From the preset gradient threshold Th41, the first intermediate feature F411, the first intermediate feature F412, the first intermediate feature F413, and the first intermediate feature F414, the first intermediate mask feature M311, the first intermediate mask feature M312, the first intermediate mask feature M313, and the first intermediate mask feature M314 can be obtained. It will be appreciated that, regarding the detailed descriptions of the first intermediate features F411 to F414, the preset gradient threshold Th41, and the first intermediate mask features M411 to M414, reference may be made to the first intermediate features F311 to F314, the preset gradient threshold Th31, and the first intermediate mask features M311 to M314, which are not repeated herein.

As shown in fig. 4, the second intermediate network may output 4 second intermediate features, which are the second intermediate feature F421, the second intermediate feature F422, the second intermediate feature F423, and the second intermediate feature F424, respectively. The second intermediate network may output 4 second intermediate features, namely a second intermediate feature F421, a second intermediate feature F422, a second intermediate feature F423, and a second intermediate feature F424. The second intermediate mask feature M421, the second intermediate mask feature M422, the second intermediate mask feature M423, and the second intermediate mask feature M424 may be obtained according to the preset gradient threshold Th42, the second intermediate feature F421, the second intermediate feature F422, the second intermediate feature F423, and the second intermediate feature F424. It is understood that, regarding the detailed descriptions of the second intermediate features F421 to F424, the preset gradient threshold Th42, and the second intermediate mask features M421 to M424, reference may be made to the second intermediate features F321 to F324, the preset gradient threshold Th32, and the second intermediate mask features M321 to M324, which are not repeated herein.

In an embodiment of the present disclosure, training the first segmentation model based on the first distillation loss comprises: and normalizing the at least one first intermediate mask feature to obtain at least one first intermediate normalized feature. A second distillation loss is determined based on the at least one second intermediate normalized feature and the at least one first intermediate normalized feature. The at least one second intermediate normalization feature is a normalization of the at least one second intermediate mask feature.

As shown in fig. 4, the first intermediate mask feature M411, the first intermediate mask feature M412, the first intermediate mask feature M413, and the first intermediate mask feature M414 may be input to the linear processing layer 414, respectively, to obtain 4 first intermediate normalized features. The second intermediate mask feature M421, the second intermediate mask feature M422, the second intermediate mask feature M423, and the second intermediate mask feature M424 may also be input to the linear processing layer 414, respectively, to obtain 4 second intermediate normalized features. Next, 4 binary cross entropy loss (Binary Cross Entropy Loss, BCE) functions may be utilized to determine 4 binary cross entropy loss based on the 4 first intermediate normalized features and the 4 second intermediate normalized features. From the 4 binary cross entropy losses, the second distillation loss L402 can be determined in various ways (e.g., summation or weighted summation, etc.). For example, a two-value cross entropy loss may be determined from a first intermediate normalized feature derived from the first intermediate mask feature M411 and a second intermediate normalized feature derived from the second intermediate mask feature M421.

In embodiments of the present disclosure, the first segmentation model may be trained based on the first distillation loss and the second distillation loss. For example, based on the first distillation loss and the second distillation loss, the total loss may be determined using various means (e.g., summation or weighted summation). Next, parameters of the first segmentation model may be adjusted according to the total loss to converge the total loss. Through the embodiment of the disclosure, the middle mask characteristic is normalized, and the first segmentation model is trained through the first distillation loss and the second distillation loss, so that the performance of the second segmentation model is more comprehensively transferred to the first segmentation model.

It will be appreciated that some ways of determining the second loss of distillation are described above and some ways of determining the loss of attention distillation will be described below.

In an embodiment of the present disclosure, training the first segmentation model may include, based on the first distillation loss: the at least one first intermediate mask feature is processed using an attention mechanism resulting in at least one first attention feature. An attention deficit is determined from the at least one second attention characteristic and the at least one first attention characteristic. The at least one second attention feature is derived by processing the at least one second intermediate mask feature using an attention mechanism.

As shown in fig. 4, the first intermediate mask feature M411, the first intermediate mask feature M412, the first intermediate mask feature M413, and the first intermediate mask feature M414 may be input to the linear processing layer 414, respectively, to obtain 4 first intermediate normalized features. The second intermediate mask feature M421, the second intermediate mask feature M422, the second intermediate mask feature M423, and the second intermediate mask feature M424 may also be input to the linear processing layer 414, respectively, to obtain 4 second intermediate normalized features. Next, the 4 first intermediate normalized features may be processed separately using the attention mechanism, resulting in 4 first attention features. The 4 second intermediate normalized features may also be processed separately using the attention mechanism to yield 4 second attention features. Next, 4 attention distiller losses may be determined using a binary cross entropy loss function based on the 4 first attention features and the 4 second attention features. From the 4 attention distiller losses, attention distiller losses can be determined. For example, an attention distiller loss may be determined based on a first attention feature derived from the first intermediate mask feature F411 and a second attention feature derived from the second intermediate mask feature F421.

In embodiments of the present disclosure, the first segmentation model may be trained based on the first distillation loss and the attention distillation loss. For example, based on the first distillation loss and the attention distillation loss, the total loss may be determined using various means (e.g., summation or weighted summation). Next, parameters of the first segmentation model may be adjusted according to the total loss to converge the total loss. According to the embodiment of the disclosure, the first segmentation model is trained by using the attention distillation loss, so that the first segmentation model can pay more attention to information related to the example segmentation in the image, and the example segmentation capability of the first segmentation model is improved.

In the disclosed embodiments, the attention mechanism may be various attention mechanisms. For example, the attention mechanism may be a self attention mechanism (self attention) or a multi-headed self attention mechanism (multi-head self attention). One mechanism of attention is described below in connection with fig. 5.

Fig. 5 is a schematic diagram of an attention mechanism according to one embodiment of the present disclosure.

As shown in fig. 5, the first intermediate normalized feature N51 may be input to the first attention unit 5151, resulting in the first attention feature a51. The second intermediate normalized feature may be input to the second attention unit 5152 resulting in a second attention feature a52. It will be appreciated that the first intermediate normalization feature N51 may be any of the 4 first intermediate normalization features described above. The second intermediate normalized feature may be any of the 4 second intermediate normalized features described above.

The first attention unit 5151 may include a convolution layer 51511, a normalization layer 51512, a convolution layer 51513, an excitation layer 51514, and a convolution layer 51515. The first intermediate normalized feature N51 may be input to the convolution layer 51511 to yield a first convolved feature. The first convolved feature is input to the normalization layer 51512, where a first normalized feature may be obtained. The first product feature may be obtained by multiplying the first normalized feature by the first intermediate normalized feature N51. The first product feature is input to the convolution layer 51513, which may result in a first intermediate output feature. The first intermediate output feature is input into the excitation layer 51514, which may result in a first excitation feature. The first excitation feature is input to the convolutional layer 51515, which may result in a first processed feature. The first post-processing feature and the first intermediate normalized feature N51 are added to obtain a first attention feature a51.

The second attention unit 5152 may include a convolution layer 51521, a normalization layer 51522, a convolution layer 51523, an excitation layer 51524, and a convolution layer 51525. The second intermediate normalized feature N52 may be input to the convolution layer 51521 to yield a second convolved feature. The second convolved feature is input to the normalization layer 51522, which can yield a second normalized feature. The second product feature may be obtained by multiplying the second normalized feature by the second intermediate normalized feature N52. The second product feature is input to the convolution layer 51523, which may result in a second intermediate output feature. The second intermediate output feature is input to the excitation layer 51524, which may result in a second excitation feature. The second excitation feature is input to the convolution layer 51525 to yield a second processed feature. The second post-processing feature and the second intermediate normalized feature N52 are added to obtain a second attention feature a52.

Next, from the first attention feature a51 and the second attention feature a52, attention distiller loss Lattl can be determined.

It will be appreciated that the first and second attention features are derived from the first and second intermediate normalized features above. The present disclosure is not limited thereto and the first and second attention features may also be derived from the first and second intermediate mask features. For example, a first intermediate mask feature may be input to the first attention unit described above, resulting in a first attention feature. The second intermediate mask feature may be input to the second attention unit described above to obtain a second attention feature.

It will be appreciated that while the attention mechanism of the present disclosure is described above, some ways in which the present disclosure trains the first segmentation model will be further described below.

Fig. 6A is a schematic diagram of a first output network according to one embodiment of the present disclosure.

In an embodiment of the present disclosure, inputting the first sample image into the first segmentation model may further include: and inputting the first sample image into a first segmentation model to obtain a first segmentation feature output by a first output network. As shown in fig. 6A, the first input feature is input to the first output network 613 of the first segmentation model, and the first segmentation feature Seg61 can be obtained.

In an embodiment of the present disclosure, inputting the first sample image into the first segmentation model may further include: and inputting the first sample image into a first segmentation model to obtain a first example feature and a first class feature output by a first output network. As shown in fig. 6A, the first example feature Ins61 and the first class feature table 61 may also be obtained by inputting the first input feature described above into the first output network 613 of the first segmentation model.

In the disclosed embodiments, the first instance feature may indicate whether the first sample pixel in the first sample image belongs to one sample instance. For example, the first example feature Ins61 may comprise a plurality of first example feature values. The value of the first example feature value may be 0 or 1. In the case where the first example feature value is 0, the first sample pixel corresponding to the first example feature value does not belong to one sample example. In the case where the first example feature value is 1, the first sample pixel corresponding to the first example feature value belongs to one sample example.

In the disclosed embodiments, the first class feature may indicate a class of at least one sample instance in the first sample image. For example, the first class feature table 61 includes a plurality of first class feature values, which may indicate a class to which the first sample pixels correspond.

In the disclosed embodiments, a first sample mask may be determined from the first segmentation feature and the first instance feature. From the first class feature, a first classification result may be determined. The first classification result may indicate a class to which the sample instance corresponds. For example, a dot product operation is performed based on the first segmentation feature seg61 and the first instance feature Ins61, and a first sample mask61 may be obtained. The first sample mask61 and the first classification result Rc61 may be the first division result described above.

It will be appreciated that while the first output network of the present disclosure is described above, some ways in which the present disclosure trains the first segmentation model will be further described below.

Fig. 6B is a schematic diagram of determining distillation loss according to one embodiment of the disclosure.

In an embodiment of the present disclosure, training the first segmentation model based on the first distillation loss comprises: a third distillation loss is determined based on the second split feature and the first split feature. The second segmentation feature corresponds to a second output network, the second segmentation feature being obtained by inputting a second sample image into a second segmentation model. For example, the second input feature described above is input to the second output network of the second segmentation model, and the second segmentation feature Seg62 can be obtained. From the first segmentation feature Seg61 and the second segmentation feature Seg62, a third distillation loss L603 may be determined using a function of the dimension-wise distillation (pair wise distillation).

In an embodiment of the present disclosure, a first segmentation model is trained based on the first distillation loss and the third distillation loss. For example, based on the first distillation loss and the third distillation loss L603, the total loss may be determined using various means (e.g., summation or weighted summation). Next, parameters of the first segmentation model may be adjusted according to the total loss to converge the total loss. According to the embodiment of the disclosure, the capability of extracting the local image features of the second segmentation model can be transferred to the first segmentation model, and the feature extraction capability of the first segmentation model can be improved.

It will be appreciated that the third distillation loss of the present disclosure is described above, but the present disclosure is not limited thereto, as will be further described below.

In an embodiment of the present disclosure, training the first segmentation model based on the first distillation loss comprises: a first channel dimension feature is determined from the first instance feature and the first class feature. For example, the first instance feature Ins61 and the first class feature label61 are fused, and the first channel dimension feature can be obtained.

In an embodiment of the present disclosure, a channel dimension distillation loss is determined from the second channel dimension feature and the first channel dimension feature. The second channel dimension feature is determined from a second instance feature and a second class feature obtained by inputting a second sample image into the second segmentation model. For example, the second instance feature Ins62 and the second category feature label62 may be fused to obtain the second channel dimension feature. From the first channel dimension characteristic and the second channel dimension characteristic, a channel dimension distillation loss Lchannel can be determined using a Jensen-shannon divergence (Jensen-Shannon Divergence) function.

In embodiments of the present disclosure, a first segmentation model may be trained based on the first distillation loss and the channel dimension distillation loss. For example, based on the first distillation loss and the channel dimension distillation loss, the total loss may be determined using various means (e.g., summation or weighted summation). Next, parameters of the first segmentation model may be adjusted according to the total loss to converge the total loss. According to the embodiment of the disclosure, the features with different dimensions extracted by different models can be fully utilized for training, and the global feature extraction capability of the first segmentation model is improved.

It will be appreciated that the third distillation loss and channel dimension distillation loss are described above and that the manner in which the first split model is trained will be further described below.

In an embodiment of the present disclosure, a segmentation loss is determined according to a first segmentation result output by a first segmentation model and a first label of a first sample image. A first segmentation model is trained based on the first distillation loss and the segmentation loss. For example, based on the first tag and the first segmentation result, a first segmentation loss may be determined using various loss functions. Based on the first distillation loss and the first split loss, the total loss may be determined in various ways (e.g., summation or weighted summation). Next, parameters of the first segmentation model may be adjusted according to the total loss to converge the total loss.

It will be appreciated that the first segmentation model is trained above based on one of the first segmentation loss, the second distillation loss, the third distillation loss, the attention distillation loss, and the channel dimension distillation loss, and the first distillation loss. However, the present disclosure is not limited thereto, and the first division model may be trained based on at least two of the first division loss, the second distillation loss, the third distillation loss, the attention distillation loss, and the channel dimension distillation loss, and the first distillation loss. That is, the first segmentation model may be trained based on the first distillation loss and at least one of the first segmentation loss, the second distillation loss, the third distillation loss, the attention distillation loss, and the channel dimension distillation loss.

It will be appreciated that while the model training method of the present disclosure is described above, the image segmentation method of the present disclosure will be described below.

Fig. 7 is a flowchart of an image segmentation method according to another embodiment of the present disclosure.

As shown in fig. 7, the method 700 may include operation S710.

In operation S710, a target image is input into a first segmentation model, resulting in a target segmentation result.

In the disclosed embodiment, the target segmentation result includes a target mask of the target instance in the target image and a class of the target instance.

In an embodiment of the present disclosure, the first segmentation model may be trained using the methods provided by the present disclosure. For example, the first segmentation model may be trained according to the method 200 described above.

Fig. 8 is a block diagram of a training apparatus of a segmentation model according to one embodiment of the present disclosure.

As shown in fig. 8, the apparatus 800 may include a first obtaining module 810, a second obtaining module 820, a determining module 830, and a training module 840.

The first obtaining module 810 is configured to input the first sample image into the first segmentation model, and obtain gradient information of a first intermediate network of the first segmentation model.

The second obtaining module 820 is configured to obtain at least one first intermediate mask feature corresponding to the first intermediate network according to the gradient information of the first intermediate network and at least one first intermediate feature output by the first intermediate network.

A determining module 830 is configured to determine a first distillation loss based on the at least one second intermediate mask feature and the at least one first intermediate mask feature. The at least one second intermediate mask feature corresponds to a second intermediate network of the second segmentation model, the at least one second intermediate mask feature is obtained according to gradient information of the second intermediate network and at least one second intermediate feature output by the second intermediate network, the gradient information of the second intermediate network is obtained by inputting a second sample image into the second segmentation model, and the parameter quantity of the second segmentation model is larger than that of the first segmentation model.

The training module 840 is configured to train the first segmentation model based on the first distillation loss.

In some embodiments, the gradient information of the first intermediate network comprises gradient information of at least one first intermediate feature, the gradient information of the first intermediate feature comprising gradients of respective first intermediate feature values of the first intermediate feature. The second obtaining module includes: the first obtaining submodule is used for obtaining a first intermediate mask feature according to a preset gradient threshold value and gradients of each of a plurality of first intermediate feature values of the first intermediate feature.

In some embodiments, the first obtaining submodule includes: and the replacing unit is used for replacing the first intermediate characteristic values with the first preset value and replacing the first intermediate characteristic values with the second preset value, wherein the gradient of the first intermediate characteristic values is larger than or equal to the preset gradient threshold value, and the first intermediate characteristic values with the gradient smaller than the preset gradient threshold value are replaced with the second preset value, so that the first intermediate mask characteristic is obtained.

In some embodiments, the gradient information of the second intermediate network comprises gradient information of at least one second intermediate feature, the gradient information of the second intermediate feature comprising gradients of respective second intermediate feature values of the second intermediate feature. The second intermediate mask feature is derived from a predetermined gradient threshold and a gradient of each of a plurality of second intermediate feature values of the second intermediate feature.

In some embodiments, the training module comprises: the first attention processing sub-module is used for processing the at least one first intermediate mask feature by using an attention mechanism to obtain at least one first attention feature. A first determination sub-module for determining an attention distillation loss based on the at least one second attention characteristic and the at least one first attention characteristic. The at least one second attention feature is derived by processing the at least one second intermediate mask feature using an attention mechanism. A first training sub-module for training a first segmentation model based on the first distillation loss and the attention distillation loss.

In some embodiments, the training module comprises: and the normalization processing sub-module is used for carrying out normalization processing on the at least one first intermediate mask feature to obtain at least one first intermediate normalization feature. A second determining sub-module for determining a second distillation loss based on the at least one second intermediate normalized feature and the at least one first intermediate normalized feature. The at least one second intermediate normalization feature is a normalization of the at least one second intermediate mask feature. And a second training sub-module for training the first segmentation model based on the first distillation loss and the second distillation loss.

In some embodiments, the first segmentation model further comprises a first output network, the first obtaining module further configured to: and inputting the first sample image into a first segmentation model to obtain a first segmentation feature output by a first output network.

In some embodiments, the second segmentation model further comprises a second output network, the training module comprising: and a third determination submodule for determining a third distillation loss according to the second segmentation feature and the first segmentation feature. The second segmentation feature corresponds to a second output network, the second segmentation feature being obtained by inputting a second sample image into a second segmentation model. And a third training submodule for training the first segmentation model according to the first distillation loss and the third distillation loss.

In some embodiments, the first segmentation model further comprises a first output network, the first obtaining module further configured to: and inputting the first sample image into a first segmentation model to obtain a first example feature and a first class feature output by a first output network. The first instance feature is used to indicate whether a first sample pixel in the first sample image belongs to a sample instance, and the first class feature is used to indicate a class of at least one sample instance in the first sample image.

In some embodiments, the training module comprises: a fourth determination submodule determines a first channel dimension feature from the first example feature and the first class feature. And a fifth determining submodule for determining channel dimension distillation loss according to the second channel dimension characteristic and the first channel dimension characteristic. The second channel dimension feature is determined from a second instance feature and a second class feature obtained by inputting a second sample image into the second segmentation model. And the fourth training submodule is used for training the first segmentation model according to the first distillation loss and the channel dimension distillation loss.

In some embodiments, the training module comprises: and a sixth determining submodule, configured to determine a segmentation loss according to the first segmentation result output by the first segmentation model and the first label of the first sample image. And a fifth training sub-module for training the first segmentation model based on the first distillation loss and the segmentation loss.

In some embodiments, the first segmentation model is a first instance segmentation model and the second segmentation model is a second instance segmentation model.

Fig. 9 is a block diagram of an image segmentation apparatus according to another embodiment of the present disclosure.

As shown in fig. 9, the apparatus 900 may include a third obtaining module 910.

The third obtaining module 930 is configured to input the target image into the first segmentation model to obtain a target segmentation result. The target segmentation result includes a target mask of the target instance in the target image and a class of the target instance.

In an embodiment of the present disclosure, the first segmentation model is trained using the apparatus provided by the present disclosure. For example, the first segmentation model is trained using apparatus 800

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 10 shows a schematic block diagram of an example electronic device 1000 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for the operation of the device 1000 can also be stored. The computing unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

Various components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and communication unit 1009 such as a network card, modem, wireless communication transceiver, etc. Communication unit 1009 allows device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.

The computing unit 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1001 performs the respective methods and processes described above, for example, a training method of a segmentation model and/or an image segmentation method. For example, in some embodiments, the training method of the segmentation model and/or the image segmentation method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communication unit 1009. When the computer program is loaded into the RAM 1003 and executed by the computing unit 1001, one or more steps of the training method of the segmentation model and/or the image segmentation method described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the training method of the segmentation model and/or the image segmentation method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) display or an LCD (liquid crystal display)) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of training a segmentation model, comprising:

inputting a first sample image into a first segmentation model to obtain gradient information of a first intermediate network of the first segmentation model;

obtaining at least one first intermediate mask feature corresponding to the first intermediate network according to the gradient information of the first intermediate network and at least one first intermediate feature output by the first intermediate network;

determining a first distillation loss according to at least one second intermediate mask feature and at least one first intermediate mask feature, wherein at least one second intermediate mask feature corresponds to a second intermediate network of a second segmentation model, at least one second intermediate mask feature is obtained according to gradient information of the second intermediate network and at least one second intermediate feature output by the second intermediate network, the gradient information of the second intermediate network is obtained by inputting a second sample image into the second segmentation model, and the parameter quantity of the second segmentation model is larger than that of the first segmentation model; and

The first segmentation model is trained based on the first distillation loss.

2. The method of claim 1, wherein the gradient information of the first intermediate network comprises gradient information of at least one of the first intermediate features, the gradient information of the first intermediate features comprising gradients of respective ones of a plurality of first intermediate feature values of the first intermediate features,

the obtaining at least one first intermediate mask feature corresponding to the first intermediate network according to the gradient information of the first intermediate network and at least one first intermediate feature output by the first intermediate network includes:

and obtaining the first intermediate mask feature according to a preset gradient threshold value and gradients of each of a plurality of first intermediate feature values of the first intermediate feature.

3. The method of claim 2, wherein the deriving a first intermediate mask feature from a preset gradient threshold and gradients of each of a plurality of first intermediate feature values of the first intermediate feature comprises:

and replacing the first intermediate feature value with a first preset value and replacing the first intermediate feature value with a second preset value, wherein the gradient of the first intermediate feature value is larger than or equal to the preset gradient threshold, aiming at a plurality of first intermediate feature values of the first intermediate feature, so as to obtain the first intermediate mask feature.

4. The method of claim 1, wherein the gradient information of the second intermediate network comprises gradient information of at least one of the second intermediate features, the gradient information of the second intermediate features comprising gradients of respective ones of a plurality of second intermediate feature values of the second intermediate features,

the second intermediate mask feature is derived from a preset gradient threshold and gradients of each of a plurality of second intermediate feature values of the second intermediate feature.

5. The method of claim 1, wherein the training the first segmentation model based on the first distillation loss comprises:

processing at least one of the first intermediate mask features using an attention mechanism to obtain at least one first attention feature;

determining attention distillation loss based on at least one second attention feature and at least one said first attention feature, wherein at least one said second attention feature is derived by processing at least one second intermediate mask feature using an attention mechanism;

the first segmentation model is trained based on the first distillation loss and the attention distillation loss.

6. The method of claim 1, wherein the training the first segmentation model based on the first distillation loss comprises:

Normalizing at least one first intermediate mask feature to obtain at least one first intermediate normalized feature;

determining a second distillation loss according to at least one second intermediate normalization feature and at least one first intermediate normalization feature, wherein the at least one second intermediate normalization feature is obtained by normalizing the at least one second intermediate mask feature;

the first segmentation model is trained based on the first distillation loss and the second distillation loss.

7. The method of claim 1, wherein the first segmentation model further comprises a first output network,

the inputting the first sample image into the first segmentation model further comprises:

and inputting the first sample image into the first segmentation model to obtain a first segmentation feature output by the first output network.

8. The method of claim 6, wherein the second segmentation model further comprises a second output network,

said training the first segmentation model based on the first distillation loss comprises:

determining a third distillation loss according to a second segmentation feature and the first segmentation feature, wherein the second segmentation feature corresponds to the second output network, and the second segmentation feature is obtained by inputting the second sample image into the second segmentation model;

The first segmentation model is trained based on the first distillation loss and the third distillation loss.

9. The method of claim 1, wherein the first segmentation model further comprises a first output network,

and inputting the first sample image into the first segmentation model to obtain a first instance feature and a first class feature output by the first output network, wherein the first instance feature is used for indicating whether a first sample pixel in the first sample image belongs to one sample instance, and the first class feature is used for indicating the class of at least one sample instance in the first sample image.

10. The method of claim 8, wherein the training the first segmentation model according to the first distillation loss comprises:

determining a first channel dimension feature from the first instance feature and the first class feature;

determining channel dimension distillation loss according to a second channel dimension feature and the first channel dimension feature, wherein the second channel dimension feature is determined according to the second example feature and the second class feature, and the second example feature and the second class feature are obtained by inputting the second sample image into the second segmentation model; and

The first segmentation model is trained based on the first distillation loss and the channel dimension distillation loss.

11. The method of claim 1, wherein the training the first segmentation model according to the first distillation loss comprises:

determining segmentation loss according to a first segmentation result output by the first segmentation model and a first label of the first sample image;

the first segmentation model is trained based on the first distillation loss and the segmentation loss.

12. The method of claim 1, wherein the first segmentation model is a first instance segmentation model and the second segmentation model is a second instance segmentation model.

13. An image segmentation method, comprising:

inputting a target image into a first segmentation model to obtain a target segmentation result, wherein the target segmentation result comprises a target mask of a target instance in the target image and a category of the target instance,

the first segmentation model is trained using the method of any one of claims 1 to 11.

14. A training apparatus for a segmentation model, comprising:

the first acquisition module is used for inputting the first sample image into a first segmentation model to obtain gradient information of a first intermediate network of the first segmentation model;

The second obtaining module is used for obtaining at least one first intermediate mask feature corresponding to the first intermediate network according to the gradient information of the first intermediate network and at least one first intermediate feature output by the first intermediate network;

a determining module, configured to determine a first distillation loss according to at least one second intermediate mask feature and at least one first intermediate mask feature, where at least one second intermediate mask feature corresponds to a second intermediate network of a second segmentation model, at least one second intermediate mask feature is obtained according to gradient information of the second intermediate network and at least one second intermediate feature output by the second intermediate network, the gradient information of the second intermediate network is obtained by inputting a second sample image into the second segmentation model, and a parameter of the second segmentation model is greater than a parameter of the first segmentation model; and

and the training module is used for training the first segmentation model according to the first distillation loss.

15. The apparatus of claim 14, wherein the gradient information of the first intermediate network comprises gradient information of at least one of the first intermediate features, the gradient information of the first intermediate features comprising gradients of respective ones of a plurality of first intermediate feature values of the first intermediate features,

The second obtaining module includes:

the first obtaining submodule is used for obtaining the first intermediate mask feature according to a preset gradient threshold value and gradients of each of a plurality of first intermediate feature values of the first intermediate feature.

16. The apparatus of claim 15, wherein the first obtaining submodule comprises:

and the replacing unit is used for replacing the first intermediate feature value with a first preset value and replacing the first intermediate feature value with a second preset value, wherein the gradient of the first intermediate feature value is larger than or equal to the preset gradient threshold, and the first intermediate feature value is smaller than the preset gradient threshold, so that the first intermediate mask feature is obtained.

17. The apparatus of claim 14, wherein the gradient information of the second intermediate network comprises gradient information of at least one of the second intermediate features, the gradient information of the second intermediate features comprising gradients of respective ones of a plurality of second intermediate feature values of the second intermediate features,

18. The apparatus of claim 14, wherein the training module comprises:

A first attention processing sub-module for processing at least one of the first intermediate mask features with an attention mechanism to obtain at least one first attention feature;

a first determining sub-module for determining an attention distillation loss based on at least one second attention feature and at least one of said first attention features, wherein at least one of said second attention features is derived by processing at least one second intermediate mask feature using an attention mechanism;

a first training sub-module for training the first segmentation model based on the first distillation loss and the attention distillation loss.

19. The apparatus of claim 14, wherein the training module comprises:

the normalization processing submodule is used for carrying out normalization processing on at least one first intermediate mask feature to obtain at least one first intermediate normalization feature;

a second determining submodule, configured to determine a second distillation loss according to at least one second intermediate normalization feature and at least one first intermediate normalization feature, where at least one second intermediate normalization feature is obtained by normalizing at least one second intermediate mask feature;

A second training sub-module for training the first segmentation model based on the first distillation loss and the second distillation loss.

20. The apparatus of claim 14, wherein the first segmentation model further comprises a first output network,

the first obtaining module is further configured to:

21. The apparatus of claim 20, wherein the second segmentation model further comprises a second output network,

the training module comprises:

a third determining submodule, configured to determine a third distillation loss according to a second segmentation feature and the first segmentation feature, where the second segmentation feature corresponds to the second output network, and the second segmentation feature is obtained by inputting the second sample image into the second segmentation model;

and a third training sub-module for training the first segmentation model based on the first distillation loss and the third distillation loss.

22. The apparatus of claim 14, wherein the first segmentation model further comprises a first output network,

The first obtaining module is further configured to:

23. The apparatus of claim 22, wherein the training module comprises:

a fourth determination submodule configured to determine a first channel dimension feature from the first example feature and the first class feature;

a fifth determining submodule, configured to determine a channel dimension distillation loss according to a second channel dimension feature and the first channel dimension feature, where the second channel dimension feature is determined according to the second example feature and the second class feature, and the second example feature and the second class feature are obtained by inputting the second sample image into the second segmentation model; and

and a fourth training submodule for training the first segmentation model according to the first distillation loss and the channel dimension distillation loss.

24. The apparatus of claim 14, wherein the training module comprises:

a sixth determining submodule, configured to determine a segmentation loss according to a first segmentation result output by the first segmentation model and a first label of the first sample image;

and a fifth training sub-module for training the first segmentation model based on the first distillation loss and the segmentation loss.

25. The apparatus of claim 14, wherein the first segmentation model is a first instance segmentation model and the second segmentation model is a second instance segmentation model.

26. An image segmentation apparatus comprising:

a third obtaining module, configured to input a target image into a first segmentation model to obtain a target segmentation result, where the target segmentation result includes a target mask of a target instance in the target image and a class of the target instance,

the first segmentation model is trained using the apparatus of any one of claims 14 to 25.

27. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 13.

28. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1 to 13.

29. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 13.