CN112258564A

CN112258564A - Method and device for generating fusion feature set

Info

Publication number: CN112258564A
Application number: CN202011126125.0A
Authority: CN
Inventors: 唐雯; 张荣国; 李新阳; 陈宽; 王少康
Original assignee: Infervision Medical Technology Co Ltd
Current assignee: Infervision Medical Technology Co Ltd
Priority date: 2020-10-20
Filing date: 2020-10-20
Publication date: 2021-01-22
Anticipated expiration: 2040-10-20
Also published as: CN112258564B

Abstract

The application provides a method and a device for generating a fusion feature set, wherein the method comprises the following steps: determining a first feature set of a first scale corresponding to the first modality image and a second feature set of the first scale corresponding to the second modality image; and inputting the first feature set and the second feature set into a feature processing model to obtain a first fusion feature set of a first scale. Compared with the prior art, the technical scheme provided by the application carries out integral feature transformation and fusion on the modal image through the feature processing model, and enhances the fusion and interpretability of the multi-modal features.

Description

Method and device for generating fusion feature set

Technical Field

The application relates to the technical field of deep learning, in particular to a method and a device for generating a fusion feature set.

Background

In recent years, various medical imaging apparatuses are generally used to acquire medical images of human organs or diseased tissues. Since each medical imaging device can only acquire a single-modality medical image, the single-modality medical image is often difficult to provide sufficient information. However, the existing image feature extraction and fusion method has the problems of incomplete and accurate registration, insufficient fusion, overfitting caused by different image registration results, and the like.

Therefore, how to fuse the single-modality medical images obtained by different medical imaging devices to obtain a multi-modality medical image so as to obtain more comprehensive information is an urgent technical problem to be solved.

Disclosure of Invention

In view of the above, embodiments of the present application aim to provide a method and an apparatus for generating a fusion feature set to solve the problems of low fusion and less interpretable feature information of a multi-modal image in the prior art.

In a first aspect, an embodiment of the present application provides a method for generating a fused feature set, where the method includes: determining a first feature set of a first scale corresponding to the first modality image and a second feature set of the first scale corresponding to the second modality image; and inputting the first feature set and the second feature set into a feature processing model to obtain a first fusion feature set of a first scale.

In some embodiments of the present application, the feature processing model includes a feature transformation module and a feature fusion module, and inputting the first feature set and the second feature set into the feature processing model to obtain a first fused feature set at a first scale includes: inputting the first feature set and the second feature set into a feature processing model, and determining a third feature set of a first scale corresponding to the second feature set based on the first feature set and the second feature set by using a feature transformation module, wherein a spatial matching relationship exists between the third feature set and the first feature set; and utilizing a feature fusion module to generate a first fusion feature set of a first scale based on fusion of the first feature set and the third feature set.

In some embodiments of the present application, determining, by the feature transformation module, a third feature set of the first scale corresponding to the second feature set based on the first feature set and the second feature set includes: and performing elastic registration operation on the first feature set and the second feature set by using a feature transformation module to determine a third feature set.

In some embodiments of the present application, the determining the third feature set by performing an elastic registration operation on the first feature set and the second feature set using a feature transformation module includes: respectively determining offsets between a plurality of second features with different dimensions and the corresponding first features by using a feature transformation module to generate an offset set corresponding to the second feature set; and performing characteristic dimension resampling operation on a plurality of second characteristics with different dimensions in the second characteristic set by utilizing a characteristic transformation module based on the offset set to determine a third characteristic set.

In some embodiments of the present application, determining, by a feature transformation module, offsets between a plurality of second features with different dimensions and respective corresponding first features, respectively, to generate an offset set corresponding to a second feature set, includes: performing feature splicing operation on the second features and first features corresponding to the second features aiming at each of the plurality of second features with different dimensions to generate first spliced features; performing convolution operation on the first splicing features to determine a first direction offset and a second direction offset corresponding to the second features; and generating an offset set based on the first direction offset and the second direction offset corresponding to the second features of the plurality of different dimensions.

In some embodiments of the present application, the third feature set includes a plurality of third features of different dimensions, and the third features of the plurality of different dimensions and the first features of the plurality of different dimensions have a one-to-one dimensional correspondence relationship, and the generating of the first fused feature set of the first scale based on the fusion of the first feature set and the third feature set includes: for each third feature in the plurality of third features with different dimensions, performing feature splicing operation on the third feature and the first feature corresponding to the third feature to generate a second spliced feature; performing convolution operation on the second splicing features to determine first fusion features of the first scale corresponding to the third features; and generating a first fused feature set based on the first fused features of the first scale corresponding to the third features of the plurality of different dimensions.

In certain embodiments of the present application, further comprising: determining a fourth feature set of the second scale corresponding to the first modality image and a fifth feature set of the second scale corresponding to the second modality image; inputting the fourth feature set and the fifth feature set into a feature processing model to obtain a second fusion feature set of a second scale; and obtaining a third fusion feature set corresponding to the first fusion feature set and the second fusion feature set by using the feature processing model.

In some embodiments of the present application, obtaining a third fused feature set corresponding to the first fused feature set and the second fused feature set by using a feature processing model includes: determining a seventh feature set of the second scale corresponding to the first fusion feature set by using the feature processing model; and performing fusion operation on the seventh feature set and the second fusion feature set by using the feature processing model to obtain a third fusion feature set.

In a second aspect, an embodiment of the present application provides an apparatus for generating a fused feature set, where the apparatus includes: the determining module is used for determining a first feature set of a first scale corresponding to the first modality image and a second feature set of the first scale corresponding to the second modality image; and the input module is used for inputting the first feature set and the second feature set into the feature processing model to obtain a first fusion feature set of a first scale.

In a third aspect, an embodiment of the present application provides a computer-readable storage medium, where the storage medium stores a computer program for executing the method for generating a fused feature set according to the first aspect.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: a processor; a memory for storing processor executable instructions, wherein the processor is configured to perform the method of generating a fused feature set as described in the first aspect above.

The embodiment of the application provides a method and a device for generating a fusion feature set, which are used for performing overall feature transformation and fusion on modal images through a feature processing model, so that the fusion and interpretability of multi-modal features are enhanced. In addition, the embodiment of the application is an end-to-end scheme, additional manual operation is not needed, the complexity of the network is not changed greatly, and the method is easy to use and train. The performance is optimized, the speed is improved, and the occupied space of the network is reduced.

Drawings

Fig. 1 is a schematic flowchart of a network model training method according to an embodiment of the present disclosure.

Fig. 2 is a schematic diagram of feature transformation of a training method of a network model according to an embodiment of the present application.

Fig. 3 is a schematic diagram of generating fusion features in a network model training method according to an embodiment of the present application.

Fig. 4 is a schematic diagram of down-sampling of a training method of a network model according to an embodiment of the present application.

Fig. 5 is a flowchart illustrating a method for generating a fused feature set according to an embodiment of the present application.

Fig. 6 is a schematic flowchart of generating a first fused feature set according to a method for generating a fused feature set according to an embodiment of the present application.

Fig. 7 is a flowchart illustrating an elastic registration operation of a method for generating a fused feature set according to an embodiment of the present application.

Fig. 8 is a schematic diagram of an offset-based feature transformation of a method for generating a fused feature set according to an embodiment of the present application.

Fig. 9 is a flowchart illustrating an elastic registration operation of a method for generating a fused feature set according to another embodiment of the present application.

Fig. 10 is a schematic flowchart of a method for generating a fused feature set according to another embodiment of the present application.

Fig. 11 is a flowchart illustrating a method for generating a fused feature set according to another embodiment of the present application.

Fig. 12 is a schematic structural diagram of a training apparatus for a network model according to an embodiment of the present application.

Fig. 13 is a schematic structural diagram of an apparatus for generating a fused feature set according to an embodiment of the present application.

Fig. 14 is a block diagram of an electronic device for a training method of a network model or a method of generating a fused feature set according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the prior art, the task of deep learning to classify, segment or detect multiple modes generally adopts three methods: one is to merge across channels and send the whole as input to the network for classification, segmentation or detection. The main drawback of this approach is the lack of monitoring of modalities, leaving the network entirely free for feature extraction that may result in some modalities being disabled. Meanwhile, due to the limitation of receptive field, the multimode image which is not subjected to registration processing cannot be fused with the due multi-feature information of the multimode.

The second method is that firstly, the traditional or deep learning method is used for registration, and then the registered images are input into a classification, segmentation or detection network. Due to the fact that the registration is not accurate completely, and different image registration results are different, the efficiency of network overfitting, classification, segmentation and detection is reduced. Meanwhile, the two-step model training and testing also have a basic concept of end-to-end violation, and the extracted features cannot achieve global optimization.

And three types are that a classification, segmentation and detection model is trained for each mode, and finally, the results of a plurality of models are fused. This approach does not extract features of the fused multimodality in the deep network, and is only insufficient for fusion of results. Meanwhile, the two-step integral model training and testing also has a basic concept of end-to-end, and the extracted features cannot reach the global optimum and are time-consuming and labor-consuming.

Fig. 1 is a schematic flow chart illustrating a method for training a network model according to an embodiment of the present application. The method of fig. 1 is performed by a computing device. As shown in fig. 1, the training method of the network model includes the following steps.

110: an initial network model is determined.

Specifically, the initial network model may be obtained by iterative training based on the sample images of the label information input in step 120. The initial network model is not particularly limited in the embodiments of the present application.

120: an initial network model is trained based on sample images including label information to generate a feature processing model.

Specifically, the sample image may be a modality image (e.g., a brain CT image), and the embodiment of the present application does not specifically limit the sample image.

The tag information may be subjected to a feature transformation operation. The tag information may also be a feature fusion operation. The tag information may also be subjected to feature transformation and feature fusion operations simultaneously. The tag information may be manually calibrated. The tag information may also be automatically calibrated by other algorithms.

The feature processing model may include a feature transformation module. The feature processing model may also include a feature fusion module. The feature processing model can also comprise a feature transformation module and a feature fusion module. The feature processing model is not particularly limited in the embodiments of the present application.

In one example, a sample image with labeled information is input into a network model for initial training. And in the training process, the back propagation is carried out through the loss function, and the training is continuously carried out until the required model is reached. Wherein, the processing modules (such as the feature transformation module and the feature fusion module) in the feature processing model are connected and updated by the same loss function. The loss function may be a cross-entropy loss function or other suitable loss function, which is not specifically limited in this embodiment.

In an embodiment, the feature processing model is configured to generate a first fused feature set at a first scale based on a fusion of a first feature set at the first scale corresponding to the first modality image and a second feature set at the first scale corresponding to the second modality image.

The first modality image and the second modality image may be medical images, and may specifically be Computed Tomography (CT) images. Or images in other fields such as the driving field, and the type of the image is not particularly limited in the embodiment of the present application. And the first and second modality images may be different modality images of the same part of the same test subject, for example, images of cerebral blood vessels.

For example, the first modality image may be a CT image. The first modality image may also be a Magnetic Resonance Imaging (MRI) image. The first modality image may also be a Computed Radiography (CR) image. The first modality image may also be a Digital Radiography (DR) image.

For example, the second modality image may also be a CT image. The second modality image may also be an MRI image. The second modality image may also be a CR image. The second modality image may also be a DR image.

In an example, the first modality image and the second modality image are images of different modalities. For example, the first modality image is a CT image and the second modality image is an MRI image; or the first modality image is an MRI image, and the second modality image is a CT image; or the first modality image is a CT image and the second modality image is a CR image. The embodiment of the present application does not specifically limit what modality image the first modality image and the second modality image are.

The first scale may be a size at which the first modality image and the second modality image are input. The first scale may be any size of the first modality image and the second modality image after downsampling, which is not specifically limited in the embodiment of the present application.

The first feature set and the second feature set may be obtained by feature extraction based on the same basic network framework sharing parameter. The basic network may specifically be ResNext, ResNet, densnet, VGG, initiation, or another suitable convolutional neural network, and the type of the basic network is not specifically limited in the embodiments of the present application.

For details of the process of generating the first fused feature set, please refer to the following description of the embodiment of fig. 5, and details are not repeated herein in order to avoid repetition.

Therefore, the feature processing model is used for carrying out overall feature transformation and fusion on the modal images, and the fusion performance and the interpretability of the multi-modal features are enhanced. In addition, the embodiment of the application is an end-to-end scheme, additional manual operation is not needed, the complexity of the network is not changed greatly, and the method is easy to use and train. The performance is optimized, the speed is improved, and the occupied space of the network is reduced.

In an embodiment of the present application, the feature processing model includes a feature transformation module and a feature fusion module, the feature transformation module is configured to determine, based on the first feature set and the second feature set, a third feature set of the first scale corresponding to the second feature set, where a spatial matching relationship exists between the third feature set and the first feature set, and the feature fusion module is configured to generate a first fusion feature set based on the first feature set and the third feature set.

Specifically, the relationship of spatial matching will be described by taking the first feature set, the second feature set and the third feature set as an example.

In an example, the spatial matching relationship may be a one-to-one correspondence relationship between feature sets (e.g., a first feature set and a second feature set) in the same dimension, and may perform operations such as spatially stitching and convolving. For example, a plurality of third features in the third feature set and a plurality of first features in the first feature set are in a one-to-one correspondence relationship in the same feature dimension, and the purpose of splicing the features in the same dimension can be achieved.

In an example, the spatial matching relationship may be based on a plurality of different dimensions of the first feature set and a plurality of different dimensions of the second feature set, and the plurality of different dimensions of the first feature set and the plurality of different dimensions of the second feature set are elastically registered with an offset.

It should be understood that the method described in the embodiment of the present application may be mainly divided into two parts, namely, a feature transformation part (for example, a process of generating the third feature set 203, see fig. 2) and a feature fusion part (for example, a part of generating the first fusion feature set 313, see fig. 3). The two parts can be trained simultaneously, wherein simultaneous training means that the two parts are connected and updated by the same loss function. The loss function may be a cross-entropy loss function or other suitable loss function, which is not specifically limited in this embodiment.

Therefore, the two models are trained simultaneously through the same loss function, the extracted features are globally optimal while end-to-end requirements are met, and due multi-feature information of multiple modes is fused.

In an embodiment of the present application, the feature processing model is further configured to generate a second fused feature set at the second scale based on a fusion of a fourth feature set at the second scale corresponding to the first modality image and a fifth feature set at the second scale corresponding to the second modality image, and to generate a third fused feature set by stitching the second fused feature set and the first fusion.

Specifically, the fourth feature set at the second scale of the first modality image (or the fifth feature set at the second scale corresponding to the second modality image) may be a generated fourth feature set at the second scale (or a fifth feature set at the second scale) after down-sampling and size-reducing the first feature set at the first scale in order that the receptive field may be larger and the feature dimension may be increased under the self-setting of the base network.

In an example, the first and second modality images may be preprocessed to obtain a first feature set and a second feature set. The first and second feature sets may be the same size as the first and second modality images were input. The first feature set and the second feature set integrate position information and image information, that is, the position information and the pixel information of each pixel point are included. And carrying out downsampling operation by using a basic network, wherein the sizes of the first feature set and the second feature set are continuously reduced, and deep features of the image can be extracted. The features obtained by the down-sampling (e.g., the fourth feature set and the fifth feature set, see fig. 4) may also be up-sampled to obtain feature maps having a size consistent with the original first modality image and the second modality image. In addition, this step may be repeated until a threshold set by the underlying network (i.e., a reduced scale), such as 1/16 or 1/32, is reached.

It should be appreciated that the process shown in fig. 4 is a downsampled process, where the first through third feature sets 401 through 403 are one scale; the fourth feature set 404 through the sixth feature set 406 are one scale; the tenth feature set 407 to the twelfth feature set 409 are one scale; the thirteenth set of features 410 through the fifteenth set of features 412 are one scale. Wherein the reduction in size of the outer border in the figure represents a reduction in scale. The feature transformation modules 413 to 416 are shown as repeated operations of the same module.

Therefore, for the multi-modal deep learning problem, the features of different scales can be registered by automatically utilizing the feature transformation module, no additional label is required to be provided, the network learns by self, the features are registered and transformed in a mode of being beneficial to overall classification, segmentation or detection tasks, and the fusion and the interpretability of the multi-modal features are enhanced.

Fig. 5 is a flowchart illustrating a method for generating a fused feature set according to an embodiment of the present application. FIG. 5 is an example of the embodiment shown in FIG. 1, and the same parts are not repeated herein, and the differences are mainly described herein. As shown in fig. 5, the method for generating the fused feature set includes the following steps.

510: a first set of features at a first scale corresponding to a first modality image and a second set of features at the first scale corresponding to a second modality image are determined.

In particular, the first set of features may include a plurality of different dimensions of the first feature. The second set of features may include a plurality of second features of different dimensions. And the first features of the plurality of different dimensions and the second features of the plurality of different dimensions may exhibit a one-to-one correspondence in the spatial dimension. That is, the first feature and the second feature in the same dimension are in a corresponding relationship.

The first dimension is substantially the same as that described in fig. 1, and please refer to the description of fig. 1 for details, which are not described herein again.

520: and inputting the first feature set and the second feature set into a feature processing model to obtain a first fusion feature set of a first scale.

In an embodiment, the feature processing model is obtained based on the training method of the network model described in the embodiment of fig. 1.

Specifically, step 520 may specifically be: first, a third feature set is generated in the feature transformation module based on the first feature set and the second feature set. Wherein the third set of features may comprise a plurality of third features of different dimensions. And then, performing feature splicing on the third features and the first features corresponding to the third features to generate spliced features. And secondly, performing convolution operation on the splicing feature to determine a first fusion feature of the first scale corresponding to the third feature. And finally, generating first fusion features of the first scale corresponding to the third features of the plurality of different dimensions, and generating the first fusion feature set by combination. For details, please refer to the following description of the embodiments, which is not repeated herein.

This first fused feature set may be combined with subsequent other features (e.g., a seventh feature set mentioned in the following embodiments) for performing segmentation, classification, and detection operations (which may be understood as an intermediate stage in the downsampling process). Or the first fused feature set can be directly used for realizing operations such as segmentation, classification and detection (in this case, the last stage in the downsampling process can be understood).

It should also be understood that the present embodiment is exemplified by 2-modality (i.e., first-modality image and second-modality image) inputs. The same can be done if based on more modality images (e.g., 3, 4, or more). For example, based on the training method of the network model described in the above embodiment, the features corresponding to the multiple modality images are subjected to stitching and convolution operations, so as to generate the fusion features.

Also, the feature transformation portion of the embodiments of the present application can be applied to a variety of infrastructure network structures, the purpose of which is to fuse multimodal features. Therefore, the feature transformation part provided by the embodiment of the application can achieve the effect at different parts of other networks.

Therefore, the method for generating the fused feature set provided by the embodiment of the application enhances the fusion and interpretability of the multi-modal features by performing overall feature transformation on the features with different scales.

Fig. 6 is a schematic flowchart illustrating a method for generating a fused feature set according to an embodiment of the present application to generate a first fused feature set. Fig. 6 is an example of generating the first fused feature set in the embodiment related to fig. 5, and the same parts are not described again, and the differences are emphasized here. As shown in fig. 6, the operation of generating the first fused feature set includes the following.

610: and inputting the first feature set and the second feature set into a feature processing model, and determining a third feature set of the first scale corresponding to the second feature set based on the first feature set and the second feature set by using a feature transformation module.

In one embodiment, the feature processing model includes a feature transformation module and a feature fusion module.

In an embodiment, there is a spatial matching relationship between the third set of features and the first set of features.

Specifically, the spatial matching is substantially the same as that described in the embodiment of fig. 1, and please refer to the description of the embodiment for details, which is not repeated herein again to avoid redundancy.

The third feature set may be obtained by a feature transformation module. The manner of obtaining may be based on the principle of elastic registration, obtained by offset calculation. The obtaining manner may also be obtained by calculation of other algorithms, and this is not particularly limited in this embodiment of the application.

620: and utilizing a feature fusion module to generate a first fusion feature set of a first scale based on fusion of the first feature set and the third feature set.

In particular, generating the first fused feature set may be based on spatial matching. Generating the first fused feature set may also be based on a convolution operation. The generating of the first fused feature set may also be based on a pooling operation, which is not particularly limited in this embodiment of the present application.

Therefore, the embodiment of the application carries out feature transformation and fusion in the same feature processing model, and effectively solves the problems of inaccurate registration, difference in registration results, overfitting of a network and the like. Meanwhile, the end-to-end basic concept is followed, and the extracted features are optimized.

According to an embodiment of the present application, determining, by using a feature transformation module, a third feature set of the first scale corresponding to the second feature set based on the first feature set and the second feature set includes: and performing elastic registration operation on the first feature set and the second feature set by using a feature transformation module to determine a third feature set.

Specifically, to facilitate the description of the embodiments described below, the first modality image may be defined as a fixed modality image, and the second modality image may be defined as a moving modality image.

The manner of the elastic registration operation may be calculated by the offset between the second set of features and the first set of features. Namely, resampling the second features with different dimensions in the second feature set with the first scale corresponding to the second modal image one by one to obtain a third feature set. The elastic registration operation may also be performed by adding or subtracting offsets to the second features one by one to obtain a third feature set. The operation manner of the elastic registration is not particularly limited in the embodiments of the present application.

Therefore, the feature is subjected to elastic registration operation, no additional label is required to be provided, and the network learns the feature completely, so that the registration transformation is performed on the feature in a mode of being beneficial to overall classification, segmentation or detection tasks, and the fusion and interpretability of the multi-modal feature are enhanced.

Fig. 7 is a flowchart illustrating an elastic registration operation of a method for generating a fused feature set according to an embodiment of the present application. Fig. 7 is an example of the elastic registration operation in the embodiment related to fig. 6, and the same parts are not described again, and the differences are emphasized here. As shown in fig. 7, the elastic registration operation includes the following.

710: and respectively determining the offsets between the second features with different dimensions and the corresponding first features by using a feature transformation module to generate an offset set corresponding to the second feature set.

In an embodiment, the first feature set includes a plurality of first features of different dimensions, the second feature set includes a plurality of second features of different dimensions, and the plurality of first features of different dimensions and the plurality of second features of different dimensions are in a one-to-one dimensional correspondence relationship.

In particular, the dimensional correspondence may be a one-to-one correspondence in a spatial dimension. For example, the first feature and the second feature of the same dimension are in a corresponding relationship.

The offset may be one or two, and this is not particularly limited in the embodiment of the present application. When the offset amount is two, one may be an X-direction offset amount and the other may be a Y-direction offset amount. Since the second feature set has a plurality of second features with different dimensions, the offset set also integrates offsets between the plurality of second features with different dimensions and the corresponding first features. Wherein the size of the dimension is set by the base network.

720: and performing characteristic dimension resampling operation on a plurality of second characteristics with different dimensions in the second characteristic set by utilizing a characteristic transformation module based on the offset set to determine a third characteristic set.

Specifically, the resampling operation (see fig. 8) may be a set (i.e., a third feature set, e.g., 802 in fig. 8) of new feature points (e.g., point a inside 801 or 802 in fig. 8, it should be understood that point a inside 801 corresponds to point a inside 802) obtained by adding/subtracting the X offset and the Y offset position to the feature point (e.g., point B inside 801 in fig. 8) position for each feature point (e.g., point B inside 801 in fig. 8) in the second feature set (e.g., 801 in fig. 8) of the second features of a plurality of different dimensions. Illustratively, the point a and the point B mentioned above are pixel points at corresponding positions.

Therefore, the elastic registration is realized by resampling through the offset, and the problems of inaccurate registration and difference in registration are effectively avoided.

Fig. 9 is a flowchart illustrating an elastic registration operation of a method for generating a fused feature set according to another embodiment of the present application. Fig. 9 is an example of the elastic registration operation in the embodiment related to fig. 8, and the same parts are not described again, and the differences are emphasized here. As shown in fig. 9, the elastic registration operation includes the following.

910: and for each second feature in the plurality of second features with different dimensions, performing feature splicing operation on the second feature and the first feature corresponding to the second feature to generate a first spliced feature.

In particular, the splice may be a length splice, e.g., a 3 x 3 matrix of second features and first features, the resulting first splice feature being a 3 x 6 matrix.

920: and performing convolution operation on the first splicing features to determine a first direction offset and a second direction offset corresponding to the second features.

Specifically, the convolution operation may be to convolve the first stitching feature into an original size, for example, a 3 × 6 matrix as the first stitching feature, and to convolve into a 3 × 3 matrix again through the convolution operation.

Two channels are output according to the first splicing feature convolved into the original size, one channel represents a first direction offset (for example, an X direction offset or a Y direction offset), and the other channel represents a second direction offset (for example, an X direction offset or a Y direction offset).

930: and generating an offset set based on the first direction offset and the second direction offset corresponding to the second features of the plurality of different dimensions.

Specifically, the second feature of the plurality of different dimensions generates a first directional offset and a second directional offset of the plurality of dimensions, and the offset set is generated based on the first directional offset and the second directional offset of the plurality of dimensions.

Fig. 10 is a schematic flowchart illustrating a method for generating a fused feature set according to another embodiment of the present application to generate a first fused feature set. Fig. 10 is an example of generating the first fused feature set in the embodiment related to fig. 6, and the same parts are not described again, and the differences are emphasized here. As shown in fig. 10, the operation of generating the first fused feature set includes the following.

1010: and for each third feature in the plurality of third features with different dimensions, performing feature splicing operation on the third feature and the first feature corresponding to the third feature to generate a second spliced feature.

In an embodiment, the third feature set includes a plurality of third features of different dimensions, and the plurality of third features of different dimensions and the plurality of first features of different dimensions are in a one-to-one dimensional correspondence relationship.

Specifically, the splicing operation is substantially the same as step 910 in fig. 9, please refer to the related description in fig. 9 for details, which is not described herein again.

1020: and performing convolution operation on the second splicing features to determine first fusion features of the first scale corresponding to the third features.

Specifically, the convolution operation is substantially the same as step 920 in fig. 9, please refer to the related description in fig. 9 for details, which is not repeated herein.

1030: and generating a first fused feature set based on the first fused features of the first scale corresponding to the third features of the plurality of different dimensions.

Specifically, the third features of multiple different dimensions generate respective corresponding first fusion features, and the first fusion features are combined into a first fusion feature set.

Therefore, the embodiment of the application avoids the problem of insufficient feature fusion through the operation of splicing first and then convoluting, and improves the fusion efficiency.

Fig. 11 is a flowchart illustrating a method for generating a fused feature set according to another embodiment of the present application. Fig. 11 is an example of the embodiment of fig. 5, and the same parts are not described again, and the differences are emphasized here. As shown in fig. 11, the method of generating the fused feature set includes the following.

1110: and determining a fourth feature set of the second scale corresponding to the first modality image and a fifth feature set of the second scale corresponding to the second modality image.

Specifically, the second dimension is substantially the same as that described in the above embodiments, and for details, reference is made to the description of the above embodiments, and details are not repeated here.

1120: and inputting the fourth feature set and the fifth feature set into the feature processing model to obtain a second fusion feature set of a second scale.

In particular, the feature processing model may include a feature transformation module and a feature fusion module.

In an example, the manner of obtaining the second fused feature set may be: first, in the feature transformation module (see fig. 4), a sixth feature set 406 is obtained based on the fourth feature set 404 and the fifth feature set 405 (the sixth feature set is obtained in the same manner as the third feature set 403 is generated). Then, within the feature fusion module (see fig. 3), the fourth feature set 304 (equivalent to the fourth feature set 404 in fig. 4) and the sixth feature set 306 (equivalent to the sixth feature set 406 in fig. 4) generate a second fused feature set 315 (the operation of obtaining the second fused feature set is the same as the operation of generating the first fused feature set) by a splicing operation, a convolution operation, or the like.

It should be appreciated that the process of generating the second fused feature set is a repeated operation in the downsampling process of the first modality image and the second modality image. The process of operations of fig. 3 and 4 also repeats itself with the process of downsampling, generating a fourth fused feature set 317 and a fifth fused feature set 319.

1130: and obtaining a third fusion feature set corresponding to the first fusion feature set and the second fusion feature set by using the feature processing model.

In particular, the feature processing model may include a feature fusion module.

In an example, the manner of obtaining the third fused feature set may be: first, in the feature fusion module (see fig. 3), the seventh feature set 314 of the second scale corresponding to the first fusion feature set 313 is determined based on the first fusion feature set 313. Then, still within the feature fusion module (see fig. 3), a third fused feature set 320 is generated based on the fusion of the seventh feature set 313 and the second fused feature set 315. For details, please refer to the following description of the embodiments, which is not repeated herein.

It should be understood that the process of generating the seventh feature set and the process of generating the second fused feature set may be set by the base network itself. The seventh feature set may be generated first and then the second fused feature set may be generated, the second fused feature set may be generated first and then the seventh feature set may be generated, or both processes may be performed simultaneously, and the order of the processes is not specifically limited in the embodiments of the present application.

Therefore, for the multi-modal deep learning problem, the feature transformation module can be automatically utilized to register the features with different scales, no additional label is needed, the network learns the features by self, the registration transformation of the features is facilitated in a mode of integral classification, segmentation or detection tasks, and the fusion and the interpretability of the multi-modal features are enhanced.

According to an embodiment of the present application, obtaining a third fused feature set corresponding to the first fused feature set and the second fused feature set by using the feature processing model includes: determining a seventh feature set of the second scale corresponding to the first fusion feature set by using the feature processing model; and performing fusion operation on the seventh feature set and the second fusion feature set by using the feature processing model to obtain a third fusion feature set.

In particular, the seventh feature set (see seventh feature set 313 in fig. 3) may be obtained in such a way that a max pooling operation is performed. This maximum pooling operation may be understood as a corresponding scaling down of the scale per maximum pooling operation. The scale of the reduction may be self-set by the neural network, such as one reduction 1/2, one reduction 1/4, and so on. And automatically stops when scaling down to a scaling down threshold (e.g., 1/16 or 1/32) set by the base network itself.

In an example, the third fused feature set may be obtained by: firstly, for each seventh feature in the seventh features of multiple different dimensions, performing feature splicing operation on the seventh feature and a second fusion feature corresponding to the seventh feature to generate a third spliced feature. And then, performing convolution operation on the third spliced feature to determine a third fused feature corresponding to the seventh feature. And finally, generating a third fused feature set based on the third fused features corresponding to the seventh features of the plurality of different dimensions.

It should be understood that this third fused feature set is also a set of features having multiple dimensions. And when the third fused feature set is a downsampled last level size, the third fused feature set may be used for segmentation, classification, and detection tasks.

The operation step of obtaining the third fused feature set is basically the same as the operation process of generating the first fused feature set or the second fused feature set, and is not described herein again.

Therefore, the seventh feature set of the second scale is obtained through the maximum pooling operation, so that the receptive field can be larger, the feature dimension is improved, and the feature information is enriched. Meanwhile, different characteristics are fused, so that the multiple characteristic information of multiple modes can be fully reflected, and the characteristics of a single mode are not reflected.

For example, a series of operations performed by the feature transformation module and the feature fusion module in the foregoing embodiments of the present application may be executed by the corresponding modules themselves, or executed by a processor or a module controlled by a person, which is not described herein again.

Fig. 12 is a schematic structural diagram of a training apparatus for a network model according to an embodiment of the present application, where the training apparatus 1200 for a network model includes: a determination module 1210 and a fusion module 1220.

A determining module 1210 for determining an initial network model; a training module 1220, configured to train an initial network model based on a sample image including label information to generate a feature processing model, where the feature processing model is configured to generate a first fused feature set at a first scale based on a fusion of a first feature set at the first scale corresponding to a first modality image and a second feature set at the first scale corresponding to a second modality image.

According to the training device for the network model, the feature transformation and fusion of the modal images are integrally performed through the feature processing model, and the fusion and interpretability of multi-modal features are enhanced. In addition, the embodiment of the application is an end-to-end scheme, additional manual operation is not needed, the complexity of the network is not changed greatly, and the method is easy to use and train. The performance is optimized, the speed is improved, and the occupied space of the network is reduced.

According to an embodiment of the application, the feature processing model includes a feature transformation module and a feature fusion module, the feature transformation module is configured to determine a third feature set of the first scale corresponding to the second feature set based on the first feature set and the second feature set, a spatial matching relationship exists between the third feature set and the first feature set, and the feature fusion module is configured to generate a first fusion feature set based on the first feature set and the third feature set.

According to an embodiment of the present application, the feature processing model is further configured to generate a second fused feature set at the second scale based on fusion of a fourth feature set at the second scale corresponding to the first modality image and a fifth feature set at the second scale corresponding to the second modality image, and generate a third fused feature set by stitching the second fused feature set and the first fusion.

It should be understood that, for the specific working processes and functions of the determining module 1210 and the fusing module 1220 in the foregoing embodiments, reference may be made to the description in the network model training method provided in the foregoing embodiments of fig. 1 to 4, and in order to avoid repetition, details are not described here again.

Fig. 13 is a schematic structural diagram of an apparatus for generating a fused feature set according to an embodiment of the present application, where the apparatus 1300 for generating a fused feature set includes: a determination module 1310 and an input module 1320.

A determining module 1310 configured to determine a first feature set of a first scale corresponding to the first modality image and a second feature set of the first scale corresponding to the second modality image; an input module 1320, configured to input the first feature set and the second feature set into the feature processing model to obtain a first fused feature set of the first scale.

The embodiment of the application enhances the fusion and the interpretability of the multi-modal features by carrying out integral feature transformation on the features with different scales. In addition, the embodiment of the application is an end-to-end scheme, additional manual operation is not needed, the complexity of the network is not changed greatly, and the method is easy to use and train. The performance is optimized, the speed is improved, and the occupied space of the network is reduced.

According to an embodiment of the present application, the feature processing model includes a feature transformation module and a feature fusion module, the input module 1320 is configured to input the first feature set and the second feature set into the feature processing model, and determine, by using the feature transformation module, a third feature set of the first scale corresponding to the second feature set based on the first feature set and the second feature set, where a spatial matching relationship exists between the third feature set and the first feature set; and utilizing a feature fusion module to generate a first fusion feature set of a first scale based on fusion of the first feature set and the third feature set.

According to an embodiment of the present application, the input module 1320 is configured to perform an elastic registration operation on the first feature set and the second feature set by using the feature transformation module to determine a third feature set.

According to an embodiment of the present application, the first feature set includes a plurality of first features with different dimensions, the second feature set includes a plurality of second features with different dimensions, and the plurality of first features with different dimensions and the plurality of second features with different dimensions are in a one-to-one dimensional correspondence relationship, the input module 1320 is configured to determine, by using the feature transformation module, offsets between the plurality of second features with different dimensions and the respective corresponding first features, respectively, so as to generate an offset set corresponding to the second feature set; and performing characteristic dimension resampling operation on a plurality of second characteristics with different dimensions in the second characteristic set by utilizing a characteristic transformation module based on the offset set to determine a third characteristic set.

According to an embodiment of the present application, the input module 1320 is configured to, for each of a plurality of second features with different dimensions, perform a feature stitching operation on the second feature and a first feature corresponding to the second feature to generate a first stitched feature; performing convolution operation on the first splicing features to determine a first direction offset and a second direction offset corresponding to the second features; and generating an offset set based on the first direction offset and the second direction offset corresponding to the second features of the plurality of different dimensions.

According to an embodiment of the present application, the third feature set includes a plurality of third features with different dimensions, and the plurality of third features with different dimensions and the plurality of first features with different dimensions are in a one-to-one dimensional correspondence relationship, and the input module 1320 is configured to, for each of the plurality of third features with different dimensions, perform a feature stitching operation on the third feature and the first feature corresponding to the third feature to generate a second stitched feature; performing convolution operation on the second splicing features to determine first fusion features of the first scale corresponding to the third features; and generating a first fused feature set based on the first fused features of the first scale corresponding to the third features of the plurality of different dimensions.

According to an embodiment of the present application, the determining module 1310 is configured to determine a fourth feature set of the first modality image at the second scale and a fifth feature set of the second modality image at the second scale; an input module 1320, configured to input the fourth feature set and the fifth feature set into the feature processing model to obtain a second fused feature set at a second scale; and obtaining a third fusion feature set corresponding to the first fusion feature set and the second fusion feature set by using the feature processing model.

According to an embodiment of the present application, the input module 1320 is configured to determine, by using the feature processing model, a seventh feature set of the second scale corresponding to the first fused feature set; and performing fusion operation on the seventh feature set and the second fusion feature set by using the feature processing model to obtain a third fusion feature set.

It should be understood that, for the specific working processes and functions of the determining module 1110 and the input module 1320 in the foregoing embodiments, reference may be made to the description in the method for generating the fusion feature set provided in the foregoing embodiments of fig. 5 to 11, and in order to avoid repetition, details are not described here again.

Fig. 14 is a block diagram of an electronic device 1400 for a training method of a network model or a method of generating a fused feature set according to an exemplary embodiment of the present application.

Referring to fig. 14, electronic device 1400 includes a processing component 1410 that further includes one or more processors, and memory resources, represented by memory 1420, for storing instructions, such as application programs, that are executable by processing component 1410. The application programs stored in memory 1420 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1410 is configured to execute instructions to perform the above-described method of training of a network model or method of generating a fused feature set.

The electronic device 1400 may also include a power component configured to perform power management of the electronic device 1400, a wired or wireless network interface configured to connect the electronic device 1400 to a network, and an input-output (I/O) interface. The electronic device 1400 may be operated based on an operating system stored in the memory 1420, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

A non-transitory computer readable storage medium, wherein instructions of the storage medium, when executed by a processor of the electronic device 1400, enable the electronic device 1400 to perform a network model training method, the method comprising: determining an initial network model; training an initial network model based on a sample image including label information to generate a feature processing model; or a method of generating a fused feature set, comprising: determining a first feature set of a first scale corresponding to the first modality image and a second feature set of the first scale corresponding to the second modality image; and inputting the first feature set and the second feature set into a feature processing model to obtain a first fused feature set of a first scale.

All the above optional technical solutions can be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment. In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program check codes, such as a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in the description of the present application, the terms "first", "second", "third", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present application, "a plurality" means two or more unless otherwise specified.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modifications, equivalents and the like that are within the spirit and principle of the present application should be included in the scope of the present application.

Claims

1. A method of generating a fused feature set, comprising:

determining a first feature set of a first scale corresponding to the first modality image and a second feature set of the first scale corresponding to the second modality image;

and inputting the first feature set and the second feature set into a feature processing model to obtain a first fused feature set of a first scale.

2. The method of generating a fused feature set according to claim 1, wherein the feature processing model comprises a feature transformation module and a feature fusion module, and the inputting the first feature set and the second feature set into the feature processing model to obtain the first fused feature set at the first scale comprises:

inputting the first feature set and the second feature set into a feature processing model, and determining a third feature set of a first scale corresponding to the second feature set based on the first feature set and the second feature set by using the feature transformation module, wherein a spatial matching relationship exists between the third feature set and the first feature set;

and generating a first fused feature set of a first scale based on the fusion of the first feature set and the third feature set by using the feature fusion module.

3. The method of generating a fused feature set according to claim 1, wherein the determining, by the feature transformation module, a third feature set of the first scale corresponding to the second feature set based on the first feature set and the second feature set comprises:

performing, by the feature transformation module, an elastic registration operation on the first feature set and the second feature set to determine the third feature set.

4. The method of generating a fused feature set according to claim 3, wherein the first feature set comprises a plurality of first features of different dimensions, the second feature set comprises a plurality of second features of different dimensions, and the plurality of first features of different dimensions and the plurality of second features of different dimensions are in a one-to-one dimensional correspondence, and performing an elastic registration operation on the first feature set and the second feature set by using the feature transformation module to determine the third feature set comprises:

respectively determining offsets between the second features of the plurality of different dimensions and the corresponding first features by using the feature transformation module to generate an offset set corresponding to the second feature set;

performing, by the feature transformation module, feature dimension resampling operation on the second features of the plurality of different dimensions included in the second feature set based on the offset set, so as to determine the third feature set.

5. The method for generating a fused feature set according to claim 4, wherein the determining, by the feature transformation module, offsets between the second features of the plurality of different dimensions and the corresponding first features respectively to generate the offset sets corresponding to the second feature sets comprises:

for each second feature in the plurality of second features with different dimensions, performing feature splicing operation on the second feature and a first feature corresponding to the second feature to generate a first spliced feature;

performing convolution operation on the first splicing features to determine a first direction offset and a second direction offset corresponding to the second features;

and generating the offset set based on the first direction offset and the second direction offset corresponding to the second features of the plurality of different dimensions respectively.

6. The method of generating a fused feature set according to any one of claims 1 to 5, wherein the third feature set comprises a plurality of third features of different dimensions, and the third features of different dimensions and the first features of different dimensions have a one-to-one dimensional correspondence, and the generating a first fused feature set of a first scale based on the fusion of the first feature set and the third feature set comprises:

for each third feature in the plurality of third features with different dimensions, performing feature splicing operation on the third feature and a first feature corresponding to the third feature to generate a second spliced feature;

performing convolution operation on the second splicing feature to determine a first fusion feature of a first scale corresponding to the third feature;

and generating the first fused feature set based on the first fused features of the first scale corresponding to the third features of the plurality of different dimensions.

7. The method of generating a fused feature set according to any one of claims 1 to 5, further comprising:

determining a fourth feature set of the first modality image corresponding to the second scale and a fifth feature set of the second modality image corresponding to the second scale;

inputting the fourth feature set and the fifth feature set into the feature processing model to obtain a second fused feature set of a second scale;

and obtaining a third fused feature set corresponding to the first fused feature set and the second fused feature set by using the feature processing model.

8. The method according to any one of claims 1 to 5, wherein the obtaining, by using the feature processing model, a third fused feature set corresponding to the first fused feature set and the second fused feature set includes:

determining a seventh feature set of a second scale corresponding to the first fusion feature set by using the feature processing model;

and performing fusion operation on the seventh feature set and the second fusion feature set by using the feature processing model to obtain the third fusion feature set.

9. An apparatus for generating a fused feature set, comprising:

the determining module is used for determining a first feature set of a first scale corresponding to the first modality image and a second feature set of the first scale corresponding to the second modality image;

and the input module is used for inputting the first feature set and the second feature set into a feature processing model to obtain a first fusion feature set of a first scale.

10. A computer-readable storage medium storing a computer program for executing the method of generating a fused feature set according to any one of claims 1 to 8.

11. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions,

wherein the processor is configured to perform the method of generating a fused feature set of any of the preceding claims 1 to 8.