CN111340059A

CN111340059A - Image feature extraction method and device, electronic equipment and storage medium

Info

Publication number: CN111340059A
Application number: CN201811561327.0A
Authority: CN
Inventors: 赵元; 尹程翔; 伍林; 唐剑; 沈海峰
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2018-12-19
Filing date: 2018-12-19
Publication date: 2020-06-26

Abstract

The embodiment of the application provides an image feature extraction method and device, electronic equipment and a storage medium, and belongs to the technical field of images. According to the method, after M different-level features extracted from an image to be processed are processed twice, M layers of first intermediate features and M layers of second intermediate features are obtained, then the M layers of first intermediate features and the M layers of second intermediate features are fused to obtain M layers of image features, each layer of image features obtained through the method can contain M different-level features with balanced information, namely each layer of image features comprises high-level information and low-level information with balanced information, and the low-level information is sensitive to certain detailed information and can provide information beneficial to positioning and segmentation.

Description

Image feature extraction method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of image technologies, and in particular, to an image feature extraction method and apparatus, an electronic device, and a storage medium.

Background

Example segmentation is a very important direction in the field of computer vision, and has very wide application in the fields of unmanned driving, household robots and the like. The task combines the characteristics of semantic segmentation and object detection, and for each object of an input image, an independent pixel-level Mask is generated for each object, and the corresponding category of the Mask is predicted, so that in order to better predict each object in the input image, in the prior art, the feature extraction is performed on the input image, then the extracted features are fused by using a Feature Pyramid Network (FPN) structure of a Mask Region Convolutional neural network (Mask R-CNN), the finally obtained features contain more high-level information, and because the high-level information has a better recognition effect on a large object, but the low-level information has a better recognition effect on a small object, the high-level information and the low-level information contained in the features finally obtained in the above manner are unbalanced, the recognition effect of the large and small objects is larger when the finally obtained features are used for instance segmentation subsequently.

Disclosure of Invention

An object of the embodiments of the present application is to provide an image feature extraction method, an image feature extraction device, an electronic device, and a storage medium, so that a higher-layer network can more easily and more comprehensively obtain lower-layer information, thereby making the higher-layer information and the lower-layer information more balanced, and achieving a segmentation effect of balancing large and small objects.

In a first aspect, an embodiment of the present application provides an image feature extraction method, where the method includes: acquiring an image to be processed, and performing feature extraction on the image to be processed to acquire M different-level features, wherein M is an integer greater than or equal to 2; processing the M different level features according to a first level direction to obtain M layers of first intermediate features, and processing the M different level features according to a second level direction opposite to the first level direction to obtain M layers of second intermediate features; and processing the M layers of first intermediate features and the M layers of second intermediate features to obtain M layers of image features.

In the implementation process, after the M different levels of features extracted from the image to be processed are processed twice, M layers of first intermediate features and M layers of second intermediate features are obtained, then the M layers of first intermediate features and the M layers of second intermediate features are fused to obtain M layers of image features, each layer of image features obtained in this way can contain M different-level features with relatively balanced information, namely, each layer of image characteristics comprises higher-layer information and lower-layer information with relatively balanced information, because the lower-layer information is sensitive to some detailed information, the information which is beneficial to positioning and segmentation can be provided, by processing the characteristics for a plurality of times, the high-level network can more easily and comprehensively acquire the low-level information, therefore, the high-level information and the low-level information can be more balanced, and the balanced segmentation effect on large and small objects is realized.

Optionally, the processing the M layers of first intermediate features and the M layers of second intermediate features to obtain M layers of image features includes: determining a layer 1 first intermediate feature as a layer 1 image feature of the M layers of image features; and sequentially taking i as 2 to M, fusing the first intermediate features of the ith layer with the second intermediate features of the (i-1) th layer to obtain image features of the ith layer, and obtaining image features of the (M-1) th layer when i is M.

In the implementation process, the M-layer image features obtained after the M-layer first intermediate features and the M-layer second intermediate features are processed are made to include the 1 st-layer first intermediate features and the fusion features obtained after the M-layer first intermediate features and the M-layer second intermediate features are fused each time, and the high-rise network can more easily and more comprehensively obtain the low-rise information through the above-mentioned multiple processing of the features, so that the high-rise information and the low-rise information are more balanced, and the balanced segmentation effect of large and small objects is realized.

Optionally, performing feature extraction on the image to be processed to obtain M different level features, including: and performing feature extraction on the image to be processed through a neural network, and outputting M different-level features through M network layers with different network depths in the neural network.

In the implementation process, the neural network is used for extracting the features of the image to be processed, so that M different-level features output by network layers with different network depths are obtained, namely the features of high-level information and low-level information in the image to be processed can be obtained, and when the features are subsequently used for example segmentation, the segmentation result can be more accurate.

Optionally, processing the M different hierarchical features according to a first hierarchical direction to obtain M layers of first intermediate features, and processing the M different hierarchical features according to a second hierarchical direction opposite to the first hierarchical direction to obtain M layers of second intermediate features, including: processing the M different-level features according to the direction from the high-level features to the low-level features to obtain M layers of first intermediate features; and processing the M different-level features according to the direction from the low-level features to the high-level features to obtain M layers of second intermediate features.

In the implementation process, the M-level features are processed respectively according to two different level directions, so that M-level first intermediate features and M-level second intermediate features are obtained, and the M-level first intermediate features and the M-level second intermediate features both comprise the M-level features, namely after the M-level features are processed, a high-level network can more easily and more comprehensively obtain low-level information.

Optionally, the processing, by processing the M different hierarchical features according to a direction from a higher hierarchical feature to a lower hierarchical feature, an M-level first intermediate feature is obtained, where i is an integer less than or equal to M and greater than or equal to 2, and the level of the ith hierarchical feature is higher than that of the i-1 hierarchical feature, and the processing includes: determining an M-th level feature as an M-th level first intermediate feature of the M-level first intermediate features; and sequentially taking i as M-1 to 1, fusing M-i +1 hierarchical features between the M-th hierarchical feature and the i-th hierarchical feature to obtain the i-th layer first intermediate feature, and obtaining the M-1 layer first intermediate feature when i is 1.

In the implementation process, the M-level first intermediate features obtained by performing the processing on the M-level features make the M-level first intermediate features include the M-level features and the fusion features obtained by fusing the M-level features each time, that is, after the M-level features are processed, the high-level network can more easily and more comprehensively obtain the low-level information.

Optionally, sequentially taking i as M-1 to 1, fusing the M-th level features to the i-th level features to obtain the i-th layer first intermediate features, and obtaining M-1 first intermediate features in total when i is 1, including: and sequentially taking i as M-1 to 1, sequentially upsampling the M-th level features output by the M-th layer network layer in the neural network along the direction from the depth to the depth in the neural network, and fusing the upsampled M-th level features with the M-i +1 level features between the i-th level features output by the i-th layer network layer to obtain the first intermediate features of the i-th layer, and when i is 1, obtaining the first intermediate features of the M-1 layer.

In the implementation process, the M-th level features are up-sampled through the neural network and then fused with the i-th level features, so that the M-th level features can be transformed into the features with the same size as the i-th level features and then fused, and feature fusion can be facilitated.

Optionally, the processing, performed on the M different hierarchical features according to a direction from a lower hierarchical feature to a higher hierarchical feature, to obtain M layers of second intermediate features includes: determining a level 1 feature as a level 1 second intermediate feature of the M levels of second intermediate features; and sequentially taking i as 2 to M, fusing the i hierarchical features between the 1 st hierarchical feature and the i hierarchical feature to obtain the second intermediate feature of the i layer, and obtaining the second intermediate feature of the M-1 layer when i is M.

In the implementation process, the M-level second intermediate features obtained by performing the processing on the M-level features make the M-level second intermediate features include the 1 st-level features and the fusion features obtained by fusing the M-level features each time, that is, after the M-level features are processed, the high-level network can more easily and more comprehensively obtain the low-level information.

Optionally, sequentially taking i as 2 to M, fusing the level-1 features to the level-i features to obtain a level-i second intermediate feature, and obtaining M-1 second intermediate features in total when i is M, including: and sequentially taking i as 2 to M, sequentially sampling the 1 st level features output by the 1 st layer network layer in the neural network along the direction from shallow depth to deep depth of the network in the neural network, and fusing the i level features with the i level features output by the i layer network layer to obtain the second intermediate features of the i layer, wherein when i is M, the second intermediate features of the M-1 layer are obtained.

In the implementation process, the 1 st level feature is downsampled through the neural network and then fused with the i th level feature, so that the 1 st level feature can be transformed into the feature with the same size as the i th level feature and then fused, and feature fusion can be facilitated.

Optionally, after the processing the M layers of first intermediate features and the M layers of second intermediate features to obtain M layers of image features, the method further includes: and segmenting at least partial region of the image to be processed based on the M layers of image features to obtain a segmentation result.

Optionally, segmenting at least a partial region of the image to be processed based on the M layers of image features to obtain a segmentation result, including: and performing semantic segmentation on at least part of the region of the image to be processed based on the M layers of image features to obtain a semantic segmentation result.

Optionally, segmenting at least a partial region of the image to be processed based on the M layers of image features to obtain a segmentation result, including: and performing example segmentation on at least partial region of the image to be processed based on the M layers of image features to obtain an example segmentation result.

In the implementation process, the image to be processed is subjected to instance segmentation or semantic segmentation based on the finally obtained M layers of image features, and each layer of image features comprises balanced high-layer information and balanced low-layer information, so that a semantic segmentation effect or an instance segmentation effect on large and small objects can be realized.

In a second aspect, an embodiment of the present application provides an image feature extraction apparatus, including:

the image feature extraction module is used for acquiring an image to be processed, extracting features of the image to be processed and acquiring M different-level features, wherein M is an integer greater than or equal to 2;

the first feature processing module is used for processing the M different-level features according to a first level direction to obtain M layers of first intermediate features; and

the second feature processing module is used for processing the M different-level features according to a second level direction opposite to the first level direction to obtain M layers of second intermediate features;

and the third feature processing module is used for processing the M layers of first intermediate features and the M layers of second intermediate features to obtain M layers of image features.

Optionally, a level of an ith hierarchical feature in the M different hierarchical features is less than a level of an i +1 th hierarchical feature, and the third feature processing module is configured to determine a 1 st-layer first intermediate feature as a 1 st-layer image feature in the M layers of image features; and sequentially taking i as 2 to M, fusing the first intermediate features of the ith layer with the second intermediate features of the (i-1) th layer to obtain image features of the ith layer, and obtaining image features of the (M-1) th layer when i is M.

Optionally, the image feature extraction module is configured to perform feature extraction on the image to be processed through a neural network, and output M different level features through M network layers with different network depths in the neural network.

Optionally, the first feature processing module is configured to process the M different-level features in a direction from a higher-level feature to a lower-level feature, so as to obtain M layers of first intermediate features;

and the second feature processing module is used for processing the M different-level features according to the direction from the low-level features to the high-level features to obtain M layers of second intermediate features.

Optionally, a level of an ith hierarchical feature in the M different hierarchical features is higher than a level of an i-1 th hierarchical feature, i is an integer less than or equal to M and greater than or equal to 2, and the first feature processing module is configured to determine an mth hierarchical feature as an mth layer first intermediate feature in the M layers of first intermediate features; and sequentially taking i as M-1 to 1, fusing M-i +1 hierarchical features between the M-th hierarchical feature and the i-th hierarchical feature to obtain the i-th layer first intermediate feature, and obtaining the M-1 layer first intermediate feature when i is 1.

Optionally, the first feature processing module is further configured to sequentially take i as M-1 to 1, and sequentially merge, in the neural network, after upsampling an M-th hierarchical feature output by an M-th layer network layer in the neural network, with M-i +1 hierarchical features between the i-th hierarchical features output by the i-th layer network layer, along a direction from a deep depth to a shallow depth of the network in the neural network, to obtain an i-th layer first intermediate feature, and when i is 1, obtain the M-1 layer first intermediate feature altogether.

Optionally, a level of an ith hierarchical feature in the M different hierarchical features is less than a level of an i +1 th hierarchical feature, i is an integer less than or equal to M and greater than or equal to 1, and the second feature processing module is configured to determine the 1 st hierarchical feature as a 1 st layer second intermediate feature in the M layers of second intermediate features; and sequentially taking i as 2 to M, fusing the i hierarchical features between the 1 st hierarchical feature and the i hierarchical feature to obtain the second intermediate feature of the i layer, and obtaining the second intermediate feature of the M-1 layer when i is M.

Optionally, the second feature processing module is further configured to sequentially take i as 2 to M, and sequentially merge, in the neural network, the 1 st-level features output by the 1 st-level network layer in the neural network after being downsampled, with i-level features between the i-level features output by the i-level network layer, to obtain an i-level second intermediate feature, and when i is M, obtain M-1-level second intermediate features altogether.

Optionally, the apparatus further comprises:

and the image segmentation module is used for segmenting at least part of the region of the image to be processed based on the M layers of image characteristics to obtain a segmentation result.

Optionally, the image segmentation module is specifically configured to perform semantic segmentation on at least a partial region of the image to be processed based on the M layers of image features to obtain a semantic segmentation result.

Optionally, the image segmentation module is specifically configured to perform example segmentation on at least a partial region of the image to be processed based on the M-layer image features, so as to obtain an example segmentation result.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, where the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, the steps in the method as provided in the first aspect are executed.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the steps in the method as provided in the first aspect.

Additional features and advantages of the present application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the present application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a flowchart of an image feature extraction method provided in an embodiment of the present application;

FIG. 2 is a schematic view of feature fusion shown in an embodiment of the present application;

FIG. 3 is a schematic diagram of an application of feature extraction in an embodiment of the present application;

FIG. 4 is a schematic diagram of a network result for two-way mask prediction in an embodiment of the present application;

FIG. 5 is a flowchart of an embodiment of an application of the image feature extraction method of the present application;

FIG. 6 is a process diagram of the embodiment of the application shown in FIG. 5;

fig. 7 is a block diagram of an image feature extraction apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

Referring to fig. 1, fig. 1 is a flowchart of an image feature extraction method according to an embodiment of the present application, where the method includes the following steps:

step S110: acquiring an image to be processed, and performing feature extraction on the image to be processed to obtain M different-level features.

Wherein M is an integer greater than or equal to 2, namely at least two different level features are obtained.

Representations of features in the embodiments of the present application may include, for example, but are not limited to: feature maps, feature vectors or feature matrices, etc.

The different levels refer to a plurality of network layers which are positioned at different depths of the neural network, the plurality of network layers of the neural network can extract the features of the input image to be processed, the image to be processed passes through the plurality of network layers to obtain M different level features, namely, one network layer outputs one level feature.

The image to be processed includes, for example, but not limited to: still images, frame images in video, and the like.

Step S120: processing the M different level features according to a first level direction to obtain M layers of first intermediate features, and processing the M different level features according to a second level direction opposite to the first level direction to obtain M layers of second intermediate features.

Step S130: and processing the M layers of first intermediate features and the M layers of second intermediate features to obtain M layers of image features.

The different hierarchical directions described above may include: the direction from a higher level feature to a lower level feature and the direction from a lower level feature to a higher level feature, e.g. the first level direction is the direction from a higher level feature to a lower level feature and the second level direction is the direction from a lower level feature to a higher level feature, or the first level direction is the direction from a lower level to a higher level and the second level direction is the direction from a higher level to a lower level.

When the M-layer first intermediate features and the M-layer second intermediate features are processed, M may be performed in a direction from a lower hierarchy to an upper hierarchy to obtain M-layer image features, so that each obtained layer of image features includes balanced M different hierarchy features, that is, each layer of image features includes relatively balanced upper layer information and relatively balanced lower layer information.

Therefore, in this embodiment, after three times of processing, M different levels of features included in each layer of image features in the obtained M layers of image features are relatively balanced, that is, each layer of image features includes high-level information and low-level information with relatively balanced information, and because the low-level information is relatively sensitive to some detailed information, information beneficial to positioning and segmentation can be provided.

In addition, in the above embodiment, performing feature extraction on the image to be processed to obtain M different level features may include: and performing feature extraction on the image to be processed through a neural network, and outputting the M different-level features through M network layers with different network depths in the neural network.

It should be understood that the neural network includes more than two network layers with different network depths, among the network layers included in the neural network, the network layer for performing feature extraction may be referred to as a feature layer, after the neural network receives an image to be processed, the neural network performs feature extraction on the input image to be processed through the first network layer and inputs the extracted features to the second network layer, and from the second network layer, each network layer performs feature extraction on the input features in sequence, and inputs the extracted features to the next network layer for privilege extraction. The network depth of each network layer in the neural network is from shallow to deep according to the input and output sequence or the sequence of feature extraction, the hierarchy of features which are sequentially extracted and output by each network layer is from low to high, and the resolution is from high to low. Compared with a network layer with shallow network depth in the same neural network, the network layer with the deep network depth has a larger visual field and focuses on more spatial structure information, and when the extracted features are used for example segmentation, the segmentation result can be more accurate. In neural networks, the network layers may generally include: at least one convolutional layer for feature extraction, and an upsampling layer for upsampling the features (e.g., feature map) extracted by the convolutional layer, the size of the features (e.g., feature map) extracted by the convolutional layer can be reduced by upsampling the features.

In addition, as an implementation manner, in the above embodiment, the processing the M different level features according to the first level direction to obtain M layers of first intermediate features includes: processing the M different-level features according to the direction from the high-level features to the low-level features to obtain M layers of first intermediate features; processing the M different level features according to a second level direction opposite to the first level direction to obtain M layers of second intermediate features, including: and processing the M different-level features according to the direction from the low-level features to the high-level features to obtain M layers of second intermediate features.

Or, processing M different level features according to a first level direction to obtain M layers of first intermediate features, including: processing the M different-level features according to the direction from the low-level features to the high-level features to obtain M layers of first intermediate features; processing the M different level features according to a second level direction opposite to the first level direction to obtain M layers of second intermediate features, including: and processing the M different-level features according to the direction from the high-level features to the low-level features to obtain M layers of second intermediate features.

For convenience of description of the embodiments of the present application, the embodiments of the present application are described by taking a first hierarchy direction as a direction from a high-level feature to a low-level feature, and a second hierarchy direction as a direction from the low-level feature to the high-level feature.

Wherein, the level of the ith hierarchical feature in the M different hierarchical features is higher than the level of the i-1 hierarchical feature, i is an integer less than or equal to M and greater than or equal to 2, and the M different hierarchical features are processed according to the direction from the high hierarchical feature to the low hierarchical feature to obtain M layers of first intermediate features, including: determining an M-th level feature as an M-th level first intermediate feature of the M-level first intermediate features; and sequentially taking i as M-1 to 1, fusing M-i +1 hierarchical features between the M-th hierarchical feature and the i-th hierarchical feature to obtain the i-th layer first intermediate feature, and obtaining the M-1 layer first intermediate feature when i is 1.

For example, if M is 4, the 4 th-level feature is determined as the 4 th-level first intermediate feature among the 4-level first intermediate features, then i is sequentially taken to be 3 to 1, the 4 th-level feature is fused to the i-level feature to obtain the i-level first intermediate feature, when i is 1, the 3-level first intermediate feature is obtained in total, if i is 3, the 4 th-level feature is fused to the 3 rd-level feature to obtain the 3 rd-level first intermediate feature, if i is 2, the 4 th-level feature is fused to the 2 nd-level feature to obtain the 2 nd-level first intermediate feature, if i is 1, the 4 th-level feature is fused to the 1 st-level feature to obtain the 1 st-level first intermediate feature, so in this way, the 4 th-level first intermediate feature can be obtained in total.

And in the neural network, sequentially taking i as M-1 to 1, sequentially upsampling the M-th level features output by the M-th layer network layer in the neural network along the direction from the depth to the depth in the neural network, and fusing the upsampled M-th level features with the M-i +1 level features between the i-th level features output by the i-th layer network layer to obtain the first intermediate features of the i-th layer, and when i is 1, obtaining the first intermediate features of the M-1 layer altogether. For example, the M-th level feature is up-sampled and added to the M-1-th level feature to obtain a first intermediate feature of an M-1-th level, the first intermediate feature of the M-1-th level may be fused with the M-1-th level feature output from the next network deep M-1-level network layer to obtain a first intermediate feature of an M-2-th level, and the first intermediate feature of the M-2-th level may be further fused with the M-2-th level feature output from the next M-2-level network layer to obtain a first intermediate feature of an M-3-th level, in this way, the first intermediate feature of the 1-th level may be obtained to obtain the first intermediate feature of the M-1-level.

Wherein, the mth level characteristics may include: the features are obtained by extracting the features output by the M layer network layer in the neural network or the features output by the M layer network layer at least once. For example, among the features participating in the fusion, the feature of the highest hierarchy may be the feature of the highest hierarchy among the above M different hierarchy features, that is, the feature of the M-th hierarchy, or may be a feature obtained by performing one or more feature extractions on the feature of the highest hierarchy, and the M-layer first intermediate features may include the feature of the highest hierarchy and the fusion feature obtained by each fusion.

Similarly, the step of processing the ith hierarchical feature in the M different hierarchical features in a direction from the lower hierarchical feature to the higher hierarchical feature to obtain M layers of second intermediate features includes: determining a level 1 feature as a level 1 second intermediate feature of the M levels of second intermediate features; and sequentially taking i as 2 to M, fusing the i hierarchical features between the 1 st hierarchical feature and the i hierarchical feature to obtain the second intermediate feature of the i layer, and obtaining the second intermediate feature of the M-1 layer when i is M.

For example, if M is 4, the 1 st-level feature is determined as the 1 st-level second intermediate feature among the 4-level second intermediate features, then i is sequentially taken to be 2 to 4, the 1 st-level feature is fused to the i-level feature to obtain the i-level second intermediate feature, when i is 4, a total of 3-level second intermediate features are obtained, for example, when i is 2, the 1 st-level feature is fused to the 2 nd-level feature to obtain the 2 nd-level second intermediate feature, for example, when i is 3, the 1 st-level feature is fused to the 3 rd-level feature to obtain the 3 rd-level second intermediate feature, when i is 4, the 1 st-level feature is fused to the 4 th-level feature to obtain the 4 th-level second intermediate feature, so in this way, a total of 4-level second intermediate features can be obtained.

And in the neural network, sequentially taking i as 2 to M, and sequentially fusing the 1 st level features output by the 1 st layer network layer in the neural network with i level features between the i level features output by the i layer network layer after down-sampling along the direction from shallow to deep of the network depth in the neural network to obtain the second intermediate features of the i layer, and obtaining the second intermediate features of the M-1 layer when i is M.

For example, the 1 st level features are downsampled and fused with the 2 nd level features to obtain second-level second intermediate features, then the second-level second intermediate features can be fused with the 3 rd level features output by the next 3 rd layer network layer to obtain third-level second intermediate features, and then the third-level second intermediate features can be further fused with the 4 th level features output by the next 4 th layer network layer to obtain 4 th-level second intermediate features, so that the 4 th-level second intermediate features can be obtained.

The level 1 features may include: the features are obtained by carrying out at least one time of feature extraction on the features output by the layer 1 network layer in the neural network or the features output by the layer 1 network layer. For example, among the features participating in the fusion, the feature at the lowest level may be the feature at the lowest level among the features at the M different levels, or may also be the feature obtained by performing one or more feature extractions on the feature at the lowest level, and the M-level second intermediate features may include the feature at the lowest level and the fusion feature obtained by each fusion.

FIG. 2 is a feature fusion diagram shown in the embodiment of the present application, and FIG. 2 shows a feature N at a lower level_iDownsampled and neighboring, higher-level features P_i+1Fusing to obtain corresponding fusion characteristic N_i+1Wherein i is an integer having a value greater than 0.

Based on the embodiment, the features with high level and low resolution are gradually fused with the features with low resolution of low level according to the sequence from top to bottom (namely the sequence from the depth of the network in the neural network to the features with low level), so as to obtain a batch of new features, namely M layers of first intermediate features, and the features with low level and high resolution of low level are sequentially downsampled with the features with high resolution of high level and adjacent features with high level according to the sequence from bottom to top (namely the sequence from the features with low level to the features with high level), so as to gradually fuse the features with high resolution of low level and the features with low resolution of high level, so as to obtain another batch of new features, namely M layers of second intermediate features.

Wherein, the level of the ith hierarchical feature in the M different hierarchical features is less than the level of the i +1 th hierarchical feature, and the M layers of first intermediate features and the M layers of second intermediate features are processed to obtain M layers of image features, including: determining a layer 1 first intermediate feature as a layer 1 image feature of the M layers of image features; and sequentially taking i as 2 to M, fusing the first intermediate features of the ith layer with the second intermediate features of the (i-1) th layer to obtain image features of the ith layer, and obtaining image features of the (M-1) th layer when i is M.

For example, if M is 4, the 1 st layer first intermediate feature is determined as the 1 st layer image feature in the M layer image features, if i is 2, the 2 nd layer first intermediate feature is fused to the 1 st layer second intermediate feature to obtain the 2 nd layer image feature, if i is 3, the 3 rd layer first intermediate feature is fused to the 2 nd layer second intermediate feature to obtain the 3 rd layer image feature, and if i is 4, the 4 th layer first intermediate feature is fused to the 3 rd layer second intermediate feature to obtain the 4 th layer image feature, in this way, the 4 th layer image feature can be obtained.

In the neural network, M layers of first intermediate features and M layers of second intermediate features are fused along the direction from shallow to deep of the depth of the network in the neural network, for example, low-level features in the first intermediate features and low-level features in the second intermediate features are fused to obtain first-layer image features, each layer of image features includes the first intermediate feature at the lowest level and the feature after each time the first intermediate features and the second intermediate features are fused, that is, each finally obtained layer of image features includes M different-level features, and the proportions of the different-level features are the same.

Therefore, the feature processing performed three times in this embodiment can help the lower layer information to propagate to the higher layer network (i.e. the network layer with deeper network depth) more easily, so that the weights of the hierarchical features included in the finally obtained image features are the same, namely, the weight of the high-level features is the same as that of the low-level features, the loss of information transmission is reduced, the information can be more smoothly transmitted in the neural network, the high-level network can more easily and comprehensively acquire the low-level information, since the lower layer information is sensitive to certain details, it can provide information that is very beneficial for localization and segmentation, the high-level information is sensitive to the information of a large object, the proportion of the low-level information and the high-level information obtained in the scheme is the same, the method and the device have the advantages that the example segmentation effect on large and small objects is balanced, and the example segmentation effect on medium and large objects is good.

To facilitate understanding of the present embodiment, reference may be made to fig. 3, where fig. 3 is a schematic diagram of an application of feature extraction in the embodiment of the present application. The embodiment of the application is a Mask R-CNN characteristic pyramid network FPN structure, a pyramid structure which is in mirror image operation with FPN is added to the other side of a residual error network (ResNet), and the obtained characteristics (N) of each layer level are obtained₁,N₂，N₃，N₄) (i.e., M layers of second intermediate features) and the corresponding level features (P) of the FPN₁,P₂,P₃,P₄) (namely M layers of first intermediate features) to finally obtain information-balanced each-layer feature (O)₁,O₂,O₃,O₄) (i.e., M-layer image features).

Specifically, (C)₁,C₂,C₃,C₄) The features are M different-level features obtained by extracting the features of the image to be processed through a neural network, in the embodiment, M is 4, namely 4 different-level features are obtained, and the step (C) is performed₁,C₂,C₃,C₄) The features are processed in the direction from the high-level features to the low-level features to obtain 4-level first intermediate features, i.e. first intermediate features as features (P)₁,P₂,P₃,P₄) The way of fusion can be simply expressed as follows: p₄＝C₄，P₃＝C₃+C₄，P₂＝C₂+C₃+C₄，P₁＝C₁+C₂+C₃+C₄. Will be (C)₁,C₂,C₃,C₄) The feature is performed in the direction from the low-level feature to the high-level feature, and 4-level second intermediate features are obtained, namely the second intermediate features are taken as the feature (N)₁,N₂，N₃，N₄) The way of fusion can be simply expressed as follows: n is a radical of₁＝C₁，N₂＝N₁+C₂＝C₁+C₂，N₃＝N₂+C₃＝C₁+C₂+C₃，N₄＝N₃+C₄＝C₁+C₂+C₃+C₄. Then the feature (P) is added₁,P₂,P₃,P₄) And feature (M)₁,M₂,M₃,M₄) Processing in the direction from the low-level feature to the high-level feature to obtain 4-level image features, i.e. features (O)₁,O₂,O₃,O₄) The way of fusion can be simply expressed as follows: o is₁＝P₁＝C₁+C₂+C₃+C₄，O₂＝N₁+P₂＝C₁+C₂+C₃+C₄，O₃＝N₂+P₃＝C₁+C₂+C₃+C₄，O₄＝N₃+P₄＝C₁+C₂+C₃+C₄It can be seen that each layer of the obtained O-layer features includes a C-layer feature having the same specific gravity, that is, the high-level feature information and the low-level feature information in each layer of the image features have the same specific gravity, so that when performing example segmentation on an object in the following, the obtained O-layer feature information has a better example segmentation effect on the large object and the small object because the high-level feature information is sensitive to the large object and the low-level feature information is sensitive to the small object.

After the M-layer image features are obtained, at least a partial region of the image to be processed may be segmented based on the M-layer image features to obtain a segmentation result. For example, in each embodiment of the present application, at least a partial region of an image to be processed may be a whole region or a local region (for example, a candidate region) of the image, that is, the whole image to be processed may be segmented to obtain a segmentation result of the image, or a local region (for example, a candidate region) of the image to be processed may be segmented to obtain a segmentation result of the local region.

In addition, the segmenting the image to be processed may be performing semantic segmentation or instance segmentation on the image to be processed, for example, when the image to be processed is segmented, at least a partial region of the image to be processed may be performed semantic segmentation based on the M-layer image features to obtain a semantic segmentation result, where the semantic segmentation result may include, for example: the category of each pixel in at least partial region of the image to be processed.

For another example, at least a partial region of the image to be processed may be subjected to instance segmentation based on the M-layer image features, so as to obtain an instance segmentation result. Example segmentation results may include: pixels belonging to an instance and a category to which the instance belongs in at least a partial region of the image to be processed, for example, pixels belonging to a boy and a category to which the boy belongs in the at least a partial region are people. Example segmentation may employ the Mask R-CNN algorithm described above.

Examples, for example, may include, but are not limited to, a particular object, such as a particular person, a particular object, and so forth. One or more example candidate regions can be obtained by detecting the image to be processed through the neural network, and the example candidate regions represent regions in the image where the examples are likely to appear.

In addition, in order to better perform instance segmentation on the image to be processed, the image features of different levels in the M layers of image features can be subjected to pixel-level fusion to obtain final fusion features, and further, at least partial regions of the image to be processed can be segmented based on the final fusion features.

In one optional example, performing pixel-level fusion on the M-layer image features includes: the M-layer image features take the maximum value based on each pixel, namely the features of each pixel position in the M-layer image features take the maximum value; or averaging the M-layer image features based on each pixel, namely averaging the features of each pixel position in the M-layer image features; or summing the M-layer image features based on each pixel, namely summing the features of each pixel position in the M-layer image features.

In the above embodiment, when the M-layer image features are maximized based on each pixel, the obtained features are more obvious compared with other modes in a mode that the M-layer image features are maximized based on each pixel, so that the segmentation result is more accurate, and the accuracy of the segmentation result is improved.

In the above embodiment, the example prediction at the pixel level may also be performed based on the fusion features corresponding to at least part of the region of the image to be processed in the M-layer image features, so as to obtain the example category prediction result of at least part of the region of the image to be processed; and performing pixel-level foreground and background prediction based on fusion features corresponding to at least partial areas of the image to be processed to obtain foreground and background prediction results of at least partial areas of the image to be processed.

Acquiring an example segmentation result of at least a partial region of the image to be processed based on the example type prediction result and the foreground and background prediction results, wherein the example segmentation result comprises: the pixels in the current instance candidate area belonging to an instance and the class information to which the instance belongs.

In this embodiment, based on the M-layer image features, the instance class prediction and the foreground prediction at the pixel level are performed at the same time, the M-layer image features can be finely classified and multi-classified through the instance class prediction at the pixel level, better global information can be obtained through the foreground prediction, the prediction speed is increased because detailed information among multiple instance classes does not need to be concerned, and the instance segmentation result of the instance object candidate region is obtained based on the instance class prediction result and the foreground prediction result, so that the instance segmentation result of the instance candidate region or the image to be processed can be increased.

In one optional example, the performing, at a pixel level, an example category prediction based on a fusion feature corresponding to at least a partial region of an image to be processed in the M-layer image features may include:

performing feature extraction on fusion features corresponding to at least partial region of the image through a first convolution network, wherein the first convolution network comprises at least one full convolution layer;

and performing object class prediction at a pixel level based on the characteristics output by the first convolution network through the first full convolution layer.

In one optional example, performing pixel-level foreground-background prediction based on fusion features corresponding to at least a partial region of an image to be processed includes:

and predicting pixels belonging to the foreground and/or pixels belonging to the background in at least partial areas of the image to be processed based on the corresponding fusion characteristics of the at least partial areas of the image to be processed.

The background and the foreground may be set according to a requirement, for example, the foreground may include all the instance category corresponding portions, the background may include portions other than all the instance category corresponding portions, or the background may include all the instance category corresponding portions, and the foreground may include portions other than all the instance category corresponding portions.

In another alternative example, performing pixel-level foreground-background prediction based on M-layer image features may include:

performing feature extraction on fusion features corresponding to at least partial region of the image to be processed through a second convolution network, wherein the second full convolution network comprises at least one full convolution layer;

and performing pixel-level foreground and background prediction based on the characteristics output by the second convolution network through a full connection layer.

In an implementation manner of each embodiment of the image feature extraction method, obtaining an example segmentation result of at least a partial region of an image to be processed based on an example type prediction result and a foreground and background prediction result may include:

and performing pixel-level addition processing on the object type prediction result of at least part of the region of the image to be processed and the foreground and background prediction results to obtain an example segmentation result of at least part of the region of the image to be processed.

In another embodiment, after obtaining the foreground prediction result of at least a partial region of the image, the method may further include: and converting the foreground and background prediction results into foreground and background prediction results with the dimension consistent with the example type prediction results. For example, the foreground background prediction result is converted from a vector to a matrix consistent with the dimension of the object class prediction. Correspondingly, the pixel-level addition processing of the object class prediction result and the foreground prediction result of at least a partial region of the image to be processed may include: and performing pixel-level addition processing on the example type prediction result of at least partial region of the image to be processed and the converted foreground and background prediction result.

When the example segmentation result is obtained based on the M-layer image features, the partial scheme may be called dual-path mask prediction because the pixel-level example type prediction and the foreground prediction are performed based on the M-layer image features at the same time, as shown in fig. 4, where fig. 4 is a schematic diagram of a network result of the dual-path mask prediction in the embodiment of the present application.

In fig. 4, the fused feature of the local Region (ROI) is subjected to instance class prediction and foreground prediction through two branches, respectively. Wherein the first branch comprises: four full convolutional layers (conv1-conv4), i.e., the first convolutional network; and an deconvolution layer (deconv), i.e. the first full convolution layer. Another branch packet: the full convolutional layer (conv1-conv3) from the first branch, and the two full convolutional layers (conv4-fc and conv5-fc), i.e., the second convolutional network, and the full connected layer (fc), and the conversion layer (reshape), are used to convert the foreground and background predictors into a foreground and background predictor that is consistent with the dimensions of the instance class predictor. The first branch performs mask prediction at the pixel level for each potential instance type, while the fully-connected layer performs mask prediction independent of the instance type (i.e., performs foreground prediction at the pixel level), and finally the mask predictions of the two branches are added to obtain a final instance segmentation result.

Fig. 5 is a flowchart of an application embodiment of the image feature extraction method, which takes example segmentation of a local area of an image to be processed as an example to explain, and when the whole image to be processed is subjected to example segmentation or semantic segmentation, corresponding example segmentation or semantic segmentation is directly performed on M-layer image features of the whole image to be processed. Fig. 6 is a process diagram of the application embodiment shown in fig. 5. Referring to fig. 5 and fig. 6, the image feature extraction method according to the embodiment of the application includes:

step S210: and performing feature extraction on the image to be processed through a neural network, and outputting M hierarchical features through M network layers with different network depths in the neural network.

Taking M as 4 as an example, M hierarchical features C are output through network layers with 4 different network depths in the neural network₁-C₄。

Step S220: and sequentially upsampling the characteristics of a higher level and then processing the characteristics of a lower level in the M levels of characteristics according to the sequence from the characteristics of a higher level to the characteristics of a lower level to obtain M levels of first intermediate characteristics.

I.e. according to the high-level feature C in the M hierarchical features₄To low level feature C₁In order of (2), sequentially adding higher level features C_iUpsampled and lower level features C_i-1Processing to obtain 4 layers of first intermediate characteristics P₁-P₄。

Wherein the values of i are integers in 4-1 in sequence. The first intermediate feature P of the highest hierarchy among the features participating in the fusion and the first intermediate features₄Is the feature C of the highest hierarchy among the four different hierarchy features₄Or by fully convolving the layer with respect to the feature C₄Features obtained by performing feature extraction, i.e. P₄＝C₄First intermediate feature P₃Is a feature P₄And feature C₃Either by summing the features P by the convolution layer₄And feature C₃Features obtained by performing convolution, i.e. P₃＝P₄+C₃＝C₄+C₃In this way, the characteristic P₂＝P₃+C₂＝C₄+C₃+C₂Characteristic P₁＝P₂+C₁＝C₄+C₃+C₂+C₁。

Step S230: and sequentially upsampling the features of the lower level and processing the upsampled features with the features of the higher level in the M levels of features according to the sequence from the features of the lower level to the features of the higher level to obtain second intermediate features of the M levels.

Is about to get onAmong the four hierarchical features, according to the feature C from the lower hierarchy₁To high level feature C₄In the order of (1), sequentially adding the features C of lower hierarchy level_iUpsampled and higher level feature C_i+1Processing to obtain 4 layers of second intermediate characteristics N₁-N₄。

Wherein the value of i is an integer of 1-4 in sequence. The second intermediate feature N of the lowest hierarchy among the features participating in the fusion and the second intermediate features₁Is the feature C at the lowest level in the four different-level features₁Or by fully convolving the layer with respect to the feature C₁Features obtained by performing feature extraction, i.e. N₁＝C₁Second intermediate feature N₂Is characterized by N₁And feature C₂Or by a convolution layer to feature N₁And feature C₂Features obtained by performing convolution, i.e. N₂＝N₁+C₂＝C₁+C₂In this way, feature N₃＝N₂+C₃＝C₁+C₂+C₃Characteristic N₄＝N₃+C₄＝C₁+C₂+C₃+C₄。

It should be noted that, step S220 and step S230 do not have a sequence in execution time, and both may be executed simultaneously, or may be executed in any time sequence, and the implementation steps listed in this embodiment do not limit the present application.

Step S240: and processing the M layers of first intermediate features and the M layers of second intermediate features according to the sequence from the low-level features to the high-level features to obtain M layers of image features.

I.e. 4 layers of the first intermediate feature P₁-P₄And 4 layers of second intermediate features N₁-N₄Processing according to the sequence from the low-level feature to the high-level feature to obtain the 4-level image feature O₁-O₄。

Wherein, of the features participating in the fusion, the image feature O of the lowest level₁Is the feature P of the lowest level in the first intermediate features₁Or by fully convolving the layer with respect to the feature P₁Features obtained by performing feature extraction, i.e. O₁＝P₁＝C₁+C₂+C₃+C₄Image feature O₂Is characterized by N₁And feature P₂Or by a convolution layer to feature N₁And feature P₂Features obtained by performing convolution, i.e. O₂＝N₁+P₂＝C₁+C₂+C₃+C₄In this way, feature O₃＝N₂+P₃＝C₁+C₂+C₃+C₄Characteristic O₄＝N₃+P₄＝C₁+C₂+C₃+C₄。

In the embodiments of the present application, for example, but not limited to, a method for generating a Region of interest (ROI) alignment (ROIAlign) of an image by using a Region recommendation network (RPN) may be adopted, and a Region feature corresponding to the local Region is extracted from M-layer image features.

Step S250: and performing pixel-level fusion on the four regional characteristics corresponding to the local region extracted from the M layers of image characteristics to obtain final fusion characteristics.

Step S260: and carrying out example identification based on the final fusion characteristics to obtain an example identification result.

The instance recognition result includes an object box (box) or position of each instance and an instance class (class) to which the instance belongs. After this step, the subsequent flow of the present application embodiment may not be executed.

Step S270: and performing pixel-level foreground background prediction based on the final fusion features to obtain a foreground background prediction result.

Step S280: and performing pixel-level addition processing on the object type prediction result and the foreground and background prediction results to obtain an example segmentation result of the local area.

Wherein the example segmentation result comprises: the local area includes pixels belonging to an instance and an instance class to which the instance belongs, wherein the instance class may be a background or an instance class.

The execution time of the steps S260 and the steps S270 to S280 is not sequential, and the steps S260 and the steps S270 to S280 may be executed simultaneously or in any time sequence.

Referring to fig. 7, fig. 7 is a block diagram of an image feature extraction apparatus 200 according to an embodiment of the present disclosure, the apparatus includes:

the image feature extraction module 210 is configured to obtain an image to be processed, perform feature extraction on the image to be processed, and obtain M different-level features, where M is an integer greater than or equal to 2;

a first feature processing module 220, configured to process the M different-level features according to a first level direction, so as to obtain M layers of first intermediate features; and

a second feature processing module 230, configured to process the M different level features according to a second level direction opposite to the first level direction, so as to obtain M layers of second intermediate features;

Optionally, the image feature extraction module 210 is configured to perform feature extraction on the image to be processed through a neural network, and output M different level features through M network layers with different network depths in the neural network.

Optionally, the first feature processing module 220 is configured to process the M different-level features in a direction from a higher-level feature to a lower-level feature, so as to obtain M layers of first intermediate features;

the second feature processing module 230 is configured to process the M different level features according to a direction from a lower level feature to a higher level feature, so as to obtain M layers of second intermediate features.

Optionally, a level of an ith hierarchical feature in the M different hierarchical features is higher than a level of an i-1 th hierarchical feature, i is an integer less than or equal to M and greater than or equal to 2, and the first feature processing module 220 is configured to determine the mth hierarchical feature as an mth layer first intermediate feature in the M layers of first intermediate features; and sequentially taking i as M-1 to 1, fusing M-i +1 hierarchical features between the M-th hierarchical feature and the i-th hierarchical feature to obtain the i-th layer first intermediate feature, and obtaining the M-1 layer first intermediate feature when i is 1.

Optionally, the first feature processing module 220 is further configured to sequentially take i as M-1 to 1, and sequentially perform upsampling on an M-th hierarchical feature output by an M-th hierarchical layer network layer in the neural network along a direction from a deep depth to a shallow depth of the network in the neural network, and then fuse the upsampled M-th hierarchical feature with M-i +1 hierarchical features between the upsampled M-th hierarchical feature and the i-th hierarchical feature output by the i-th hierarchical layer network layer, so as to obtain an i-th-layer first intermediate feature, and when i is 1, obtain an M-1-layer first intermediate feature altogether.

Optionally, a level of an ith hierarchical feature in the M different hierarchical features is less than a level of an i +1 th hierarchical feature, i is an integer less than or equal to M and greater than or equal to 1, and the second feature processing module 230 is configured to determine the 1 st hierarchical feature as a layer 1 second intermediate feature in the M layers of second intermediate features; and sequentially taking i as 2 to M, fusing the i hierarchical features between the 1 st hierarchical feature and the i hierarchical feature to obtain the second intermediate feature of the i layer, and obtaining the second intermediate feature of the M-1 layer when i is M.

Optionally, the second feature processing module 230 is further configured to sequentially take i as 2 to M, and sequentially merge, in the neural network, the 1 st-level feature output by the 1 st network layer in the neural network after down-sampling the 1 st-level feature and i-level features between the i-level features output by the i-level network layer in the direction from shallow to deep in the network depth in the neural network, to obtain an i-level second intermediate feature, and when i is M, obtain M-1-level second intermediate features altogether.

Optionally, the apparatus further comprises:

Referring to fig. 8, fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, where the electronic device may include: at least one processor 110, such as a CPU, at least one communication interface 120, at least one memory 130, and at least one communication bus 140. Wherein the communication bus 140 is used for realizing direct connection communication of these components. The communication interface 120 of the device in the embodiment of the present application is used for performing signaling or data communication with other node devices. The memory 130 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). Memory 130 may optionally be at least one memory device located remotely from the aforementioned processor. The memory 130 stores computer readable instructions, which when executed by the processor 110, cause the electronic device to perform the method processes described above with reference to fig. 1.

Embodiments of the present application provide a readable storage medium, and when being executed by a processor, the computer program performs the method processes performed by an electronic device in the method embodiment shown in fig. 1.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method, and will not be described in too much detail herein.

To sum up, the embodiments of the present application provide an image feature extraction method, an apparatus, an electronic device, and a storage medium, in which after M different hierarchical features extracted from an image to be processed are processed twice, M layers of first intermediate features and M layers of second intermediate features are obtained, and then the M layers of first intermediate features and the M layers of second intermediate features are fused to obtain M layers of image features, each layer of image features obtained by this method may include M different hierarchical features with balanced information, that is, each layer of image features includes high-level information and low-level information with balanced information, since the low-level information is sensitive to some detailed information, information beneficial to positioning and segmentation can be provided, and by the above-mentioned multiple processing of features, a high-level network can more easily and comprehensively obtain the low-level information, so that the high-level information and the low-level information are more balanced, the effect of segmenting large and small objects in a balanced manner is achieved.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. An image feature extraction method, characterized in that the method comprises:

acquiring an image to be processed, and performing feature extraction on the image to be processed to acquire M different-level features, wherein M is an integer greater than or equal to 2;

processing the M different level features according to a first level direction to obtain M layers of first intermediate features, and processing the M different level features according to a second level direction opposite to the first level direction to obtain M layers of second intermediate features;

and processing the M layers of first intermediate features and the M layers of second intermediate features to obtain M layers of image features.

2. The method according to claim 1, wherein the level of the ith hierarchical feature in the M different hierarchical features is less than the level of the i +1 th hierarchical feature, and the M layers of first intermediate features and the M layers of second intermediate features are processed to obtain M layers of image features, including:

determining a layer 1 first intermediate feature as a layer 1 image feature of the M layers of image features;

and sequentially taking i as 2 to M, fusing the first intermediate features of the ith layer with the second intermediate features of the (i-1) th layer to obtain image features of the ith layer, and obtaining image features of the (M-1) th layer when i is M.

3. The method according to claim 1, wherein performing feature extraction on the image to be processed to obtain M different levels of features comprises:

and performing feature extraction on the image to be processed through a neural network, and outputting M different-level features through M network layers with different network depths in the neural network.

4. The method of claim 1, wherein processing the M different hierarchical features in a first hierarchical direction to obtain M layers of first intermediate features, and processing the M different hierarchical features in a second hierarchical direction opposite to the first hierarchical direction to obtain M layers of second intermediate features comprises:

processing the M different-level features according to the direction from the high-level features to the low-level features to obtain M layers of first intermediate features; and

and processing the M different-level features according to the direction from the low-level features to the high-level features to obtain M layers of second intermediate features.

5. The method according to claim 4, wherein the level of the ith hierarchical feature in the M different hierarchical features is higher than the level of the ith-1 hierarchical feature, i is an integer less than or equal to M and greater than or equal to 2, and the processing the M different hierarchical features in the direction from the higher hierarchical feature to the lower hierarchical feature to obtain the M layers of first intermediate features comprises:

determining an M-th level feature as an M-th level first intermediate feature of the M-level first intermediate features;

and sequentially taking i as M-1 to 1, fusing M-i +1 hierarchical features between the M-th hierarchical feature and the i-th hierarchical feature to obtain the i-th layer first intermediate feature, and obtaining the M-1 layer first intermediate feature when i is 1.

6. The method according to claim 5, wherein taking i as M-1 to 1 in sequence, fusing the M-th level features to the i-th level features to obtain the i-th level first intermediate features, and obtaining M-1 first intermediate features in total when i is 1, comprises:

and sequentially taking i as M-1 to 1, sequentially upsampling the M-th level features output by the M-th layer network layer in the neural network along the direction from the depth to the depth in the neural network, and fusing the upsampled M-th level features with the M-i +1 level features between the i-th level features output by the i-th layer network layer to obtain the first intermediate features of the i-th layer, and when i is 1, obtaining the first intermediate features of the M-1 layer.

7. The method according to claim 4, wherein the level of the ith hierarchical feature in the M different hierarchical features is less than the level of the i +1 th hierarchical feature, i is an integer less than or equal to M and greater than or equal to 1, and the M different hierarchical features are processed in a direction from the lower hierarchical feature to the higher hierarchical feature to obtain M layers of second intermediate features, and the method comprises the following steps:

determining a level 1 feature as a level 1 second intermediate feature of the M levels of second intermediate features;

and sequentially taking i as 2 to M, fusing the i hierarchical features between the 1 st hierarchical feature and the i hierarchical feature to obtain the second intermediate feature of the i layer, and obtaining the second intermediate feature of the M-1 layer when i is M.

8. The method according to claim 7, wherein taking i as 2 to M in sequence, fusing the level-1 features to the level-i features to obtain the level-i second intermediate features, and obtaining M-1 second intermediate features in total when i is M, comprises:

and sequentially taking i as 2 to M, sequentially sampling the 1 st level features output by the 1 st layer network layer in the neural network along the direction from shallow depth to deep depth of the network in the neural network, and fusing the i level features with the i level features output by the i layer network layer to obtain the second intermediate features of the i layer, wherein when i is M, the second intermediate features of the M-1 layer are obtained.

9. The method according to any one of claims 1 to 8, wherein after processing the M-layer first intermediate features and the M-layer second intermediate features to obtain M-layer image features, the method further comprises:

and segmenting at least partial region of the image to be processed based on the M layers of image features to obtain a segmentation result.

10. The method according to claim 9, wherein segmenting at least a partial region of the image to be processed based on the M-layer image features to obtain a segmentation result comprises:

and performing semantic segmentation on at least part of the region of the image to be processed based on the M layers of image features to obtain a semantic segmentation result.

11. The method according to claim 9, wherein segmenting at least a partial region of the image to be processed based on the M-layer image features to obtain a segmentation result comprises:

and performing example segmentation on at least partial region of the image to be processed based on the M layers of image features to obtain an example segmentation result.

12. An image feature extraction device characterized by comprising:

13. The apparatus according to claim 12, wherein the level of the ith hierarchical feature of the M different hierarchical features is less than the level of the i +1 th hierarchical feature, the third feature processing module is configured to determine the 1 st layer first intermediate feature as the 1 st layer image feature of the M layers of image features; and sequentially taking i as 2 to M, fusing the first intermediate features of the ith layer with the second intermediate features of the (i-1) th layer to obtain image features of the ith layer, and obtaining image features of the (M-1) th layer when i is M.

14. The apparatus of claim 12, wherein the image feature extraction module is configured to perform feature extraction on the image to be processed through a neural network, and output M different hierarchical features through M network layers with different network depths in the neural network.

15. The apparatus according to claim 12, wherein the first feature processing module is configured to process the M different-level features in a direction from a higher-level feature to a lower-level feature to obtain M layers of first intermediate features;

16. The apparatus according to claim 15, wherein the level of an ith hierarchical feature of the M different hierarchical features is higher than the level of an i-1 th hierarchical feature, i being an integer less than or equal to M and greater than or equal to 2, the first feature processing module being configured to determine an mth hierarchical feature as an mth layer first intermediate feature of the M layer first intermediate features; and sequentially taking i as M-1 to 1, fusing M-i +1 hierarchical features between the M-th hierarchical feature and the i-th hierarchical feature to obtain the i-th layer first intermediate feature, and obtaining the M-1 layer first intermediate feature when i is 1.

17. The apparatus according to claim 16, wherein the first feature processing module is further configured to take i as M-1 to 1 in sequence, and sequentially up-sample an M-th hierarchical feature output by an M-th network layer in the neural network along a direction from a deep network depth to a shallow network depth in the neural network, and fuse the up-sampled M-th hierarchical feature with M-i +1 hierarchical features between the up-th hierarchical feature and an i-th hierarchical feature output by the i-th network layer in the neural network to obtain an i-th layer first intermediate feature, and when i is 1, obtain an M-1 layer first intermediate feature altogether.

18. The apparatus according to claim 15, wherein the level of the ith hierarchical feature of the M different hierarchical features is less than the level of the i +1 th hierarchical feature, i being an integer less than or equal to M and greater than or equal to 1, the second feature processing module being configured to determine the 1 st hierarchical feature as the 1 st layer second intermediate feature of the M layer second intermediate features; and sequentially taking i as 2 to M, fusing the i hierarchical features between the 1 st hierarchical feature and the i hierarchical feature to obtain the second intermediate feature of the i layer, and obtaining the second intermediate feature of the M-1 layer when i is M.

19. The apparatus according to claim 18, wherein the second feature processing module is further configured to take i as 2 to M in sequence, and sequentially down-sample the level 1 feature output by the layer 1 network layer in the neural network along the direction from shallow to deep of the depth of the network in the neural network, and fuse the down-sampled level 1 feature with i level features between the level i feature output by the layer i network layer to obtain a layer i second intermediate feature, and when i is M, obtain M-1 second intermediate features in total.

20. The apparatus of any of claims 12-19, further comprising:

21. The apparatus according to claim 20, wherein the image segmentation module is specifically configured to perform semantic segmentation on at least a partial region of the image to be processed based on the M-layer image features to obtain a semantic segmentation result.

22. The apparatus according to claim 20, wherein the image segmentation module is specifically configured to perform instance segmentation on at least a partial region of the image to be processed based on the M-layer image features to obtain an instance segmentation result.

23. An electronic device comprising a processor and a memory, said memory storing computer readable instructions which, when executed by said processor, perform the steps of the method of any of claims 1-11.

24. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 11.