CN117152575A

CN117152575A - Image processing apparatus, electronic device, and computer-readable storage medium

Info

Publication number: CN117152575A
Application number: CN202311395932.6A
Authority: CN
Inventors: 刘军; 刘倩倩
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2023-10-26
Filing date: 2023-10-26
Publication date: 2023-12-01
Anticipated expiration: 2043-10-26
Also published as: CN117152575B

Abstract

The application provides an image processing device, electronic equipment and a computer readable storage medium, and relates to the technical field of image processing. The device comprises: the image acquisition module is used for acquiring a foot complete image and a plurality of foot partial image blocks; the first foot complete feature acquisition module is used for carrying out global feature extraction on the foot complete image through the first global feature extraction network to acquire first foot complete features; the first foot local feature acquisition module is used for respectively carrying out local feature extraction on the plurality of foot local image blocks through a first local feature extraction network to obtain a plurality of first foot local features; the cross fusion module is used for carrying out cross fusion processing on the first foot complete characteristics and a plurality of first foot local characteristics to obtain first foot fusion characteristics; and the prediction module is used for performing prediction processing on the first foot fusion characteristics through a prediction network. The method can accurately evaluate the foot ulcer degree in the plantar image.

Description

Image processing apparatus, electronic device, and computer-readable storage medium

Technical Field

The present application relates to the field of image processing technology, and in particular, to an image processing apparatus, an electronic device, and a computer readable storage medium.

Background

This section is intended to provide a background or context to the embodiments of the application that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

With the continued development of Computer Vision (CV) technology, intelligent telemedicine systems are often considered to be one of the most economical solutions for remote detection and prevention of diabetic foot ulcers (Dabetic Foot Ulcer, DFU). The telemedicine system and the current medical services can be combined to provide more cost-effective, efficient and superior treatment for preventing diabetic foot ulcers.

Therefore, the technical problem to be solved by the application is how to remotely judge whether the user has foot ulcers and the grades of the foot ulcers based on the foot images of the user so as to further check and treat the user.

Disclosure of Invention

An object of the present application is to provide an image processing apparatus, an electronic device, and a computer-readable storage medium that can accurately determine whether a foot in a foot image has a foot ulcer and a level of the foot ulcer based on the foot image.

Other features and advantages of the application will be apparent from the following detailed description, or may be learned by the practice of the application.

The embodiment of the application provides an image processing method, which comprises the following steps: acquiring a foot complete image and a plurality of foot partial image blocks, wherein the foot complete image is segmented to obtain the plurality of foot partial image blocks; performing global feature extraction on the foot complete image through a first global feature extraction network to obtain first foot complete features; respectively extracting local features of the plurality of foot local image blocks through a first local feature extraction network to obtain a plurality of first foot local features; performing cross fusion processing on the first foot complete feature and the plurality of first foot local features through a first cross fusion network to obtain a first foot fusion feature; and carrying out prediction processing on the first foot fusion characteristic through a prediction network so as to determine a foot ulcer evaluation result corresponding to the foot complete image.

In some embodiments, the first global feature extraction network comprises a channel attention sub-network comprising a first pooling layer and a second pooling layer; the global feature extraction is performed on the foot complete image through a first global feature extraction network to obtain a first foot complete feature, and the method comprises the following steps: performing space compression processing on the foot complete image features corresponding to the foot complete image through the first pooling layer to obtain first space context description features; performing spatial compression processing on the foot complete image features corresponding to the foot complete image through the second pooling layer to obtain second spatial context description features; summing the first spatial context description feature and the second spatial context description feature to obtain a target spatial context description feature; activating the context description feature of the target space to obtain a channel attention weight coefficient; and weighting the foot complete image features corresponding to the foot complete image through the channel attention weight coefficient to obtain the first foot complete features.

In some embodiments, the first global feature extraction network further comprises a multi-scale feature extraction network comprising a plurality of convolution sub-networks, wherein the receptive fields of the respective convolution sub-networks are different; the method further comprises the steps of before the first pooling layer performs spatial compression processing on the foot complete image features corresponding to the foot complete image to obtain a first spatial context description feature or before the second pooling layer performs spatial compression processing on the foot complete image features corresponding to the foot complete image to obtain a second spatial context description feature: performing feature extraction processing on the foot complete image to obtain a foot complete image feature vector; dividing the foot complete image feature vector according to channels to obtain a plurality of channel divided features, wherein channel information corresponding to each channel divided feature is different; the features after the channels are divided are respectively subjected to convolution operation through the convolution sub-networks with different receptive fields, so that a plurality of scale features with different receptive fields are obtained; and fusing the plurality of scale features to obtain foot complete image features corresponding to the foot complete image.

In some embodiments, the plurality of channel-segmented features includes a first channel-segmented feature, a second channel-segmented feature, a third channel-segmented feature, the different receptive field convolution sub-networks include a first convolution sub-network, a second convolution sub-network, wherein the convolution kernels of the first convolution sub-network and the second convolution sub-network are different, the different receptive field plurality of scale features includes a first receptive field feature, a second receptive field feature, and a third receptive field feature; the method comprises the steps of respectively carrying out convolution operation on the characteristics after the channel segmentation through the convolution sub-networks with different receptive fields, and obtaining a plurality of scale characteristics with different receptive fields, wherein the steps comprise: taking the first channel segmented feature as the first receptive field feature; carrying out convolution processing on the features after the second channel segmentation through the first convolution sub-network to obtain the second receptive field features; fusing the second receptive field features and the third channel segmented features to obtain receptive field fusion features; and carrying out convolution processing on the receptive field fusion features through the second convolution sub-network to obtain the third receptive field features.

In some embodiments, fusing the plurality of scale features to obtain a foot complete image feature corresponding to the foot complete image includes: splicing the first receptive field feature, the second receptive field feature and the third receptive field feature to obtain spliced features; carrying out pooling treatment on the spliced features in the space dimension through a global average pooling layer to obtain channel features; processing the channel characteristics through a full connection layer to fit correlation among channels of the channel characteristics; and determining the complete image characteristics of the foot according to the channel characteristics after the full connecting layer processing and the spliced characteristics.

In some embodiments, the first local feature extraction network comprises a spatial attention sub-network comprising a third pooled layer and a fourth pooled layer, the plurality of foot local image blocks comprising a first foot local image block: the local feature extraction is performed on the plurality of foot local image blocks through a first local feature extraction network, so as to obtain a plurality of first foot local features, including: performing channel compression processing on the foot partial image features corresponding to the first foot partial image block through the third pooling layer to obtain a first channel feature map, wherein the foot partial image features corresponding to the first foot partial image block are obtained after feature extraction is performed on the first foot partial image block; carrying out channel compression processing on the foot partial image features corresponding to the first foot partial image block through the fourth pooling layer to obtain a second channel feature map; cascading the first channel characteristic diagram and the second channel characteristic diagram to obtain a target channel characteristic diagram; activating the target channel feature map to obtain a spatial attention weight coefficient; and weighting the foot partial image features corresponding to the first foot partial image block through the space attention weight coefficient to obtain the first foot partial features corresponding to the first foot partial image block.

In some embodiments, the first cross-fusion network includes a first parameter matrix, a second parameter matrix, and a third parameter matrix; the method comprises the steps of performing cross fusion processing on the first foot complete feature and the plurality of first foot local features through a first cross fusion network to obtain a first foot fusion feature, wherein the method comprises the following steps: performing projection processing on the first foot complete characteristics through the first parameter matrix to obtain first query characteristics; performing fusion processing on the plurality of first foot local features to obtain foot local fusion features; performing projection processing on the local fusion features of the foot through the second parameter matrix to obtain first key features; performing projection processing on the local fusion feature of the foot through the third parameter matrix to obtain a first value feature; and activating the first query feature, the first key feature and the first value feature to obtain the first foot fusion feature.

In some embodiments, the method further comprises: performing global feature extraction on the first foot complete features through a second global feature extraction network to obtain second foot complete features; performing local feature extraction processing on the first foot local feature through a second local feature extraction network to obtain a second foot local feature; performing cross fusion processing on the second foot complete characteristics and the second foot local characteristics through a second cross fusion network to obtain second foot fusion characteristics; the predicting network is used for predicting the first foot fusion characteristic to determine a foot ulcer evaluation result corresponding to the foot complete image, and the method comprises the following steps: and carrying out prediction processing on the first foot fusion characteristic and the second foot fusion characteristic through a prediction network so as to determine the foot ulcer evaluation result corresponding to the foot complete image.

In some embodiments, performing a prediction process on the first foot fusion feature and the second foot fusion feature through a prediction network to determine the foot ulcer assessment corresponding to the complete image of the foot comprises: performing downsampling processing on the first foot fusion feature to obtain a first downsampled fusion feature; performing downsampling processing on the second foot fusion feature to obtain a second downsampled fusion feature, wherein the feature dimension of the first downsampled fusion feature is the same as the feature dimension of the second downsampled fusion feature; splicing the first downsampling fusion feature and the second downsampling fusion feature to obtain a multi-level foot fusion feature; and determining the foot ulcer evaluation result corresponding to the foot complete image according to the multi-level foot fusion characteristics.

An embodiment of the present application provides an image processing apparatus including: the device comprises an image acquisition module, a first foot complete characteristic acquisition module, a first foot local characteristic acquisition module, a cross fusion module and a prediction module.

The image acquisition module is used for acquiring a foot complete image and a plurality of foot partial image blocks, wherein the foot complete image is segmented to obtain the plurality of foot partial image blocks; the first foot complete feature acquisition module may be configured to perform global feature extraction on the foot complete image through a first global feature extraction network to obtain a first foot complete feature; the first foot local feature obtaining module may be configured to obtain a plurality of first foot local features by respectively performing local feature extraction on the plurality of foot local image blocks through a first local feature extraction network; the cross fusion module can be used for carrying out cross fusion processing on the first foot complete feature and the plurality of first foot local features through a first cross fusion network to obtain a first foot fusion feature; the prediction module may be configured to perform prediction processing on the first foot fusion feature through a prediction network, so as to determine a foot ulcer evaluation result corresponding to the complete foot image.

An embodiment of the present application provides an electronic device, including: a memory and a processor; the memory is used for storing computer program instructions; the processor invokes the computer program instructions stored by the memory for implementing the image processing method of any one of the above.

An embodiment of the present application proposes a computer-readable storage medium, on which computer program instructions are stored, implementing an image processing method as described in any one of the above.

Embodiments of the present application provide a computer program product or computer program comprising computer program instructions stored in a computer readable storage medium. The computer program instructions are read from a computer readable storage medium and executed by a processor to implement the image processing method described above.

The image processing device, the electronic device and the computer readable storage medium provided by the embodiment of the application can perform feature extraction on the foot complete image through the first global feature extraction network to obtain global features (namely first foot complete features) of the foot complete image; on the other hand, local feature extraction can be carried out on a plurality of foot local image blocks obtained through foot complete image segmentation through a first local feature extraction module, so as to obtain local features corresponding to foot complete images; then, the global features of the foot complete image and the local features corresponding to the foot complete image are crossed and fused; and finally, predicting the foot ulcer by using the image features after the global features and the local features are crossed and fused. The method combines the global features of the foot image and the local features of the foot image, thereby improving the accuracy of foot evaluation.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is evident that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

Fig. 1 shows a schematic view of a scene of an image processing method or an image processing apparatus which can be applied to an embodiment of the present application.

Fig. 2 is a schematic diagram of an image processing network architecture, according to an exemplary embodiment.

Fig. 3 is a flowchart illustrating an image processing method according to an exemplary embodiment.

Fig. 4 is a schematic diagram illustrating the structure of a first global feature extraction network according to an exemplary embodiment.

Fig. 5 is a flowchart illustrating an image processing method according to an exemplary embodiment.

Fig. 6 is a schematic diagram illustrating the structure of a channel attention sub-network according to an exemplary embodiment.

FIG. 7 is a schematic diagram of a multi-scale feature extraction network, according to an example embodiment.

Fig. 8 is a flowchart illustrating an image processing method according to an exemplary embodiment.

Fig. 9 is a flowchart illustrating an image processing method according to an exemplary embodiment.

Fig. 10 is a flowchart illustrating an image processing method according to an exemplary embodiment.

Fig. 11 is a schematic diagram illustrating the structure of a first local feature extraction network according to an exemplary embodiment.

Fig. 12 is a schematic diagram illustrating the structure of a spatial attention sub-network according to an exemplary embodiment.

Fig. 13 is a flow chart illustrating a cross-fusion processing method according to an exemplary embodiment.

Fig. 14 is a schematic diagram of an image processing network architecture, according to an exemplary embodiment.

Fig. 15 is a flowchart illustrating an image processing method according to an exemplary embodiment.

Fig. 16 is a block diagram of an image processing apparatus according to an exemplary embodiment.

Fig. 17 shows a schematic diagram of an electronic device suitable for implementing an embodiment of the application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments can be embodied in many forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted.

One skilled in the art will appreciate that embodiments of the present application may be a system, apparatus, device, method, or computer program product. Thus, the application may be embodied in the form of: complete hardware, complete software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

The described features, structures, or characteristics of the application may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the application may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.

The drawings are merely schematic illustrations of the present application, in which like reference numerals denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and not necessarily all of the elements or steps are included or performed in the order described. For example, some steps may be decomposed, and some steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

In the description of the present application, "/" means "or" unless otherwise indicated, for example, A/B may mean A or B. "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. Furthermore, "at least one" means one or more, and "a plurality" means two or more. The terms "first," "second," and the like do not limit the amount and order of execution, and the terms "first," "second," and the like do not necessarily differ; the terms "comprising," "including," and "having" are intended to be inclusive and mean that there may be additional elements/components/etc., in addition to the listed elements/components/etc.

In order that the above-recited objects, features and advantages of the present application can be more clearly understood, a more particular description of the application will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings, it being understood that embodiments of the application and features of the embodiments may be combined with each other without departing from the scope of the appended claims.

In the technical scheme of the application, the aspects of the related personal information of the user, such as acquisition, collection, updating, analysis, processing, use, transmission, storage and the like, all conform to the rules of related laws and regulations, are used for legal purposes, and do not violate the popular public order. Necessary measures are taken for the personal information of the user, illegal access to the personal information data of the user is prevented, and the personal information security, network security and national security of the user are maintained.

Diabetic foot ulcers are one of the most serious chronic complications of diabetes, and lesions may appear anywhere on the foot, with size, color and contrast varying from lesion to lesion. Over 1000 tens of thousands of diabetics annually experience amputation or death due to failure to properly recognize and effectively treat DFU. DFU patients are currently evaluated primarily by visual inspection by podiatry, with manual measurement tools to determine the severity of DFU. The above-mentioned methods have great limitations, especially in less developed countries, where there are few podiatrists, resulting in a failure of the immediate early warning and effective treatment for most patients, which undoubtedly greatly increases the risk of amputation and even death of the patient.

For lesion severity, the widely accepted grading method consists of six grades. The greater the number, the more severe the extent of diabetic foot ulcers, which are classified according to clinical manifestations as follows: grade 0 (intact skin), grade 1 (superficial ulcers), grade 2 (deep ulcers to bones, tendons, deep fascia or joint capsules), grade 3 (deep ulcers with abscess, osteomyelitis or osteomyelitis), grade 4 (anterior gangrene), grade 5 (total gangrene).

The application provides an image analysis device, electronic equipment and a computer readable storage medium for a foot abnormal region of a diabetic foot patient, which aim to extract fusion characteristics simultaneously with fine-grained local representation and remote context global relation by mining and analyzing a foot focus region of the patient based on an algorithm, and are used for carrying out preliminary evaluation on the severity of the foot focus region of the patient, providing technical support for remote monitoring of the illness state of the diabetic foot patient and assisting a clinician in improving the working efficiency.

The following describes example embodiments of the application in detail with reference to the accompanying drawings.

Referring to FIG. 1, a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application is shown.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop computers, desktop computers, wearable devices, virtual reality devices, smart homes, etc.

The server 105 may be a server providing various services, such as a background management server providing support for devices operated by users with the terminal devices 101, 102, 103. The background management server can analyze and process the received data such as the request and the like, and feed back the processing result to the terminal equipment.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server or the like for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network ), basic cloud computing services such as big data and artificial intelligent platform, and the application is not limited to this.

Server 105 may, for example, obtain a complete image of the foot and a plurality of partial image blocks of the foot, wherein the plurality of partial image blocks of the foot are obtained after segmentation of the complete image of the foot; server 105 may perform global feature extraction on the foot-complete image, for example, through a first global feature extraction network, obtaining first foot-complete features; the server 105 may perform local feature extraction on the plurality of foot local image blocks through a first local feature extraction network, for example, to obtain a plurality of first foot local features; server 105 may perform a cross-fusion process on the first foot complete feature and the plurality of first foot partial features, for example, over a first cross-fusion network, to obtain a first foot fusion feature; server 105 may perform a predictive process on the first foot fusion feature, for example, via a predictive network, to determine a foot ulcer assessment corresponding to the complete image of the foot.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative, and that the server 105 may be a server of one entity, or may be composed of a plurality of servers, and may have any number of terminal devices, networks and servers according to actual needs.

Fig. 2 is a diagram illustrating an image processing network architecture according to an exemplary embodiment.

Referring to fig. 2, the above image processing network structure may include: a first global feature extraction network 202, a first local feature extraction network 204, a first cross-fusion network 205, and a prediction network 206.

Referring to the image processing network structure shown in fig. 2, the present application proposes the image processing method shown in fig. 3.

Fig. 3 is a flowchart illustrating an image processing method according to an exemplary embodiment. The method provided in the embodiment of the present application may be performed by any electronic device having a computing processing capability, for example, the method may be performed by a server or a terminal device in the embodiment of fig. 1, or may be performed by both the server and the terminal device, and in the following embodiment, the server is taken as an execution body for illustration, but the present application is not limited thereto.

Referring to fig. 3, the image processing method provided by the embodiment of the present application may include the following steps.

In step S302, a foot complete image and a plurality of foot partial image blocks are acquired, wherein the foot complete image is segmented to obtain the plurality of foot partial image blocks.

In some embodiments, the complete image of the foot may refer to a complete image (e.g., 201 in FIG. 2) that includes the entire foot. The partial image block of the foot may refer to a plurality of image blocks (e.g., divided into 2, 3, 4, … … blocks, etc.) obtained by dividing the complete image of the foot (e.g., 203 in fig. 2).

Step S304, global feature extraction is carried out on the complete foot image through a first global feature extraction network, and the complete first foot feature is obtained.

In some embodiments, the first foot-complete feature may be obtained by global feature extraction of the foot-complete image (e.g., 201 in fig. 2) through a first global feature extraction network (e.g., 202 in fig. 2). Wherein the first foot integral feature is a global feature extracted from the foot integral image that can describe the integral foot.

Step S306, local feature extraction is performed on the plurality of foot local image blocks through the first local feature extraction network, so as to obtain a plurality of first foot local features.

In some embodiments, the plurality of first foot partial features may be obtained by performing partial feature extraction on a plurality of foot partial image blocks (e.g., 203 in fig. 2) through a first partial feature extraction network (e.g., 204 in fig. 2).

In some embodiments, since the foot partial image block is part of a complete image of the foot, feature extraction of the foot partial image block is followed by only describing the corresponding partial features of the foot partial image.

Step S308, performing cross fusion processing on the first foot complete feature and the plurality of first foot local features through a first cross fusion network to obtain a first foot fusion feature.

As shown in fig. 2, the first foot fusion feature may be obtained by performing a cross-fusion process on the first foot complete feature and the plurality of first foot partial features through a first cross-fusion network (e.g., 205 in fig. 2).

And step S310, performing prediction processing on the first foot fusion characteristic through a prediction network to determine a foot ulcer evaluation result corresponding to the foot complete image.

As shown in fig. 2, the first foot fusion feature may be predicted by a prediction network (e.g., 206 in fig. 2) to determine a foot ulcer assessment corresponding to the complete image of the foot.

The predictive network (e.g., 206 in fig. 2) may include a spliced subnetwork 2061 pooling subnetwork 2062 and a fully connected subnetwork 2063.

In the above embodiment, on the one hand, feature extraction may be performed on the foot complete image through the first global feature extraction network to obtain global features of the foot complete image (i.e., the first foot complete features); on the other hand, local feature extraction can be carried out on a plurality of foot local image blocks obtained through foot complete image segmentation through a first local feature extraction module, so as to obtain local features corresponding to foot complete images; then, the global features of the foot complete image and the local features corresponding to the foot complete image are crossed and fused; and finally, predicting the foot ulcer by using the image features after the global features and the local features are crossed and fused. The method combines the global features of the foot image and the local features of the foot image, thereby improving the accuracy of foot ulcer evaluation.

In some embodiments, the first global feature extraction network may include a multi-scale feature extraction network (e.g., 401 in fig. 4), an adaptive pooling layer (e.g., 402 in fig. 4), and a channel attention sub-network (e.g., 403 in fig. 4), and a weighting network (e.g., 404 in fig. 4).

In some embodiments, global feature extraction of the foot-complete image through the first global feature extraction network may include the steps of: the processing of the foot complete image features corresponding to the foot complete image by the multi-scale feature extraction network 401 and the adaptive pooling layer 402 in the first local feature extraction network may refer to the embodiment shown in fig. 9 for a specific process; then determining a channel attention weight coefficient through the channel attention sub-network 403; finally, the channel attention weight coefficient and the foot complete image feature corresponding to the foot complete image are weighted through the weighting network 404, so as to obtain a first foot complete feature.

In some embodiments, the first global feature extraction network may include a channel attention sub-network, which may include a first pooling layer 602 and a second pooling layer 603, as shown in fig. 6. Wherein the first pooling layer may be a maximum pooling layer or an average pooling layer; the second pooling layer may be a maximum pooling layer or an average pooling layer.

Step S502, performing spatial compression processing on the foot complete image features corresponding to the foot complete image through the first pooling layer to obtain first spatial context description features.

As shown in fig. 6, the first spatial context description feature may be obtained by spatially compressing the foot-complete image features (e.g., feature map 601 in fig. 6) corresponding to the foot-complete image by first pooling layer 602. For example, it is possible to take the scriptIs compressed to +.>Where w represents width, h represents height, and d represents the number of channels.

Wherein the first spatial context description feature is used to describe global context information of a complete image of the foot through multiple channels.

In some embodiments, the first spatial context description feature may also be extracted by a neural network 604 (e.g., an MLP (Multilayer Perceptron, multi-layer perceptron) network in fig. 6).

Step S504, performing spatial compression processing on the foot complete image features corresponding to the foot complete image through the second pooling layer to obtain second spatial context description features.

As shown in fig. 6, the second spatial context description feature may be obtained by performing spatial compression processing on the foot full image feature 601 corresponding to the foot full image through the second pooling layer 603. For example, it is possible to take the script Is compressed to +.>Where w represents width, h represents height, and d represents the number of channels.

Wherein the second spatial context description feature is also used to describe global context information of the complete image of the foot through multiple channels.

In some embodiments, the second spatial context description feature may also be extracted by a neural network 605 (e.g., the MLP network in fig. 6).

In step S506, the first spatial context description feature and the second spatial context description feature are summed to obtain the target spatial context description feature.

As shown in fig. 6, the first spatial context description feature and the second spatial context description feature may be summed by a summing structure 606 to obtain the target spatial context description feature.

Step S508, activating the target space context description feature to obtain the channel attention weight coefficient.

As shown in fig. 6, the channel attention weighting coefficients may be obtained by activating the target space context description feature by the activation structure 607.

In step S510, the weighting process is performed on the foot complete image features corresponding to the foot complete image through the channel attention weight coefficient, so as to obtain the first foot complete feature.

As shown in fig. 4, the foot whole image features corresponding to the foot whole image may be weighted by the channel attention weight coefficient to obtain a first foot whole feature; the characteristics processed by the self-adaptive pooling layer can be weighted through the channel attention weight coefficient, so that the first foot complete characteristics are obtained.

The application also proposes an image processing method shown in fig. 8 according to the structural schematic diagram of the multi-scale feature extraction network shown in fig. 7.

Referring to fig. 8, the above-described image processing method may include the following steps.

Step S802, feature extraction processing is carried out on the foot complete image, and foot complete image feature vectors are obtained.

Step S804, segmenting the feature vector of the foot complete image according to the channels to obtain a plurality of channel segmented features, wherein the channel information corresponding to each channel segmented feature is different.

In some embodiments, the foot full image feature vector may be segmented per channel to obtain a plurality of channel segmented features (e.g., x in FIG. 7 ₁ 、x ₂ 、x ₃ X ₄ ) Wherein the channel information corresponding to the features after each channel is segmented is different.

For example, a foot full image feature vector is given asWherein->For the size of the feature matrix, C is the number of feature channels, and features can be obtainedThe matrix is divided into 4 parts by channel number to obtain x ₁ 、x ₂ 、x ₃ X ₄ I.e.Each feature subset x _i Is the same in space size, wherein +.>I is an integer greater than or equal to 1, where R represents a real set.

Step S806, the convolution sub-networks with different receptive fields are used for carrying out convolution operation on the segmented features of the channels respectively, so as to obtain a plurality of scale features with different receptive fields.

As shown in fig. 7, it is possible toRespectively and->Is convolved with the convolution kernel of (2), and is represented by the following formula (1).

Wherein K is _i Representing the convolution operation, y _i Represents x _i Through convolution layer K _i And the output is i is an integer greater than or equal to 1, and s is the number of the characteristics after channel segmentation.

Step S808, fusing the plurality of scale features to obtain foot complete image features corresponding to the foot complete image.

As shown in fig. 7, the data may be transmitted through a convolutional layer (e.g., withA convolution layer of a convolution kernel) to fuse the plurality of scale features to obtain foot complete image features corresponding to the foot complete image.

As shown in fig. 4, after the foot complete image features corresponding to the foot complete image are obtained through the multi-scale feature extraction network, the self-adaptive pooling layer 402 may be used to pool the foot complete image features, and then the attention sub-network 403 may be used to process the pooled features to obtain the channel attention weight coefficient.

The processing the pooled features through the attention sub-network 403 to obtain the channel attention weight coefficient may include step S810 to step S816.

Step S810, performing spatial compression processing on the foot complete image features corresponding to the foot complete image through the first pooling layer to obtain first spatial context description features.

In step S812, spatial compression processing is performed on the foot complete image features corresponding to the foot complete image through the second pooling layer, so as to obtain second spatial context description features.

Step S814, summing the first spatial context description feature and the second spatial context description feature to obtain the target spatial context description feature.

Step S816, performing activation processing on the target space context description feature to obtain a channel attention weight coefficient.

After obtaining the channel attention weighting coefficients, step S818 may continue to be performed.

In step S818, the weighting process is performed on the foot complete image features corresponding to the foot complete image through the channel attention weight coefficient to obtain the first foot complete feature.

After obtaining the channel attention coefficients, the channel attention weighting coefficients and the foot integrity image features may also be weighted by a weighting network 404 to obtain first foot integrity features, as shown in fig. 4.

Referring to fig. 9, the above-described image processing method may include the following steps.

In some embodiments, the plurality of channel segmented features may include a first channel segmented feature (e.g., x in FIG. 7 ₁ ) Second channel segmented feature (e.g., x in FIG. 7 ₂ ) Third channel segmented feature (e.g. x in FIG. 7 ₃ ) The different convolutional sub-networks of the receptive field include a first convolutional sub-network (e.g., K in FIG. 7 ₁ ) And a second convolution sub-network (e.g., K in FIG. 7 ₂ 701 The convolution kernels of the first convolution sub-network and the second convolution sub-network may be the same or different, and the plurality of scale features of the difference in receptive fields include a first receptive field feature, a second receptive field feature, and a third receptive field feature.

In step S902, the first channel segmented feature is used as a first receptive field feature.

As shown in fig. 7, the first channel segmented feature x may be ₁ As a first receptive field feature (e.g., y in FIG. 7 ₁ ）。

And step S904, carrying out convolution processing on the features after the second channel segmentation through the first convolution sub-network to obtain second receptive field features.

As shown in fig. 7, the first convolution sub-network (e.g., K in fig. 7 ₂ ) Feature x after segmentation of the second channel ₂ Performing convolution processing to obtain a second receptive field feature (e.g., y in FIG. 7 ₂ ）。

Step S906, fusing the second receptive field features and the third channel segmented features to obtain receptive field fusion features.

As shown in FIG. 7, a second receptive field feature (e.g., y in FIG. 7) ₂ ) And third channel segmented feature fusion (as x in FIG. 7 ₃ ) And obtaining receptive field fusion characteristics.

Step S908, performing convolution processing on the receptive field fusion features through a second convolution sub-network to obtain third receptive field features.

As shown in fig. 7, the data may be transmitted through a second convolutional sub-network (e.g., K in fig. 7 ₃ 702 Convolution processing of the receptive field fusion features to obtain a third receptive field feature (e.g., y in FIG. 7) ₃ ）。

In some embodiments, a third receptive field feature (e.g., y3 in FIG. 7) and a fourth segmented feature (e.g., x in FIG. 7) may also be fused ₄ ) Obtaining a second senseA wild fusion feature; and then through a third convolution sub-network (e.g., K in FIG. 7 ₄ 703 Convolution processing is carried out on the second receptive field fusion feature to obtain a fourth receptive field feature.

Step S910, performing a stitching process on the first receptive field feature, the second receptive field feature, and the third receptive field feature, to obtain a stitched feature.

As shown in fig. 7, the first receptive field feature, the second receptive field feature, and the third receptive field feature (and the fourth receptive field feature) may be subjected to a stitching process to obtain stitched features.

Step S912, the spliced features are subjected to pooling processing in the space dimension through the global average pooling layer, so that channel features are obtained.

In step S914, the channel features are processed through the full connection layer to fit the correlation between channels of the channel features.

And step S916, determining the complete image characteristics of the foot according to the channel characteristics and the spliced characteristics after the full connection layer processing.

In some embodiments, the plurality of foot partial image blocks includes a first foot partial image block. The present embodiment will be explained by taking the first partial image block of the foot as an example, and those skilled in the art can refer to the embodiment shown in fig. 10 for performing other partial images of the foot.

As shown in fig. 11, the first local feature extraction network may include a multi-scale feature extraction network 1101, an adaptive pooling layer 1102, a spatial attention sub-network 1103, and a weighting network 1104.

In some embodiments, the local feature extraction of the first foot local image block through the first local feature extraction network, and obtaining the first foot local feature corresponding to the first foot local image block may include the following steps: processing the first partial image block of the foot by the adaptive pooling layer 1102 through the multi-scale feature extraction network 1101 (for a specific process, reference may be made to the embodiment shown in fig. 9) in the first partial feature extraction network; then, processing the processing result of the adaptive pooling layer 1102 by using the spatial attention sub-network 1103 to determine a spatial attention weight coefficient; finally, the spatial attention weighting coefficients and the processing results of the adaptive pooling layer 1102 are weighted by using a weighting network 1104.

As shown in fig. 12, the spatial attention sub-network may include a third pooling layer (e.g., max pooling layer 1202 in fig. 12) and a fourth pooling layer (e.g., average pooling layer 1203 in fig. 12), a feature concatenation module (e.g., 1204 in fig. 12), a convolution layer (e.g., 1205 in fig. 12), and an activation layer (e.g., 1206 in fig. 12).

The process of acquiring a first foot partial feature according to the present application will be described in detail with reference to fig. 11 and 12.

Referring to fig. 10, the above-described image processing method may include the following steps.

Step S1002, performing channel compression processing on the local image features of the foot corresponding to the first local image block of the foot through the third pooling layer, to obtain a first channel feature map, where the local image features of the foot corresponding to the first local image block of the foot are obtained after feature extraction is performed on the local image block of the foot.

In some embodiments, the first channel feature map may be obtained by performing channel compression processing on the local image features of the foot (e.g., feature map 1201 in fig. 12) corresponding to the first local image block of the foot through a third pooling layer (e.g., maximum pooling layer 1202 in fig. 12).

Step S1004, performing channel compression processing on the partial image features of the foot corresponding to the partial image block of the first foot through the fourth pooling layer to obtain a second channel feature map.

In some embodiments, the second channel feature map may be obtained by performing channel compression processing on the local image features of the foot (e.g., 1201 in fig. 12) corresponding to the first local image block of the foot through the fourth pooling layer (e.g., 1203 in fig. 12).

Step S1006, cascading the first channel characteristic diagram and the second channel characteristic diagram to obtain a target channel characteristic diagram.

As shown in fig. 12, the first channel feature map and the second channel feature map may be cascaded by a feature cascading module 1204 to obtain a target channel feature map.

Step S1008, performing activation processing on the target channel feature map to obtain a spatial attention weight coefficient.

In some embodiments, the target channel feature map may be convolved by the convolution layer 1205, and then the convolved target channel feature map may be activated by the activation layer 1206 to obtain the spatial attention weighting coefficients.

In step S1010, the partial image features of the foot corresponding to the first partial image block of the foot are weighted by the spatial attention weight coefficient, so as to obtain the partial features of the foot corresponding to the partial image block of the foot.

As shown in fig. 11, the spatial attention weighting coefficients and the processing results of the adaptive pooling layer 1102 may be weighted using a weighting network 1104.

And weighting the foot partial image features corresponding to the first foot partial image block through the space attention weight coefficient to obtain the first foot partial features corresponding to the first foot partial image block.

In some embodiments, the first cross-fusion network may include a first parameter matrix, a second parameter matrix, and a third parameter matrix.

Referring to fig. 13, the above cross-fusion process may include the following steps.

In some embodiments, to fuse multiple scales of global and local paths, local features from a global path may be mapped separately as queriesLocal features from the local path are mapped as keys +.>Sum->。

In step S1302, the first query feature is obtained by performing projection processing on the complete feature of the first foot through the first parameter matrix.

In some embodiments, the first parameter matrix may be passed throughProjecting the complete first foot feature to obtain the first inquiry feature +.>Reference may be made specifically to formula (2).

In step S1304, a fusion process is performed on the plurality of first local features of the foot to obtain local fusion features of the foot.

In some embodiments, the plurality of first foot partial features may be fused to obtain the foot partial fusion feature Y.

In step S1306, the local fusion feature of the foot is projected through the second parameter matrix, so as to obtain a first key feature.

In some embodiments, the second parameter matrix may be passed throughPerforming projection processing on the local fusion characteristic Y of the foot to obtain a first key characteristic +.>。

Step S1308, performing projection processing on the local fusion feature of the foot through a third parameter matrix to obtain a first value feature.

In some embodiments, the third parameter matrix may be passed throughProjection processing is carried out on the local fusion characteristic of the foot to obtain a first value characteristic +.>。

Wherein the method comprises the steps ofIs a projection matrix, which is a learnable parameter.

In step S1310, activation processing is performed on the first query feature, the first key feature, and the first value feature, so as to obtain a first foot fusion feature.

In some embodiments, the first query feature, the first key feature, and the first value feature may be activated to obtain a first foot fusion feature.

Fig. 14 is a diagram illustrating an image processing network architecture according to an exemplary embodiment.

Referring to fig. 14, the above-described image processing network structure may include: a first global feature extraction network 1402, a second global feature extraction network 1403, a first local feature extraction network 1405, a second local feature extraction network 1406, a first cross-fuse network 1407, a second cross-fuse network 1408, and a prediction network 1409.

Referring to the image processing network structure shown in fig. 14, the present application proposes the image processing method shown in fig. 15.

In step S1502, a foot complete image and a plurality of foot partial image blocks are acquired, wherein the plurality of foot partial image blocks are obtained after the foot complete image is segmented.

In some embodiments, a complete image of a foot may refer to a complete image that includes the entire foot (e.g., 1401 in FIG. 14). The partial image block of the foot may refer to a plurality of image blocks (e.g., divided into 2, 3, 4, … … blocks, etc.) obtained after dividing the complete image of the foot (e.g., 1404 in fig. 14).

In step S1504, global feature extraction is performed on the complete foot image through the first global feature extraction network, so as to obtain a first complete foot feature.

In some embodiments, the first foot-complete feature may be obtained by global feature extraction of a foot-complete image (e.g., 1401 in fig. 14) through a first global feature extraction network (e.g., 1402 in fig. 14). Wherein the first foot integral feature is a global feature extracted from the foot integral image that can describe the integral foot.

In step S1506, local feature extraction is performed on the plurality of local image blocks of the foot through the first local feature extraction network, so as to obtain a plurality of local features of the first foot.

In some embodiments, the plurality of first foot partial features may be obtained by performing partial feature extraction on a plurality of foot partial image blocks (e.g., 1404 in fig. 14) through a first partial feature extraction network (e.g., 1405 in fig. 14), respectively.

In step S1508, the first foot integrated feature and the plurality of first foot local features are cross-fused by the first cross-fusion network, so as to obtain a first foot fusion feature.

As shown in fig. 14, the first foot fusion feature may be obtained by performing a cross-fusion process on the first foot complete feature and the plurality of first foot partial features through a first cross-fusion network (e.g., 1407 in fig. 14).

In step S1510, global feature extraction is performed on the first foot complete features through the second global feature extraction network, so as to obtain second foot complete features.

In some embodiments, the first foot-complete feature may be globally feature extracted via a second global feature extraction network (e.g., 1403 in fig. 14) to obtain a second foot-complete feature.

Step S1512, performing local feature extraction processing on the second foot local feature through the second local feature extraction network to obtain a second foot local feature.

In some embodiments, the second foot partial feature may be obtained by partial feature extraction of the second foot partial feature through a second partial feature extraction network (e.g., 1406 in fig. 14).

And step S1514, performing cross fusion processing on the second foot complete feature and the second foot local feature through a second cross fusion network to obtain a second foot fusion feature.

As shown in fig. 14, the second foot fusion feature may be obtained by cross-fusing the second foot complete feature and the second foot partial feature via a second cross-fusion network (e.g., 1408 in fig. 14)

And step S1516, performing prediction processing on the first foot fusion feature and the second foot fusion feature through a prediction network to determine a foot ulcer evaluation result corresponding to the foot complete image.

As shown in fig. 14, the second foot fusion feature may be predicted by a prediction network (e.g., 1409 in fig. 2) to determine a foot ulcer assessment corresponding to the complete image of the foot.

The prediction network (1409 in fig. 14) may include a spliced subnetwork 14091 pooling subnetwork 14092 and a fully connected subnetwork 14093.

The method for predicting the first foot fusion feature and the second foot fusion feature through the prediction network to determine a foot ulcer evaluation result corresponding to a complete foot image comprises the following steps: performing downsampling processing on the first foot fusion feature to obtain a first downsampled fusion feature; performing downsampling processing on the second foot fusion feature to obtain a second downsampled fusion feature, wherein the feature dimension of the first downsampled fusion feature is the same as the feature dimension of the second downsampled fusion feature; splicing the first downsampling fusion feature and the second downsampling fusion feature to obtain a multi-level foot fusion feature; and determining a foot ulcer evaluation result corresponding to the foot complete image according to the multi-level foot fusion characteristics.

In the above embodiment, on the one hand, feature extraction may be performed on the foot complete image through the first global feature extraction network to obtain global features of the foot complete image (i.e., the first foot complete features); on the other hand, local feature extraction can be carried out on a plurality of foot local image blocks obtained through foot complete image segmentation through a first local feature extraction module, so as to obtain local features corresponding to foot complete images; then, the global features of the foot complete image and the local features corresponding to the foot complete image are crossed and fused; and finally, predicting the foot ulcer by using the image features after the global features and the local features are crossed and fused. The method combines the global features of the foot image and the local features of the foot image, thereby improving the accuracy of foot evaluation.

The present application will be specifically described below with reference to specific embodiments of the image processing method described above.

The process implemented in this example is to construct a method for classifying severity of diabetic foot ulcers, which essentially comprises the following steps.

(1) And (5) data acquisition. And acquiring foot images of the diabetic foot patients, constructing a data set, setting image labels, and dividing all data into a training set and a verification set according to the proportion of 8:2.

(2) And (5) preprocessing data. The foot images of the sugar foot patients are adjusted to be of a set size, the number of training images is increased by using an image augmentation strategy, and the problem that the model is fitted due to too small data quantity is avoided.

(3) And inputting the input image into a global feature extraction network in the multi-scale fusion network to perform global feature extraction.

(4) And inputting the image blocks of the input image into a local feature extraction network in the multi-scale fusion network to extract local features.

(5) And inputting the global features and the local features with different scales into a cross fusion module to perform feature cross fusion.

(6) And carrying out feature cascading on the fusion features with different scales to obtain a multi-scale fusion feature F.

(7) And inputting the fusion characteristic F into a pooling layer, and performing global average pooling operation by the pooling layer to obtain a characteristic map Fc after downsampling.

(8) And inputting Fc into the two fully-connected layers, inputting Fc into the softmax layer to obtain a classification result, and updating the gradient by counter-propagating through calculating loss until model training is completed.

(9) And (5) evaluating the prediction. And inputting the test image into a trained model for characteristic reasoning to obtain the evaluation result of the foot ulcer degree of the sugar foot patient.

As shown in fig. 14, the global feature extraction network may include 4 sets of RSM modules (for example, may include a first global feature extraction network 1402, a second global feature extraction network 1403, and the other RSMs are not marked on the way by identifiers), where each RSM module is a global feature extraction network. Each RSM module as shown in fig. 4 may comprise a multi-scale feature extraction network of Ni layers (multi-scale feature extraction network 401 as shown in fig. 4), a pooling layer (adaptive pooling layer 402 as shown in fig. 4), and a channel attention module CAM (channel attention sub-network 403 as shown in fig. 4), ni being an integer greater than or equal to 1.

Wherein the local feature extraction network may be composed of 4 sets of RSM modules, as shown in fig. 11, each RSM module includes a Ni-layer multi-scale feature extraction network (multi-scale feature extraction network 1101 shown in fig. 11), a pooling layer (adaptive pooling layer 1102 shown in fig. 11), and a channel attention module SAM (spatial attention sub-network 1103 shown in fig. 11).

The CAM optimizes the feature classification effect by extracting the importance degree of different channel features on key information. The channel map due to high-level features can be seen as a response to different semantic features and interrelated. The channel attention can mine the dependency relationship between the channel graphs, obtain the importance degree of each characteristic channel, selectively pay attention to the information with large weight value according to the importance degree, and improve the characteristic representation of discriminant semantics.

The SAM module focuses attention on the space position information of the key features, and the network learns the feature information useful for classification tasks according to weight distribution by giving different weights to the position information of the features, so that the expression capability of the discriminant features is enhanced.

The implementation steps of the technical scheme provided by the embodiment are as follows.

(1) And acquiring data, and acquiring foot images of the diabetic foot patient.

(1.1) using a camera as a device for capturing images of the foot, the images being obtained in close-up of the foot, at a distance of about 30-40 cm, parallel to the plane of the ulcer.

(1.2) avoiding the use of a flash as the primary light source, but using enough room light to obtain a color-consistent image.

(1.3) marking the severity of lesions by grade for the acquired images of the foot of a diabetic foot patient by two clinically experienced podiatrists.

(1.4) data tag settings were classified by grade according to clinical manifestations into the following categories: grade 0 (intact skin), grade 1 (superficial ulcers), grade 2 (deep ulcers to bones, tendons, deep fascia or joint capsules), grade 3 (deep ulcers with abscess, osteomyelitis or osteomyelitis), grade 4 (anterior gangrene), grade 5 (total gangrene). The greater the number, the more severe the DF.

(1.5) dividing all marked data into a training set and a verification set according to the proportion of 8:2.

(2) And carrying out data amplification on the input image.

(2.1) the size of all images is adjusted to 256×256 to improve the performance and reduce the calculation cost.

And (2.2) using an image augmentation strategy, including rotation, random clipping, color channel and the like to enlarge the image quantity of the data set, so as to avoid the problem of overfitting of the model caused by too small data quantity.

And (2.3) inputting the amplified sample data into a multi-scale fusion network for training, wherein a network structure diagram is shown in the figure.

(3.1) the global feature extraction network consists of 4 sets of RCM modules, each RCM module comprising a Ni-layered multi-scale feature extraction network, a pooling layer and a channel attention module CAM, wherein,. The RCM module is shown in fig. 4.

The purpose of using the channel attention module is: in order to make the input image more meaningful, it is roughly understood that the importance (weight) of each channel of the input image is calculated through the network, that is, which channels contain key information are paid more attention, and less attention is paid to channels with little importance, so that the purpose of improving the feature representation capability is achieved.

(3.2) given the input characteristic data asWherein->For the size of the feature matrix, C is the number of feature channels, and the feature matrix is input into a multi-scale feature extraction network, the structure is shown in figure 7, and the feature matrix is divided into 4 parts according to the number of channels, namely +.>Each feature subset x _i Is the same in size. Wherein (1)>。/>Respectively and->The convolution kernel of (2) is convolved and can be represented by formula (1).

Wherein K is _i Representing the convolution operation, y _i Represents x _i Through convolution layer K _i And outputting the result.

(3.3) all y _i Parallel connection and adopting 1×1 convolution to perform multi-scale feature fusion, and specifically, can refer to formula (5).

Wherein,representative use->Convolving with a convolution kernel of->The representative performs stitching on the vectors.

(3.4) the output of the multi-scale feature extraction network is input to the channel attention module CAM through the pooling layer, the structure of which is shown in FIG. 6. In this module, features are first characterized using a global average pooling layer (e.g., first pooling layer 602) and a maximum pooling layer (e.g., second pooling layer 603)Compressed into->Two different spatial context descriptors are generated +.>And->。

(3.5) willAnd->Is input to a multi-layer perceptron (MLP) comprising a hidden layer for feature dimension-reduction-dimension-increase operations. In the multi-layer perceptron, the number of hidden layer neurons is C/r, the number of output layer neurons is C, and parameters are shared.

(3.6) summing the two output feature graphs element by element, and activating the summed feature graphs by using an activation function Sigmoid to obtain a channel attention weight coefficientAs shown in the following formula.

Wherein sigma represents a Sigmoid activation function,representing the processing using a multi-layer perceptron MLP, < >>Representative for average pooling treatment,/->Represents maximum pooling treatment, +.>And->Respectively representing the weights of the hidden layer and the output layer of the multi-layer perceptron.

(3.7) weighting coefficients Multiplying the input characteristic y element by element to obtain an output characteristic (I) subjected to channel attention refinement>Can be represented by formula (7).

(4.1) the local feature extraction network is composed of 4 sets of RSM modules, each RSM module comprising a multi-scale feature extraction network of Ni layers, a pooling layer and a channel attention space attention module SAM, wherein,the RSM block structure is shown in fig. 11.

(4.2) given the input characteristic data asWherein->And C is the number of feature channels and is the size of a feature matrix, and a multi-scale feature extraction network is input.

(4.3) the output of the multi-scale feature extraction network is input to the channel attention module SAM through the pooling layer, the structure of which is shown in fig. 12. The input feature map z is respectively subjected to an average pooling layer and a maximum pooling layer in the channel dimension to obtain a one-dimensional channel feature mapAnd->They are then concatenated to compute a valid feature descriptor. Then, the convolution layer is applied to compress the channel dimension, and the size of the feature map is +.>Then pass through a Sigmoid activation layer to obtain To spatial attention weighting coefficient +.>And finally, multiplying the input characteristic Z by a corresponding element to obtain a characteristic diagram Z subjected to spatial attention refinement. Spatial attention fast weighting coefficient->Can be represented by formula (8).

Where σ represents the sigmoid function,representing the convolution kernel as +.>Is a convolution operation of->Representing an average pooling process,/>Representing a maximum pooling process.

(5) And inputting the global features and the local features with different scales into a cross fusion module CA to perform local and global feature fusion.

(5.1) extracting local Global depth features from the Global feature extraction networkLocal depth feature extraction by local feature extraction network>Global depth feature->Including global context information for the entire input image, the local depth feature Z includes fine-grained local information from a local patchThe number of channels d and the height of the channel dWidth->。

To fuse the multiple scales of global and local paths, the local feature maps from the local global paths are queries, respectivelyMapping global local features from global local path to key +.>Sum->Cross fusion is performed through a cross fusion module (for details, reference may be made to formulas (2), (3) and (4)).

The feature fusion process from both paths can be represented by equation (9).

Wherein,representing the cross-fused features, and T represents the transpose of the matrix.

(6) And (3) carrying out feature cascading on the fusion features with different scales according to a formula (10) to obtain a multi-scale fusion feature F.

Wherein DS represents a downsampling operation, and subscript represents a downsampling multiple.

(7) Inputting the feature map F into a pooling layer, and performing global average pooling operation by the pooling layer to obtain a feature map F after downsampling _S 。

(8) Will F _S And inputting two fully connected layers, inputting a softmax layer to obtain a classification result, and updating the gradient by counter-propagating through calculation loss until the model training is completed.

(8.1) outputting by the softmax activation layer to obtain 0-5 levels in the corresponding levels of the 6 predictive labels.

(8.2) the class loss function uses cross entropy loss (see equation (11)), defined as:

wherein N represents the number of samples,representing the ith prediction result,/->Representing the label corresponding to the ith predicted result.

(9) And (5) evaluating the prediction. Loading a new foot image of the sugar foot patient, and inputting the new foot image into a trained model for characteristic reasoning to obtain a foot ulcer degree evaluation result of the sugar foot patient.

Early discovery of critical pathological changes in the foot that lead to the development of DFU is important, and manual examination by podiatry is still currently an ideal solution for diagnosing DFU, but because of limited human resources and facilities of medical systems, image processing-based computer-aided systems provide assistance in assisting clinicians in assessing DFU.

The pre-evaluation method for the severity of the sugar foot ulcer provided by the application is used for completing local feature extraction, global feature extraction and local-global multi-scale feature fusion through a multi-scale fusion network, so as to obtain fusion features with fine-grained local feature representation and remote context.

Based on the insufficient knowledge of many diabetics on the symptoms of the sugar foot, the sugar foot ulcer evaluation method provided by the application can provide rapid feedback for the patients, and has great potential in the fields of helping medical professionals and the patients to evaluate and follow-up DFU in a remote environment in the future.

It should be noted that, in the embodiments of the image processing method, the steps may be intersected, replaced, added, and subtracted. Therefore, these reasonable permutation and combination transformation methods should also belong to the protection scope of the present application, and should not limit the protection scope of the present application to the embodiments.

Based on the same inventive concept, an image processing apparatus is also provided in the embodiments of the present application, as follows. Since the principle of solving the problem of the embodiment of the device is similar to that of the embodiment of the method, the implementation of the embodiment of the device can be referred to the implementation of the embodiment of the method, and the repetition is omitted.

Fig. 16 is a block diagram of an image processing apparatus according to an exemplary embodiment. Referring to fig. 16, an image processing apparatus 1600 provided by an embodiment of the present application may include: an image acquisition module 1601, a first foot full feature acquisition module 1602, a first foot partial feature acquisition module 1603, a cross fusion module 1604, and a prediction module 1605.

Wherein the image acquisition module 1601 may be configured to acquire a foot complete image and a plurality of foot partial image blocks, wherein the plurality of foot partial image blocks are obtained after dividing the foot complete image; the first foot complete feature obtaining module 1602 may be configured to perform global feature extraction on the foot complete image through a first global feature extraction network to obtain a first foot complete feature; the first foot local feature obtaining module 1603 may be configured to obtain a plurality of first foot local features by performing local feature extraction on the plurality of foot local image blocks through the first local feature extraction network, respectively; the cross fusion module 1604 may be configured to perform cross fusion processing on the first foot complete feature and the plurality of first foot local features through a first cross fusion network to obtain a first foot fusion feature; the prediction module 1605 may be configured to perform a prediction process on the first foot fusion feature via a prediction network to determine a foot ulcer evaluation corresponding to the complete image of the foot.

Here, the image acquiring module 1601, the first foot complete feature acquiring module 1602, the first foot partial feature acquiring module 1603, the cross fusion module 1604, and the prediction module 1605 correspond to S302 to S310 in the method embodiment, and the modules are the same as the examples and the application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the method embodiment. It should be noted that the modules described above may be implemented as part of an apparatus in a computer system, such as a set of computer-executable instructions.

In some embodiments, the first global feature extraction network comprises a channel attention sub-network comprising a first pooling layer and a second pooling layer; wherein the first foot integrity feature acquisition module comprises: the device comprises a first pooling sub-module, a second pooling sub-module, a summation sub-module, a channel attention weight coefficient determination sub-module and a first foot complete characteristic determination sub-module.

The first pooling submodule is used for carrying out space compression processing on the foot complete image features corresponding to the foot complete image through the first pooling layer to obtain first space context description features; the second pooling submodule is used for carrying out space compression processing on the foot complete image features corresponding to the foot complete image through the second pooling layer to obtain second space context description features; the summation sub-module is used for carrying out summation processing on the first spatial context description characteristic and the second spatial context description characteristic to obtain a target spatial context description characteristic; the channel attention weight coefficient determination submodule is used for activating the context description feature of the target space to obtain a channel attention weight coefficient; the first foot complete feature determination submodule is used for carrying out weighting processing on foot complete image features corresponding to the foot complete images through the channel attention weight coefficient so as to obtain first foot complete features.

In some embodiments, the first global feature extraction network further comprises a multi-scale feature extraction network comprising a plurality of convolution sub-networks, wherein the receptive fields of the respective convolution sub-networks are different; wherein the apparatus further comprises: the device comprises a foot complete image feature vector acquisition module, a channel segmentation sub-module, a convolution processing module and a multi-scale fusion processing module.

The foot integral image feature vector acquisition module is used for carrying out space compression processing on foot integral image features corresponding to foot integral images through the first pooling layer to obtain first space context description features or carrying out space compression processing on foot integral image features corresponding to foot integral images through the second pooling layer to obtain second space context description features, and carrying out feature extraction processing on the foot integral images to obtain foot integral image feature vectors; the channel segmentation submodule is used for segmenting the complete image feature vector of the foot according to the channels to obtain a plurality of channel segmented features, wherein channel information corresponding to each channel segmented feature is different; the convolution processing module is used for carrying out convolution operation on the characteristics after the division of the channels through a plurality of convolution sub-networks with different receptive fields to obtain a plurality of scale characteristics with different receptive fields; the multi-scale fusion processing module is used for fusing the plurality of scale features to obtain foot complete image features corresponding to the foot complete image.

In some embodiments, the plurality of channel-segmented features includes a first channel-segmented feature, a second channel-segmented feature, a third channel-segmented feature, the different-receptive-field convolution sub-networks include a first convolution sub-network, a second convolution sub-network, wherein the convolution kernels of the first convolution sub-network and the second convolution sub-network are different, and the different-receptive-field plurality of scale features includes a first receptive-field feature, a second receptive-field feature, and a third receptive-field feature; wherein, convolution processing module includes: the first receptive field feature determination submodule, the second receptive field feature determination submodule, the receptive field fusion feature determination submodule and the third receptive field feature determination submodule.

The first receptive field feature determination submodule is used for taking the first channel segmented features as first receptive field features; the second receptive field feature determining submodule is used for carrying out convolution processing on the features after the second channel segmentation through the first convolution sub-network to obtain second receptive field features; the receptive field fusion feature determination submodule is used for fusing the second receptive field feature and the third channel segmented feature to obtain receptive field fusion features; the third receptive field feature determining submodule is used for carrying out convolution processing on the receptive field fusion features through the second convolution sub-network to obtain third receptive field features.

In some embodiments, the multi-scale fusion processing module comprises: the device comprises a first splicing sub-module, a channel characteristic determining sub-module, a correlation determining sub-module and a foot complete image characteristic determining sub-module.

The first splicing submodule is used for carrying out splicing treatment on the first receptive field feature, the second receptive field feature and the third receptive field feature to obtain a spliced feature; the channel characteristic determining submodule is used for carrying out pooling treatment on the spliced characteristics in the space dimension through the global average pooling layer to obtain channel characteristics; the correlation determination submodule is used for processing the channel characteristics through the full connection layer so as to fit the correlation among channels of the channel characteristics; the foot complete image feature determination submodule is used for determining foot complete image features according to the channel features processed by the full connection layer and the spliced features.

In some embodiments, the first local feature extraction network comprises a spatial attention sub-network comprising a third pooled layer and a fourth pooled layer, the plurality of foot local image blocks comprising a first foot local image block: wherein the first foot local feature acquisition module comprises: the system comprises a first channel characteristic diagram determining sub-module, a second channel characteristic diagram determining sub-module, a target channel characteristic diagram determining sub-module, a space attention weight coefficient determining sub-module and a space weighting processing sub-module.

The first channel characteristic map determining submodule is used for carrying out channel compression processing on the foot partial image characteristics corresponding to the first foot partial image block through the third pooling layer to obtain a first channel characteristic map, wherein the foot partial image characteristics corresponding to the first foot partial image block are obtained after the characteristic extraction of the first foot partial image block; the second channel characteristic diagram determining submodule is used for carrying out channel compression processing on the foot partial image characteristics corresponding to the first foot partial image block through the fourth pooling layer to obtain a second channel characteristic diagram; the target channel characteristic diagram determining submodule is used for cascading the first channel characteristic diagram and the second channel characteristic diagram to obtain a target channel characteristic diagram; the space attention weight coefficient determination submodule is used for carrying out activation processing on the target channel feature map to obtain a space attention weight coefficient; the spatial weighting processing sub-module is used for carrying out weighting processing on the foot local image features corresponding to the first foot local image block through the spatial attention weight coefficient to obtain first foot local features corresponding to the first foot local image block.

In some embodiments, the first cross-fusion network includes a first parameter matrix, a second parameter matrix, and a third parameter matrix; wherein, the cross fusion module includes: the system comprises a first query feature determination sub-module, a foot local fusion feature determination sub-module, a first key feature determination sub-module, a first value feature determination sub-module and a first foot fusion feature determination sub-module.

The first query feature determination submodule is used for carrying out projection processing on the complete features of the first foot through the first parameter matrix to obtain first query features; the foot local fusion characteristic determining submodule is used for carrying out fusion processing on the plurality of first foot local characteristics to obtain foot local fusion characteristics; the first key feature determination submodule is used for carrying out projection processing on the local fusion features of the foot through the second parameter matrix to obtain first key features; the first value characteristic determining submodule is used for carrying out projection processing on the local fusion characteristic of the foot through the third parameter matrix to obtain a first value characteristic; the first foot fusion feature determination submodule is used for activating the first query feature, the first key feature and the first value feature to obtain a first foot fusion feature.

In some embodiments, the apparatus further comprises: a second foot integrity feature determination module, a second foot local feature determination module, and a second foot fusion feature determination module.

The second foot complete feature determining module is used for carrying out global feature extraction on the first foot complete feature through a second global feature extraction network to obtain a second foot complete feature; the second foot local feature determining module is used for carrying out local feature extraction processing on the first foot local feature through a second local feature extraction network to obtain a second foot local feature; the second foot fusion feature determination module is used for performing cross fusion processing on the second foot complete feature and the second foot local feature through a second cross fusion network to obtain a second foot fusion feature.

Wherein the prediction module comprises: the evaluation result determining submodule; the evaluation result determination submodule is used for carrying out prediction processing on the first foot fusion characteristic and the second foot fusion characteristic through a prediction network so as to determine a foot ulcer evaluation result corresponding to the foot complete image.

In some embodiments, the evaluation result determination submodule includes: the device comprises a first downsampling unit, a second downsampling unit, a downsampling splicing unit and an evaluation unit.

The first downsampling unit is used for downsampling the first foot fusion feature to obtain a first downsampled fusion feature; the second downsampling unit is used for downsampling the second foot fusion feature to obtain a second downsampling fusion feature, and the feature dimension of the first downsampling fusion feature is the same as the feature dimension of the second downsampling fusion feature; the downsampling splicing unit is used for splicing the first downsampling fusion characteristic and the second downsampling fusion characteristic to obtain a multi-level foot fusion characteristic; the evaluation unit is used for determining a foot ulcer evaluation result corresponding to the foot complete image according to the multi-level foot fusion characteristics.

Since the functions of the apparatus 1600 are described in detail in the corresponding method embodiments, the disclosure is not repeated here.

The modules and/or sub-modules and/or units involved in the embodiments of the present application may be implemented in software or in hardware. The described modules and/or sub-modules and/or units may also be provided in a processor. Wherein the names of the modules and/or sub-modules and/or units do not in some cases constitute a limitation of the module and/or sub-modules and/or units themselves.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module or portion of a program that comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer program instructions.

Furthermore, the above-described drawings are only schematic illustrations of processes included in the method according to the exemplary embodiment of the present application, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

Fig. 17 shows a schematic diagram of an electronic device suitable for implementing an embodiment of the application. It should be noted that, the electronic device 1700 shown in fig. 17 is only an example, and should not impose any limitation on the functions and application scope of the embodiments of the present application.

As shown in fig. 17, the electronic apparatus 1700 includes a Central Processing Unit (CPU) 1701, which can execute various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1702 or a program loaded from a storage portion 1708 into a Random Access Memory (RAM) 1703. In the RAM 1703, various programs and data necessary for the operation of the electronic device 1700 are also stored. The CPU 1701, ROM 1702, and RAM 1703 are connected to each other through a bus 1704. An input/output (I/O) interface 1705 is also connected to the bus 1704.

The following components are connected to the I/O interface 1705: an input section 1706 including a keyboard, a mouse, and the like; an output portion 1707 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage portion 1708 including a hard disk or the like; and a communication section 1709 including a network interface card such as a LAN card, a modem, or the like. The communication section 1709 performs communication processing via a network such as the internet. The driver 1710 is also connected to the I/O interface 1705 as needed. A removable medium 1711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 1710 so that a computer program read therefrom is installed into the storage portion 1708 as needed.

In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program comprising computer program instructions for performing the method shown in the flowchart. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1709, and/or installed from the removable media 1711. The above-described functions defined in the system of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 1701.

The computer readable storage medium shown in the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with computer-readable computer program instructions embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable storage medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Computer program instructions embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

As another aspect, the present application also provides a computer-readable storage medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer-readable storage medium carries one or more programs which, when executed by a device, cause the device to perform functions including: acquiring a foot complete image and a plurality of foot partial image blocks, wherein the foot complete image is segmented to obtain a plurality of foot partial image blocks; performing global feature extraction on the foot complete image through a first global feature extraction network to obtain first foot complete features; respectively extracting local features of the plurality of foot local image blocks through a first local feature extraction network to obtain a plurality of first foot local features; performing cross fusion processing on the first foot complete characteristics and a plurality of first foot local characteristics through a first cross fusion network to obtain first foot fusion characteristics; and carrying out prediction processing on the first foot fusion characteristics through a prediction network so as to determine a foot ulcer evaluation result corresponding to the foot complete image.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer program instructions stored in a computer readable storage medium. The computer program instructions are read from a computer-readable storage medium and executed by a processor to implement the methods provided in the various alternative implementations of the above embodiments.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution of the embodiments of the present application may be embodied in the form of a software product, where the software product may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disc, a mobile hard disk, etc.), and includes several computer program instructions for causing an electronic device (may be a server or a terminal device, etc.) to perform a method according to the embodiments of the present application.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the application is not limited to the details of construction, the manner of drawing, or the manner of implementation, which has been set forth herein, but, on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. An image processing apparatus, comprising:

the image acquisition module is used for acquiring a foot complete image and a plurality of foot partial image blocks, wherein the foot partial image blocks are obtained after the foot complete image is segmented;

the first foot complete feature acquisition module is used for carrying out global feature extraction on the foot complete image through a first global feature extraction network to obtain a first foot complete feature;

the first foot local feature acquisition module is used for respectively carrying out local feature extraction on the plurality of foot local image blocks through a first local feature extraction network to obtain a plurality of first foot local features;

the cross fusion module is used for carrying out cross fusion processing on the first foot complete characteristics and the plurality of first foot local characteristics through a first cross fusion network to obtain first foot fusion characteristics;

and the prediction module is used for performing prediction processing on the first foot fusion characteristic through a prediction network so as to determine a foot ulcer evaluation result corresponding to the foot complete image.

2. The apparatus of claim 1, wherein the first global feature extraction network comprises a channel attention sub-network comprising a first pooling layer and a second pooling layer; wherein the first foot integrity feature acquisition module comprises:

the first pooling sub-module is used for carrying out space compression processing on the foot complete image features corresponding to the foot complete image through the first pooling layer to obtain first space context description features;

the second pooling sub-module is used for carrying out space compression processing on the foot complete image features corresponding to the foot complete image through the second pooling layer to obtain second space context description features;

the summation sub-module is used for carrying out summation processing on the first spatial context description characteristic and the second spatial context description characteristic to obtain a target spatial context description characteristic;

the channel attention weight coefficient determination submodule is used for activating the context description feature of the target space to obtain a channel attention weight coefficient;

and the first foot complete characteristic determination submodule is used for carrying out weighting processing on the foot complete image characteristic corresponding to the foot complete image through the channel attention weight coefficient so as to obtain the first foot complete characteristic.

3. The apparatus of claim 2, wherein the first global feature extraction network further comprises a multi-scale feature extraction network comprising a plurality of convolution sub-networks, wherein the receptive fields of each convolution sub-network are different; wherein the apparatus further comprises:

the foot complete image feature vector acquisition module is used for carrying out space compression processing on foot complete image features corresponding to the foot complete image through the first pooling layer to obtain a first space context description feature or carrying out space compression processing on foot complete image features corresponding to the foot complete image through the second pooling layer to obtain a second space context description feature, and carrying out feature extraction processing on the foot complete image to obtain a foot complete image feature vector;

the channel segmentation sub-module is used for segmenting the foot complete image feature vector according to channels to obtain a plurality of channel segmented features, wherein channel information corresponding to each channel segmented feature is different;

the convolution processing module is used for carrying out convolution operation on the characteristics after the channels are divided through the convolution sub-networks with different receptive fields to obtain a plurality of scale characteristics with different receptive fields;

And the multi-scale fusion processing module is used for fusing the plurality of scale features to obtain foot complete image features corresponding to the foot complete image.

4. The apparatus of claim 3, wherein the plurality of channel-segmented features comprises a first channel-segmented feature, a second channel-segmented feature, a third channel-segmented feature, the different-receptive-field convolution sub-networks comprise a first convolution sub-network, a second convolution sub-network, wherein the convolution kernels of the first convolution sub-network and the second convolution sub-network are different, the different-receptive-field plurality of scale features comprise a first receptive-field feature, a second receptive-field feature, and a third receptive-field feature; wherein, convolution processing module includes:

a first receptive field feature determination submodule for taking the first channel segmented features as the first receptive field features;

the second receptive field feature determining submodule is used for carrying out convolution processing on the features after the second channel is segmented through the first convolution sub-network to obtain second receptive field features;

the receptive field fusion feature determination submodule is used for fusing the second receptive field feature and the third channel segmented feature to obtain receptive field fusion features;

And the third receptive field feature determining submodule is used for carrying out convolution processing on the receptive field fusion features through the second convolution subnetwork to obtain the third receptive field features.

5. The apparatus of claim 4, wherein the multi-scale fusion processing module comprises:

the first splicing submodule is used for carrying out splicing treatment on the first receptive field feature, the second receptive field feature and the third receptive field feature to obtain a spliced feature;

the channel characteristic determining submodule is used for carrying out pooling treatment on the spliced characteristics in the space dimension through the global average pooling layer to obtain channel characteristics;

a correlation determination submodule, configured to process the channel characteristics through a full connection layer to fit correlations between channels of the channel characteristics;

and the foot complete image feature determining sub-module is used for determining the foot complete image feature according to the channel feature processed by the full connection layer and the spliced feature.

6. The apparatus of claim 1, wherein the first local feature extraction network comprises a spatial attention sub-network comprising a third pooled layer and a fourth pooled layer, the plurality of foot local image tiles comprising a first foot local image tile: wherein the first foot local feature acquisition module comprises:

The first channel characteristic map determining submodule is used for carrying out channel compression processing on the foot local image characteristics corresponding to the first foot local image block through the third pooling layer to obtain a first channel characteristic map, wherein the foot local image characteristics corresponding to the first foot local image block are obtained after the characteristic extraction is carried out on the first foot local image block;

the second channel characteristic diagram determining submodule is used for carrying out channel compression processing on the foot partial image characteristics corresponding to the first foot partial image block through the fourth pooling layer to obtain a second channel characteristic diagram;

the target channel characteristic diagram determining submodule is used for cascading the first channel characteristic diagram and the second channel characteristic diagram to obtain a target channel characteristic diagram;

the space attention weight coefficient determining submodule is used for activating the target channel feature map to obtain a space attention weight coefficient;

and the space weighting processing sub-module is used for carrying out weighting processing on the foot local image features corresponding to the first foot local image block through the space attention weight coefficient to obtain the first foot local features corresponding to the first foot local image block.

7. The apparatus of claim 1, wherein the first cross-fusion network comprises a first parameter matrix, a second parameter matrix, and a third parameter matrix; wherein, the cross fusion module includes:

the first query feature determination submodule is used for carrying out projection processing on the first foot complete features through the first parameter matrix to obtain first query features;

the foot local fusion characteristic determination submodule is used for carrying out fusion processing on the plurality of first foot local characteristics to obtain foot local fusion characteristics;

the first key feature determining submodule is used for carrying out projection processing on the local fusion features of the foot through the second parameter matrix to obtain first key features;

the first value characteristic determining submodule is used for carrying out projection processing on the local fusion characteristic of the foot through the third parameter matrix to obtain a first value characteristic;

and the first foot fusion feature determination submodule is used for activating the first query feature, the first key feature and the first value feature to obtain the first foot fusion feature.

8. The apparatus of claim 1, wherein the apparatus further comprises:

The second foot complete feature determining module is used for carrying out global feature extraction on the first foot complete feature through a second global feature extraction network to obtain a second foot complete feature;

the second foot local feature determining module is used for carrying out local feature extraction processing on the first foot local feature through a second local feature extraction network to obtain a second foot local feature;

the second foot fusion characteristic determining module is used for carrying out cross fusion processing on the second foot complete characteristic and the second foot local characteristic through a second cross fusion network to obtain a second foot fusion characteristic;

wherein the prediction module comprises:

and the evaluation result determination submodule is used for carrying out prediction processing on the first foot fusion characteristic and the second foot fusion characteristic through a prediction network so as to determine the foot ulcer evaluation result corresponding to the foot complete image.

9. An electronic device, comprising:

a memory and a processor;

the memory is used for storing computer program instructions; the processor invokes the computer program instructions stored by the memory to implement an image processing method comprising:

Acquiring a foot complete image and a plurality of foot partial image blocks, wherein the foot complete image is segmented to obtain the plurality of foot partial image blocks;

performing global feature extraction on the foot complete image through a first global feature extraction network to obtain first foot complete features;

respectively extracting local features of the plurality of foot local image blocks through a first local feature extraction network to obtain a plurality of first foot local features;

performing cross fusion processing on the first foot complete feature and the plurality of first foot local features through a first cross fusion network to obtain a first foot fusion feature;

and carrying out prediction processing on the first foot fusion characteristic through a prediction network so as to determine a foot ulcer evaluation result corresponding to the foot complete image.

10. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement an image processing method comprising: