CN114926479A

CN114926479A - Image processing method and device

Info

Publication number: CN114926479A
Application number: CN202210590677.XA
Authority: CN
Inventors: 郑青青; 李嘉陆; 王琼; 赵保亮; 胡颖
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2022-05-27
Filing date: 2022-05-27
Publication date: 2022-08-19

Abstract

The application is applicable to the technical field of image processing, and provides an image processing method and device, which comprise the following steps: acquiring a target frame image, a plurality of memory frame images and a guide frame image, wherein the plurality of memory frame images comprise a focus mask, and the variation difference between the target frame image and the guide frame image is within a preset range; according to the time dimensions of a plurality of memory frame images, local similarity calculation is carried out on the target frame image and each memory frame image to obtain a time characteristic image; performing spatial similarity calculation on the target frame image and the guide frame image to obtain a spatial feature map; and determining a prediction mask of the focus in the target frame image according to the temporal feature map and the spatial feature map. Therefore, through decoupling and parallel processing of the time dimension and the space dimension, the complexity and the time cost of calculation are reduced, the calculation power resource is saved, the detection speed and the detection precision of the lesion area are improved, and the real-time and low-delay detection requirement is met.

Description

Image processing method and device

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image processing method and apparatus.

Background

With the continuous development of science and technology, a computer can be used for segmenting a lesion region in an ultrasonic video, such as information of a mask, time and the like of a lesion, which is important for clinical examination and treatment.

During the generation process of the ultrasonic video, artifact phenomena such as speckle noise, low contrast and intensity nonuniformity are accompanied, and the boundary of a lesion area and a non-lesion area is blurred. Therefore, a semi-supervised video-memory network (STM) algorithm is proposed.

Currently, an STM algorithm usually adopts a global attention matching mechanism to distinguish a lesion region from a non-lesion region, so as to realize segmentation of an ultrasound video. However, the mask due to the lesion is typically located in the image across the frame and in a local region in the image. Therefore, the global attention matching mechanism may introduce misleading information, resulting in a large amount of computation, and the global attention matching mechanism itself has a high computation cost, so that it is difficult to meet the requirement of real-time segmentation.

Disclosure of Invention

The application provides an image processing method and device, which can solve the problems that the related technology is large in calculation amount and cannot meet the requirement for rapidly segmenting a lesion area and a non-lesion area.

In a first aspect, the present application provides an image processing method, including:

acquiring a target frame image, a plurality of memory frame images and a guide frame image, wherein the plurality of memory frame images comprise a focus mask, and the variation difference between the target frame image and the guide frame image is within a preset range;

according to the time dimensions of a plurality of memory frame images, local similarity calculation is carried out on the target frame image and each memory frame image to obtain a time characteristic image;

performing spatial similarity calculation on the target frame image and the guide frame image to obtain a spatial feature map;

and determining a prediction mask of the focus in the target frame image according to the temporal feature map and the spatial feature map.

By the image processing method of the first aspect, the target frame image, the plurality of memory frame images, and the guide frame image can be acquired. From the angle of decoupling time dimension and space dimension, attention can be paid to local regions of the target frame image and the multiple memory frame images, local similarity calculation is carried out on the target frame image and each memory frame image to obtain a time feature map, and local similarity of the target frame image and the multiple memory frame images about a focus mask can be determined. Meanwhile, attention can be paid to the global/overall areas of the target frame image and the guide frame image, overall similarity calculation is carried out on the target frame image and the guide frame image to obtain a spatial feature map, and the global similarity of the target frame image and the guide frame image about change difference (such as static background texture) can be determined. And determining a prediction mask of the focus in the target frame image according to the temporal feature map and the spatial feature map. Therefore, in the method and the device, the mask of the focus in the image can be accurately predicted by means of decoupling and parallel processing of time dimension and space dimension, the complexity and time cost of calculation are reduced, computer power resources are saved, interference information in the image is effectively filtered, the detection speed and the detection precision of a lesion area are improved, and the detection requirement of real-time and low delay is met.

In a possible design, according to time dimensions of a plurality of memory frame images, performing local similarity calculation on a target frame image and each memory frame image to obtain a time feature map, including:

obtaining a first key feature map and a first value feature map according to the target frame image, wherein the first key feature map corresponds to the first value feature map;

obtaining a second key feature map and a second value feature map according to each memory frame image, wherein the second key feature map corresponds to the second value feature map;

dividing the first key feature map, the first value feature map, the plurality of second key feature maps and the plurality of second value feature maps into a preset number of non-overlapping region blocks;

according to the time sequence of the plurality of memory frame images, local similarity calculation is carried out on each region block in the first key feature map and the region blocks corresponding to the same index in the plurality of second key feature maps to obtain local similarity;

and obtaining a time characteristic map according to the local similarity, each region block in the first value characteristic map and each region block in the second value characteristic maps.

In one possible design, according to the time sequence of a plurality of memory frame images, performing local similarity calculation on each region block in the first key feature map and region blocks corresponding to the same index in a plurality of second key feature maps to obtain local similarity, including:

sequencing each region block in the plurality of second key feature graphs in time sequence according to the same index to obtain a first set;

and carrying out local similarity calculation and normalization processing on each region block in the first key feature map and the corresponding same index in the first set to obtain local similarity.

In one possible design, obtaining a time feature map according to the local similarity, each region block in the first value feature map, and each region block in the plurality of second value feature maps includes:

weighting the local similarity and each region block in the plurality of second value feature maps to obtain a memory value feature map;

and merging the memory value characteristic diagram with each region block in the first value characteristic diagram according to the same index of the region block to obtain a time characteristic diagram.

In one possible design, the method further includes:

obtaining a third value characteristic diagram according to the time characteristic diagram;

dividing the third value feature map into a preset number of non-overlapping region blocks;

and updating the time characteristic diagram according to the local similarity, each region block in the first value characteristic diagram and each region block in the third value characteristic diagram.

In one possible design, performing spatial similarity calculation on the target frame image and the guide frame image to obtain a spatial feature map, including:

obtaining a fourth key feature map and a fourth value feature map according to the guide frame image, wherein the fourth key feature map corresponds to the fourth value feature map;

performing spatial similarity calculation on the first key feature map and the fourth key feature map to obtain spatial similarity;

weighting the spatial similarity and the fourth value characteristic diagram to obtain a guide value characteristic diagram;

and combining the guide value characteristic diagram with the first value characteristic diagram to obtain a spatial characteristic diagram.

In one possible design, the method further includes:

obtaining a fifth value feature map according to the space feature map;

weighting the spatial similarity and the fifth value characteristic diagram, and updating the guide value characteristic diagram;

and combining the guide value characteristic graph with the first value characteristic graph, and updating the spatial characteristic graph.

In a second aspect, embodiments of the present application provide an image processing apparatus configured to perform the image processing method of the first aspect or any of the possible designs of the first aspect.

In a third aspect, an embodiment of the present application provides an image processing apparatus, which includes a memory and a processor. The memory is to store instructions; the processor executes the instructions stored by the memory to cause the apparatus to perform the image processing method of the first aspect or any possible design of the first aspect.

In a fourth aspect, there is provided a computer-readable storage medium having stored therein instructions which, when run on a computer, cause the computer to perform the image processing method of the first aspect or any of the possible designs of the first aspect.

In a fifth aspect, there is provided a computer program product comprising instructions which, when run on an apparatus, cause the apparatus to perform the method of image processing of the first aspect or any of the possible designs of the first aspect.

It is to be understood that, for the beneficial effects of the second aspect to the fifth aspect, reference may be made to the relevant description in the first aspect and any possible design of the first aspect, and details are not described herein again.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a block diagram illustrating an architecture of an image processing model according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of an image processing method according to an embodiment of the present application;

fig. 3 is a schematic flowchart of an image processing method according to an embodiment of the present application;

fig. 4 is a schematic block diagram of an image processing method according to an embodiment of the present application;

fig. 5 is a block diagram illustrating an image processing method according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather mean "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

The application provides an image processing method, an image processing device, an image processing apparatus, a computer readable storage medium and a computer program product, and provides the following inventive concepts:

for any frame of image needing to be detected, the similarity between local areas of the images is concerned in a time dimension by using a large number of memory frame images, the characteristics of a target in the frame of image can be obtained, and the calculation complexity is reduced; by using the image with little or no change difference, the overall/global similarity of the image is noted in the spatial dimension, and the overall characteristics of the frame image can be obtained.

Therefore, by means of decoupling of the time dimension and the space dimension and the two lightweight parallel processing processes, the region where the target is located in the frame image can be automatically obtained based on the characteristics of the target in the frame image and the overall characteristics of the frame image, the technical problems that the related technology is large in calculation amount and cannot meet the requirement of rapid segmentation are solved, and the technical effects of improving the detection speed and the detection precision of the target and reducing the calculation cost are achieved.

The image processing method is applicable to various video/image fields. For example, the image processing method of the present application is suitable for the imaging field of clinical medicine, and provides more possibilities for the application of the field.

For convenience of explanation, the above-described target is exemplified by a lesion in the present application. Correspondingly, the region where the target is located is the mask of the lesion. Therefore, the method is beneficial to quickly and accurately distinguishing the lesion area from the non-lesion area.

Referring to fig. 1, fig. 1 is a block diagram illustrating an architecture of an image processing model according to an embodiment of the present disclosure.

As shown in fig. 1, the image processing model may include: an encoder (encode, abbreviated as Enc), a key-value output module (key-value output module), a region block segmentation module (block segmentation module), a Temporal attention module (Temporal Transformer), a spatial attention module (spatial Transformer), a merge module (match module), a feature fusion module (Fea), and a decoder (decode, abbreviated as Dec).

And the encoder is used for extracting the feature map of any frame of image. The number and type of encoders are not limited in the present application. For convenience of explanation, in fig. 1, the encoder is illustrated by taking Enc1, Enc2, and Enc3 as examples.

For example, the encoder of the present application may employ ResNet50 as the backbone network. In addition, the encoder of the present application may also adopt other existing feature extraction networks, such as a residual error network (ResNet-152) neural network (AlexNet), a deep convolutional neural network (VGGNet), a convolutional neural network (inclusion), and the like.

It should be noted that each frame of image input to the encoder may include four-dimensional channels, i.e., a three-dimensional channel corresponding to a color and a one-dimensional channel corresponding to a mask after color removal. And/or, each frame of image input to the encoder may also include only one-dimensional channels corresponding to the mask after color removal.

The Key-Value pair output module comprises two parallel convolution layers (Conv for short) and is used for receiving the feature map output by the encoder and generating two feature maps of a Key (Key) and a Value (Value) respectively, namely a pair of Key feature map and Value feature map. For convenience of explanation, in fig. 1, the key-value pair output modules are illustrated by Conv11 and Conv12, Conv21 and Conv22, and Conv31 and Conv32 as examples.

And the region block segmentation module is used for segmenting the key feature map and the value feature map into a preset number of non-overlapping region blocks respectively. The non-overlap here refers to a non-overlap between the region blocks. The specific numerical value of the preset number is not limited in the application. For convenience of explanation, in fig. 1, the region block division module is illustrated by taking block 1 and block 2 as examples.

The time attention module is a light-weight module for paying attention to the attention of a local area of the image and is used for calculating the local similarity between each feature map corresponding to the multi-frame image and the feature map corresponding to any one frame of image according to the time dimension of the multi-frame image (namely the time sequence of image acquisition). The local similarity here refers to the similarity between partial regions of two feature maps. The number of the time attention modules can be set to be N, and N is a positive integer.

The spatial attention module is a light-weight module for paying attention to the global/whole region of the image and is used for calculating the spatial similarity between the feature images corresponding to any two frames of images according to the spatial dimension of the image. The spatial similarity here refers to the similarity between all regions (i.e., the entire regions) of the two feature maps. The number of the spatial attention modules can be set to be M, and M is a positive integer.

In addition, M and N may be the same or different, and this is not limited in this application.

It should be noted that the temporal attention module and the spatial attention module are two parallel modules, the temporal attention module focuses on the temporal dimension, and the spatial attention module focuses on the spatial dimension.

And the merging module is used for merging a plurality of feature maps according to the channels of the feature maps.

And the characteristic fusion module is used for carrying out characteristic fusion on the plurality of characteristic graphs. For example, the feature fusion module may implement feature fusion using one convolutional layer.

And a decoder for outputting the prediction result, which is recorded as pred, restoring the image input to the encoder, and realizing the segmentation of the image. In this application, parameters such as the number and type of decoders are not limited.

In some embodiments, the decoder may include: a group of residual volume blocks, an interpolation module and an adjustment module. The residual volume block is used for amplifying the feature map, the interpolation module is used for processing the feature map such as bipolar interpolation, and the adjusting module is used for adjusting the size of the feature map to be the same as that of an image input to the encoder.

It should be noted that the feature image of the output decoder is a black-and-white image, and the output channel of the decoder is a one-dimensional channel corresponding to the mask after color removal.

In the following, the image processing method of the present application is explained in detail by taking an apparatus using the image processing model in fig. 1 as an example, and combining the drawings and application scenarios.

Wherein the device may comprise software modules or physical devices in software and/or hardware.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating an image processing method according to an embodiment of the present disclosure.

As shown in fig. 2, the image processing method of the present application may include:

s201, acquiring a target frame image, a plurality of memory frame images and a guide frame image.

The target frame image may be a current frame image in the ultrasound video, that is, a current frame image that a user wants to test, and the current frame image t may be marked by Query.

The multiple Memory frame images are historical images acquired before the target frame image in the ultrasonic video, and a Memory can be used for marking T Memory frame images { T-1, …, T-T }, wherein T is a positive integer.

The plurality of memory frame images comprise the focus mask, so that local basis of the focus mask features (namely lesion areas) can be provided for the target frame images conveniently. In addition, the mask of the lesion in the multiple memory frame images may be located in the cross-frame images and/or in a local region in the images. The cross-frame image herein refers to a discontinuous multi-frame image.

The guidance frame image may be a frame image acquired before the target frame image in the ultrasound video, such as a Previous frame image, and the guidance frame image t-1 may be marked by Previous. The change difference (or referred to as motion) between the target frame image and the guide frame image can be within a preset range, that is, the two frame images have no change or small change difference, so that global/overall reference/guide of the features (namely, the lesion region and the non-lesion region) of all the regions is provided for the target frame image. The guide frame image is included in a plurality of memory frame images.

The parameters such as formats and contents of the target frame image, the multiple memory frame images and the guide frame image are not limited by the application.

S202, according to the time dimensions of the plurality of memory frame images, local similarity calculation is carried out on the target frame image and each memory frame image to obtain a time characteristic image.

It can be understood that the time dimension of the plurality of memory frame images refers to the time sequence of acquiring the plurality of memory frame images, and may correspond to the acquisition time of the plurality of memory frame images from morning to evening.

Therefore, in the time dimension, the local similarity calculation can be carried out on each same partial area of the target frame image and each memory frame image by adopting the image processing model, and the time characteristic diagram is obtained.

The time characteristic map can be used for representing the local similarity between the target frame image and each memory frame image, namely the similarity between each partial region of the target frame image and the corresponding partial region of each memory frame image.

And S203, carrying out spatial similarity calculation on the target frame image and the guide frame image to obtain a spatial feature map.

The spatial feature map may be used to represent spatial similarity between the target frame image and the guide frame image, that is, similarity between all regions of the target frame image and all of the guide frame images.

Therefore, in the spatial dimension, the image processing model can be used for calculating the spatial similarity of all the areas of the target frame image and the guide frame image to obtain a spatial feature map.

It should be noted that there is no chronological sequence between S202 and S203, and S202 and S203 may be executed simultaneously or sequentially.

In some embodiments, the present application may perform S202 and S203 in parallel. Therefore, computational power resources of a computer can be effectively saved, the speed of detecting the focus mask is increased, interference information is filtered, and the precision of detecting the focus mask is improved.

And S204, determining a prediction mask of the focus in the target frame image according to the temporal feature map and the spatial feature map.

And distinguishing a lesion area and/or a non-lesion area of the target frame image by adopting an image processing model according to the time characteristic diagram and the space characteristic diagram, thereby determining a prediction mask of the lesion in the target frame image.

In some embodiments, the image processing model may determine a predictive mask of a lesion in the target frame image using the merging module, the feature fusion module, and the decoder shown in fig. 1.

Wherein the temporal feature map and the spatial feature map may be input into a merging module. The merging module can merge the time characteristic diagram and the space characteristic diagram and input the merged characteristic diagram into the characteristic merging module.

For example, if the temporal profile is represented by matrix 1 of H × W × C1 and the spatial profile is represented by matrix 2 of H × W × C2, the merged profile may be represented by matrix 3 of H × W × (C1+ C2).

And the feature fusion module performs feature fusion on the combined feature graph and inputs the fused feature graph into a decoder. And the decoder decodes the fused feature map and can output a prediction mask of the focus in the target frame image.

In some embodiments, a set of residual convolution blocks in the decoder may progressively enlarge the feature map and input the enlarged feature map into an interpolation module in the decoder.

And for two adjacent residual convolution blocks, the input of the latter residual convolution block is connected between the feature map output by the former residual convolution block and the feature map of the target frame image output by the encoder.

The interpolation module in the decoder can perform bipolar interpolation processing on the amplified feature map, and input the interpolated feature map into the adjustment module in the decoder.

And the adjusting module in the decoder adjusts the size of the interpolated characteristic diagram, namely the size of the interpolated characteristic diagram is adjusted to be the same as that of the target frame image input to the encoder.

Thus, the adjustment module in the decoder outputs a prediction mask of the lesion in the target frame image. In addition, the method and the device can also utilize the prediction result output by the decoder and the real result of the corresponding image to estimate the identification precision of the image processing model and adjust the model parameters of the image processing model, so that the accuracy rate of identifying the pathological shade can be improved.

In some embodiments, the predicted value corresponding to the prediction mask of the focus in the target frame image is used as a basis

A true value (GT) Y corresponding to a true mask of a lesion in the target frame image may determine a loss sum L between the two _all 。

Wherein, loss and L _all The following formula can be used for representation:

wherein, the first and the second end of the pipe are connected with each other,

which represents a binary cross-entropy loss of,

represents the Dice loss, loss and L _all Is the sum of the binary cross entropy loss and the Dice loss.

m denotes each pixel in the prediction mask, Y ₊ Representing diseases in a prediction maskThe variant region, Y-represents the non-lesion region in the prediction mask. Alpha is a hyperparameter larger than 0, the weight of the two losses, namely the binary cross entropy loss and the Dice loss, is measured, and the value of the two losses is set to be 1 in the application.

Visible, loss and L _all The value of (2) can represent the identification accuracy of the image processing model.

Wherein, loss and L _all The smaller the numerical value of (A), the smaller the predicted value corresponding to the prediction mask of the focus in the target frame image

The higher the coincidence degree between the true values Y corresponding to the true masks of the lesions in the target frame image, i.e., the higher the recognition accuracy of the image processing model.

Wherein, loss and L _all The larger the numerical value of (A), the more the predicted value corresponding to the prediction mask of the focus in the target frame image is

The lower the coincidence degree between the true values Y corresponding to the true masks of the lesions in the target frame image, i.e., the lower the recognition accuracy of the image processing model.

In addition, compared with the STM algorithm provided by the related technology, the image processing method provided by the application is based on comparison of parameters such as Jaccard index (Jaccard index), statistic (F-measure), Precision (Precision), Recall (Recall) and the like, so that the detection Precision is better, and the detection speed is faster. In addition, the method and the device can also achieve real-time detection.

According to the image processing method, the target frame image, the plurality of memory frame images and the guide frame image are obtained. From the angle of decoupling time dimension and space dimension, attention can be paid to local regions of the target frame image and the plurality of memory frame images, local similarity calculation is carried out on the target frame image and each memory frame image to obtain a time characteristic image, and local similarity of the target frame image and the plurality of memory frame images about a focus mask can be determined. Meanwhile, attention can be paid to the global/overall areas of the target frame image and the guide frame image, overall similarity calculation is carried out on the target frame image and the guide frame image to obtain a spatial feature map, and the global similarity of the target frame image and the guide frame image about change difference (such as static background texture) can be determined. And determining a prediction mask of the focus in the target frame image according to the temporal feature map and the spatial feature map. Therefore, in the method and the device, the mask of the focus in the image can be accurately predicted by means of decoupling and parallel processing of time dimension and space dimension, the complexity and time cost of calculation are reduced, computer power resources are saved, interference information in the image is effectively filtered, the detection speed and the detection precision of a lesion area are improved, and the detection requirement of real-time and low delay is met.

Referring to fig. 3, fig. 3 is a schematic flowchart illustrating an image processing method according to an embodiment of the present disclosure.

As shown in fig. 3, the image processing method of the present application may include:

s301, obtaining a target frame image, a plurality of memory frame images and a guide frame image, wherein the plurality of memory frame images comprise a focus mask, and the variation difference between the target frame image and the guide frame image is within a preset range.

S301 is similar to the implementation of S201 in the embodiment of fig. 2, and is not described herein again.

S302, obtaining a first key feature map and a first value feature map according to the target frame image, wherein the first key feature map corresponds to the first value feature map.

S303, obtaining a second key feature map and a second value feature map according to each memory frame image, wherein the second key feature map corresponds to the second value feature map.

S304, the first key feature map, the first value feature map, the second key feature maps and the second value feature maps are divided into a preset number of non-overlapping area blocks.

S305, according to the time sequence of the plurality of memory frame images, local similarity calculation is carried out on each region block in the first key feature diagram and the region blocks corresponding to the same index in the plurality of second key feature diagrams, and local similarity is obtained.

S306, obtaining a time characteristic diagram according to the local similarity, each region block in the first value characteristic diagram and each region block in the second value characteristic diagrams.

S307, obtaining a third value characteristic diagram according to the time characteristic diagram; dividing the third value feature map into a preset number of non-overlapping region blocks; the temporal feature map is updated according to the local similarity, each region block in the first value feature map, and each region block in the third value feature map.

And S308, obtaining a fourth key feature map and a fourth value feature map according to the guide frame image, wherein the fourth key feature map corresponds to the fourth value feature map.

S309, performing spatial similarity calculation on the first key feature diagram and the fourth key feature diagram to obtain spatial similarity.

S310, weighting the spatial similarity and the fourth value feature map to obtain a guide value feature map.

And S311, combining the guide value characteristic diagram with the first value characteristic diagram to obtain a space characteristic diagram.

S312, obtaining a fifth value feature map according to the space feature map; weighting the spatial similarity and the fifth value feature map, and updating the guide value feature map; and merging the guide value feature map and the first value feature map, and updating the spatial feature map.

And S313, determining a prediction mask of the focus in the target frame image according to the temporal feature map and the spatial feature map.

S313 is similar to the implementation manner of S204 in the embodiment of fig. 2, and is not described herein again.

Wherein, S302-S306 can be executed sequentially, and S308-S311 can be executed sequentially.

It should be noted that there is no chronological sequence between S302-S306 and S308-S311, and S302-S306 and S308-S311 may be executed simultaneously or sequentially.

In S302-S306, the image processing model may utilize the encoder, the key-value pair output module, the region block segmentation module, and the time attention module shown in FIG. 1 to obtain a time feature map.

In S302, the target frame image is input to Enc 2. The feature layer in Enc2 extracts a feature map of the target frame image and inputs the feature map into Conv21 and Conv 22. Conv21 and Conv22 generate a pair of key feature maps and value feature maps, i.e. a first key feature map and a first value feature map, which correspond.

Wherein the first key feature map can adopt K ^Q Marking is carried out, and the first value feature map can adopt V ^Q And (5) marking.

In S303, the memory frame image is input to Enc 3. The feature layer in the encoder Enc3 extracts a feature map of the target frame image and inputs the feature map into Conv31 and Conv 32. Conv31 and Conv32 generate a pair of key feature map and value feature map, i.e. a second key feature map and a second value feature map, respectively.

Wherein the second key feature map can adopt K ^M Marking is carried out, and the second value feature map adopts V ^M And (5) marking.

In S304, the first key feature map K is set ^Q And a first value profile V ^Q Inputting into block 1, and inputting multiple second key feature maps K ^M And a plurality of corresponding second value feature maps V ^M Input into block 2.

block 1 converts the first key feature map K ^Q Along the first key feature map K ^Q Is divided into length and width S ² Non-overlapping region blocks. Wherein the first key feature map is divided into S ² Each region block can be marked as

block 1 is used for converting the first value characteristic graph V ^Q Along a first value profile V ^Q Is divided into length and width S ² Non-overlapping region blocks. Wherein the first value feature map is divided into S ² Each region block can be marked as

block 2 combines a plurality of second key feature maps K ^M Along a plurality of secondThe length and width of the key feature map are divided into S ² A plurality of non-overlapping blocks of area. Wherein the plurality of second key feature maps are divided into S ² Each region block can be marked as

block 2 converts a plurality of second value feature maps V ^M Dividing into S along the length and width of multiple second value feature maps ² Non-overlapping region blocks. Wherein the plurality of second value feature maps are divided into S ² Can be marked as

Wherein S is a positive integer, z belongs to [1, T ] represents each memory frame image in the T memory frame images, i, j belongs to [1, S ] represents the index of the region block, i is a positive integer with the number of times of i being more than or equal to 1 and less than or equal to S, and j is a positive integer with the number of times of j being more than or equal to 1 and less than or equal to S.

In addition, for any one of the characteristic diagrams, S ² Any two of the area blocks do not overlap.

In S305, each region block in the multiple second key feature maps is sorted in time sequence according to the same index, so as to obtain a first set. Wherein the first set

Can be expressed as:

indicates the area block with index i, j in the 1 st second key feature map,

representing a zone with index i, j in the Tth second key feature mapAnd (4) domain block.

Further, the S in the first key feature map is matched ² And local similarity calculation is carried out on the area blocks and the area blocks corresponding to the same index in each second key feature map in the first set, so as to obtain local similarity.

S in the first key feature map for any one area block of each second key feature map in the first set ² In each area block, the area block corresponding to the first key feature map with the same index as the area block is searched, and the dot product calculation and normalization processing are carried out on the two area blocks with the same index, so that the local similarity of the two area blocks with the same index is obtained.

Wherein the local similarity

The following formula can be used for representation:

wherein Softmax denotes normalization, exp denotes an exponential function with a natural constant e as the base,

indicating that the dot product calculation is performed on the area block having the same index in the first key feature map and the plurality of second key feature maps,

indicates the area block with index i, j in the first key feature map,

indicates a region block with index i, j in the second key feature map.

In addition, the present application is not limited to the above-described implementation. For example, the apparatus may further perform local similarity calculation on each region block in the first key feature map and the region blocks corresponding to the same index in the plurality of second key feature maps according to the time sequence of the plurality of second key feature maps.

In S306, the local similarity is used as the weights of the plurality of second-value feature maps, that is, the local similarity corresponding to each region block with the same index is used as the weight of each region block with the same index in each second-value feature map. And according to the local similarity, weighting each area block in the plurality of second value characteristic graphs, and calculating to obtain a memory value characteristic graph.

Wherein, the characteristic graph V of the memory value ^T Including a memory value profile for each zone block with the same index. Characteristic map of memory value of region block with index of { i, j }

Expressed as:

wherein the content of the first and second substances,

the degree of local similarity is represented by,

and representing the area blocks with the index of { i, j } in the Z-th second value feature map.

Further, according to the same index of the region block, the memory value characteristic graph V ^T And a first value feature map V ^Q Each area block in the time domain is combined to obtain a time characteristic diagram y ^T 。

Wherein the time characteristic diagram y ^T Expressed as:

y ^T ＝[V ^Q ,V ^T ]。

s307 is an optional step.

In S307, the number of the time attention modules is N, and the application does not limit the specific value of N.

When N equals 1, the temporal attention module may output a temporal profile.

When N is greater than 1, the 1 st temporal attention module may output a temporal profile. For any one of the temporal attention modules from the 2 nd temporal attention module to the nth temporal attention module, the target frame image and the temporal feature map output by the last temporal attention module are input to the temporal attention module. The temporal attention module may update the temporal feature map based on the target frame image and the temporal feature map output by the previous temporal attention module.

Therefore, the feature diagram output by the time feature diagram is more expressive through the N-1 times of superposition processing based on the time attention module.

The following describes in detail the implementation of S302-S306 in fig. 3 by using 1 time attention module in conjunction with fig. 4.

Referring to fig. 4, fig. 4 is a schematic block diagram illustrating an image processing method according to an embodiment of the present disclosure.

As shown in fig. 4, in the time attention module, the feature map K is plotted for the first key ^Q And each second key feature map

Performing dot product calculation and softmax operation to obtain a first key feature map K ^Q And a second key profile

Local similarity between them.

Map K of the first key ^Q And a second key feature map

The local similarity between the two is taken as a corresponding second value feature map

And the weighting system and the corresponding second value feature map

Go on pointMultiplying and calculating to obtain a memory value characteristic diagram V ^T 。

Feature map V of memory value ^T And a first value feature map V ^Q Merging to obtain a time characteristic diagram y ^T 。

In the application, the time characteristic diagram of the local area between the images can be acquired through the time attention module, so that the detection precision can be improved, and the calculation amount and the calculation complexity can be reduced.

In S308-S311, the image processing model may utilize the encoder, the key-value pair output module, and the spatial attention module shown in FIG. 1 to obtain a spatial feature map.

In S308, the guide frame image is input to Enc 1. The feature layer in Enc1 extracts a feature map of the guide frame image and inputs the feature map into Conv11 and Conv 12. Conv11 and Conv12 generate a pair of key feature map and value feature map, i.e. a fourth key feature map and a fourth value feature map, which correspond to each other.

Wherein, the fourth key characteristic diagram can adopt K ^P Marking, the fourth value characteristic graph can adopt V ^P And (6) marking.

In S309, global similarity calculation is performed on the first key feature map and the fourth key feature map to obtain spatial similarity.

In some embodiments, the first key feature map and the fourth key feature map are subjected to dot product calculation and normalization processing to obtain corresponding spatial similarity.

Wherein the spatial similarity f (K) ^Q ,K ^P ) Expressed as:

wherein Softmax denotes normalization processing, exp denotes an exponential function with a natural constant e as a base,

the dot product calculation is performed on the first key feature map and the fourth key feature map.

In S310, the spatial similarity is used as the weight of the fourth-value feature map. And weighting the fourth value characteristic diagram according to the spatial similarity, and calculating to obtain a guide value characteristic diagram.

Wherein the guide value feature map V ^S Expressed as:

V ^S ＝f(K ^Q ,K ^P )V ^P ；

wherein, f (K) ^Q ,K ^P ) Represents spatial similarity, V ^P A fourth value profile is shown.

In S311, the guidance value feature map and the first value feature map are merged to obtain a spatial feature map.

Wherein the spatial feature map y ^S Expressed as:

y ^S ＝[V ^Q ,V ^S ]＝[V ^Q ,f(K ^Q ,K ^P )V ^P ]。

s312 is an optional step.

In S312, the number of the spatial attention modules is M, and the application does not limit the specific numerical value of M.

When M is equal to 1, the spatial attention module may output a temporal profile.

When M is greater than 1, the 1 st spatial attention module may output a spatial signature. For any one of the spatial attention modules starting from the 2 nd spatial attention module to the M-th spatial attention module, the target frame image and the spatial feature map output by the last spatial attention module are input to the spatial attention module. The spatial attention module can update the spatial feature map according to the target frame image and the spatial feature map output by the previous spatial attention module.

Therefore, the feature map output by the spatial attention module is more expressive based on the M-1 times of superposition processing of the spatial attention module.

The following describes in detail the implementation of S308-S311 in fig. 3 by using 1 spatial attention module in conjunction with fig. 5.

Referring to fig. 5, fig. 5 is a block diagram illustrating an image processing method according to an embodiment of the present disclosure.

As shown in FIG. 5, in the spatial attention Module, for the first key feature map K ^Q And a fourth key feature map K ^P Performing dot product calculation and softmax operation to obtain a first key feature map K ^Q And a fourth key feature map K ^P Spatial similarity between them.

Map K of the first key ^Q And a fourth key feature map K ^P The spatial similarity between the characteristic graphs is taken as a fourth value characteristic graph V ^P And the weighting system and the fourth value feature map V are compared ^P Performing dot product calculation to obtain a guide value characteristic diagram V ^S 。

Feature map V of guide values ^S And a first value feature map V ^Q Combining to obtain a spatial feature map y ^S 。

In the application, the spatial feature map of the similar image can be acquired through the spatial attention module, and the improvement of the detection precision is facilitated.

Illustratively, the embodiment of the application also provides an image processing device.

Fig. 6 is a schematic structural diagram of an image processing apparatus provided in an embodiment of the present application, where the image processing apparatus 10 is configured to implement operations corresponding to a device that can use the image processing model shown in fig. 1 in any of the method embodiments described above, and as shown in fig. 6, the image processing apparatus 10 may include: an acquisition module 11, a first calculation module 12, a second calculation module 13 and a determination module 14.

The acquisition module 11 is configured to acquire a target frame image, a plurality of memory frame images and a guide frame image, where the plurality of memory frame images include a focus mask, and a variation difference between the target frame image and the guide frame image is within a preset range;

the first calculation module 12 is configured to perform local similarity calculation on the target frame image and each memory frame image according to the time dimensions of the plurality of memory frame images to obtain a time feature map;

the second calculation module 13 is configured to perform spatial similarity calculation on the target frame image and the guide frame image to obtain a spatial feature map;

and a determining module 14, configured to determine a prediction mask of the lesion in the target frame image according to the temporal feature map and the spatial feature map.

In some embodiments, the first calculation module 12 is configured to:

and obtaining a time characteristic diagram according to the local similarity, each area block in the first value characteristic diagram and each area block in the second value characteristic diagrams.

In some embodiments, the first computing module 12 is specifically configured to:

and carrying out local similarity calculation and normalization processing on each region block in the first key feature map and the region blocks corresponding to the same index in the first set to obtain local similarity.

In some embodiments, the first calculating module 12 is specifically configured to:

performing weighted calculation on the local similarity and each region block in the plurality of second value feature maps to obtain a memory value feature map;

In some embodiments, the second calculation module 13 is configured to:

weighting the spatial similarity and the fourth value feature map to obtain a guide value feature map;

and merging the guide value feature map and the first value feature map to obtain a spatial feature map.

Fig. 7 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application, and as shown in fig. 7, on the basis of the structure shown in fig. 6, the image processing apparatus 10 may further include: a time updating module 15 and a space updating module 16.

In some embodiments, the time update module 15 is configured to:

obtaining a third value feature map according to the time feature map;

the temporal feature map is updated according to the local similarity, each region block in the first value feature map, and each region block in the third value feature map.

In some embodiments, the space update module 16 is configured to:

obtaining a fifth value feature map according to the space feature map;

weighting the spatial similarity and the fifth value feature map, and updating the guide value feature map;

and merging the guide value feature map and the first value feature map, and updating the spatial feature map.

The image processing apparatus of the present application may be configured to execute the technical solutions of the devices in the method embodiments shown in fig. 1 to fig. 5, and the implementation principles and technical effects are similar, where operations for implementing each module may further refer to the relevant description of the method embodiments, and are not described herein again. The modules herein may also be replaced with components or circuits.

The present application may divide the functional modules of the image processing apparatus according to the above method examples, for example, each functional module may be divided according to each function, or two or more functions may be integrated into one processing module. In one implementation, the apparatus is part of an operating system. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that the division of the modules in the embodiments of the present application is schematic, and is only one division of logic functions, and there may be another division manner in actual implementation.

Fig. 8 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application, and as shown in fig. 8, the image processing apparatus 20 includes:

a memory 21 for storing program instructions, the memory 21 may be a flash (flash memory). The memory 21 also stores an operating system of the device and applications within the device.

And a processor 22 for calling and executing the program instructions in the memory to implement the steps of the corresponding devices in the image processing methods of fig. 1-5. Reference may be made in particular to the description relating to the previous method embodiments.

An input/output interface 23 may also be included. The input/output interface 23 may include a separate output interface and input interface, or may be an integrated interface that integrates input and output. The output interface is used for outputting data, the input interface is used for acquiring input data, the output data is a general name output in the method embodiment, and the input data is a general name input in the method embodiment.

The image processing device 20 may be configured to perform various steps and/or processes corresponding to the respective devices in the above method embodiments.

The application also provides a readable storage medium, wherein the readable storage medium stores an execution instruction, and when at least one processor of the terminal device executes the execution instruction, the terminal device executes the image processing method in the above method embodiment.

The present application also provides a program product comprising execution instructions stored in a readable storage medium. The at least one processor of the apparatus may read the execution instructions from the readable storage medium, and the execution of the execution instructions by the at least one processor causes the apparatus to implement the image processing method in the above method embodiment.

The application also provides a chip, wherein the chip is connected with the memory, or the chip is integrated with the memory, and when a software program stored in the memory is executed, the image processing method in the embodiment of the method is realized.

Those of ordinary skill in the art will understand that: in the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.). Computer-readable storage media can be any available media that can be accessed by a computer or a data storage device, such as a server, data center, etc., that includes one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Claims

1. An image processing method, comprising:

according to the time dimensions of the multiple memory frame images, local similarity calculation is carried out on the target frame image and each memory frame image to obtain a time characteristic image;

2. The method according to claim 1, wherein the obtaining a temporal feature map by performing local similarity calculation on the target frame image and each memory frame image according to the time dimensions of the plurality of memory frame images comprises:

dividing the first key feature map, the first value feature map, the plurality of second key feature maps, and the plurality of second value feature maps into a preset number of non-overlapping region blocks, respectively;

and obtaining the time characteristic diagram according to the local similarity, each region block in the first value characteristic diagram and each region block in the second value characteristic diagrams.

3. The method according to claim 2, wherein the performing local similarity calculation on each region block in the first key feature map and the region blocks corresponding to the same index in the second key feature map according to the time sequence of the plurality of memory frame images to obtain local similarity includes:

sequencing each area block in the second key characteristic diagrams according to the same index in time sequence to obtain a first set;

and performing local similarity calculation and normalization processing on each region block in the first key feature map and the region blocks corresponding to the same index in the first set to obtain the local similarity.

4. The method of claim 2, wherein the obtaining the temporal feature map according to the local similarity, each region block in the first value feature map, and each region block in a plurality of the second value feature maps comprises:

performing weighted calculation on the local similarity and each area block in the plurality of second value feature maps to obtain a memory value feature map;

and merging the memory value characteristic diagram with each region block in the first value characteristic diagram according to the same index of the region block to obtain the time characteristic diagram.

5. The method of any one of claims 2-4, further comprising:

obtaining a third value feature map according to the time feature map;

segmenting the third value feature map into the preset number of non-overlapping region blocks;

updating the time feature map according to the local similarity, each region block in the first value feature map, and each region block in the third value feature map.

6. The method as claimed in claim 2, wherein said performing spatial similarity calculation between said target frame image and said guide frame image to obtain a spatial feature map comprises:

and combining the guide value characteristic diagram with the first value characteristic diagram to obtain the spatial characteristic diagram.

7. The method of claim 6, wherein the method further comprises:

obtaining a fifth value feature map according to the space feature map;

weighting the space similarity and the fifth value feature map, and updating the guide value feature map;

and merging the guide value characteristic graph and the first value characteristic graph, and updating the spatial characteristic graph.

8. An image processing apparatus comprising means for performing the image processing method of any one of claims 1-7.

9. An image processing apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the image processing method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the image processing method according to any one of claims 1 to 7.