CN112257728B

CN112257728B - Image processing method, image processing apparatus, computer device, and storage medium

Info

Publication number: CN112257728B
Application number: CN202011264341.1A
Authority: CN
Inventors: 余双; 马锴; 郑冶枫; 刘含若; 王宁利
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-11-12
Filing date: 2020-11-12
Publication date: 2021-08-17
Anticipated expiration: 2040-11-12
Also published as: CN112257728A

Abstract

The embodiment of the application discloses an image processing method and device, computer equipment and a storage medium, and belongs to the computer vision technology in the field of artificial intelligence. The image processing method comprises the following steps: acquiring an image to be identified, and extracting image instance characteristics of the image to be identified, wherein the image instance characteristics comprise N original characteristic graphs, and any characteristic graph pixel of any original characteristic graph corresponds to one instance of the image to be identified; extracting K local key instance features under K scales from N original feature maps, and superposing the K local key instance features into multi-scale instance features of an image to be identified; extracting global example weight characteristics of the image to be identified from the N original characteristic graphs; and identifying the multi-scale example characteristics and the global example weight characteristics to obtain an image identification result of the image to be identified. By the aid of the image recognition method and device, image recognition efficiency and accuracy can be improved.

Description

Image processing method, image processing apparatus, computer device, and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an image processing method and apparatus, a computer device, and a storage medium.

Background

Image recognition refers to processing, analyzing, and understanding images to identify image content. The image recognition is widely applied to face recognition, expression recognition and the like in the safety field, traffic sign recognition and license plate number recognition in the traffic field, and also applied to interest area recognition and the like in the medical field.

At present, the main mode of image recognition is a manual recognition mode, and image contents are distinguished manually based on past experience and knowledge, but the manual recognition is not only low in efficiency, but also has great influence on the human perception, so that the recognition result is inaccurate.

Disclosure of Invention

The embodiment of the application provides an image processing method, an image processing device, computer equipment and a storage medium, which can improve the efficiency and accuracy of image recognition.

An embodiment of the present application provides an image processing method, including:

acquiring an image to be identified, and extracting image instance features of the image to be identified, wherein the image instance features comprise N original feature maps, any feature map pixel of any original feature map corresponds to one instance of the image to be identified, and N is a positive integer;

extracting K local key instance features under K scales from the N original feature maps, and superposing the K local key instance features into multi-scale instance features of the image to be identified, wherein K is a positive integer;

extracting global example weight characteristics of the image to be identified from the N original characteristic graphs;

and identifying the multi-scale example features and the global example weight features to obtain an image identification result of the image to be identified.

An aspect of an embodiment of the present application provides an image processing apparatus, including:

the image recognition system comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring an image to be recognized and extracting image instance characteristics of the image to be recognized, the image instance characteristics comprise N original characteristic maps, any characteristic image pixel of any original characteristic map corresponds to one instance of the image to be recognized, and N is a positive integer;

the first extraction module is used for extracting K local key example features under K scales from the N original feature maps;

the superposition module is used for superposing K local key example features into the multi-scale example features of the image to be identified, wherein K is a positive integer;

the second extraction module is used for extracting global example weight characteristics of the image to be identified from the N original characteristic graphs;

and the identification module is used for identifying the multi-scale example characteristics and the global example weight characteristics to obtain an image identification result of the image to be identified.

An aspect of the embodiments of the present application provides a computer device, including a memory and a processor, where the memory stores a computer program, and when the computer program is executed by the processor, the processor is caused to execute the method in the foregoing embodiments.

An aspect of the embodiments of the present application provides a computer storage medium, in which a computer program is stored, where the computer program includes program instructions, and when the program instructions are executed by a processor, the method in the foregoing embodiments is performed.

An aspect of the embodiments of the present application provides a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium, and when the computer instructions are executed by a processor of a computer device, the computer instructions perform the methods in the embodiments described above.

According to the method and the device, the terminal automatically identifies the image to determine the image identification result of the image, manual participation is not needed, the image identification efficiency can be improved, subjective factor interference caused by manual identification can be avoided, the image identification accuracy is improved, and the image identification mode is enriched; moreover, the image recognition result is recognized by extracting the local features and the global features of the image to be recognized, and the local features and the global features are mutually assisted, so that the accuracy of image recognition can be improved; by introducing multi-instance learning and observing the weight distribution of each instance under global and local visual angles, the feature response of an image region with low background contrast can be captured, the feature expression capability is improved, and the accuracy of image identification is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a system architecture diagram of an image processing system according to an embodiment of the present application;

2 a-2 b are schematic views of a scene of image processing provided by an embodiment of the present application;

fig. 3 is a schematic flowchart of an image processing method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an example provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a deep multi-instance learning model provided by an embodiment of the present application;

FIG. 6 is a schematic flowchart of determining an image recognition result according to an embodiment of the present disclosure;

FIG. 7 is a diagram of a network architecture for determining target fusion characteristics according to an embodiment of the present application;

8 a-8 l are diagrams of sets of example responses provided by embodiments of the present application;

fig. 9 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, large image processing technologies, operating/interactive systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The application relates to a Computer Vision technology (CV) belonging to artificial intelligence, in particular to image content identification in the Computer Vision technology, which can specifically identify an image label, a target area of an image and the like.

The computer vision technology is a science for researching how to make a machine "see", and in particular, it refers to that a camera and a computer are used to replace human eyes to make machine vision of identifying, tracking and measuring target, and further make image processing, so that the computer processing becomes an image more suitable for human eye observation or transmitted to an instrument for detection. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes technologies such as image processing, image recognition, image semantic understanding, image retrieval, synchronous positioning, map construction and the like, and also includes common biometric technologies such as face recognition, fingerprint recognition and the like.

The application can be applied to the following scenes: when the image label of the image (or the target area of the image) needs to be identified, the image to be identified is acquired, the image label of the image (or the target area of the image) is identified by adopting the depth multi-instance learning scheme of the application, and then the image can be classified based on the image label or segmented based on the target area.

Referring to fig. 1, fig. 1 is a system architecture diagram of image processing according to an embodiment of the present disclosure. The system architecture of the present application relates to a server 10d and a terminal device cluster, and the terminal device cluster may include: terminal device 10a, terminal device 10 b.

Taking the terminal device 10a as an example, the terminal device 10a acquires an image to be recognized and transmits the image to the server 10 d. The server 10d extracts image instance features of the image, which include N original feature maps. Extracting K local key example features under K scales from the N original feature maps, and superposing the K local key example features into multi-scale example features of the image to be identified. And extracting global example weight characteristics of the image to be identified from the N original characteristic graphs, and identifying the multi-scale example characteristics and the global example weight characteristics to obtain an image identification result of the image to be identified.

If the image recognition result is an image tag, the server 10d may perform image classification based on the image recognition result; if the image recognition result is an image pixel label, the server 10d may perform image segmentation based on the image recognition result; or the server 10d may directly issue the image recognition result to the terminal device 10a, and the terminal device 10a jointly displays the image to be recognized and the image recognition result on the screen.

The terminal device 10a, the terminal device 10b, the terminal device 10c, and the like shown in fig. 1 may be an intelligent device having an image processing function, such as a mobile phone, a tablet computer, a notebook computer, a palm computer, a Mobile Internet Device (MID), a wearable device, and the like. The terminal device cluster and the server 10d may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

The following description will be made in detail by taking an example of how the server 10d recognizes an image tag of an image: please refer to fig. 2 a-2 b, which are schematic views of a scene of image processing according to an embodiment of the present disclosure. As shown in fig. 2a, the server 10d acquires an image 20a to be recognized, and the server inputs the image 20a into an image feature extraction model to extract an instance feature 20b of the image 20 a. The image feature extraction model may include a plurality of convolutional layers, the example feature 20b may include a plurality of feature maps, the number of the feature maps is the same as the number of the last category to be classified (assuming that the number of the last category to be classified is 3, the number of the feature maps is 3), each pixel of each feature map may correspond to an example of the image 20a (in colloquial, an example of the image 20a is an image region, and the entire image 20a may be regarded as a packet), and the pixel value represents a score of the example on the corresponding category. For example, if the pixel value of the first pixel of the first feature map is equal to 0.3, the score of the corresponding instance of the first pixel in the first category may be considered to be 0.3.

After the server 10d obtains the instance feature 20b, the contribution of each instance to each category to be classified is examined from the global aspect and the local aspect. Firstly, a key example in a plurality of examples is determined through local examination, and the specific process is as follows: a sliding window is set, and a plurality of scales are set. For the scale i, sliding windows on any feature map in the example features 20b, in each window, keeping the pixel value of the pixel of the i-th highest ranking, setting the pixel values of the remaining pixels to 0, and recombining the 3 feature maps with adjusted pixel values into the unit local example feature at the scale i by the server 10 d.

For each scale, the unit local instance feature corresponding to the scale is determined in the above manner, and a plurality of unit local instance features at a plurality of scales are superimposed to be the local instance feature 20 c. When the server 10d superimposes the plurality of unit local instance features at the plurality of scales, the server adopts a registration-stacking mode, that is, the first feature map of the unit local instance feature at the scale i +1, and the first feature map of the unit local instance feature at the scale i +2 are registered and stacked. It can be seen that, since only the pixel values are changed, the superimposed local example feature 20c and example feature 20b are identical in both feature size and channel number, i.e., the local example feature 20c also includes 3 feature maps.

The weights of a plurality of examples are determined through global examination, and the specific process is as follows:

server 10d reduces the dimension of example feature 20b to a weight matrix 20d by calling a single-channel 1 x 1 convolution kernel, the size of which is the same as the size of the feature map in example feature 20 b. Each feature value in the weighted feature 20d may be considered a weight for each instance of the image 20 a.

Thus, the server 10d obtains the local instance feature 20c and the weight matrix 20d, weights the local instance feature 20c and the weight matrix 20d, that is, performs matrix dot product operation on 3 feature maps in the local instance feature 20c and the weight matrix 20d respectively to obtain 3 weighted feature maps (the weighted feature maps may be referred to as a fusion feature map), and the server 10d combines the 3 fusion feature maps into the fusion instance feature 20 e. Analyzing the above process, it can be known that the fused instance feature can represent both the local instance feature and the global instance weight feature of the image 20a, and the image feature of the image 20a is extracted from both the local aspect and the global aspect.

As shown in fig. 2b, for each fused feature map, the superposition of the pixel values of the pixels in the fused feature map results in a feature value that may represent the initial probability of the image 20a in a category. The server determines the feature value of each fused feature map, i.e. determines the initial probabilities of the image 20a in 3 categories, and the server may combine the 3 initial probabilities into the initial probability set 20 f. The initial probability set 20f is activated by calling an activation function, so that a target probability set is obtained, and each target probability in the target probability set represents the matching probability of the image 20a with one category. The server 10d may combine the target probability set and the 3 categories to obtain the image recognition result 20f of the image 20 a. As can be seen from fig. 2b, the matching probability of the image 20a with the image category "fitness" is 0.8, the matching probability of the image 20a with the image category "running" is 0.7, and the matching probability of the image 20a with the image category "sports" is 0.6. The server 10d may use the image category corresponding to the maximum matching probability as the image tag of the image 20a, that is, the image tag of the image 20a is: and (5) body building.

Specific processes of acquiring an image to be recognized (such as the image 20a in the above embodiment), extracting image instance features (such as the instance features 20b in the above embodiment), extracting multi-scale instance features (such as the local instance features 20c in the above embodiment), and determining an image recognition result (such as the image recognition result 20f in the above embodiment) may refer to the following embodiments corresponding to fig. 3 to 8 l.

Referring to fig. 3, fig. 3 is a schematic flow chart of an image processing method provided in an embodiment of the present application, where the following embodiment describes a server with better performance (such as the server 10d in the embodiment corresponding to fig. 2a to fig. 2 b) as an execution subject, and the embodiment mainly describes how to determine multi-scale instance features and global instance weight features of an image to be recognized, where the image processing method includes the following steps:

step S101, obtaining an image to be identified, extracting image instance characteristics of the image to be identified, wherein the image instance characteristics comprise N original characteristic graphs, any characteristic image pixel of any original characteristic graph corresponds to one instance of the image to be identified, and N is a positive integer.

Specifically, the server obtains an image to be identified (e.g., the image 20a in the embodiment corresponding to fig. 2a to fig. 2 b), where the image to be identified may be a medical image in the medical field, and may specifically be a retinal image, and a lesion attribute and/or a lesion region of a lesion in the retinal image may be identified according to the scheme of the present application, for example, it is identified that the retinal medical image belongs to a glycocalix lesion attribute, or belongs to a glaucoma lesion attribute, or belongs to an elderly disease lesion attribute. In addition to identifying a focal property and/or focal region, visual organs and visual organ regions in the retinal image may be identified, for example, an eyeball region, a lens region, a optic nerve region, etc. in the retinal image.

The image to be recognized can also be a natural image in the non-medical field, and the image category of the natural image can be recognized through the application, for example, when the image to be recognized is a facial expression image, the image category is recognized to be tense or frightened; when the image to be recognized is a traffic sign image, it is recognized whether the image category is a warning sign, a road sign, or the like.

And calling a convolution feature extraction network to extract convolution features of the image to be identified, wherein the convolution feature extraction network can comprise a plurality of convolution layers, and hidden image features of the image to be identified can be extracted through convolution kernels in each convolution layer. The convolution feature may be regarded as a plurality of convolution feature maps, for example, the convolution feature has a size of 50 × 50 × 10, and the convolution feature may be regarded as a 10 convolution feature map having a size of 50 × 50. Generally, as the number of convolutional layers increases, the size of the extracted feature map becomes smaller, but the number of feature maps increases (the number of channels of a feature may also increase).

The server converts the convolution characteristics into image example characteristics (such as the example characteristics 20b in the corresponding embodiment of fig. 2 a-2 b) of the image to be recognized through a 1 × 1 convolution kernel in the conversion layer, and the convolution processing of the convolution characteristics through the 1 × 1 convolution kernel does not change the size of the characteristic diagrams, but only changes the number of the characteristic diagrams, so as to achieve the purpose of data dimension reduction or data dimension increase. The image example features comprise N original feature maps (namely the original feature maps have the same size as the convolution feature maps), any feature image pixel of any original feature map (or convolution feature map) corresponds to an example of the image to be identified, and the pixel value of any feature image pixel of any original feature map can be regarded as the score of one example on one category. The whole image to be recognized can be regarded as a packet, an example of the image to be recognized can be regarded as an image area of the image to be recognized, and N is equal to the number of categories classified finally, or the number of categories of the packet.

The specific process of converting the convolution feature into the image instance feature can be represented by the following formula (1):

wherein, X₂Representing features of image instances, X₁Representing a convolution characteristic, W₁And b₁The weight and bias terms of the translation layer are respectively expressed, 1 × 1 represents the size of the convolution kernel, N represents the number of channels, and also represents the number of final classes.

For example, the size of the convolution feature is 50 × 50 × 10, the convolution feature is converted into an image instance feature through a convolution kernel in a conversion layer, the size of the image instance feature is 50 × 50 × N, the size of each original feature map is 50 × 50, and a certain feature map pixel of the first original feature map represents the score of an instance corresponding to the feature map pixel on the first category.

Referring to fig. 4, fig. 4 is a schematic diagram of an example provided in the embodiment of the present application, and as shown in fig. 4, the convolution kernel size of the convolution layer is 2 × 2, that is, the size of the convolution sliding window on the image to be recognized is 2 × 2, and the example is described by taking 4 pixels of the convolution sliding window currently sliding to the upper left corner area of the image to be recognized. And performing dot multiplication operation on the 4 pixels at the upper left corner of the image to be recognized and the convolution kernel to obtain the pixel value of the first characteristic image pixel at the upper left corner of the convolution characteristic, wherein the pixel value of the first characteristic image pixel at the upper left corner is 0 currently. As can be seen from the mapping relationship shown in fig. 4, the first feature map pixel at the upper left corner of the convolution feature corresponds to the upper left corner region of the image to be recognized, and the upper left corner region of the image to be recognized may also be referred to as an example of the image to be recognized, so that the first feature map pixel at the upper left corner of the convolution feature corresponds to an example (or an image region) of the image to be recognized. Subsequently, the convolution feature can be convolved by using a convolution kernel of 1 × 1 to change the number of channels of the convolution feature map. Although the number of channels is changed, the size of the feature map is not changed, so that each pixel of the feature after convolution by adopting a convolution kernel of 1 × 1 still corresponds to one example of the image to be recognized.

Step S102, extracting K local key example features under K scales from the N original feature maps, and superposing the K local key example features into multi-scale example features of the image to be identified, wherein K is a positive integer.

Specifically, K scales are preset, and local key instance features (unit local instance features in the embodiments corresponding to fig. 2 a-2 b as described above) at each scale are extracted from N original feature maps. For the ith scale, i is more than or equal to 1 and less than or equal to K, the process of extracting local key example features under the ith scale from N original feature maps is as follows:

and setting a polling priority for each original feature map, selecting the original feature map with the highest polling priority from the N original feature maps according to the polling priority, and taking the selected original feature map as a target original feature map. And determining the key features of the unit local instance according to the target original feature map and the scale i. And continuously selecting the target original feature map with the highest polling priority from the rest original feature maps according to the polling priority, and continuously determining the unit local key example feature of the next target original feature map. And continuously circulating, stopping polling when all the original feature maps are determined as target original feature maps, and combining the N units of the local key instance features determined previously into local key instance features under the ith scale. The dimensions of each unit local key instance feature are the same as the dimensions of the original feature map.

Colloquially, the process of determining local key instance features at the ith scale is: and respectively determining a unit local key instance feature from each original feature map, and combining the N unit local key instance features into the local key instance features under the ith scale.

According to the target original feature map and the dimension i, the process of determining the key features of one unit local instance is as follows: and the server acquires the size of the sliding window and divides the target original feature map into a plurality of unit original feature maps according to the size of the sliding window. There may be overlapping feature map pixels or no overlapping feature map pixels between the plurality of unit original feature maps. And respectively adjusting the pixel value of the feature image pixel of each unit original feature map according to the scale i, and taking the unit original feature map with the adjusted pixel value as a unit target feature map. And splicing all unit target feature maps into unit local key example features.

For example, if the size of the target raw feature map is 100 × 100 and the size of the sliding window is 10 × 10, if there are no overlapped feature map pixels between multiple unit raw feature maps, the target raw feature map may be divided into 100 unit raw feature maps, the pixel value of the feature map pixel in each unit raw feature map is adjusted, and the unit raw feature map after the pixel value adjustment is taken as the unit target feature map. And splicing 100 unit target feature maps to obtain a unit local key example feature with the size of 100 multiplied by 100.

Aiming at any unit original feature map in a plurality of unit original feature maps, the pixel value of the feature image pixel of any unit original feature map is adjusted according to the scale i, and the flow of obtaining the unit target feature map corresponding to any unit original feature map is as follows: and performing descending sorting on all the feature image pixels according to the pixel values of the feature image pixels in any unit of original feature image, wherein the first i feature image pixels in the descending sorting are all used as reserved feature image pixels, and the rest feature image pixels are all used as feature image pixels to be adjusted. And adjusting the pixel values of all the characteristic image pixels to be adjusted to be a preset pixel threshold (the pixel threshold can be equal to 0), and keeping the pixel values of the characteristic image pixels unchanged. And the server takes any unit original feature map after the pixel value adjustment as a unit target feature map corresponding to any unit original feature map.

In general, for a unit original bitmap, the pixel values of i pixels having the largest pixel values are unchanged, and the pixel values of the remaining pixels are all adjusted to 0. Therefore, the characteristic responses of i examples with characteristic response values ranked i at the top can be reserved, and the characteristic responses of all the other examples are suppressed, so that the key examples can be screened out.

For example, assuming that i is 2, a unit original feature map of 2 × 2 is needed to determine a unit target feature map corresponding to the unit original feature map at the 2 nd scale:

since the pixel value of pixel 3 > the pixel value of pixel 2 > the pixel value of pixel 1 > the pixel value of pixel 4, the 2 pixels with the largest pixel values are pixel 3 and pixel 2, i.e. the pixel values of pixel 3 and pixel 2 are not changed, the pixel values of pixel 1 and pixel 4 are adjusted to 0, and a unit target feature map can be obtained:

0	0.4
		0.5	0

the specific process of determining the target unit feature map can be represented by the following formula (2):

wherein the content of the first and second substances,

representing the pixel value of the feature image pixel at the (m, n) position in the unit target feature map at the ith scale,

representing the pixel value of the feature image pixel at the (m, n) position in the unit original feature map at the ith scale,

and (3) representing the pixel value of the characteristic image pixel which is ranked i before the pixel value in the unit original characteristic image.

The physical meaning of equation (2) is: if the pixel k of the unit original feature map is the pixel i ranked before the pixel value, the pixel value of the pixel k is unchanged, otherwise, the pixel value of the pixel k is adjusted to 0.

Optionally, when determining the unit target feature map, in addition to the above sorting manner, the unit original feature map may be processed based on other filter kernels such as gaussian filter, so as to obtain a unit target feature map corresponding to the unit original feature map.

The server can respectively determine the local key instance features under each scale according to the process, and then K local key instance features under K scales can be obtained.

The server may perform superposition and activation processing on the K local key instance features to obtain a multi-scale instance feature of the image to be identified, where a size of a feature map (referred to as a scale feature map) of the multi-scale instance feature is equal to a size of any one of feature maps of the local key instance features, and a number of scale feature maps included in the multi-scale instance feature is equal to a number of feature maps of any one of the local key instance features, which is equal to a number of original feature maps. The K local key instance features are superimposed into the multi-scale instance feature by using alignment addition, that is, the first feature maps in each local key instance feature are added to obtain a first scale feature map of the multi-scale instance feature, and the second feature maps in each local key instance feature are added to obtain a second scale feature map of the multi-scale instance feature, and finally, the N added scale feature maps are respectively subjected to activation processing, wherein the activation processing is to combine the N activated scale feature maps into the multi-scale instance feature by using one scale feature map as a processing unit (such as the local instance feature 20c in the corresponding embodiment of fig. 2 a-2 b).

Analyzing the above process, it can be known that the p-th instance of the feature response ranking (i.e. the feature map pixel) in the sampling window is overlapped by K-p +1 times, so that the important instances are continuously strengthened, and the roles of different key instances are further distinguished. For example, a feature map pixel 1 at the upper left corner in the original feature map 1 is a feature map pixel with a first pixel value rank in the unit original feature map, the pixel value of the feature map pixel 1 is retained when determining the local key instance feature at the first scale, the pixel value of the feature map pixel 1 is retained when determining the local key instance feature at the second scale, and the pixel value of the feature map pixel 1 is retained when determining the local key instance feature at the kth scale. And finally, when K local key example features are superposed, the pixel value of the feature image pixel 1 is superposed for K times, and a feature image pixel of a scale feature map of the multi-scale example features is obtained.

The process of superimposing K local key instance features into a multi-scale instance feature can be described by the following equation (3):

X₃＝softmax(∑_iX_3,i) (3)

wherein, X₃Representing multi-scale example features, X_3,iRepresenting local key instance features, softmax () representing an activation function whose role is to normalize the feature distribution to 0,1]。

The server acquires the multi-scale example feature, the multi-scale example feature also comprises N scale feature maps, the size of the scale feature map contained in the multi-scale example feature is equal to the scale of the original feature map, only the critical example in the multi-scale example feature is reserved (the pixel value of the critical example is reserved), and then the non-critical example is directly ignored (the pixel value of the non-critical example is set to be 0).

Step S103, extracting global instance weight characteristics of the image to be identified from the N original characteristic graphs.

Specifically, the server performs convolution processing on the N original feature maps to obtain a convolution matrix, wherein a convolution kernel of the convolution processing is a single-channel 1 × 1 convolution kernel, and thus the size of the obtained convolution matrix is the same as that of any original feature map.

For example, the size of the image example feature is 50 × 50 × 10, that is, the image example feature includes 10 original feature maps with the size of 50 × 50, and the convolution matrix with the size of 50 × 50 can be obtained by performing convolution processing on the 10 original feature maps based on a convolution kernel with the size of 1 × 1 × 10 × 1.

And performing activation processing on the convolution matrix to obtain global instance weight characteristics (such as the weight matrix 20d in the corresponding embodiment of the above-mentioned fig. 2 a-2 b) of the image to be identified, wherein the activation processing has the effect of enabling the global instance weight characteristics to be distributed in [0,1 ]. The size of the global instance weight feature is equal to the size of any original feature map, and each feature value in the global instance weight feature represents the weight of an instance, and the greater the weight, the higher the importance of the instance is.

The process of extracting global instance weight features from N original feature maps can be represented by the following formula (4):

M＝softmax(relu(W₂X₂+b₂)) (4)

where M represents a global instance weight feature, X₂Representing features of an image instance, W₂And b₂Representing the weight and bias terms, respectively, softmax () and relu () each represent an activation function.

It should be noted that the order of determining the multi-scale instance feature in step S102 and determining the global instance weight feature in step S103 is not limited.

And step S104, identifying the multi-scale example features and the global example weight features to obtain an image identification result of the image to be identified.

Specifically, the server fuses the multi-scale instance features and the global instance weight features into target fusion features of the image to be identified (such as the fusion instance features 20e in the corresponding embodiments of fig. 2 a-2 b described above).

And (3) carrying out identification processing on the target fusion features to obtain an image identification result (such as the image identification result 20f in the embodiment corresponding to fig. 2 a-2 b).

Referring to fig. 5, fig. 5 is a schematic diagram of a deep multiple-instance learning model provided in an embodiment of the present application, and important components of the deep multiple-instance learning model are: the local pyramid perception module and the global perception module. After the image to be recognized is input into the feature extraction network, the convolution feature of the image to be recognized can be extracted. And converting the convolution characteristics into image example characteristics through a 1 x 1 convolution kernel in the conversion layer, wherein the image example characteristics comprise N original characteristic maps, the N number is equal to the number of categories, each characteristic map pixel of the original characteristic maps corresponds to one image area of the image to be identified, and the pixel value of one characteristic map pixel of the original characteristic maps represents the score of the example on one category. And respectively inputting the converted image example features into a local pyramid perception module, extracting multi-scale example features, and inputting the image example features into a global perception module to extract global example weight features. The specific process for extracting the multi-scale example features is as follows: the method comprises the steps of presetting K scales, determining local key example features of image example features under each scale, and overlapping the K local key example features into multi-scale example features. The specific process for determining the local key instance features under the ith scale is as follows: in the sliding window of each original feature map of the N original feature maps, if the pixel value of the feature map pixel in the sliding window is ranked at the top i, the pixel value is kept unchanged, otherwise, the pixel value of the feature map pixel is set to be 0. And taking the N original feature maps after the pixel values are adjusted as local key example features under the ith scale.

The specific process for extracting the global instance weight features comprises the following steps: and carrying out convolution operation and activation operation on the image example features by adopting a single-channel convolution kernel of 1 × 1 to obtain the global example weight features.

And performing weighted aggregation on the extracted global instance weight features and the multi-scale instance features to obtain a packet-level feature vector (the packet-level feature vector can correspond to the feature to be activated in the application), wherein each component in the feature vector represents the probability between the image to be identified and one category, the probability is not in the [0,1] interval, the packet-level feature vector is activated through an activation function, each component in the activated feature vector represents the matching probability between the image to be identified and one category, and the matching probability is in the [0,1] interval.

The process of determining the matching probability of the image to be recognized and the image class can be represented by the following formula (5):

p＝g(M,X₃)＝∑_i,jM_i,jX_3,(i,j) (5)

wherein, X₃The method comprises the steps of representing multi-scale example features, M represents global example weight features, firstly, performing product operation on the global example weight features and each scale feature graph in the multi-scale example features, then, superposing pixel values of the feature graphs after the product operation into an N-dimensional feature vector p, wherein each value in the feature vector represents an initial score of an image to be identified on one image category. The server 10d activates the feature vector p to obtain the matching probability of the image to be recognized in the N image categories, that is, the matching probability is obtained

It should be noted that, as can be seen from fig. 5, the local-global network and the classifier can identify the category of the image through the feature extraction network in the deep multi-instance learning model. Therefore, when the deep multi-instance learning model is trained, the feature extraction network, the local-global network and the classifier are also trained together as a model.

When the deep multi-instance learning model is trained, the loss value can be determined by adopting the following loss function:

wherein, Y_cA true label representing the sample's C-th class,

a prediction tag representing the sample's C-th class.

In view of the fact that the contrast between the fine focal region and the fundus background is low, the present application selects key instances at different Local scales through a Local Pyramid Perception Module (LPPM) to increase the importance of locally prominent instances. Meanwhile, in consideration of the characteristic that the focus of the fundus disease is distributed in a scattered manner, the importance of each example is measured from the perspective of the entire image by a Global Perception Module (GPM). Finally, under the paradigm of the example space method, the example representation form based on the local and the example space weight distribution from the global are subjected to feature fusion in a weighting mode, the probability distribution of the package is generated by using a softmax classifier, and the expression and the weight of the example feature are examined from the local and the global aspects, so that the feature expression mode of the image can be enriched, and the accuracy of image identification can be improved.

The local-global bidirectional perception deep multi-instance learning model can simply and intuitively replace a global average pooling layer and a full connection layer in the conventional convolutional neural network structure, and a conventional convolutional neural network method is conveniently converted into a deep multi-instance learning scheme. Meanwhile, a deep multi-Instance Learning framework composed of a CNN (Convolutional Neural Networks) backbone and a proposed MIL (Multiple Instance Learning) module can be integrally trained and optimized in an end-to-end and end-to-end manner.

Referring to fig. 6, fig. 6 is a schematic flow chart of determining an image recognition result according to an embodiment of the present application, where this embodiment mainly describes how to determine an image recognition result according to a multi-scale example feature and a global example weight feature, where the determining of the image recognition result includes the following steps S201 to S202, and the steps S201 to S202 are a specific embodiment of the step S104 in fig. 3:

step S201, fusing the multi-scale instance features and the global instance weight features into target fusion features of the image to be recognized.

Specifically, as can be seen from the foregoing, the multi-scale example feature includes N scale feature maps, and the size of each scale feature map is the same as that of each original feature map, and the global example weight feature is a matrix, and the size of the matrix is the same as that of each scale feature map.

And the server performs point multiplication operation on the global instance weight characteristics and each scale characteristic graph to obtain N fusion characteristic graphs, and combines the N fusion characteristic graphs into fused target fusion characteristics.

Step S202, the target fusion characteristics are identified, and an image identification result of the image to be identified is obtained.

Specifically, there is a little difference in the recognition processing based on different business requirements, and the image recognition result determined here may be an image category or an image semantic segmentation result. The following first explains how to determine the image category of the image to be recognized based on the target fusion feature:

as can be seen from the foregoing, the target fusion feature includes N fusion feature maps, the pixel values of all feature map pixels of each fusion feature map are superimposed to be the feature value to be activated, and the N feature values to be activated are combined to be the feature to be activated. For example, the target fusion feature includes 3 fusion feature maps, which are respectively a fusion feature map 1, a fusion feature map 2, and a fusion feature map 3, the pixel values of all feature image pixels in the fusion feature map 1 are superimposed to be the feature value 1 to be activated, the pixel values of all feature image pixels in the fusion feature map 2 are superimposed to be the feature value 2 to be activated, the pixel values of all feature image pixels in the fusion feature map 3 are superimposed to be the feature value 3 to be activated, and the feature value 1 to be activated, the feature value 2 to be activated, and the feature value 3 to be activated are combined to be the feature to be activated.

Or, taking the pixel average value of all the feature image pixels of each fused feature image as the feature value to be activated, and combining the N feature values to be activated into the feature to be activated. For example, the target fusion feature includes 3 fusion feature maps, which are respectively a fusion feature map 1, a fusion feature map 2, and a fusion feature map 3, the pixel average value of all feature image pixels in the fusion feature map 1 is used as a feature value 1 to be activated, the pixel average value of all feature image pixels in the fusion feature map 2 is used as a feature value 2 to be activated, the pixel average value of all feature image pixels in the fusion feature map 3 is used as a feature value 3 to be activated, and the feature value 1 to be activated, the feature value 2 to be activated, and the feature value 3 to be activated are combined into a feature to be activated.

The server activates the feature to be activated to obtain a matching probability set between the image to be identified and the N image categories, and the server can select the image category with the maximum matching probability from the matching probability set as the image category of the image to be identified and use the identified image category as the image identification result.

How to determine the semantic segmentation result of the image to be recognized based on the target fusion features is explained as follows:

as can be seen from the foregoing, the target fusion feature includes N fusion feature maps, and the size of the fusion feature map is generally smaller than the size of the image to be recognized.

And the server carries out interpolation processing on each fusion characteristic graph to obtain N mask matrixes with the same size as the image to be identified, and each value in the mask matrixes represents the score of one pixel of the image to be identified on one category. The N mask matrices may determine a set of matching probabilities between each pixel of the image to be identified and the N pixel classes. For example, the number of the mask matrixes is 3, which are respectively a mask matrix 1, a mask matrix 2 and a mask matrix 3, the mask matrix 1 corresponds to the pixel class 1, the mask matrix 2 corresponds to the pixel class 2, the mask matrix 3 corresponds to the pixel class 3, a value of a first position at the upper left corner of the mask matrix 1 represents a score of a first pixel at the upper left corner of the image to be recognized on the pixel class 1, a value of a first position at the upper left corner of the mask matrix 2 represents a score of a first pixel at the upper left corner of the image to be recognized on the pixel class 2, and a value of a first position at the upper left corner of the mask matrix 3 represents a score of a first pixel at the upper left corner of the image to be recognized on the pixel class 3.

After the server determines the matching probability set of each pixel of the image to be recognized, for a pixel, the pixel class corresponding to the maximum matching probability in the matching probability set of the pixel is used as the pixel class of the pixel, and the server uses the identified pixel class of each pixel as the semantic segmentation result (namely, the image recognition result) of the image to be recognized. Subsequently, semantic segmentation can be performed on the image to be recognized based on the semantic segmentation result of the image to be recognized.

In addition to determining the semantic segmentation result of the image, in the medical field, if the image to be recognized is a retinal image, the object attribute (the object attribute may specifically refer to a lesion object) of an interest object (the interest object may specifically refer to a lesion object) and the interest region (the interest region may specifically refer to a lesion region) of the interest object in the retinal image may be recognized based on the target fusion feature, the image category of the image to be recognized may correspond to the object attribute of the image, only one branch for recognizing the interest region needs to be added, and the specific process for determining the interest region by the branch may be: and taking the fusion feature map corresponding to the image category of the determined image to be identified as a target fusion feature map, taking an example corresponding to the feature image pixel with the maximum pixel value in the target fusion feature map as a target example, and taking an image area corresponding to the target example in the image to be identified as an interest area of the interest object.

And combining the object attributes and the interest areas into an image recognition result of the retina image, and then using the recognized object attributes and the interest areas as data support for auxiliary diagnosis for generating an auxiliary diagnosis report. For example, a risk assessment report is generated based on the object attributes and the region of interest.

Further, if the image to be recognized is a retinal image, the object of interest may also refer to a visual organ, the object attribute refers to an organ type of the visual organ, and the region of interest refers to a region where the visual organ is located, the object attribute of the retinal image may correspond to the organ type of the visual organ (for example, the visual organ is an eyeball, or a crystalline lens, or a visual nerve, etc.) here, and the region of interest of the retinal image may correspond to the region where the visual organ is located here. Subsequently, the identified object attributes and the interest areas can be identified in the retina images, and the method is used in the fields of medical teaching and the like.

If the image to be recognized is a face image in the non-medical field, the image recognition result can be the identity information of the face image; if the image to be recognized is a facial expression image in the non-medical field, the image recognition result can be an expression type; if the image to be identified is a traffic sign image in the non-medical field, the image identification result can be a traffic sign category; if the image to be recognized is a license plate number image in the non-medical field, the image recognition result can be a license plate number.

Referring to fig. 7, fig. 7 is a diagram illustrating a network architecture for determining target fusion features, after image instance features of an image to be identified are extracted, local key instance features are extracted under K scales respectively, and the K local key instance features under the K scales are superimposed into multi-scale instance features. Meanwhile, convolution operation is carried out on the image instance features through a single-channel 1 x 1 convolution kernel of the conversion layer, and global instance weight features are obtained. And performing feature fusion on the multi-scale instance features and the global instance weight features to obtain target fusion features of the image to be recognized.

To illustrate the effectiveness of the protocol of the present application, applicants compared the protocol of the present application to a number of other protocols and performed experiments on 3 data sets, and the following table 1 shows the accuracy of the protocol of the present application compared to other protocols for the task of retinal disease identification:

TABLE 1

Wherein P represents Precision Accuracy, R represents Recall rate, F1 represents harmonic mean of F1-measure Accuracy and Recall rate, and Acc represents Accuracy. MP in table 1 represents a maximum value depth multi-instance learning model, AP represents an average value depth multi-instance learning model, GA MIL represents a gated attention-based depth multi-instance learning model, CSA MIL represents a "channel-space" attention model-based depth multi-instance learning model, and MS MIL represents a multi-instance multi-scale model.

As can be seen from table 1, the scheme of the present application has better recognition accuracy than the other schemes on 3 data sets, which fully indicates that the recognition accuracy of the scheme of the present application on the image is better than that of the other comparison schemes.

Tables 2 and 3 respectively list the expressions of the scheme (the scheme can be abbreviated as LGDP Local-Global Dual Perception) in implanting different convolutional neural network skeletons and applying to remote sensing images and natural image recognition tasks. VGG, RN (ResNet), and inclusion denote 3 convolutional neural network skeletons. Table 2 shows the performance of the scheme applied to the remote sensing image task after being implanted into different convolutional neural network skeletons, and table 3 shows the performance of the scheme applied to the natural image task after being implanted into different convolutional neural network skeletons.

As can be seen from tables 2 and 3, the scheme of the present application has a significant accuracy improvement effect on different convolutional neural network skeletons, both in the retinal disease recognition task and in other image recognition tasks.

TABLE 2

	P	R	F1	Acc
					VGG	98.77	76.61	86.29	87.83
VGG+LGDP	94.26	91.52	92.87	92.97
					RN	98.69	87.92	92.99	93.38
RN+LGDP	98.39	97.08	97.73	97.74
					Inception	96.47	87.13	91.57	91.97
Inception+LGDP	97.25	95.91	96.58	96.60

TABLE 3

	NWPU	Scene-15
			VGG	97.79±0.15	82.48±0.12
VGG+LGDP	92.72±0.17	86.61±0.11
			RN	80.37±0.27	81.45±0.21
RN+LGDP	92.99±0.10	87.17±0.18
			Inception	80.30±0.38	82.45±0.25
Inception+LGDP	92.70±0.21	87.65±0.22

Please refer to fig. 8 a-8 l, which are schematic diagrams of a plurality of example responses provided by an embodiment of the present application, fig. 8a and 8b are a group, fig. 8c is a group shown in fig. 8d, and the 4 images all belong to the glycoreticular lesion category images; 8e and 8f are a group, and 8g is a group shown in 8h, and the 4 images belong to the glaucoma focus category image; fig. 8i and 8j are a group, fig. 8k is a group of fig. 8l, and the 4 images all belong to the senile disease focus category image. The analysis was performed using the example set of 8a and 8 b: fig. 8a is an image to be identified, fig. 8b is an example response corresponding to a feature image pixel with a larger pixel value in a fusion feature map belonging to a glycocalix lesion class, and the fusion feature map is a feature map in the target fusion feature corresponding to fig. 8a determined by the scheme of the present application. As can be seen from fig. 8b, the present disclosure can effectively capture the lesion area, further improving the interpretability of the present disclosure.

In the above, when the target fusion features are superimposed as the features to be activated, the present application proposes a summation mode, an average mode, a maximum mode, and the like, and a plurality of calculation modes can enrich and determine the determination modes of the features to be activated; furthermore, the image processing method can be applied to various image processing fields such as image label identification, image semantic segmentation and the like, and has certain generalization capability and migratability.

Further, please refer to fig. 9, which is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application. As shown in fig. 9, the image processing apparatus 1 may be applied to the server in the above-described embodiments corresponding to fig. 3 to 8 l. Specifically, the image processing apparatus 1 may be a computer program (including program code) running in a computer device, for example, the image processing apparatus 1 is an application software; the image processing apparatus 1 may be configured to perform corresponding steps in the method provided by the embodiment of the present application.

The image processing apparatus 1 may include: the device comprises an acquisition module 11, a first extraction module 12, a superposition module 13, a second extraction module 14 and an identification module 15.

The image recognition method includes the steps that an obtaining module 11 is used for obtaining an image to be recognized and extracting image instance features of the image to be recognized, wherein the image instance features comprise N original feature maps, any feature map pixel of any original feature map corresponds to one instance of the image to be recognized, and N is a positive integer;

a first extraction module 12, configured to extract K local key instance features at K scales from the N original feature maps;

a superposition module 13, configured to superpose K local key instance features as the multi-scale instance features of the image to be identified, where K is a positive integer;

a second extraction module 14, configured to extract global instance weight features of the image to be identified from the N original feature maps;

and the identification module 15 is configured to perform identification processing on the multi-scale example features and the global example weight features to obtain an image identification result of the image to be identified.

In a possible implementation manner, for the ith scale of K scales, i is greater than or equal to 1 and less than or equal to K, when the first extraction module 12 is used to extract the local key instance feature at the ith scale from the N original feature maps, it is specifically configured to:

setting a polling priority for each original feature map, and determining a target original feature map for current polling from N original feature maps according to the polling priority;

determining unit local key instance characteristics according to the target original characteristic graph and the scale i;

and when all the original feature maps are determined as target original feature maps, stopping polling, and combining N unit local key instance features into the local key instance features under the ith scale.

In a possible implementation manner, when the first extraction module 12 is configured to determine a unit local key instance feature according to the target original feature map and the scale i, specifically, to:

acquiring the size of a sliding window, and dividing the target original feature map into a plurality of unit original feature maps according to the size of the sliding window;

respectively adjusting the pixel value of the feature image pixel of each unit original feature map according to the scale i to obtain a plurality of unit target feature maps;

and splicing a plurality of unit target feature maps into the unit local key instance features.

In a possible implementation manner, for any unit original feature map in the plurality of unit original feature maps, when the first extraction module 12 is configured to adjust a pixel value of a feature image pixel of the unit original feature map according to the scale i to obtain a unit target feature map corresponding to the unit original feature map, specifically:

according to the pixel value of the feature image pixel of any unit original feature map, sorting the feature image pixel of any unit original feature map in a descending order;

taking the first i characteristic image pixels in the descending order as reserved characteristic image pixels, and taking the characteristic image pixels except the reserved characteristic image pixels as the characteristic image pixels to be adjusted in all the characteristic image pixels of the original characteristic image of any unit;

adjusting the pixel value of the feature image pixel to be adjusted to be a pixel threshold value;

and taking any unit original feature map after the pixel value is adjusted as a unit target feature map corresponding to the unit original feature map.

In a possible implementation, the second extraction module 14 is specifically configured to:

performing convolution processing on the N original characteristic graphs to obtain a convolution matrix, wherein the size of the convolution matrix is the same as that of any original characteristic graph;

and activating the convolution matrix to obtain the global example weight characteristics of the image to be identified.

In a possible implementation manner, when the identifying module 15 is configured to perform the identification processing on the multi-scale instance feature and the global instance weight feature to obtain the image identification result of the image to be identified, specifically, the identifying module is configured to:

fusing the multi-scale instance features and the global instance weight features into target fusion features of the image to be recognized;

and identifying the target fusion characteristics to obtain an image identification result of the image to be identified.

In one possible implementation, the multi-scale example features comprise N scale feature maps, any scale feature map has the same size as any original feature map, and the feature size of the global example weight feature has the same size as any scale feature map;

when the identifying module 15 is configured to fuse the multi-scale instance feature and the global instance weight feature into the target fusion feature of the image to be identified, specifically, it is configured to:

and performing product operation on the global instance weight characteristics and each scale characteristic graph to obtain N fusion characteristic graphs, and combining the N fusion characteristic graphs into the target fusion characteristics of the image to be identified.

In a possible implementation manner, when the recognition module 15 is configured to perform recognition processing on the target fusion feature to obtain an image recognition result of the image to be recognized, specifically, to:

superposing the target fusion features into features to be activated, and activating the features to be activated to obtain a matching probability set between the images to be identified and N image categories;

and determining the image category of the image to be recognized according to the matching probability set, and taking the image category of the image to be recognized as the image recognition result of the image to be recognized.

In one possible embodiment, the target fusion feature comprises N fusion feature maps;

when the recognition module 15 is configured to superimpose the target fusion feature as a feature to be activated, it is specifically configured to:

respectively superposing the pixel values of all the feature image pixels of each fusion feature image into feature values to be activated, and combining N feature values to be activated into the features to be activated; or the like, or, alternatively,

and respectively taking the pixel average value of all the feature image pixels of each fusion feature map as a feature value to be activated, and combining the N feature values to be activated into the feature to be activated.

the recognition module 15 is specifically configured to, when being configured to perform recognition processing on the target fusion feature to obtain an image recognition result of the image to be recognized:

performing interpolation processing on each fusion characteristic graph to obtain N mask matrixes with the same size as the image to be identified, and determining a matching probability set between each pixel of the image to be identified and N pixel categories according to the N mask matrixes;

determining the pixel category of each pixel of the image to be recognized according to the matching probability set of each pixel of the image to be recognized, and taking the pixel category of each pixel of the image to be recognized as the image recognition result of the image to be recognized.

In a possible implementation manner, the image to be recognized is a retina image, and the image recognition result includes an interest area where an object of interest in the retina image is located and an object attribute of the object of interest.

In a possible implementation, the image to be recognized is a natural image in a non-medical field, and the image recognition result includes an image category of the natural image.

According to an embodiment of the present invention, the steps involved in the methods shown in fig. 3-8 l may be performed by various modules in the image processing apparatus shown in fig. 10. For example, steps S101-S104 shown in fig. 3 may be performed by the acquisition module 11, the first extraction module 12, the superposition module 13, the second extraction module 14, and the identification module 15 shown in fig. 9, respectively; as another example, steps S201-S202 shown in FIG. 6 may be performed by the recognition module 15 shown in FIG. 9.

Further, please refer to fig. 10, which is a schematic structural diagram of a computer device according to an embodiment of the present application. The server in the corresponding embodiments of fig. 3-8 l described above may be a computer device 1000. As shown in fig. 10, the computer device 1000 may include: a user interface 1002, a processor 1004, an encoder 1006, and a memory 1008. Signal receiver 1016 is used to receive or transmit data via cellular interface 1010, WIFI interface 1012. The encoder 1006 encodes the received data into a computer-processed data format. The memory 1008 has stored therein a computer program by which the processor 1004 is arranged to perform the steps of any of the method embodiments described above. The memory 1008 may include volatile memory (e.g., dynamic random access memory DRAM) and may also include non-volatile memory (e.g., one time programmable read only memory OTPROM). In some instances, the memory 1008 can further include memory located remotely from the processor 1004, which can be connected to the computer device 1000 via a network. The user interface 1002 may include: a keyboard 1018, and a display 1020.

In the computer device 1000 shown in fig. 10, the processor 1004 may be configured to call the memory 1008 to store a computer program to implement:

In one embodiment, for the ith scale of K scales, i is greater than or equal to 1 and less than or equal to K, when the processor 1004 performs the step of extracting the local key instance feature at the ith scale from the N original feature maps, the following steps are specifically performed:

In one embodiment, when the processor 1004 determines a unit local key instance feature according to the target original feature map and the scale i, the following steps are specifically performed:

In one embodiment, for any unit original feature map in the plurality of unit original feature maps, when the processor 1004 performs, according to the scale i, adjusting a pixel value of a feature image pixel of the unit original feature map to obtain a unit target feature map corresponding to the unit original feature map, specifically perform the following steps:

In one embodiment, when the processor 1004 extracts the global instance weight feature of the image to be identified from the N original feature maps, the following steps are specifically performed:

In an embodiment, when the processor 1004 performs the identification processing on the multi-scale instance feature and the global instance weight feature to obtain the image identification result of the image to be identified, the following steps are specifically performed:

In one embodiment, the multi-scale example features comprise N scale feature maps, wherein any scale feature map has the same size as any original feature map, and the feature size of the global example weight feature has the same size as any scale feature map;

when the processor 1004 performs the fusion of the multi-scale instance feature and the global instance weight feature as the target fusion feature of the image to be recognized, specifically performing the following steps:

In an embodiment, when the processor 1004 executes the recognition processing on the target fusion feature to obtain the image recognition result of the image to be recognized, the following steps are specifically executed:

In one embodiment, the target fusion feature comprises N fusion feature maps;

when the processor 1004 performs the superposition of the target fusion feature into the feature to be activated, the following steps are specifically performed:

In one embodiment, the target fusion feature comprises N fusion feature maps;

when the processor 1004 executes the identification processing on the target fusion feature to obtain the image identification result of the image to be identified, the following steps are specifically executed:

In one embodiment, the image to be recognized is a retina image, and the image recognition result includes an interest region where an interest object in the retina image is located and an object attribute of the interest object.

In one embodiment, the image to be recognized is a natural image in a non-medical field, and the image recognition result includes an image category of the natural image.

It should be understood that the computer device 1000 described in this embodiment of the present application may perform the description of the image processing method in the embodiment corresponding to fig. 3 to 8l, and may also perform the description of the image processing apparatus 1 in the embodiment corresponding to fig. 9, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: an embodiment of the present application further provides a computer storage medium, and the computer storage medium stores the aforementioned computer program executed by the image processing apparatus 1, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the image processing method in the embodiment corresponding to fig. 3 to 8l can be performed, so that details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in the embodiments of the computer storage medium referred to in the present application, reference is made to the description of the embodiments of the method of the present application. By way of example, program instructions may be deployed to be executed on one computer device or on multiple computer devices at one site or distributed across multiple sites and interconnected by a communication network, and the multiple computer devices distributed across the multiple sites and interconnected by the communication network may be combined into a blockchain network.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device can execute the method in the embodiment corresponding to fig. 3 to 8l, which will not be described herein again.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. An image processing method, comprising:

acquiring an image to be identified, and extracting image instance features of the image to be identified, wherein the image instance features comprise N original feature maps, any feature map pixel of any original feature map corresponds to an instance of the image to be identified, N is a positive integer, and N is equal to the number of categories of the final classification;

extracting K local key instance features under K scales from the N original feature maps, and superposing the K local key instance features into multi-scale instance features of the image to be identified, wherein K is a positive integer; aiming at the ith scale under the K scales, i is more than or equal to 1 and less than or equal to K, the process of extracting the local key example features under the ith scale from the N original feature maps comprises the following steps: setting a polling priority for each original feature map, and determining a target original feature map for current polling from N original feature maps according to the polling priority; determining unit local key instance characteristics according to the target original characteristic graph and the scale i; when all the original feature maps are determined as target original feature maps, stopping polling, and combining N unit local key instance features into local key instance features under the ith scale; the unit local key example feature is an example feature which is obtained by splicing a plurality of unit target feature maps and has the same size with the target original feature map, and the unit target feature map is obtained by adjusting pixel values of non-overlapped feature image pixels which are divided according to the size of a sliding window in the target original feature map;

2. The method according to claim 1, wherein the determining unit local key instance features according to the target original feature map and a scale i comprises:

3. The method according to claim 2, wherein the step of adjusting the pixel value of the feature image pixel of any unit raw feature map in the plurality of unit raw feature maps according to the scale i to obtain the unit target feature map corresponding to the unit raw feature map comprises:

4. The method according to claim 1, wherein the extracting global instance weight features of the image to be recognized from the N original feature maps comprises:

5. The method according to claim 1, wherein the identifying the multi-scale instance features and the global instance weight features to obtain an image identification result of the image to be identified comprises:

6. The method of claim 5, wherein the multi-scale instance features comprise N scale feature maps, any scale feature map has the same size as any original feature map, and the feature size of the global instance weight feature has the same size as any scale feature map;

the fusing the multi-scale instance features and the global instance weight features into target fusion features of the image to be recognized comprises:

7. The method according to claim 5, wherein the identifying the target fusion feature to obtain an image identification result of the image to be identified comprises:

8. The method of claim 7, wherein the target fusion feature comprises N fusion feature maps;

the overlaying the target fusion feature as a feature to be activated includes:

9. The method of claim 5, wherein the target fusion feature comprises N fusion feature maps;

the identifying the target fusion features to obtain the image identification result of the image to be identified comprises the following steps:

10. The method according to claim 1, wherein the image to be recognized is a retinal image, and the image recognition result includes an interest region where an object of interest in the retinal image is located and an object attribute of the object of interest.

11. The method according to claim 1, wherein the image to be recognized is a natural image of a non-medical field, and the image recognition result includes an image category of the natural image.

12. An image processing apparatus characterized by comprising:

the image recognition system comprises an acquisition module, a classification module and a classification module, wherein the acquisition module is used for acquiring an image to be recognized and extracting image instance characteristics of the image to be recognized, the image instance characteristics comprise N original characteristic maps, any characteristic image pixel of any original characteristic map corresponds to an instance of the image to be recognized, N is a positive integer, and N is equal to the number of categories of final classification;

the superposition module is used for superposing K local key example features into the multi-scale example features of the image to be identified, wherein K is a positive integer; aiming at the ith scale under the K scales, i is more than or equal to 1 and less than or equal to K, the process of extracting the local key example features under the ith scale from the N original feature maps comprises the following steps: setting a polling priority for each original feature map, and determining a target original feature map for current polling from N original feature maps according to the polling priority; determining unit local key instance characteristics according to the target original characteristic graph and the scale i; when all the original feature maps are determined as target original feature maps, stopping polling, and combining N unit local key instance features into local key instance features under the ith scale; the unit local key example feature is an example feature which is obtained by splicing a plurality of unit target feature maps and has the same size with the target original feature map, and the unit target feature map is obtained by adjusting pixel values of non-overlapped feature image pixels which are divided according to the size of a sliding window in the target original feature map;

13. A computer arrangement comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the method according to any one of claims 1-11.

14. A computer storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions which, when executed by a processor, perform the method of any one of claims 1-11.