CN113537350A

CN113537350A - Image processing method and device, electronic equipment and storage medium

Info

Publication number: CN113537350A
Application number: CN202110806356.4A
Authority: CN
Inventors: 岳晓宇; 旷章辉; 张伟; 林达华
Original assignee: Bozhi Perceptual Interaction Research Center Co ltd; Sensetime Group Ltd
Current assignee: Bozhi Perceptual Interaction Research Center Co ltd; Sensetime Group Ltd
Priority date: 2021-07-16
Filing date: 2021-07-16
Publication date: 2021-10-22
Anticipated expiration: 2041-07-16
Also published as: CN113537350B

Abstract

The present disclosure relates to an image processing method and apparatus, an electronic device, and a storage medium, the method including: performing feature extraction on an image to be processed to obtain a feature map of the image to be processed, wherein the image to be processed comprises a target object; sampling the characteristic graph according to the position of the target object in the image to be processed to obtain sampling characteristics; and classifying the target object based on the sampling characteristics to obtain a classification result.

Description

Image processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an image processing method and apparatus, an electronic device, and a storage medium.

Background

The image identification can effectively determine the category of the object in the image, and has important application in various scenes such as automatic driving, intelligent audit and the like.

However, in the related image recognition method, the class of the object is often recognized based on the image features in a fixed range, the distribution of the content information of the image itself is ignored, and the effect of image recognition and classification is reduced.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure proposes an image processing scheme.

According to an aspect of the present disclosure, there is provided an augmented image processing method including:

performing feature extraction on an image to be processed to obtain a feature map of the image to be processed, wherein the image to be processed comprises a target object; sampling the characteristic graph according to the position of the target object in the image to be processed to obtain sampling characteristics; and classifying the target object based on the sampling characteristics to obtain a classification result.

In a possible implementation manner, the sampling the feature map according to the position of the target object in the image to be processed to obtain a sampling feature includes: determining a target sampling position for sampling the characteristic diagram according to the position of the target object in the image to be processed; and sampling the characteristic graph according to the target sampling position to obtain the sampling characteristic.

In a possible implementation manner, the determining a target sampling position for sampling the feature map according to a position of the target object in the image to be processed includes: and obtaining the target sampling position matched with the position of the target object in the image to be processed by performing at least one iterative sampling on the feature map.

In one possible implementation, in response to performing at least two iterative samplings on the feature map, performing a tth iterative sampling on the feature map includes: according to the intermediate sampling position after the t-1 th iterative sampling, carrying out the t-th iterative sampling on the feature map to obtain the intermediate sampling feature after the t-th iterative sampling, wherein t is an integer larger than 1; updating the intermediate sampling position after the t-1 th iterative sampling according to the intermediate sampling characteristic after the t-th iterative sampling to obtain the intermediate sampling position after the t-th iterative sampling; and taking the middle sampling position after the t-th iteration sampling as the target sampling position under the condition that the t reaches the preset iteration times.

In a possible implementation manner, the updating the intermediate sampling position after the t-1 th iterative sampling according to the intermediate sampling feature after the t-th iterative sampling to obtain the intermediate sampling position after the t-th iterative sampling includes: predicting the position offset according to the middle sampling characteristics after the t-th iterative sampling to generate the position offset; and updating the intermediate sampling position after the t-1 th iterative sampling according to the position offset to obtain the intermediate sampling position after the t-th iterative sampling.

In a possible implementation manner, the performing the tth iterative sampling on the feature map according to the intermediate sampling position after the tth-1 iterative sampling to obtain the intermediate sampling feature after the tth iterative sampling includes: extracting a feature vector in the feature map according to the intermediate sampling position after the t-1 th iterative sampling to obtain an intermediate extraction feature of the t-th iterative sampling; fusing the intermediate extraction feature of the tth iterative sampling, the intermediate sampling feature after the t-1 th iterative sampling and the feature vector corresponding to the intermediate sampling position after the t-1 th iterative sampling to obtain an intermediate fusion feature of the tth iterative sampling; and according to the intermediate fusion characteristics of the t-th iterative sampling, performing feature coding transformation based on self attention to obtain the intermediate sampling characteristics after the t-th iterative sampling.

In one possible implementation, the first iterative sampling of the feature map includes: extracting a feature vector in the feature map according to a preset initial sampling position to obtain an intermediate extraction feature of the first iterative sampling; fusing the intermediate extraction features of the first iterative sampling and the feature vectors corresponding to the initial sampling positions to obtain intermediate fusion features of the first iterative sampling; according to the intermediate fusion characteristics of the first iterative sampling, performing feature coding transformation based on self attention to obtain intermediate sampling characteristics after the first iterative sampling; and updating the initial sampling position according to the intermediate sampling characteristic after the first iterative sampling to obtain the intermediate sampling position after the first iterative sampling.

In a possible implementation manner, the classifying the target object based on the sampling feature to obtain a classification result includes: according to the sampling characteristics, performing characteristic coding transformation based on self attention to obtain transformed sampling characteristics; and carrying out classification processing according to the converted sampling characteristics to obtain a classification result of the target object.

In a possible implementation manner, the performing feature coding transformation based on self-attention according to the sampling feature to obtain a transformed sampling feature includes: acquiring a plurality of sampling feature vectors of the sampling features; respectively determining fusion weights of the plurality of sampling feature vectors according to the similarity among the plurality of sampling feature vectors; according to the fusion weights of the sampling feature vectors, performing weighted fusion on the sampling feature vectors to obtain a weighted fusion result; and generating the transformed sampling feature according to the weighted fusion result.

In one possible implementation, the method is implemented by a target neural network comprising: the characteristic extraction network is used for extracting the characteristics of the image to be processed to obtain a characteristic diagram of the image to be processed; the progressive sampling module is used for sampling the characteristic graph according to the position of the target object in the image to be processed to obtain sampling characteristics; and the classification network module is used for classifying the target object based on the sampling characteristics to obtain a classification result, wherein the classification network module comprises a visual transformation network and a classification network which are sequentially connected.

According to an aspect of the present disclosure, there is provided an image processing apparatus including:

the device comprises a feature extraction module, a feature extraction module and a feature extraction module, wherein the feature extraction module is used for extracting features of an image to be processed to obtain a feature map of the image to be processed, and the image to be processed comprises a target object; the sampling module is used for sampling the characteristic graph according to the position of the target object in the image to be processed to obtain sampling characteristics; and the classification module is used for classifying the target object based on the sampling characteristics to obtain a classification result.

In one possible implementation, the sampling module is configured to: determining a target sampling position for sampling the characteristic diagram according to the position of the target object in the image to be processed; and sampling the characteristic graph according to the target sampling position to obtain the sampling characteristic.

In one possible implementation, the sampling module is further configured to: and obtaining the target sampling position matched with the position of the target object in the image to be processed by performing at least one iterative sampling on the feature map.

In one possible implementation, in response to performing at least two iterative samplings on the feature map, the sampling module is further configured to: according to the intermediate sampling position after the t-1 th iterative sampling, carrying out the t-th iterative sampling on the feature map to obtain the intermediate sampling feature after the t-th iterative sampling, wherein t is an integer larger than 1; updating the intermediate sampling position after the t-1 th iterative sampling according to the intermediate sampling characteristic after the t-th iterative sampling to obtain the intermediate sampling position after the t-th iterative sampling; and taking the middle sampling position after the t-th iteration sampling as the target sampling position under the condition that the t reaches the preset iteration times.

In one possible implementation, the sampling module is further configured to: predicting the position offset according to the middle sampling characteristics after the t-th iterative sampling to generate the position offset; and updating the intermediate sampling position after the t-1 th iterative sampling according to the position offset to obtain the intermediate sampling position after the t-th iterative sampling.

In one possible implementation, the sampling module is further configured to: extracting a feature vector in the feature map according to the intermediate sampling position after the t-1 th iterative sampling to obtain an intermediate extraction feature of the t-th iterative sampling; fusing the intermediate extraction feature of the tth iterative sampling, the intermediate sampling feature after the t-1 th iterative sampling and the feature vector corresponding to the intermediate sampling position after the t-1 th iterative sampling to obtain an intermediate fusion feature of the tth iterative sampling; and according to the intermediate fusion characteristics of the t-th iterative sampling, performing feature coding transformation based on self attention to obtain the intermediate sampling characteristics after the t-th iterative sampling.

In one possible implementation, the sampling module is further configured to: extracting a feature vector in the feature map according to a preset initial sampling position to obtain an intermediate extraction feature of the first iterative sampling; fusing the intermediate extraction features of the first iterative sampling and the feature vectors corresponding to the initial sampling positions to obtain intermediate fusion features of the first iterative sampling; according to the intermediate fusion characteristics of the first iterative sampling, performing feature coding transformation based on self attention to obtain intermediate sampling characteristics after the first iterative sampling; and updating the initial sampling position according to the intermediate sampling characteristic after the first iterative sampling to obtain the intermediate sampling position after the first iterative sampling.

In one possible implementation, the classification module is configured to: according to the sampling characteristics, performing characteristic coding transformation based on self attention to obtain transformed sampling characteristics; and carrying out classification processing according to the converted sampling characteristics to obtain a classification result of the target object.

In one possible implementation, the classification module is further configured to: acquiring a plurality of sampling feature vectors of the sampling features; respectively determining fusion weights of the plurality of sampling feature vectors according to the similarity among the plurality of sampling feature vectors; according to the fusion weights of the sampling feature vectors, performing weighted fusion on the sampling feature vectors to obtain a weighted fusion result; and generating the transformed sampling feature according to the weighted fusion result.

In one possible implementation, the apparatus is implemented by a target neural network, the target neural network comprising: the characteristic extraction network is used for extracting the characteristics of the image to be processed to obtain a characteristic diagram of the image to be processed; the progressive sampling module is used for sampling the characteristic graph according to the position of the target object in the image to be processed to obtain sampling characteristics; and the classification network module is used for classifying the target object based on the sampling characteristics to obtain a classification result, wherein the classification network module comprises a visual transformation network and a classification network which are sequentially connected.

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

In the embodiment of the disclosure, according to the position of the target object in the image to be processed, sampling processing is performed on the feature map of the image to be processed to obtain sampling features, and classification processing is performed on the target object based on the sampling features to obtain a classification result. Through the process, according to the image processing method and device, the electronic device and the storage medium provided by the embodiment of the disclosure, the characteristic diagram of the image to be processed can be sampled by utilizing the image content information of the image to be processed, so as to obtain the sampling characteristic, so that the characteristics of the target object in the sampling characteristic are more complete and comprehensive, the correlation between the sampling characteristic and the target object is improved, the precision of the classification result obtained by classifying based on the sampling characteristic is improved, and the classification effect is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flowchart of an image processing method according to an embodiment of the present disclosure.

Fig. 2 shows a flow chart of an image processing method according to an embodiment of the present disclosure.

FIG. 3 shows a schematic diagram of a target sampling location according to an embodiment of the present disclosure.

Fig. 4 shows a schematic diagram of iterative sampling according to an embodiment of the present disclosure.

Fig. 5 shows a flowchart of an image processing method according to an embodiment of the present disclosure.

Fig. 6 shows a schematic structural diagram of a feature encoding layer according to an embodiment of the present disclosure.

Fig. 7 shows a schematic structural diagram of a feature encoding layer according to an embodiment of the present disclosure.

Fig. 8 illustrates a block diagram of an image processing apparatus according to an embodiment of the present disclosure.

Fig. 9 shows a schematic diagram of an application example according to the present disclosure.

Fig. 10 illustrates a block diagram of an electronic device 800 in accordance with an embodiment of the disclosure.

Fig. 11 shows a block diagram of an electronic device 1900 according to an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 shows a flowchart of an image processing method according to an embodiment of the present disclosure, which may be applied to an image processing apparatus, which may be a terminal device, a server, or other processing device, etc. The terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like.

In some possible implementations, the image processing method may also be implemented by the processor calling computer readable instructions stored in the memory.

As shown in fig. 1, in one possible implementation, the image processing method may include:

and step S11, performing feature extraction on the image to be processed to obtain a feature map of the image to be processed, wherein the image to be processed comprises the target object.

The image to be processed may be any image having image recognition or classification requirements, and the target object included in the image to be processed may be any object to be recognized or classified in the image to be processed. The number of the to-be-processed images to be processed in the embodiment of the present disclosure can be flexibly determined according to actual situations, and one to-be-processed image can be processed each time, or a plurality of to-be-processed images can be processed simultaneously. The number of target objects included in the image to be processed is also not limited in the embodiment of the present disclosure, and may be one or more.

The realization forms of the image to be processed and the target object can be flexibly changed along with different application scenes. For example, in an application scene of automatic driving, the image to be processed may be an image acquired in a driving process, and the target object in the image to be processed may include various objects concerned in the driving process, such as pedestrians, license plates, zebra stripes, traffic lights, and the like; for example, in an application scenario of intelligent review, the image to be processed may be various images published in a website, and the target object in the image to be processed may include various patterns or marks to be reviewed; for example, in an application scenario of face recognition, the image to be processed may be an acquired face image, and the target object in the image to be processed may include a face object in the image.

The feature map (feature map) of the image to be processed may include feature information of the entire image to be processed, a position of the feature information in the feature map may correspond to a position of each pixel point in the image to be processed, and the feature map may be obtained by performing feature extraction on the image to be processed. The feature extraction method is not limited in the embodiments of the present disclosure, and is not limited in the following embodiments, and in some possible implementations, a feature map extracted after convolution and the like may be obtained by performing convolution processing on an image to be processed; in some possible implementations, the feature map may also be obtained by performing feature extraction on the image to be processed through any neural network having a feature extraction function, such as a feature extraction convolutional neural network, and in one example, the feature map may be obtained by effectively utilizing the capability of a convolution operator to extract context information in the image to be processed, and performing feature extraction on the image to be processed by using the backbone network of ResNet50 and the first two residual modules as feature extraction networks.

And step S12, sampling the characteristic diagram according to the position of the target object in the image to be processed to obtain sampling characteristics.

The sampling feature may be feature information obtained by performing sampling processing on the feature map, for example, the sampling feature may include one or more feature vectors. As described in step S12, in the embodiment of the present disclosure, the sampling process may be performed according to the position of the target object in the image to be processed, and therefore, the sampling features obtained in the embodiment of the present disclosure may include feature information of the target object in a higher proportion, and the feature information of the target object may also be more complete and comprehensive.

How to sample the feature map based on the position of the target object in the image to be processed can be flexibly determined according to actual situations, for example, a rough position of the target object in the image to be processed can be determined first, and the feature map is sampled based on the rough position; or an initial sampling position may be randomly determined, after the feature map is sampled based on the initial sampling position, the initial sampling position is updated according to the sampling result, and an updated position or the like more matching the position of the target object in the image to be processed is determined. Various implementations of step S12 can be found in the following disclosure, which will not be expanded herein.

And step S13, classifying the target object based on the sampling characteristics to obtain a classification result.

The classification result may be a category to which the target object in the image to be processed belongs, for example, in an automatic driving scene, the classification result may indicate whether the target object in the image to be processed belongs to an object such as a pedestrian or a traffic light; or in an intelligent audit scene, the classification result can indicate that the target object in the image to be processed belongs to a sensitive object and needs to be shielded; or in a face recognition scene, the classification result can indicate that the target object in the image to be processed belongs to a face object with a passing right, and the like.

The form of the classification result is not limited in the embodiment of the present disclosure, and may be a category of the target object, or a probability that the target object belongs to various preset categories, and the like, and may be flexibly determined according to an actual situation.

The manner of classifying the target object based on the sampling features is not limited in the embodiments of the present disclosure, and is not limited to the following embodiments. In some possible implementations, the sampling feature may be directly passed through a neural network layer having a classification function, such as a network layer of softmax, to perform a classification process on the target object according to the sampling feature. In some possible implementations, the sampling features may be subjected to feature fusion processing and then classified, for example, the sampling features may be subjected to fusion processing by a transform network (Transformer) based on self-attention, and the sampling features after the fusion processing are classified by a network layer such as softmax, so as to obtain a classification result. Some possible implementations of step S13 can be seen in the following disclosure, which is not first expanded.

Fig. 2 shows a flowchart of an image processing method according to an embodiment of the present disclosure, and as shown in the figure, in one possible implementation, step S12 may include:

step S121, determining a target sampling position for sampling the characteristic diagram according to the position of the target object in the image to be processed.

The target sampling position may be a position at which a feature vector in the feature map is sampled, and the target sampling position may include position information of one or more sampling points in the feature map. As described in the foregoing disclosure, the position of the feature information in the feature map may correspond to the position of each pixel point in the image to be processed, and therefore, according to the position of the target object in the image to be processed, the target sampling position may be correspondingly determined to sample the feature map.

The target sampling position may include positions of one or more sampling points in the image to be processed, the position distribution of the sampling points may be flexibly determined according to the actual condition of the image to be processed, and in some possible implementation manners, the target sampling position determined according to the position of the target object in the image to be processed may include a plurality of non-uniformly distributed sampling points, so as to improve the pertinence and the accuracy of sampling the target object.

Fig. 3 is a schematic diagram of a target sampling position according to an embodiment of the present disclosure, and as shown in the drawing, in one example, according to a position of a target object, namely a cat, in an image to be processed, a plurality of sampling points belonging to or around the target object may be determined, and positions of the sampling points in the image to be processed may be used as target sampling positions for sampling a feature map.

The implementation manner of step S121 can be flexibly selected according to practical situations, and is not limited to the following disclosed embodiments. In some possible implementation manners, a position where a target object is located in an image to be processed may be detected, and based on a detection result, a position of one or more sampling points is determined as a target sampling position, for example, a position of a significant region including the target object in the image to be processed may be obtained through significance detection, and a plurality of sampling points are randomly or uniformly determined in the significant region, and positions of the sampling points are determined as the target sampling position.

In some possible implementation manners, an initial sampling position for sampling the feature map may also be determined, the feature map is iteratively sampled for multiple times based on the initial sampling position, and the initial sampling position is continuously updated based on the result of iterative sampling to obtain a target sampling position. The specific manner of iterative sampling and updating the initial sampling location can be seen in the following disclosure, which is not first expanded.

And S122, sampling the characteristic graph according to the target sampling position to obtain sampling characteristics.

The manner of sampling the feature map according to the target sampling position may be flexibly changed according to different implementation manners of step S121.

In some possible implementation manners, in the case that the target sampling position is determined by means of saliency detection or the like, the feature information at the target sampling position may be directly extracted from the feature map to obtain the sampling feature. In some possible implementation manners, in the case of determining the target sampling position by means of iterative sampling and the like, sampling features and the like can be obtained according to the target sampling position by combining a manner of extracting features from a feature map in each iterative sampling, and a specific process is described in detail in each disclosed embodiment described below, and is not expanded first.

According to the embodiment of the disclosure, a target sampling position matched with the position of a target object in an image to be processed can be determined through various flexible modes such as target object detection or iterative sampling, and based on the target sampling position, sampling processing is performed on a feature map to obtain a sampling feature, on one hand, attention to a background part feature in the image to be processed can be reduced, correlation between the sampling feature and the target object is improved, and by performing feature sampling through the target sampling position, possible cutting of the target object feature can be reduced, and the integrity of the target object feature in the sampling feature is improved; on the other hand, the sampling flexibility can be improved, and the classification flexibility and the classification practicability are improved while the precision of the classification method is improved.

As described in the above-mentioned disclosed embodiment, the target sampling position may be determined by iterative sampling, and therefore, in a possible implementation, step S121 may include: and performing at least one iterative sampling on the characteristic diagram to obtain a target sampling position matched with the position of the target object in the image to be processed.

The number of iterative sampling may be flexibly determined according to an actual situation, and may be one or more, which is not limited in the embodiment of the present disclosure. In the iterative sampling process of the feature map, the iterative sampling modes of each time can be the same, and each iterative sampling can be realized based on the result obtained by the last iterative sampling, so that iteration among multiple times of sampling is realized.

Through carrying out iterative sampling on the characteristic diagram for at least one time, the target sampling position matched with the position of the target object in the image to be processed is obtained, through the process, the target sampling position can be adaptively and progressively corrected in an iterative sampling mode, and the classification precision is conveniently and efficiently improved.

In one possible implementation, the feature map may be sampled iteratively at least twice, in which case the t-th iterative sampling of the feature map may include:

performing the t-th iterative sampling on the feature map according to the intermediate sampling position after the t-1-th iterative sampling to obtain the intermediate sampling feature after the t-th iterative sampling, wherein t is an integer greater than 1;

updating the intermediate sampling position after the t-1 th iterative sampling according to the intermediate sampling characteristic after the t-th iterative sampling to obtain the intermediate sampling position after the t-th iterative sampling;

and taking the middle sampling position after the t-th iteration sampling as a target sampling position under the condition that t reaches the preset iteration times.

The intermediate sampling position may be a position at which the feature map is sampled next time, which is determined after each iterative sampling, and the intermediate sampling feature may be feature information obtained after the feature map is sampled based on the intermediate sampling position. The preset iteration times are preset iteration sampling time threshold values, specific numerical values are not limited in the embodiment of the disclosure, and can be flexibly selected according to actual conditions, for example, the preset iteration times can be 3-10 times and the like.

Through the process, the intermediate sampling feature can be obtained in the process of performing iterative sampling on the feature map each time, the intermediate sampling feature can be used for updating the intermediate sampling position after the last iterative sampling to obtain the intermediate sampling position after the current iterative sampling, and under the condition that the iterative sampling frequency does not reach the preset iterative frequency, the next iterative sampling can be performed on the feature map based on the intermediate sampling position obtained after the current iterative sampling, so that a new intermediate sampling feature is obtained to update the intermediate sampling position; under the condition that the iterative sampling frequency reaches the preset iterative frequency, the intermediate sampling position obtained after the iterative sampling is considered to approach the position of the target object in the image to be processed, so that the iterative process can be ended, and the intermediate sampling position obtained after the iterative sampling is taken as the target sampling position.

Through the embodiment of the disclosure, the determined intermediate sampling position in each iterative sampling can be continuously updated through multiple iterative sampling, so that the intermediate sampling position is more and more close to the position of the target object in the image to be processed, the target sampling position matched with the position of the target object is obtained, and each sampling point in the target sampling position can be non-uniformly distributed sampling points, for example, the distance between each sampling point can be different. Through the embodiment of the disclosure, the sampling position can be continuously corrected in a self-adaptive manner by utilizing the content information of the image to be processed, so that the target sampling position with higher attention to the target object is obtained, the classification precision is improved, and meanwhile, the classification method is convenient to combine with various classification modes, and the classification application range and the practicability are improved.

As described in the embodiments of the disclosure, the intermediate sampling position after the last iterative sampling may be updated according to the intermediate sampling feature after each iterative sampling. The updating mode may be flexibly selected according to an actual situation, for example, in a possible implementation mode, the intermediate sampling position may be directly predicted according to the intermediate sampling feature after the current iterative sampling, and the predicted intermediate sampling position is corrected based on the intermediate sampling position after the last iterative sampling, so as to obtain the updated intermediate sampling position after the current iterative sampling.

In a possible implementation manner, updating the intermediate sampling position after the t-1 th iterative sampling according to the intermediate sampling feature after the t-th iterative sampling to obtain the intermediate sampling position after the t-th iterative sampling may include:

predicting the position offset according to the middle sampling characteristics after the t-th iterative sampling to generate the position offset;

and updating the intermediate sampling position after the t-1 th iterative sampling according to the position offset to obtain the intermediate sampling position after the t-th iterative sampling.

The position offset can be a change vector of the position, the direction and the distance of the adjustment of the intermediate sampling position after the t-1 th iterative sampling can be determined according to the position offset, the intermediate sampling position after the t-1 th iterative sampling is adjusted based on the direction and the distance, and the intermediate sampling position after the t-1 th iterative sampling can be obtained.

The position offset prediction mode is not limited in the embodiment of the present disclosure, and may be flexibly determined according to an actual situation, and in a possible implementation manner, the intermediate sampling feature after the t-th iterative sampling may be input to a Fully Connected layer (FC) for processing, so as to obtain a position offset output by the Fully Connected layer.

Through the embodiment of the disclosure, the next iteration sampling can be performed by using the intermediate sampling position obtained by the last iteration sampling, the offset of the intermediate sampling position is predicted based on the intermediate sampling characteristic obtained by the iteration sampling, the intermediate sampling position of the last iteration sampling is further corrected, and a more accurate target sampling position is obtained, so that the data information obtained by each iteration sampling is fully utilized through repeated iteration sampling, the sampling position can be gradually and continuously corrected, the accuracy of the target sampling position is improved, meanwhile, the reusability of computing resources is improved, the utilization efficiency of the computing resources is improved, and the overall processing cost of the image processing method is reduced.

In some possible implementations, the intermediate sampled feature may be a feature obtained by directly sampling the feature map based on the intermediate sampling position. In some possible implementations, the intermediate sampling feature may also be fused with other feature information, such as an intermediate sampling feature obtained before the current iteration sampling. The sampling mode in the iterative sampling can also be flexibly changed along with the difference of the characteristic information contained in the intermediate sampling characteristic.

Therefore, in a possible implementation manner, performing the tth iterative sampling on the feature map according to the intermediate sampling position after the tth-1 iterative sampling to obtain the intermediate sampling feature after the tth iterative sampling may include:

extracting a feature vector in the feature map according to the intermediate sampling position after the t-1 th iterative sampling to obtain intermediate extraction features of the t-th iterative sampling;

fusing the intermediate extraction features of the t-th iterative sampling, the intermediate sampling features after the t-1 th iterative sampling and the feature vectors corresponding to the intermediate sampling positions after the t-1 th iterative sampling to obtain intermediate fusion features of the t-th iterative sampling;

and according to the intermediate fusion characteristics of the t-th iterative sampling, performing feature coding transformation based on self attention to obtain intermediate sampling characteristics after the t-th iterative sampling.

The intermediate extraction features may be feature information formed by feature vectors located at intermediate sampling positions in the feature map, and in each iterative sampling process, the feature vectors at corresponding positions in the feature map may be extracted based on the intermediate sampling position after the last iterative sampling, so as to obtain intermediate extraction features.

The intermediate fusion feature may be feature information obtained by fusing the intermediate extraction feature with other features, where the type of the fused other features may be flexibly determined according to an actual situation, and in a possible implementation manner, the other features may include an intermediate sampling feature obtained after last iterative sampling and/or a feature vector corresponding to an intermediate sampling position after last iterative sampling. The feature vector corresponding to the intermediate sampling position after the last iterative sampling may be a feature vector obtained by performing position coding on the intermediate sampling position after the last iterative sampling in a form of a full connection layer or other network layers.

The fusion mode is not limited in the embodiment of the present disclosure, and may be directly adding the above features, or performing weighted addition according to a certain preset weight, and the like, and may be flexibly selected according to an actual situation.

In a possible implementation manner, in order to better learn more comprehensive feature information of a target object through iterative sampling, the obtained intermediate fusion features may be subjected to feature coding transformation based on self-attention, so as to obtain intermediate sampling features after the iterative sampling.

In one possible implementation manner, the intermediate fusion features may pass through a feature coding network Layer (transform Encoder Layer) of a transform to implement global fusion on feature information in the intermediate fusion features, thereby implementing the feature coding transform based on self-attention on the intermediate fusion features. The transformation mode of the feature coding network layer for implementing the feature coding transformation based on self attention can be detailed in the following disclosed embodiments, and is not expanded at first.

Through the embodiment of the disclosure, in the iterative sampling process at every time, when the feature vector in the feature map is sampled based on the intermediate sampling position after the last iterative sampling, the intermediate sampling feature after the last iterative sampling is fused, and the feature vector corresponding to the intermediate sampling position after the last iterative sampling is fused, so that the intermediate sampling feature which is more comprehensive and more relevant to the previous iterative sampling result can be obtained, the precision of the finally obtained target sampling position is improved, and the classification precision and the classification effect are improved.

In a possible implementation manner, in the process of performing first iterative sampling on the feature map with the iteration number of 1, the intermediate sampling position and the intermediate sampling feature obtained in the last iteration cannot be used for processing, so in response to performing one iterative sampling on the feature map or in response to performing multiple iterative sampling on the feature map and performing the first iterative sampling, in a possible implementation manner, the performing the first iterative sampling on the feature map may include:

extracting a feature vector in the feature map according to a preset initial sampling position to obtain an intermediate extraction feature of the first iterative sampling;

fusing the intermediate extraction features of the first iterative sampling and the feature vectors corresponding to the initial sampling positions to obtain intermediate fusion features of the first iterative sampling;

according to the intermediate fusion characteristics of the first iterative sampling, performing feature coding transformation based on self attention to obtain intermediate sampling characteristics after the first iterative sampling;

and updating the initial sampling position according to the intermediate sampling characteristic after the first iterative sampling to obtain the intermediate sampling position after the first iterative sampling.

The specific position of the initial sampling position may be flexibly set according to the actual situation, and is not limited to the embodiments of the present disclosure. In a possible implementation manner, an initial sampling position may be randomly generated, and in some possible implementation manners, a plurality of sampling points that uniformly sample the feature map may also be set, and the positions of the plurality of sampling points are used as the initial sampling position; in some possible implementations, the position of the target object may also be roughly determined by saliency detection or the like, and the position may be used as an initial sampling position or the like.

For the implementation forms of the intermediate extraction features and the intermediate fusion features, reference may be made to the above-mentioned embodiments, which are not described herein again.

The process of extracting the feature vector in the feature map according to the initial sampling position may refer to the above manner of extracting the feature vector according to the intermediate sampling position, and is not described herein again.

In the first iterative sampling process, the intermediate sampling feature obtained in the last iteration cannot be used, so that in a possible implementation manner, the intermediate extraction feature obtained based on the initial sampling position may be fused with only the feature vector corresponding to the initial sampling position to obtain the intermediate fusion feature. The implementation form of the feature vector corresponding to the initial sampling position may refer to the feature vector corresponding to the middle sampling position in the above disclosed embodiment, and details are not repeated here again.

Similarly, the manner of performing feature coding transformation based on self-attention according to the intermediate fusion feature to obtain the intermediate sampling feature, and the manner of obtaining the intermediate sampling position based on the intermediate sampling feature may also refer to the above-mentioned embodiments, and are not described in detail herein.

Through the embodiment of the disclosure, the first iterative sampling can be realized by using the preset initial sampling position, so that the initialization of the iterative sampling process is realized, the subsequent iterative sampling is facilitated to be carried out and realized, and the feasibility of the whole image processing method is improved.

FIG. 4 is a schematic diagram of iterative sampling according to an embodiment of the disclosure, as shown, after T-1 iterative sampling is performed on a feature map F, an intermediate sampling feature T can be obtained_t-1And an intermediate sampling position p_t。

During the t-th iteration sampling, the intermediate sampling position p can be firstly selected_tExtracting the feature vector in the feature map F by the following formula (1) to obtain an intermediate extraction feature T_t'：

T_t'＝F(p_t) (1)

The intermediate extraction feature T_t' the characteristic T of the intermediate sampling after the T-1 th iterative sampling can be respectively compared with the characteristic T of the intermediate sampling after the T-1 th iterative sampling by the following formula (2)_t-1And feature vectorsP_tPerforming fusion to obtain intermediate fusion feature X_tWherein the feature vector P_tMay be to the intermediate sampling position p_tThe feature vector obtained after the position encoding is performed, the encoding process can be represented by the following formula (3):

P_t＝W_tp_t (3)

wherein, W_tTo sample an intermediate position p_tProjection as a feature vector P_tLinear transformation of (3). In obtaining intermediate fusion characteristics X_tThereafter, X may be_tRealizing feature coding transformation based on self attention through a feature coding network Layer (transform Encoder Layer) of a transform to obtain an intermediate sampling feature T after the tth iterative sampling_t. Further, the intermediate sampling characteristic T_tThe positional deviation prediction can be realized by the full connection layer based on the following formula (4) to obtain the positional deviation o_t。

o_t＝M_tT_t,t∈{1,...,N-1} (4)

Wherein M is_tN is a preset number of iterations for a learnable linear transformation parameter for predicting a position offset. Based on the position offset o_tThe intermediate sampling position p after the t-1 th iterative sampling can be obtained by the following formula (5)_tUpdating to obtain the intermediate sampling position p after the t-th iterative sampling_t+1。

p_t+1＝p_t+o_t,t∈{1,...,N-1} (5)

By performing N iterations with preset iteration times on the iteration process, the sampling characteristic T after N iterative samplings can be obtained_NFor subsequent classification processing.

Fig. 5 shows a flowchart of an image processing method according to an embodiment of the present disclosure, and as shown in the figure, in one possible implementation, step S13 may include:

step S131, according to the sampling characteristics, the characteristic coding transformation based on the self attention is carried out to obtain the transformed sampling characteristics.

In the process of performing feature coding transformation based on self-attention according to sampling features, reference may be made to the above-described embodiments, that is, sampling features may be input to a feature coding network Layer (transform Encoder Layer) of a Transformer to implement feature coding transformation based on self-attention.

In some possible implementations, the sampling feature may implement the feature coding transform based on self-attention through a feature coding layer of a transform, and in some possible implementations, the sampling feature may also implement the feature coding transform based on self-attention through a visual transform network (ViT), ViT may include multiple feature coding layers of transforms connected in sequence, and the number of layers of the feature coding layers included in ViT may be flexibly determined according to actual situations, which is not limited in the embodiment of the present disclosure.

In some possible implementations, the sampling feature may be further associated with a predetermined classification feature T_clsThe common input is ViT to perform self-attention-based feature coding transformation, the preset classification feature T_clsThe included feature information is not limited in the embodiments of the present disclosure, and in one example, the preset classification feature T_clsWhich may be obtained by training ViT as a trainable parameter.

In step S131, how to perform feature coding transformation based on self-attention on sampling features based on one or more transform feature coding layers may be described in detail in the following disclosure embodiments, which are not expanded herein.

And step S132, carrying out classification processing according to the converted sampling characteristics to obtain a classification result of the target object.

The classification processing mode can be flexibly determined according to actual conditions, and in some possible implementation modes, the converted sampling features can pass through a neural network or a neural network layer with a classification function to obtain a classification result of the target object. In one example, the transformed sampling features may be passed through a softmax classification layer to obtain a classification result output by the softmax classification layer.

Through the embodiment of the disclosure, a network layer in a Transformer can be utilized to perform feature coding transformation based on self attention on sampling features, and the transformed sampling features are obtained to perform classification processing, so that a classification result of a target object is obtained.

In one possible implementation, step S131 may include:

acquiring a plurality of sampling feature vectors of the sampling features;

respectively determining fusion weights of the plurality of sampling feature vectors according to the similarity among the plurality of sampling feature vectors;

according to the fusion weight of the sampling feature vectors, performing weighted fusion on the sampling feature vectors to obtain a weighted fusion result;

and generating the transformed sampling feature according to the weighted fusion result.

The plurality of sampling feature vectors of the sampling features may be a plurality of sampling feature vectors obtained by mapping the sampling features to different vector spaces. The vector types included in the sampling feature vectors can be flexibly determined according to actual conditions, and in one possible implementation, the sampling features can be mapped to a query vector space, a key vector space and a value vector space respectively to obtain a query vector (Queries), a key vector (Keys) and a value vector (Values) of the sampling features as the sampling feature vectors.

According to the similarity among the three sampling feature vectors, namely the query vector, the key vector and the value vector, the fusion weights of the three feature vectors can be respectively determined, the multiple sampling feature vectors can be subjected to weighted fusion based on the fusion weights of the vectors to obtain a weighted fusion result, and the weighted fusion result can be further processed through a full connection layer to obtain the converted sampling feature. The calculation mode of the similarity is not limited in the embodiment of the present disclosure, and the calculated similarity may be directly used as the fusion weight of the sampling feature vector to implement weighted fusion.

As described in the foregoing disclosure, the feature coding transform based on self attention proposed in the embodiments of the present disclosure may be implemented by a feature coding layer of a transform. Fig. 6 and 7 show structural diagrams of a feature coding layer according to an embodiment of the present disclosure, as shown in fig. 6, in one example, a feature coding layer of a transform may include a Multi-Head Attention module (Multi-Head Attention) and a Forward propagation module (Feed Forward) connected in sequence, where the structure of the Multi-Head Attention module is shown in fig. 7, as can be seen from fig. 7, a query vector Q, a key vector K, and a value vector V of a sampled feature are respectively input into the Multi-Head Attention module, and after linear transformation, in a Scaled Dot-Product Attention unit (Scaled Dot-Product Attention) in the Multi-Head Attention module, similarities between vectors Q, K and V may be respectively calculated, fusion weights between Q, K and V may be determined according to the similarities, for example, Q and K vectors may be matrix-multiplied to determine the similarities between Q and K, and as the fusion weight of Q and K, normalizing the similarity between Q and K, and performing matrix multiplication on the normalized similarity and the V vector to obtain the similarity of V among a plurality of vectors as the weight of V, thereby obtaining the fusion weight between Q, K and V. And performing weighted fusion in each dot product attention unit according to the determined fusion weight, thereby obtaining a weighted fusion result in each dot product attention unit.

The weighted fusion results in the h dot product attention units can be connected through the connecting layers, linear transformation is carried out again, multi-head attention characteristics are obtained through an addition and normalization layer (Add & Norm), the multi-head attention characteristics enter the forward propagation module, and after the multi-head attention characteristics are processed through two full connecting layers in the forward propagation module, converted sampling characteristics are obtained through the addition and normalization layer.

Through this disclosed embodiment, can obtain a plurality of sampling feature vectors through mapping sampling feature to different vector spaces, utilize the similarity between a plurality of sampling feature vectors, come to carry out the weighted fusion based on self attention mechanism to the sampling feature, obtain the sampling feature after the transform, thereby make the sampling feature after the transform can fuse the global feature of self better, promote the integrality and the precision of the characteristic information of the sampling feature after the transform, then promote the precision of classification result, promote classification effect.

In one possible implementation, the method proposed by the embodiment of the present disclosure may be implemented by a target neural network, and the target neural network may include:

the characteristic extraction network is used for extracting the characteristics of the image to be processed to obtain a characteristic diagram of the image to be processed;

the progressive sampling module is used for sampling the characteristic graph according to the position of the target object in the image to be processed to obtain sampling characteristics;

and the classification network module is used for classifying the target object based on the sampling characteristics to obtain a classification result, wherein the classification network module comprises a visual transformation network and a classification network which are sequentially connected.

The feature extraction network may be any neural network with a feature extraction function, which is proposed in the above-mentioned disclosed embodiments, and is not described herein again.

The progressive sampling module may perform at least one iterative sampling on the feature map in an iterative sampling manner mentioned in the foregoing disclosed embodiment to obtain the sampling feature, and the implementation form of the progressive sampling module may refer to each of the foregoing disclosed embodiments, for example, the structure of the progressive sampling module may refer to fig. 4 in the foregoing disclosed embodiment, and details are not repeated here.

The classification network module may classify the target object based on the sampling features, and in a possible implementation manner, the classification network module may include a visual transformation network and a classification network, where the visual transformation network may be the ViT network mentioned in the foregoing disclosed embodiment, and details are not described here. The classification network may be any neural network or neural network layer having a classification function, and reference may also be made to the above-described embodiments, which are not described herein again.

In some possible implementation manners, the target neural network may also omit some network structures or modules, for example, only a progressive sampling module and a visual transformation network may be included, and other omitted networks or modules may be implemented by using a relevant algorithm, which is determined flexibly according to the actual situation.

Through the embodiment of the disclosure, image classification can be realized through the target neural network comprising one or more of the feature extraction network, the progressive sampling module, the visual transformation network and the classification network.

Fig. 8 shows a block diagram of an image processing apparatus 20 according to an embodiment of the present disclosure, which, as shown in fig. 8, includes:

the feature extraction module 21 is configured to perform feature extraction on the image to be processed to obtain a feature map of the image to be processed, where the image to be processed includes a target object.

And the sampling module 22 is configured to perform sampling processing on the feature map according to the position of the target object in the image to be processed, so as to obtain a sampling feature.

And the classification module 23 is configured to perform classification processing on the target object based on the sampling features to obtain a classification result.

In one possible implementation, the sampling module is configured to: determining a target sampling position for sampling the characteristic diagram according to the position of the target object in the image to be processed; and sampling the characteristic graph according to the target sampling position to obtain sampling characteristics.

In one possible implementation, the sampling module is further configured to: and performing at least one iterative sampling on the characteristic diagram to obtain a target sampling position matched with the position of the target object in the image to be processed.

In one possible implementation, in response to performing at least two iterative samplings on the feature map, the sampling module is further configured to: performing the t-th iterative sampling on the feature map according to the intermediate sampling position after the t-1-th iterative sampling to obtain the intermediate sampling feature after the t-th iterative sampling, wherein t is an integer greater than 1; updating the intermediate sampling position after the t-1 th iterative sampling according to the intermediate sampling characteristic after the t-th iterative sampling to obtain the intermediate sampling position after the t-th iterative sampling; and taking the middle sampling position after the t-th iteration sampling as a target sampling position under the condition that t reaches the preset iteration times.

In one possible implementation, the sampling module is further configured to: extracting a feature vector in the feature map according to the intermediate sampling position after the t-1 th iterative sampling to obtain intermediate extraction features of the t-th iterative sampling; fusing the intermediate extraction features of the t-th iterative sampling, the intermediate sampling features after the t-1 th iterative sampling and the feature vectors corresponding to the intermediate sampling positions after the t-1 th iterative sampling to obtain intermediate fusion features of the t-th iterative sampling; and according to the intermediate fusion characteristics of the t-th iterative sampling, performing feature coding transformation based on self attention to obtain intermediate sampling characteristics after the t-th iterative sampling.

In one possible implementation, the classification module is further configured to: acquiring a plurality of sampling feature vectors of the sampling features; respectively determining fusion weights of the plurality of sampling feature vectors according to the similarity among the plurality of sampling feature vectors; according to the fusion weight of the sampling feature vectors, performing weighted fusion on the sampling feature vectors to obtain a weighted fusion result; and generating the transformed sampling feature according to the weighted fusion result.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

Application scenario example

Fig. 9 is a schematic diagram illustrating an application example according to the present disclosure, and as shown in the diagram, the application example of the present disclosure proposes an image processing method, which may classify a target object in an input image to be processed, and as shown in fig. 9, the image processing method proposed by the application example of the present disclosure may include the following processes:

firstly, inputting an image to be processed into a feature extraction network for feature extraction to obtain a feature map F.

Secondly, inputting the obtained feature map F into a progressive sampling module in fig. 9, and performing iterative sampling:

as shown in fig. 9, the first iterative sampling may be implemented according to an initial sampling position p1, which includes 9 sampling points that equally divide the entire feature map.

In each iterative sampling process, according to the intermediate sampling position obtained by the last iterative sampling, extracting the characteristics corresponding to the sampling points from the F to obtain intermediate extraction characteristics, fusing the intermediate extraction characteristics with the intermediate sampling characteristics obtained by the last iterative sampling and the characteristic vector corresponding to the intermediate sampling position obtained by the last iterative sampling to obtain intermediate fusion characteristics, processing the intermediate fusion characteristics through a characteristic coding layer in a transform to obtain the intermediate sampling characteristics after the current iterative sampling, and predicting the position offset of each sampling point on a two-dimensional plane based on the intermediate sampling characteristics to be used as the position offset. The position offset can be used for updating the middle sampling position after the last iterative sampling to obtain the middle sampling position after the current iterative sampling.

The progressive sampling module may obtain sampling characteristics through multiple iterative sampling, and input the sampling characteristics into the visual transformation network ViT as shown in fig. 9.

And thirdly, ViT, the network comprises a feature coding layer in a multi-layer Transformer, and the multi-layer feature coding layer can perform feature coding transformation based on self attention on the input sampling features to obtain the transformed sampling features.

And fourthly, the transformed sampling features may enter a classification network to perform classification processing, so as to obtain a classification result, and in one example, the classification result output in fig. 9 may indicate that the target object in the image to be processed is a cat.

By the application example of the method, the target objects in the image to be processed can be classified by utilizing the network structure of the Transformer and combining the mode of dynamically sampling the characteristic diagram, and on one hand, the classification effect facing large-scale data is improved by adopting the Transformer structure as a network main body compared with a convolutional neural network; on the other hand, the sampling position is adjusted through iterative sampling, the content information of the image is fully considered, dynamic sampling is carried out according to the position of the target object in the image, the condition that the characteristics of the region where the target object is located are divided is reduced, and the classification effect and precision are effectively improved.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a non-volatile computer readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

The disclosed embodiments also provide a computer program product comprising computer readable code which, when run on a device, executes instructions for implementing a method as provided by any of the above embodiments.

Embodiments of the present disclosure also provide another computer program product for storing computer readable instructions, which when executed, cause a computer to perform the operations of the method provided by any of the above embodiments.

The electronic device may be provided as a terminal, server, or other form of device.

Fig. 10 illustrates a block diagram of an electronic device 800 in accordance with an embodiment of the disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.

Referring to fig. 10, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a Complementary Metal Oxide Semiconductor (CMOS) or Charge Coupled Device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as a wireless network (WiFi), a second generation mobile communication technology (2G) or a third generation mobile communication technology (3G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

Fig. 11 shows a block diagram of an electronic device 1900 according to an embodiment of the disclosure. For example, the electronic device 1900 may be provided as a server. Referring to fig. 11, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system, such as the Microsoft Server operating system (Windows Server), stored in the memory 1932^TM) Apple Inc. of the present application based on the graphic user interface operating System (Mac OS X)^TM) Multi-user, multi-process computer operating system (Unix)^TM) Free and open native code Unix-like operating System (Linux)^TM) Open native code Unix-like operating System (FreeBSD)^TM) Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An image processing method, characterized in that the method comprises:

performing feature extraction on an image to be processed to obtain a feature map of the image to be processed, wherein the image to be processed comprises a target object;

sampling the characteristic graph according to the position of the target object in the image to be processed to obtain sampling characteristics;

and classifying the target object based on the sampling characteristics to obtain a classification result.

2. The method according to claim 1, wherein the sampling the feature map according to the position of the target object in the image to be processed to obtain a sampled feature comprises:

determining a target sampling position for sampling the characteristic diagram according to the position of the target object in the image to be processed;

and sampling the characteristic graph according to the target sampling position to obtain the sampling characteristic.

3. The method according to claim 2, wherein the determining a target sampling position for sampling the feature map according to the position of the target object in the image to be processed comprises:

and obtaining the target sampling position matched with the position of the target object in the image to be processed by performing at least one iterative sampling on the feature map.

4. The method of claim 3, wherein in response to sampling the feature map iteratively at least twice, sampling the feature map iteratively for the t-th time comprises:

according to the intermediate sampling position after the t-1 th iterative sampling, carrying out the t-th iterative sampling on the feature map to obtain the intermediate sampling feature after the t-th iterative sampling, wherein t is an integer larger than 1;

and taking the middle sampling position after the t-th iteration sampling as the target sampling position under the condition that the t reaches the preset iteration times.

5. The method according to claim 4, wherein the updating the intermediate sampling position after the t-1 th iterative sampling according to the intermediate sampling characteristic after the t-th iterative sampling to obtain the intermediate sampling position after the t-th iterative sampling comprises:

6. The method according to claim 4 or 5, wherein the obtaining the intermediate sampling feature after the t-th iterative sampling by performing the t-th iterative sampling on the feature map according to the intermediate sampling position after the t-1-th iterative sampling comprises:

extracting a feature vector in the feature map according to the intermediate sampling position after the t-1 th iterative sampling to obtain an intermediate extraction feature of the t-th iterative sampling;

fusing the intermediate extraction feature of the tth iterative sampling, the intermediate sampling feature after the t-1 th iterative sampling and the feature vector corresponding to the intermediate sampling position after the t-1 th iterative sampling to obtain an intermediate fusion feature of the tth iterative sampling;

and according to the intermediate fusion characteristics of the t-th iterative sampling, performing feature coding transformation based on self attention to obtain the intermediate sampling characteristics after the t-th iterative sampling.

7. The method of any one of claims 3 to 6, wherein the first iterative sampling of the feature map comprises:

8. The method according to any one of claims 1 to 7, wherein the classifying the target object based on the sampling feature to obtain a classification result comprises:

according to the sampling characteristics, performing characteristic coding transformation based on self attention to obtain transformed sampling characteristics;

and carrying out classification processing according to the converted sampling characteristics to obtain a classification result of the target object.

9. The method of claim 8, wherein said performing a self-attention based feature-coding transform based on the sampling features to obtain transformed sampling features comprises:

acquiring a plurality of sampling feature vectors of the sampling features;

according to the fusion weights of the sampling feature vectors, performing weighted fusion on the sampling feature vectors to obtain a weighted fusion result;

10. The method of any one of claims 1 to 9, wherein the method is implemented by a target neural network comprising:

11. An image processing apparatus, characterized in that the apparatus comprises:

the device comprises a feature extraction module, a feature extraction module and a feature extraction module, wherein the feature extraction module is used for extracting features of an image to be processed to obtain a feature map of the image to be processed, and the image to be processed comprises a target object;

the sampling module is used for sampling the characteristic graph according to the position of the target object in the image to be processed to obtain sampling characteristics;

and the classification module is used for classifying the target object based on the sampling characteristics to obtain a classification result.

12. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the memory-stored instructions to perform the method of any one of claims 1 to 10.

13. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 10.