CN116958570A

CN116958570A - Image processing method, device, electronic equipment and storage medium

Info

Publication number: CN116958570A
Application number: CN202310417354.5A
Authority: CN
Inventors: 李德辉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-04-11
Filing date: 2023-04-11
Publication date: 2023-10-27

Abstract

The present application provides an image processing method, apparatus, computer readable storage medium, computer program product, the image processing method comprising: extracting features of an image to obtain a three-dimensional feature map of the image; fusing convolution features of the three-dimensional feature map on different dimensions to obtain a global feature map of the image; the convolution characteristics of the three-dimensional feature map in different dimensions are obtained by performing dimension transposition processing on the basis of the three-dimensional feature map and then performing convolution processing; and carrying out pooling treatment on the global feature map to obtain the image features of the image. The technical scheme of the embodiment of the application can accurately acquire the image characteristics of the characterization image information, and has small image processing calculated amount and high efficiency.

Description

Image processing method, device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an image processing method, an image processing device, an electronic device, and a storage medium.

Background

The image characteristics obtained by image processing can be applied to a plurality of fields such as image retrieval, image classification and the like, the existing image characteristics are obtained mostly through a global modeling network, and in the processing of the global modeling network, the characteristics of each position in the image are learned through a self-attention mechanism, so that the total characteristics are obtained.

Since the self-attention mechanism needs to calculate a large amount of weight matrixes, the acquisition and the use of the weight matrixes can generate a large amount of calculation amount, the calculation amount can increase exponentially along with the increase of the image size, calculation resources are occupied, and the existing image characteristic acquisition method has the problems of large calculation amount, high cost, long image processing time, low efficiency and the like caused by a large amount of calculation.

Disclosure of Invention

To solve the above technical problems, embodiments of the present application provide an image processing method and apparatus, an electronic device, a computer readable storage medium, and a computer program product.

According to an aspect of an embodiment of the present application, there is provided an image processing method including: extracting features of an image to obtain a three-dimensional feature map of the image; fusing convolution features of the three-dimensional feature map on different dimensions to obtain a global feature map of the image; the convolution characteristics of the three-dimensional feature map in different dimensions are obtained by performing dimension transposition processing on the basis of the three-dimensional feature map and then performing convolution processing; and carrying out pooling treatment on the global feature map to obtain the image features of the image.

According to an aspect of an embodiment of the present application, there is provided an image processing apparatus including: the feature acquisition module is configured to perform feature extraction on the image to obtain a three-dimensional feature map of the image; the multi-direction convolution module is configured to fuse convolution features of the three-dimensional feature map in different dimensions to obtain a global feature map of the image; the convolution characteristics of the three-dimensional feature map in different dimensions are obtained by performing dimension transposition processing on the basis of the three-dimensional feature map and then performing convolution processing; and the image feature acquisition module is configured to pool the global feature map to obtain image features of the image.

In one embodiment, a feature acquisition module includes: an image dividing unit configured to divide the image into a plurality of image blocks; the image block feature extraction unit is configured to perform feature extraction on the plurality of image blocks respectively to obtain image block features corresponding to the image blocks; and the feature acquisition unit is configured to integrate the features of the image blocks corresponding to the image blocks to obtain the three-dimensional feature map.

In an embodiment, the image block feature extraction unit includes: an image block vector block configured to acquire image block vectors respectively corresponding to the plurality of image blocks; and the image block feature extraction plate is configured to perform linear mapping processing on the image block vectors corresponding to the image blocks to obtain the image block features corresponding to the image blocks.

In an embodiment, the three-dimensional feature map includes a first dimension, a second dimension, and a third dimension; a multi-way convolution module comprising: the first transfer convolution unit is configured to perform dimension transposition processing on the three-dimensional feature map according to the first dimension, and perform convolution processing on the three-dimensional feature map subjected to the dimension transposition processing to obtain a first convolution map; the second transposition convolution unit is configured to carry out dimension transposition processing on the first convolution map according to the second dimension, and carry out convolution processing on the first convolution map after the dimension transposition processing to obtain a second convolution map; the third transposition convolution unit is configured to perform dimension transposition processing on the second convolution map according to the third dimension, and perform convolution processing on the second convolution map subjected to the dimension transposition processing to obtain a third convolution map; and the first global feature map acquisition unit is configured to acquire the global feature map based on the third convolution map.

In an embodiment, a predetermined order of dimensions exists among the first dimension, the second dimension, and the third dimension of the three-dimensional feature map; a first transpose convolution unit comprising: the transposed block is configured to perform dimension modification processing on the three-dimensional feature map, so that the first dimension is the first dimension in the dimension sequence of the modified three-dimensional feature map, and the three-dimensional feature map after the dimension transposition processing is obtained.

In an embodiment, a predetermined order of dimensions exists among the first dimension, the second dimension, and the third dimension of the three-dimensional feature map; the first global feature map acquisition unit includes: a dimension sequence detection block configured to detect a dimension sequence of the third convolution map; the first global feature map obtaining plate is configured to take the third convolution map as the global feature map if the dimension sequence of the third convolution map is the same as the preset dimension sequence of the three-dimensional feature map; and the second global feature map obtaining plate is configured to perform dimension transposition on the third convolution map if the dimension sequence of the third convolution map is different from the preset dimension sequence of the three-dimensional feature map, so that the dimension sequence of the third convolution map after the dimension transposition is the same as the preset dimension sequence, and the global feature map is obtained.

In one embodiment, the multi-way convolution module includes: the parallel dimension transposition unit is configured to respectively carry out dimension transposition processing on the three-dimensional feature map in different dimensions to obtain a three-dimensional feature map subjected to the different dimension transposition processing; the parallel convolution unit is configured to respectively carry out convolution processing on the three-dimensional feature graphs subjected to the different-dimension transposition processing to obtain convolution feature graphs corresponding to the different-dimension transposition; and the second global feature map acquisition unit is configured to perform fusion processing on the convolution feature maps corresponding to the different dimensional transposition to obtain the global feature map.

In an embodiment, the different dimensions of the three-dimensional feature map include a first dimension, a second dimension, and a third dimension; a parallel dimension transpose unit comprising: the first parallel dimension transposition plate is configured to carry out dimension transposition on the three-dimensional feature map according to the first dimension to obtain a three-dimensional feature map after the first dimension transposition; the second parallel dimension transposition plate is configured to perform dimension transposition on the three-dimensional feature map according to the second dimension to obtain a three-dimensional feature map after the second dimension transposition; and the third parallel dimension transposition plate is configured to perform dimension transposition on the three-dimensional feature map according to the third dimension to obtain a three-dimensional feature map after the third dimension transposition.

In an embodiment, a preset dimension sequence exists between different dimensions of the three-dimensional feature map; the second global feature map acquisition unit includes:

the dimension transposition processing plate is configured to respectively carry out dimension transposition processing on the convolution feature graphs corresponding to the different dimension transposition so that the dimension sequence of the convolution feature graphs corresponding to the different dimension transposition after the dimension transposition processing is the same as the preset dimension sequence;

The characteristic splicing plate is configured to perform characteristic splicing on the convolution characteristic graphs corresponding to the different dimensional transposition after the dimensional transposition treatment to obtain a high-dimensional characteristic graph;

and a second global feature map acquisition block configured to acquire the global feature map based on the high-dimensional feature map.

In an embodiment, the second global feature map obtaining block includes:

and the second global feature map obtaining sub-block is configured to reduce the number of channels of the high-dimensional feature map to be the same as the number of channels of the three-dimensional feature map, so as to obtain the global feature map.

According to an aspect of an embodiment of the present application, there is provided an electronic device including one or more processors; and storage means for storing one or more computer programs which, when executed by the one or more processors, cause the electronic device to implement the image processing method as described above.

According to an aspect of an embodiment of the present application, there is provided a computer-readable storage medium having stored thereon computer-readable instructions which, when executed by a processor of a computer, cause the computer to perform the image processing method as described above.

According to an aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs the image processing methods provided in the above-described various alternative embodiments.

According to an aspect of an embodiment of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of the image processing method as described above.

In the technical scheme provided by the embodiment of the application, when the image processing is carried out, on one hand, convolution features of the three-dimensional feature map in different dimensions are fused, so that characterization information of the three-dimensional feature map in different dimensions is obtained, and a global feature map for accurately characterizing the image information is obtained; on the other hand, the three-dimensional feature map is subjected to dimension transposition processing and then is subjected to convolution processing so as to learn the convolution features of the three-dimensional feature map in different dimensions, and the global feature map of the image can be obtained after the convolution features of the three-dimensional feature map in different dimensions are fused.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is evident that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings:

FIG. 1 is a schematic diagram of an implementation environment shown in an exemplary embodiment of the application;

FIG. 2 is a flow chart of an image processing method shown in an exemplary embodiment of the application;

FIG. 3 is a schematic diagram of an image processing network according to an exemplary embodiment of the present application;

FIG. 4 is a flow chart of an image processing method shown based on the embodiment of FIG. 2;

FIG. 5 is a diagram of a feature extraction process in an initial feature extraction module shown in an exemplary embodiment of the application;

FIG. 6 is a flow chart of another image processing method shown based on the embodiment of FIG. 2;

FIG. 7 is a schematic diagram of a multi-way convolution module according to an exemplary embodiment of the present disclosure;

FIG. 8 is a flow chart illustrating the acquisition of a global feature map according to an exemplary embodiment of the present application;

FIG. 9 is a flow chart of another image processing method shown based on the embodiment of FIG. 2;

FIG. 10 is a schematic diagram of a multi-way convolution module shown in another exemplary embodiment of the present disclosure;

FIG. 11 is a flowchart of acquiring a global feature map, shown in accordance with another exemplary embodiment of the present application;

FIG. 12 is a flowchart of an application method of image features obtained by an image processing method according to an exemplary embodiment of the present application;

fig. 13 is a schematic structural view of an image processing apparatus according to an exemplary embodiment;

fig. 14 shows a schematic diagram of a computer system suitable for use in implementing an embodiment of the application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

Also to be described is: in the present application, the term "plurality" means two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., a and/or B may represent: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

It will be appreciated that in the specific embodiment of the present application, the image, the content on the image, and the feature information are related, when the above embodiment of the present application applies the information to a specific product or technology, it is required to obtain permission or consent of the user, or perform desensitization filtering processing of related data, and the collection, use and processing of related information is required to comply with related laws and regulations and standards of related countries and regions.

Description of technical words:

global modeling network: a deep learning network capable of acquiring global features.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, positioning and measurement on a target, and further perform graphic processing to make the Computer process into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

In the process of image processing, image features are generally extracted from an image, so that subsequent processing such as image recognition, image classification and the like can be performed according to the extracted image features; and the extraction of the image features generally extracts the global features of the image, and the extraction of the image features is realized through a global modeling network.

Vision Transformer (ViT) and its variants in the existing global modeling network are networks with higher current attention, and the methods are derived from natural language processing, are used for computer vision related recognition algorithms after transformation, have global modeling capability, can grasp features in the whole graph range, realize more robust and accurate recognition, and compared with Convolutional Neural Networks (CNNs), the vision Transformer global modeling network is proved to realize better perception effect on each large vision perception task.

The vision Transformer global modeling network firstly divides an input picture into a plurality of patches (files), then extracts the characteristics of each patch by using a linear mapping layer to obtain corresponding token (character), then interacts with each token by using an operation named a transducer (a characteristic processor), stacks multiple layers of transducers to obtain the output characteristics of an image, and finally uses the output characteristics of the image after being mapped by a task header for application of various image characteristics at the downstream.

In the global modeling network, the core operation of the transducer is a self-attention mechanism, the input size is c×h×w, wherein c is a channel, h is high, w is wide, three features of query, key and value (key and value form a key value pair) are obtained after the input features pass through different 1*1 convolution layers, the sizes of the three features are c×h×w, the query and the key are multiplied after deformation, a weight matrix with the size of hw×hw is obtained, the weight matrix is multiplied with the value to obtain c×hw features, and c×h×w output features are obtained after parallelization, and at the moment, the global features are integrated at each position on the output feature map.

In the existing global modeling network, the operation of the self-attention mechanism needs to calculate the weight matrix of hw, and the acquisition and use of the weight matrix can generate huge calculation amount, and the calculation amount of the self-attention operation is 3hwc ² +2(hw) ² The calculated amount increases exponentially with hw.

Therefore, although the existing global modeling network with the self-attention mechanism can prove to realize better perception effect on various large visual perception tasks, the network is difficult to run smoothly on a platform with lower calculation power such as a mobile phone, a vehicle, an unmanned aerial vehicle, a robot and the like due to the characteristic of high calculation amount of the structure of the self-attention mechanism in a transducer, and the floor application of the algorithm is greatly limited.

Based on this, in this embodiment, an image processing method, an apparatus, an electronic device, a storage medium, and a computer program product are provided, where image features are convolved and extracted from different channels of an image to replace a self-attention mechanism with large calculation amount, so that on the basis of accurately acquiring image features, there is less calculation amount, which is favorable for deploying a model on various platforms with limited calculation force, and thus promotes the floor application of a global modeling network.

Referring first to fig. 1, fig. 1 is a schematic diagram of an implementation environment according to the present application. The implementation environment includes a terminal 100 and a server side 200, and communication is performed between the terminal 100 and the server side 200 through a wired or wireless network.

Of course, the number of server-side 200 in fig. 1 is merely exemplary, and in other embodiments, other numbers of server-side 200 are also possible, and in this embodiment, the terminal 100 may be configured to determine an image to be subjected to image processing, where the purpose of the image processing is to obtain image features that accurately and quickly characterize the image information, so that related image applications may be performed through the image features, for example, a target image of a similar or same category as the image to be subjected to image processing is retrieved in an image set, where the image set includes a wide number of images; or the application mode such as classifying the image to be processed.

The terminal 100 further sends the image to the server 200, so that an image processing network preset in the server 200 performs image processing based on the image to obtain image features, and performs an image processing task based on a preset image processing task to obtain a final image processing result, wherein the image processing task is set for different image applications, and the server 200 also returns the image processing result to the terminal 100 for display through a visualization module of the terminal 100.

After obtaining the image, the terminal 100 sends the image to the server 200, and the server 200 performs feature extraction on the image to obtain a three-dimensional feature map of the image; fusing convolution features of the three-dimensional feature map on different dimensions to obtain a global feature map of the image; the convolution characteristics of the three-dimensional feature map on different dimensions are obtained by performing dimension transposition processing on the basis of the feature map and then performing convolution processing; the global feature map is subjected to pooling processing to obtain image features of the image, and then the image features can be directly returned to the terminal 100 for display through a self-contained visualization module of the terminal 100, or image processing tasks are executed through the image features, and finally obtained image processing results are returned to the terminal 100 for display through the self-contained visualization module of the terminal 100.

Of course, in some embodiments, the image processing may also be directly performed by the terminal 100, that is, the terminal 100 does not send the image to the server 200, but performs the image processing by the service system of the terminal 100, and performs the subsequent process of performing the image processing task by using the image feature obtained by the image processing.

The terminal 100 may be a mobile phone, a computer, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, an aircraft, etc., which is not limited herein. The server 200 may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, where a plurality of servers may form a blockchain, and the servers are nodes on the blockchain, and the server 200 may also be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), and basic cloud computing services such as big data and artificial intelligent platforms, which are not limited herein.

It should be noted that this embodiment is only an exemplary implementation environment provided for the convenience of understanding the idea of the present application, and should not be construed as providing any limitation on the scope of use of the present application.

Fig. 2 is a flowchart illustrating an image processing method according to an exemplary embodiment, which is applicable to the implementation environment in fig. 1 and is specifically executed by the server side 200 in fig. 1, it should be understood that the method may also be applied to other exemplary implementation environments and is specifically executed by devices in other implementation environments, and the embodiment is not limited to the implementation environment to which the method is applicable.

In an exemplary embodiment, the method may include steps S210 to S250, which are described in detail as follows:

step S210: and extracting the characteristics of the image to obtain a three-dimensional characteristic image of the image.

In this embodiment, when an image is applied, the image is mostly processed first, so that the feature of the image is obtained, and the image is applied based on the feature.

In this embodiment, the acquired image features may be obtained through an image processing network as shown in fig. 3, where the image processing network is a global modeling network, and the image processing network includes an initial feature extraction module, a multi-directional convolution module, and a pooling module that are sequentially linked, as shown in fig. 3.

Specifically, an image to be subjected to image processing is input to an image processing network, initial feature extraction is performed in an initial feature extraction module in the image processing network, and a three-dimensional feature map of the image is obtained.

The feature extraction process may be to divide an image into a plurality of image blocks, then extract the image block feature of each image block, and then integrate the image block feature of each image block to obtain a three-dimensional feature map, where the dimension of the three-dimensional feature map is c×h×w, c is a channel, h is a height, and w is a width.

Step S230: and fusing convolution features of the three-dimensional feature map on different dimensions to obtain a global feature map of the image.

In this embodiment, unlike other global modeling networks, in the obtained three-dimensional feature map, the deep meaning of the image is learned without performing a self-attention mechanism, but rather, the deep meaning of the image is learned by fusing the convolution features of the three-dimensional feature map in different dimensions, so as to obtain the global feature map of the image, and the convolution features of the three-dimensional feature map in different dimensions are obtained by performing dimension transposition processing based on the feature map and then performing convolution processing.

Specifically, the three-dimensional feature map may be regarded as including three dimensions, i.e. a channel dimension, a height dimension, and a width dimension, and generally, there is a predetermined dimension sequence of the three dimensions of the three-dimensional feature map, i.e. the channel dimension, the height dimension, and the width dimension in sequence.

The dimension sequence of the three-dimensional feature map may be considered as a specific manner of performing subsequent convolution processing on the three-dimensional feature map, if the three-dimensional feature map is a preset dimension sequence, that is, corresponds to a dimension c×h×w, then when convolution is performed, the convolution kernel is fixed in the channel dimension direction, and then slides along the height dimension and the width dimension of the three-dimensional feature map with a dimension c×h×w, so as to obtain a convolution result; when the dimension sequence of a certain feature map is c×h×w, the convolution kernel is fixed in the height dimension direction during convolution, and then slides along the channel dimension and the width dimension of the feature map, so as to obtain a convolution result. Namely, when the convolution is performed after the dimension conversion processing, the first dimension in the dimension sequence of the fixed feature map can be regarded as, and the convolution is realized by sliding along the other two dimensions in the dimension sequence.

Therefore, the information on different dimensions of the three-dimensional feature map can be learned by performing convolution processing after performing dimensional transposition processing on the three-dimensional feature map, so that global features of the image can be obtained.

In this embodiment, the global feature map is fused with convolution features of the three-dimensional feature map obtained by performing dimension transposition processing based on the feature map and then performing convolution processing.

Specifically, the convolution features of the three-dimensional feature map on different dimensions may be that the three-dimensional feature map is subjected to dimensional transposition of different dimensions, then the three-dimensional feature maps obtained after the dimensional transposition of different dimensions are respectively subjected to convolution processing to obtain three-dimensional feature map convolution processing results obtained after the dimensional transposition of different dimensions, and finally the three-dimensional feature map convolution processing results obtained after the dimensional transposition of different dimensions are fused to obtain a global feature map.

In some embodiments, the convolution features of the three-dimensional feature map in different dimensions may be that the three-dimensional feature map is transposed in a dimension, then the three-dimensional feature map after the dimension transposition is convolved to obtain a convolution result in the dimension, the convolution result is also a three-dimensional feature map, at this time, the convolution is transposed in another dimension, then the convolution result after the dimension transposition in another dimension is convolved, and the above processes are repeated until the global feature map is obtained after the convolution is performed on all three dimensions, where the process may be regarded as a process of serially processing to obtain the global feature map.

In this embodiment, whether serial processing or parallel processing is performed, the global feature map obtained finally is still a three-dimensional feature map, and the dimension of the global feature map is the same as that of the three-dimensional feature map, and is also c×h×w, that is, the dimension sequence of the global feature map is the same as that of the three-dimensional feature map, and is the preset dimension sequence.

It should be noted that, in the dimensional sequence of the dimensional transposition result obtained by performing the dimensional transposition on the three-dimensional feature map or the convolution result with different dimensions, the target dimension is the first dimension, for example, if the dimensional transposition of the height dimension is required for the three-dimensional feature map, the height dimension is the first dimension of the dimensional sequence of the three-dimensional feature map performing the dimensional transposition of the height dimension, that is, the dimension of the three-dimensional feature map performing the dimensional transposition of the height dimension is h×c×w.

Of course, for a three-dimensional feature map that is not subjected to dimension transposition, the dimensions of the feature map are c×h×w, that is, the preset dimension sequence is sequentially the channel dimension, the height dimension and the width dimension, no matter what dimension is transposed in any dimension, the corresponding dimension is only transposed to the first dimension in the dimension sequence, the other dimensions do not change positions, when the three-dimensional feature map after having been subjected to dimension transposition is transposed in other dimensions, the corresponding dimension is also only transposed to the first dimension in the dimension sequence, and the other two dimensions should keep the dimension sequence among the channel dimension, the height dimension and the width dimension, that is, the target dimension for dimension transposition of the target dimension is obtained, and the dimension sequence between the other two dimensions should correspond to the preset dimension sequence.

If the dimension of a dimension transposed three-dimensional feature map with height dimension is h×c×w, then the dimension transposed with width dimension is performed on the three-dimensional feature map with height dimension, and the dimension of the finally obtained feature map is w×c×h, except for the width dimension, the channel dimension should be kept in the order before the height dimension.

Of course, fig. 3 in this embodiment only shows one multi-directional convolution module, and in other embodiments, a plurality of multi-directional convolution modules may be repeatedly stacked, and the three-dimensional feature map is processed by repeatedly stacking the plurality of multi-directional convolution modules to obtain the global feature map, where each multi-directional convolution module performs the parallel processing or serial processing manner described above.

Step S250: and carrying out pooling treatment on the global feature map to obtain the image features of the image.

In this embodiment, the output global features are subjected to global average pooling, so that image features of the image can be obtained, and then different modules can be set according to different image applications to obtain an image application result.

If the task of image classification is aimed at, an image classification module is added behind the image processing network of fig. 3, and a full connection layer can be accepted in the classification module, so that the classification characteristics of the image characteristics are obtained, and the image classification is carried out.

For the task of image recognition, an image recognition module is added behind the image processing network in fig. 3, in the image recognition module, similarity calculation of image features and image features to be analyzed in the image set is performed, so as to obtain target features with high similarity to the image features, the target images corresponding to the target features are images similar to the images or in the same category, and the image features to be analyzed in the image set are feature images obtained by performing the steps shown in fig. 2 corresponding to the images to be analyzed.

Of course, the above-listed image recognition and image classification are considered to be merely exemplary, and in other embodiments, other image tasks may be performed, and different task modules may be disposed behind the image processing network for different image tasks to implement the image tasks.

In the image processing method, the three-dimensional feature map is subjected to dimension transposition processing and then is subjected to convolution processing, so that the image features of the three-dimensional feature map in different dimension directions are learned, the accuracy of the obtained global feature map is improved, the global feature map can accurately express the feature content of an image, the accuracy of the image features is improved, and the accuracy of the task results obtained in the subsequent image task processing based on the image features is higher.

On the other hand, in the image processing method in this embodiment, the calculated amount of the global feature map obtained by performing the convolution processing after performing the dimension transposition processing based on the three-dimensional feature map in series is hwc (h+w+c), the calculated amount of the global feature map obtained by performing the convolution processing after performing the dimension transposition processing based on the three-dimensional feature map in parallel is hwc (h+w+4c), and in most cases, the calculated amount is 3hwc compared with the calculated amount of self-attention in the global modeling network ² +2(hw) ² The calculation amount of the multi-directional convolution is lower than that of the self-attentiveness in other global modeling networks, for example, when h, w and c are 17, 17 and 196 respectively, the calculation amount of the self-attentiveness module is about 66M; the calculated amount of the serial multi-direction convolution module is about 13M, and the parallel multi-direction convolution module is parallel and multi-directionalThe calculated amount of convolution is about 46.3M, and the calculated amount of the latter two is only 19.7% and 70.2% of the former. In practice, however, the width and height of the three-dimensional feature map may be much higher than 17, and the computation advantage of the multi-direction convolution module may be more remarkable.

The image processing method provided by the embodiment carries out convolution processing after dimension transposition processing based on the three-dimensional feature map, so that the global feature map is quickly and accurately obtained, the calculated amount is reduced, and the calculated cost is reduced, and the image processing network for the image processing method has the characteristic of light weight, can be deployed on a mobile phone, an unmanned vehicle, an unmanned plane, a robot and other platforms to operate, for example, on the mobile phone, can achieve efficient and accurate image recognition effect by using the image processing network, and realizes functions of shooting object recognition, image beautification, shooting automatic focusing and the like; on the unmanned vehicle, the method can be used for realizing a more accurate and rapid environment sensing algorithm and providing technical support for unmanned driving.

Fig. 4 is a flowchart of an image processing method according to the embodiment of fig. 2, in an exemplary embodiment, the process of extracting features from an image in step S210 of fig. 2 to obtain a three-dimensional feature map of the image may include steps S410 to S450, which are described in detail below:

step S410: an image is divided into a plurality of image blocks.

In this embodiment, the extraction of the image features is performed in the initial feature extraction module, and in this embodiment, the feature extraction process in the initial feature extraction module may refer to fig. 5, specifically, the image is first divided into a plurality of image blocks, so that the local information extraction of the image is performed based on the plurality of image blocks.

Specifically, the image is divided into h×w image blocks, if the original width and height of the image are w0 and h0, the width of each image block is w1=w0/w, and the height is h1=h0/h.

Step S430: and respectively extracting the characteristics of the plurality of image blocks to obtain the image block characteristics corresponding to each image block.

And extracting the characteristics of each image block, obtaining image block vectors corresponding to the image blocks respectively, and performing linear mapping processing on the image block vectors corresponding to the image blocks to obtain the image block characteristics corresponding to the image blocks.

Referring to fig. 5, each image block is flattened into an image block vector of 1 x (h 1w 1), and then each image block vector is processed by using a linear mapping layer, and features are extracted to obtain image block features corresponding to each image block.

In this embodiment, the number of input channels of the linear mapping layer is h1w1, the number of output channels is c, and the obtained h×w image block features are regarded as local features, and the number of channels of each image block feature is c.

Step S450: and integrating the features of the image blocks corresponding to the image blocks to obtain a three-dimensional feature map.

In this embodiment, h×w image block features are integrated (reshape, shape changed) into a feature map with a size of c×h×w, so as to obtain a three-dimensional feature map with a size of c×h×w.

In this embodiment, the image is segmented to extract the local information of each image block, so that the image block features of the local information of the image represented by the image blocks can be integrated, that is, a three-dimensional feature map for representing the image information is obtained, so that the information of the deeper layer of the image can be extracted based on the three-dimensional feature map later, and the image features of the image information can be more abundant and more accurately represented.

FIG. 6 is a flow chart of another image processing method shown based on the embodiment of FIG. 2, in an exemplary embodiment, a three-dimensional feature map including a first dimension, a second dimension, and a third dimension; step S230 in fig. 2 fuses the convolution features of the three-dimensional feature map in different dimensions, and the process of obtaining the global feature map of the image may include steps S610 to S670, which are described in detail below:

Step S610: and according to the first dimension, performing dimension transposition processing on the three-dimensional feature map, and performing convolution processing on the three-dimensional feature map subjected to the dimension transposition processing to obtain a first convolution map.

In this embodiment, the first dimension, the second dimension, and the third dimension are three dimensions corresponding to the three-dimensional feature map, that is, a channel dimension, a height dimension, and a width dimension; the first dimension, the second dimension, and the third dimension may be a channel dimension, a height dimension, or a width dimension, which are not specifically limited herein, and only the first dimension, the second dimension, and the third dimension need to be guaranteed to respectively correspond to three different dimensions of the three-dimensional feature map.

In this embodiment, a way of obtaining a global feature map by serial multi-directional convolution is provided, at this time, the corresponding multi-directional convolution module structure may refer to fig. 7, and mainly includes a first serial module, a second serial module and a third serial module that are sequentially linked, and the three-dimensional feature map is input into the multi-directional convolution module, and is processed by the first serial module, the second serial module and the third serial module, so that the global feature map may be obtained.

In this embodiment, the first serial module, the second serial module and the third serial module respectively correspond to the dimension transposition and convolution processing of the first dimension, the second dimension and the third dimension, specifically, the three-dimensional feature map is input into the first serial module, and the dimension transposition processing on the first dimension is performed on the three-dimensional feature map.

The dimension transposition process comprises the following steps: as set forth above, the three dimensions in the three-dimensional feature map have a preset dimensional sequence, and are a channel dimension, a height dimension and a width dimension in turn, that is, a preset dimensional sequence exists among the first dimension, the second dimension and the third dimension, at this time, if the three-dimensional feature map needs to be subjected to dimensional transposition on the first dimension, the three-dimensional feature map is subjected to dimensional modification processing, so that the first dimension is the first dimension in the dimensional sequence of the modified three-dimensional feature map, so as to obtain the three-dimensional feature map after dimensional transposition processing.

In some embodiments, if the first dimension is not the first dimension of the preset dimension sequence in the three-dimensional feature map, that is, the first dimension is not the channel dimension, the first dimension in the three-dimensional feature map is subjected to dimension modification, so that the first dimension is the first dimension in the modified dimension sequence of the three-dimensional feature map, the dimension sequence between the second dimension and the third dimension corresponds to the dimension sequence between the second dimension and the third dimension in the preset dimension sequence, that is, if the third dimension in the preset dimension sequence is before the second dimension, the third dimension in the dimension sequence of the three-dimensional feature map after dimension transposition is still before the second dimension.

Of course, in some embodiments, the first dimension is a channel dimension, that is, the first dimension is a first dimension of a preset dimension sequence in the three-dimensional feature map, and at this time, the three-dimensional feature map may be directly used as a three-dimensional feature map after the dimension transposition process without performing the dimension transposition of the first dimension on the three-dimensional feature map, and then the convolution process is performed.

After the three-dimensional feature map after the dimension transposition processing is obtained, a first dimension convolution can be performed in the first serial module, so as to obtain a first convolution map, and each neuron of the first convolution map has information on a first dimension of the current position.

The first convolution graph also comprises three dimensions, and the dimension sequence of the first convolution graph is the same as the dimension sequence of the three-dimensional feature graph after the dimension transposition processing.

The first dimension convolution may be implemented by a 1*1 convolution kernel, or may be implemented by convolution kernels of other dimensions, without specific limitation herein.

Step S630: and according to the second dimension, performing dimension transposition processing on the first convolution map, and performing convolution processing on the first convolution map subjected to the dimension transposition processing to obtain a second convolution map.

The first convolution diagram reaches a second serial module, in the second serial module, dimensional transposition processing is carried out on the first convolution diagram, the process of dimensional transposition processing is the same as that of dimensional transposition processing on the first dimension of the three-dimensional feature diagram, the second dimension is the first dimension in the dimensional sequence of the first convolution diagram after the dimensional transposition processing, and in the dimensional sequence of the first convolution diagram after the dimensional transposition processing, the dimensional sequence between the first dimension and the third dimension is the same as that between the first dimension and the third dimension in the preset dimensional sequence.

And then, the same as the first serial module, performing convolution processing on the second dimension on the first convolution graph after the dimension transposition processing to obtain a second convolution graph with all information of the current position on the second dimension of each neuron.

Similarly, the second convolution map also includes three dimensions, and the dimension sequence of the second convolution map is the same as the dimension sequence of the first convolution map after the dimension transposition process.

Step S650: and according to the third dimension, performing dimension transposition processing on the second convolution map, and performing convolution processing on the second convolution map subjected to the dimension transposition processing to obtain a third convolution map.

And then, the second convolution graph reaches a third serial module, the execution process in the third serial module is the same as the execution process of the second serial module, namely, the third dimension transposition is carried out on the second convolution graph, so that the third dimension is the first dimension in the dimension sequence of the second convolution graph after the dimension transposition, and in the dimension sequence of the third convolution graph after the dimension transposition, the dimension sequence between the first dimension and the second dimension is the same as the dimension sequence between the first dimension and the second dimension in the preset dimension sequence.

And then, carrying out convolution processing on the second convolution graph subjected to the dimension transposition processing on a third dimension to obtain a third convolution graph with all information of the current position of each neuron on the third dimension, wherein the third convolution graph also comprises three dimensions, and the dimensional sequence of the third convolution graph is the same as that of the second convolution graph subjected to the dimension transposition processing.

Step S670: based on the third convolution map, a global feature map is obtained.

In this embodiment, the dimensional sequence of the global feature map obtained finally needs to be the same as the dimensional sequence of the three-dimensional feature map, so that a detection module may be further configured to detect the dimensional sequence of the third convolution map; if the dimension sequence of the third convolution map is the same as the preset dimension sequence of the three-dimensional feature map, the third convolution map is used as a global feature map; if the dimension sequence of the third convolution map is different from the preset dimension sequence of the three-dimensional feature map, performing dimension transposition on the third convolution map, so that the dimension sequence of the third convolution map after the dimension transposition is the same as the preset dimension sequence, and obtaining the global feature map.

Therefore, if the third dimension in the embodiment is the same as the first dimension in the preset dimension sequence, the detection module may not be provided, and at this time, the dimension sequence of the third convolution graph output in the third serial module is the same as the preset dimension sequence of the three-dimensional feature graph, that is, the third convolution graph is the global feature graph.

Therefore, in one embodiment, the third dimension is a channel dimension, so as to reduce the volume of the multi-direction convolution module, as shown in fig. 8, the first dimension is a height dimension, the second dimension is a width dimension, the third dimension is a channel dimension, and the predetermined dimension order (the dimension of the three-dimensional feature map) is c×h×w.

In the existing convolution process, the three-dimensional feature map with the size of c×h×w is directly convolved, that is, the convolution kernel in the convolution process is fixed in the channel dimension direction and then slides along the height dimension and the width dimension.

In fig. 8, a three-dimensional feature map with a size of c×h×w is input to a first serial module, a three-dimensional feature map with a dimension sequence of h×c×w is obtained by performing a convolution on the three-dimensional feature map with a dimension sequence of h×c×w by using a 1*1 convolution kernel, the direction indicated by an arrow is a convolution direction, that is, the convolution kernel is fixed in the height dimension direction and then slides along a channel dimension and a width dimension, the dimension sequence of the first convolution map is also h×c×w, and at this time, each neuron on the output feature map has all information on the height dimension of the current position.

Then, in the second serial module, the first convolution map is subjected to dimension transposition processing of the width dimension to obtain a first convolution map with the dimension sequence of w×c×h, and the 1*1 convolution kernel is used for realizing the width dimension convolution, namely the convolution kernel is fixed in the width dimension direction and then slides along the channel dimension and the height dimension, the dimension sequence of the output second convolution map is also w×c×h, and at the moment, each neuron of the output second convolution map has all information of the current position in space.

In the third serial module, the second convolution image is subjected to dimension transposition of channel dimensions to obtain a second convolution image with a dimension sequence of c×h×w, the channel dimension convolution is verified by using 1*1 convolution cores, namely convolution kernels in the convolution process are fixed in the channel dimension direction and then slide along the height dimension and the width dimension, the dimension sequence of the output third convolution image is also c×h×w, and at the moment, each neuron of the third convolution image has global information, namely the third convolution image is a global feature image.

In this embodiment, a method for obtaining a global feature map is provided, a convolution process is performed after dimension transposition process is performed based on a three-dimensional feature map, feature information of the three-dimensional feature map in different dimensions is learned, so that a global feature map which more accurately characterizes image information is obtained, the accuracy of feature extraction of a subsequent image is improved, and therefore image application can be efficiently performed based on the image features.

FIG. 9 is a flowchart of another image processing method based on the embodiment of FIG. 2. In an exemplary embodiment, the process of merging the convolution features of the three-dimensional feature map in different dimensions to obtain the global feature map of the image in step S230 of FIG. 2 may include steps S910 to S950, which are described in detail below:

Step S910: and respectively carrying out dimension transposition processing on the three-dimensional feature map on different dimensions to obtain the three-dimensional feature map after the different dimension transposition processing.

Similarly, the three-dimensional feature map includes a first dimension, a second dimension, and a third dimension, where the first dimension, the second dimension, and the third dimension correspond to three different dimensions of the three-dimensional feature map, respectively, and three dimensions in the three-dimensional feature map have a preset dimensional sequence, and are in turn a channel dimension, a height dimension, and a width dimension, that is, a preset dimensional sequence exists among the first dimension, the second dimension, and the third dimension.

Fig. 10 is a schematic diagram of a multi-direction convolution module structure according to another embodiment, where the multi-direction convolution module structure obtains global features through parallel multi-direction convolution, and the parallel multi-direction convolution module in fig. 10 includes a first parallel module, a second parallel module, and a third parallel module that are connected in parallel, and a feature fusion module that is connected to output ends of the first parallel module, the second parallel module, and the third parallel module.

The first parallel module, the second parallel module and the third parallel module respectively correspond to dimension transposition and convolution processing of the first dimension, the second dimension and the third dimension, the three-dimensional feature images are respectively input into the first parallel module, the second parallel module and the third parallel module, and the first parallel module, the second parallel module and the third parallel module respectively carry out dimension transposition on the input three-dimensional feature images and then carry out convolution processing.

In the first parallel module, according to the first dimension, the three-dimensional feature map is subjected to dimension transposition, so as to obtain a three-dimensional feature map after the first dimension transposition, i.e. the dimension transposition is the same as the dimension transposition in the first serial module in fig. 7.

If the first dimension is not the first dimension of the preset dimension sequence in the three-dimensional feature map, carrying out dimension modification on the first dimension in the three-dimensional feature map so that the first dimension is the first dimension in the dimension sequence of the three-dimensional feature map after the first dimension transposition treatment, and the dimension sequence between the second dimension and the third dimension corresponds to the dimension sequence between the second dimension and the third dimension in the preset dimension sequence; if the first dimension is the first dimension of the preset dimension sequence in the three-dimensional feature map, the three-dimensional feature map is not subjected to dimension transposition of the first dimension, and the three-dimensional feature map is directly used as the three-dimensional feature map after the dimension transposition processing.

In the second parallel module, performing dimension transposition on the three-dimensional feature map according to a second dimension to obtain a three-dimensional feature map after the dimension transposition, wherein the processing mode of the three-dimensional feature map is the same as that of the three-dimensional feature map in the first parallel module, and if the second dimension is not the first dimension of a preset dimension sequence in the three-dimensional feature map, performing dimension modification on the second dimension in the three-dimensional feature map to enable the second dimension to be the first dimension of the dimension sequence of the three-dimensional feature map after the dimension transposition; if the second dimension is the first dimension of the preset dimension sequence in the three-dimensional feature map, the three-dimensional feature map is not subjected to dimension transposition of the second dimension, and the three-dimensional feature map is directly used as the three-dimensional feature map after the dimension transposition processing.

In the third parallel module, performing dimension transposition on the three-dimensional feature map according to a third dimension to obtain a three-dimensional feature map after the third dimension transposition, wherein the three-dimensional feature map is processed in the same manner as that in the first parallel module and the second parallel module, and if the third dimension is not the first dimension of a preset dimension sequence in the three-dimensional feature map, performing dimension modification on the third dimension in the three-dimensional feature map so that the third dimension is the first dimension in the dimension sequence of the three-dimensional feature map after the third dimension transposition; if the third dimension is the first dimension of the preset dimension sequence in the three-dimensional feature map, the three-dimensional feature map is not subjected to dimension transposition of the third dimension, and the three-dimensional feature map is directly used as the three-dimensional feature map after the third dimension transposition.

Step S930: and respectively carrying out convolution processing on the three-dimensional feature graphs subjected to the transposition processing of different dimensions to obtain convolution feature graphs corresponding to the transposition of different dimensions.

In this embodiment, after the first parallel module, the second parallel module, and the third parallel module respectively correspond to the first dimension, the second dimension, and the third dimension, convolution processing in the corresponding dimensions is performed in each module based on the three-dimensional feature map after the dimension is transposed, so as to obtain the convolution feature maps output by each of the three modules.

And if the first parallel module is used, carrying out convolution processing on the first dimension on the three-dimensional feature map subjected to the first dimension transposition processing to obtain the convolution feature map output by the first parallel module.

The convolution feature images output by the modules all comprise three dimensions, and the dimensional sequence is the same as that of the three-dimensional feature images subjected to dimensional transposition processing in the corresponding modules.

Likewise, the convolution of the corresponding dimensions in each module may be implemented by a 1*1 convolution kernel, or may be implemented by convolution kernels of other sizes, which are not particularly limited herein.

Step S950: and carrying out fusion processing on convolution feature graphs corresponding to the transposition of different dimensions to obtain a global feature graph.

In this embodiment, after obtaining convolution feature graphs corresponding to different dimensional transpositions, fusion processing may be performed to obtain global features of the image, where the fusion processing is completed in a feature fusion module, specifically, the convolution feature graphs corresponding to the different dimensional transpositions are respectively subjected to dimensional transpositions, so that a dimensional sequence of the convolution feature graphs corresponding to the different dimensional transpositions after the dimensional transpositions is the same as a preset dimensional sequence; performing feature stitching on convolution feature graphs corresponding to different dimensional transposition after the dimensional transposition processing to obtain a high-dimensional feature graph; based on the high-dimensional feature map, a global feature map is obtained.

In an embodiment, if the dimension order of the convolution feature images output by the first parallel module is different from the preset dimension order, the convolution feature images output by the first parallel module are dimension transposed, so that the dimension order of the dimension transposed convolution feature images is the same as the preset dimension order.

When feature stitching is performed on the convolution feature graphs corresponding to different dimensional transpositions after the dimensional transposition processing, as the three feature graphs are stitched, and the sizes of the three feature graphs should be c×h×w, the size of the obtained high-dimensional feature graph is 3c×h×w, the number of channels of the channel dimension of the high-dimensional feature graph is 3, the number of channels of the channel dimension of the high-dimensional feature graph is reduced to be the same as the number of channels of the channel dimension of the three-dimensional feature graph, so as to obtain a global feature graph, and the size of the global feature graph is ensured to be c×h×w.

In a specific embodiment, the first dimension is a channel dimension, the second dimension is a width dimension, the third dimension is a high dimension, and the process of obtaining global features in parallel may refer to fig. 11, the size of the input three-dimensional feature map is (in order of dimensions) c×h×w, in the first parallel module, the convolution kernel in the convolution process is directly verified by using the 1*1 convolution module to be fixed in the channel dimension direction, and then slides along the height dimension and the width dimension, and the size of the output convolution feature map is c×h×w, where each neuron of the output convolution feature map has information on all channels at the current position.

In the second parallel module, dimension transposition of the width dimension is performed to obtain a feature map with the dimension w×c×h, and the 1*1 convolution kernel is used for realizing the width dimension convolution, namely the convolution kernel is fixed in the width dimension direction and then slides along the channel dimension and the height dimension, the dimension of the output convolution feature map is w×c×h, and at the moment, all the neurons on the output convolution feature map have all the information on the width dimension of the current position.

In a third parallel module, performing dimension transposition of a height dimension to obtain a feature map with the dimension of h×c×w, and using a 1*1 convolution kernel to realize the height dimension convolution, namely fixing the convolution kernel in the height dimension direction, then sliding along the channel dimension and the width dimension, wherein the dimension of the output convolution feature map is h×c×w, and at the moment, each neuron on the output convolution feature map has all information on the height dimension of the current position.

In the feature fusion module, the dimension sequences of the convolution feature graphs output by the second parallel module and the third parallel module are transposed into c x h x w, and the convolution feature graphs output by the second parallel module and the first parallel module are spliced into a high-dimensional feature graph with the dimension of 3c x h x w along the channel dimension.

And then the 1*1 convolution is used for reducing the dimension of the high-dimension feature map, the dimension of the output global feature map is c x h x w, and at the moment, each neuron on the output global feature map has global information.

As can be seen from the foregoing, in this embodiment, convolution features on different channels of the three-dimensional feature map are fused in a parallel or serial manner, so as to obtain a global feature, in this process, the calculation amount of the global feature map obtained in series is hwc (h+w+c), the calculation amount of the global feature map obtained in series is hwc (h+w+4c), compared with other global modeling networks, the calculation amount of the self-attention calculation amount is less, that is, the calculation amount of the global feature obtaining method provided in this embodiment is small, and convolution features on different channels of the three-dimensional feature map can be fused, so that global features with high accuracy are obtained, the calculation amount of the model is less, the model is favorable to be deployed on various platforms with limited calculation power, and the global features with high accuracy can improve the efficiency of image application.

Based on the image processing method shown in fig. 2 to 11, a manner of obtaining the image features by applying the image processing method is proposed in this embodiment, specifically referring to fig. 12, fig. 12 shows an application of image classification, an image to be classified is input to an initial feature extraction module, the image is divided into a plurality of image blocks, the image block features corresponding to each image block are extracted based on linear mapping, then feature integration is performed on the image block features corresponding to each image block through reshape (shape change), a three-dimensional feature map with a size of c×h×w is obtained, the three-dimensional feature map enters a multi-directional convolution module, a serial or parallel manner as shown in fig. 6 or 9 is performed, a global feature map with a size of c×h×w is obtained, the global feature map enters a pooling module, average pooling is performed to obtain the image features, and then the image features enter a full connection layer to be classified, so as to obtain the category of the image to be classified.

Fig. 13 is a schematic structural view of an image processing apparatus according to an exemplary embodiment. As shown in fig. 13, in an exemplary embodiment, the image processing apparatus includes: the feature acquisition module 1310 is configured to perform feature extraction on the image to obtain a three-dimensional feature map of the image; the multi-direction convolution module 1330 is configured to fuse the convolution features of the three-dimensional feature map in different dimensions to obtain a global feature map of the image; the convolution characteristics of the three-dimensional feature map on different dimensions are obtained by performing dimension transposition processing on the basis of the three-dimensional feature map and then performing convolution processing; the image feature obtaining module 1350 is configured to pool the global feature map to obtain image features of the image.

The image processing device in the embodiment can accurately acquire the image representation of the representation image information, and the image processing efficiency is high.

In one embodiment, the feature acquisition module includes: an image dividing unit configured to divide an image into a plurality of image blocks; the image block feature extraction unit is configured to respectively perform feature extraction on a plurality of image blocks to obtain image block features corresponding to the image blocks; and the feature acquisition unit is configured to integrate the features of the image blocks corresponding to the image blocks to obtain a three-dimensional feature map.

In one embodiment, the three-dimensional feature map includes a first dimension, a second dimension, and a third dimension; a multi-way convolution module comprising: the first transfer convolution unit is configured to perform dimension transposition processing on the three-dimensional feature map according to the first dimension, and perform convolution processing on the three-dimensional feature map subjected to the dimension transposition processing to obtain a first convolution map; the second transposition convolution unit is configured to carry out dimension transposition processing on the first convolution map according to the second dimension, and carry out convolution processing on the first convolution map after the dimension transposition processing to obtain a second convolution map; the third transposition convolution unit is configured to perform dimension transposition processing on the second convolution map according to a third dimension, and perform convolution processing on the second convolution map subjected to the dimension transposition processing to obtain a third convolution map; the first global feature map acquisition unit is configured to acquire a global feature map based on the third convolution map.

In an embodiment, a predetermined dimension sequence exists among the first dimension, the second dimension and the third dimension of the three-dimensional feature map; a first transpose convolution unit comprising: the transposed block is configured to perform dimension modification processing on the three-dimensional feature map, so that the first dimension is the first dimension in the dimension sequence of the modified three-dimensional feature map, and the three-dimensional feature map after dimension transposition processing is obtained.

In an embodiment, a predetermined dimension sequence exists among the first dimension, the second dimension and the third dimension of the three-dimensional feature map; the first global feature map acquisition unit includes: a dimension sequence detecting block configured to detect a dimension sequence of the third convolution map; the first global feature map obtaining plate is configured to take the third convolution map as a global feature map if the dimension sequence of the third convolution map is the same as the preset dimension sequence of the three-dimensional feature map; and the second global feature map obtaining plate is configured to perform dimension transposition on the third convolution map if the dimension sequence of the third convolution map is different from the preset dimension sequence of the three-dimensional feature map, so that the dimension sequence of the third convolution map after the dimension transposition is the same as the preset dimension sequence, and the global feature map is obtained.

In one embodiment, the multi-way convolution module includes: the parallel dimension transposition unit is configured to carry out dimension transposition processing on the three-dimensional feature map in different dimensions respectively to obtain a three-dimensional feature map subjected to the different dimension transposition processing; the parallel convolution unit is configured to carry out convolution processing on the three-dimensional feature images subjected to the transposition processing of different dimensions respectively to obtain convolution feature images corresponding to the transposition of different dimensions; the second global feature map obtaining unit is configured to fuse the convolution feature maps corresponding to the transpose of different dimensions to obtain a global feature map.

In one embodiment, the different dimensions of the three-dimensional feature map include a first dimension, a second dimension, and a third dimension; a parallel dimension transpose unit comprising: the first parallel dimension transposition plate is configured to carry out dimension transposition on the three-dimensional feature map according to the first dimension to obtain a three-dimensional feature map after the first dimension transposition treatment; the second parallel dimension transposition plate is configured to perform dimension transposition on the three-dimensional feature map according to the second dimension to obtain a three-dimensional feature map after the second dimension transposition; and the third parallel dimension transposition plate is configured to perform dimension transposition processing on the three-dimensional feature map according to the third dimension to obtain a three-dimensional feature map after the third dimension transposition processing.

In one embodiment, a predetermined dimension order exists between different dimensions of the three-dimensional feature map; the second global feature map acquisition unit includes:

the dimension transposition processing plate is configured to respectively carry out dimension transposition processing on the convolution feature graphs corresponding to different dimension transposition so that the dimension sequence of the convolution feature graphs corresponding to different dimension transposition after the dimension transposition processing is the same as the preset dimension sequence;

the characteristic splicing plate is configured to perform characteristic splicing on the convolution characteristic graphs corresponding to different dimensional transposition after the dimensional transposition processing to obtain a high-dimensional characteristic graph;

and a second global feature map acquisition block configured to acquire a global feature map based on the high-dimensional feature map.

In an embodiment, the second global feature map obtaining block includes:

It should be noted that, the image processing apparatus provided in the foregoing embodiment and the image processing method provided in the foregoing embodiment belong to the same concept, and a specific manner in which each module and unit perform an operation has been described in detail in the method embodiment, which is not described herein again.

The embodiment of the application also provides electronic equipment, which comprises: one or more processors; and a storage means for storing one or more programs which, when executed by the one or more processors, cause the electronic device to implement the image processing methods provided in the respective embodiments described above.

It should be noted that, the computer system 1100 of the electronic device shown in fig. 14 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

As shown in fig. 14, the computer system 1100 includes a central processing unit (Central Processing Unit, CPU) 1101 that can perform various appropriate actions and processes, such as performing the method in the above-described embodiment, according to a program stored in a Read-Only Memory (ROM) 1102 or a program loaded from a storage section 1108 into a random access Memory (Random Access Memory, RAM) 1103. In the RAM 1103, various programs and data required for system operation are also stored. The CPU 1101, ROM 1102, and RAM 1103 are connected to each other by a bus 1104. An Input/Output (I/O) interface 1105 is also connected to bus 1104.

The following components are connected to the I/O interface 1105: an input section 1106 including a keyboard, a mouse, and the like; an output portion 1107 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and a speaker; a storage section 1108 including a hard disk or the like; and a communication section 1109 including a network interface card such as a LAN (Local Area Network ) card, a modem, or the like. The communication section 1109 performs communication processing via a network such as the internet. The drive 1110 is also connected to the I/O interface 1105 as needed. Removable media 1111, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in drive 1110, so that a computer program read therefrom is installed as needed in storage section 1108.

In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method shown in the flowchart. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1109, and/or installed from the removable media 1111. When executed by a Central Processing Unit (CPU) 1101, performs the various functions defined in the system of the present application.

It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium may be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with a computer-readable computer program embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. A computer program embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Where each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

Another aspect of the application also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements an image processing method as before. The computer-readable storage medium may be included in the electronic device described in the above embodiment or may exist alone without being incorporated in the electronic device.

Another aspect of the application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs the image processing methods provided in the respective embodiments described above.

The foregoing is merely illustrative of the preferred embodiments of the present application and is not intended to limit the embodiments of the present application, and those skilled in the art can easily make corresponding variations or modifications according to the main concept and spirit of the present application, so that the protection scope of the present application shall be defined by the claims.

Claims

1. An image processing method, comprising:

extracting features of an image to obtain a three-dimensional feature map of the image;

fusing convolution features of the three-dimensional feature map on different dimensions to obtain a global feature map of the image; the convolution characteristics of the three-dimensional feature map in different dimensions are obtained by performing dimension transposition processing on the basis of the three-dimensional feature map and then performing convolution processing;

and carrying out pooling treatment on the global feature map to obtain the image features of the image.

2. The method of claim 1, wherein the three-dimensional feature map comprises a first dimension, a second dimension, and a third dimension; the fusing the convolution features of the three-dimensional feature map in different dimensions to obtain a global feature map of the image comprises the following steps:

performing dimension transposition on the three-dimensional feature map according to the first dimension, and performing convolution on the three-dimensional feature map subjected to the dimension transposition to obtain a first convolution map;

performing dimension transposition on the first convolution map according to the second dimension, and performing convolution on the first convolution map subjected to the dimension transposition to obtain a second convolution map;

Performing dimension transposition on the second convolution map according to the third dimension, and performing convolution on the second convolution map subjected to the dimension transposition to obtain a third convolution map;

and acquiring the global feature map based on the third convolution map.

3. The method of claim 2, wherein a predetermined order of dimensions exists between the first dimension, the second dimension, and the third dimension of the three-dimensional feature map; performing dimension transposition processing on the three-dimensional feature map according to the first dimension, and performing convolution processing on the three-dimensional feature map after the dimension transposition processing to obtain a first convolution map, wherein the method comprises the following steps of:

and carrying out dimension modification processing on the three-dimensional feature map, so that the first dimension is the first dimension in the dimension sequence of the modified three-dimensional feature map, and obtaining the three-dimensional feature map after dimension transposition processing.

4. The method of claim 2, wherein a predetermined order of dimensions exists between the first dimension, the second dimension, and the third dimension of the three-dimensional feature map; the obtaining the global feature map based on the third convolution map includes:

Detecting the dimension sequence of the third convolution graph;

if the dimension sequence of the third convolution graph is the same as the preset dimension sequence of the three-dimensional feature graph, the third convolution graph is used as the global feature graph;

if the dimension sequence of the third convolution graph is different from the preset dimension sequence of the three-dimensional feature graph, performing dimension transposition on the third convolution graph, so that the dimension sequence of the third convolution graph after the dimension transposition is identical to the preset dimension sequence, and obtaining the global feature graph.

5. The method according to claim 1, wherein the fusing the convolution features of the three-dimensional feature map in different dimensions to obtain a global feature map of the image includes:

performing dimensional transposition processing on the three-dimensional feature map on different dimensions respectively to obtain a three-dimensional feature map subjected to the dimensional transposition processing;

respectively carrying out convolution processing on the three-dimensional feature graphs subjected to the different-dimensional transposition processing to obtain convolution feature graphs corresponding to the different-dimensional transposition;

and carrying out fusion processing on the convolution feature graphs corresponding to the transposition of different dimensions to obtain the global feature graph.

6. The method of claim 5, wherein the different dimensions of the three-dimensional feature map include a first dimension, a second dimension, and a third dimension; the step of performing dimension transposition processing on the three-dimensional feature map on different dimensions to obtain a three-dimensional feature map after the dimension transposition processing comprises the following steps:

According to the first dimension, performing dimension transposition on the three-dimensional feature map to obtain a three-dimensional feature map subjected to the dimension transposition;

according to the second dimension, performing dimension transposition on the three-dimensional feature map to obtain a three-dimensional feature map subjected to the dimension transposition;

and according to the third dimension, performing dimension transposition on the three-dimensional feature map to obtain a three-dimensional feature map after the third dimension transposition.

7. The method of claim 5, wherein a predetermined order of dimensions exists between different dimensions of the three-dimensional feature map; the fusing processing is carried out on the convolution feature graphs corresponding to the transposition of different dimensions to obtain the global feature graph, which comprises the following steps:

performing dimension transposition processing on the convolution feature graphs corresponding to the different dimension transposition respectively so that the dimension sequence of the convolution feature graphs corresponding to the different dimension transposition after the dimension transposition processing is the same as the preset dimension sequence;

performing feature stitching on the convolution feature graphs corresponding to the different dimensional transposition after the dimensional transposition processing to obtain a high-dimensional feature graph;

and acquiring the global feature map based on the high-dimensional feature map.

8. The method of claim 7, wherein the obtaining the global feature map based on the high-dimensional feature map comprises:

and reducing the number of channels of the high-dimensional feature map to be the same as the number of channels of the three-dimensional feature map, and obtaining the global feature map.

9. The method according to claim 1, wherein the feature extraction of the image to obtain a three-dimensional feature map of the image comprises:

dividing the image into a plurality of image blocks;

extracting the characteristics of the image blocks respectively to obtain the image block characteristics corresponding to each image block;

and carrying out feature integration on the image block features corresponding to the image blocks to obtain the three-dimensional feature map.

10. The method according to claim 9, wherein the performing feature extraction on the plurality of image blocks to obtain image block features corresponding to the image blocks includes:

acquiring image block vectors corresponding to the plurality of image blocks respectively;

and carrying out linear mapping processing on the image block vectors corresponding to the image blocks to obtain the image block characteristics corresponding to the image blocks.

11. An image processing apparatus, comprising:

The feature acquisition module is configured to perform feature extraction on the image to obtain a three-dimensional feature map of the image;

the multi-direction convolution module is configured to fuse convolution features of the three-dimensional feature map in different dimensions to obtain a global feature map of the image; the convolution characteristics of the three-dimensional feature map in different dimensions are obtained by performing dimension transposition processing on the basis of the three-dimensional feature map and then performing convolution processing;

and the image feature acquisition module is configured to pool the global feature map to obtain image features of the image.

12. An electronic device, comprising:

one or more processors;

storage means for storing one or more computer programs that, when executed by the one or more processors, cause the electronic device to implement the method of any of claims 1-10.

13. A computer readable storage medium having stored thereon computer readable instructions which, when executed by a processor of a computer, cause the computer to perform the method of any of claims 1 to 10.

14. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 10.