CN111931781A

CN111931781A - Image processing method and device, electronic equipment and storage medium

Info

Publication number: CN111931781A
Application number: CN202010801553.2A
Authority: CN
Inventors: 刘建博; 李鸿升; 任思捷; 王晓刚
Original assignee: Sensetime Group Ltd
Current assignee: Sensetime Group Ltd
Priority date: 2020-08-10
Filing date: 2020-08-11
Publication date: 2020-11-13

Abstract

The present disclosure relates to an image processing method and apparatus, an electronic device, and a storage medium, the method including: performing feature extraction on an image to be processed to obtain N first feature maps of the image to be processed, wherein the scales of the first feature maps in the N first feature maps are different; generating N characteristic code words and first weight characteristic graphs of the N characteristic code words according to M first characteristic graphs in the N first characteristic graphs; determining a second feature map of the image to be processed according to the n feature code words and the first weight feature map; and determining a processing result of the image to be processed according to the second feature map. The disclosed embodiments can achieve high-precision image processing.

Description

Image processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an image processing method and apparatus, an electronic device, and a storage medium.

Background

In the field of computer vision, it is generally necessary to process an image, for example, to extract a feature map of the image, analyze object information in an image scene according to the feature map, and then obtain a processing result of the image. However, the image processing method according to the related art, or the feature perception field of view is limited, resulting in poor processing accuracy; or the accuracy is improved by a large calculation amount, so that the application range is limited, and the processing speed and the processing accuracy cannot be considered at the same time.

Disclosure of Invention

The present disclosure proposes an image processing technical solution.

According to an aspect of the present disclosure, there is provided an image processing method including: performing feature extraction on an image to be processed to obtain N first feature maps of the image to be processed, wherein the scales of the first feature maps in the N first feature maps are different, N is an integer and is not less than 2; generating N feature code words and first weight feature maps of the N feature code words according to M first feature maps in the N first feature maps, wherein M is an integer and is more than or equal to 2 and less than or equal to N, and N is an integer more than 1; determining a second feature map of the image to be processed according to the n feature code words and the first weighted feature map; and determining a processing result of the image to be processed according to the second feature map.

In a possible implementation manner, generating N feature codewords and first weighted feature maps of the N feature codewords according to M first feature maps in the N first feature maps includes: carrying out scale reduction and fusion on the M first feature maps to obtain a third feature map; performing convolution transformation on the third feature maps respectively to obtain codebook feature maps of the third feature maps and second weight feature maps of the n feature codewords, wherein the heights and the widths of the codebook feature maps and the second weight feature maps are the same as those of the third feature maps, and the number of channels of the second weight feature maps corresponds to the number n; and determining the n feature code words according to the codebook feature map and the weight feature map, wherein each feature code word comprises a feature vector, and the length of the feature vector is equal to the number of channels of the codebook feature map.

In a possible implementation manner, generating N feature codewords and first weighted feature maps of the N feature codewords according to M first feature maps in the N first feature maps includes: carrying out scale amplification and fusion on the M first feature maps to obtain a fourth feature map; and determining a first weight feature map of the n feature code words according to the fourth feature map and the codebook feature map, wherein the height and the width of the first weight feature map are the same as those of the fourth feature map, and the number of channels of the first weight feature map corresponds to the number n.

In a possible implementation manner, determining a first weighted feature map of the n feature codewords according to the fourth feature map and the codebook feature map includes: performing global pooling on the codebook feature map to obtain a global average vector of the n feature codewords; and determining the first weighted feature map according to the fourth feature map and the global average vector.

In a possible implementation manner, determining a second feature map of the image to be processed according to the n feature code words and the first weighted feature map includes: rearranging the first weight characteristic graph to obtain a weight matrix of the n characteristic code words; multiplying the n characteristic code words by the weight matrix to obtain a characteristic matrix of the n characteristic code words; and determining the second characteristic diagram according to the characteristic matrix of the n characteristic code words.

In a possible implementation manner, determining the second feature map according to the feature matrix of the n feature codewords includes: rearranging the feature matrixes of the n feature code words to obtain a fifth feature map; and fusing the fifth characteristic diagram and the fourth characteristic diagram to obtain the second characteristic diagram.

In a possible implementation manner, determining a processing result of the image to be processed according to the second feature map includes: performing convolution transformation and up-sampling on the second feature map to obtain a segmentation feature map of the image to be processed, wherein the height and the width of the segmentation feature map are the same as those of the image to be processed; and segmenting the image to be processed according to the segmentation characteristic graph to obtain a processing result of the image to be processed.

In one possible implementation, the processing result of the image to be processed includes a position and/or a category of an object in the image to be processed.

In one possible implementation, the number n of feature code words is greater than the number of categories of the target.

According to an aspect of the present disclosure, there is provided an image processing apparatus including:

the image processing device comprises a feature extraction module, a feature extraction module and a feature extraction module, wherein the feature extraction module is used for extracting features of an image to be processed to obtain N first feature maps of the image to be processed, the scales of the first feature maps in the N first feature maps are different, N is an integer and is more than or equal to 2; a codeword and weight generating module, configured to generate N feature codewords and first weight feature maps of the N feature codewords according to M first feature maps in the N first feature maps, where M is an integer and is greater than or equal to 2 and less than or equal to N, and N is an integer greater than 1; a feature map determining module, configured to determine a second feature map of the to-be-processed image according to the n feature codewords and the first weighted feature map; and the result determining module is used for determining the processing result of the image to be processed according to the second feature map.

In a possible implementation manner, the codeword and weight generating module includes: the scale reduction and fusion submodule is used for carrying out scale reduction and fusion on the M first feature maps to obtain a third feature map; the transformation submodule is used for performing convolution transformation on the third feature graph respectively to obtain a codebook feature graph of the third feature graph and second weight feature graphs of the n feature code words, the heights and the widths of the codebook feature graph and the second weight feature graphs are the same as those of the third feature graph, and the number of channels of the second weight feature graphs corresponds to the number n; and the code word determining submodule is used for determining the n characteristic code words according to the codebook characteristic diagram and the weight characteristic diagram, each characteristic code word comprises a characteristic vector, and the length of the characteristic vector is equal to the number of channels of the codebook characteristic diagram.

In a possible implementation manner, the codeword and weight generating module includes: the scale amplification and fusion submodule is used for carrying out scale amplification and fusion on the M first feature maps to obtain a fourth feature map; and the weight map determining submodule is used for determining a first weight feature map of the n feature code words according to the fourth feature map and the codebook feature map, wherein the height and the width of the first weight feature map are the same as those of the fourth feature map, and the number of channels of the first weight feature map corresponds to the number n.

In one possible implementation manner, the weight map determining sub-module is configured to: performing global pooling on the codebook feature map to obtain a global average vector of the n feature codewords; and determining the first weighted feature map according to the fourth feature map and the global average vector.

In one possible implementation, the feature map determining module includes: the rearrangement submodule is used for rearranging the first weight characteristic diagram to obtain a weight matrix of the n characteristic code words; the multiplication submodule is used for multiplying the n characteristic code words by the weight matrix to obtain the characteristic matrix of the n characteristic code words; and the characteristic diagram determining submodule is used for determining the second characteristic diagram according to the characteristic matrix of the n characteristic code words.

In a possible implementation manner, the feature map determining sub-module is configured to: rearranging the feature matrixes of the n feature code words to obtain a fifth feature map; and fusing the fifth characteristic diagram and the fourth characteristic diagram to obtain the second characteristic diagram.

In one possible implementation, the result determination module includes: the transformation and up-sampling submodule is used for carrying out convolution transformation and up-sampling on the second feature map to obtain a segmentation feature map of the image to be processed, and the height and the width of the segmentation feature map are the same as those of the image to be processed; and the segmentation submodule is used for segmenting the image to be processed according to the segmentation characteristic graph to obtain a processing result of the image to be processed.

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

According to the embodiment of the disclosure, N first feature maps of an image can be extracted; generating a plurality of feature code words and corresponding weight feature maps according to M feature maps in the N first feature maps, and further obtaining a second feature map with high resolution and rich semantic information; by determining the processing result of the image based on the second feature map, highly accurate image processing can be realized with low computational complexity.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flowchart of an image processing method according to an embodiment of the present disclosure.

Fig. 2 shows a schematic diagram of an image processing network of an image processing method according to an embodiment of the present disclosure.

Fig. 3 shows a schematic diagram of a process of an image processing method according to an embodiment of the present disclosure.

Fig. 4 illustrates a block diagram of an image processing apparatus according to an embodiment of the present disclosure.

Fig. 5 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.

Fig. 6 illustrates a block diagram of an electronic device in accordance with an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Semantic segmentation in the field of computer vision refers to: for a given image and object class set, a specific object class is judged for each pixel through understanding semantic information in an image scene, so that the identification of objects in the image is realized. Semantic segmentation has important application value in many fields. For example, the method includes intelligent recognition of a lane line And other related target regions in the automatic driving field, a scene segmentation function in an intelligent image editing task, a semantic perception part in an augmented reality ar (augmented reality)/synchronous positioning And mapping slam (synchronized Localization And mapping) task, recognition And segmentation of a focus region in the intelligent medical field, And the like.

A semantic perception method based on deep learning generally uses an end-to-end convolutional neural network (such as a residual error network) as a semantic information encoder or a basic network to analyze semantic information of an image, extract a semantic feature map of the image, and then utilize a full connection layer to analyze a semantic category distribution probability map from the semantic feature map so as to obtain a final prediction result. However, the output semantic feature map of the last layer network of the semantic information encoder has low resolution, and spatial detail information is lost, resulting in poor accuracy.

In the related art, it is possible to adopt an encoder-decoder structure, i.e., to raise the resolution of the feature map step by the decoder. However, the upsampling operation in the decoder belongs to a local sensing mode, so that the obtained semantic feature map with high resolution cannot sense semantic information with a larger view field, and the segmentation accuracy is poor.

In the related art, an encoder structure based on the hole convolution may be adopted, that is, in the last network blocks of the encoder, the hole convolution is used to replace the ordinary convolution, so that the resolution of the semantic feature map is not reduced any more, and higher accuracy can be achieved. However, this method causes a particularly large number of convolutional layers in the last network blocks of the encoder, which results in a particularly large amount of computation, and thus the processing speed is reduced, so that this method cannot be widely applied to hardware with limited computing resources.

According to the embodiment of the disclosure, an image processing method is provided, which can obtain a feature-enhanced high-resolution feature map without increasing the computational complexity basically, and significantly improve the precision of semantic segmentation.

Fig. 1 shows a flowchart of an image processing method according to an embodiment of the present disclosure, which includes, as shown in fig. 1:

in step S11, performing feature extraction on an image to be processed to obtain N first feature maps of the image to be processed, where the scales of the first feature maps in the N first feature maps are different, N is an integer and N is greater than or equal to 2;

in step S12, according to M first feature maps of the N first feature maps, N feature codewords and first weighted feature maps of the N feature codewords are generated, where M is an integer and 2 ≤ M ≤ N, and N is an integer greater than 1;

in step S13, determining a second feature map of the image to be processed according to the n feature code words and the first weighted feature map;

in step S14, a processing result of the image to be processed is determined according to the second feature map.

In one possible implementation, the image processing method may be performed by an electronic device such as a terminal device or a server, the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA) device, a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like, and the method may be implemented by a processor calling a computer readable instruction stored in a memory. Alternatively, the method may be performed by a server.

In one possible implementation, the image to be processed may be an image captured by an image capturing device (e.g., a camera), and the image includes one or more objects to be identified, such as people, animals, vehicles, objects, and so on. The present disclosure is not limited to the source of the image to be processed and the specific type of object in the image to be processed.

In a possible implementation manner, in step S11, feature extraction may be performed on the image to be processed through a feature extraction network, so as to obtain N first feature maps of the image to be processed, where N is an integer and N is greater than or equal to 2. The feature extraction network may, for example, comprise a plurality of convolutional layers, each convolutional layer outputting a first feature map; the feature extraction network may also be, for example, a residual network, comprising a plurality of stages of residual network blocks (ResBlock), each stage of the residual network blocks outputting a first feature map. The present disclosure does not limit the specific network structure of the feature extraction network.

In one possible implementation, the dimensions of each of the N first feature maps are different. Each first feature map may be an output of a feature extraction network at different levels, and the resolution of the feature maps (i.e., the height and width of the feature maps) is sequentially reduced to form a feature pyramid. For example, in the case where the feature extraction network includes 5 levels and the step size is 2, the height and width of each first feature map are 1/2, 1/4, 1/8, 1/16, and 1/32 of the image to be processed, which may be respectively referred to as OS-2, OS-4, OS-8, OS-16, and OS-32. Assuming that the image to be processed is height H × width W, the feature map with OS equal to 8 is height (H/8) × width (W/8).

In one possible implementation, a first feature map output by a later network level (which may be referred to as a high level) of the feature extraction network, for example, the first feature map with a scale of OS 16 or OS 32, has a lower resolution, has lost most of image structure details, but has richer semantic information for classification; the first feature maps output by the previous network levels (which may be called low levels) of the feature extraction network, such as the first feature maps with the scales of OS 4 and OS 8, have higher resolution and retain rich image structure information, but come from relatively shallow layers, and semantic information used for classification is not accurate enough to guide an upsampling process. Therefore, the image processing can be performed by combining the first feature maps output by a plurality of network levels.

In one possible implementation manner, in step S12, M first feature maps may be selected from the N first feature maps, where M is an integer and 2 ≦ M ≦ N. For example, if the scale of the first feature map includes OS 2, OS 4, OS 8, OS 16, or OS 32, three first feature maps, OS 8, OS 16, or OS 32, are selected and processed. The selection mode and the selection quantity of the first characteristic diagram are not limited in the present disclosure.

In a possible implementation manner, according to the M first feature maps, n feature codewords (also referred to as semantic codewords) and first weight feature maps of the n feature codewords may be generated respectively. That is, a plurality of feature codewords can be generated according to the first feature map with low resolution to capture global semantic information; and generating a weight feature map (which can be called as a first weight feature map) of the plurality of feature code words according to the high-resolution first feature map so as to retain the structural information of the image.

In one possible implementation, the number n of feature codewords may be greater than the number of categories of the target. The feature code words can be regarded as clustering results of feature information in the feature graph, and the greater the number of the clustering categories, the richer the captured semantic information. In this case, the number n of feature codewords may be set to be greater than the actual number of categories of objects to be recognized in the image, thereby improving the effect of feature enhancement.

In one possible implementation, the number n of feature code words may be set to be about 4 times the number of categories of the target, for example, when the category of the object to be segmented in the data set is 150 categories, the number n of feature code words may be set to be 600; when the class of the object to be segmented in the data set is 60 classes, the number n of the feature code words may be set to 256.

It should be understood that the number n of the feature code words can be set by those skilled in the art according to practical situations, and the present disclosure does not limit this.

In a possible implementation manner, in step S13, a second feature map of the image to be processed may be determined according to the n feature code words and the first weighted feature map. The n characteristic code words can be multiplied by the rearranged first weight characteristic graph to obtain a characteristic matrix; after the feature matrix is rearranged and transformed, a second feature map is obtained. That is, the feature code words may be linearly combined at each spatial position of the weighted feature map to realize feature upsampling and recover image details, so as to obtain a feature map with high resolution and rich semantic information, that is, a second feature map.

In one possible implementation manner, in step S14, a processing result of the image to be processed may be determined according to the second feature map. Wherein, the second characteristic diagram can be directly segmented to obtain a segmentation result; the second feature map may be further processed, and the processed feature map may be segmented to obtain a segmentation result.

In a possible implementation manner, the processing result of the image to be processed may be the segmentation result described above, or may be a result obtained by processing the segmentation result again according to an actual image processing task. For example, in an image editing task, a foreground region and a background region may be distinguished according to a segmentation result, and corresponding processing is performed on the foreground region and/or the background region, for example, blurring processing is performed on the background region, so as to obtain a final image processing result. The present disclosure does not limit the segmentation method of the feature map and the specific content included in the processing result.

In one possible implementation, after the N first feature maps of the image to be processed are extracted in step S11, M first feature maps among the N first feature maps may be processed in step S12. Wherein, the step S12 may include:

carrying out scale reduction and fusion on the M first feature maps to obtain a third feature map;

performing convolution transformation on the third feature maps respectively to obtain codebook feature maps of the third feature maps and second weight feature maps of the n feature codewords, wherein the heights and the widths of the codebook feature maps and the second weight feature maps are the same as those of the third feature maps, and the number of channels of the second weight feature maps corresponds to the number n;

and determining the n feature code words according to the codebook feature map and the weight feature map, wherein each feature code word comprises a feature vector, and the length of the feature vector is equal to the number of channels of the codebook feature map.

For example, a plurality of feature codewords may be generated from the low-resolution first feature map to capture global semantic information. The scale of the M-1 feature maps except the first feature map can be reduced to be the same as that of the first feature map by taking the first feature map with the smallest scale in the M first feature maps as a reference. For example, the first feature map to be processed includes three first feature maps, i.e., OS 8, OS 16, and OS 32, and the two first feature maps, i.e., OS 8 and OS 16, may be scaled down to OS 32 with the first feature map of OS 32 as a reference, with the first feature map of OS 32 being kept unchanged. The scale reduction of the first feature map may be implemented by linear decimation, downsampling, and the like, which is not limited in this disclosure.

In a possible implementation manner, before the scaling down, the number of channels of the M first feature maps may be adjusted so that the number of channels of the M first feature maps is the same (for example, all the channels are adjusted to 512), and then the scaling down is performed; the number of channels of the M first feature maps may also be adjusted after the scale reduction is completed, so that the number of channels of the M first feature maps is the same, which is not limited in this disclosure.

In a possible implementation manner, the M first feature maps with reduced scales may be fused, and for example, the M first feature maps may be directly connected to obtain a fused third feature map; after the M first feature maps are connected, the M first feature maps may be transformed by 1 × 1 convolution to obtain a fused third feature map. The scale of the fused third feature map is, for example, 1536 × (H/32) × (W/32). The present disclosure does not limit the specific way of feature map fusion and the scale of the third feature map.

In a possible implementation manner, the number of the feature code words to be generated may be set to n, where n is an integer greater than 1. Each feature codeword may be represented in the form of a feature vector, and the length of the feature vector may be a preset length d. d may be, for example, 1024, and the specific value of the preset length d is not limited by the present disclosure.

In a possible implementation manner, the third feature map may be subjected to convolution transformation through two convolution operations of 1 × 1, so as to obtain a codebook feature map of the third feature map and second weight feature maps of the n feature codewords. Wherein, the codebook (also called semantic codebook) includes n characteristic codewords, and the codebook characteristic graph can represent the basic characteristics of the codebook; the second weighted feature map may represent weights for respective spatial locations of the feature map. The height and width of the codebook characteristic diagram and the second weight characteristic diagram are the same as those of the third characteristic diagram; the channel number of the codebook feature map is equal to the vector length d of the feature code word; the number of channels of the second weighted feature map corresponds to the number n, for example, equal to n or a multiple of n. For example, the scale of the codebook feature map is d × (H/32) × (W/32), and the scale of the second weight feature map is n × (H/32) × (W/32).

In one possible implementation, for any one position (x, y) in the codebook feature map, the position has a feature vector B (x, y) with length d, that is, the codebook feature map may include (H/32) × (W/32) feature vectors with length d. In the case that the number of channels of the second weighted feature map is n, the feature map Ai of the ith channel of the second weighted feature map may represent the weight of the ith feature code word, and the scale is, for example, 1 × (H/32) × (W/32), that is, the ith feature code word has (H/32) × (W/32) weights. In order to ensure that the second weight feature map can be normalized correctly, each position of the feature map of each channel can be normalized through the softmax function, and the normalized second weight feature map is obtained.

In a possible implementation manner, for the ith feature codeword, according to normalized (H/32) × (W/32) weights, the (H/32) × (W/32) feature vectors at the corresponding positions of the codebook feature map are subjected to weighted summation, so that the ith feature codeword, which is represented as a vector of length d, can be obtained. By performing the processing in this way, n feature codewords, that is, n vectors with length d, can be obtained. The n characteristic codewords can form a codebook with a dimension of n × d.

By the method, n characteristic code words can be obtained, clustering of image characteristic information is realized, enhancement of the image characteristic information can be realized according to the characteristic code words in subsequent processing, and the accuracy of image processing is improved.

In one possible implementation, step S12 further includes:

carrying out scale amplification and fusion on the M first feature maps to obtain a fourth feature map;

and determining a first weight feature map of the n feature code words according to the fourth feature map and the codebook feature map, wherein the height and the width of the first weight feature map are the same as those of the fourth feature map, and the number of channels of the first weight feature map corresponds to the number n.

For example, a weighted feature map of a plurality of feature codewords may be generated from the high-resolution first feature map to preserve structural information of the image.

In one possible implementation manner, the first feature map with the largest scale in the M first feature maps is used as a reference, and the scales of the M-1 feature maps except the first feature map are enlarged to be the same as the scale of the first feature map. For example, the first feature map to be processed includes three first feature maps, i.e., OS-8, OS-16, and OS-32, and then the first feature map with OS-8 may be kept unchanged, and the two first feature maps with OS-16 and OS-32 may be scaled up to OS-8. The first feature diagram can be scaled up by linear interpolation, upsampling, and the like, which is not limited in this disclosure.

In a possible implementation manner, before the scale-up, the number of channels of the M first feature maps may be adjusted so that the number of channels of the M first feature maps is the same (for example, all the channels are adjusted to 512), and then the scale-up is performed; the number of channels of the M first feature maps may also be adjusted after the scale enlargement is completed, so that the number of channels of the M first feature maps is the same, which is not limited in this disclosure.

In a possible implementation manner, the M first feature maps after the scale enlargement may be fused, and for example, the M first feature maps may be directly connected to obtain a fused fourth feature map; after the M first feature maps after the scale enlargement are connected, the fourth feature map after the fusion can be obtained by performing the transformation through the convolution of 1 × 1. The scale of the fused fourth feature map is, for example, 1024 × (H/8) × (W/8). The present disclosure does not limit the specific way of feature map fusion and the scale of the fourth feature map.

In one possible implementation manner, based on the fourth feature map and the codebook feature map obtained in the foregoing step, a weighted feature map (which may also be referred to as an assembly coefficient) of n feature codewords at high resolution (for example, OS ═ 8) may be generated, so as to achieve enhancement of image feature information while preserving structure information of an image.

In a possible implementation manner, the step of determining a first weighted feature map of the n feature codewords according to the fourth feature map and the codebook feature map may include:

performing global pooling on the codebook feature map to obtain a global average vector of the n feature codewords;

and determining the first weighted feature map according to the fourth feature map and the global average vector.

For example, a global pooling (Pool) operation may be performed on the codebook feature map through the global pooling layer to obtain a global average vector of n feature codewords, where the vector has a length d. The global average vector may include general codeword information (general codeword information) of the feature codeword.

In a possible implementation manner, the global average vector may be added to the fourth feature map to obtain a high-resolution feature map added with the codeword information; and then, carrying out transformation through convolution of 1 × 1 to obtain a first weight characteristic diagram of the n characteristic code words. The height and width of the first weight characteristic diagram are the same as those of the fourth characteristic diagram, and the number of channels of the first weight characteristic diagram corresponds to the number n. For example, when the scale of the fourth feature map is OS equal to 8, the scale of the first weighted feature map is n × (H/8) × (W/8). Thus, the feature map of each channel of the first weighted feature map may represent the weight of one feature codeword, with a scale of, for example, 1 × (H/8) × (W/8), i.e., each feature codeword has (H/8) × (W/8) weights.

In this way, the information of the feature code word can be added into the fourth feature map, and the weighted feature map of the feature code word at high resolution is generated, so that the enhancement of the image feature information is realized under the condition of keeping the structure information of the image.

In a possible implementation manner, after the n feature codewords and the first weighted feature map are obtained in step S12, the feature codewords and the first weighted feature map may be further processed in step S13. Wherein, the step S13 may include:

rearranging the first weight characteristic graph to obtain a weight matrix of the n characteristic code words;

multiplying the n characteristic code words by the weight matrix to obtain a characteristic matrix of the n characteristic code words;

and determining the second characteristic diagram according to the characteristic matrix of the n characteristic code words.

For example, the codebook composed of n feature codewords is in a matrix form, and the first weighted feature map can be rearranged for calculation. That is, the first weighted feature map is rearranged (also referred to as deformation) to obtain a weight matrix of n feature codewords. For example, when the scale of the first weight feature map is n × (H/8) × (W/8), a weight matrix of n × (HW/64) may be obtained by rearrangement.

In a possible implementation manner, a codebook composed of n feature code words and a transpose matrix of a weight matrix may be subjected to matrix multiplication to obtain a feature matrix of the n feature code words. When the codebook scale is n × d, the scale of the feature matrix is, for example, d × (HW/64).

In a possible implementation manner, the feature matrix may be rearranged to obtain a rearranged feature map, and the feature map is used as a second feature map; the rearranged feature map may be further processed, and the processed feature map may be used as the second feature map. The present disclosure is not so limited.

By the method, the feature map with high resolution and rich semantic information can be reconstructed, and the enhancement of the image feature information is realized.

In a possible implementation manner, the step of determining the second feature map according to the feature matrix of the n feature code words may include:

rearranging the feature matrixes of the n feature code words to obtain a fifth feature map;

and fusing the fifth characteristic diagram and the fourth characteristic diagram to obtain the second characteristic diagram.

That is, the feature matrix may be rearranged to obtain a rearranged feature map (referred to as a fifth feature map); the fifth feature map and the fourth feature map can be fused, and the fused feature map is used as a final second feature map, so that the loss of the structural information of the image is reduced.

In a possible implementation manner, the fifth feature map and the fourth feature map may be directly connected to obtain a fused feature map, for example; after the fifth feature map and the fourth feature map are connected, the fifth feature map and the fourth feature map may be transformed by 1 × 1 convolution to obtain a fused feature map as the second feature map. The scale of the second feature map is, for example, 1024 × (H/8) × (W/8). The present disclosure does not limit the specific way of feature map fusion and the scale of the second feature map.

In this way, the loss of the structural information of the image can be reduced, and the accuracy of the image processing can be further improved.

In one possible implementation, after obtaining the second feature map of the image to be processed, the image may be further processed in step S14. Wherein, the step S14 may include:

performing convolution transformation and up-sampling on the second feature map to obtain a segmentation feature map of the image to be processed, wherein the height and the width of the segmentation feature map are the same as those of the image to be processed;

and segmenting the image to be processed according to the segmentation characteristic graph to obtain a processing result of the image to be processed.

For example, the number of channels of the second feature map may be adjusted by performing a transform by convolution of 1 × 1 according to the number m (m is an integer greater than 1) of categories of the objects in the image, and the scale of the transformed second feature map is, for example, mx (H/8) × (W/8); and then, performing up-sampling on the transformed second feature map to obtain a segmentation feature map of the image to be processed, wherein the height and the width of the segmentation feature map are the same as those of the image to be processed, that is, OS is 1, and the scale of the segmentation feature map is, for example, m × H × W.

In a possible implementation manner, the image to be processed may be segmented according to the segmentation feature map, so as to obtain a segmentation result of the image to be processed. The segmentation result can be used as the processing result of the image; the segmentation result can also be further processed to obtain a processing result of the image. The present disclosure is not so limited.

In one possible implementation, the processing result of the image to be processed may include the location and/or class of the target in the image to be processed. That is, by image processing, the object type (for example, the type of a person, an animal, a vehicle, or the like) of each position in the image is specified.

In this way, a more accurate semantic segmentation result can be obtained.

The image processing method according to the embodiment of the present disclosure may be implemented by an image processing network, which may be a convolutional neural network, including a convolutional layer, a residual layer, a pooling layer, a full connection layer, and the like.

Fig. 2 shows a schematic diagram of an image processing network of an image processing method according to an embodiment of the present disclosure, which may include, as shown in fig. 2: a feature extraction network 21, a codebook generation network 22, a weight prediction network 23, a reconstruction network 24, and a segmentation network 25.

In an example, the feature extraction network 21 may include 5 stages of residual network blocks 211, each stage of residual network blocks 211 having a step size of 2; the image 26 to be processed is input into the feature extraction network 21 for processing, the residual network blocks 211 at each stage sequentially output 5 first feature maps, the height and width of each first feature map are 1/2, 1/4, 1/8, 1/16 and 1/32 of the image to be processed, and the height and width of each first feature map can be respectively recorded as OS-2, OS-4, OS-8, OS-16 and OS-32. Assuming that the image 26 to be processed has a height H × a width W, the feature map having OS equal to 8 has a height (H/8) × a width (W/8).

In the example, three first feature maps, i.e., 8, 16, and 32, may be selected, and the three feature maps are input into the codebook generation network 22 and the weight prediction network 23, respectively, and processed to generate n feature code words (not shown) and first weight feature maps (not shown) of the n feature code words.

In an example, n feature codewords and the first weighted feature map may be input into the reconstruction network 24, resulting in a second feature map (not shown) with high resolution and enhanced features. The scale is OS-8; the second feature map is input into the segmentation network 25 for segmentation to obtain a segmentation result 27 of the image 26 to be processed.

Fig. 3 is a schematic diagram illustrating a processing procedure of an image processing method according to an embodiment of the present disclosure, and as shown in fig. 3, the scales of three first feature maps 311, 312, and 313 to be processed are OS-8, OS-16, and OS-32, respectively; the three first feature maps 311, 312 and 313 are scaled down and merged 32 through the codebook generating network. The three first feature maps 311, 312, and 313 are adjusted to three

feature maps

321, 322, and 323 with a scale OS of 32, and are fused to generate a third feature map 324 (with a scale OS of 32).

In an example, the codebook generating step 33 can be performed by a codebook generating network according to the third feature map 324. Performing convolution transformation on the third feature map 324 through two convolution operations of 1 × 1 to obtain a codebook feature map 331 (with a scale of d × (H/32) × (W/32)) and a second weight feature map 332 (with a scale of n × (H/32) × (W/32)); according to the normalized weights in the second weight feature map 332, the feature vectors in the codebook feature map are subjected to weighted summation to obtain n feature codewords, which form the codebook 333, thereby completing the codebook generating process.

In the example, the three first feature maps 311, 312, 313 are first scaled up and fused by the weight prediction network 34. The three first feature maps 311, 312, and 313 are adjusted to three

feature maps

341, 342, and 343 with the scale OS equal to 8, and are fused to generate a fourth feature map 344 (with the scale OS equal to 8).

In an example, the weight prediction step 35 may be performed by a weight prediction network according to the fourth feature map 344. Performing convolution transformation on the fourth feature map 344 by a convolution operation of 1 × 1 to obtain a feature map 351 having a scale of d × (H/8) × (W/8); performing global pooling on the codebook feature map 331 to obtain a global average vector; and adding the global average vector and the fourth feature map, and performing convolution transformation to obtain a first weighted feature map 352 with the scale of n (H/8) x (W/8), thereby completing the process of weight prediction.

In an example, a fifth feature map 361 can be reconstructed according to the n feature codewords 333 and the first weighted feature map 352; and then the fifth feature map 361 and the fourth feature map 344 are fused, and the fused feature maps are used as a final second feature map 362, so that the whole image feature enhancement process is completed.

In an example, after obtaining the second feature map of the image to be processed, image segmentation may be performed according to the second feature map, so as to obtain a processing result (not shown) of the image to be processed.

In a possible implementation manner, before deploying the image processing network, the image processing network may be trained, and the image processing method according to an embodiment of the present disclosure further includes:

and training the image processing network according to a preset training set, wherein the training set comprises a plurality of sample images and labeling information of the sample images.

For example, sample images in the training set may be input into the image processing network for processing to obtain a sample processing result of the sample images; determining the loss of the image processing network according to the difference between the sample processing result of the sample image and the labeling information; adjusting network parameters of the image processing network in reverse direction according to the loss; after a plurality of iterations, when a training condition (such as network convergence) is satisfied, a trained image processing network is obtained. In this way, a training process for the image processing network may be achieved.

According to the image processing method disclosed by the embodiment of the disclosure, the feature graphs output by different levels of a feature extraction network can be utilized, and a plurality of feature code words are generated according to the feature graph with low resolution so as to capture global semantic information; generating a weight characteristic diagram of a plurality of characteristic code words according to the first characteristic diagram with high resolution to retain structural information of the image, thereby obtaining a characteristic-enhanced high-resolution characteristic diagram under the condition of not increasing the calculation complexity basically and obviously improving the precision of semantic segmentation; the method can give consideration to both processing speed and processing precision, and realizes high-precision and low-power-consumption rapid processing.

Compared with a processing mode based on a residual error network of the cavity convolution in the related technology, the image processing method provided by the embodiment of the disclosure removes the cavity convolution with heavy calculation, and can realize speed improvement of at least 3 times under the condition of basically not losing semantic segmentation precision.

Compared with a processing mode based on an encoder-decoder in the related art, the image processing method according to the embodiment of the disclosure effectively reconstructs a high-resolution semantic feature map from a multi-level semantic feature map through a decoder guided by the whole situation, and can remarkably improve the semantic segmentation precision under the condition of not increasing the calculation complexity basically.

The image processing method can be applied to the application fields of intelligent video analysis, intelligent medical treatment, automatic driving and the like, and improves the target recognition precision of the image. For example, the method can be applied to an intelligent perception task in an automatic driving scene, and can be used for identifying and segmenting target objects such as automobiles, pedestrians, lane lines and the like in a vehicle condition scene, so that the intelligent perception task of the vehicle condition is realized. For example, the method can be applied to an intelligent medical scene, and can be used for carrying out pre-judging screening and intelligent extraction on the outline of a target such as a focus and the like on a medical image map, so that the work of a doctor is assisted, and the processing efficiency of the doctor is improved.

In an example, the method can be applied to the detection and identification tasks of the images, unreasonable feature distribution in the semantic feature map is effectively improved, and the image detection and identification performance can be improved.

In an example, the method can be applied to an intelligent editing task of images and videos, different objects in the images are automatically identified, and then different image processing flows are adopted for the different objects. For example, in the portrait function of a smart phone, it is necessary to perform blurring processing on the background behind the portrait to achieve the single-shot effect. The method can identify the portrait area in the image and perform blurring processing on the position outside the portrait area. For example, in many scenes of video playing, people need to be intelligently edited, and through the method, people in the image can be identified and subjected to semantic segmentation operation.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted. Those skilled in the art will appreciate that in the above methods of the specific embodiments, the specific order of execution of the steps should be determined by their function and possibly their inherent logic.

In addition, the present disclosure also provides an image processing apparatus, an electronic device, a computer-readable storage medium, and a program, which can be used to implement any one of the image processing methods provided by the present disclosure, and the descriptions and corresponding descriptions of the corresponding technical solutions and the corresponding descriptions in the methods section are omitted for brevity.

Fig. 4 shows a block diagram of an image processing apparatus according to an embodiment of the present disclosure, which includes, as shown in fig. 4:

the feature extraction module 41 is configured to perform feature extraction on an image to be processed to obtain N first feature maps of the image to be processed, where scales of the first feature maps in the N first feature maps are different, N is an integer and N is greater than or equal to 2;

a codeword and weight generating module 42, configured to generate N feature codewords and first weight feature maps of the N feature codewords according to M first feature maps in the N first feature maps, where M is an integer and is greater than or equal to 2 and less than or equal to N, and N is an integer greater than 1;

a feature map determining module 43, configured to determine a second feature map of the to-be-processed image according to the n feature codewords and the first weighted feature map;

and the result determining module 44 is configured to determine a processing result of the image to be processed according to the second feature map.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a non-volatile computer readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

The embodiments of the present disclosure also provide a computer program product, which includes computer readable code, and when the computer readable code runs on a device, a processor in the device executes instructions for implementing the image processing method provided in any one of the above embodiments.

The embodiments of the present disclosure also provide another computer program product for storing computer readable instructions, which when executed cause a computer to perform the operations of the image processing method provided in any of the above embodiments.

The electronic device may be provided as a terminal, server, or other form of device.

Fig. 5 illustrates a block diagram of an electronic device 800 in accordance with an embodiment of the disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.

Referring to fig. 5, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a Complementary Metal Oxide Semiconductor (CMOS) or Charge Coupled Device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as a wireless network (WiFi), a second generation mobile communication technology (2G) or a third generation mobile communication technology (3G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

Fig. 6 illustrates a block diagram of an electronic device 1900 in accordance with an embodiment of the disclosure. For example, the electronic device 1900 may be provided as a server. Referring to fig. 6, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system, such as the Microsoft Server operating system (Windows Server), stored in the memory 1932^TM) Apple Inc. of the present application based on the graphic user interface operating System (Mac OS X)^TM) Multi-user, multi-process computer operating system (Unix)^TM) Free and open native code Unix-like operating System (Linux)^TM) Open native code Unix-like operating System (FreeBSD)^TM) Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An image processing method, comprising:

performing feature extraction on an image to be processed to obtain N first feature maps of the image to be processed, wherein the scales of the first feature maps in the N first feature maps are different, N is an integer and is not less than 2;

generating N feature code words and first weight feature maps of the N feature code words according to M first feature maps in the N first feature maps, wherein M is an integer and is more than or equal to 2 and less than or equal to N, and N is an integer more than 1;

determining a second feature map of the image to be processed according to the n feature code words and the first weighted feature map;

and determining a processing result of the image to be processed according to the second feature map.

2. The method of claim 1, wherein generating N feature codewords and first weighted feature maps of the N feature codewords according to M first feature maps of the N first feature maps comprises:

3. The method of claim 2, wherein generating N feature codewords and first weighted feature maps of the N feature codewords according to M first feature maps of the N first feature maps comprises:

4. The method of claim 3, wherein determining the first weighted feature map of the n feature codewords according to the fourth feature map and the codebook feature map comprises:

5. The method according to claim 3 or 4, wherein determining the second feature map of the image to be processed according to the n feature code words and the first weighted feature map comprises:

6. The method of claim 5, wherein determining the second profile from the profile matrix of the n profile codewords comprises:

7. The method according to any one of claims 1 to 6, wherein determining the processing result of the image to be processed according to the second feature map comprises:

8. The method according to any one of claims 1 to 7, wherein the processing result of the image to be processed comprises a position and/or a category of an object in the image to be processed.

9. The method of claim 8, wherein the number n of signature code words is greater than the number of classes of the target.

10. An image processing apparatus characterized by comprising:

the image processing device comprises a feature extraction module, a feature extraction module and a feature extraction module, wherein the feature extraction module is used for extracting features of an image to be processed to obtain N first feature maps of the image to be processed, the scales of the first feature maps in the N first feature maps are different, N is an integer and is more than or equal to 2;

a codeword and weight generating module, configured to generate N feature codewords and first weight feature maps of the N feature codewords according to M first feature maps in the N first feature maps, where M is an integer and is greater than or equal to 2 and less than or equal to N, and N is an integer greater than 1;

a feature map determining module, configured to determine a second feature map of the to-be-processed image according to the n feature codewords and the first weighted feature map;

and the result determining module is used for determining the processing result of the image to be processed according to the second feature map.

11. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the memory-stored instructions to perform the method of any of claims 1 to 9.

12. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 9.