CN115147635A - Image processing method and related device - Google Patents

Image processing method and related device Download PDF

Info

Publication number
CN115147635A
CN115147635A CN202110341809.0A CN202110341809A CN115147635A CN 115147635 A CN115147635 A CN 115147635A CN 202110341809 A CN202110341809 A CN 202110341809A CN 115147635 A CN115147635 A CN 115147635A
Authority
CN
China
Prior art keywords
feature map
image
region
area
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110341809.0A
Other languages
Chinese (zh)
Inventor
王聪
韩昊成
曹琛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202110341809.0A priority Critical patent/CN115147635A/en
Publication of CN115147635A publication Critical patent/CN115147635A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses an image processing method, which is applied to the field of artificial intelligence and comprises the following steps: processing the first image through a first network to obtain a plurality of characteristic graphs with different resolutions; searching in the first feature map with low resolution through a second network based on the second image to obtain a first area in the first feature map, wherein the first area is used for changing the classification result of the first feature map; acquiring a mapping area of the first area in the second feature map; searching the mapping area through a second network based on the second image to obtain a second area in the second feature map with high resolution; and obtaining a region corresponding to the second region in the first image according to the second region. According to the scheme, the areas to be searched on the high-resolution characteristic diagram are effectively reduced by searching the areas on the low-resolution characteristic diagram and searching the areas in the corresponding areas on the high-resolution characteristic diagram, so that the efficiency is improved, and the time for outputting the result is reduced.

Description

Image processing method and related device
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to an image processing method and a related apparatus.
Background
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
In interpretable Artificial Intelligence (XAI), counterfactual interpretation is an interpretation that is more consistent with human cognition. Counterfactual interpretation refers to the minimum condition that needs to be changed to cause an event to occur with a different outcome. Based on the counterfactual interpretation, it can be known how an instance changes to alter its prediction.
Applying the counterfactual interpretation to the classification problem in the field of computer vision, it is possible to determine, by perturbing the original image, counterfactual images or feature maps are generated such that the categories of the generated images change. That is, if the input image is the generated counterfactual image or the feature map, the classification result of the input image may be a category desired by the user, which is different from the category of the original image.
At present, the way to obtain the counterfactual interpretation result is usually: and replacing partial areas in the input image with areas in the comparison image one by one, and deducing which area is replaced to change the classification result of the composite image, thereby finally obtaining the composite image with the changed classification result. However, the current method for obtaining the counterfactual interpretation result needs to replace the areas one by one on the image with higher resolution, the processing efficiency is low, and the result is often output in a long time.
Disclosure of Invention
The application provides an image processing method, which comprises the steps of obtaining a plurality of characteristic graphs with different resolutions of an input image, searching a key area for changing a characteristic graph classification result on the characteristic graph with lower resolution, then determining a mapping area of the key area on the characteristic graph with lower resolution on the characteristic graph with higher resolution, and further searching the key area on the mapping area to finally obtain an area needing to be replaced on the image. By searching the key area on the low-resolution characteristic diagram and searching the key area in the corresponding area on the high-resolution characteristic diagram, the area needing to be searched on the high-resolution characteristic diagram is effectively reduced, the efficiency is improved, and the time for outputting the result is reduced.
A first aspect of the present application provides an image processing method, including: firstly, a first image is processed through a first network to obtain a plurality of feature maps with different resolutions, wherein the feature maps comprise a first feature map and a second feature map, and the resolution of the second feature map is greater than that of the first feature map. The first network may include a plurality of feature extraction layers, and the plurality of feature extraction layers are connected in sequence. And after the first image is input into the first network, the first image is subjected to feature processing by a plurality of feature extraction layers, and feature maps with different resolutions are respectively output. For example, for a first layer of feature extraction layer in a first network, the input of the first layer of feature extraction layer is a first image, and the first layer of feature extraction layer outputs a feature map after performing feature extraction processing on the first image. For any one of the feature extraction layers in the first network except the first feature extraction layer, the input of the feature extraction layer is the feature map output by the feature extraction layer above, the feature extraction layer processes the input feature map and outputs the processed feature map, and the processed feature map is used as the input of the feature extraction layer below. By analogy, the feature extraction layers in the first network respectively output feature maps with different resolutions.
Then, based on the second image, searching in the first feature map through a second network to obtain a first area in the first feature map. The second network may specifically include a classification network, and the classification network may predict a category to which the feature map belongs based on the feature map to obtain a classification result. The first image and the second image have different classification results. The first region is used for changing the classification result of the first feature map. And after replacing the first region in the first feature map with the corresponding region in the second image, obtaining a third image (namely the first feature map after replacing the first region), wherein the classification result of the third image is the same as that of the second image. The first region can be understood as a key region in the first characteristic diagram. The key area in the feature map refers to an area capable of influencing the classification result of the feature map, and when the key area in the feature map is replaced by an area in a specific image, the classification result of the feature map after the key area is replaced is the same as the classification result of the specific image.
Searching in the first feature map means: and replacing the areas in the first feature map with corresponding areas in the second image one by one, so that after searching and determining which area in the first feature map is replaced, the probability that the first feature map with the replaced area is predicted to be the category to which the second image belongs by the classification network is increased.
Secondly, a mapping area of the first area in the second feature map is obtained, and the content of the first area is related to the mapping area. In short, the second feature extraction layer in the first network for outputting the second feature map outputs the second feature map, and the second feature map is input into the feature extraction layer for outputting the first feature map. And the feature extraction layer for outputting the first feature map outputs the first feature map after processing the second feature map. And the first region in the first feature map is obtained by processing the mapping region in the second feature map by a feature extraction layer for outputting the first feature map.
And searching the mapping area through the second network based on the second image to obtain a second area in the second feature map. The second region may include one or more regions. And replacing the second area in the second characteristic diagram with the corresponding area in the second image to obtain a second characteristic diagram after replacing the second area. The classification result of the second feature map after replacing the second region is the same as the classification result of the second image. That is, the second region is a key region in the second feature map.
Finally, according to the second area, obtaining an area corresponding to the second area in the first image, to output a first image marked with the corresponding region.
In this embodiment, a plurality of feature maps with different resolutions of an input image are obtained, a key area for changing a feature map classification result is searched on a feature map with a lower resolution, a mapping area of the key area on the feature map with the lower resolution on the feature map with the higher resolution is determined, and the key area is further searched on the mapping area, so that an area needing to be replaced on the image is finally obtained. By searching the key area on the low-resolution characteristic diagram and searching the key area in the corresponding area on the high-resolution characteristic diagram, the area needing to be searched on the high-resolution characteristic diagram is effectively reduced, the efficiency is improved, and the time for outputting the result is reduced.
In one possible implementation, the first feature map is a feature map with the smallest resolution among the plurality of feature maps. The number of regions to be searched in the first feature map is the smallest among the plurality of feature maps. Therefore, the region in the first feature map can be searched for with a minimum of time with respect to the other feature maps in the plurality of feature maps.
In one possible implementation, the resolution of the feature maps output by the plurality of feature extraction layers in the first network is gradually reduced from the first feature extraction layer to the last feature extraction layer in the first network. The second feature map is a feature map of the plurality of feature maps that is closest in resolution to the first feature map. That is, the feature extraction layer for outputting the second feature map and the feature extraction layer for outputting the first feature map may be two adjacent feature extraction layers. For example, in the case that the first feature map is the feature map with the smallest resolution among the plurality of feature maps, the feature extraction layer for outputting the first feature map is the last feature extraction layer of the first network, and the feature extraction layer for outputting the second feature map is the last feature extraction layer of the first network.
In a possible implementation manner, the resolution of the second feature map is a first preset resolution, that is, the second feature map is extracted by a feature extraction layer capable of extracting the feature map with the first preset resolution. Optionally, the resolution of the second feature map may also be within a preset resolution interval. Or, the second feature map is a feature map output by a target feature extraction layer, where the target feature extraction layer is a feature extraction layer with an output resolution closest to the first preset resolution among the plurality of feature extraction layers of the first network.
In practical application, a user adjusts the first preset resolution according to actual needs. For example, when the user wishes to obtain a more accurate output result, the user may adjust the first preset resolution to a larger resolution; when the user wants to obtain the output result faster, the user can adjust the first preset resolution to be a smaller resolution.
In one possible implementation, one or more feature maps of the plurality of feature maps are further spaced between the second feature map and the first feature map, and the resolution of the one or more feature maps is greater than the resolution of the first feature map and less than the resolution of the second feature map. That is, in the first network, the feature extraction layer for outputting the second feature map and the feature extraction layer for outputting the first feature map are not adjacent to each other, and one or more feature extraction layers are further provided between the feature extraction layer for outputting the second feature map and the feature extraction layer for outputting the first feature map.
In one possible implementation manner, the obtaining a mapping region of the first region in the second feature map includes: acquiring the size and step length of a convolution kernel of a feature extraction layer for extracting the first feature map; and determining the mapping area in the second feature map according to the convolution kernel size and the step length of the first area and the feature extraction layer. That is, the mapping region is actually the receptive field of the first region in the second feature map. The receptive field can be derived based on the size and step size of the convolution kernel of the feature extraction layer from which the first feature map is extracted and the first region.
In one possible implementation manner, in order to improve the accuracy of the output result as much as possible, region search may be performed on a feature map-by-feature map basis in the obtained feature maps, and the search range of each feature map is a mapping region of a region that has been searched in the previous feature map. The first feature map and the second feature map are two feature maps for performing the area search.
Specifically, the method further comprises: repeatedly executing the area searching step until a termination condition is met to obtain a target area; and outputting a region corresponding to the target region in the first image. The area searching step includes: acquiring a mapping region of a region in the ith feature map in the (i + 1) th feature map, wherein the region in the ith feature map is used for changing the classification result of the ith feature map; searching in a mapping region in the (i + 1) th feature map through the second network based on the second image to obtain a region in the (i + 1) th feature map, wherein the region in the (i + 1) th feature map is used for changing a classification result of the (i + 1) th feature map; the value of i is incremented by 1. The target area is an area in the (i + 1) th feature map when a termination condition is met, and the resolution of the (i) th feature map is smaller than that of the (i + 1) th feature map. And i is greater than or equal to 2, and when i is equal to 2, the ith characteristic map is the second characteristic map.
In short, after the key area in the feature map is obtained by searching, the corresponding mapping area is determined in the feature map extracted by the feature extraction layer at the upper layer, and the key area is continuously searched in the mapping area until the termination condition is met.
In one possible implementation manner, the termination condition includes that the resolution of the (i + 1) th feature map reaches a second preset resolution. It should be understood that the larger the feature map resolution used for searching the region, the less information contained in the region per unit area of the feature map, the more accurate the region obtained by the search, but the larger the area of the region to be searched, the more time it takes. Therefore, in practical applications, the accuracy of the result and the calculation efficiency can be balanced according to the requirements of the user, so as to determine the resolution of the feature map searched finally.
In one possible implementation manner, the termination condition includes that an area ratio between a region in the (i + 1) th feature map and a mapping region in the (i + 1) th feature map is greater than a preset threshold. It should be understood that, in the area searching process, after searching the area of the ith feature map, the process of continuing to search the area in the mapping area of the (i + 1) th feature map is actually to further narrow the area to obtain a more accurate result. When the area ratio between the region in the (i + 1) th feature map and the mapping region in the (i + 1) th feature map is found to be greater than the preset threshold value through the search, it may be considered that the region cannot be reduced by continuing the region search in the mapping region in the (i + 1) th feature map, and thus the region search may be terminated.
In one possible implementation manner, the searching in the first feature map through the second network based on the second image to obtain the first region in the first feature map includes: and processing the second image through the first network to obtain a third feature map, wherein the resolution of the third feature map is the same as that of the first feature map. And replacing a partial region in the first feature map with a partial region in the third feature map, so that the classification result of the first feature map after replacing the partial region is the same as the classification result of the second image. And determining the replaced area in the first feature map after the partial area is replaced as the first area.
That is, the corresponding region in the second image may refer to a partial region in the third feature map corresponding to the second image, and the resolution of the third feature map corresponding to the second image is the same as the resolution of the first feature map.
In one possible implementation manner, the performing, based on the second image, a region search in the first feature map through a second network to obtain a first region in the first feature map includes: acquiring a non-background area in the first feature map, wherein the non-background area is an area except for a background area in the first feature map; searching for a non-background region in the first feature map through the second network based on the second image. In this application, the non-background region in the image may refer to a region where an object to be classified is located in the image, and the background region in the image may refer to a region other than the region where the object to be classified is located. For example, assuming that the target to be classified in the first feature map is a bird, the non-background region in the first feature map is a region in which the bird is located in the first feature map. The background region in the first feature map is a region outside the region where the bird is located in the first feature map, for example, a region where a scene such as sky, grass, or forest is located. For another example, assuming that the object to be classified in the first feature map is a car, the non-background region in the first feature map is a region where the car in the first feature map is located. The background area in the first feature map is an area outside the area of the vehicle in the first feature map, for example, an area where an object such as a road or a yard is located.
In the scheme, the background area in the input image is identified through an algorithm before the search of the area is executed, so that the background area in the input image is excluded from the search range, and the search range of the area is reduced.
In one possible implementation, the searching in the first feature map through the second network based on the second image includes: and acquiring a marked area in the first image, wherein the marked area can be an area marked by a user and is used for representing an area which does not need to perform key area search. And searching a third area in the first feature map through the second network based on the second image, wherein the third area is an area except a fourth area in the first feature map, and the fourth area is an area corresponding to the mark area in the first feature map.
In a possible implementation manner, the obtaining, according to the second region, a region in the first image corresponding to the second region includes: acquiring the resolution of the second feature map and the resolution of the first image; and determining a region corresponding to the second region in the first image according to the position of the second region in the second feature map, the resolution of the second feature map and the resolution of the first image.
In one possible implementation, the second region includes a plurality of sub-regions. The obtaining, according to the second region, a region corresponding to the second region in the first image includes: obtaining a probability corresponding to each sub-region in the plurality of sub-regions, wherein the probability is the probability which is improved when the sub-region is predicted to be the category of the second image after the second feature map replaces the sub-region; sequencing the probability corresponding to each subregion from high to low to obtain a sequencing result; and obtaining a plurality of regions corresponding to the plurality of sub-regions in the first image according to the sorting result and the plurality of sub-regions, wherein the plurality of regions have different marking modes, and the marking modes are related to the sorting result.
In the scheme, the regions are sorted and displayed according to the contribution degrees of the different regions to the change of the classification result of the second feature map, so as to highlight the corresponding importance degrees of the different regions.
A second aspect of the present application provides an image processing apparatus including an acquisition unit and a processing unit. The processing unit is used for processing the first image through a first network to obtain a plurality of feature maps, wherein the plurality of feature maps comprise a first feature map and a second feature map, and the resolution of the second feature map is greater than that of the first feature map; the processing unit is further configured to search the first feature map through a second network based on a second image to obtain a first region in the first feature map, where the first region is used to change a classification result of the first feature map, the classification result of the first image is different from that of the second image, a classification result of a third image is the same as that of the second image, and the third image is the first feature map obtained by replacing the first region with a corresponding region in the second image; the acquiring unit is used for acquiring a mapping area of the first area in the second feature map, and the content of the first area is related to the mapping area; the processing unit is further configured to search the mapping area through the second network based on the second image to obtain a second area in the second feature map, where the second area is used to change the classification result of the first feature map; the processing unit is further configured to obtain, according to the second region, a region in the first image corresponding to the second region.
In one possible implementation, the first feature map is a feature map with the smallest resolution among the plurality of feature maps.
In one possible implementation, the second feature map is a feature map of the plurality of feature maps that is closest in resolution to the first feature map; or, the resolution of the second feature map is a first preset resolution; or one or more feature maps in the plurality of feature maps are further spaced between the second feature map and the first feature map, and the resolution of the one or more feature maps is greater than that of the first feature map and less than that of the second feature map.
In one possible implementation manner, the obtaining unit is configured to: acquiring the size and step length of a convolution kernel of a feature extraction layer for outputting the first feature map; and determining the mapping area in the second feature map according to the convolution kernel size and the step length of the first area and the feature extraction layer.
In one possible implementation manner, the processing unit is further configured to: repeatedly executing the area searching step until a termination condition is met to obtain a target area; outputting a region corresponding to the target region in the first image; the area searching step includes: acquiring a mapping region of a region in the ith feature map in the (i + 1) th feature map, wherein the region in the ith feature map is used for changing the classification result of the ith feature map; based on the second image, performing region search in a mapping region in the (i + 1) th feature map through the second network to obtain a region in the (i + 1) th feature map, wherein the region in the (i + 1) th feature map is used for changing a classification result of the (i + 1) th feature map; adding 1 to the value of i; the target area is an area in the (i + 1) th feature map when a termination condition is met, the resolution of the (i) th feature map is smaller than that of the (i + 1) th feature map, i is greater than or equal to 2, and when i is equal to 2, the (i) th feature map is the second feature map.
In one possible implementation manner, the termination condition includes that the resolution of the (i + 1) th feature map reaches a second preset resolution.
In one possible implementation manner, the termination condition includes that an area ratio between a region in the (i + 1) th feature map and a mapping region in the (i + 1) th feature map is greater than a preset threshold.
In one possible implementation manner, the processing unit is further configured to: processing the second image through the first network to obtain a third feature map, wherein the resolution of the third feature map is the same as that of the first feature map; replacing a partial region in the first feature map with a partial region in the third feature map, so that the classification result of the first feature map after replacing the partial region is the same as the classification result of the second image; and determining the area replaced in the first feature map after the partial area is replaced as the first area.
In a possible implementation manner, the obtaining unit is further configured to obtain a non-background region in the first feature map, where the non-background region is a region of the first feature map other than a background region; the processing unit is further configured to search for a non-background region in the first feature map through the second network based on the second image.
In a possible implementation manner, the acquiring unit is further configured to acquire a mark region in the first image; the processing unit is further configured to search, based on the second image, for a third area in the first feature map through the second network, where the third area is an area in the first feature map except for a fourth area, and the fourth area is an area in the first feature map corresponding to the mark area.
In a possible implementation manner, the processing unit is further configured to determine, according to a position of the second region in the second feature map, a resolution of the second feature map, and a resolution of the first image, a region in the first image corresponding to the second region.
In one possible implementation, the second region includes a plurality of sub-regions; the processing unit is further configured to: obtaining a probability corresponding to each sub-region in the plurality of sub-regions, wherein the probability is the probability which is improved when the sub-region is predicted to be the category of the second image after the second feature map replaces the sub-region; sequencing the probability corresponding to each subregion from high to low to obtain a sequencing result; and obtaining a plurality of regions corresponding to the plurality of sub-regions in the first image according to the sorting result and the plurality of sub-regions, wherein the plurality of regions have different marking modes, and the marking modes are related to the sorting result.
The image processing apparatus provided by the second aspect corresponds to the method described in the first aspect, and is used for implementing or implementing the method provided by the first aspect in a matching manner, so that the same or corresponding beneficial effects as those achieved by the first aspect can be achieved, and details are not repeated here.
A third aspect of the present application provides an image processing apparatus, which may comprise a processor, a memory coupled to the processor, the memory storing program instructions, which when executed by the processor, implement the method of the first aspect. For the processor to execute the steps in each possible implementation manner of the first aspect, reference may be made to the first aspect specifically, and details are not described here.
A fourth aspect of the present application provides a computer-readable storage medium having stored thereon a computer program which, when run on a computer, causes the computer to perform the method of the first aspect described above.
A fifth aspect of the present application provides circuitry comprising processing circuitry configured to perform the method of the first aspect described above.
A sixth aspect of the present application provides a computer program product which, when run on a computer, causes the computer to perform the method of the first aspect described above.
A seventh aspect of the present application provides a chip system, where the chip system includes a processor, configured to support a server or a threshold value obtaining apparatus to implement the functions referred to in the first aspect, for example, to send or process data and/or information referred to in the method. In one possible design, the system-on-chip further includes a memory for storing program instructions and data necessary for the server or the communication device. The chip system may be formed by a chip, or may include a chip and other discrete devices.
Drawings
FIG. 1 is a schematic structural diagram of an artificial intelligence body framework;
FIG. 2 is a schematic diagram of a convolutional neural network provided in an embodiment of the present application;
FIG. 3 is a schematic diagram of a convolutional neural network provided in an embodiment of the present application;
FIG. 4 is a diagram illustrating a system architecture according to an embodiment of the present application;
FIG. 5 is a diagram illustrating an alternative operation performed on an image according to an embodiment of the present disclosure;
FIG. 6 (a) is a schematic diagram of an application scenario in an educational product according to an embodiment of the present application;
fig. 6 (b) is a schematic diagram of an application scenario in the development of an AI program according to an embodiment of the present application;
fig. 6 (c) is a schematic diagram of another application scenario in AI program development according to an embodiment of the present application;
fig. 7 is a schematic flowchart of an image processing method according to an embodiment of the present application;
FIG. 8 (a) is a schematic diagram of a calculated receptive field provided by an embodiment of the present application;
fig. 8 (b) is a schematic diagram of converting a second feature map into a first image according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a search area provided in an embodiment of the present application;
fig. 10 (a) is a schematic diagram of performing a region search based on a locked region according to an embodiment of the present application;
fig. 10 (b) is a schematic diagram illustrating comparison between performing a region search based on a locked region according to an embodiment of the present application;
fig. 11 is a schematic flowchart of an image processing method according to an embodiment of the present application;
FIG. 12 is a diagram illustrating an example of a search key area provided in an embodiment of the present application;
fig. 13 is a schematic diagram of searching key regions layer by layer according to an embodiment of the present disclosure;
FIG. 14 is a schematic diagram illustrating a comparison of key regions provided in an embodiment of the present application;
fig. 15 is an image processing apparatus according to an embodiment of the present application;
fig. 16 is a schematic structural diagram of an execution device according to an embodiment of the present application;
fig. 17 is a schematic structural diagram of a chip according to an embodiment of the present disclosure.
Detailed Description
The embodiments of the present invention will be described below with reference to the drawings. The terminology used in the description of the embodiments of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
Embodiments of the present application are described below with reference to the accompanying drawings. As can be known to those skilled in the art, with the development of technology and the emergence of new scenes, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.
The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the various embodiments of the application and how objects of the same nature can be distinguished. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
For convenience of understanding, technical terms related to the embodiments of the present application will be explained below.
Receptive Field (Receptive Field): in the convolutional neural network, the receptive field refers to the area size of the pixel points on the feature map (feature map) output by each layer of the convolutional neural network mapped on the input image/input feature map, that is, the points on the feature map are calculated by all the pixel points in the receptive field size area in the input image. The larger the area of the receptive field is, the larger the range of the original image which can be contacted by the point on the feature map is, which also means that more global and higher semantic level features can be obtained. Conversely, a smaller area of the receptive field indicates that the points on the feature map contain more local and detailed features.
The general workflow of the artificial intelligence system will be described first, please refer to fig. 1, which shows a schematic structural diagram of an artificial intelligence body framework, and the artificial intelligence body framework is explained below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where "intelligent information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process. The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.
(1) An infrastructure.
The infrastructure provides computing power support for the artificial intelligent system, communication with the outside world is achieved, and support is achieved through the foundation platform. Communicating with the outside through a sensor; the computing power is provided by intelligent chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA and the like); the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.
(2) And (4) data.
Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.
(3) And (6) data processing.
Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.
The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.
Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.
Decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sorting, prediction and the like.
(4) Universal capability.
After the above-mentioned data processing of the data, further based on the result of the data processing, some general capabilities may be formed, such as an algorithm or a general-purpose system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.
(5) Intelligent products and industrial applications.
The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent terminal, intelligent transportation, intelligent medical treatment, autopilot, wisdom city etc..
The method provided by the application is described from the model training side and the model application side as follows:
the model training method provided by the embodiment of the application can be particularly applied to data processing methods such as data training, machine learning and deep learning, symbolic and formal intelligent information modeling, extraction, preprocessing, training and the like are carried out on training data, and a trained neural network model (such as a target neural network model in the embodiment of the application) is finally obtained; and the target neural network model can be used for model reasoning, and specifically, input data can be input into the target neural network model to obtain output data.
Since the embodiments of the present application relate to the application of a large number of neural networks, for the convenience of understanding, the related terms and related concepts such as neural networks related to the embodiments of the present application will be described below.
(1) A neural network.
The neural network may be composed of neural units, and the neural units may refer to operation units with xs (i.e. input data) and intercept 1 as inputs, and the output of the operation units may be:
wherein s =1, 2, \8230, n is a natural number greater than 1, ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input for the next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by a plurality of the above-mentioned single neural units being joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.
(2) Convolutional Neural Networks (CNN) are a type of deep neural Network with convolutional structures. The convolutional neural network includes a feature extractor consisting of convolutional layers and sub-sampling layers. The feature extractor may be considered a filter and the convolution process may be considered as convolving an input image or convolved feature plane (feature map) with a trainable filter. The convolutional layer is a neuron layer (for example, a first convolutional layer and a second convolutional layer in the present embodiment) for performing convolution processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The underlying principle is: the statistics of a certain part of the image are the same as the other parts. Meaning that image information learned in one part can also be used in another part. We can use the same learned image information for all locations on the image. In the same convolution layer, a plurality of convolution kernels can be used to extract different image information, and generally, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.
The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.
Specifically, as shown in fig. 2, convolutional Neural Network (CNN) 100 may include an input layer 110, a convolutional/pooling layer 120, where the pooling layer is optional, and a neural network layer 130.
The structure formed by the convolutional layer/pooling layer 120 and the neural network layer 130 may be a first convolutional layer and a second convolutional layer described in this application, the input layer 110 is connected to the convolutional layer/pooling layer 120, the convolutional layer/pooling layer 120 is connected to the neural network layer 130, the output of the neural network layer 130 may be input to the active layer, and the active layer may perform nonlinear processing on the output of the neural network layer 130.
Convolutional/pooling layers 120. And (3) rolling layers: convolutional/pooling layers 120 as shown in FIG. 2 may include, for example, 121-126 layers, in one implementation, 121 layers are convolutional layers, 122 layers are pooling layers, 123 layers are convolutional layers, 124 layers are pooling layers, 125 is convolutional layers, 126 is pooling layers; in another implementation, 121, 122 are convolutional layers, 123 are pooling layers, 124, 125 are convolutional layers, and 126 are pooling layers. I.e. the output of the convolutional layer may be used as input for the subsequent pooling layer, or may be used as an input to another convolutional layer to continue the convolution operation.
Taking the convolutional layer 121 as an example, the convolutional layer 121 may include a plurality of convolution operators, which are also called kernels, and the function of the convolution operators in image processing is equivalent to a filter for extracting specific information from an input image matrix, and the convolution operators may be essentially a weight matrix, which is usually predefined, and during the convolution operation on the image, the weight matrix is usually processed on the input image pixel by pixel (or two pixels by two pixels) \8230; which depends on the value of the step size stride), so as to complete the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same dimension are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image. Different weight matrixes can be used for extracting different characteristics in the image, for example, one weight matrix is used for extracting image edge information, the other weight matrix is used for extracting specific colors of the image, the other weight matrix is used for blurring unwanted noise points in the image \8230, the \8230indicatesthat the multiple weight matrixes have the same dimension, the dimensions of characteristic graphs extracted by the multiple weight matrixes with the same dimension are also the same, and the extracted multiple characteristic graphs with the same dimension are combined to form the output of convolution operation.
The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can extract information from the input image, thereby helping the convolutional neural network 100 to make correct prediction.
When convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (e.g., 121) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 100 increases, the more convolutional layers (e.g., 126) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.
A pooling layer: since it is often desirable to reduce the number of training parameters, it is often desirable to periodically introduce pooling layers after a convolutional layer, i.e., layers 121-126 as illustrated by 120 in fig. 2, either one convolutional layer followed by one pooling layer or multiple convolutional layers followed by one or more pooling layers.
The neural network layer 130: after processing by convolutional layer/pooling layer 120, convolutional neural network 100 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (class information or other relevant information as needed), the convolutional neural network 100 needs to generate one or a set of outputs of the number of classes as needed using the neural network layer 130. Accordingly, a plurality of hidden layers (such as 131, 132, to 13n shown in fig. 2) and an output layer 140 may be included in the neural network layer 130, and parameters included in the plurality of hidden layers may be pre-trained according to related training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.
After the hidden layers in the neural network layer 130, i.e. the last layer of the whole convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the class cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from 110 to 140 in fig. 2 is the forward propagation) of the whole convolutional neural network 100 is completed, the backward propagation (i.e. the propagation from 140 to 110 in fig. 2 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.
It should be noted that the convolutional neural network 100 shown in fig. 2 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, as shown in fig. 3, a plurality of convolutional layers/pooling layers are parallel, and the features respectively extracted are all input to the whole neural network layer 130 for processing.
(3) A deep neural network.
Deep Neural Networks (DNNs), also known as multi-layer Neural networks, can be understood as Neural networks having many layers of hidden layers, where "many" has no particular metric. From the division of DNNs by the location of different layers, neural networks inside DNNs can be divided into three categories: the number of input layers, the hidden layer,and (5) outputting the layer. Typically, the first layer is the input layer, the last layer is the output layer, and the number of layers in between are all hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer. Although DNN appears complex, it is not as complex as the work of each layer, in short the following linear relational expression:
Figure BDA0002999360340000111
wherein,
Figure BDA0002999360340000112
is a function of the input vector or vectors,
Figure BDA0002999360340000113
is the output vector of the output vector,
Figure BDA0002999360340000114
is an offset vector, W is a weight matrix (also called coefficient), and α () is an activation function. Each layer is only for the input vector
Figure BDA0002999360340000115
Obtaining the output vector through such simple operation
Figure BDA0002999360340000116
Due to the large number of DNN layers, the coefficient W and the offset vector
Figure BDA0002999360340000117
The number of the same is large. The definition of these parameters in DNN is as follows: taking coefficient W as an example: assume that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as
Figure BDA0002999360340000118
The superscript 3 represents the number of layers in which the coefficient W is located, while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input. The summary is that: kth neuron of layer L-1 to jth neuron of layer LThe coefficient of the neuron is defined as
Figure BDA0002999360340000121
Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final goal of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (formed by a number of layers of vectors W) of all layers of the deep neural network that has been trained.
(4) A loss function.
In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first updating, namely parameters are preset for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be slightly lower, and the adjustment is carried out continuously until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.
(5) A back propagation algorithm.
The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, error loss occurs when an input signal is transmitted in a forward direction until the input signal is output, and parameters in an initial super-resolution model are updated by reversely propagating error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the super-resolution model, such as a weight matrix.
(6) Linear operation.
Linearity refers to a proportional, linear relationship between a quantity and a quantity, and is understood mathematically as a function of which the first derivative is a constant, and linear operations can be, but are not limited to, addition operations, null operations, identity operations, convolution operations, batch normalization BN operations, and pooling operations. Linear operations may also be referred to as linear mapping, which requires two conditions to be satisfied: homogeneity and additivity, either condition being not satisfied then being non-linear
Wherein, homogeneity means f (ax) = af (x); additive means f (x + y) = f (x) + f (y); for example, f (x) = ax is linear. It should be noted that x, a, f (x) herein are not necessarily scalar quantities, and may be vectors or matrices, forming a linear space of any dimension. If x, f (x) are n-dimensional vectors, the equivalence satisfies homogeneity when a is constant, and the equivalence satisfies additivity when a is matrix. In contrast, a function graph that is a straight line does not necessarily conform to a linear mapping, such as f (x) = ax + b, and neither homogeneity nor additivity is satisfied, and thus belongs to a non-linear mapping.
In the embodiment of the present application, a composite of multiple linear operations may be referred to as a linear operation, and each linear operation included in the linear operation may also be referred to as a sub-linear operation.
Fig. 4 is a schematic diagram of a system architecture provided in an embodiment of the present application, in fig. 4, an execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through a client device 140.
During the input data preprocessing performed by the execution device 120 or the processing related to the computation performed by the computation module 111 of the execution device 120 (such as performing the function implementation of the neural network in the present application), the execution device 120 may call the data, the code, and the like in the data storage system 150 for corresponding processing, and may store the data, the instruction, and the like obtained by corresponding processing in the data storage system 150.
Finally, the I/O interface 112 returns the processing results to the client device 140 for presentation to the user.
Alternatively, the client device 140 may be, for example, a control unit in an automatic driving system, a functional algorithm module in a mobile phone terminal, and the functional algorithm module may be used to implement related tasks, for example.
It should be noted that the training device 120 may generate corresponding target models/rules (e.g., target neural network models in this embodiment) based on different training data for different targets or different tasks, and the corresponding target models/rules may be used to achieve the targets or complete the tasks, so as to provide the user with the required results.
In the case shown in fig. 4, the user may manually give the input data, which may be operated through an interface provided by the I/O interface 112. Alternatively, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 140. The user may view the results output by the execution device 110 at the client device 140, and the specific form may be a display, a sound, an action, and the like. The client device 140 may also serve as a data collection terminal, collecting input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data, and storing the new sample data in the database 130. Of course, the input data inputted to the I/O interface 112 and the output result outputted from the I/O interface 112 as shown in the figure may be directly stored in the database 130 as new sample data by the I/O interface 112 without being collected by the client device 140.
It should be noted that fig. 4 is only a schematic diagram of a system architecture provided in an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 4, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may also be disposed in the execution device 110.
In interpretable Artificial Intelligence (XAI), counterfactual interpretation is an interpretation that is more consistent with human cognition. Counterfactual interpretation refers to the minimum condition that needs to be changed to cause an event to occur with a different outcome. Based on the counterfactual interpretation, it can be known how an instance changes to alter its prediction. In this way, the user learns how to do so by understanding the changes that can cause the prediction to change so that the prediction gets closer to expectations.
For example, xiaoming applied for a loan and was rejected by machine learning based banking software, he wanted to know why his application was rejected and how he would improve the chances of getting a loan. Based on the counter-fact explanation, the above-mentioned "why" problem can be expressed as: what is the smallest change in a characteristic (e.g., income, number of credit cards, age, etc.), can be made to change the prediction from refusal to approval?
In some possible examples, the counterfactual is explained as follows:
if Xiaoming earns 10,000 dollars each year, he will get a loan.
If there are fewer minuscule credit cards and there is no delinquent loan 5 years ago, he will get a loan.
Thus, by understanding the conditions that can cause changes in the loan application results, it is possible to know how to do so to bring the loan application results closer to expectations.
Applying the counterfactual interpretation to the classification problem in the computer vision field, a counterfactual image or a feature map can be generated by perturbing an original image, so that the category of the generated image is changed. That is, if the input image is the generated image, the classification result of the input image is a category desired by the user, which is different from that of the original image.
For example, for an image of interest that belongs to the a category, a human would like to know why the image of interest does not belong to the B category. Therefore, a set of experimental images including the image of interest (belonging to the a category) and the interference image (belonging to the B category) may be input, and a composite image is obtained by replacing a partial region on the image of interest with a corresponding position on the interference image. Then, the classification result of the composite image by the classification model is changed from the a-class to the B-class.
Briefly, the application of counterfactual explanations in image classification is specifically: the method comprises the steps of obtaining an input image and a comparison image, replacing partial areas in the input image with areas in the comparison image one by one, reasoning which area is replaced to change the classification result of a composite image, and finally obtaining the composite image with the changed classification result.
For example, referring to fig. 5, fig. 5 is a schematic diagram illustrating an alternative operation performed on an image according to an embodiment of the present disclosure. As shown in fig. 5, for the input image I and the comparison image I ', the class of the input image I was cupmorase and the class of the comparison image I' was cormorant. In the processing process, all the areas in the input image I are used as search candidate areas, all the areas in the comparison image I' are required to be replaced one by one for any search candidate area, and whether the classification result of the synthesized image after the areas are replaced changes or not is inferred. For example, assuming that the resolution of the input image I is 100 × 100, the input image I may be divided into 100 × 100 search candidate regions; assuming that the resolution of the comparison image I 'is also 100 × 100, the comparison image I' may be divided into 100 × 100 regions. For each search candidate region in the input image I, 100 × 100 regions in the comparison image I' need to be replaced one by one. Therefore, every time a critical area is found, 100 × 100 × 100 (i.e., 8 times of 10) replacements are required. In the case where the number of critical areas is large, the number of replacements is further increased by several times, resulting in an excessive number of final replacements. The key area refers to a partial area in the input image, and after the key area in the input image is replaced by the partial area in the contrast image, the probability that the input image is predicted to be the category to which the contrast image belongs is increased. When all key regions in the input image are replaced by regions in the contrast image, the replaced input image can be predicted as the category to which the contrast image belongs.
To ensure the accuracy of the counterfactual interpretation results, the current way of acquiring counterfactual interpretation results usually requires the region-by-region replacement on the image with higher resolution. Since the resolution of the image is high, a large number of areas need to be replaced during the processing, resulting in low processing efficiency and often requiring a long time to output the result.
In view of this, an embodiment of the present application provides an image processing method, which obtains a plurality of feature maps with different resolutions of an input image, searches a key area for changing a feature map classification result on a feature map with a lower resolution, determines a mapping area of the key area on the feature map with the lower resolution on the feature map with the higher resolution, and further searches the key area on the mapping area, so as to finally obtain an area that needs to be replaced on the image. By searching the key area on the low-resolution characteristic diagram and searching the key area in the corresponding area on the high-resolution characteristic diagram, the area needing to be searched on the high-resolution characteristic diagram is effectively reduced, the efficiency is improved, and the time for outputting the result is reduced.
The image processing method provided by the embodiment of the application can be applied to a terminal. The terminal may be, for example, a digital camera, a surveillance camera, a mobile phone (mobile phone), a Personal Computer (PC), a laptop, a server, a tablet PC, a smart tv, a Mobile Internet Device (MID), a wearable device, a Virtual Reality (VR) device, an Augmented Reality (AR) device, a wireless terminal in industrial control (industrial control), a wireless terminal in self driving (self driving), a wireless terminal in remote surgery (remote medical supply), a wireless terminal in smart grid (smart grid), a wireless terminal in transportation safety (transportation safety), a wireless terminal in smart city (smart city), a wireless terminal in home (smart home), or the like. For convenience of description, the following describes an image processing method provided in an embodiment of the present application, taking an example of applying the image processing method to a terminal.
It should be understood that the image in the embodiment of the present application may be a static image (or referred to as a static picture) or a dynamic image (or referred to as a dynamic picture), such as an RGB image, a black-and-white image, a grayscale image, or the like. For convenience of description, the present application collectively refers to a still image or a moving image as an image in the following embodiments.
For ease of understanding, a scenario to which the image processing method provided in the embodiment of the present application can be applied will be described below. In the image processing method provided by the embodiment of the application, the original image and the comparison image are input, and the output image can be output. The output image is obtained by replacing a partial region in the original image with a partial region in the comparison image, and the classification result of the output image is the same as that of the comparison image. The image processing method provided by the embodiment of the application can be applied to the fields of education products, computer program development, medical treatment or traffic. The present embodiment does not limit the scene to which the image processing method is applied.
Particularly, through the image processing method based on the anti-real-time interpretation, interpretation which accords with human cognition can be provided, a user is helped to understand input factors influencing model decision, and how to change the factors enables the decision result of the model to be closer to the expectation of the human. By comparing different interpretations of the image processing method provided on the same data and different models, the user can find some problems that may exist in the models, as well as how well the models are compared. The image processing method can also be applied to the actual scene of fine-grained classification problems, such as assisting doctors in medical judgment, helping beginners to quickly understand the differences among different species of the same class, and the like.
For example, referring to fig. 6 (a), fig. 6 (a) is a schematic diagram of an application scenario in an educational product according to an embodiment of the present application. As shown in fig. 6 (a), a child of a bird of beginners may select a set of data to be interpreted, which includes: an image of an unknown bird (i.e., the original image described above) and an image of an already recognized bird (i.e., the comparative image described above).
Then, the bird classification model that has been prepared and the data to be interpreted are input into the counterfactual interpretation module. In the process of inputting the bird classification model into the counterfactual interpretation module, the bird classification model can be packaged, for example, processing operations such as adding a calling interface and the like are added, so that the counterfactual interpretation module can be connected with the bird classification model into a whole. Thus, after the data to be explained is input into the counterfactual interpretation module, the counterfactual interpretation module calculates and outputs a result, and the output result is an image obtained by replacing the image of the unknown bird. Therefore, the child can compare the differences between the unknown birds and the known birds by outputting the result, and learn the distinguishing characteristics of the strange birds.
For example, referring to fig. 6 (b), fig. 6 (b) is a schematic diagram of an application scenario in the development of an AI program according to an embodiment of the present application. As shown in fig. 6 (b), the AI developer prepares a plurality of image classification models in advance, which are models that the AI developer needs to compare. Second, the AI developer picks image data that needs interpretation.
Then, the AI developer inputs one of the image classification models that has been prepared and the data to be interpreted into the counterfactual interpretation module. In this way, after the data to be explained is input to the counterfactual interpretation module, the counterfactual interpretation module calculates and outputs a result, and the output result is an image obtained by performing replacement processing on the image data. After a group of output data corresponding to one image classification model is obtained, the AI developer inputs the other image classification model into the counterfactual interpretation model, and repeatedly inputs the data to be interpreted to obtain the output data of the counterfactual interpretation model. And circulating the steps until the output data corresponding to all the image classification models are obtained. In this way, the AI developer can determine the regions of interest for different image classification models based on different output data, and determine which image classification model is more reliable.
For example, referring to fig. 6 (c), fig. 6 (c) is a schematic diagram of another application scenario in AI program development provided by the embodiment of the present application. As shown in fig. 6 (c), the AI developer prepares an image classification model to be detected in advance, and selects image data to be interpreted.
The AI developer then enters the prepared image classification model and the data to be interpreted into the counterfactual interpretation module. In this way, after the data to be explained is input into the counterfactual interpretation module, the counterfactual interpretation module calculates and outputs the result, and the output result is the image obtained after the replacement processing is carried out on the image data. After obtaining a set of output data corresponding to the image classification model, the AI developer may retrain and adjust the image classification model based on the output data. After obtaining the adjusted image classification model, the AI developer inputs the adjusted image classification model and the data to be explained into the counterfactual interpretation module again to obtain new output data. And circulating the steps until the AI developer is satisfied with the classification precision of the adjusted image classification model. Therefore, an AI developer can continuously adjust the image classification model according to the output data of the counterfactual interpretation module, and the image classification model with higher classification precision is obtained.
Similarly, in the medical field, by inputting the image to be diagnosed and the contrast image, a doctor can be helped to learn and judge which positions in the image to be diagnosed are key positions affecting the diagnosis type corresponding to the image to be diagnosed. The embodiment does not describe any more specific flow of the image processing method applied in each field.
Referring to fig. 7, fig. 7 is a schematic flowchart of an image processing method according to an embodiment of the present disclosure. As shown in fig. 7, the image processing method includes the following steps 701 to 705.
Step 701, processing the first image through a first network to obtain a plurality of feature maps with different resolutions.
In this embodiment, the first network may be a neural network, such as a convolutional neural network. The first network may include a plurality of feature extraction layers, and the plurality of feature extraction layers are connected in sequence. And after the first image is input into the first network, the first image is subjected to feature processing by a plurality of feature extraction layers, and feature maps with different resolutions are respectively output.
The first image may be an image for which counterfactual interpretation needs to be performed, and the result of performing counterfactual interpretation on the first image is: after which regions in the first image are replaced, the category of the replaced first image changes. The first image may be, for example, an image of a bird that is unknown to children as described in the example corresponding to fig. 6 (a).
Wherein the resolution of the first image may be greater than or equal to the plurality of feature maps. For example, assuming that the resolution of the first image is 1000 × 1000, the resolutions of the feature maps obtained by processing the first image may be 250 × 250, 50 × 50, and 10 × 10, respectively. In general, in the process of extracting the feature map, the resolution of the feature map is gradually reduced in the first network. Therefore, in this case, the higher the resolution of the feature map, the closer the feature map is to the first image.
Step 702, searching in the first feature map through a second network based on a second image to obtain a first area in the first feature map.
In this embodiment, the second image is a contrast image of the first image, and the classification results of the first image and the second image are different, that is, the category to which the first image belongs is different from the category to which the second image belongs. For example, assuming that both the first image and the second image are images of birds, after the first image and the second image are respectively input into a classification network, the classification result of the first image is a cupola cuphea, and the classification result of the second image is a red-face cormorant.
After obtaining the plurality of feature maps corresponding to the first image, one of the feature maps (i.e., the first feature map) may be selected, for example, one of the feature maps with a lower resolution. Then, based on the second image, a key area is searched for in the selected first feature map through the second network. The second network may specifically include a classification network, and the classification network may predict a category to which the feature map belongs based on the feature map to obtain a classification result.
Wherein, searching in the first characteristic diagram means: and replacing the regions in the first feature map with the corresponding regions in the second image one by one, and after determining which region in the first feature map is replaced, increasing the probability that the classified network predicts the first feature map with the regions as the category to which the second image belongs. For example, assuming that the probability that the initial first feature map is predicted as the class to which the second image belongs by the classification network is 0.1, when the partial region in the first feature map is replaced by the corresponding region in the second image, the probability that the replaced first feature map is predicted as the class to which the second image belongs by the classification network becomes 0.3, and the probability that is increased when the replaced first feature map is predicted as the class to which the second image belongs is 0.1, it may be determined that the replaced region in the first feature map is the first region, and the first region is used to change the classification result of the first feature map.
Optionally, in this embodiment, the search of the key region is performed on the first feature map, and the resolution of the second image is usually different from the resolution of the first feature map, so when replacing the region in the first feature map, the region in the first feature map may be replaced with the region in the feature map corresponding to the second image.
For example, the performing of the search for the region on the first feature map may specifically include: and processing the second image through the first network to obtain a third feature map, wherein the resolution of the third feature map is the same as that of the first feature map. And replacing a partial region in the first feature map with a partial region in the third feature map, so that the classification result of the first feature map after replacing the partial region is the same as the classification result of the second image. And determining the area replaced in the first feature map after the partial area is replaced as the first area. That is, the corresponding region in the second image may refer to a partial region in the third feature map corresponding to the second image, and the resolution of the third feature map corresponding to the second image is the same as the resolution of the first feature map.
Generally, replacing only one region in the first feature map does not change the classification result of the first feature map to be the same as the classification result corresponding to the second image. Therefore, in the process of performing the search for the first feature map region, the regions in the first feature map may be searched one by one, and after one region is obtained by the search, the searched region in the first feature map is replaced, and the process is circulated until the classification result of the first feature map after replacing one or more regions is changed into the classification result of the second image.
Illustratively, when the first key area search is performed, each area in the first feature map is sequentially replaced by each area in the third feature map, and the classification result of the first feature map after the areas are replaced is predicted through the second network. And based on the classification result predicted by the second network, acquiring the improved probability of the first feature map after each region is replaced when the first feature map is predicted to belong to the category of the second image, thereby determining the region with the highest improved probability when the first feature map is predicted to belong to the category of the second image, and determining the region as the region searched in the first key region searching process. And after the key area is obtained by the first search, replacing the key area in the first characteristic diagram with the corresponding area in the third characteristic diagram to obtain the first characteristic diagram after the area is replaced once.
And then, based on the first feature map after the primary key area is replaced, searching a secondary key area to obtain a new key area. And on the basis of the first characteristic diagram after the primary key area is replaced, replacing a new key area searched in the searching process of the secondary key area to obtain the first characteristic diagram after the secondary key area is replaced. And circulating until the first feature map with the key area replaced for N times is predicted as the class of the second image by the second network.
That is, in the present embodiment, the first region may include one or more critical regions. After replacing the first region in the first feature map with the corresponding region in the second image, and obtaining a third image, namely the first feature map after replacing the first area. The classification result of the third image is the same as the classification result of the second image.
Step 703, obtaining a mapping area of the first area in the second feature map, where the content of the first area is related to the mapping area.
The second feature map is extracted by other feature extraction layers in the first network, and the resolution of the second feature map is greater than that of the first feature map. The content of the first area is related to the mapped area. In brief, the second feature extraction layer in the first network for outputting the second feature map outputs the second feature map, and the second feature map is input into the feature extraction layer for outputting the first feature map. And the feature extraction layer for outputting the first feature map is used for outputting the first feature map after processing the second feature map. Wherein, the first region in the first feature map is obtained by processing the mapping region in the second feature map by the feature extraction layer for outputting the first feature map.
That is, the mapping region is actually the receptive field of the first region in the second feature map. The receptive field can be derived based on the size and step size of the convolution kernel of the feature extraction layer from which the first feature map is extracted and the first region.
Illustratively, the obtaining the mapping region of the first region in the second feature map may specifically include: acquiring the size and the step length of a convolution kernel of a feature extraction layer for extracting the first feature map; and determining the mapping area in the second feature map according to the convolution kernel size and the step length of the first area and the feature extraction layer.
Referring to fig. 8 (a), fig. 8 (a) is a schematic diagram of calculating a receptive field according to an embodiment of the present disclosure. As shown in fig. 8 a, fig. 8 a includes a feature map (hereinafter, referred to as feature map i-1), a feature map i, and a feature map i +1, which are extracted by the i-1 th feature extraction layer. The value "5" in the feature map i +1 is calculated from the 2 × 2 portion enclosed in the feature map i, and therefore, the 2 × 2 region enclosed in the feature map i can be referred to as the receptive field of the region in which the value "5" in the feature map i +1 is located. Similarly, each region having a value of "3" in the feature map i is calculated from a region of 3 × 3 in the feature map i-1, i.e., the convolution kernel size is 3. The step from the feature map i-1 to the feature map i is 2, and the 5 × 5 region in the feature map i-1 is the receptive field of the 2 × 2 region framed in the feature map i. That is, the receptive field of a particular region in the feature map in the previous layer of the feature map is based on the convolution kernel size and step size.
Specifically, the mapping region in the second feature map can be found by the following formula 1.
S = (S1-1)' stride + ksize formula 1
Wherein S is the size of the mapping region in the second feature map, S1 is the size of the first region, stride is the size of the convolution kernel of the feature extraction layer used for extracting the first feature map, and ksize is the step size of the feature extraction layer used for extracting the first feature map.
In one possible implementation, the first feature map may be a feature map with the smallest resolution among the plurality of feature maps. The number of regions to be searched in the first feature map is the smallest among the plurality of feature maps. Therefore, the region in the first feature map can be searched for with the least time consumption relative to the other feature maps in the plurality of feature maps.
Optionally, the second feature map may be a feature map closest to the resolution of the first feature map among the plurality of feature maps. That is, the feature extraction layer for extracting the second feature map and the feature extraction layer for extracting the first feature map may be two adjacent feature extraction layers. For example, in the case that a first feature map is a feature map with the minimum resolution in the plurality of feature maps, the feature extraction layer for extracting the first feature map is the last feature extraction layer in the first network, and the feature extraction layer for extracting the second feature map is the last feature extraction layer in the first network.
Optionally, the resolution of the second feature map may be a first preset resolution, that is, the second feature map is extracted by a feature extraction layer capable of extracting the feature map with the first preset resolution. For example, assuming that the resolution of the first image is 1000 × 1000, the resolution of the first feature map may be 10 × 10, and the resolution of the second feature map may be 250 × 250. The first preset resolution may be, for example, a resolution specified by a user. In practical application, a user adjusts the first preset resolution according to actual needs. For example, when the user wishes to obtain a more accurate output result, the user may adjust the first preset resolution to a larger resolution; when the user wants to obtain the output result faster, the user can adjust the first preset resolution to be a smaller resolution.
When the resolution of the second feature map is the first preset resolution, the feature extraction layer for extracting the second feature map and the feature extraction layer for extracting the first feature map may be two adjacent feature extraction layers, and one or more other feature extraction layers may be provided between the feature extraction layer for extracting the second feature map and the feature extraction layer for extracting the first feature map.
Step 704, searching in the mapping region through the second network based on the second image to obtain a second region in the second feature map.
The second region may include one or more regions. And replacing the second area in the second characteristic diagram with the corresponding area in the second image to obtain a second characteristic diagram after replacing the second area. The classification result of the second feature map after replacing the second region is the same as the classification result of the second image. The corresponding region in the second image may refer to a partial region in a fourth feature map corresponding to the second image, and the resolution of the fourth feature map corresponding to the second image is the same as the resolution of the second feature map.
The process of performing the area search in the mapping area of the second feature map through the second network is similar to the process of performing the area search in the first feature map through the second network, and reference may be specifically made to the description of step 702 above, which is not described herein again. It is noted that in step 702, a region search is performed for all regions in the first feature map. However, in this step, the search for the region is performed in the mapping region in the second feature map, that is, the range of the search region is reduced, and the search for the region is no longer performed for all the regions in the second feature map, so that the calculation resources can be saved and the calculation time can be greatly reduced.
Step 705, obtaining a region corresponding to the second region in the first image according to the second region.
After the obtained second region is a region in the second feature map, the second region in the second feature map may be converted into a region corresponding to the first image to output the first image marked with the corresponding region.
Optionally, the obtaining, according to the second region, a region corresponding to the second region in the first image specifically includes: and acquiring the resolution of the second feature map and the resolution of the first image. And determining a region corresponding to the second region in the first image according to the position of the second region in the second feature map, the resolution of the second feature map and the resolution of the first image.
For example, referring to fig. 8 (b), fig. 8 (b) is a schematic diagram of converting a second feature map provided in an embodiment of the present application into a first image. As shown in fig. 8 (b), the resolution of the second feature map is 4 × 4, and the resolution of the first image is 8 × 8. In the second feature map, 3 regions in the lower right corner are determined as second regions. Based on the position of the second region in the second feature map, and the resolution of the second feature map and the first image, a corresponding region of the second region may be determined in the first image. The corresponding regions are 12 regions in the lower right corner of the first image.
In this embodiment, a plurality of feature maps with different resolutions of an input image are acquired, a region search is performed on a feature map with a lower resolution, a mapping region of a region on the feature map with the lower resolution on a feature map with a higher resolution is determined, and a region is further searched on the mapping region, so that a region to be replaced on the image is finally obtained. By searching the area on the low-resolution characteristic diagram and then searching the area in the corresponding area on the high-resolution characteristic diagram, the area needing to be searched on the high-resolution characteristic diagram is effectively reduced, the efficiency is improved, and the time for outputting the result is reduced.
For ease of understanding, the image processing method according to the embodiment of the present application will be described below with reference to specific examples. Referring to fig. 9, fig. 9 is a schematic diagram of a search area according to an embodiment of the present disclosure.
As shown in fig. 9, first, a first image is input into a first network, and feature extraction is performed on the first image by the first network to obtain a first feature map and a second feature map output by the first network. Then, based on the second image, a search for regions is performed on all regions of the first feature map through the second network, resulting in a first region on the first feature map. After the first area is obtained, a mapping area of the first feature map on the second feature map is obtained, and area searching is continuously performed in the mapping area. Finally, after the second area in the second feature map is obtained through searching, the second area in the second feature map is converted into the area corresponding to the first image, and the first image marked with the corresponding area is output.
In one possible embodiment, in order to improve the accuracy of the output result as much as possible, a region search may be performed on the obtained feature maps one by one, and the search range of each feature map is a mapping region of a key region that has been searched in the previous feature map.
Illustratively, the image processing method described above may further include: and repeatedly executing the area searching step until a termination condition is met to obtain a target area, wherein the target area is an area in the feature map searched for the last time. And outputting a region corresponding to the target region in the first image.
Specifically, the area searching step includes: acquiring a mapping region of a region in the ith feature map in the (i + 1) th feature map, wherein the region in the ith feature map is used for changing the classification result of the ith feature map; searching a mapping region in the (i + 1) th feature map through the second network based on the second image to obtain a region in the (i + 1) th feature map, wherein the region in the (i + 1) th feature map is used for changing a classification result of the (i + 1) th feature map; the value of i is incremented by 1.
The ith feature map and the (i + 1) th feature map are feature maps in a plurality of feature maps extracted by a first network, and the resolution of the ith feature map is smaller than that of the (i + 1) th feature map. The i is greater than or equal to 2, and when the i is equal to 2, the ith characteristic diagram is the same as the second characteristic diagram. The target area is an area in the (i + 1) th feature map when a termination condition is met.
For example, if the initial value of i is 2, the 2 nd feature map is the second feature map. Therefore, a mapping region of the region in the 2 nd feature map (i.e., the second region described above) in the 3 rd feature map is obtained; and searching the region in the mapping region in the 3 rd feature map through a second network based on the second image to obtain the region in the 3 rd feature map. Then, an add 1 operation is performed on i, and the value of i becomes 3. Then, acquiring a mapping region of a region in the 3 rd feature map in the 4 th feature map; and searching the region in the mapping region in the 4 th feature map through a second network based on the second image to obtain the region in the 4 th feature map. And by analogy, searching the areas in the plurality of feature maps one by one until a termination condition is met.
In brief, it is assumed that a plurality of feature maps obtained by the first network are extracted by a plurality of feature extraction layers sequentially connected in the first network. The first feature map is extracted from the last feature extraction layer in the first network, and the second feature map is extracted from the penultimate feature extraction layer in the first network. Therefore, after the second region in the second feature map is obtained, the mapping region in the feature map extracted by the last-but-one feature extraction layer of the second region in the first network is obtained, and the region search is continuously performed in the mapping region. After the region in the feature map extracted by the third feature extraction layer from the last is obtained through searching, the mapping region in the feature map extracted by the fourth feature extraction layer from the last is determined based on the region, so that the region searching is continuously performed. And by analogy, after the region in the feature map is obtained, determining a corresponding mapping region in the feature map extracted and obtained by the feature extraction layer at the upper layer, and continuously performing region search in the mapping region until the termination condition is met.
In this embodiment, the above termination condition may be various. In practical applications, the termination condition can be adjusted according to actual needs.
In one possible implementation manner, the termination condition includes that the resolution of the (i + 1) th feature map reaches a second preset resolution. <xnotran> , , , , , . </xnotran> Therefore, in practical applications, the accuracy of the result and the calculation efficiency can be balanced according to the requirements of the user, so as to determine the resolution of the feature map searched finally. That is, the second preset resolution may be a predefined resolution, and the second preset resolution may be set by a user or a developer of the network, for example, the second preset resolution is 250 × 250 or 100 × 100, and the second preset resolution is not limited herein.
In another possible implementation manner, the termination condition includes that an area ratio between a region in the (i + 1) th feature map and a mapping region in the (i + 1) th feature map is greater than a preset threshold. Illustratively, the preset threshold may be, for example, 2/3,3/4, or 4/5, and the value of the preset threshold is not specifically limited in this embodiment.
It can be understood that, in the area searching process, after searching the area of the ith feature map, the process of continuing to search the area in the mapping area of the (i + 1) th feature map is actually to further narrow the area to obtain a more accurate result. When the area ratio between the region in the (i + 1) th feature map and the mapping region in the (i + 1) th feature map is found to be greater than the preset threshold value through the search, it may be considered that the region cannot be reduced by continuing the region search in the mapping region in the (i + 1) th feature map, and thus the region search may be terminated.
In practical application, in the process of performing region search in the mapping region in the (i + 1) th feature map, if the area ratio between the replaced region in the (i + 1) th feature map and the mapping region is already greater than or equal to a preset threshold, and the classification result of the (i + 1) th feature map after replacing part of the region is still not changed into the classification result of the second image, it may be considered that the area ratio between the region in the (i + 1) th feature map and the mapping region in the (i + 1) th feature map is greater than the preset threshold. At this time, the search for the region of the (i + 1) th feature map may be terminated, and the mapping region of the (i + 1) th feature map may be regarded as the region of the (i + 1) th feature map, thereby outputting the region of the (i + 1) th feature map.
It will be appreciated that, in general, for an image to be classified, other objects will generally be included in the image to be classified in addition to the target object to be classified. For example, for a bird image, the bird image may include scenes such as branches, sky, or grass, in addition to the birds to be classified. In this case, replacing scenes other than birds in bird images obviously does not change the classification result of the bird images after replacement. That is, for the image to be classified, replacing other objects in the image to be classified in addition to the target object to be classified does not change the classification result of the image to be classified.
Therefore, before the search of the regions is performed, if the object other than the target object can be identified, the regions where the object other than the target object is located can be locked, that is, the search of the regions is not performed in the regions, so that the search range of the regions is reduced, and the search efficiency is improved.
In this embodiment, the search range of the region may be reduced in various ways.
In one possible implementation, the background region in the input image may be identified by an algorithm, so as to exclude the background region in the input image from the search range, and reduce the search range of the region.
Illustratively, the searching in the first feature map through a second network based on a second image includes: and acquiring a non-background area in the first feature map, wherein the non-background area is an area except for a background area in the first feature map. The background region and the non-background region in the first image are identified, for example by an algorithm, and the background region and the non-background region are marked in the first image, respectively. After obtaining a plurality of feature maps of the first image through the first network, a background region and a non-background region corresponding to the feature maps may be determined based on a mapping relationship between the first image and the feature maps. Then, based on the second image, searching through the second network in a non-background region in the first feature map. That is, the background regions in the first feature map may be regarded as locked regions, and these locked regions are not searched any more in the process of performing the search for the regions.
In another possible implementation, a region in which a search is not to be performed may be specified by the user, so that the part of the region specified by the user is excluded from the search range, and the search range of the region is reduced. For example, in a case where the image processing method provided in the embodiment of the present application is applied to an educational product, for an image of a bird unknown to a child (the image of the bird serving as a first image input to a first network), the child may specify a feature region recognized in the bird image, thereby locking the part of the feature region not serving as a search region.
Illustratively, the searching in the first feature map through a second network based on a second image includes: first, a marked region in the first image is obtained, where the marked region may be a region marked by a user and is used to indicate a region where region search is not to be performed. The marked area may also be referred to as a locked area indicated by the user. Then, a fourth region in the first feature map corresponding to the marked region is determined. Since the marked region is marked in the first image and the process of performing the region search is on the first feature map, the region of the first image corresponding to the marked region in the first feature map, i.e., the first region described above, can be determined. Specifically, the process of determining the first region in the first feature map may be to determine a region of the first feature map related to the marker region in the first image based on a manner in which feature extraction is performed by the first network, and determine the region as the first region. Finally, searching a third area in the first feature map through the second network based on the second image, wherein the third area is an area except the fourth area in the first feature map. After the fourth area is obtained, the area except the fourth area in the first characteristic diagram can be determined to be the third area, and the search of the key area is executed on the second area, so that the purpose of reducing the search range is achieved.
It will be appreciated that the two implementations described above may also be used in combination. For example, after the background region in the first image is identified by the algorithm, the background region in the first image is determined as the default locked region; the user may continue to mark on the first image to mark other areas that are not locked, resulting in a locked area selected by the user. Thus, the first image includes the default locked region and the user-selected locked region. When the area search is performed on the first feature map, the area search is not performed in the area corresponding to the default locked area and the locked area selected by the user.
For example, referring to fig. 10 (a), fig. 10 (a) is a schematic diagram illustrating a region search performed based on a locked region according to an embodiment of the present application. As shown in fig. 10 (a), the first image is an image of a bird, and after the first image is identified by a background identification algorithm, most of the background region in the first image is identified, and the identified background region is determined as a default locked region. On the first image in which the default locked region is determined, the user further marks other regions, i.e., the back and chest of the bird, and determines the region marked by the user as the locked region selected by the user. Thus, the first image includes the default locked region and the user-selected locked region. When the area search is performed on the first feature map corresponding to the first image, the area search is not performed in the area corresponding to the default locked area and the locked area selected by the user. As can be seen from the output result diagram on the right side of fig. 10 (a), the regions corresponding to the regions in the first image are all outside the default locked region and the locked region selected by the user.
For the same image, when the locked area changes, the output result changes accordingly. For example, referring to fig. 10 (b), fig. 10 (b) is a schematic diagram illustrating comparison of performing a region search based on a locked region according to an embodiment of the present application. As shown in fig. 10 (b), in the case where the locked regions are different in the first image, the regions marked in the output result map are also different.
In addition to reducing the search range of the feature map by identifying the background region in the first image and acquiring the mark region in the first image, the search range of the feature map may also be reduced by other methods, which is not specifically limited in this embodiment.
It is understood that the second region on the second feature map may include a plurality of regions, and each region has a different contribution to the change of the classification result of the replaced second feature map to the classification result of the second image. For some regions containing more features or regions containing more obvious features, the probability of being predicted as the classification result of the second image is improved greatly after the second feature map replaces the regions. For some regions containing fewer features or regions containing less obvious features, the probability of being predicted as a classification result of the second image is less increased after the second feature map replaces the regions.
Based on this, in one possible embodiment, the regions may be sorted and displayed according to the contribution degrees of the different regions to the change of the classification result of the second feature map, so as to highlight the corresponding importance degrees of the different regions.
For example, a plurality of sub-regions may be included in the second region of the second feature map. The obtaining, according to the second region, a region corresponding to the second region in the first image specifically includes: firstly, for a plurality of sub-regions in a second feature map, obtaining a probability corresponding to each of the plurality of sub-regions, where the probability is a probability that is increased when the sub-region is replaced by the second feature map and is predicted as a category of the second image. For example, for any sub-region in the second feature map, assuming that the probability that the second feature map is predicted as the class of the second image before replacing the sub-region is s, and the probability that the second feature map is predicted as the class of the second image after replacing the sub-region is n, the probability corresponding to the sub-region is n-s.
And then, sequencing the probabilities corresponding to the sub-regions in a sequence from high to low to obtain a sequencing result. For example, if the probability corresponding to the first subregion is 0.3, the probability corresponding to the second subregion is 0.1, and the probability corresponding to the first subregion is 0.2, the following ranking results can be obtained: first sub-region > third sub-region > second sub-region.
And obtaining a plurality of regions corresponding to the plurality of sub-regions in the first image according to the sorting result and the plurality of sub-regions, wherein the plurality of regions have different marking modes, and the marking modes are related to the sorting result. Wherein, the different marking modes can comprise marking by using marking frames with different colors, marking by using marking frames with different shapes or marking by using marking frames with different thicknesses, the embodiment does not limit the marking manner.
Illustratively, as shown in fig. 10 (a), the birds in the first image are most distinguished from the birds in the second image by the regions of the crown, mouth, eyes, and chest, and the degree of distinction between the features is ordered by crown, mouth, eyes, and chest. I.e. the difference between the crowns of the two birds is greatest, followed by the mouth, then the eyes, and finally the chest. After the regions are sorted according to their importance levels, the regions may be marked with different color boxes based on the sorting result. For example, for the area where the crown is located, a red mark box can be used for marking; for the area where the mouth is located, a yellow mark box can be used for marking; for the area where the eyes are located, a blue marking frame can be adopted for marking; for the area where the mouth is located, a green mark box may be used for marking.
For convenience of understanding, the image processing method provided by the embodiment of the present application will be described in detail below with reference to specific examples. For example, referring to fig. 11 and 12, fig. 11 is a schematic flowchart of an image processing method according to an embodiment of the present application; fig. 12 is a schematic diagram of an example of a search key area provided in an embodiment of the present application. As shown in fig. 11, the flow of searching for the key area includes the following steps S1 to S9.
S1, inputting an original image (namely the first image) and a comparison image (namely the second image) into a counterfactual interpretation network, wherein the counterfactual interpretation network comprises the first network and the second network.
In addition, a locked region, such as a default locked region or a user-selected locked region, may also be included in the original image.
And S2, extracting a plurality of feature maps corresponding to the original image through the counterfactual explanation network, wherein the resolution ratios of the plurality of feature maps corresponding to the original image are different from each other. In addition, a plurality of feature maps corresponding to the comparison image can be extracted and obtained through a network structure for extracting an original image in the counterfactual interpretation network, and the resolution ratios of the plurality of feature maps corresponding to the original image are different from each other. In short, each of the plurality of feature maps corresponding to the original image has a feature map of a contrast image with the same resolution.
And S3, searching a key region (such as a key region 1 shown in FIG. 12) in a last layer of feature map in the plurality of feature maps corresponding to the original image, wherein the last layer of feature map is the feature map extracted by the last feature extraction layer, and the last layer of feature map is the image with the minimum resolution in the plurality of feature maps corresponding to the original image. For a plurality of feature extraction layers for extracting a feature map of an original image, the resolution of the feature map extracted by the feature extraction layer gradually increases from the last feature extraction layer toward the first feature extraction layer. For a specific search process, refer to step 702 above, and details are not repeated here.
In the case where the locked region is included in the original image, when searching for a key region in the last-layer feature map, the search for the key region is not performed in the locked region.
And S4, after obtaining the key area of the last layer of feature map, determining a mapping area (such as mapping area 1 shown in FIG. 12) in the last layer of feature map based on the key area of the last layer of feature map, wherein the key area of the last layer of feature map is determined based on the mapping area in the last layer of feature map. The feature map of the penultimate layer is the feature map extracted by the penultimate feature extraction layer.
And S5, continuing searching for a key area (such as the key area 2 shown in the figure 12) in the mapping area of the penultimate layer feature map. For a specific search process, reference may be made to step 704 described above, which is not described herein again.
And S6, judging whether a termination condition is met, wherein the termination condition can be that the resolution of the current characteristic diagram reaches a second preset resolution, and the area ratio between the key area and the mapping area in the current characteristic diagram is greater than a preset threshold value. If the termination condition is met, the step S7 is executed; if the termination condition is not satisfied, go to step S8.
And S7, outputting the area corresponding to the key area in the original image. And when the termination condition is met, according to the key area in the feature map obtained by final calculation, calculating and outputting the area corresponding to the key area in the original image.
And S8, if the termination condition is not met, continuing to determine a mapping area (such as mapping area 2 shown in FIG. 12) in the next-layer feature map. For example, mapping regions in the penultimate, fourth-to-last or fifth-to-last feature maps are determined.
S9, after determining the mapping region of the feature map, searching for a key region (such as the key region 3 shown in fig. 12) in the mapping region of the feature map, and proceeding to step S6.
Referring to fig. 13, fig. 13 is a schematic diagram of searching for a key region layer by layer according to an embodiment of the present disclosure. As shown in fig. 13, the first network includes a plurality of convolutional layers, a plurality of pooling layers, and a plurality of fully-connected layers. After the first image is input into the first network, different feature maps are respectively output by each convolution layer and each pooling layer in the first network. Therefore, in the process of performing the key area search, the pooling layer 4 is used as the last feature extraction layer, and the key area search is performed on the feature map extracted by the pooling layer 4. After obtaining the key region in the feature map extracted by the pooling layer 4, determining a mapping region of the key region in the feature map extracted by the pooling layer 4 in the convolutional layer 4-3, and continuing to perform a search of the key region in the mapping region. By analogy, the key areas are searched layer by layer from the pooling layer 4 until the termination condition is met.
Referring to fig. 14, fig. 14 is a schematic diagram illustrating a comparison of key regions according to an embodiment of the present disclosure. As shown in fig. 14, when the key area is obtained by searching the feature maps with different resolutions and then the key area is converted into a corresponding area in the original image, the size and the position of the corresponding area in the original image may also be different. For example, when a key region is searched in the feature map extracted by the pooling layer 4 and the key region is converted into a corresponding region in the original image, the corresponding region in the original image is larger (i.e., the region marked by the mark frame in the original image is larger), and the corresponding region in the original image includes other non-distinctive features in addition to the distinctive features. When a key area is searched in the feature map extracted by the pooling layer 2 and the key area is converted into a corresponding area in the original image, the corresponding area in the original image is smaller (i.e. the area marked by the marking frame in the original image is smaller). Wherein, the resolution of the feature map extracted by the pooling layer 4 is smaller than that of the feature map extracted by the pooling layer 2. That is, the higher the resolution of the feature map, the more accurate the final output result is.
In addition, compared with the method of directly searching the key region on the feature map with the specific resolution, the method of searching the key region layer by layer in the mapping region based on the embodiment can effectively improve the calculation speed and consume less memory, so that a user can spend less calculation resources and time to obtain an output result.
Illustratively, let t1 be the time for synthesizing a feature map (i.e., the time required to replace a region in the feature map), and t2 be the time for reasoning on a feature map, i.e., the time for obtaining a classification result of a feature map.
Then, the time taken to search for the key area on the feature map can be calculated by equation 2.
t = N × R ^2 × t1+ N × R ^2/B × t2 formula 2
Wherein t is the time required for performing key area search on a feature map, N is the number of key areas obtained by searching on the feature map, t1 is the time for synthesizing the feature map, t2 is the time for reasoning on the feature map, R is the size of the feature map, R2 represents the square of R, and B is the number of the feature maps for batch reasoning.
Taking fig. 14 as an example, it is assumed that the size of the feature map extracted from the pooling layer 4 is 7 × 7, and the size of the feature map extracted from the convolutional layer 4-3 is 14 × 14. Assume that the number of key regions extracted from pooling layer 4 is N1, the number of key regions extracted from convolutional layer 4-3 is N2, and the size of the receptive field from pooling layer 4 to convolutional layer 4-3 is 2 x 2.
The time required to search for the critical region in the profile of pooling layer 4 is then N1 ^ 7 ^2 ^ t1+ N1 ^ 7 ^2/B ^ t2.
After the key region is searched in the feature map of pooling layer 4, the time required to search the key region in the mapping region of the feature map of convolution layer 4-3 is N2 x (2 x 2) 2 x t1 x N2 x (2 x 2) 2/B x t2.
That is, the time required to search the key region in the feature map of the convolutional layer 4-3 based on the method provided in this embodiment is: n1 ^ 7 ^2 ^ t1 ^ N2 ^ t1+N1 ^ 7 ^2/B ^ t2+ N2 ^2 (2) ^2/B ^ t2.
If the key region is searched directly in the feature map of convolutional layer 4-3, the time taken is: n2 (14 x 14) ^2 x t1+ N2 (14 × 14)/B × t2.
In general, the inference time of the feature map is usually negligible, since the time required to perform batch processing on the feature map is small. Then, the factor affecting the total time taken for the key area search is mainly the time required for replacing the feature map.
In most cases, the values between N1 and N2 are not very different, so N1 ^ 7 ^2 ^ t1+N2 ^2 ^ t1 and N2 ^ 14 ^2 ^ t1. In other words, compared with the method of directly searching the key area on the specified feature map, the method provided by the embodiment of the application searches the key area in the mapping area layer by layer, and can effectively reduce the calculation time.
Referring to fig. 15, fig. 15 is an image processing apparatus according to an embodiment of the present disclosure. The image processing apparatus includes an acquisition unit 1501 and a processing unit 1502.
The processing unit 1502 is configured to process the first image through the first network to obtain a plurality of feature maps, where the plurality of feature maps include a first feature map and a second feature map, and a resolution of the second feature map is greater than a resolution of the first feature map; the processing unit 1502 is further configured to search the first feature map through a second network based on a second image to obtain a first region in the first feature map, where the first region is used to change a classification result of the first feature map, the classification result of the first image is different from that of the second image, a classification result of a third image is the same as that of the second image, and the third image is the first feature map obtained by replacing the first region with a corresponding region in the second image; the acquiring unit 1501 is configured to acquire a mapping region of the first region in the second feature map, where the content of the first region is related to the mapping region; the processing unit 1502 is further configured to search the mapping region through the second network based on the second image to obtain a second region in the second feature map, where the second region is used to change the classification result of the first feature map; the processing unit 1502 is further configured to obtain, according to the second area, an area in the first image corresponding to the second area.
In one possible implementation, the first feature map is a feature map with the smallest resolution among the plurality of feature maps.
In one possible implementation, the second feature map is a feature map of the plurality of feature maps that is closest in resolution to the first feature map; or, the resolution of the second feature map is a first preset resolution; or one or more feature maps in the plurality of feature maps are further spaced between the second feature map and the first feature map, and the resolution of the one or more feature maps is greater than that of the first feature map and less than that of the second feature map.
In one possible implementation manner, the obtaining unit 1501 is configured to: acquiring the size and step length of a convolution kernel of a feature extraction layer for outputting the first feature map; and determining the mapping area in the second feature map according to the convolution kernel size and the step length of the first area and the feature extraction layer.
In a possible implementation manner, the processing unit 1502 is further configured to: repeatedly executing the area searching step until a termination condition is met to obtain a target area; outputting a region corresponding to the target region in the first image; the area searching step includes: acquiring a mapping region of a region in the ith feature map in the (i + 1) th feature map, wherein the region in the ith feature map is used for changing the classification result of the ith feature map; based on the second image, performing region search in a mapping region in the (i + 1) th feature map through the second network to obtain a region in the (i + 1) th feature map, wherein the region in the (i + 1) th feature map is used for changing a classification result of the (i + 1) th feature map; adding 1 to the value of i; the target area is an area in the (i + 1) th feature map when a termination condition is met, the resolution of the (i) th feature map is smaller than that of the (i + 1) th feature map, the i is greater than or equal to 2, and when the i is equal to 2, the (i) th feature map is the second feature map.
In one possible implementation manner, the termination condition includes that the resolution of the (i + 1) th feature map reaches a second preset resolution.
In one possible implementation manner, the termination condition includes that an area ratio between a region in the (i + 1) th feature map and a mapping region in the (i + 1) th feature map is greater than a preset threshold.
In a possible implementation manner, the processing unit 1502 is further configured to: processing the second image through the first network to obtain a third feature map, wherein the resolution of the third feature map is the same as that of the first feature map; replacing a partial region in the first feature map with a partial region in the third feature map, so that the classification result of the first feature map after replacing the partial region is the same as the classification result of the second image; and determining the replaced area in the first feature map after the partial area is replaced as the first area.
In a possible implementation manner, the obtaining unit 1501 is further configured to obtain a non-background region in the first feature map, where the non-background region is a region of the first feature map other than a background region; the processing unit 1502 is further configured to search for a non-background region in the first feature map through the second network based on the second image.
In a possible implementation manner, the acquiring unit 1501 is further configured to acquire a mark region in the first image; the processing unit 1502 is further configured to search, based on the second image, for a third area in the first feature map through the second network, where the third area is an area in the first feature map except for a fourth area, and the fourth area is an area in the first feature map corresponding to the labeled area.
In a possible implementation manner, the processing unit 1502 is further configured to determine, according to a position of the second region in the second feature map, a resolution of the second feature map, and a resolution of the first image, a region in the first image corresponding to the second region.
In one possible implementation, the second region includes a plurality of sub-regions; the processing unit 1502 is further configured to: obtaining a probability corresponding to each sub-region in the plurality of sub-regions, wherein the probability is the probability which is improved when the sub-region is predicted to be the category of the second image after the second feature map replaces the sub-region; sequencing the probability corresponding to each subregion from high to low to obtain a sequencing result; and obtaining a plurality of regions corresponding to the plurality of sub-regions in the first image according to the sorting result and the plurality of sub-regions, wherein the plurality of regions have different marking modes, and the marking modes are related to the sorting result.
Referring to fig. 16, fig. 16 is a schematic structural diagram of an execution device provided in the embodiment of the present application, and the execution device 1600 may be embodied as a mobile phone, a tablet, a notebook computer, an intelligent wearable device, a server, and the like, which is not limited herein. The execution device 1600 may be deployed with the data processing apparatus described in the embodiment corresponding to fig. 16, and is configured to implement the function of data processing in the embodiment corresponding to fig. 16. Specifically, the execution device 1600 includes: a receiver 1601, a transmitter 1602, a processor 1603, and a memory 1604 (where the number of processors 1603 in the execution device 1600 may be one or more, one processor is taken as an example in fig. 16), wherein the processors 1603 may include an application processor 16031 and a communication processor 16032. In some embodiments of the present application, the receiver 1601, the transmitter 1602, the processor 1603, and the memory 1604 may be connected by a bus or other means.
The memory 1604 may include both read-only memory and random access memory, and provides instructions and data to the processor 1603. A portion of the memory 1604 may also include non-volatile random access memory (NVRAM). The memory 1604 stores the processor and the operating instructions, executable modules or data structures, or a subset thereof, or an expanded set thereof, wherein the operating instructions may include various operating instructions for implementing various operations.
Processor 1603 controls the operation of the execution apparatus. In a particular application, the various components of the execution device are coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. For clarity of illustration, the various buses are referred to in the figures as a bus system.
The method disclosed in the embodiments of the present application may be applied to the processor 1603 or implemented by the processor 1603. The processor 1603 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by hardware integrated logic circuits or instructions in software form in the processor 1603. The processor 1603 may be a general-purpose processor, a Digital Signal Processor (DSP), a microprocessor or a microcontroller, and may further include an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The processor 1603 may implement or perform the methods, steps and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1604, and the processor 1603 reads the information in the memory 1604 and completes the steps of the method in combination with its hardware.
The receiver 1601 may be used to receive input numeric or alphanumeric information and generate signal inputs related to performing device related settings and function control. The transmitter 1602 may be configured to output numeric or character information via a first interface; the transmitter 1602 is also operable to send instructions to the disk pack via the first interface to modify data in the disk pack; the transmitter 1602 may also include a display device such as a display screen.
In the embodiment of the present application, in one case, the processor 1603 is configured to execute the image processing method in the corresponding embodiment of fig. 7.
Embodiments of the present application also provide a computer program product, which when executed on a computer causes the computer to perform the steps performed by the aforementioned execution device, or causes the computer to perform the steps performed by the aforementioned training device.
Also provided in an embodiment of the present application is a computer-readable storage medium, in which a program for signal processing is stored, and when the program is run on a computer, the program causes the computer to execute the steps executed by the aforementioned execution device, or causes the computer to execute the steps executed by the aforementioned training device.
The execution device, the training device, or the terminal device provided in the embodiment of the present application may specifically be a chip, where the chip includes: a processing unit, which may be for example a processor, and a communication unit, which may be for example an input/output interface, a pin or a circuit, etc. The processing unit may execute the computer execution instructions stored by the storage unit to cause the chip in the execution device to execute the image processing method described in the above embodiment, or to cause the chip in the training device to execute the image processing method described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, and the like, and the storage unit may also be a storage unit located outside the chip in the wireless access device side, such as read-only memory (ROM) or other types of static storage devices that may store static information and instructions, random Access Memory (RAM), and so forth.
Specifically, referring to fig. 17, fig. 17 is a schematic structural diagram of a chip provided in the embodiment of the present application, where the chip may be represented as a neural network processor NPU 1700, an NPU 1700 mounted as a coprocessor on a main CPU (Host CPU), and the Host CPU allocates tasks. The core portion of the NPU is an arithmetic circuit 1703, and the controller 1704 controls the arithmetic circuit 1703 to extract matrix data in the memory and perform multiplication.
In some implementations, the arithmetic circuitry 1703 includes multiple processing units (PEs) within it. In some implementations, the operational circuitry 1703 is a two-dimensional systolic array. The arithmetic circuit 1703 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 1703 is a general-purpose matrix processor.
For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the corresponding data of the matrix B from the weight memory 1702 and buffers it in each PE in the arithmetic circuit. The arithmetic circuit fetches the matrix a data from the input memory 1701, performs matrix arithmetic on the matrix a data and the matrix B data, and stores a partial result or a final result of the matrix in an accumulator (accumulator) 1708.
The unified memory 1706 is used for storing input data and output data. The weight data is directly transferred to the weight Memory 1702 via a Memory Access Controller (DMAC) 1705. Input data is also carried through the DMAC into the unified memory 1706.
The BIU is a Bus Interface Unit 1717, which is used for the interaction of the AXI Bus with the DMAC and the Instruction Fetch memory (IFB) 1709.
The Bus Interface Unit 1717 (Bus Interface Unit, BIU for short) is configured to obtain an instruction from the external memory by the instruction fetch memory 1709, and to obtain the original data of the input matrix a or the weight matrix B from the external memory by the memory Unit access controller 1705.
The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1706 or to transfer weight data to the weight memory 1702 or to transfer input data to the input memory 1701.
The vector calculation unit 1707 includes a plurality of operation processing units, and further processes the output of the operation circuit 1703 such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization, pixel-level summation, up-sampling of a feature plane and the like.
In some implementations, the vector calculation unit 1707 can store the vector of processed outputs to the unified memory 1706. For example, the vector calculation unit 1707 may calculate a linear function; alternatively, a non-linear function is applied to the output of the arithmetic circuit 1703, such as linear interpolation of the feature planes extracted from the convolutional layers, and then, for example, a vector of accumulated values to generate an activation value. In some implementations, the vector calculation unit 1707 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the operational circuitry 1703, e.g., for use in subsequent layers in a neural network.
An instruction fetch buffer 1709 connected to the controller 1704, configured to store instructions used by the controller 1704;
the unified memory 1706, input memory 1701, weight memory 1702, and instruction fetch memory 1709 are On-Chip memories. The external memory is private to the NPU hardware architecture.
The processor mentioned in any of the above may be a general purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above programs.
It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, which may be specifically implemented as one or more communication buses or signal lines.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be substantially embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, an exercise device, or a network device) to execute the method according to the embodiments of the present application.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.
The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the procedures or functions described in accordance with the embodiments of the present application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, training device, or data center to another website site, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a training device, a data center, etc., that incorporates one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.

Claims (11)

1. An image processing method, characterized by comprising:
processing the first image through a first network to obtain a plurality of feature maps, wherein the plurality of feature maps comprise a first feature map and a second feature map, and the resolution of the second feature map is greater than that of the first feature map;
searching in the first feature map through a second network based on a second image to obtain a first region in the first feature map, wherein the first region is used for changing a classification result of the first feature map, the classification result of the first image is different from that of the second image, a classification result of a third image is the same as that of the second image, and the third image is the first feature map obtained by replacing the first region with a corresponding region in the second image;
acquiring a mapping area of the first area in the second feature map, wherein the content of the first area is related to the mapping area;
searching in the mapping area through the second network based on the second image to obtain a second area in the second feature map, wherein the second area is used for changing the classification result of the first feature map;
and obtaining a region corresponding to the second region in the first image according to the second region.
2. The method of claim 1, wherein the first feature map is a feature map of the plurality of feature maps having a smallest resolution.
3. The method according to claim 1 or 2, wherein the second feature map is a feature map of the plurality of feature maps that is closest in resolution to the first feature map;
or, the resolution of the second feature map is a first preset resolution;
or one or more feature maps in the plurality of feature maps are further spaced between the second feature map and the first feature map, and the resolution of the one or more feature maps is greater than that of the first feature map and less than that of the second feature map.
4. The method according to any one of claims 1 to 3, wherein the obtaining of the mapping region of the first region in the second feature map comprises:
acquiring the size and step length of a convolution kernel of a feature extraction layer for outputting the first feature map;
and determining the mapping area in the second feature map according to the convolution kernel size and the step length of the first area and the feature extraction layer.
5. The method according to any one of claims 1 to 4, wherein the searching in the first feature map through a second network based on the second image to obtain the first region in the first feature map comprises:
processing the second image through the first network to obtain a third feature map, wherein the resolution of the third feature map is the same as that of the first feature map;
replacing a partial region in the first feature map with a partial region in the third feature map, so that the classification result of the first feature map after the partial region is replaced is the same as the classification result of the second image;
and determining the replaced area in the first feature map after the partial area is replaced as the first area.
6. The method according to any one of claims 1 to 5, wherein the searching in the first feature map through a second network based on the second image comprises:
acquiring a mark region in the first image;
and searching a third area in the first feature map through the second network based on the second image, wherein the third area is an area except a fourth area in the first feature map, and the fourth area is an area corresponding to the mark area in the first feature map.
7. The method according to any one of claims 1 to 6, wherein the obtaining, according to the second region, a region in the first image corresponding to the second region comprises:
and determining a region corresponding to the second region in the first image according to the position of the second region in the second feature map, the resolution of the second feature map and the resolution of the first image.
8. The method of any one of claims 1-7, wherein the second region comprises a plurality of sub-regions;
the obtaining, according to the second region, a region corresponding to the second region in the first image includes:
obtaining a probability corresponding to each sub-region in the plurality of sub-regions, wherein the probability is the probability which is improved when the sub-region is predicted to be the category of the second image after the second feature map replaces the sub-region;
sequencing the probability corresponding to each subregion from high to low to obtain a sequencing result;
and obtaining a plurality of regions corresponding to the plurality of sub-regions in the first image according to the sorting result and the plurality of sub-regions, wherein the plurality of regions have different marking modes, and the marking modes are related to the sorting result.
9. An image processing apparatus comprising a memory and a processor; the memory stores code, the processor is configured to execute the code, and when executed, the image processing apparatus performs the method of any one of claims 1 to 8.
10. A computer storage medium storing instructions that, when executed by a computer, cause the computer to perform the method of any one of claims 1 to 8.
11. A computer program product having stored thereon instructions which, when executed by a computer, cause the computer to carry out the method of any one of claims 1 to 8.
CN202110341809.0A 2021-03-30 2021-03-30 Image processing method and related device Pending CN115147635A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110341809.0A CN115147635A (en) 2021-03-30 2021-03-30 Image processing method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110341809.0A CN115147635A (en) 2021-03-30 2021-03-30 Image processing method and related device

Publications (1)

Publication Number Publication Date
CN115147635A true CN115147635A (en) 2022-10-04

Family

ID=83403798

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110341809.0A Pending CN115147635A (en) 2021-03-30 2021-03-30 Image processing method and related device

Country Status (1)

Country Link
CN (1) CN115147635A (en)

Similar Documents

Publication Publication Date Title
CN110175671B (en) Neural network construction method, image processing method and device
WO2022083536A1 (en) Neural network construction method and apparatus
CN113705769B (en) Neural network training method and device
CN111507378A (en) Method and apparatus for training image processing model
CN111291809B (en) Processing device, method and storage medium
WO2022042713A1 (en) Deep learning training method and apparatus for use in computing device
CN110309856A (en) Image classification method, the training method of neural network and device
WO2022001805A1 (en) Neural network distillation method and device
CN113807399B (en) Neural network training method, neural network detection method and neural network training device
CN111401517B (en) Method and device for searching perceived network structure
CN113570029A (en) Method for obtaining neural network model, image processing method and device
CN114255361A (en) Neural network model training method, image processing method and device
CN111832592B (en) RGBD significance detection method and related device
CN113011562B (en) Model training method and device
CN111797895A (en) Training method of classifier, data processing method, system and equipment
WO2022111617A1 (en) Model training method and apparatus
CN113822951B (en) Image processing method, device, electronic equipment and storage medium
CN111783937A (en) Neural network construction method and system
CN113191489B (en) Training method of binary neural network model, image processing method and device
CN112257759A (en) Image processing method and device
CN111797881A (en) Image classification method and device
WO2022179606A1 (en) Image processing method and related apparatus
CN110222718A (en) The method and device of image procossing
CN113592060A (en) Neural network optimization method and device
CN114091554A (en) Training set processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination