CN112967199A - Image processing method and device - Google Patents

Image processing method and device Download PDF

Info

Publication number
CN112967199A
CN112967199A CN202110247373.9A CN202110247373A CN112967199A CN 112967199 A CN112967199 A CN 112967199A CN 202110247373 A CN202110247373 A CN 202110247373A CN 112967199 A CN112967199 A CN 112967199A
Authority
CN
China
Prior art keywords
feature map
mask
result
analysis
target object
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110247373.9A
Other languages
Chinese (zh)
Inventor
王诗吟
周强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Priority to CN202110247373.9A priority Critical patent/CN112967199A/en
Publication of CN112967199A publication Critical patent/CN112967199A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/77Retouching; Inpainting; Scratch removal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure relates to an image processing method and apparatus. The method comprises the following steps: the method comprises the steps of obtaining an image to be processed, a first analysis mask and a second analysis mask, wherein the image to be processed comprises a target object, a partial region of the target object is shielded by a shielding object, the first analysis mask is an analysis mask of a region of the target object which is not shielded by the shielding object, and the second analysis mask is an analysis mask of a complete region of the target object; and inputting the image to be processed, the first analysis mask and the second analysis mask into the neural network model, and completing the content of the area of the target object, which is shielded by the shielding object. Therefore, the content of the region to be compensated is accurately estimated, and tasks such as target tracking, target detection, image segmentation and the like are favorably realized.

Description

Image processing method and device
Technical Field
The present disclosure relates to the field of image processing technologies, and in particular, to an image processing method and apparatus.
Background
In the process of image acquisition, the image is easily damaged or interfered by noise due to the influence of factors such as illumination, equipment and algorithm, so that the content of the image cannot be correctly expressed. Therefore, image inpainting (inpainting) has become one of the important steps in the preprocessing of computer vision tasks, and has direct correlation with task results such as target tracking, target detection, image segmentation, and the like.
At present, the repairing process of the image usually adopts a neural network model to complete the image. Neural network models typically infer what needs to be completed based on the characteristics of the surrounding area. For example, if the left leg is occluded by a human body, then an inference is made from the apparent characteristics of the right leg. However, the neural network model cannot accurately infer the content of the image to be completed, and is not beneficial to the realization of tasks such as target tracking, target detection, image segmentation and the like.
Disclosure of Invention
To solve the above technical problem or at least partially solve the above technical problem, the present disclosure provides an image processing method and apparatus.
In a first aspect, the present disclosure provides an image processing method, including:
the method comprises the steps of obtaining an image to be processed, a first analysis mask and a second analysis mask, wherein the image to be processed comprises a target object, a partial region of the target object is shielded by a shielding object, the first analysis mask is an analysis mask of a region of the target object which is not shielded by the shielding object, and the second analysis mask is an analysis mask of a complete region of the target object;
and inputting the image to be processed, the first analysis mask and the second analysis mask into the neural network model, and complementing the content of the area of the target object, which is shielded by the shielding object, based on the attention mechanism.
By the method provided by the first aspect, the analysis mask of the area of the target object which is not shielded by the shielding object and the analysis mask of the complete area of the target object are introduced into the attention mechanism, so that the attention mechanism can focus attention on the target object, the content of the area of the target object which is shielded by the shielding object can be accurately deduced based on the characteristics of the area of the target object which is not shielded by the shielding object in the image to be processed and the characteristics of the complete area of the target object which is predicted, the completion of the content of the area of the target object which is shielded by the shielding object in the image to be processed is realized, the image to be processed after the area of the target object which is shielded by the shielding object is supplemented can be accurately output, and tasks such as target tracking, target detection, image segmentation and the like can be reliably completed.
In one possible design, inputting the image to be processed, the first analysis mask and the second analysis mask into the neural network model, and complementing the content of the region of the target object, which is shielded by the shielding object, based on the attention mechanism, includes: processing the image to be processed through a plurality of network layers in a neural network model, and complementing the content of an area of a target object, which is shielded by a shielding object, based on an attention mechanism; wherein, the processing procedure of at least one network layer in the plurality of network layers comprises: receiving a first feature map output by a previous network layer, wherein the first feature map is obtained based on an image to be processed, a first analysis mask and a second analysis mask; obtaining a second feature map according to the first feature map, the first analysis mask and the second analysis mask, wherein the second feature map is a feature of the target object after the content of the region shielded by the shielding object is enhanced, the sizes of the first feature map, the first analysis mask and the second analysis mask are the same, and the number of channels of the first analysis mask and the number of channels of the second analysis mask are the same; and carrying out convolution processing on the second feature map, and transmitting the feature map after the convolution processing to a next network layer.
In one possible design, the plurality of network layers includes a first network layer and a second network layer, the first feature map received by the first network layer and the first feature map received by the second network layer have different sizes, the first parsing mask in the processing procedure of the first network layer and the first parsing mask in the processing procedure of the second network layer have different sizes, and the second parsing mask in the processing procedure of the first network layer and the second parsing mask in the processing procedure of the second network layer have different sizes.
In one possible design, obtaining the second feature map according to the first feature map, the first parsing mask and the second parsing mask includes: obtaining a first attention result according to the first feature map, the first analysis mask and the second analysis mask, wherein the first attention result is used for representing the association relation between each spatial position point and all spatial position points of the target object in the feature map corresponding to the image to be processed, and/or obtaining a second attention result according to the first feature map, the first analysis mask and the second analysis mask, and the second attention result is used for representing semantic information and depth features of a sub-component region corresponding to the difference between the first analysis mask and the second analysis mask in a region, shielded by the shielding object, of the target object; and obtaining a second feature map according to the first attention result and/or the second attention result.
In one possible design, obtaining a first attention result based on the first feature map, the first resolution mask, and the second resolution mask includes: obtaining a correlation feature map according to the first feature map, the first analysis mask and the second analysis mask, and taking the correlation feature map as a first attention result, wherein the correlation feature map is used for representing the correlation relationship between the area of the target object which is not shielded by the shielding object and the complete area of the target object; stacking the associated characteristic diagram and the first characteristic diagram to obtain a first stacking result; and performing dimension reduction processing on the first stacking result to obtain a second characteristic diagram.
In one possible design, obtaining a first attention result and a second attention result according to the first feature map, the first resolution mask and the second resolution mask includes: obtaining a correlation feature map according to the first feature map, the first analysis mask and the second analysis mask, and taking the correlation feature map as a first attention result, wherein the correlation feature map is used for representing the correlation relationship between the area of the target object which is not shielded by the shielding object and the complete area of the target object; performing channel number dimension reduction processing on the first feature map to obtain a dimension reduction feature map, wherein the channel number of the dimension reduction feature map is the same as the channel number of the first analysis mask or the second analysis mask; performing point multiplication on the dimension reduction feature map and the first analysis mask to obtain a first point multiplication result, wherein the first point multiplication result is used for representing semantic information and depth features of a sub-component region corresponding to the first analysis mask in a region of the target object which is not shielded by the shielding object; performing dot multiplication on the dimension reduction feature map and the second analysis mask to obtain a second dot multiplication result, wherein the second dot multiplication result is used for representing semantic information and depth features of a sub-component region corresponding to the second analysis mask in the complete region of the target object; stacking and convolving the first point multiplication result and the second point multiplication result to obtain a second attention result; obtaining a second feature map according to the first attention result and the second attention result, wherein the second feature map comprises: stacking the second attention result, the associated feature map and the first feature map to obtain a second stacking result; and performing dimension reduction processing on the second stacking result to obtain a second feature map.
In one possible design, obtaining the associated feature map according to the first feature map, the first parsing mask and the second parsing mask includes: stacking the first feature map and the first analysis mask to obtain a third stacking result; stacking the first feature map and the second analysis mask to obtain a fourth stacking result; and performing attention mechanism processing on the third stacking result and the fourth stacking result to obtain a correlation characteristic diagram.
In one possible design, obtaining a second attention result according to the first feature map, the first resolution mask and the second resolution mask includes: performing channel number dimension reduction processing on the first feature map to obtain a dimension reduction feature map, wherein the channel number of the dimension reduction feature map is the same as the channel number of the first analysis mask or the second analysis mask; performing point multiplication on the dimension reduction feature map and the first analysis mask to obtain a first point multiplication result, wherein the first point multiplication result is used for representing semantic information and depth features of a sub-component region corresponding to the first analysis mask in a region of the target object which is not shielded by the shielding object; performing dot multiplication on the dimension reduction feature map and the second analysis mask to obtain a second dot multiplication result, wherein the second dot multiplication result is used for representing semantic information and depth features of a sub-component region corresponding to the second analysis mask in the complete region of the target object; stacking and convolving the first point multiplication result and the second point multiplication result to obtain a second attention result; and obtaining a second feature map according to the second attention result, wherein the second feature map comprises: stacking the second attention result and the first feature map to obtain a fifth stacking result; and performing dimension reduction processing on the fifth stacking result to obtain a second feature map.
In one possible design, the first resolution mask and the second resolution mask take the form of a one-hot encoding.
In a second aspect, the present disclosure provides an image processing apparatus comprising:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring an image to be processed, a first analysis mask and a second analysis mask, the image to be processed comprises a target object, a partial region of the target object is shielded by a shielding object, the first analysis mask is an analysis mask of a region of the target object which is not shielded by the shielding object, and the second analysis mask is an analysis mask of a complete region of the target object;
and the processing module is used for inputting the image to be processed, the first analysis mask and the second analysis mask into the neural network model and complementing the content of the area of the target object, which is shielded by the shielding object, based on the attention mechanism.
In one possible design, the processing module is specifically configured to process the image to be processed through a plurality of network layers in the neural network model, and complement the content of the region of the target object that is blocked by the blocking object based on the attention mechanism; wherein, the processing procedure of at least one network layer in the plurality of network layers comprises: receiving a first feature map output by a previous network layer, wherein the first feature map is obtained based on an image to be processed, a first analysis mask and a second analysis mask; obtaining a second feature map according to the first feature map, the first analysis mask and the second analysis mask, wherein the second feature map is a feature of the target object after the content of the region shielded by the shielding object is enhanced, the sizes of the first feature map, the first analysis mask and the second analysis mask are the same, and the number of channels of the first analysis mask and the number of channels of the second analysis mask are the same; and performing convolution processing on the second feature map.
In one possible design, the plurality of network layers includes a first network layer and a second network layer, the first feature map received by the first network layer and the first feature map received by the second network layer have different sizes, the first parsing mask in the processing procedure of the first network layer and the first parsing mask in the processing procedure of the second network layer have different sizes, and the second parsing mask in the processing procedure of the first network layer and the second parsing mask in the processing procedure of the second network layer have different sizes.
In one possible design, the processing module is configured to obtain a first attention result according to the first feature map, the first analysis mask and the second analysis mask, where the first attention result is used to indicate an association relationship between each spatial position point and all spatial position points of the target object in the feature map corresponding to the image to be processed, and/or obtain a second attention result according to the first feature map, the first analysis mask and the second analysis mask, where the second attention result is used to indicate semantic information and depth features of a sub-component region corresponding to a difference between the first analysis mask and the second analysis mask in a region of the target object that is blocked by the blocking object; and obtaining a second feature map according to the first attention result and/or the second attention result.
In one possible design, the processing module is configured to obtain an associated feature map according to the first feature map, the first analysis mask and the second analysis mask, and use the associated feature map as a first attention result, where the associated feature map is used to represent an association relationship between an area of the target object that is not covered by the blocking object and a complete area of the target object; stacking the associated characteristic diagram and the first characteristic diagram to obtain a first stacking result; and performing dimension reduction processing on the first stacking result to obtain a second characteristic diagram.
In one possible design, the processing module is configured to obtain an associated feature map according to the first feature map, the first analysis mask and the second analysis mask, and use the associated feature map as a first attention result, where the associated feature map is used to represent an association relationship between an area of the target object that is not covered by the blocking object and a complete area of the target object; performing channel number dimension reduction processing on the first feature map to obtain a dimension reduction feature map, wherein the channel number of the dimension reduction feature map is the same as the channel number of the first analysis mask or the second analysis mask; performing point multiplication on the dimension reduction feature map and the first analysis mask to obtain a first point multiplication result, wherein the first point multiplication result is used for representing semantic information and depth features of a sub-component region corresponding to the first analysis mask in a region of the target object which is not shielded by the shielding object; performing dot multiplication on the dimension reduction feature map and the second analysis mask to obtain a second dot multiplication result, wherein the second dot multiplication result is used for representing semantic information and depth features of a sub-component region corresponding to the second analysis mask in the complete region of the target object; stacking and convolving the first point multiplication result and the second point multiplication result to obtain a second attention result; stacking the second attention result, the associated feature map and the first feature map to obtain a second stacking result; and performing dimension reduction processing on the second stacking result to obtain a second feature map.
In one possible design, the processing module is configured to stack the first feature map and the first analysis mask to obtain a third stacking result; stacking the first feature map and the second analysis mask to obtain a fourth stacking result; and performing attention mechanism processing on the third stacking result and the fourth stacking result to obtain a correlation characteristic diagram.
In one possible design, the processing module is configured to perform channel number dimension reduction processing on the first feature map to obtain a dimension reduction feature map, where the channel number of the dimension reduction feature map is the same as the channel number of the first analysis mask or the second analysis mask; performing point multiplication on the dimension reduction feature map and the first analysis mask to obtain a first point multiplication result, wherein the first point multiplication result is used for representing semantic information and depth features of a sub-component region corresponding to the first analysis mask in a region of the target object which is not shielded by the shielding object; performing dot multiplication on the dimension reduction feature map and the second analysis mask to obtain a second dot multiplication result, wherein the second dot multiplication result is used for representing semantic information and depth features of a sub-component region corresponding to the second analysis mask in the complete region of the target object; stacking and convolving the first point multiplication result and the second point multiplication result to obtain a second attention result; stacking the second attention result and the first feature map to obtain a fifth stacking result; and performing dimension reduction processing on the fifth stacking result to obtain a second feature map.
In one possible design, the first resolution mask and the second resolution mask take the form of a one-hot encoding.
The beneficial effects of the image processing apparatus provided in the second aspect and the possible designs of the second aspect may refer to the beneficial effects brought by the possible embodiments of the first aspect and the first aspect, and are not described herein again.
In a third aspect, the present disclosure provides an electronic device, comprising: a memory and a processor; the memory is used for storing program instructions; the processor is configured to invoke the program instructions in the memory to cause the electronic device to perform the image processing method of the first aspect and any one of the possible designs of the first aspect.
In a fourth aspect, the present disclosure provides a computer storage medium comprising computer instructions that, when run on an electronic device, cause the electronic device to perform the image processing method of the first aspect and any one of the possible designs of the first aspect.
In a fifth aspect, the present disclosure provides a computer program product for causing a computer to perform the image processing method of the first aspect and any one of the possible designs of the first aspect when the computer program product runs on the computer.
In a sixth aspect, the present disclosure provides a chip system, comprising: a processor; the electronic device performs the image processing method of the first aspect and any one of the possible designs of the first aspect when the processor executes the computer instructions stored in the memory.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic structural diagram of 19 parts of a human body according to an embodiment of the present disclosure;
fig. 2 is a schematic flowchart of an image processing method according to an embodiment of the disclosure;
fig. 3 is a schematic structural diagram of a neural network model provided in an embodiment of the present disclosure;
FIG. 4A is a schematic diagram of an attention mechanism module provided in an embodiment of the present disclosure;
fig. 4B is a schematic flowchart of an image processing method according to an embodiment of the disclosure;
FIG. 5A is a schematic diagram of an attention mechanism module provided in an embodiment of the present disclosure;
fig. 5B is a schematic flowchart of an image processing method according to an embodiment of the disclosure;
FIG. 6A is a schematic diagram of an attention mechanism module provided in an embodiment of the present disclosure;
fig. 6B is a schematic flowchart of an image processing method according to an embodiment of the disclosure;
fig. 7 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.
Illustratively, the present disclosure provides an image processing method, apparatus, device, computer storage medium, and computer program product, which, by introducing an analytic mask of a target object, guides a neural network model to capture a region to be compensated where the correct attention should be focused on the target object, so as to accurately infer the content of the region to be compensated of the target object, and facilitate tasks such as target tracking, target detection, image segmentation, and the like.
Wherein, the image processing method of the present disclosure is executed by an electronic device. The electronic device may be a tablet computer, a mobile phone (e.g., a folding screen mobile phone, a large screen mobile phone, etc.), a wearable device, a vehicle-mounted device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), a smart television, a smart screen, a high-definition television, a 4K television, a smart speaker, an intelligent projector, and other internet of things (IOT) devices, and the specific type of the electronic device is not limited by the disclosure.
The analysis mask of the target object may be understood as a code formed by different numbers respectively corresponding to each sub-component region of the target object after the sub-component region is divided. For example, when the target object is a Human body, the Human body is divided into N sub-component regions, and Human body analysis (Human matching) processing is performed on the N divided sub-component regions, that is, the N sub-component regions are respectively marked by using numbers 0 to N, so as to obtain an analysis mask of the Human body, where N is a positive integer.
When N is 19, as shown in fig. 1, 19 sub-component regions of the human body may include: the human body analysis mask comprises a head a1, a neck a2, a left shoulder a3, a right shoulder a4, a left upper arm a5, a right upper arm a6, a left lower arm a7, a right lower arm a8, a left hand a9, a right hand a10, a left hip a11, a right hip a12, a left thigh a13, a right thigh a14, a left calf a15, a right calf a16, a left foot a17, a right foot a18 and a body a19, and the human body analysis mask is a code formed by respectively corresponding to one number to 19 sub-component regions of the human body. In addition, the present disclosure is not limited to dividing the human body into 19 sub-component regions.
Based on the foregoing description, the image processing method provided by the present disclosure will be explained in detail by taking an electronic device as an example, and combining with the accompanying drawings and application scenarios.
Referring to fig. 2, fig. 2 is a schematic flow chart of an image processing method according to an embodiment of the disclosure. As shown in fig. 2, the image processing method provided by the present disclosure may include:
s101, acquiring an image to be processed, a first analysis mask and a second analysis mask.
The electronic device may acquire an image to be processed. The image to be processed comprises a target object, and a partial area of the target object is shielded by a shielding object. The present disclosure does not limit the size, format, content, and other parameters of the image to be processed.
It should be noted that, in addition to the image to be processed, the electronic device may obtain a complete input including the image to be processed, the first mask and the second mask. The first mask is a mask of an area of the target object which is not blocked by the blocking object, and the second mask is a mask of a complete area of the target object.
Based on the foregoing description, the present disclosure divides the complete area of the target object into an area of the target object that is not occluded by an occlusion and an area of the target object that is occluded by an occlusion.
The target object may include, but is not limited to, a human body, an animal, or an article. The present disclosure does not limit parameters such as the size, shape, and position of an area of the target object that is not blocked by the blocking object and an area of the target object that is blocked by the blocking object.
The electronic device analyzes the sub-component region included in the region of the target object not shielded by the shielding object in the image to be processed to obtain a first analysis mask, namely the first analysis mask is the analysis mask of the region of the target object not shielded by the shielding object.
Correspondingly, the electronic device analyzes the sub-component region included in the complete region of the target object in the image to be processed to obtain a second analysis mask, that is, the second analysis mask is the analysis mask of the complete region of the target object.
Wherein the complete region of the target object in the image to be processed is predicted. In addition, the sub-component regions included in the region of the target object that is not occluded by the occlusion object and the sub-component regions included in the entire region of the target object are the same in the division rule.
S102, inputting the image to be processed, the first analysis mask and the second analysis mask into the neural network model, and complementing the content of the area, shielded by the shielding object, of the target object based on the attention mechanism.
The electronic device can adopt the neural network model, and based on the image to be processed, the first analysis mask and the second analysis mask, the attention mechanism of the neural network model is concentrated on the target object, so that the content of the region of the target object, which is shielded by the shielding object, is supplemented, and the image to be processed, which is obtained after the region of the target object, which is shielded by the shielding object, is supplemented, is output.
According to the image processing method provided by the disclosure, the analysis mask of the area of the target object which is not shielded by the shielding object and the analysis mask of the complete area of the target object are introduced into the attention mechanism, so that the attention mechanism can focus on the target object, the content of the area of the target object which is shielded by the shielding object can be accurately deduced based on the characteristics of the area of the target object which is not shielded by the shielding object in the image to be processed and the characteristics of the complete area of the target object which is predicted, the completion of the content of the area of the target object which is shielded by the shielding object in the image to be processed is realized, the image to be processed after the area of the target object which is shielded by the shielding object is supplemented can be accurately output, and tasks such as target tracking, target detection, image segmentation and the like are reliably completed.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a neural network model according to an embodiment of the present disclosure. As shown in fig. 3, the neural network model 10 provided by the present disclosure may include: an input layer 11, an intermediate layer 12, an output layer 13, and an attention mechanism module 14. The attention mechanism module 14 is disposed between the input layer 11 and the intermediate layer 12, and/or the attention mechanism module 14 is disposed between the intermediate layer 12 and the output layer 13, and/or the attention mechanism module 14 is disposed between a plurality of intermediate layers 12.
Also, the present disclosure does not limit the number of intermediate layers 12 and attention mechanism modules 14. In some embodiments, the attention mechanism module 14 is symmetrically disposed in the neural network model 10.
The input layer 11 is mainly configured to receive an image to be processed (schematically illustrated by a letter a1 in fig. 3), and output a feature map obtained by performing processing such as convolution on the image to be processed to the intermediate layer 12 or the attention mechanism module 14.
The intermediate layer 12 is mainly used for receiving the feature map from the input layer 11, the attention mechanism module 14 or the intermediate layer 12, and performing processing such as feature extraction, convolution, dimension reduction, dimension increase and the like on the feature map to obtain the feature map and outputting the feature map to the output layer 13, the attention mechanism module 14 or the intermediate layer 12.
The output layer 13 is mainly configured to receive the feature map from the intermediate layer 12 or the attention mechanism module 14, and perform processing such as convolution on the feature map to output a to-be-processed image (indicated by a letter a2 in fig. 3) which is obtained by filling up an area of the target object which is blocked by the blocking object.
The attention mechanism module 14 is configured to receive a first analysis mask (illustrated by a letter B1 or B2 in fig. 3) and a second analysis mask (illustrated by a letter C1 or C2 in fig. 3), receive a feature map from the input layer 11 or the intermediate layer 12, and output the feature map obtained by performing attention mechanism calculation on the feature map to the intermediate layer 12 or the output layer 13.
Illustratively, when the neural network model 10 includes a plurality of attention mechanism modules 14, such as a first attention mechanism module and a second attention mechanism module, it is assumed that the first attention mechanism module receives a first resolution mask of B1 and the second resolution mask of C1. The first resolution mask received by the second attention mechanism module is B2 and the second resolution mask is C2. Then, the sizes of B1 and C1 remain the same, the sizes of B2 and C2 remain the same, and the size of B1 differs from the size of B2 and the size of C1 differs from the size of C2. Thereby. The attention mechanism of the resolution mask may be fused into layers of different stages in the neural network model 10 at different scales.
Based on the foregoing description, the neural network model 10 may include: a plurality of network layers. One of the network layers may be the input layer 11, one or more intermediate layers 12, or the attention mechanism module 14 and the layer connected to the output side (i.e., the side near the output layer 13) of the attention mechanism module 14. The layer connected to the output side of attention mechanism module 14 may be intermediate layer 12 or output layer 13. Also, the neural network model 10 includes at least one network layer including the attention mechanism module 14, the network layer including the attention mechanism module 14 and a layer connected to an output side of the attention mechanism module 14.
Therefore, the electronic equipment can process the image to be processed through a plurality of network layers, and continuously enhance the characteristics corresponding to the content of the area of the target object, which is shielded by the shielding object, so as to complement the content of the area of the target object, which is shielded by the shielding object.
For one network layer 1 of the plurality of network layers, the network layer 1 is a network layer including the attention mechanism module 14, and the network layer 1 is disposed between the network layer 0 and the network layer 2, the network layer 0 is a previous network layer of the network layer 1, and the network layer 2 is a next network layer of the network layer 1.
The network layer 1 may receive a first feature map output by the network layer 0, where the first feature map is obtained based on the image to be processed, the first parsing mask and the second parsing mask. The network layer 1 may obtain a second feature map based on the first feature map, the first analysis mask, and the second analysis mask, perform convolution processing on the second feature map, and transmit the feature map after convolution processing to the network layer 2 (i.e., the first feature map output from the network layer 1 to the network layer 2).
The second feature map is a feature of the target object after the content of the region blocked by the blocking object is enhanced, so that the characteristics of the region of the target object blocked by the blocking object can be accurately deduced. The sizes of the first characteristic diagram, the first analysis mask and the second analysis mask are the same, and the channel numbers of the first analysis mask and the second analysis mask are the same. Illustratively, the first resolution mask and the second resolution mask are in the form of a one-hot encoding.
Therefore, the network layer 1 can continuously enhance the feature corresponding to the content of the region of the target object which is blocked by the blocking object based on the output second feature map through the above process, so as to accurately infer the content of the region of the target object which is blocked by the blocking object based on the feature map.
In some embodiments, the neural network model 10 may include a first network layer and a second network layer capable of performing the above-described processes. When the size of the first feature map in the processing process of the first network layer is different from that of the first feature map in the processing process of the second network layer, the first parsing mask and the second parsing mask are scaled to the same size as the first feature map received by the network layer, and then input to the attention mechanism module 14.
For example, assume that both network layer 1 and network layer 2 can perform the above-described process, and that the size of the first profile received by network layer 1 from network layer 0 is larger than the size of the first profile received by network layer 2 from network layer 1. Since the first parsing mask, the second parsing mask and the first profile performing the above-described process in the network layer 1 remain the same, and the size of the first parsing mask, the second parsing mask and the first profile performing the above-described process in the network layer 2 remains responsive. Accordingly, the size of the first resolution mask in the network layer 1, on which the above-described process is performed, is larger than the size of the first resolution mask in the network layer 2, on which the above-described process is performed, and the size of the second resolution mask in the network layer 1, on which the above-described process is performed, is larger than the size of the second resolution mask in the network layer 2.
Besides the above-mentioned way of dividing one of the network layers, the following dividing ways may also be adopted in the present disclosure: one of the network layers may be the attention mechanism module 14 and the layer connected to the input side (i.e., the side near the input layer 11) of the attention mechanism module 14, one or more intermediate layers 12, and the input layer 11. The layer connected to the input side of attention mechanism module 14 may be intermediate layer 12 or input layer 11. Also, the neural network model 10 includes at least one network layer including the attention mechanism module 14 and a layer connected to an input side of the attention mechanism module 14.
For one network layer 1 of the plurality of network layers, the network layer 1 is a network layer including the attention mechanism module 14, and the network layer 1 is disposed between the network layer 0 and the network layer 2, the network layer 0 is a previous network layer of the network layer 1, and the network layer 2 is a next network layer of the network layer 1.
The network layer 1 may receive a second feature map output by the network layer 0, where the second feature map is obtained based on the image to be processed, the first parsing mask and the second parsing mask. The network layer 1 performs convolution processing on the second feature map to obtain a first feature map, obtains a second feature map based on the first feature map, the first analysis mask and the second analysis mask, and transmits the second feature map to the network layer 2 (i.e., the second feature map output from the network layer 1 to the network layer 2).
In addition, the attention mechanism module 14 may be provided separately and not bound to other layers. I.e., at least one attention mechanism module 14 is disposed between any two layers.
Based on the foregoing description, the electronic device may utilize the attention mechanism module 14 in at least one network layer to obtain the second feature map according to the first feature map, the first parsing mask and the second parsing mask in various implementations.
The electronic device may utilize the attention mechanism module 14 to obtain the first attention result and/or the second attention result according to the first feature map, the first resolution mask, and the second resolution mask.
The first attention result is used for representing the association relation between each spatial position point and all the spatial position points of the target object in the feature map corresponding to the image to be processed. The spatial position point here is a point in the feature map (H × W × C, H represents a length, W represents a width, and C represents the number of channels), the coordinates of the spatial position point are represented as (i, j), and the size of the spatial position point is 1 × 1 × C.
In the actual calculation process, the attention mechanism module 14 may calculate an association relationship between each spatial position point and all spatial position points in the feature map corresponding to the image to be processed including all the target object and the background region. Also, the attention mechanism module 14 may focus attention on the target object based on the first and second resolved masks to obtain a first attention result.
And the second attention result is used for representing semantic information and depth features of a sub-component region corresponding to the difference between the first analysis mask and the second analysis mask in the region of the target object which is blocked by the blocking object.
The semantic information here can be understood as the specific position of the sub-component region, such as the head or left foot of the human body. The depth feature herein may be understood as representing the content of the sub-component region after being filled in the image to be processed, such as the content of the left foot after being filled in the image to be processed.
The attention mechanism module 14 may compare semantic information and depth features of the sub-component regions included in the region of the target object that is occluded by the occlusion based on the difference between the first resolution mask and the second resolution mask to obtain a second attention result.
Thus, the electronic device can utilize the attention mechanism module 14 to obtain the second feature map based on the first attention result and/or the second attention result.
In the following, three possible implementations of the electronic device that can obtain the second characteristic diagram by using the attention mechanism module 14 are described with reference to fig. 4A to 4B, fig. 5A to 5B, and fig. 6A to 6B. For convenience of explanation, in fig. 4A to 4B, 5A to 5B, and 6A to 6B, the first characteristic diagram is Fi inFor example, the first resolution mask is Mm pFor example, the second resolution mask is Ma pFor illustration purposes, the second characteristic diagram is Fi outFor example, the channel number dimensionality reduction processing is illustrated by taking 1 × 1 as an example, the convolution processing is illustrated by taking 3 × 3 as an example, the dot multiplication processing is illustrated by taking one dot in one circle, and the stacking processing is illustrated by taking one cross in one circle.
Referring to fig. 4A and 4B, fig. 4A is a schematic diagram of an attention mechanism module according to an embodiment of the disclosure, and fig. 4B is a schematic flowchart of an image processing method according to an embodiment of the disclosure. As shown in fig. 4A-4B, the electronic device of the present disclosure may utilize the attention mechanism module 14 to perform the following steps:
s201, obtaining a correlation characteristic diagram according to the first characteristic diagram, the first analysis mask and the second analysis mask, and taking the correlation characteristic diagram as a first attention result.
The related feature map corresponds to the first attention result, that is, the related feature map has the same meaning as the first attention result, and the related feature map is taken as the first attention result. The association feature map can be used for representing the association relationship between the area of the target object which is not occluded by the occlusion object and the complete area of the target object. Thus, the specific position and content of the region of the target object that is occluded by the occlusion is inferred.
S202, stacking the associated feature map and the first feature map to obtain a first stacking result. Thus, the partial region missing from the target object is determined in order to deduce the feature of the first feature map associated with the content of the region of the target object occluded by the occlusion.
S203, performing channel number dimension reduction processing on the first stacking result to obtain a second characteristic diagram. Therefore, the first feature map and the second feature map can keep the same spatial dimension, the influence of the attention mechanism model on the current operation is avoided, the attention mechanism module 14 can be seamlessly fused into the layers of the neural network model 10 at different stages, the current network layer is not required to adjust the spatial dimension of the second feature map, and the process that the current network layer outputs the first feature map to the next network layer is simplified.
Based on the above description, the electronic device may introduce a spatial location point attention mechanism based on the analysis mask by using the attention mechanism module 14 to obtain a second feature map representing the content-enhanced feature of the region of the target object that is occluded by the occlusion.
Referring to fig. 5A and 5B, fig. 5A is a schematic diagram of an attention mechanism module according to an embodiment of the disclosure, and fig. 5B is a schematic flowchart of an image processing method according to an embodiment of the disclosure. As shown in fig. 5A-5B, the electronic device of the present disclosure may utilize the attention mechanism module 14 to perform the following steps:
s301, obtaining a correlation feature map according to the first feature map, the first analysis mask and the second analysis mask, and taking the correlation feature map as a first attention result.
The specific implementation manner of step S301 may refer to the description of step S201 in fig. 4B, which is not described herein again.
S302, performing channel number dimension reduction processing on the first feature map to obtain a dimension reduction feature map, so that the channel number of the dimension reduction feature map is kept the same as the channel number of the first analysis mask or the second analysis mask, and the processing steps of the attention mechanism module 14 are reduced.
For example, assuming that the number of channels of the first parsing mask or the second parsing mask is 19, the number of channels of the dimension-reduced feature map is usually greater than 19. Therefore, the attention mechanism module 14 executes step S302 to reduce the channel number of the dimension-reduced feature map to 19.
And S303, performing dot product processing on the dimension reduction feature map and the first analysis mask to obtain a first dot product result.
Wherein the first point multiplication result corresponds to the second attention result. The first point multiplication result is used for representing semantic information and depth features of a sub-component region corresponding to the first analysis mask in a region of the target object which is not blocked by the blocking object. The attention mechanism module 14 may determine semantic information and depth features of sub-component regions included in the region of the target object that is not occluded by the occlusion, i.e., a first point product, based on the first parsing mask.
And S304, performing dot product processing on the dimension reduction feature map and the second analysis mask to obtain a second dot product result.
Wherein the second dot product result corresponds to the second attention result. The second dot product result is used to represent semantic information and depth features of a sub-component region corresponding to the second parsing mask in the complete region of the target object. The attention mechanism module 14 may determine semantic information and depth features of the sub-component regions included in the complete region of the target object, i.e., a second dot product result, based on the second parsing mask.
In addition, the execution sequence of steps S303 and S304 is not sequential, step S303 may be executed first and then step S304 is executed, step S304 may be executed first and then step S303 is executed, or steps S303 and S304 may be executed simultaneously.
And S305, stacking and convolving the first point multiplication result and the second point multiplication result to obtain a second attention result.
S306, stacking the second attention result, the associated feature map and the first feature map to obtain a second stacking result.
Therefore, the partial area which is lacked by the target object is determined, namely, the area of the target object which is shielded by the shielding object is separated, so that the characteristic which is related to the content of the area of the target object which is shielded by the shielding object in the first characteristic diagram is deduced.
The first characteristic diagram can increase the redundancy of the process, so that the second stacking result is more accurate. In addition, in step S306, the first feature map may be removed, and the second attention result and the associated feature map may be subjected to stacking processing to obtain a second stacking result.
S307, performing channel number dimension reduction processing on the second stacking result to obtain a second characteristic diagram. Therefore, the first characteristic diagram and the second characteristic diagram can keep the same spatial dimension, the influence of the attention mechanism model on the operation of the current network is avoided, the current network layer is not required to adjust the spatial dimension of the second characteristic diagram, and the process that the current network layer outputs the first characteristic diagram to the next network layer is simplified.
For example, the number of channels of the first feature map is C, and the number of channels of the second stacked result is 2C + 38. Therefore, the attention mechanism module 14 executes step S307, and the channel number of the second stacking result may be reduced to C.
Based on the above description, the electronic device may utilize the attention mechanism module 14 to introduce an analysis mask-based spatial location point attention mechanism and an analysis mask-based channel number attention mechanism to obtain a second feature map representing the content-enhanced feature of the region of the target object that is occluded by the obstruction.
In step S201 in fig. 4B or step S301 in fig. 5B, the electronic device may use the attention mechanism module 14 to obtain the associated feature map according to the first feature map, the first parsing mask and the second parsing mask in various implementations.
In some embodiments, the electronic device may utilize the attention mechanism module 14 to stack the first feature map and the first parsing mask to obtain a third stacking result, so as to infer a feature associated with the content of the region of the target object that is not occluded by the occlusion in the first feature map.
Correspondingly, the electronic device may utilize the attention mechanism module 14 to stack the first feature map and the second parsing mask to obtain a fourth stacking result, so as to infer a feature associated with the content of the complete region of the target object in the first feature map.
Therefore, the electronic device may perform attention mechanism processing on the third stacking result and the fourth stacking result by using the attention mechanism module 14 to obtain the associated feature map, so as to infer the feature associated with the content of the region of the target object, which is occluded by the occlusion object, in the first feature map.
Referring to fig. 6A and fig. 6B, fig. 6A is a schematic diagram of an attention mechanism module according to an embodiment of the disclosure, and fig. 6B is a schematic flowchart of an image processing method according to an embodiment of the disclosure. As shown in fig. 6A-6B, the electronic device of the present disclosure may utilize the attention mechanism module 14 to perform the following steps:
s401, performing channel number dimension reduction processing on the first feature map to obtain a dimension reduction feature map, so that the channel number of the dimension reduction feature map is kept the same as the channel number of the first analysis mask or the second analysis mask, and the processing steps of the attention mechanism module 14 are reduced.
S402, performing dot multiplication processing on the dimension reduction feature map and the first analysis mask to obtain a first dot multiplication result. The specific implementation manner of step S402 can refer to the description of step S303 in fig. 5B, which is not described herein again. And S403, performing dot product processing on the dimension reduction feature map and the second analysis mask to obtain a second dot product result. The specific implementation manner of step S403 may refer to the description of step S304 in fig. 5B, which is not described herein again.
S404, stacking and convolution processing are carried out on the first point multiplication result and the second point multiplication result, and a second attention result is obtained.
S405, stacking the second attention result and the first feature map to obtain a fifth stacking result.
Thus, the partial area which is lacked by the target object is determined so as to deduce the characteristic which is related to the content of the area which is blocked by the blocking object of the target object in the first characteristic diagram.
And S406, performing channel number dimension reduction processing on the fifth stacking result to obtain a second characteristic diagram. Therefore, the first characteristic diagram and the second characteristic diagram can keep the same spatial dimension, the influence of the attention mechanism model on the operation of the current network is avoided, the current network layer is not required to adjust the spatial dimension of the second characteristic diagram, and the process that the current network layer outputs the first characteristic diagram to the next network layer is simplified.
Based on the above description, the electronic device may introduce an attention mechanism based on the number of channels of the parsing mask by using the attention mechanism module 14 to obtain a second feature map representing the content-enhanced feature of the region of the target object that is occluded by the obstruction. Illustratively, the present disclosure provides an image processing apparatus.
Referring to fig. 7, fig. 7 is a schematic structural diagram of an image processing apparatus according to an embodiment of the disclosure. The image processing apparatus of the present disclosure may be disposed in an electronic device, and the image processing method that can implement the above-described embodiments of fig. 1 to 6B corresponds to the operation of the electronic device. As shown in fig. 7, the image processing apparatus 700 provided by the present disclosure may include: an acquisition module 701 and a processing module 702.
The acquiring module 701 is configured to acquire an image to be processed, a first analysis mask and a second analysis mask, where the image to be processed includes a target object, a partial region of the target object is blocked by a blocking object, the first analysis mask is an analysis mask of a region of the target object that is not blocked by the blocking object, and the second analysis mask is an analysis mask of a complete region of the target object;
the processing module 702 is configured to input the image to be processed, the first analysis mask and the second analysis mask into the neural network model, and complement the content of the region of the target object that is blocked by the blocking object based on the attention mechanism.
In some embodiments, the processing module 702 is specifically configured to process the image to be processed through a plurality of network layers in the neural network model, and supplement the content of the area of the target object that is blocked by the blocking object based on the attention mechanism; wherein, the processing procedure of at least one network layer in the plurality of network layers comprises: receiving a first feature map output by a previous network layer, wherein the first feature map is obtained based on an image to be processed, a first analysis mask and a second analysis mask; obtaining a second feature map according to the first feature map, the first analysis mask and the second analysis mask, wherein the second feature map is a feature of the target object after the content of the region shielded by the shielding object is enhanced, the sizes of the first feature map, the first analysis mask and the second analysis mask are the same, and the number of channels of the first analysis mask and the number of channels of the second analysis mask are the same; and performing convolution processing on the second feature map.
In some embodiments, the plurality of network layers includes a first network layer and a second network layer, the first feature map received by the first network layer has a different size than the first feature map received by the second network layer, the first resolution mask during processing by the first network layer has a different size than the first resolution mask during processing by the second network layer, and the second resolution mask during processing by the first network layer has a different size than the second resolution mask during processing by the second network layer.
In one possible design, the processing module 702 is configured to obtain a first attention result according to the first feature map, the first analysis mask and the second analysis mask, where the first attention result is used to indicate an association relationship between each spatial position point and all spatial position points of the target object in the feature map corresponding to the image to be processed, and/or obtain a second attention result according to the first feature map, the first analysis mask and the second analysis mask, where the second attention result is used to indicate semantic information and depth features of a sub-component region corresponding to a difference between the first analysis mask and the second analysis mask in a region of the target object that is blocked by the blocking object; and obtaining a second feature map according to the first attention result and/or the second attention result.
In some embodiments, the processing module 702 is configured to obtain an associated feature map according to the first feature map, the first analysis mask and the second analysis mask, and use the associated feature map as the first attention result, where the associated feature map is used to represent an association relationship between an area of the target object that is not covered by the obstruction and a complete area of the target object; stacking the associated characteristic diagram and the first characteristic diagram to obtain a first stacking result; and performing dimension reduction processing on the first stacking result to obtain a second characteristic diagram.
In some embodiments, the processing module 702 is configured to obtain an associated feature map according to the first feature map, the first analysis mask and the second analysis mask, and use the associated feature map as the first attention result, where the associated feature map is used to represent an association relationship between an area of the target object that is not covered by the obstruction and a complete area of the target object; performing channel number dimension reduction processing on the first feature map to obtain a dimension reduction feature map, wherein the channel number of the dimension reduction feature map is the same as the channel number of the first analysis mask or the second analysis mask; performing point multiplication on the dimension reduction feature map and the first analysis mask to obtain a first point multiplication result, wherein the first point multiplication result is used for representing semantic information and depth features of a sub-component region corresponding to the first analysis mask in a region of the target object which is not shielded by the shielding object; performing dot multiplication on the dimension reduction feature map and the second analysis mask to obtain a second dot multiplication result, wherein the second dot multiplication result is used for representing semantic information and depth features of a sub-component region corresponding to the second analysis mask in the complete region of the target object; stacking and convolving the first point multiplication result and the second point multiplication result to obtain a second attention result; stacking the second attention result, the associated feature map and the first feature map to obtain a second stacking result; and performing dimension reduction processing on the second stacking result to obtain a second feature map.
In some embodiments, the processing module 702 is configured to stack the first feature map and the first parsing mask to obtain a third stacking result; stacking the first feature map and the second analysis mask to obtain a fourth stacking result; and performing attention mechanism processing on the third stacking result and the fourth stacking result to obtain a correlation characteristic diagram.
In some embodiments, the processing module 702 is configured to perform channel number dimension reduction processing on the first feature map to obtain a dimension reduction feature map, where a channel number of the dimension reduction feature map is the same as a channel number of the first analysis mask or the second analysis mask; performing point multiplication on the dimension reduction feature map and the first analysis mask to obtain a first point multiplication result, wherein the first point multiplication result is used for representing semantic information and depth features of a sub-component region corresponding to the first analysis mask in a region of the target object which is not shielded by the shielding object; performing dot multiplication on the dimension reduction feature map and the second analysis mask to obtain a second dot multiplication result, wherein the second dot multiplication result is used for representing semantic information and depth features of a sub-component region corresponding to the second analysis mask in the complete region of the target object; stacking and convolving the first point multiplication result and the second point multiplication result to obtain a second attention result; stacking the second attention result and the first feature map to obtain a fifth stacking result; and performing dimension reduction processing on the fifth stacking result to obtain a second feature map.
In one possible design, the first resolution mask and the second resolution mask take the form of a one-hot encoding.
The image processing apparatus provided by the present disclosure may implement the above method embodiments, and specific implementation principles and technical effects thereof can be referred to the above method embodiments, which are not described herein again.
Illustratively, the present disclosure provides an electronic device comprising: one or more processors; a memory; and one or more computer programs; wherein the one or more computer programs are stored in the memory; the one or more processors, when executing the one or more computer programs, cause the electronic device to implement the image processing methods of the foregoing embodiments.
Illustratively, the present disclosure provides a chip system, which is applied to an electronic device including a memory, a sensor; the chip system includes: a processor; when the processor executes the image processing method of the foregoing embodiment.
Exemplarily, the present disclosure provides a computer-readable storage medium on which a computer program is stored, which, when being executed by a processor, causes an electronic device to implement the image processing method of the foregoing embodiment.
Illustratively, the present disclosure provides a computer program product which, when run on a computer, causes the computer to perform the image processing method of the foregoing embodiments.
In the above-described embodiments, all or part of the functions may be implemented by software, hardware, or a combination of software and hardware. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the disclosure are wholly or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (13)

1. An image processing method, characterized in that the method comprises:
acquiring an image to be processed, a first analysis mask and a second analysis mask, wherein the image to be processed comprises a target object, a partial region of the target object is shielded by a shielding object, the first analysis mask is an analysis mask of a region of the target object which is not shielded by the shielding object, and the second analysis mask is an analysis mask of a complete region of the target object;
inputting the image to be processed, the first analysis mask and the second analysis mask into a neural network model, and complementing the content of the area of the target object, which is shielded by the shielding object, based on an attention mechanism.
2. The method according to claim 1, wherein inputting the image to be processed, the first analysis mask and the second analysis mask into a neural network model, and complementing the content of the region of the target object occluded by the occlusion object based on an attention mechanism comprises:
processing the image to be processed through a plurality of network layers in the neural network model, and complementing the content of the area of the target object, which is shielded by the shielding object, based on an attention mechanism;
wherein the processing procedure of at least one of the plurality of network layers comprises:
receiving a first feature map output by a previous network layer, wherein the first feature map is obtained based on the image to be processed, the first analysis mask and the second analysis mask;
obtaining a second feature map according to the first feature map, a first analysis mask and a second analysis mask, wherein the second feature map is a feature of the target object after content enhancement of an area shielded by a shielding object, the first feature map, the first analysis mask and the second analysis mask are the same in size, and the first analysis mask and the second analysis mask are the same in channel number;
and carrying out convolution processing on the second feature map, and transmitting the feature map after the convolution processing to a next network layer.
3. The method of claim 2, wherein the plurality of network layers comprises a first network layer and a second network layer, wherein the first feature map received by the first network layer is different in size from the first feature map received by the second network layer, wherein the first resolution mask during processing by the first network layer is different in size from the first resolution mask during processing by the second network layer, and wherein the second resolution mask during processing by the first network layer is different in size from the second resolution mask during processing by the second network layer.
4. The method according to claim 2 or 3, wherein obtaining a second feature map according to the first feature map, the first parsing mask and the second parsing mask comprises:
obtaining a first attention result according to the first feature map, the first analysis mask and the second analysis mask, wherein the first attention result is used for representing the association relation between each spatial position point and all spatial position points of the target object in the feature map corresponding to the image to be processed, and/or obtaining a second attention result according to the first feature map, the first analysis mask and the second analysis mask, wherein the second attention result is used for representing semantic information and depth features of a sub-component region corresponding to the difference between the first analysis mask and the second analysis mask in a region of the target object which is blocked by a blocking object;
and obtaining the second feature map according to the first attention result and/or the second attention result.
5. The method of claim 4, wherein obtaining a first attention result from the first profile, the first resolution mask, and the second resolution mask comprises:
obtaining a correlation feature map according to the first feature map, the first analysis mask and the second analysis mask, and taking the correlation feature map as the first attention result, wherein the correlation feature map is used for representing the correlation relationship between the area of the target object which is not covered by the shielding object and the complete area of the target object;
stacking the associated feature map and the first feature map to obtain a first stacking result;
and performing dimension reduction processing on the first stacking result to obtain the second feature map.
6. The method of claim 4, wherein obtaining a first attention result and a second attention result from the first profile, the first resolution mask, and the second resolution mask comprises:
obtaining a correlation feature map according to the first feature map, the first analysis mask and the second analysis mask, and taking the correlation feature map as the first attention result, wherein the correlation feature map is used for representing the correlation relationship between the area of the target object which is not covered by the shielding object and the complete area of the target object;
performing channel number dimension reduction processing on the first feature map to obtain a dimension reduction feature map, wherein the channel number of the dimension reduction feature map is the same as the channel number of the first analysis mask or the second analysis mask;
performing point multiplication on the dimension reduction feature map and the first analysis mask to obtain a first point multiplication result, wherein the first point multiplication result is used for representing semantic information and depth features of a sub-component region corresponding to the first analysis mask in a region of the target object which is not shielded by the shielding object;
performing dot multiplication on the dimension reduction feature map and the second analysis mask to obtain a second dot multiplication result, wherein the second dot multiplication result is used for representing semantic information and depth features of a sub-component region corresponding to the second analysis mask in the complete region of the target object;
stacking and convolving the first point multiplication result and the second point multiplication result to obtain a second attention result;
the obtaining the second feature map according to the first attention result and the second attention result includes:
stacking the second attention result, the associated feature map and the first feature map to obtain a second stacking result;
and performing dimension reduction processing on the second stacking result to obtain the second feature map.
7. The method according to claim 5 or 6, wherein obtaining the associated feature map according to the first feature map, the first parsing mask and the second parsing mask comprises:
stacking the first feature map and the first analysis mask to obtain a third stacking result;
stacking the first feature map and the second analysis mask to obtain a fourth stacking result;
and performing attention mechanism processing on the third stacking result and the fourth stacking result to obtain the associated feature map.
8. The method of claim 4, wherein obtaining a second attention result from the first profile, the first resolution mask, and the second resolution mask comprises:
performing channel number dimension reduction processing on the first feature map to obtain a dimension reduction feature map, wherein the channel number of the dimension reduction feature map is the same as the channel number of the first analysis mask or the second analysis mask;
performing point multiplication on the dimension reduction feature map and the first analysis mask to obtain a first point multiplication result, wherein the first point multiplication result is used for representing semantic information and depth features of a sub-component region corresponding to the first analysis mask in a region of the target object which is not shielded by the shielding object;
performing dot multiplication on the dimension reduction feature map and the second analysis mask to obtain a second dot multiplication result, wherein the second dot multiplication result is used for representing semantic information and depth features of a sub-component region corresponding to the second analysis mask in the complete region of the target object;
stacking and convolving the first point multiplication result and the second point multiplication result to obtain a second attention result;
the obtaining the second feature map according to the second attention result includes:
stacking the second attention result and the first feature map to obtain a fifth stacking result;
and performing dimension reduction processing on the fifth stacking result to obtain the second feature map.
9. The method according to any of claims 1-8, wherein said first resolution mask and said second resolution mask are in the form of a one-hot encoding.
10. An image processing apparatus, characterized in that the apparatus comprises:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring an image to be processed, a first analysis mask and a second analysis mask, the image to be processed comprises a target object, a partial region of the target object is shielded by a shielding object, the first analysis mask is an analysis mask of a region of the target object which is not shielded by the shielding object, and the second analysis mask is an analysis mask of a complete region of the target object;
and the processing module is used for inputting the image to be processed, the first analysis mask and the second analysis mask into a neural network model and complementing the content of a visible area of the target object, which is shielded by a shielding object, based on an attention mechanism.
11. An electronic device, comprising: one or more processors; a memory; and one or more computer programs; wherein the one or more computer programs are stored in the memory; characterized in that the one or more processors, when executing the one or more computer programs, cause the electronic device to implement the image processing method of any of claims 1-9.
12. A computer storage medium comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the image processing method of any one of claims 1-9.
13. A computer program product, characterized in that it causes a computer to carry out the image processing method according to any one of claims 1 to 9, when said computer program product is run on the computer.
CN202110247373.9A 2021-03-05 2021-03-05 Image processing method and device Pending CN112967199A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110247373.9A CN112967199A (en) 2021-03-05 2021-03-05 Image processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110247373.9A CN112967199A (en) 2021-03-05 2021-03-05 Image processing method and device

Publications (1)

Publication Number Publication Date
CN112967199A true CN112967199A (en) 2021-06-15

Family

ID=76276722

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110247373.9A Pending CN112967199A (en) 2021-03-05 2021-03-05 Image processing method and device

Country Status (1)

Country Link
CN (1) CN112967199A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113657355A (en) * 2021-10-20 2021-11-16 之江实验室 Global and local perception pedestrian re-identification method fusing segmentation information

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101479765A (en) * 2006-06-23 2009-07-08 图象公司 Methods and systems for converting 2D motion pictures for stereoscopic 3D exhibition
EP2541277A1 (en) * 2011-06-30 2013-01-02 Furuno Electric Company Limited AGPS server with SBAS aiding information for satellite based receivers
JP2020013216A (en) * 2018-07-13 2020-01-23 キヤノン株式会社 Device, control method, and program
CN110929651A (en) * 2019-11-25 2020-03-27 北京达佳互联信息技术有限公司 Image processing method, image processing device, electronic equipment and storage medium
CN110956097A (en) * 2019-11-13 2020-04-03 北京影谱科技股份有限公司 Method and module for extracting occluded human body and method and device for scene conversion
CN111079545A (en) * 2019-11-21 2020-04-28 上海工程技术大学 Three-dimensional target detection method and system based on image restoration
CN111339903A (en) * 2020-02-21 2020-06-26 河北工业大学 Multi-person human body posture estimation method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101479765A (en) * 2006-06-23 2009-07-08 图象公司 Methods and systems for converting 2D motion pictures for stereoscopic 3D exhibition
EP2541277A1 (en) * 2011-06-30 2013-01-02 Furuno Electric Company Limited AGPS server with SBAS aiding information for satellite based receivers
JP2020013216A (en) * 2018-07-13 2020-01-23 キヤノン株式会社 Device, control method, and program
CN110956097A (en) * 2019-11-13 2020-04-03 北京影谱科技股份有限公司 Method and module for extracting occluded human body and method and device for scene conversion
CN111079545A (en) * 2019-11-21 2020-04-28 上海工程技术大学 Three-dimensional target detection method and system based on image restoration
CN110929651A (en) * 2019-11-25 2020-03-27 北京达佳互联信息技术有限公司 Image processing method, image processing device, electronic equipment and storage medium
CN111339903A (en) * 2020-02-21 2020-06-26 河北工业大学 Multi-person human body posture estimation method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ADAM W. HARLEY等: "Segmentation-Aware Convolutional Networks Using Local Attention Masks", 《ARXIV》, 15 August 2017 (2017-08-15), pages 1 - 11 *
CHENYU LI等: "Look Through Masks: Towards Masked Face Recognition with De-Occlusion Distillation", 《MM \'20》, 12 October 2020 (2020-10-12), pages 3016 - 3024, XP059453783, DOI: 10.1145/3394171.3413960 *
QIANG ZHOU等: "Human De-occlusion: Invisible Perception and Recovery for Humans", 《ARXIV》, 22 March 2021 (2021-03-22), pages 1 - 11 *
周强: "基于先验信息建模的视频目标分割和补全研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 01, 15 January 2023 (2023-01-15), pages 138 - 1227 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113657355A (en) * 2021-10-20 2021-11-16 之江实验室 Global and local perception pedestrian re-identification method fusing segmentation information

Similar Documents

Publication Publication Date Title
US9679412B2 (en) 3D face model reconstruction apparatus and method
CN111860398B (en) Remote sensing image target detection method and system and terminal equipment
WO2020248841A1 (en) Au detection method and apparatus for image, and electronic device and storage medium
CN111862035B (en) Training method of light spot detection model, light spot detection method, device and medium
CN111414879B (en) Face shielding degree identification method and device, electronic equipment and readable storage medium
CN110837796B (en) Image processing method and device
US11822900B2 (en) Filter processing device and method of performing convolution operation at filter processing device
CN111428805B (en) Method for detecting salient object, model, storage medium and electronic device
US20230067934A1 (en) Action Recognition Method, Apparatus and Device, Storage Medium and Computer Program Product
US20180005113A1 (en) Information processing apparatus, non-transitory computer-readable storage medium, and learning-network learning value computing method
CN112967199A (en) Image processing method and device
JP2014123230A (en) Image processor, image processing method, and program
CN112633260B (en) Video motion classification method and device, readable storage medium and equipment
US20190130600A1 (en) Detection Method and Device Thereof
US20180052536A1 (en) Method for detecting input device and detection device
CN117541511A (en) Image processing method and device, electronic equipment and storage medium
CN113222830A (en) Image processing method and device
CN112967197A (en) Image processing method, apparatus, electronic device, medium, and computer program product
CN110619597A (en) Semitransparent watermark removing method and device, electronic equipment and storage medium
CN115965791A (en) Image generation method and device and electronic equipment
CN112967198A (en) Image processing method and device
CN113255700A (en) Image feature map processing method and device, storage medium and terminal
CN113033542A (en) Method and device for generating text recognition model
CN110517239A (en) A kind of medical image detection method and device
CN112200774A (en) Image recognition apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination