CN115908982A

CN115908982A - Image processing method, model training method, device, equipment and storage medium

Info

Publication number: CN115908982A
Application number: CN202211527503.5A
Authority: CN
Inventors: 李阳刚; 邢怀飞
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-12-01
Filing date: 2022-12-01
Publication date: 2023-04-04

Abstract

The disclosure provides an image processing method, a model training method, a device, equipment and a storage medium, relates to the field of artificial intelligence, in particular to cloud computing, video coding and media cloud technologies, and can be applied to intelligent cloud scenes. The specific implementation scheme is as follows: extracting a network based on the features in the saliency detection model to obtain feature information of the target image; processing the feature information by using at least two detection branches in the significance detection model respectively to obtain at least two significance graphs; based on each of the at least two saliency maps, determining a corresponding saliency region in the target image. The embodiment of the disclosure can provide more detailed measurement for attention degrees of different regions, thereby improving the accuracy of significance detection and improving the detection efficiency.

Description

Image processing method, model training method, device, equipment and storage medium

Technical Field

The disclosure relates to the field of artificial intelligence, in particular to cloud computing, video coding and media cloud technologies, which can be applied in an intelligent cloud scene.

Background

By introducing visual saliency, a series of significant help and improvement can be brought to the computer vision task. Saliency detection is the process of predicting which information in an image or video is more interesting to the human eye through computer vision algorithms, i.e. saliency detection is used to detect salient regions in an image or video.

In the related art, saliency detection is generally implemented based on deep learning, and regions on an image are divided into two categories, namely saliency and non-saliency. However, in practical application scenarios, complex subjects are easy to appear in images or videos, and the attention degree of human eyes to different areas is not limited to significant and non-significant.

Disclosure of Invention

The disclosure provides an image processing method, a model training method, an apparatus, a device and a storage medium.

According to an aspect of the present disclosure, there is provided an image processing method including:

extracting a network based on the features in the significance detection model to obtain feature information of the target image;

processing the feature information by using at least two detection branches in the significance detection model respectively to obtain at least two significance graphs;

based on each of the at least two saliency maps, determining a corresponding saliency region in the target image.

According to another aspect of the present disclosure, there is provided a model training method, including:

acquiring a sample image, and determining at least two saliency map labels corresponding to the sample image;

training a preset model by using the sample image and at least two significance icon labels to obtain a significance detection model; the preset model comprises a feature extraction network and at least two detection branches, the feature extraction network is used for obtaining feature information of the sample image, and the at least two detection branches are used for respectively processing the feature information to obtain at least two significance maps.

According to another aspect of the present disclosure, there is provided an image processing apparatus including:

the feature extraction module is used for extracting a network based on features in the significance detection model to obtain feature information of the target image;

the detection module is used for processing the characteristic information by utilizing at least two detection branches in the significance detection model respectively to obtain at least two significance graphs;

a region determination module to determine a corresponding salient region in the target image based on each of the at least two saliency maps.

According to another aspect of the present disclosure, there is provided a model training apparatus including:

the labeling module is used for acquiring a sample image and determining at least two saliency map labels corresponding to the sample image;

the training module is used for training a preset model by utilizing the sample image and the at least two saliency icon labels to obtain a saliency detection model; the preset model comprises a feature extraction network and at least two detection branches, the feature extraction network is used for obtaining feature information of the sample image, and the at least two detection branches are used for respectively processing the feature information to obtain at least two significance maps.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of any of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method according to any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements a method according to any of the embodiments of the present disclosure.

According to the method of the embodiment of the present disclosure, at least two salient regions may be determined for a target image. By distinguishing each region in the target image into a non-significant region and at least two significant regions, more detailed measurement is given to the attention degrees of different regions, so that the accuracy of significance detection is improved. In addition, at least two detection branches in the significance detection model share the feature extraction network, so that efficient feature multiplexing can be realized, and the detection efficiency is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flowchart of an image processing method according to an embodiment of the present disclosure;

FIG. 2 is a schematic structural diagram of a saliency detection model in an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an RSU module according to an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a significance detection model in another embodiment of the present disclosure;

FIG. 5 is a schematic flow chart diagram illustrating a model training method according to an embodiment of the present disclosure;

FIG. 6 is a schematic view of a second saliency map label in an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a face region in an embodiment of the present disclosure;

FIG. 8 is a schematic block diagram of an image processing apparatus according to an embodiment of the present disclosure;

fig. 9 is a schematic block diagram of an image processing apparatus according to another embodiment of the present disclosure;

FIG. 10 is a schematic block diagram of a model training apparatus according to an embodiment of the present disclosure;

FIG. 11 is a schematic block diagram of a model training apparatus according to another embodiment of the present disclosure;

FIG. 12 is a block diagram of an electronic device used to implement methods of embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 shows a schematic flowchart of an image processing method according to an embodiment of the present disclosure. The method comprises the following steps:

step S110, extracting a network based on the features in the saliency detection model to obtain feature information of a target image;

step S120, at least two detection branches in the significance detection model are used for respectively processing the feature information to obtain at least two significance graphs;

step S130, determining a corresponding salient region in the target image based on each of the at least two saliency maps.

Illustratively, the significance detection model may be a deep learning model trained using a training data set. The network structure of the saliency detection model may be a convolutional neural network including, for example, unet (U-network), U2net (dual-layer Unet nested network), poolNet (pool-based network), and the like.

In an embodiment of the present disclosure, a saliency detection model is used to process a target image into at least two saliency maps. The target image may include any image to be processed of the input saliency detection model, for example, the target image may be an independent image or a video frame in a video. The saliency map may be used to characterize salient regions. Illustratively, the saliency map is an image of the same size as the target image, wherein the salient regions and the non-salient regions are filled with different pixel values to distinguish the salient regions from the non-salient regions.

Illustratively, the significance detection model may comprise a feature extraction network and at least two detection branches. The feature extraction network is used for receiving the target image and processing the target image to obtain feature information of the target image. The feature information may include one or more feature maps of the target image. Each detection branch is connected with a feature extraction network and used for receiving one or more feature maps and obtaining a significance map based on feature information.

In the embodiment of the present disclosure, at least two detection branches of the significance detection model respectively process the feature information, so that each detection branch can process to obtain one significance map, and thus, at least two detection branches process to obtain at least two significance maps. Alternatively, each detection branch can be processed to obtain a different saliency map based on the feature information by setting a different saliency map label for each detection branch in the model training process.

For example, the training data set includes a plurality of human images, and for each human image, two saliency map labels may be set, that is, two saliency maps are determined as training labels (or supervision targets) of the human image. The two saliency maps include a first saliency map having a face region in the person image as a salient region, and a second saliency map having a person region in the person image as a salient region. The two detection branches of the model comprise a first detection branch for detecting a human face and a second detection branch for detecting a human body, the first saliency map is used as a training label of the first detection branch, and the second saliency map is used as a training label of the second detection branch. In this way, after the model is trained based on the training data set, the first detection branch can output a saliency map in which the face region is regarded as a salient region for any image, and the second detection branch can output a saliency map in which the person region is regarded as a salient region for any image.

Since the saliency map is used to characterize the saliency region, based on each saliency map output by the saliency detection model, the corresponding saliency region can be determined in the target image. Illustratively, on the basis that the saliency map fills the saliency areas and the non-saliency areas with different pixel values, for each saliency map, the saliency areas in the saliency map can be determined according to the pixel values of the pixels, so that for at least two saliency maps, at least two saliency areas can be determined.

According to the method, at least two salient regions can be determined for the target image, for example, for the human image, not only the region where the human body is located and the non-salient region can be distinguished, but also the region where the human body is located and the region where the human face is located can be distinguished. By distinguishing each region in the target image into a non-significant region and at least two significant regions, more detailed measurement is given to the attention degrees of different regions, so that the accuracy of significance detection is improved. In addition, at least two detection branches in the significance detection model share the feature extraction network, so that efficient feature multiplexing can be realized, increment of calculation cost is reduced, and detection efficiency is improved.

In some embodiments of the present disclosure, application scenarios for at least two salient regions are also provided. Illustratively, the image processing method may further include:

and coding at least two significant regions and non-significant regions in the target image by adopting different code rates to obtain a compressed image of the target image.

Illustratively, the significance degrees of at least two significant regions are different, that is, the attention degrees of the at least two significant regions are different, and based on this, different code rates are set for the respective significant regions to respectively encode the different regions according to the different significance degrees, so as to improve the image quality of the attention regions.

For example, the second significant region of the two significant regions surrounds the first significant region, and the first significant region is a region with more attention, then the code rate of the first significant region is set to be higher than that of the second significant region, and the code rate of the second significant region is set to be higher than that of the non-significant region. In this way, in the encoded compressed image, the information compression of the first salient region is less, the information compression degree of the second salient region is intermediate, and the information compression of the non-salient region is more.

According to the embodiment, the code rate distribution of each region is more reasonable when the at least two salient regions are applied to the compression coding process of the image or the video, so that the image quality of the attention part of human eyes can be improved, the overall low code rate can be kept, and the compression coding effect of the image or the video can be improved.

Optionally, the at least two significant regions determined by the method of the embodiment of the present disclosure may also be applied to other various computer vision tasks such as automatic image cropping and image enhancement, which are not described herein one by one.

In some application scenarios, such as video coding transmission scenarios, the real-time requirement for image processing is high, and therefore, in some embodiments of the present disclosure, some improvements to the saliency detection model are also provided to improve the image processing efficiency.

Illustratively, in the above embodiment, the feature extraction network may include at least two convolutional layers, some of the at least two convolutional layers being depth Separable convolutional layers (Depthwise Separable convergence).

Wherein the depth separable convolutional layers may be convolutional layers that decompose a conventional convolution operation into channel-by-channel convolution and point-by-point convolution. By replacing a conventional convolutional layer with a depth separable convolutional layer in a partial convolutional layer, the amount of computation in the feature extraction network can be reduced, thereby improving the efficiency of significance detection and correspondingly improving the image processing efficiency.

Illustratively, depth separable convolutional layers may be used in convolutional layers computed for low-scale feature maps in a feature extraction network. The low-scale characteristic diagram has high resolution and large calculation amount, and the calculation amount can be effectively reduced by adopting the depth separable convolution.

Optionally, in some exemplary embodiments, the step S110 of obtaining the feature information of the target image based on the feature extraction network in the saliency detection model includes:

performing feature coding at least twice and pooling at least twice based on a feature coding network and a target image in a feature extraction network to obtain at least two coding feature maps of the target image;

and performing at least twice upsampling and at least twice feature decoding on the feature decoding network and the at least two coding feature maps in the feature extraction network to obtain at least two decoding feature maps of the target image.

That is, in the above-described embodiment, the feature information output by the feature extraction network includes at least two decoded feature maps of the target image. Specifically, the feature extraction network comprises a feature encoding network and a feature decoding network. The feature coding network is used for carrying out feature coding and pooling for multiple times, wherein each feature coding obtains one coding feature map, pooling is carried out, then next feature coding and pooling are carried out, and finally a plurality of coding feature maps with different scales are obtained. The feature decoding network is used for performing feature decoding and up-sampling for multiple times, wherein the image subjected to up-sampling each time is spliced with the coding feature map of the corresponding scale and then subjected to feature decoding to obtain a decoding feature map, and then next up-sampling and feature decoding are performed to finally obtain a plurality of decoding feature maps of different scales. Illustratively, the network structure of the significance detection model is a net structure or an improved structure based on the net, such as a U2net structure.

In the above embodiment, the feature extraction network obtains the coding feature maps with different scales (i.e. extracts high-frequency information and low-frequency information in the image) in at least two pooling processes, and performs feature decoding by combining the coding feature maps with different scales in at least two upsampling processes, so that the at least two decoding feature maps output by the feature extraction network can retain various information in the image, thereby being beneficial to improving the significance detection effect.

Illustratively, the feature coding network comprises at least two RSUs (ReSidual U-blocks) for feature coding, that is, the saliency detection model adopts a U2net structure nested in a double-layer U-type network. Wherein the input convolutional layer in the RSU is a depth separable convolutional layer.

To facilitate explanation of the arrangement of the depth separable convolutional layers in this example, fig. 2 exemplarily shows a structural schematic diagram of a significance detection model in an embodiment of the present disclosure. As shown in fig. 2, the saliency detection model is a U2net structure, and includes a feature extraction network 210 and a detection branch portion 220. The feature extraction network 210 includes a feature encoding network 211 and a feature decoding network 212. The signature encoding network 211 comprises 5 RSU modules. The first 4 RSU modules are followed by a max-pooling layer to achieve 2-fold down-sampling, so each RSU module of the coding network will output a signature graph of different scales. The feature decoding network 212 includes 4 RSU modules, which start from the minimum encoding feature map output by the feature encoding network, perform layer-by-layer upsampling and splicing with the multi-scale encoding feature map, and then perform decoding through the RSU modules, thereby obtaining a multi-scale decoding feature map.

Fig. 3 shows a schematic structural diagram of the RSU module. The RSU module is a Unet network. Each convolution layer is followed by a Batch normalization (Batch nompalidation) and a ReLU (Rectified Linear Unit) activation function. The numbers under the convolutional layer indicate their number of convolution kernels, all of which are 3 × 3 in size (except for the depth separable convolution, which contains 13 × 3 convolution and 1 × 1 convolution). The input image first passes through an input convolutional layer 310, which is used to adjust the channel number to be consistent with the channel number of the output encoding feature map (12 in the figure). Then, the data passes through 3 convolutional layers and the maximum pooling layer, and then passes through an expansion convolutional layer with the expansion rate of 2, and each convolutional layer outputs an encoding characteristic diagram. And starting from the coding characteristic diagram output by the expansion convolutional layer, sampling the coding characteristic diagram layer by layer, splicing the coding characteristic diagram with a corresponding scale, and decoding by the convolutional layer. The output of the last decoded convolutional layer is summed (rather than spliced) with the original depth separable convolutional output, resulting in the encoded feature map output by the RSU module.

According to the above example approach, in fig. 3, the input convolution layer 310 employs a depth separable convolution instead of a conventional convolution. Because the input convolution layer in the RSU module is calculated at the lowest scale, the input resolution is high, and the calculation amount is large, so that the calculation amount can be effectively reduced by adopting the depth separable convolution.

In some examples, each of the at least two detection branches includes at least two significance detection heads in one-to-one correspondence with the at least two decoding feature maps, wherein each of the at least two significance detection heads is configured to obtain a detection result based on its corresponding decoding feature map; and each detection branch is used for fusing the detection result obtained by each significance detection head to obtain a significance map corresponding to the detection branch. The fusion means may be splicing.

Correspondingly, the step S120 of processing the feature information by using at least two detection branches in the significance detection model respectively to obtain at least two significance maps may include: obtaining at least two detection results by utilizing at least two significance detection heads corresponding to at least two decoding feature maps one by one in each detection branch in the significance detection model and the at least two decoding feature maps; and fusing at least two detection results to obtain a significance map corresponding to each detection branch.

For example, fig. 2 includes two detection branches, where one detection branch includes 5 first significance detection heads in one-to-one correspondence with the feature map output by the feature coding network and the decoded feature map output by each RSU module in the feature decoding network, and the 5 first significance detection heads output 5 detection results, that is, 5 significance maps of different scales. And then, uniformly upsampling the 5 saliency maps back to the original scale, splicing the 5 saliency maps along the channel dimension, and fusing the 5 saliency maps into a first output saliency map through a 1 × 1 convolution layer. Similarly, the other detection branch comprises 5 second significance detection heads in one-to-one correspondence with the feature map output by the feature coding network and the feature decoding maps output by the RSU modules in the feature decoding network, and the 5 second significance maps can be fused to obtain a second significance map. Wherein, each detection head can be composed of a 3 × 3 convolution layer and a sigmoid active layer.

In practical applications, in order to improve the image processing efficiency, the processing procedure may be retained in the model training process, but in the model reasoning process, i.e., in the image processing method, the multi-scale fusion link is removed.

Specifically, in another example, the step S120 of processing the feature information by using at least two detection branches in the significance detection model respectively to obtain at least two significance maps may include: and respectively processing the maximum decoding characteristic diagram in the at least two decoding characteristic diagrams by using at least two detection branches to obtain at least two significance diagrams.

The maximum decoding feature map is the feature map with the highest resolution (largest size) in the at least two decoding feature maps, that is, the decoding feature map of the highest layer.

Exemplarily, fig. 4 shows a schematic structural diagram of the saliency detection model in the above example. As shown in fig. 4, in each detection branch, a detection head is not correspondingly set for other decoding feature maps and a plurality of saliency maps are output and then fused, but only the detection head corresponding to the maximum decoding feature map is retained, and after the maximum decoding feature map is processed, the corresponding saliency map is obtained. Two significance maps were obtained using two detection branches.

By observing the trained network parameters, the output weight of the decoding characteristic diagram with the highest layer and the high resolution is usually higher, while the output weight of the decoding characteristic diagram with the low resolution is smaller, so that the calculation amount of the model can be reduced by removing multi-scale fusion and low-resolution detection, and the output accuracy of the significance diagram is kept unchanged.

Correspondingly to the image processing method, the embodiment of the present disclosure further provides a model training method, which is used for training to obtain the saliency detection model. Fig. 5 shows a flowchart of a model training method according to an embodiment of the present disclosure. As shown in fig. 5, the method may include:

step S510, obtaining a sample image, and determining at least two saliency map labels corresponding to the sample image;

step S520, training a preset model by using the sample image and at least two significance icon labels to obtain a significance detection model; the preset model comprises a feature extraction network and at least two detection branches, the feature extraction network is used for obtaining feature information of the sample image, and the at least two detection branches are used for respectively processing the feature information to obtain at least two significance maps.

The network structure of the preset model, that is, the network structure of the saliency detection model, can be implemented with reference to the foregoing embodiments.

The significance detection model obtained by training according to the model training method can be used for realizing the image processing method in the embodiment, and has corresponding beneficial effects.

In some embodiments of the present disclosure, exemplary implementations of annotating sample images are also provided. Optionally, in step S510, the determining at least two saliency maps corresponding to the sample image may include: processing the sample image by using a target detection algorithm to obtain the position information of a target detection frame in the sample image; and obtaining a first saliency map label corresponding to the sample image based on the position information of the target detection frame.

Taking a sample image as a human image, taking a target detection algorithm comprising a face detection algorithm as an example, the human image can be processed by adopting the face detection algorithm to obtain the position information of a face detection frame, so that a saliency map used for representing a face area is obtained based on the position information of the face detection frame, and the saliency map can be used as a first saliency map label in at least two saliency icon labels.

The target detection frame is obtained through the target detection algorithm, and then the first significance map label is obtained based on the target detection frame, so that automatic labeling can be realized, the cost caused by manual labeling is avoided, and meanwhile, the processing efficiency is favorably improved.

Optionally, in some exemplary embodiments, obtaining, based on the position information of the target detection frame, a first saliency map corresponding to the sample image includes:

determining an enclosing area of the target detection frame based on the position information of the target detection frame;

determining a second salient region in the sample image based on a second saliency map label acquired in advance;

determining a first salient region in the sample image based on the intersection of the surrounding region of the target detection frame and the second salient region;

and determining a first saliency map label corresponding to the sample image based on the first saliency region.

By way of example and not limitation, the sample images described above may be sample images in a public dataset. Labeled saliency map labels are also typically provided in public data sets. The second saliency icon label may be a labeled saliency map label in the public data set.

Fig. 6 illustrates a schematic diagram of a second saliency map label in an embodiment of the present disclosure, taking the sample image as an example of a crowd image. The second saliency map label is a saliency map for representing a person main body region in the crowd image. As shown in fig. 6, the white area in the saliency map is a second salient area (human subject area) whose pixel value is 255, and the black area is an insignificant area whose pixel value is 0.

In practical application, the human face in the crowd image is detected by using any human face detection algorithm, and the position information and the confidence coefficient of the human face detection frame (rectangular detection frame) can be obtained. The location information of each face includes 4 values: the x-coordinate of the top left corner vertex, the y-coordinate of the top left corner vertex, the x-coordinate of the bottom right corner vertex, and the y-coordinate of the bottom right corner vertex. The confidence coefficient is a number between 0 and 1, reflects the credibility of the detection result, and is more credible when the value is larger. The low confidence face is usually due to bad angle, undersize or false detection, so the detected face can be further screened according to the confidence. Illustratively, a confidence threshold (e.g., 0.7) is preset, and only detection boxes with confidence greater than the threshold are retained.

According to the above exemplary embodiment, after the face detection frame is obtained, based on the surrounding area of the face detection frame and the second significant area labeled in the public data set, taking an intersection to obtain a first significant area (i.e. a face area), specifically referring to the following formula (1):

A ^face ＝A ^sal ∩A ^box formula (1)

Wherein A is ^sal As a second salient region (embodied as a set of pixels), A ^box As an enclosing region (embodied as another set of pixels) of the face detection box, a ^face The overlapping part (intersection of two pixel sets) of the second salient region and the surrounding region of the face detection frame, namely the face region. Fig. 7 shows a schematic diagram of the face region. As shown in fig. 7, the square frame in fig. 7 is a face detection frame, and an irregular region 720 obtained by intersecting the surrounding region of the face detection frame and the region in the human subject region 710 is a human face region.

In practical application, the saliency map corresponding to the region where the intersection is located may be used as the first saliency icon label. For example, the first saliency map may be obtained by setting the pixel value of the region where the intersection is located to 255 and the pixel values of the other regions to 0 in the same-size image of the sample image.

In addition, the first saliency map label and the second saliency map label can also be aggregated in an image. For example, the pixel values of the face region are modified on the second saliency map label according to the intersection to obtain a multi-level saliency map, wherein the multi-level saliency map comprises the first saliency map label and the second saliency map label. For example, will belong to A ^face Has a pixel value of 2, and belongs to A ^sal But not belonging to A ^face The pixel value of (a) is set to 1, and the remaining pixels are set to 0.

According to the embodiment, the first significant region in the sample image is determined based on the intersection of the surrounding region of the target detection frame and the second significant region, so that the labeling accuracy of the first significant region can be improved, and the accuracy of the significance detection model can be improved.

Optionally, in step S520 of the model training method, training a preset model by using the sample image and at least two saliency icons includes:

processing the sample image by using a preset model to obtain at least two saliency maps; wherein the at least two saliency maps comprise a first saliency map and a second saliency map, and a saliency region in the first saliency map is smaller than a saliency region in the second saliency map;

determining a first loss based on the first weight, the first saliency map and its corresponding saliency icon label;

determining a second loss based on the second weight, the second saliency map and its corresponding saliency icon label; wherein the second weight is less than the first weight;

and determining the total loss based on the first loss and the second loss, and updating the parameters in the preset model based on the total loss.

Illustratively, the above-mentioned losses can be calculated using a balanced cross-entropy loss function.

Taking the first saliency map as a face region map and the second saliency map as a main body region map as an example, the loss function can be calculated by referring to the following formula (2):

L＝L _sal +L _face formula (2)

Wherein L is _sal L is the second loss (loss of the human body region predicted by the human detection branch), L _face Is the first loss (loss of face regions predicted by face prediction branches), and L is the total loss.

First loss L _face The calculation method of (c) can refer to the following formula (3):

wherein, w _face A first weight (face region weight); g _face (x, y) represents the pixel value at coordinate (x, y) in the first saliency map label; p _face And (x, y) represents a pixel value at coordinates (x, y) of the face region map output by the preset model.

Second loss L _sal The calculation method of (c) can refer to the following formula (4):

wherein, w _sal A second weight (person main body region weight); g _sal (x, y) represents a pixel value at coordinate (x, y) in the second saliency map label; p is _sal (x, y) represents a pixel value at coordinates (x, y) of the human body region map output by the preset model.

According to the above embodiment, the second weight is smaller than the first weight, e.g. w _sal Is 2,w _face Is 6.

Because the significant region in the first significant map is smaller than the significant region in the second significant map, a larger weight is set for the first significant map, and a smaller weight is set for the second significant map, so that the influence of a small region in calculating a loss function can be improved, and the imbalance of the sample caused by the undersize of pixels can be avoided. By the method, the detection recall rate of at least two significance maps can be effectively improved.

In some embodiments of the present disclosure, processing the sample image by using a preset model to obtain at least two saliency maps may include:

extracting a network by using the characteristics in the preset model to obtain the characteristic information of the sample image; the characteristic information of the sample image comprises at least two decoding characteristic maps;

obtaining at least two detection results by utilizing at least two significance detection heads corresponding to at least two decoding characteristic graphs in each detection branch in a preset model and at least two decoding characteristic graphs;

and fusing at least two detection results by utilizing the fusion unit in each detection branch to obtain a significance map corresponding to each detection branch.

That is to say, in the preset model, each of the at least two detection branches includes at least two significance detection heads in one-to-one correspondence with the at least two decoding feature maps, where each of the at least two significance detection heads is configured to obtain a detection result based on its corresponding decoding feature map; and each detection branch is used for fusing the detection result obtained by each significance detection head to obtain a significance map corresponding to the detection branch. The preset model is the same as the network structure of the significance detection model in the foregoing embodiment, and will not be described herein too much.

It should be noted that, according to the above manner, even if the multi-scale fusion is removed in the inference process, the multi-scale decoding fusion can be performed by using a complete U2net structure in the training process, so as to improve the accuracy of the significance detection model obtained by the final training.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

According to an embodiment of the present disclosure, the present disclosure also provides an image processing apparatus. Fig. 8 shows a schematic block diagram of an image processing apparatus according to an embodiment of the present disclosure. As shown in fig. 8, the apparatus may include:

the feature extraction module 810 is configured to extract a network based on features in the saliency detection model to obtain feature information of the target image;

a detection module 820, configured to process the feature information by using at least two detection branches in the significance detection model, respectively, to obtain at least two significance maps;

a region determining module 830 for determining a corresponding salient region in the target image based on each of the at least two saliency maps.

Fig. 9 shows a schematic block diagram of an image processing apparatus according to another embodiment of the present disclosure. As shown in fig. 9, in another embodiment of the present disclosure, on the basis of the above-mentioned image processing apparatus, the apparatus further includes:

the image encoding module 910 is configured to encode at least two significant regions and non-significant regions in the target image with different code rates to obtain a compressed image of the target image.

Illustratively, the feature extraction network includes at least two convolutional layers, some of which are depth separable convolutional layers.

Alternatively, as shown in fig. 9, the feature extraction module may include:

the encoding unit 921 is configured to perform feature encoding at least twice and pooling at least twice based on a feature encoding network in a feature extraction network and a target image to obtain at least two encoding feature maps of the target image;

the decoding unit 922 is configured to perform at least two upsampling and at least two feature decoding processes based on the feature decoding network and the at least two coding feature maps in the feature extraction network to obtain at least two decoding feature maps of the target image.

Optionally, the detection module is specifically configured to:

and respectively processing the maximum decoding characteristic diagram in the at least two decoding characteristic diagrams by utilizing the at least two detection branches to obtain at least two significance diagrams.

FIG. 10 shows a schematic block diagram of a model training apparatus according to an embodiment of the present disclosure. As shown in fig. 10, the apparatus may include:

the labeling module 1010 is configured to obtain a sample image and determine at least two saliency map labels corresponding to the sample image;

a training module 1020, configured to train a preset model by using the sample image and the at least two saliency icon labels to obtain a saliency detection model; the preset model comprises a feature extraction network and at least two detection branches, the feature extraction network is used for obtaining feature information of the sample image, and the at least two detection branches are used for respectively processing the feature information to obtain at least two significance maps.

FIG. 11 shows a schematic block diagram of a model training apparatus according to another embodiment of the present disclosure. As shown in fig. 11, in another embodiment of the present disclosure, the labeling module includes:

the target detection unit 1110 is configured to process the sample image by using a target detection algorithm to obtain position information of a target detection frame in the sample image;

the first labeling unit 1120 is configured to obtain a first saliency map label corresponding to the sample image based on the position information of the target detection frame.

Optionally, as shown in fig. 11, the training module includes:

a model processing unit 1130, configured to process the sample image by using a preset model to obtain at least two saliency maps; wherein the at least two saliency maps comprise a first saliency map and a second saliency map, and a saliency region in the first saliency map is smaller than a saliency region in the second saliency map;

a first loss determining unit 1140, configured to determine a first loss based on the first weight, the first saliency map and its corresponding saliency icon label;

a second loss determining unit 1150, configured to determine a second loss based on the second weight, the second saliency map, and the corresponding saliency icon label; wherein the second weight is less than the first weight;

a total loss determination unit 1160 for determining a total loss based on the first loss and the second loss;

an updating unit 1170 for updating the parameters in the preset model based on the total loss.

For a description of specific functions and examples of each module and each sub-module of the apparatus in the embodiment of the present disclosure, reference may be made to the related description of the corresponding steps in the foregoing method embodiments, and details are not repeated here.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 12 shows a schematic block diagram of an example electronic device 1200, which can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 12, the apparatus 1200 includes a computing unit 1201 which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 1202 or a computer program loaded from a storage unit 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data required for the operation of the device 1200 may also be stored. The computing unit 1201, the ROM 1202, and the RAM 1203 are connected to each other by a bus 1204. An input/output (I/O) interface 1205 is also connected to bus 1204.

Various components in the device 1200 are connected to the I/O interface 1205 including: an input unit 1206 such as a keyboard, a mouse, or the like; an output unit 1207 such as various types of displays, speakers, and the like; a storage unit 1208, such as a magnetic disk, optical disk, or the like; and a communication unit 1209 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the device 1200 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 1201 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1201 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1201 performs the respective methods and processes described above, such as the image processing method and the model training method. For example, in some embodiments, the image processing method and the model training method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 1200 via the ROM 1202 and/or the communication unit 1209. When the computer program is loaded into the RAM 1203 and executed by the computing unit 1201, one or more steps of the image processing method and the model training method described above may be performed. Alternatively, in other embodiments, the computing unit 1201 may be configured to perform the image processing method and the model training method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An image processing method, comprising:

2. The method of claim 1, further comprising:

3. The method of claim 1 or 2, wherein the feature extraction network comprises at least two convolutional layers, some of which are depth separable convolutional layers.

4. The method according to claim 1 or 2, wherein the extracting a network based on the features in the saliency detection model to obtain the feature information of the target image comprises:

performing feature coding at least twice and pooling at least twice based on the feature coding network in the feature extraction network and the target image to obtain at least two coding feature maps of the target image;

and performing at least twice upsampling and at least twice feature decoding on the basis of the feature decoding network in the feature extraction network and the at least two coding feature maps to obtain at least two decoding feature maps of the target image.

5. The method of claim 4, wherein the feature coding network comprises at least two residual U-shaped blocks for feature coding; the input convolutional layer in the residual U-shaped block is a depth separable convolutional layer.

6. The method according to claim 4, wherein the processing the feature information by using at least two detection branches in the significance detection model to obtain at least two significance maps comprises:

and respectively processing the maximum decoding characteristic diagram in the at least two decoding characteristic diagrams by utilizing the at least two detection branches to obtain the at least two significance diagrams.

7. A model training method, comprising:

training a preset model by using the sample image and the at least two saliency map labels to obtain a saliency detection model; the preset model comprises a feature extraction network and at least two detection branches, the feature extraction network is used for obtaining feature information of a sample image, and the at least two detection branches are used for respectively processing the feature information to obtain at least two significance maps.

8. The method of claim 7, wherein the determining at least two saliency maps corresponding to the sample image comprises:

processing the sample image by using a target detection algorithm to obtain the position information of a target detection frame in the sample image;

and obtaining a first saliency map label corresponding to the sample image based on the position information of the target detection frame.

9. The method of claim 8, wherein the obtaining a first saliency map corresponding to the sample image based on the position information of the target detection frame comprises:

determining a second salient region in the sample image based on a second pre-acquired saliency map label;

determining a first salient region in the sample image based on an intersection of the bounding region of the target detection frame and the second salient region;

10. The method according to any one of claims 7-9, wherein the training of a preset model using the sample image and the at least two saliency map labels comprises:

processing the sample image by using the preset model to obtain at least two saliency maps; wherein the at least two saliency maps comprise a first saliency map and a second saliency map, and a saliency region in the first saliency map is smaller than a saliency region in the second saliency map;

determining a second loss based on a second weight, the second saliency map and its corresponding saliency icon label; wherein the second weight is less than the first weight;

and determining total loss based on the first loss and the second loss, and updating parameters in the preset model based on the total loss.

11. The method of claim 10, wherein the processing the sample image using the predetermined model to obtain the at least two saliency maps comprises:

obtaining at least two detection results by utilizing at least two significance detection heads which are in one-to-one correspondence with the at least two decoding feature maps in each detection branch in the preset model and the at least two decoding feature maps;

and fusing the at least two detection results by utilizing the fusion unit in each detection branch to obtain a significance map corresponding to each detection branch.

12. An image processing apparatus comprising:

the detection module is used for respectively processing the feature information by utilizing at least two detection branches in the significance detection model to obtain at least two significance graphs;

13. The apparatus of claim 12, further comprising:

and the image coding module is used for coding at least two significant regions and non-significant regions in the target image by adopting different code rates to obtain a compressed image of the target image.

14. The apparatus of claim 12 or 13, wherein the feature extraction network comprises at least two convolutional layers, some of which are depth separable convolutional layers.

15. The apparatus of claim 12 or 13, wherein the feature extraction module comprises:

the encoding unit is used for performing at least twice feature encoding and at least twice pooling on the basis of the feature encoding network in the feature extraction network and the target image to obtain at least two encoding feature maps of the target image;

and the decoding unit is used for performing at least twice upsampling and at least twice feature decoding on the basis of the feature decoding network in the feature extraction network and the at least two coding feature maps to obtain at least two decoding feature maps of the target image.

16. The apparatus of claim 15, wherein the detection module is specifically configured to:

and respectively processing the maximum decoding characteristic diagram in the at least two decoding characteristic diagrams by using the at least two detection branches to obtain the at least two significance diagrams.

17. A model training apparatus comprising:

the system comprises an annotation module, a processing module and a display module, wherein the annotation module is used for acquiring a sample image and determining at least two saliency map labels corresponding to the sample image;

the training module is used for training a preset model by utilizing the sample image and the at least two saliency map labels to obtain a saliency detection model; the preset model comprises a feature extraction network and at least two detection branches, the feature extraction network is used for obtaining feature information of a sample image, and the at least two detection branches are used for respectively processing the feature information to obtain at least two significance maps.

18. The apparatus of claim 17, wherein the annotation module comprises:

the target detection unit is used for processing the sample image by using a target detection algorithm to obtain the position information of a target detection frame in the sample image;

and the first labeling unit is used for obtaining a first saliency map label corresponding to the sample image based on the position information of the target detection frame.

19. The apparatus of claim 17 or 18, wherein the training module comprises:

the model processing unit is used for processing the sample image by utilizing the preset model to obtain the at least two significance maps; wherein the at least two saliency maps comprise a first saliency map and a second saliency map, and a saliency region in the first saliency map is smaller than a saliency region in the second saliency map;

a first loss determination unit, configured to determine a first loss based on a first weight, the first saliency map, and a saliency icon label corresponding thereto;

a second loss determination unit, configured to determine a second loss based on a second weight, the second saliency map, and a saliency icon label corresponding thereto; wherein the second weight is less than the first weight;

a total loss determination unit configured to determine a total loss based on the first loss and the second loss;

and the updating unit is used for updating the parameters in the preset model based on the total loss.

20. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-11.

21. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-11.

22. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-11.