CN116665215A

CN116665215A - Image salient region extraction method, device, computer equipment and storage medium

Info

Publication number: CN116665215A
Application number: CN202310601820.5A
Authority: CN
Inventors: 陈超; 王彬燕; 刘子强; 王海涛; 潘超
Original assignee: Beijing Sailstar Yongzhi Software Technology Co ltd
Current assignee: Beijing Sailstar Yongzhi Software Technology Co ltd
Priority date: 2023-05-25
Filing date: 2023-05-25
Publication date: 2023-08-29

Abstract

The invention relates to the technical field of image extraction, and discloses an image saliency region extraction method, an image saliency region extraction device, computer equipment and a storage medium, wherein the extraction method is applied to a nested deep semantic segmentation model, and the nested deep semantic segmentation model comprises an encoder and a decoder; the method comprises the following steps: acquiring an image to be extracted; encoding and fusing the image to be extracted by using an encoder to generate a first encoding feature map; decoding the first coding feature map by using a decoder to generate a multi-channel probability map; and carrying out threshold filtering and connected domain operation on the multichannel probability map to determine an image saliency region. The method can save manpower and time in the extraction process of the salient region, and solves the problems of high extraction cost and long time consumption of the salient region of the archives based on manual labeling.

Description

Image salient region extraction method, device, computer equipment and storage medium

Technical Field

The present invention relates to the field of image extraction technologies, and in particular, to an image salient region extraction method, an image salient region extraction device, a computer device, and a storage medium.

Background

With the continuous development of computer information technology, internet information technology has also gone through four stages. The first phase is a setup and transfer phase, which includes setup and transfer from the network to the computer hardware devices, and development and application of computer software; the second stage is that information application and netizens rise up, and the second stage relates to construction and perfection of a network application environment; the third stage is social and mobile rise, which includes the rise of social networks on the internet, and the popularity and application of mobile terminals; the fourth stage is the parallel application and concept and the intelligent rising, which relates to big data, artificial intelligence, the internet of things and other technologies. In the continuous development and perfection of the technologies, the technologies become important forces for promoting the development of industries and bring convenience to the daily life of people.

In order to solve the problem of complexity in the traditional scanning files, the processing of file images is gradually changed from digitalization to intellectualization. The main difficulties are for example that when scanning a document, it is often the case that the document is scanned into a background that is not related to the region of significance, or into the content of an adjacent page because the archive cannot be split. The user will typically focus on scanning the content of the page that contains the region of salience, while other irrelevant parts need to be removed. Even though some computer-aided processing means exist at present, significant areas cannot be effectively extracted. In addition, due to the diversity of archival images, the conditions of yellowing and fouling of paper exist, and the screening and extraction of a salient region are greatly hindered. At present, when the traditional manual mode is used for digitizing the file, a large amount of manpower and material resources are often consumed for manual labeling, so that the processing capacity of the file is restricted.

Therefore, file intellectualization becomes an effective solution.

Disclosure of Invention

In view of the above, the embodiments of the present invention provide a method, an apparatus, a computer device, and a storage medium for extracting an image salient region, so as to solve the problems of high extraction cost and long time consumption of an archive image salient region based on manual labeling.

In a first aspect, an embodiment of the present invention provides an image salient region extraction method, which is applied to a nested deep semantic segmentation model, where the nested deep semantic segmentation model includes an encoder and a decoder; the extraction method comprises the following steps:

acquiring an image to be extracted;

encoding and fusing the image to be extracted by using an encoder to generate a first encoding feature map;

decoding the first coding feature map by using a decoder to generate a multi-channel probability map;

and carrying out threshold filtering and connected domain operation on the multichannel probability map to determine an image saliency region.

According to the image salient region extraction method provided by the embodiment of the invention, the image to be extracted is subjected to coding fusion based on the nested deep semantic segmentation model encoder to generate the first coding feature map, the nested deep semantic segmentation model decoder is used for decoding the first coding feature map to generate the multi-channel probability map, and finally the multi-channel probability map is subjected to threshold filtering and connected domain operation to determine the image salient region, so that the manpower and time in the salient region extraction process can be greatly saved, and the problems of high extraction cost and long time consumption of the archival image salient region based on manual annotation are solved.

With reference to the first aspect, in an implementation manner, before generating the first coding feature map after coding and fusing the image to be extracted by using the encoder, the method further includes:

and carrying out deviation rectifying operation on the image to be extracted to obtain an image standard chart.

According to the image salient region extraction method provided by the embodiment of the invention, the correction operation is carried out on the image to be extracted to obtain the image standard graph, so that the learning complexity of the nested deep semantic segmentation model can be reduced, and the model processing effect can be improved.

In an alternative embodiment, generating the first coding feature map after coding the image to be extracted with the encoder includes:

carrying out multi-layer downsampling characteristic extraction on an image to be extracted by using an encoder;

and carrying out feature fusion on the image subjected to feature extraction to generate a first coding feature map.

According to the image salient region extraction method provided by the embodiment of the invention, the encoder is utilized to conduct multi-layer downsampling feature extraction on the image standard graph, and the image features after feature extraction are fused to generate the first coding feature graph, so that the problem that feature diversity is difficult to effectively extract due to the diversity of image forms to be extracted is solved.

In an alternative embodiment, generating the multi-channel probability map after decoding the first encoded feature map with the decoder includes:

Performing multi-layer up-sampling feature extraction on the first coding feature map by using a decoder to generate a second coding feature map;

feature fusion is carried out on the first coding feature map and the second coding feature map, and then a feature map with the same resolution as that of the original image is generated;

and performing convolution mapping and probability conversion on the feature map to generate a multichannel probability map.

In an alternative embodiment, generating the feature map with the same resolution as the original image after feature fusion of the first encoding feature map and the second encoding feature map includes:

splicing and fusing the up-sampling and down-sampling feature images of each layer in the first coding feature image and the second coding feature image to obtain an auxiliary decoding feature image of each layer;

and splicing and fusing the auxiliary decoding feature images of each layer to obtain the feature images with the same resolution as the original image.

According to the image salient region extraction method provided by the embodiment of the invention, a decoder is utilized to carry out multi-layer up-sampling feature extraction on the first coding feature map so as to generate a second coding feature map; splicing and fusing the up-sampling and down-sampling feature images of each layer in the first coding feature image and the second coding feature image to obtain an auxiliary decoding feature image of each layer; the auxiliary decoding feature images of each layer are spliced and fused to obtain a feature image with the same resolution as the original image, the feature images of each layer of up-sampling and down-sampling in the first coding feature image and the second coding feature image are spliced and fused through multi-layer up-sampling feature extraction, a multi-channel probability image is generated through convolution mapping and probability conversion of the feature images, and the perceptibility of the nested deep semantic segmentation model to texture details with different scales in the image can be improved, so that the nested deep semantic segmentation model can better extract the features of the image, and the salient region and other non-salient regions of the image can be segmented more smoothly.

In an alternative embodiment, convolving the feature map and probability transforming to generate the multi-channel probability map includes:

carrying out convolution mapping on the channel number in the feature map according to the preset segmentation class number to obtain a feature map with the preset segmentation class number;

and carrying out probability conversion on each pixel value of each channel in the feature map to generate a multichannel probability map with preset resolution.

According to the image saliency region extraction method provided by the embodiment of the invention, the number of channels in the feature map is subjected to convolution mapping according to the preset segmentation class number, so that the feature map with the preset segmentation class number can be obtained, and the parameter number of the nested deep semantic segmentation model can be reduced while high precision is maintained. And carrying out probability conversion on each pixel value of each channel in the feature map to generate a multichannel probability map with a preset resolution, so that each pixel value of each channel is the multichannel probability map with the preset resolution, and the image is easy to process by the nested deep semantic segmentation model.

In an alternative embodiment, thresholding the multi-channel probability map and connected-domain operation to determine the image saliency region includes:

filtering the multi-channel probability map according to a preset threshold value to generate a mask map;

Calculating the maximum connected domain in the mask map;

and determining an image saliency area according to the maximum connected domain.

According to the image saliency region extraction method provided by the embodiment of the invention, a mask map is generated by filtering the multi-channel probability map through a preset threshold value, and the maximum connected domain in the mask map is calculated; the image salient region is determined according to the maximum connected domain, so that the accurate extraction of the salient region in the image to be extracted is realized, and the extraction efficiency is greatly improved.

In a second aspect, an embodiment of the present invention provides an image salient region extraction apparatus, including:

the acquisition module is used for acquiring an image to be extracted;

the coding module is used for generating a first coding feature map after coding and fusing the images to be extracted by using the coder;

the decoding module is used for decoding the first coding feature map by using a decoder to generate a multi-channel probability map;

and the post-processing module is used for carrying out threshold filtering and connected domain operation on the multichannel probability map to determine an image saliency region.

In a third aspect, an embodiment of the present invention provides a computer apparatus, including: the image saliency region extraction method comprises the steps of storing computer instructions in a memory, and executing the computer instructions by the processor, wherein the memory and the processor are in communication connection, and the processor executes the image saliency region extraction method according to the first aspect or any implementation mode corresponding to the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer readable storage medium, where computer instructions are stored on the computer readable storage medium, where the computer instructions are configured to cause a computer to perform the image saliency area extraction method of the first aspect or any embodiment corresponding to the first aspect.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow diagram of a method of extracting regions of image saliency, according to some embodiments of the invention;

FIG. 2 is a flow diagram of another image saliency region extraction method, according to some embodiments of the invention;

FIG. 3 is a flow diagram of yet another image saliency region extraction method, according to some embodiments of the invention;

FIG. 4 is a network architecture diagram of a nested deep semantic segmentation model in an image salient region extraction method according to some embodiments of the present invention;

FIG. 5 is a network architecture diagram of a nested inner layer U-shaped feature fusion module in an image salient region extraction method according to some embodiments of the present invention;

fig. 6 is a block diagram of the structure of an image saliency area extraction apparatus according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

According to an embodiment of the present invention, there is provided an image saliency region extraction method embodiment, it being noted that the steps shown in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order other than that shown or described herein.

In this embodiment, an image salient region extraction method is provided and applied to a nested deep semantic segmentation model, where the nested deep semantic segmentation model includes an encoder and a decoder, and fig. 1 is a flowchart of the image salient region extraction method according to an embodiment of the present invention, as shown in fig. 1, and the flowchart includes the following steps:

step S101, an image to be extracted is acquired. Specifically, including but not limited to scanning an archive or other paper document with a scanner to obtain an image to be extracted.

And step S102, performing coding fusion on the image to be extracted by using an encoder to generate a first coding feature map. In particular, since the variety of archival images has a variety, such as text, tables, photos, etc., it is not feasible to represent archival features in a fixed shallow feature extraction manner. In order to accurately identify the document features, the archival image is input into an encoder of a selected nested deep semantic segmentation network to obtain feature images with different level depths and corresponding scales as the encoding features of the archival image, so that the archival file or other paper files obtained by scanning can be subjected to multi-layer downsampling, nested inner layer U-shaped feature fusion and other operations based on the nested deep semantic segmentation model encoder to generate a first encoding feature image. The first encoding feature map may be encoding feature maps of different scales of the image to be extracted.

Step S103, the decoder is utilized to decode the first coding feature map to generate a multi-channel probability map.

Specifically, the first coding feature map is decoded by using a nested deep semantic segmentation model decoder, and multi-layer upsampling can be adopted, wherein the multi-layer upsampling refers to that the resolution of the first coding feature map is restored to the feature map with the same resolution as the original image, and then the feature map with the same resolution as the original image is processed to generate the multi-channel probability map.

Step S104, threshold filtering and connected domain operation are carried out on the multi-channel probability map to determine an image saliency region. Specifically, calculating a threshold value between a significance region category and other regions of the multi-channel probability map by using a maximum inter-class variance method (Otsu algorithm, OTSU), and generating a mask map of the significance region according to threshold value filtering; the image saliency region is determined using a method of finding the maximum connected domain.

For example: and calculating a threshold value in probability mapping between the salient region class and other regions in the multi-channel probability map by using a maximum inter-class variance method (Otsu) for the class feature map channel to which the salient region belongs in the processed archive image, and generating a mask map of the image to be extracted by using the threshold value, wherein the pixel value in the corresponding mask map with the probability of more than or equal to the threshold value at the pixel point is 255, and the pixel value in the corresponding mask map with the probability of less than the threshold value at the pixel point is 0. Finally, the maximum connected domain formed by the pixels with the values of 255 in the mask map is obtained, and the connected domain is the salient region in the file scan map obtained through final calculation.

According to the image salient region extraction method, the first coding feature map is generated after coding and fusing the images to be extracted based on the nested deep semantic segmentation model encoder, the first coding feature map is decoded by using the nested deep semantic segmentation model decoder to generate the multi-channel probability map, and finally the multi-channel probability map is subjected to threshold filtering and connected domain operation to determine the image salient region, so that manpower and time in the salient region extraction process can be greatly saved, and the problems of high extraction cost and long time consumption of the archival image salient region based on manual labeling are solved.

In this embodiment, an image salient region extraction method is provided and applied to a nested deep semantic segmentation model, where the nested deep semantic segmentation model includes an encoder and a decoder, and fig. 2 is a flowchart of the image salient region extraction method according to an embodiment of the present invention, as shown in fig. 2, where the flowchart includes the following steps:

step S201, an image to be extracted is acquired. Please refer to step S101 in the embodiment shown in fig. 1 in detail, which is not described herein.

In an alternative embodiment, before the encoding fusion is performed on the image to be extracted by using the encoder to generate the first encoding feature map, the method further includes:

Specifically, the image to be extracted is exemplified by a archival image, which is an archival image scanned by a user using a scanner of a prescribed configuration, but the position in which the paper archives are placed in the scanner is not completely fixed during the scanning, and thus the scanned image may be oblique. In order to solve the problem, the archive image can be rotated by a certain angle to enable the archive main body area to be positioned on a horizontal line, so that the complexity of learning a deep semantic segmentation model can be reduced, and the effect of model reasoning can be improved.

Step S202, an encoder is utilized to encode and fuse the images to be extracted, and then a first encoding feature map is generated.

Specifically, the step S202 includes:

step S2021, performing multi-layer downsampling feature extraction on the image to be extracted by using an encoder; specifically, multi-layer downsampling is also called feature extraction, an archive standard image after deviation correction operation is input into a selected nested deep semantic segmentation model, feature extraction is carried out on the image through multi-layer downsampling operation, and feature images with different level depths and corresponding scales are obtained and used as coding features of each level of the standard archive image.

In step S2022, feature fusion is performed on the feature extracted image to generate a first encoded feature map. Specifically, the nested deep semantic segmentation model further comprises a nested inner layer U-shaped feature fusion module, and the encoder performs multi-layer downsampling operation on the coding feature map transmitted by the upper layer; in the process of feature coding, performing operations such as convolution, downsampling, splicing and fusion on a feature map of a current layer through a nested inner layer U-shaped feature fusion module of each layer to obtain a feature map of the current layer subjected to feature fusion; the feature map after the U-shaped feature fusion of each layer of nested inner layer is continuously subjected to downsampling operation, and then the U-shaped feature fusion of the nested inner layer is used for carrying out iterative processing layer by layer downwards in the mode; the semantic information of the image from the shallow layer to the deep layer is obtained by continuously reducing the dimension of the feature map and increasing the number of channels of the feature map, and the nested deep semantic segmentation model can generate first coding feature maps with different dimensions according to the semantic information from the shallow layer to the deep layer.

Illustratively, in connection with fig. 4, the nested deep semantic segmentation model network structure includes: encoder 0, encoder 1, encoder 2, encoder 3, encoder 4, decoder 0, decoder 1, decoder 2 and decoder 3 in this order, encoder 0 and decoder 0 constitute layer 0 of the nested deep semantic segmentation model, encoder 1 and decoder 1 constitute layer 1 of the nested deep semantic segmentation model, encoder 2 and decoder 2 constitute layer 2 of the nested deep semantic segmentation model, encoder 3 and decoder 3 constitute layer 3 of the nested deep semantic segmentation model, and encoder 4 constitutes layer 4 of the nested deep semantic segmentation model. The feature map in fig. 4 conveys a data flow path representing the feature map; the maximum pooling 2×2 means that the size of the pooling core is 2×2, and the width and the height of the feature map after pooling become 1/2 of the access feature map, so as to realize the downsampling function; 2 times of linear up-sampling changes the width and height of an input characteristic diagram into 2 times by using a linear interpolation mode, and an up-sampling function is realized; and (3) splicing the characteristic diagrams, namely splicing the two characteristic diagrams with the same resolution in the 3 rd dimension, wherein the number of channels after splicing is the sum of the number of channels of the two previous characteristic diagrams, so that multi-scale characteristic fusion is completed.

As shown in fig. 5, the nested inner layer U-shaped feature fusion module includes 2 3×3 convolution+bn layer+relu layer, 4 2-fold downsampling network+3×3 convolution layer+bn layer+relu layer, 1 splice layer+3×3 convolution layer, and 4 2-fold upsampling network+splice layer+3×3 convolution layer+bn layer+relu layer.

In the process of extracting the characteristic of the archival image, firstly inputting the archival image of 3 channels into a nested inner layer U-shaped characteristic fusion module of the 0 th layer, namely an encoder 0, and outputting a characteristic diagram after being encoded by the encoder 0. The feature map is processed in a nested inner layer U-shaped feature fusion module through a double convolution layer, a BN layer and a ReLU layer (Rectified Linear Unit, nonlinear layer). According to different layers, the network structure of the nested inner layer U-shaped characteristic fusion module can be dynamically adjusted. The network layer numbers in the nested inner layer U-shaped feature fusion modules of different layers from top to bottom are sequentially reduced. The feature map sequentially passes through a plurality of downsampling layers, convolution layers, BN layers, reLU layers, upsampling layers, convolution layers, BN layers, reLU layers and the like to obtain the features fused by the nested inner-layer U-shaped feature fusion module. Wherein: the BN layer (Batch Normalization Layer) is a method commonly used in deep learning and is used for accelerating the convergence speed of a network structure of the nested deep semantic segmentation model, reducing the dependence of the nested deep semantic segmentation model on initial parameters and improving the robustness of the nested deep semantic segmentation model. The BN layer performs normalization processing on each mini-batch (minimum variation) data, so that the input of the nested deep semantic segmentation model is more stable, and the convergence speed and generalization capability of the nested deep semantic segmentation model are improved.

And after the upsampling features of the archival image are extracted, downsampling the picture through an outer layer network of an encoder of the nested deep semantic segmentation model. Each downsampling will change the width and height of the feature map transferred by the previous layer to 1/2 of the original feature map.

The feature map after the U-shaped feature fusion of each layer of nested inner layer is continuously subjected to downsampling operation, and then the U-shaped feature fusion of the nested inner layer is used for carrying out iterative processing layer by layer downwards in the mode; semantic information of the image from shallow layers to deep layers is obtained by continuously reducing the dimension of the feature map and increasing the number of channels of the feature map.

And (3) representing the characteristic extraction process of the nested deep semantic segmentation model encoder from the 0 th layer to the 4 th layer from top to bottom, and obtaining the characteristic map with gradually reduced characteristic map scale and gradually increased characteristic map channel number through the encoders 0 to 4. And using the network structure characteristics to obtain feature maps with different hierarchical depths and corresponding scales as first coding features of the archival images. The multi-layer downsampling operation and the U-shaped feature fusion solve the problem that the features are various and difficult to effectively extract due to the diversity of file image forms.

As shown in fig. 5, the encoder 0 and the decoder 0 use the structure in fig. 5, and the input archival image obtains a first coding feature map of corresponding scale and channel through two 3×3 convolutions+bn layer+relu layer; then obtaining coding feature maps with different scales through 4 times of 2 times of downsampling network, 3 multiplied by 3 convolution layer, BN layer and ReLU layer; and then splicing features are fused with coding features of the same scale through a 1-time splicing layer+3×3 convolution layer, a 4-time 2-time up-sampling network+splicing layer+3×3 convolution layer+BN layer+ReLU layer, so as to obtain decoding features of different scales. The encoder 0 adopts downsampling for 4 times, namely downsampling the input characteristic diagram by 16 times, so as to respectively obtain the characteristic diagrams of the width and the height of the original characteristic diagrams 1/2,1/4,1/8 and 1/16; upsampling is also used 4 times in decoder 0, i.e. the input feature map is upsampled 16 times, restoring to the original image width and height.

In addition, the encoders 1 to 4 and the decoders 1 to 3 adopt similar structures, differing only in the number of modules in the dashed-line box of fig. 5: the encoder 1 and decoder 1 down-sample and up-sample 3 times, the encoder 2 and decoder 2 down-sample and up-sample 2 times, the encoder 3 and decoder 3 down-sample and up-sample 1 time, and the encoder 4 does not up-sample and down-sample.

Step S203, the decoder is utilized to decode the first coding feature map to generate a multi-channel probability map. Please refer to step S103 in the embodiment shown in fig. 1 in detail, which is not described herein.

Step S204, threshold filtering and connected domain operation are carried out on the multi-channel probability map to determine an image saliency region. Please refer to step S104 in the embodiment shown in fig. 1 in detail, which is not described herein.

According to the image salient region extraction method provided by the embodiment, a scanner is used for scanning a paper archive, correction is carried out on archive pictures, first coding feature images of different scales of an original image are obtained based on multi-layer downsampling operation of a nested deep semantic segmentation model encoder, a nested deep semantic segmentation model decoder is used for decoding to generate a multi-channel pixel probability image, and finally the multi-channel probability image is filtered and connected with a domain operation according to a threshold value to determine the image salient region. The embodiment of the invention is based on the extraction of the image salient region of the nested deep semantic segmentation model, and can improve the perceptibility of the nested deep semantic segmentation model to texture details with different scales in the image, so that the nested deep semantic segmentation model can better extract the characteristics of the image and can more smoothly segment the salient region and other non-salient regions of the image.

In this embodiment, an image salient region extraction method is provided and applied to a nested deep semantic segmentation model, where the nested deep semantic segmentation model includes an encoder and a decoder, and fig. 3 is a flowchart of the image salient region extraction method according to an embodiment of the present invention, as shown in fig. 3, where the flowchart includes the following steps:

step S301, an image to be extracted is acquired. Please refer to step S101 in the embodiment shown in fig. 1 in detail, which is not described herein.

Step S302, an encoder is utilized to encode and fuse the image to be extracted, and then a first encoding feature map is generated. Please refer to step S202 in the embodiment shown in fig. 2, which is not described herein.

Step S303, a decoder is utilized to decode the first coding feature map to generate a multi-channel probability map.

Specifically, the step S303 includes:

step S3031, performing multi-layer up-sampling feature extraction on the first coding feature map by using a decoder to generate a second coding feature map; specifically, upsampling is the inverse of downsampling, also known as Upsampling (Upsampling) or interpolation (interpolation), which are both extraction of image features, and a decoder upsamples a first encoded feature map to generate a second encoded feature map.

Step S3032, feature fusion is carried out on the first coding feature map and the second coding feature map to generate a feature map with the same resolution as the original image; specifically, a nested deep semantic segmentation model decoder is used for carrying out multi-layer upsampling on the first coding feature images with different scales, splicing and fusing are carried out on the second coding feature images after upsampling and the first coding feature images with the same resolution after previous downsampling, then auxiliary decoding feature images of each layer are obtained through nested inner layer U-shaped feature fusion operation, and finally the feature images with the same resolution of the original image are obtained on the auxiliary decoding feature images of each layer obtained through top layer splicing and fusing.

In some optional embodiments, step S3032 includes:

step a1, splicing and fusing the up-sampling and down-sampling feature images of each layer in the first coding feature image and the second coding feature image to obtain an auxiliary decoding feature image of each layer;

and a2, splicing and fusing the auxiliary decoding feature images of each layer to obtain a feature image with the same resolution as the original image.

Specifically, in the process of feature fusion, combining fig. 4 and fig. 5, the feature map of the 4 th layer and the coding feature map of the 3 rd layer are spliced and fused first. And then sending the mixture into a nested inner layer U-shaped feature fusion module for feature fusion. And (3) carrying out up-sampling operation on the fused features, so that the width and the height of the feature map are changed to 2 times of the original feature map. And then splicing and fusing the layer 2 coding feature map. And then sending the mixture into a nested inner layer U-shaped feature fusion module for feature fusion. And carrying out iterative processing layer by layer in the mode until the characteristic images which are subjected to characteristic fusion with the coding characteristic images of the 0 th layer are spliced and fused, and then sending the characteristic images into a U-shaped characteristic fusion module of the nested inner layer to obtain the auxiliary decoding characteristic images of the 0 th layer.

And the multi-scale feature fusion process of the deep semantic segmentation model decoder is represented from layer 4 to layer 0 from bottom to top, the feature map scale is gradually increased to the original picture size, and the number of feature map channels is gradually reduced.

From layer 0 to layer 4, auxiliary decoding feature maps for each layer can be obtained, which can be used for auxiliary training of the overall model. And finally splicing and fusing the auxiliary decoding feature images of each layer at the top layer to obtain feature images with the same resolution of the original image.

Step S3033, convolution mapping and probability conversion are carried out on the feature map to generate a multi-channel probability map.

In some optional embodiments, step S3033 includes:

step b1, carrying out convolution mapping on the channel number in the feature map according to the preset segmentation class number to obtain the feature map with the preset segmentation class number; specifically, the auxiliary feature map of each layer obtained by fusion is sequentially up-sampled, convolution operation is performed by 3×3, and the number of feature map channels is mapped to the number of divided categories, for example, the number of categories is 2 when the main area and other areas are distinguished.

And b2, carrying out probability conversion on each pixel value of each channel in the feature map, and generating a multichannel probability map with preset resolution. Specifically, a Sigmoid function (an activation function of a neural network) is used for converting each pixel value of each channel, so that the value range is limited to the range of [0,1], and a multichannel probability map with preset resolution is generated in a probability form.

And finally, splicing and fusing the auxiliary decoding feature images of each layer at the top layer to obtain the feature images with the resolution of the original input image, performing 3×3 convolution operation, and mapping the channel number of the feature images to the number of divided categories, for example, distinguishing a main area from other areas, wherein the category number is 2. Then, each pixel value of each channel is converted by using a Sigmoid function, so that the value range is limited to the range of [0,1], and a multi-channel probability map with resolution is generated in a probability form.

In connection with FIG. 4, the feature map of FIG. 4 conveys a data flow path representing the feature map when the nested deep semantic segmentation model is employed, including encoding by an encoder and decoding by a decoder; the maximum pooling 2×2 means that the size of the pooling core is 2×2, and the width and the height of the feature map after pooling become 1/2 of the access feature map, so as to realize the downsampling function; 2 times of linear up-sampling changes the width and height of an input characteristic diagram into 2 times by using a linear interpolation mode, and an up-sampling function is realized; and (3) splicing the characteristic diagrams, namely splicing the two characteristic diagrams with the same resolution in the 3 rd dimension, wherein the number of channels after splicing is the sum of the number of channels of the two previous characteristic diagrams, so that multi-scale characteristic fusion is completed. The 3×3 convolution+upsampling+sigmoid represents the number of channels of the 3×3 convolution adjustment feature map, the 2-fold linear upsampling changes the width and height of the input feature map to 2 times by using a linear interpolation mode, and the Sigmoid function changes the value range of each pixel of each channel of the feature map to be within the [0,1] interval. The 3×3 convolution+sigmoid represents the number of channels of the 3×3 convolution adjustment feature map, and the Sigmoid function changes the value range of each pixel of each channel of the feature map to be within the [0,1] interval. The channel number of the final output feature map of 2 means that the network can finally identify 2 categories, namely a salient region and a non-salient region, per pixel. From the overall network longitudinal architecture, the nested deep semantic segmentation model contains 5 layers from layer 0 to layer 4. The different layers contain a transverse depth network structure such as a nested inner layer U-shaped characteristic fusion module. Therefore, the nested deep semantic segmentation model has a nested deep network structure, and has strong capability for extracting detail semantic features of images and reconstructing feature fusion.

Step S304, threshold filtering and connected domain operation are carried out on the multi-channel probability map to determine an image saliency region. Please refer to step S104 in the embodiment shown in fig. 1 in detail, which is not described herein.

According to the image salient region extraction method provided by the embodiment, a scanner is used for scanning a paper archive, deviation correction is carried out on archive pictures, first coding feature images of different scales of an original image are obtained based on multi-layer downsampling operation of a nested deep semantic segmentation model encoder, second coding feature images of different scales are obtained through upsampling of a nested deep semantic segmentation model decoder, splicing and fusion are carried out on the first coding feature images and the second coding feature images to obtain auxiliary decoding feature images of each layer, splicing and fusion are carried out on the auxiliary decoding feature images of each layer to obtain a multi-channel probability image, and finally filtering and connected domain operation are carried out on the multi-channel probability image according to a threshold value to determine the image salient region. The number of channels in the feature map is subjected to convolution mapping according to the preset segmentation class number, so that the feature map with the preset segmentation class number can be obtained, and the parameter number of the nested deep semantic segmentation model can be reduced while high precision is maintained. And carrying out probability conversion on each pixel value of each channel in the feature map to generate a multichannel probability map with a preset resolution, so that each pixel value of each channel is the multichannel probability map with the preset resolution, and the image is easy to process by the nested deep semantic segmentation model.

The embodiment also provides an image salient region extraction device, which is used for implementing the above embodiment and the preferred implementation, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

The present embodiment provides an image salient region extraction apparatus, as shown in fig. 6, including:

an acquisition module 601, configured to acquire an image to be extracted;

the encoding module 602 is configured to generate a first encoding feature map after encoding and fusing an image to be extracted by using an encoder;

a decoding module 603, configured to generate a multi-channel probability map after decoding the first coding feature map by using a decoder;

the post-processing module 604 is configured to perform threshold filtering and connected domain operation on the multi-channel probability map to determine an image saliency region.

In some alternative embodiments, the encoding module 602 includes:

a first feature extraction unit, configured to perform multi-layer downsampling feature extraction on the image standard graph by using an encoder;

And the first feature fusion unit is used for carrying out feature fusion on the image subjected to the feature extraction to generate a first coding feature map.

In some alternative embodiments, the decoding module 603 includes:

the second feature extraction unit is used for carrying out multi-layer up-sampling feature extraction on the first coding feature map by using a decoder to generate a second coding feature map;

the second feature fusion unit is used for carrying out feature fusion on the first coding feature map and the second coding feature map to generate a feature map with the same resolution as the original image;

and the mapping conversion unit is used for carrying out convolution mapping and probability conversion on the feature map to generate a multichannel probability map.

In some alternative embodiments, the second feature fusion unit includes:

the auxiliary decoding subunit is used for splicing and fusing the up-sampling and down-sampling feature images of each layer in the first coding feature image and the second coding feature image to obtain an auxiliary decoding feature image of each layer;

and the feature fusion subunit is used for splicing and fusing the auxiliary decoding feature images of each layer to obtain a feature image with the same resolution as the original image.

In some alternative embodiments, the mapping conversion unit includes:

the mapping subunit is used for carrying out convolution mapping on the channel number in the feature map according to the preset segmentation class number to obtain the feature map with the preset segmentation class number;

And the conversion subunit is used for carrying out probability conversion on each pixel value of each channel in the feature map to generate a multichannel probability map with preset resolution.

In some alternative embodiments, the post-processing module 604 includes:

the filtering unit is used for filtering the multi-channel probability map according to a preset threshold value to generate a mask map;

a calculating unit, configured to calculate a maximum connected domain in the mask map;

and the determining unit is used for determining the image saliency area according to the maximum connected domain.

The image saliency area extraction apparatus in this embodiment is presented in the form of functional units, where units refer to ASIC circuits, processors and memories executing one or more software or fixed programs, and/or other devices that can provide the above described functionality.

Further functional descriptions of the above respective modules are the same as those of the above corresponding embodiments, and are not repeated here.

The embodiment of the invention also provides computer equipment, which is provided with the image salient region extraction device shown in the figure 6.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a computer device according to an alternative embodiment of the present invention, as shown in fig. 7, the computer device includes: one or more processors 10, memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple computer devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 7.

The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.

Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform the methods shown in implementing the above embodiments.

The memory 20 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created from the use of the computer device of the presentation of a sort of applet landing page, and the like. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; memory 604 may also include a combination of the types of memory described above.

The computer device further comprises input means 30 and output means 40. The processor 10, memory 20, input device 30, and output device 20 may be connected by a bus or other means, for example in fig. 7.

The input device 30 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus, such as a touch screen, a keypad, a mouse, a trackpad, a touchpad, a pointer stick, one or more mouse buttons, a trackball, a joystick, and the like. The output means 40 may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. Such display devices include, but are not limited to, liquid crystal displays, light emitting diodes, displays and plasma displays. In some alternative implementations, the display device may be a touch screen.

The embodiments of the present invention also provide a computer readable storage medium, and the method according to the embodiments of the present invention described above may be implemented in hardware, firmware, or as a computer code which may be recorded on a storage medium, or as original stored in a remote storage medium or a non-transitory machine readable storage medium downloaded through a network and to be stored in a local storage medium, so that the method described herein may be stored on such software process on a storage medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware. The storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk, a solid state disk or the like; further, the storage medium may also comprise a combination of memories of the kind described above. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.

Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims

1. The image saliency region extraction method is characterized by being applied to a nested deep semantic segmentation model, wherein the nested deep semantic segmentation model comprises an encoder and a decoder; the method comprises the following steps:

acquiring an image to be extracted;

the image to be extracted is subjected to coding fusion by using an encoder to generate a first coding feature map;

2. The method of claim 1, further comprising, prior to generating the first encoded signature after encoding the image to be extracted with an encoder, the steps of:

3. The method of claim 1, wherein generating a first encoded signature after encoding the image to be extracted with an encoder comprises:

carrying out multi-layer downsampling characteristic extraction on the image to be extracted by using an encoder;

4. The method of claim 3, wherein generating a multi-channel probability map after decoding the first encoded feature map with a decoder comprises:

5. The method of claim 4, wherein generating a feature map having the same resolution as the original image after feature fusion of the first and second encoded feature maps comprises:

6. The method of claim 4, wherein convolving the feature map and probability transforming to generate a multi-channel probability map comprises:

7. The method of claim 6, wherein thresholding the multi-channel probability map and connected domain operation to determine image saliency regions comprises:

calculating the maximum connected domain in the mask map;

and determining an image significance region according to the maximum connected domain.

8. An image saliency region extraction apparatus, characterized by comprising:

the acquisition module is used for acquiring an image to be extracted;

the coding module is used for generating a first coding feature map after coding and fusing the image to be extracted by using an encoder;

and the post-processing module is used for carrying out threshold filtering and connected domain operation on the multichannel probability map to determine an image significance region.

9. A computer device, comprising:

a memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the image saliency region extraction method of any one of claims 1 to 7.

10. A computer-readable storage medium having stored thereon computer instructions for causing a computer to execute the image saliency area extraction method according to any one of claims 1 to 7.