CN110390327B

CN110390327B - Foreground extraction method and device, computer equipment and storage medium

Info

Publication number: CN110390327B
Application number: CN201910555311.7A
Authority: CN
Inventors: 徐彬彬; 胡勇波; 李曙鹏; 谢永康; 施恩
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-06-25
Filing date: 2019-06-25
Publication date: 2022-06-28
Anticipated expiration: 2039-06-25
Also published as: CN110390327A

Abstract

The invention discloses a foreground extraction method, a foreground extraction device, computer equipment and a storage medium, wherein the method comprises the following steps: performing significance detection on the original image to obtain a gray image corresponding to the original image; generating a Trimap image containing a foreground, a background and an unknown region according to the gray level image; and (4) combining the original image, performing foreground and background segmentation on the unknown region in the Trimap image to obtain a MASK image only containing the foreground and the background. By applying the scheme of the invention, the labor and time cost can be saved, and the accuracy of the processing result can be improved.

Description

Foreground extraction method and device, computer equipment and storage medium

[ technical field ] A method for producing a semiconductor device

The present invention relates to computer application technologies, and in particular, to a foreground extraction method and apparatus, a computer device, and a storage medium.

[ background of the invention ]

In the field of computer vision, foreground extraction is always a hot research problem and is applied to scenes such as workpiece detection, medical detection and the like, and meanwhile, with the development of a deep learning technology, the foreground extraction is also widely applied to the aspects of object detection and data enhancement of image segmentation.

The current foreground extraction mode mainly comprises an interactive annotation extraction mode, an image segmentation mode based on color and texture characteristics and the like.

The interactive annotation extraction method needs to manually mark a rough foreground and a background on an input image, and then a segmentation algorithm is used for segmenting an object from the background. However, this method requires a large cost of labor and time.

In an image segmentation mode based on color and texture features, input images are clustered and classified according to the color features or the texture features, and segmentation of a foreground and a background is achieved. But this approach requires significant color or texture differences between the foreground and background, otherwise the accuracy of the processing results may be low.

[ summary of the invention ]

In view of the above, the present invention provides a foreground extraction method, apparatus, computer device and storage medium.

The specific technical scheme is as follows:

a foreground extraction method, comprising:

carrying out significance detection on an original image to obtain a gray image corresponding to the original image;

generating a Trimap image containing a foreground, a background and an unknown region according to the gray level image;

and combining the original image, performing foreground and background segmentation on the unknown region in the Trimap image to obtain a MASK MASK image only containing a foreground and a background.

According to a preferred embodiment of the present invention, the performing saliency detection on an original image to obtain a grayscale image corresponding to the original image includes:

performing saliency detection on the original image according to a first saliency detection mode to obtain a first gray image corresponding to the original image;

and performing significance detection on the original image according to a second significance detection mode to obtain a second gray image corresponding to the original image.

According to a preferred embodiment of the present invention, the performing saliency detection on the original image according to a first saliency detection manner to obtain a first grayscale image corresponding to the original image includes:

respectively obtaining image segmentation results of M different levels of the original image, wherein M is a positive integer greater than one;

respectively acquiring corresponding significant images aiming at the image segmentation result of each layer;

and fusing the M significant images to obtain the first gray level image.

According to a preferred embodiment of the present invention, the obtaining the image segmentation results of M different levels of the original image respectively includes:

segmenting the original image into K1 sub-regions by a graph-based image segmentation algorithm to obtain a first-layer image segmentation result, wherein K1 is a positive integer greater than one;

Obtaining an ith layer image segmentation result by the following method, wherein i is a positive integer which is greater than one and less than or equal to M: and carrying out region merging on the i-1 layer image segmentation result to obtain an i-layer image segmentation result, wherein the number of sub-regions contained in the i-layer image segmentation result is less than that of the sub-regions contained in the i-1 layer image segmentation result.

According to a preferred embodiment of the present invention, the obtaining, for the image segmentation result of each hierarchy, the corresponding saliency image respectively includes:

aiming at each sub-region in the image segmentation result of any layer, respectively performing the following processing:

respectively acquiring the regional contrast, regional attributes and regional background features of the sub-regions, and generating feature vectors according to the acquired information;

performing regression on the feature vectors by adopting a random forest to obtain a significant value of the subregion, wherein the significant value is 0 or 1;

generating a saliency image corresponding to the image segmentation result of the hierarchy, wherein the value of each pixel point in the saliency image is as follows: and the salient value of the subregion where the pixel point is located.

According to a preferred embodiment of the present invention, the linearly combining the M saliency images to obtain the first grayscale image includes:

Assigning values to each pixel point in the first gray level image according to the following modes:

adding values of the pixel points in the M significant images, dividing the added sum by M, and assigning the obtained quotient to the pixel points in the first gray level image.

According to a preferred embodiment of the present invention, the performing saliency detection on the original image according to a second saliency detection method to obtain a second grayscale image corresponding to the original image includes:

dividing the original image into N image blocks with the same size, wherein N is a positive integer larger than one;

respectively acquiring a super pixel area corresponding to each image block;

respectively obtaining a foreground and background classification result corresponding to each super pixel area;

and fusing the classification results of the foreground and the background to obtain the second gray image.

According to a preferred embodiment of the present invention, the respectively obtaining the super pixel areas corresponding to each image block includes:

aiming at any image block, randomly selecting a pixel point from the image block as a clustering center, and executing the following preset operations:

performing clustering operation according to the clustering center to obtain a pixel point cluster;

calculating the average value of the values of the pixels in the pixel cluster;

Taking a pixel point which is closest to the value of the mean value in an assigned subarea in a temporary area formed by the pixels in the pixel point cluster as an updated clustering center, wherein the assigned subarea is a preset-size area which takes the center point of the temporary area as the center;

and if the distance between the updated clustering center and the clustering center before updating is smaller than a preset first threshold value, taking the temporary region as a super pixel region corresponding to the image block, otherwise, repeating the preset operation according to the updated clustering center.

According to a preferred embodiment of the present invention, the respectively obtaining the foreground and background classification results corresponding to each super-pixel region includes:

for any super pixel area, the following processing is respectively carried out:

inputting the super-pixel region into a Convolutional Neural Network (CNN) classification model to obtain a first foreground-background classification result corresponding to the super-pixel region, wherein the first foreground-background classification result comprises a classification result of whether each pixel point in the super-pixel region belongs to a foreground or a background;

intercepting an extended area which contains the super-pixel area and is larger than the super-pixel area from the original image, inputting the extended area into a CNN (content-based network) classification model, and obtaining a second foreground-background classification result corresponding to the super-pixel area, wherein the second foreground-background classification result contains a classification result of whether each pixel point in the extended area belongs to a foreground or a background respectively;

And assigning 1 to the pixel points classified as the foreground and 0 to the pixel points classified as the background each time the classification is performed.

According to a preferred embodiment of the present invention, the fusing the classification results of the foreground and the background to obtain the second grayscale image includes:

assigning values to each pixel point in the second gray image according to the following modes:

counting the times of the classification participation of the pixel points;

and adding the assignment of the pixel points each time the pixel points participate in classification, dividing the added sum by the times, and assigning the obtained quotient to the pixel points in the second gray scale image.

According to a preferred embodiment of the present invention, before generating the Trimap image including the foreground, the background, and the unknown region according to the grayscale image, the method further includes:

generating a third gray image according to the first gray image and the second gray image;

assigning values to each pixel point in the third gray level image according to the following modes:

adding values of the pixel points in the first gray level image and the second gray level image, dividing the added sum by 2, and assigning the obtained quotient to the pixel points in the third gray level image.

According to a preferred embodiment of the present invention, the generating a Trimap image including a foreground, a background, and an unknown region according to the grayscale image includes:

comparing the value of each pixel point in the third gray image with a preset second threshold and a preset third threshold, wherein the second threshold is larger than the third threshold, if the value of the pixel point is larger than the second threshold, the pixel point is determined as a foreground, if the value of the pixel point is smaller than the third threshold, the pixel point is determined as a background, and if the value of the pixel point is larger than or equal to the third threshold and smaller than or equal to the second threshold, the pixel point is determined as an unknown area;

and generating the Trimap image according to the determination result of each pixel point.

A foreground extraction apparatus comprising: a first processing unit, a second processing unit and a third processing unit;

the first processing unit is used for carrying out significance detection on an original image to obtain a gray image corresponding to the original image;

the second processing unit is used for generating a Trimap image containing a foreground, a background and an unknown region according to the gray level image;

And the third processing unit is used for performing foreground-background segmentation on the unknown region in the Trimap image by combining the original image to obtain a MASK image only containing a foreground and a background.

According to a preferred embodiment of the present invention, the first processing unit performs saliency detection on the original image according to a first saliency detection method to obtain a first grayscale image corresponding to the original image, and performs saliency detection on the original image according to a second saliency detection method to obtain a second grayscale image corresponding to the original image.

According to a preferred embodiment of the present invention, the first processing unit respectively obtains image segmentation results of M different levels of the original image, where M is a positive integer greater than one, and respectively obtains corresponding saliency images for the image segmentation result of each level, and fuses the M saliency images to obtain the first grayscale image.

According to a preferred embodiment of the present invention, the first processing unit divides the original image into K1 sub-regions by a graph-based image segmentation algorithm to obtain a first-layer image segmentation result, where K1 is a positive integer greater than one, and obtains an ith-layer image segmentation result by: and carrying out region merging on the i-1 layer image segmentation result to obtain an i-layer image segmentation result, wherein the number of sub-regions contained in the i-layer image segmentation result is less than that of the sub-regions contained in the i-1 layer image segmentation result.

According to a preferred embodiment of the present invention, the first processing unit performs the following processing for each sub-region in the image segmentation result of any layer respectively: respectively acquiring the regional contrast, regional attributes and regional background features of the sub-regions, and generating feature vectors according to the acquired information; performing regression on the feature vectors by adopting a random forest to obtain a significant value of the subregion, wherein the significant value is 0 or 1; generating a saliency image corresponding to the image segmentation result of the hierarchy, wherein the value of each pixel point in the saliency image is as follows: and the salient value of the subregion where the pixel point is located.

According to a preferred embodiment of the present invention, the first processing unit assigns values to each pixel point in the first grayscale image according to the following method: adding values of the pixel points in the M significant images, dividing the added sum by M, and assigning the obtained quotient to the pixel points in the first gray level image.

According to a preferred embodiment of the present invention, the first processing unit divides the original image into N image blocks with the same size, where N is a positive integer greater than one, respectively obtains a super-pixel region corresponding to each image block, respectively obtains a foreground-background classification result corresponding to each super-pixel region, and fuses the foreground-background classification results to obtain the second grayscale image.

According to a preferred embodiment of the present invention, the first processing unit randomly selects a pixel point from any image block as a clustering center, and performs the following predetermined operations: performing clustering operation according to the clustering center to obtain pixel point clusters; calculating the average value of the values of the pixels in the pixel cluster; taking a pixel point which is closest to the value of the mean value in an assigned subarea in a temporary area formed by the pixels in the pixel point cluster as an updated clustering center, wherein the assigned subarea is a preset-size area which takes the center point of the temporary area as the center; and if the distance between the updated clustering center and the clustering center before updating is smaller than a preset first threshold, taking the temporary region as a super-pixel region corresponding to the image block, otherwise, repeating the preset operation according to the updated clustering center.

According to a preferred embodiment of the present invention, the first processing unit performs the following processing for each super pixel region: inputting the super-pixel region into a Convolutional Neural Network (CNN) classification model to obtain a first foreground-background classification result corresponding to the super-pixel region, wherein the first foreground-background classification result comprises a classification result of whether each pixel point in the super-pixel region belongs to a foreground or a background; intercepting an extended area which contains the super-pixel area and is larger than the super-pixel area from the original image, inputting the extended area into a CNN (content-based network) classification model, and obtaining a second foreground-background classification result corresponding to the super-pixel area, wherein the second foreground-background classification result contains a classification result of whether each pixel point in the extended area belongs to a foreground or a background respectively; and assigning the pixel points classified as the foreground to be 1 and assigning the pixel points classified as the background to be 0 each time the classification is performed.

According to a preferred embodiment of the present invention, the first processing unit assigns a value to each pixel point in the second gray scale image according to the following manners: counting the number of times that the pixel points participate in classification, adding the assignments of the pixel points each time the pixel points participate in classification, dividing the added sum by the number of times, and assigning the obtained quotient to the pixel points in the second gray level image.

According to a preferred embodiment of the present invention, the second processing unit is further configured to generate a third grayscale image according to the first grayscale image and the second grayscale image; wherein, assigning is performed for each pixel point in the third gray image according to the following modes: adding values of the pixel points in the first gray level image and the second gray level image, dividing the added sum by 2, and assigning the obtained quotient to the pixel points in the third gray level image.

According to a preferred embodiment of the present invention, the second processing unit compares, for each pixel point in the third grayscale image, a value of the pixel point with a preset second threshold and a preset third threshold, where the second threshold is greater than the third threshold, and if the value of the pixel point is greater than the second threshold, the pixel point is determined as a foreground, and if the value of the pixel point is less than the third threshold, the pixel point is determined as a background, and if the value of the pixel point is greater than or equal to the third threshold and less than or equal to the second threshold, the pixel point is determined as an unknown area; and generating the Trimap image according to the determination result of each pixel point.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as described above when executing the program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method as set forth above.

Based on the introduction, the scheme of the invention can automatically realize foreground extraction, thereby saving labor and time cost, having no requirement on the difference of the color or texture of the foreground and the background, being applicable to various scenes, having higher accuracy and the like.

[ description of the drawings ]

Fig. 1 is a flowchart of an embodiment of the foreground extraction method according to the present invention.

Fig. 2 is a schematic diagram of an overall implementation process of the foreground extraction method of the present invention.

Fig. 3 is a schematic structural diagram of a foreground extracting apparatus according to an embodiment of the present invention.

FIG. 4 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present invention.

[ detailed description ] embodiments

In order to make the technical solution of the present invention clearer and more obvious, the solution of the present invention is further described below by referring to the drawings and examples.

It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In addition, it should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

Fig. 1 is a flowchart of an embodiment of a foreground extraction method according to the present invention. As shown in fig. 1, the following detailed implementation is included.

In 101, the original image is subjected to saliency detection to obtain a grayscale image corresponding to the original image.

At 102, a Trimap (Trimap) image containing foreground, background, and unknown regions is generated from the grayscale image.

In 103, combining the original image, performing foreground and background segmentation on the unknown region in the Trimap image to obtain a MASK (MASK) image only including foreground and background.

In this embodiment, two saliency detection modes may be adopted, that is, the saliency detection may be performed on the original image according to the first saliency detection mode to obtain a first grayscale image corresponding to the original image, and the saliency detection may be performed on the original image according to the second saliency detection mode to obtain a second grayscale image corresponding to the original image.

The first saliency detection mode may be saliency detection based on region features, and the second saliency detection mode may be multi-background saliency detection. Two significance detection methods will be described below.

One) region feature-based saliency detection

The original image is typically a red, green, and blue (RGB) image. In the saliency detection mode, an original image containing an object to be detected is input, a gray image representing the saliency of the object is output, the gray image is distinguished from a gray image appearing subsequently, and the gray image appearing here is called a first gray image.

The method mainly comprises the following steps:

1) multi-level image segmentation

The image segmentation results of M different levels of the original image can be obtained respectively, where M is a positive integer greater than one, and a specific value can be determined according to actual needs, for example, can be 3.

Specifically, the original image may be first segmented into K1 sub-regions by using an existing graph-based image segmentation algorithm, so as to obtain a first-layer image segmentation result, where K1 is a positive integer greater than one, and a specific value may also be determined according to an actual need.

Then, the image segmentation result of the ith layer can be obtained by the following methods respectively, wherein i is a positive integer greater than one and less than or equal to M: and carrying out region merging on the i-1 layer image segmentation result to obtain an i-layer image segmentation result, wherein the number of sub-regions contained in the i-layer image segmentation result is less than that of the sub-regions contained in the i-1 layer image segmentation result.

For example, the first layer image segmentation result is S₁＝{R¹ ₁,R¹ ₂,…R¹ _K1R denotes one sub-region, K1 sub-regions in total. Assuming that M takes a value of 3, S can be selected₁Sub-region { R } of (2)¹ ₁,R¹ ₂,…R¹ _K1Region merging is carried out, and therefore the next layer, namely a second layer image segmentation result S is obtained₂＝{R² ₁,R² ₂,…R² _K2H, K2 is smallAt K1, then S may be paired₂Sub-region { R of (1) }² ₁,R² ₂,…R² _K2Region merging is carried out, and therefore a third layer image segmentation result S is obtained₃＝{R³ ₁,R³ ₂,…R³ _K3K3 is smaller than K2.

The specific region merging manner is not limited, for example, to S₁Sub-region { R of (1) }¹ ₁,R¹ ₂,…R¹ _K1For example, for each sub-region, a mean value of pixel values in the sub-region and a mean value of pixel values in each sub-region adjacent to the sub-region may be calculated, and if a difference between the mean value of a certain adjacent sub-region and the mean value of the sub-region is small (if the difference is smaller than a preset merging threshold), the adjacent sub-region and the sub-region may be merged.

The specific value of the merging threshold can be determined according to actual needs, and in addition, the next-layer image segmentation result can be merged by increasing the merging threshold, namely the merging threshold is increased to S₁Sub-region { R of (1) }¹ ₁,R¹ ₂,…R¹ _K1Region merging is performed by using a merging threshold smaller than the pair S₂Sub-region { R of (1) }² ₁,R² ₂,…R² _K2The merge threshold used when merging regions.

By the method, a plurality of image segmentation results of different levels can be obtained.

2) Multi-level region saliency calculation

And aiming at the image segmentation result of each layer, respectively acquiring the corresponding saliency image.

Specifically, for each sub-region in the image segmentation result of any hierarchy, the following processing can be performed respectively: respectively acquiring the regional contrast, regional attributes and regional background features of the sub-region, and generating a feature vector according to the acquired information; performing regression on the feature vector by using a random forest to obtain a significant value of the subregion, wherein the significant value is 0 or 1 (if the value range of the gray level image is represented by 0-255, 1 corresponds to 255, and the follow-up is the same); and generating a saliency image corresponding to the image segmentation result of the hierarchy, wherein the value of each pixel point in the saliency image is as follows: and the salient value of the subregion where the pixel point is located.

For example, the result S is segmented for the first layer image₁May first obtain the regional contrast, regional attributes, and regional background characteristics for the sub-region. The contrast is represented by the difference of brightness in an image, and how to acquire the contrast is the prior art. The region attribute may refer to a position of the sub-region in the original image, for example, if the upper left corner of the original image is used as a coordinate origin, then coordinate information of the sub-region is obtained. The region background feature is used for showing whether the sub-region belongs to the background, and the sub-region can be input into a recognition model obtained through pre-training, so that the output region background feature is obtained. The regional contrast, regional attributes and regional background features can be mapped into a vector according to a predetermined rule, and the three vectors can be spliced (for example, end-to-end connection) to obtain the required feature vector. Further, a random forest may be used to perform regression on the feature vector, so as to obtain a significant value a ═ f (x) of the sub-region, where x represents the feature vector, and f (x) represents that the random forest is used to perform regression on the feature vector, and the obtained significant value may be 0 or 1. Then, a saliency image corresponding to the first layer image segmentation result can be generated according to the saliency value of each sub-region, wherein the value of each pixel point in the saliency image is as follows: and the salient value of the subregion where the pixel point is located. That is to say, the saliency image may be a binary image, in which the value of each pixel point is either 0 or 1, and the values of the pixel points belonging to the same sub-region are the same and are the saliency values of the sub-region. 1 denotes foreground and 0 denotes background.

In the above manner, M image segmentation results { S) of different levels can be obtained respectively₁,S₂,…S_MRespective corresponding saliency images { A }₁,A₂,…A_MWhere A1 is the first layer image segmentation result S₁Corresponding saliency image, A2Segmenting the result S for the first layer image₂Corresponding saliency images, and so on.

3) Multi-level significance fusion

The M saliency images can be fused to obtain a first gray image.

Can be applied to each significant image { A₁,A₂,…A_MAnd linearly combining to obtain a final needed first gray-scale image representing the saliency of the object.

Wherein, can assign value for each pixel in the first gray image according to the following mode respectively: adding values of the pixel point in the M significant images, dividing the added sum by M, and assigning the obtained quotient to the pixel point in the first gray level image.

Assuming that the value of M is 3, the value of each pixel in the first gray image is one of 0/3, 1/3, 2/3, and 3/3, taking 1/3 as an example, it indicates that the value of the pixel in 1 saliency image is 1, and the values in the other 2 saliency images are 0.

Two) multiple background saliency detection

The method can firstly perform super-pixel (super-pixel) segmentation, namely, an original image can be segmented into N image blocks with the same size, wherein N is a positive integer greater than one, specific values can be determined according to actual needs, and super-pixel regions corresponding to each image block can be respectively obtained.

For any image block, the manner of acquiring the corresponding super pixel area may be: randomly selecting a pixel point in the image block as a clustering center, and executing the following preset operations: performing clustering operation according to the clustering center to obtain pixel point clusters; calculating the average value of the values of the pixels in the pixel cluster; taking the pixel point which is closest to the value of the mean value in an assigned subarea in a temporary area formed by the pixel points in the pixel point cluster as an updated clustering center, wherein the assigned subarea can be a preset size area which takes the central point of the temporary area as the center; if the distance between the updated clustering center and the clustering center before updating is smaller than a preset first threshold, the temporary region can be used as the super-pixel region corresponding to the image block, otherwise, the preset operation can be repeated according to the updated clustering center.

The size of the temporary area can be determined according to actual needs, and the specific value of the first threshold can also be determined according to actual needs.

It can be seen that the clustering center is dynamically changed, and is initially a pixel point randomly selected from the image block, and then is a pixel point determined according to the mean value of the values of the pixel points in the pixel point cluster obtained by clustering each time. If the distance between the clustering centers obtained in two times is smaller than the first threshold, the iteration can be stopped, and the newly obtained temporary region is used as the required super-pixel region.

The image block has a rectangular shape, but the corresponding super-pixel region does not have to be rectangular, and may have an arbitrary shape.

According to the mode, the super pixel regions corresponding to each image block can be obtained respectively, and therefore N super pixel regions are obtained in total.

And then, respectively obtaining a foreground and background classification result corresponding to each super-pixel area, and fusing the foreground and background classification results to obtain a second gray image.

For any super pixel region, the following processes can be respectively carried out: inputting the super-pixel region into a Convolutional Neural Network (CNN) classification model to obtain a first foreground-background classification result corresponding to the super-pixel region, wherein the first foreground-background classification result comprises a classification result of whether each pixel point in the super-pixel region belongs to a foreground or a background; and intercepting an extended area which contains the super pixel area and is larger than the super pixel area from the original image, inputting the extended area into the CNN classification model, and obtaining a second foreground-background classification result corresponding to the super pixel area, wherein the second foreground-background classification result contains a classification result of whether each pixel point in the extended area belongs to a foreground or a background.

In order to avoid interference of other objects in the image on the object to be detected (the subject object), each super-pixel region may be expanded (padding) respectively, that is, an expanded region including the super-pixel region and larger than the super-pixel region is extracted from the original image, and the specific size of the expanded region may be determined according to actual needs, for example, the center point of the super-pixel region may be used as the center, the super-pixel region is expanded by 2 times or 4 times in the original image, and the expanded region is used as a required expanded region and sent to the CNN classification model. The CNN classification model can be trained according to the prior art.

In this way, for each super-pixel region, two foreground-background classification results, i.e., a first foreground-background classification result and a second foreground-background classification result, can be obtained respectively. Accordingly, for N super-pixel regions, 2N foreground-background classification results can be obtained. And assigning 1 to the pixel points classified as the foreground and 0 to the pixel points classified as the background in the region each time the classification is performed.

The 2N foreground and background classification results can be fused to obtain a second gray image. Wherein, can assign values for each pixel point in the second gray image according to the following mode respectively: counting the times of the pixel point participating in classification; and adding the assignment of the pixel point each time the pixel point participates in classification, dividing the counted times by the sum of the additions, and assigning the obtained quotient to the pixel point in the second gray scale image.

For example, for a certain pixel, it belongs to the super-pixel region a, obviously, it also belongs to the extension region a of the super-pixel region a, and in addition, the pixel also belongs to the extension region b of the super-pixel region b and the extension region c of the super-pixel region c, in the super-pixel region a, the pixel is classified as foreground, in the extension region a, the pixel is classified as background, in the extension region b, the pixel is classified as background, in the extension region c, the number of times that the pixel participates in classification is 4, the assignment of the pixel each time the pixel participates in classification is added to obtain 1+0+0+0, which is 1, and 1/4 can be assigned to the pixel in the second gray level image. When the value range of the grayscale image is represented by 0 to 255, 1/4 corresponds to 255/4.

After the first gray level image and the second gray level image are obtained, a third gray level image can be further generated according to the first gray level image and the second gray level image, for example, the first gray level image and the second gray level image can be linearly fused, and effective complementation of two significance detection modes can be realized through a fusion mode.

The assignment can be performed for each pixel point in the third gray image respectively according to the following modes: adding the values of the pixel point in the first gray level image and the second gray level image, dividing the added sum by 2, and assigning the obtained quotient to the pixel point in the third gray level image.

Then, a Trimap image including the foreground, the background and the unknown region can be generated according to the third grayscale image, and the Trimap image can be obtained by separating the foreground, the background and the unknown region in a threshold segmentation manner, for example.

Specifically, for each pixel point in the third grayscale image, the value of the pixel point may be compared with a preset second threshold and a preset third threshold, where the second threshold is greater than the third threshold, and if the value of the pixel point is greater than the second threshold, the pixel point may be determined as a foreground, and if the value of the pixel point is less than the third threshold, the pixel point may be determined as a background, and if the value of the pixel point is greater than or equal to the third threshold and less than or equal to the second threshold, the pixel point may be determined as an unknown area; a Trimap image can be generated according to the determination result of each pixel point. The specific values of the second threshold and the third threshold can be determined according to actual needs.

The pixel points in the Trimap image only contain three values, if the pixel point belonging to the foreground is 1, the pixel point belonging to the background is 0, and the pixel point belonging to the unknown area is a value between 0 and 1, which respectively corresponds to three colors of white, black and gray.

Further, the original image may be combined to perform foreground and background segmentation on an unknown region in the Trimap image, that is, perform foreground extraction, thereby obtaining a MASK (MASK) image containing only foreground and background.

The unknown region in the Trimap image can be segmented into foreground and background by using an image foreground extraction method, such as the conventional learning-based LBDM algorithm, so that the finally required MASK image is obtained.

In summary, fig. 2 is a schematic diagram of an overall implementation process of the foreground extraction method according to the present invention. As shown in fig. 2, for an original image to be processed, saliency detection based on regional features and multi-background saliency detection may be performed on the original image to be processed, so as to obtain a first grayscale image and a second grayscale image, respectively, and then image fusion may be performed on the first grayscale image and the second grayscale image, so as to obtain a third grayscale image, and for the third grayscale image, a Trimap image may be generated by threshold segmentation, and then, in combination with the original image, foreground extraction may be performed on the Trimap image, so as to obtain a final required MASK image.

It should be noted that for simplicity of description, the aforementioned method embodiments are described as a series of combinations of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated ordering of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art will appreciate that the embodiments described in this specification are presently preferred and that no acts or modules are required by the invention.

In short, the scheme of the embodiment of the method can automatically realize foreground extraction, thereby saving labor and time cost, has no requirement on the difference of the color or texture of the foreground and the background, can be suitable for various scenes, and has higher accuracy and the like.

The above is a description of method embodiments, and the embodiments of the present invention are further described below by way of apparatus embodiments.

Fig. 3 is a schematic structural diagram of a foreground extracting apparatus according to an embodiment of the present invention. As shown in fig. 3, includes: a first processing unit 301, a second processing unit 302 and a third processing unit 303.

The first processing unit 301 is configured to perform saliency detection on an original image to obtain a grayscale image corresponding to the original image.

And a second processing unit 302, configured to generate a Trimap image including a foreground, a background, and an unknown region according to the grayscale image.

The third processing unit 303 is configured to perform foreground-background segmentation on the unknown region in the Trimap image by combining the original image, so as to obtain a MASK image only including a foreground and a background.

The first processing unit 301 may perform saliency detection on the original image according to a first saliency detection manner to obtain a first grayscale image corresponding to the original image, and may perform saliency detection on the original image according to a second saliency detection manner to obtain a second grayscale image corresponding to the original image.

In the first saliency detection manner, the first processing unit 301 may respectively obtain image segmentation results of M different levels of the original image, where M is a positive integer greater than one, and for the image segmentation result of each level, may respectively obtain corresponding saliency images, and may fuse the M saliency images, thereby obtaining the first grayscale image.

Specifically, the first processing unit 301 may segment the original image into K1 sub-regions by a graph-based image segmentation algorithm, resulting in a first-layer image segmentation result, K1 being a positive integer greater than one, and may obtain an ith-layer image segmentation result by: and carrying out region merging on the i-1 layer image segmentation result to obtain an i-layer image segmentation result, wherein the number of sub-regions contained in the i-layer image segmentation result is less than that of the sub-regions contained in the i-1 layer image segmentation result.

The first processing unit 301 may perform the following processing for each sub-region in the image segmentation result of any layer respectively: respectively acquiring the regional contrast, regional attributes and regional background features of the sub-region, and generating a feature vector according to the acquired information; performing regression on the feature vector by adopting a random forest to obtain a significant value of the subregion, wherein the significant value is 0 or 1; and generating a saliency image corresponding to the image segmentation result of the hierarchy, wherein the value of each pixel point in the saliency image is as follows: and the salient value of the subregion where the pixel point is located.

The first processing unit 301 may assign a value to each pixel point in the first grayscale image according to the following manners: adding values of the pixel point in the M significant images, dividing the added sum by M, and assigning the obtained quotient to the pixel point in the first gray level image.

In the second saliency detection manner, the first processing unit 301 may divide the original image into N image blocks with the same size, where N is a positive integer greater than one, respectively obtain the super pixel regions corresponding to each image block, respectively obtain the foreground-background classification results corresponding to each super pixel region, and fuse the foreground-background classification results to obtain the second gray scale image.

Specifically, the first processing unit 301 may randomly select a pixel point from any image block as a clustering center, and perform the following predetermined operations: performing clustering operation according to the clustering center to obtain pixel point clusters; calculating the average value of the values of the pixels in the pixel cluster; taking the pixel point which is closest to the value of the mean value in an assigned subarea in a temporary area formed by the pixel points in the pixel point cluster as an updated clustering center, wherein the assigned subarea can be a preset size area which takes the central point of the temporary area as the center; if the distance between the updated clustering center and the clustering center before updating is smaller than a preset first threshold, the temporary region can be used as the super-pixel region corresponding to the image block, otherwise, the preset operation can be repeated according to the updated clustering center.

The first processing unit 301 may perform the following processing for any super pixel region: inputting the super-pixel region into a CNN classification model to obtain a first foreground-background classification result corresponding to the super-pixel region, wherein the first foreground-background classification result comprises a classification result of whether each pixel point in the super-pixel region belongs to a foreground or a background respectively; intercepting an extended area which contains the super pixel area and is larger than the super pixel area from an original image, inputting the extended area into a CNN (content-based network) classification model to obtain a second foreground-background classification result corresponding to the super pixel area, wherein the second foreground-background classification result contains a classification result of whether each pixel point in the extended area belongs to a foreground or a background respectively; and each time of classification, assigning the pixel points classified as the foreground in the region to be 1, and assigning the pixel points classified as the background to be 0.

The first processing unit 301 may assign a value to each pixel point in the second grayscale image according to the following manners: counting the number of times that the pixel point participates in classification, adding the assignment of the pixel point each time the pixel point participates in classification, dividing the counted number of times by the sum of the additions, and assigning the obtained quotient to the pixel point in the second gray level image.

After obtaining the first grayscale image and the second grayscale image, the second processing unit 302 can further generate a third grayscale image according to the first grayscale image and the second grayscale image. Wherein, can assign value for each pixel point in the third gray level image according to the following mode respectively: adding the values of the pixel point in the first gray level image and the second gray level image, dividing the added sum by 2, and assigning the obtained quotient to the pixel point in the third gray level image.

The second processing unit 302 may further generate a Trimap image including the foreground, the background, and the unknown region according to the third grayscale image, for example, the separation of the foreground, the background, and the unknown region may be implemented in a threshold segmentation manner, so as to obtain the Trimap image.

Specifically, the second processing unit 302 may compare, for each pixel point in the third grayscale image, a value of the pixel point with a preset second threshold and a preset third threshold, where the second threshold is greater than the third threshold, if the value of the pixel point is greater than the second threshold, the pixel point is determined as a foreground, if the value of the pixel point is less than the third threshold, the pixel point is determined as a background, and if the value of the pixel point is greater than or equal to the third threshold and is less than or equal to the second threshold, the pixel point is determined as an unknown area; and generating a Trimap image according to the determination result of each pixel point.

Then, the third processing unit 303 may perform foreground-background segmentation on the unknown region in the Trimap image in combination with the original image, so as to obtain a MASK image containing only foreground and background.

For a specific work flow of the apparatus embodiment shown in fig. 3, reference is made to the related description in the foregoing method embodiment, and details are not repeated.

FIG. 4 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present invention. The computer system/server 12 shown in FIG. 4 is only one example and should not be taken to limit the scope of use or functionality of embodiments of the present invention.

As shown in FIG. 4, computer system/server 12 is in the form of a general purpose computing device. The components of computer system/server 12 may include, but are not limited to: one or more processors (processing units) 16, a memory 28, and a bus 18 that connects the various system components, including the memory 28 and the processors 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer system/server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 12 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. The computer system/server 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, and commonly referred to as a "hard drive"). Although not shown in FIG. 4, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of embodiments of the invention as described.

The computer system/server 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display, etc.), with one or more devices that enable a user to interact with the computer system/server 12, and/or with any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the computer system/server 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 20. As shown in FIG. 4, network adapter 20 communicates with the other modules of computer system/server 12 via bus 18. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the computer system/server 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processor 16 executes various functional applications and data processing, such as implementing the method in the embodiment shown in fig. 1, by executing programs stored in the memory 28.

The invention also discloses a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, will carry out the method as in the embodiment shown in fig. 1.

Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method, etc., can be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A foreground extraction method, comprising:

the method for detecting the significance of the original image to obtain the gray image corresponding to the original image comprises the following steps: dividing the original image into N image blocks with the same size, wherein N is a positive integer larger than one; respectively acquiring a super pixel area corresponding to each image block, wherein the super pixel area is an area with any shape, and the method comprises the following steps: aiming at any image block, randomly selecting a pixel point from the image block as a clustering center, and executing the following preset operations: performing clustering operation according to the clustering center to obtain a pixel point cluster; calculating the average value of the values of the pixels in the pixel cluster; determining an updated clustering center according to the average value; if the distance between the updated clustering center and the clustering center before updating is smaller than a preset first threshold, taking a temporary region formed by the pixel points in the pixel point cluster as a super pixel region corresponding to the image block, otherwise, repeating the preset operation according to the updated clustering center; respectively obtaining a foreground and background classification result corresponding to each super pixel area; fusing the classification results of the foreground and the background to obtain a second gray image;

Generating a Trimap image containing a foreground, a background and an unknown region according to the gray image;

and combining the original image, performing foreground-background segmentation on the unknown region in the Trimap image to obtain a MASK image only containing foreground and background.

2. The method of claim 1,

the performing the saliency detection on the original image to obtain the grayscale image corresponding to the original image further includes:

and carrying out saliency detection on the original image according to a first saliency detection mode to obtain a first gray image corresponding to the original image.

3. The method of claim 2,

the performing significance detection on the original image according to a first significance detection mode to obtain a first grayscale image corresponding to the original image includes:

and fusing the M significant images to obtain the first gray level image.

4. The method of claim 3,

The obtaining of the image segmentation results of the M different levels of the original image respectively includes:

5. The method of claim 4,

the step of respectively acquiring corresponding saliency images for the image segmentation result of each layer includes:

And generating a significance image corresponding to the image segmentation result of the layer, wherein the value of each pixel point in the significance image is as follows: and the salient value of the subregion where the pixel point is located.

6. The method of claim 5,

the linearly combining the M significant images to obtain the first grayscale image includes:

7. The method of claim 1,

the determining the updated clustering center according to the mean value comprises:

and taking the pixel point which is closest to the value of the mean value in an assigned subarea in a temporary area formed by the pixel points in the pixel point cluster as an updated clustering center, wherein the assigned subarea is a preset-size area which takes the center point of the temporary area as the center.

8. The method of claim 7,

the respectively obtaining the foreground and background classification results corresponding to each super pixel region includes:

For any super pixel region, the following processing is respectively carried out:

and assigning 1 to the pixel points classified as the foreground in the region and 0 to the pixel points classified as the background each time the classification is performed.

9. The method of claim 8,

the fusing the classification results of the foreground and the background to obtain the second gray image comprises:

Counting the times of the classification participation of the pixel points;

10. The method of claim 2,

before generating the Trimap image including the foreground, the background and the unknown region according to the gray-scale image, the method further includes:

11. The method of claim 10,

the generating a Trimap image containing a foreground, a background and an unknown region according to the gray level image comprises:

12. A foreground extraction apparatus, comprising: a first processing unit, a second processing unit and a third processing unit;

the first processing unit is configured to perform saliency detection on an original image to obtain a grayscale image corresponding to the original image, and includes: dividing the original image into N image blocks with the same size, wherein N is a positive integer larger than one; respectively acquiring a super pixel area corresponding to each image block, wherein the super pixel area is an area with any shape, and the method comprises the following steps: aiming at any image block, randomly selecting a pixel point from the image block as a clustering center, and executing the following preset operations: performing clustering operation according to the clustering center to obtain a pixel point cluster; calculating the average value of the values of the pixels in the pixel cluster; determining an updated clustering center according to the average value; if the distance between the updated clustering center and the clustering center before updating is smaller than a preset first threshold, taking a temporary region formed by the pixel points in the pixel point cluster as a super pixel region corresponding to the image block, otherwise, repeating the preset operation according to the updated clustering center; respectively obtaining a foreground and background classification result corresponding to each super pixel area; fusing the classification results of the foreground and the background to obtain a second gray image;

The second processing unit is used for generating a Trimap image containing a foreground, a background and an unknown region according to the grayscale image;

13. The apparatus of claim 12,

the first processing unit is further configured to perform saliency detection on the original image according to a first saliency detection manner to obtain a first grayscale image corresponding to the original image.

14. The apparatus of claim 13,

the first processing unit respectively obtains M image segmentation results of different levels of the original image, wherein M is a positive integer larger than one, respectively obtains corresponding saliency images aiming at the image segmentation result of each level, and fuses the M saliency images to obtain the first gray image.

15. The apparatus of claim 14,

the first processing unit divides the original image into K1 sub-regions by a graph-based image division algorithm to obtain a first-layer image division result, wherein K1 is a positive integer greater than one, and obtains an ith-layer image division result by the following method, i is a positive integer greater than one and less than or equal to M: and carrying out region merging on the i-1 layer image segmentation result to obtain an i-layer image segmentation result, wherein the number of sub-regions contained in the i-layer image segmentation result is less than that of the sub-regions contained in the i-1 layer image segmentation result.

16. The apparatus of claim 15,

the first processing unit respectively performs the following processing for each sub-region in the image segmentation result of any layer: respectively acquiring the regional contrast, regional attributes and regional background features of the sub-regions, and generating feature vectors according to the acquired information; performing regression on the feature vectors by adopting a random forest to obtain a significant value of the subregion, wherein the significant value is 0 or 1; generating a saliency image corresponding to the image segmentation result of the hierarchy, wherein the value of each pixel point in the saliency image is as follows: and the salient value of the subregion where the pixel point is located.

17. The apparatus of claim 16,

the first processing unit assigns values to each pixel point in the first gray image according to the following modes: adding values of the pixel points in the M significant images, dividing the added sum by M, and assigning the obtained quotient to the pixel points in the first gray level image.

18. The apparatus of claim 12,

the first processing unit takes a pixel point which is closest to the value of the mean value in an assigned subregion in a temporary region formed by the pixel points in the pixel point cluster as an updated clustering center, and the assigned subregion is a predetermined size region which takes the center point of the temporary region as the center.

19. The apparatus of claim 18,

the first processing unit respectively performs the following processing for any super pixel region: inputting the super-pixel region into a Convolutional Neural Network (CNN) classification model to obtain a first foreground-background classification result corresponding to the super-pixel region, wherein the first foreground-background classification result comprises a classification result of whether each pixel point in the super-pixel region belongs to a foreground or a background; intercepting an extended area which contains the super-pixel area and is larger than the super-pixel area from the original image, inputting the extended area into a CNN (content-based network) classification model, and obtaining a second foreground-background classification result corresponding to the super-pixel area, wherein the second foreground-background classification result contains a classification result of whether each pixel point in the extended area belongs to a foreground or a background respectively; and assigning 1 to the pixel points classified as the foreground and 0 to the pixel points classified as the background each time the classification is performed.

20. The apparatus of claim 19,

the first processing unit assigns values to each pixel point in the second gray scale image according to the following modes: counting the number of times that the pixel points participate in classification, adding the assignments of the pixel points each time the pixel points participate in classification, dividing the added sum by the number of times, and assigning the obtained quotient to the pixel points in the second gray level image.

21. The apparatus of claim 13,

the second processing unit is further used for generating a third gray scale image according to the first gray scale image and the second gray scale image; wherein, assigning is performed for each pixel point in the third gray image according to the following modes: adding values of the pixel points in the first gray level image and the second gray level image, dividing the added sum by 2, and assigning the obtained quotient to the pixel points in the third gray level image.

22. The apparatus of claim 21,

the second processing unit compares the value of each pixel point in the third gray image with a preset second threshold and a preset third threshold, wherein the second threshold is larger than the third threshold, if the value of the pixel point is larger than the second threshold, the pixel point is determined as a foreground, if the value of the pixel point is smaller than the third threshold, the pixel point is determined as a background, and if the value of the pixel point is larger than or equal to the third threshold and smaller than or equal to the second threshold, the pixel point is determined as an unknown area; and generating the Trimap image according to the determination result of each pixel point.

23. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the method of any of claims 1 to 11.

24. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 11.