CN112070793A

CN112070793A - Target extraction method and device

Info

Publication number: CN112070793A
Application number: CN202010952509.1A
Authority: CN
Inventors: 王晓茹; 徐培容; 曲昭伟; 张珩; 谷嘉航
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2020-09-11
Filing date: 2020-09-11
Publication date: 2020-12-11

Abstract

The application provides a target extraction method and a target extraction device, wherein the method comprises the following steps: acquiring an image and position information of a target to be extracted; the position information comprises position information of foreground points and position information of background points; calculating a foreground binary image according to the position information of the foreground point, and calculating a background binary image according to the position information of the background point; carrying out channel combination on the image, the foreground binary image and the background binary image; inputting the image after channel combination into a trained target semantic segmentation model to obtain a mask for distinguishing a target from a non-target in the image; the target semantic segmentation model is obtained by replacing at least Xception-65 in a Deeplab v3+ semantic segmentation network with a residual error network ResNet-101; and extracting the target in the image according to the mask and the image. According to the target extraction method, the speed and the precision of extracting the target are improved.

Description

Target extraction method and device

Technical Field

The present application relates to the field of image processing, and in particular, to a method and an apparatus for extracting a target.

Background

In some application scenarios, a user needs to extract a desired object from an image, for convenience of description, the object that the user needs to extract in the image is referred to as a target, and contents other than the target in the image are referred to as non-targets. Typically, a user clicks a preset first number of location points (referred to as foreground points for descriptive convenience) within a target in an image, the foreground points being used to represent the target. Meanwhile, the user clicks a preset second number of position points (referred to as background points for convenience of description) on the non-object of the image, and the background points are used for representing the non-object.

At present, a target is extracted by processing through a semantic segmentation model based on an image, a foreground point and a background point of the target to be extracted.

However, the accuracy and speed of extracting the target are low at present.

Disclosure of Invention

The application provides a target extraction method and a target extraction device, and aims to solve the problems of low precision and low speed of target extraction.

In order to achieve the above object, the present application provides the following technical solutions:

the application provides a target extraction method, which comprises the following steps:

acquiring an image and position information of a target to be extracted; the position information comprises position information of foreground points and position information of background points;

calculating a foreground binary image according to the position information of the foreground point;

calculating a background binary image according to the position information of the background points;

performing channel combination on the image, the foreground binary image and the background binary image to obtain an image after channel combination;

inputting the image after the channel combination into a trained target semantic segmentation model to obtain a mask for distinguishing a target from a non-target in the image; the target semantic segmentation model is obtained by replacing at least Xception-65 in a Deeplab v3+ semantic segmentation network with a residual error network ResNet-101;

and extracting the target in the image according to the mask and the image.

Optionally, the calculating a foreground binary image according to the position information of the foreground point includes:

in the image, taking each foreground point as a circle center and taking a preset radius as a circle respectively to obtain a circle corresponding to each foreground point;

and setting the pixel value of the pixel point in the area where the circle corresponding to each foreground point is located as 1, and setting the pixel value of the pixel point outside the circle corresponding to the foreground point as 0 to obtain the foreground binary image.

Optionally, the calculating a background binary image according to the position information of the background point includes:

in the image, taking each background point as a circle center and taking the preset radius as a circle respectively to obtain a circle corresponding to each background point;

setting the pixel value of the pixel point in the area where the circle corresponding to each background point is located as 1, and setting the pixel value of the pixel point outside the circle corresponding to the background point as 0 to obtain the background binary image.

Optionally, the number of channels of the final output network of the target semantic segmentation network is 2;

the mask comprises a first mask and a second mask; the value of any position point in the first mask represents the probability that the pixel point which is the same as the position point in the image belongs to the target; and the value of any position point in the second mask represents the probability that the pixel point which is the same as the position point in the image belongs to the non-target.

The extracting the target in the image according to the mask and the image comprises:

comparing the values of the target position points in the first mask and the second mask to obtain a comparison result; the target position points are the same position points;

in the case that the comparison result indicates: under the condition that the value of the target position point in the first mask is larger than that of the target position point in the second mask, taking a pixel point indicated by the target position point in the image as a target pixel point;

and taking the area formed by the target pixel points in the image as the target.

Optionally, the image is an RGB image; the number of input channels of the residual network ResNet-101 is 5.

Optionally, the training process of the target semantic segmentation model includes:

acquiring a training sample and a sample label; any one training sample is obtained by combining an RGB original image, a foreground binary image and a background binary image through a channel; the sample label of the training sample is a 2-channel mask for distinguishing a target from a non-target;

training the target semantic segmentation model by adopting the training sample, the sample label and a target loss function to obtain a trained target semantic segmentation model; the target loss function is the sum of a preset mark loss function and a cross entropy; the expression of the mark loss function is:

the LabelLoss represents the marker loss function value,

l represents a non-target mask in the result mask output by the target semantic segmentation model on any input training sample, S¹Representing the foreground binary image in the training sample, S⁰Representing a background binary image in the training sample, W representing the width of the RGB raw image, and H representing the height of the RGB raw image.

Optionally, the obtaining of the training sample and the sample label includes:

acquiring a training example sample; any one of the training example samples includes: the RGB original image and a single-class mask; the single-class mask is a single-channel mask used for dividing the RGB original image into a target and a non-target;

selecting foreground points and background points for masks of a single category in each training example sample respectively to obtain a single-channel mask after marking corresponding to each example sample;

respectively generating a foreground binary image and a background binary image according to the foreground points and the background points of the marked single-channel mask to obtain a foreground binary image and a background binary image of each example sample;

respectively carrying out channel combination on the RGB original image in each example sample and the corresponding foreground binary image and background binary image to obtain an image after channel combination corresponding to each example sample;

respectively updating each single-class mask in the training example sample into a 2-channel mask through a preset function to obtain a single-class 2-channel mask;

and taking the image obtained by combining the channels corresponding to any one example sample as a training sample, and taking the 2-channel mask of a single class in the example sample as a sample label of the training sample.

Optionally, the obtaining a training example sample includes:

obtaining a semantic segmentation training data set; the semantic segmentation training data set comprises the RBG original image and a mask label; the mask label is a single-channel mask for distinguishing different areas in the RGB original image;

respectively determining a single type of mask corresponding to each area in the mask label; the single-category mask corresponding to any one area in the mask labels is obtained by setting the value of the position point of the area to be 1 and the values of other position points to be 0;

and the RGB original image and each single-class mask respectively form the training example sample.

The present application further provides a target extraction apparatus, including:

the acquisition module is used for acquiring an image and position information of a target to be extracted; the position information comprises position information of foreground points and position information of background points;

the first calculation module is used for calculating a foreground binary image according to the position information of the foreground point;

the second calculation module is used for calculating a background binary image according to the position information of the background point;

the combination module is used for carrying out channel combination on the image, the foreground binary image and the background binary image to obtain an image after channel combination;

the input module is used for inputting the image after the channel combination into a trained target semantic segmentation model to obtain a mask for distinguishing a target from a non-target in the image; the target semantic segmentation model is obtained by replacing at least Xception-65 in a Deeplab v3+ semantic segmentation network with a residual error network ResNet-101;

and the extraction module is used for extracting the target in the image according to the mask and the image.

Optionally, the first calculating module is configured to calculate a foreground binary image according to the position information of the foreground point, and includes:

the first calculation module is specifically configured to make a circle with a preset radius by taking each foreground point as a center of a circle in the image, so as to obtain a circle corresponding to each foreground point;

Optionally, the second calculating module is configured to calculate a background binary image according to the position information of the background point, and includes:

the second calculation module is specifically configured to take each background point as a circle center and take the preset radius as a circle in the image to obtain a circle corresponding to each background point;

The extracting module is configured to extract the target in the image according to the mask and the image, and includes:

the extraction module is specifically configured to compare values of target position points in the first mask and the second mask to obtain a comparison result; the target position points are the same position points;

Optionally, the method further includes: the training module is used for the training process of the target semantic segmentation model and comprises the following steps:

the LabelLoss represents the marker loss function value,

Optionally, the training module is configured to obtain a training sample and a sample label, and includes:

the training module is specifically used for acquiring a training example sample; any one of the training example samples includes: the RGB original image and a single-class mask; the single-class mask is a single-channel mask used for dividing the RGB original image into a target and a non-target;

Optionally, the training module is configured to obtain a training example sample, and includes:

the training module is specifically used for acquiring a semantic segmentation training data set; the semantic segmentation training data set comprises the RBG original image and a mask label; the mask label is a single-channel mask for distinguishing different areas in the RGB original image;

In the target extraction method and the target extraction device, an image and position information of a target to be extracted are obtained, wherein the position information comprises position information of foreground points and position information of background points; calculating a foreground binary image according to the position information of the foreground point, and calculating a background binary image according to the position information of the background point; because the information content of the binary image is small, the image, the foreground binary image and the background binary image are subjected to channel combination, and the information content of the obtained image after channel combination is small; furthermore, the image after channel combination is input into the trained target semantic segmentation model, so that the processing speed of the trained semantic segmentation model can be improved.

Because the target semantic segmentation model is obtained by replacing at least the Xeception-65 in the Deeplab v3+ semantic segmentation network with the residual error network ResNet-101, compared with the Xeception-65, the residual error network ResNet-101 can extract more semantic information, and therefore, the mask output by the target semantic segmentation model provided by the application has higher accuracy. Furthermore, according to the mask, the accuracy of the target extracted from the image is higher.

In summary, the speed and accuracy of extracting the target are improved by the target extraction method provided by the application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of a training process of a target semantic segmentation model disclosed in an embodiment of the present application;

fig. 2 is a flowchart of a target extraction method disclosed in an embodiment of the present application;

fig. 3 is a schematic diagram of an architecture of a target extraction process disclosed in an embodiment of the present application;

fig. 4(a) is a schematic image of an object to be extracted according to an embodiment of the present application;

FIG. 4(b) is a schematic diagram of an extraction result disclosed in the embodiment of the present application;

fig. 4(c) is a schematic image of another object to be extracted disclosed in the embodiment of the present application;

FIG. 4(d) is a schematic diagram of another extraction result disclosed in the embodiment of the present application;

fig. 5 is a schematic structural diagram of a target extraction device disclosed in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the embodiment of the present application, the image type suitable for the extraction target may be an RGB image, or may be another type of image.

Fig. 1 is a schematic diagram of a training process of a target semantic segmentation model according to an embodiment of the present disclosure.

In this embodiment, the target semantic segmentation model is obtained by improving a deepab v3+ semantic segmentation network, specifically, Xception-65 in a deepab v3+ semantic segmentation network is replaced by a residual error network ResNet-101, the number of input channels of the residual error network ResNet-101 is modified to 5, and in this embodiment, the number of the channels of the deepab v3+ the final output network of the semantic segmentation network, which are replaced by the residual error network ResNet-101, is modified to 2, so as to obtain the target semantic segmentation model. The specific structure of the residual error network ResNet-101 and the specific structure of the Deeplab v3+ semantic segmentation network are the prior art, and are not described herein again.

In practice, when the image is not an RGB image but another type of image, the input channel number of the residual network ResNet-101 is modified to a target channel number, where the target channel number is the total channel number obtained by adding 2 to the image channel number.

It should be further noted that, in this embodiment, the number of channels of the final output network of the deplab v3+ semantic segmentation network replaced by the residual network ResNet-101 is modified to 2, that is, the number of channels of the final output network of the target semantic segmentation model is 2, that is, the target semantic segmentation model outputs a mask of 2 channels, that is, the target in the image is determined through the mask of two channels. In practice, the number of channels of the final output network of the target semantic segmentation model may also be 1, that is, the target semantic segmentation model outputs a single-channel mask, that is, the target in the image is determined through the single-channel mask. In this embodiment, a training process of the target semantic segmentation model is introduced by taking a mask of 2 channels output by the target semantic segmentation model as an example.

In this embodiment, in the process of training the target semantic segmentation model, an Adam optimizer may be adopted, the batch size (batch size) may be set to 4, the learning rate may be set to 0.0001, and the value of the preset radius may be set to 5. Of course, in practice, other optimizers and parameters may also be used to train the target semantic segmentation model, and the specific content of the optimizers and parameters is not limited in this embodiment. The specific training process may include the steps of:

s101, obtaining a semantic segmentation training data set.

In this step, the semantic segmentation training data set may be a Pascal VOC2012 semantic segmentation data set, and certainly, in practice, other semantic segmentation data sets may also be adopted, which is described in this embodiment by taking the Pascal VOC2012 semantic segmentation data set as an example.

The semantic segmentation model training dataset comprises: the image processing method comprises RGB original images and mask labels, wherein one RGB original image corresponds to one mask label, and the mask label of any frame of RGB original image is obtained through manual marking.

For any frame of RGB original image, the RGB original image is composed of multiple regions, the mask label corresponding to the RGB original image is a single-channel mask for distinguishing different regions in the RGB original image, namely the mask label corresponding to the RGB original image comprises multiple regions, wherein one region in the mask label corresponds to one region in the RGB original image. The number of the regional categories in the RGB original image in the semantic segmentation training data set is determined in advance.

For example, a frame of RGB original image is composed of 21 regions, and the corresponding mask label of the RGB original image includes 21 regions, wherein one region may be represented by 1 number.

S102, determining the masks of the single type corresponding to each area in the mask labels respectively.

And respectively determining a mask of a single category corresponding to each region in the corresponding mask label for each frame of RGB original image in the semantic segmentation model.

Taking an RGB original image of any frame as an example, a single-type mask corresponding to any one region in a mask label corresponding to the RGB original image is obtained by setting the value of the position point of the region to 1, and setting the values of the other position points to 0.

Assuming that the frame RGB original image includes 21 regions, the mask label corresponding to the frame RGB original image includes 21 different regions, and the different regions are represented by different numbers. Assuming that any region in the mask label corresponding to the RGB original image is a region corresponding to the number "7" in the mask label, setting a value of a position point corresponding to the number "7" in the mask label to 1, and setting values of position points where other numbers except for the number "7" in the mask label to 0, thereby obtaining a single-type mask corresponding to the region. Then in this step 21 masks of a single class are obtained.

S103, combining the RGB original image and each single-class mask respectively to form a training example sample.

Taking the RGB original image including 21 regions as an example, in this step, the frame of RGB original image is combined with 21 masks of a single type to form a training example sample, i.e. 21 training example examples are obtained.

The above-mentioned purposes of S101 to S103 are: training example samples are obtained. Wherein any one of the training example samples comprises: an RGB original image and a single-class mask; the single class mask is a single channel mask for dividing the RGB original image into a target and a non-target.

And S104, selecting foreground points and background points for the masks of the single category in each training example sample respectively to obtain the marked single-channel masks corresponding to each example sample respectively.

In this step, the first location point (belonging to the target and representing the target) that simulates the user click is called the foreground point. The second location point (belonging to and representing a non-object) that simulates a user click is called a background point. Taking the RGB image including 21 regions as an example, 21 marked masks are obtained in this step.

In this step, foreground points and background points may be randomly selected.

And S105, respectively generating a foreground binary image and a background binary image according to the marked foreground points and background points of the single-channel mask corresponding to each training example sample, so as to obtain the foreground binary image and the background binary image of each example sample.

In this embodiment, the principle of generating a foreground binary image and the principle of generating a background binary image for the foreground point and the background point in the marked mask corresponding to each training example sample are the same, and for convenience of description, the principle of generating a foreground binary image and a background binary image according to the foreground point and the background point in the marked mask corresponding to the training example sample is described by taking the marked mask corresponding to any one training example sample as an example:

in the RGB original image in the training example sample, respectively taking each foreground point as the center of a circle and making a circle by a preset radius to obtain a circle corresponding to each foreground point; and setting the pixel value of the pixel point of the area where the circle corresponding to each foreground point is located as 1, and setting the pixel value of the pixel point outside the circle corresponding to the foreground point as 0 to obtain a foreground binary image.

Similarly, in the RGB original image in the training example sample, each background point is taken as a circle center, and a circle is made with the preset radius, so as to obtain a circle corresponding to each background point; and setting the pixel value of the pixel point in the area where the circle corresponding to each background point is located as 1, and setting the pixel value of the pixel point outside the circle corresponding to the background point as 0 to obtain a background binary image.

Specifically, the background binary image and the foreground binary image may be calculated according to the following formula (1):

wherein t is 0

Background II indicating marked mask(i, j) pixel values in the value image indicating pixel points,

representing a set of background points in the marked mask,

indicating that (i, j) indicates a pixel point and

the minimum distance value among the distances of each background point in the image. r denotes a preset radius value and sgn denotes a sign function.

Wherein (i, j) indicates a pixel point and

the distance between any one of the background points may be an euclidean distance, and a specific calculation formula is shown in the following formula (2):

in the formula, A_xRepresents the pixel point indicated by (i, j) in formula (1), B_xTo represent

Any background point in (2).

When t is 1

Representing pixel values of (i, j) indicated pixel points in the foreground binary image of the marked mask,

representing a set of foreground points in the marked mask,

indicating that (i, j) indicates a pixel point and

Wherein (i, j) indicates a pixel point and

the distance between any foreground points may be an euclidean distance, and the specific calculation principle may refer to formula (2), which is not described herein again.

And S106, respectively carrying out channel combination on the RGB original image in each example sample and the corresponding foreground binary image and background binary image to obtain an image after channel combination corresponding to each example sample.

In this step, the channel combination of the RGB original image in each example sample and the corresponding foreground binary image and background binary image is performed in the same manner, and the channel combination of the RGB original image of any example sample and the corresponding foreground binary image and background binary image is used as an example for introduction.

Since the RGB original image is a 3-channel image, the foreground binary image corresponding to the RGB original image is a single-channel image, and the background binary image corresponding to the RGB original image is a single-channel image, in this step, the RGB original image, the corresponding foreground binary image, and the background binary image are subjected to channel combination to obtain a 5-channel image.

S107, respectively updating the mask of each single type in the training example sample into a 2-channel mask through a preset function, and obtaining the 2-channel mask of the single type.

Specifically, in this step, the specific implementation manner of changing the mask of a single category in the training example sample into the sample label of the 2-channel by using the preset function is the prior art, and is not described herein again. In the masks of the 2 channels, the value of each position point in the mask of one channel is used for representing the probability that the corresponding pixel point in the RGB original image belongs to the target, and the value of each position point in the mask of the other channel is used for representing the probability that the corresponding pixel point in the RGB original image belongs to the non-target.

And S108, taking the image obtained by combining the channels corresponding to any example sample as a training sample, and taking the 2-channel mask of a single category in the example sample as a sample label of the training sample.

The purpose of the above S101 to S108 is: acquiring a training sample and a sample label; any training sample is obtained by combining an RGB original image, a foreground binary image and a background binary image through a channel; the sample label of the training sample is a 2-channel mask used to distinguish between targets and non-targets.

S109, training the target semantic segmentation model by adopting the training samples, the sample labels and the target loss function to obtain the trained target semantic segmentation model.

In this step, the training process of the target semantic segmentation model may include: and for any training sample input into the target semantic segmentation model, the result mask of the training sample output by the target semantic segmentation model, wherein the number of channels of the result mask is 2. And calculating a loss function value between a non-target mask in the result mask and a mask label of the training sample by adopting a target loss function, and adjusting parameters of the target semantic segmentation model according to the loss function value to finally obtain the target semantic segmentation model.

In this embodiment, the target loss function is the sum of the mark loss function and the cross entropy.

Wherein, the expression of the mark loss function is shown in the following formula (3):

wherein LabelLoss represents a loss function value,

for limiting the input range to (0,1), L represents a non-target mask in the result mask output by the target semantic division model, i.e. represents the output of the target semantic division modelChannel in the result mask representing the probability of background points, S¹Representing a foreground binary image, S⁰Representing a background binary image, W representing the width of the image, and H representing the height of the image.

(S¹-L)²Performing square calculation on each element in a difference matrix of the foreground binary image and a non-target mask in the result mask to obtain a first square matrix; s¹(S¹-L)²And expressing that elements in the first square matrix are correspondingly multiplied with elements in the foreground binary image to obtain a first multiplication matrix.

In the same way, (S)⁰-L)²Representing a second square matrix obtained by squaring each element in a difference matrix between the background binary image and a non-target mask in the result mask, S⁰(S⁰-L)²And expressing that elements in the second square matrix are correspondingly multiplied with elements in the background binary image to obtain a second multiplication matrix.

In this embodiment, the cross entropy is a cross entropy between the result mask and the sample label of the training sample, and it should be noted that in the process of calculating the cross entropy, two masks, namely a target mask and a non-target mask, in the result mask output by the target semantic segmentation model are used.

Fig. 2 is a method for extracting an object according to an embodiment of the present application, including the following steps:

s201, obtaining an image and position information of a target to be extracted.

In this step, the image of the target to be extracted is an RGB image, and the position information is a foreground point and a background point marked on the image of the target to be extracted by the user, where the foreground point belongs to the target to be extracted and the background point belongs to a non-target.

In order to more intuitively show the execution flow of the embodiment, an architectural diagram of a target extraction flow is provided in the embodiment of the present application, as shown in fig. 3.

The image and the position information of the target to be extracted acquired in this step correspond to the image indicated by "image 1" in fig. 3.

S202, calculating a foreground binary image according to the position information of the foreground point, and calculating a background binary image according to the position information of the background point.

The specific implementation manner of this step may refer to S105, which is not described herein again.

The foreground binary image obtained in this step corresponds to "image 3" in fig. 3, and the background binary image corresponds to "image 4" in fig. 3 "

And S203, carrying out channel combination on the image, the foreground binary image and the background binary image to obtain an image after channel combination.

In this step, the image of the target to be extracted, the foreground binary image of the target to be extracted, and the background binary image of the target to be extracted are subjected to channel combination to obtain an image after channel combination corresponding to the image of the target to be extracted.

It should be noted that, this step is only to combine channels of different images, and does not modify information in the images.

In this step, the image, the foreground binary image and the background binary image are subjected to channel combination, and corresponding to fig. 3, "image 2", "image 3" and "image 4" are subjected to channel combination.

And S204, inputting the image after the channel combination into the trained target semantic segmentation model to obtain a mask.

In this step, the trained target semantic segmentation model is the target semantic segmentation model obtained in S109 in the embodiment corresponding to fig. 1.

In this embodiment, the resulting mask is used to distinguish between objects and non-objects in the image. In this step, the obtained mask may be a single-channel mask or a 2-channel mask. Specifically, the number of channels of the mask depends on how to train the target semantic segmentation model, and the number of channels of the mask obtained in this step is not limited in this embodiment.

The target semantic segmentation model of this step corresponds to the target semantic segmentation model in fig. 3. This step results in a mask corresponding to "image 5" in FIG. 3.

S205, extracting the target in the image of the target to be extracted according to the mask and the image of the target to be extracted.

Optionally, in this step, taking the mask as a 2-channel mask as an example, specifically, the 2-channel mask is a first mask and a second mask, where a value taken from any position point in the first mask indicates a probability that a pixel point in the image that is the same as the position point belongs to the target; the value of any position point in the second mask represents the probability that the pixel point in the image same as the position point belongs to the non-target.

Specifically, extracting the target in the image of the target to be extracted according to the mask and the image of the target to be extracted includes the following steps a1 to A3:

and A1, comparing the values of the target position points in the first mask and the second mask to obtain a comparison result.

In this step, the target position points are the same position points, that is, each pair of position points in the first mask and the second mask, wherein any pair of position points is formed by any one of the same position points in the first mask and the second mask.

In this step, the values of each target location point in the first mask and the second mask are compared, and one target location point corresponds to one comparison result.

A2, in the comparison: and under the condition that the value of the target position point in the first mask is larger than that of the target position point in the second mask, taking the pixel point indicated by the target position point in the image of the target to be extracted as a target pixel point.

The value of the target position point in the first mask is larger than that of the target position point in the second mask, so that the pixel point indicated by the target position point in the image of the target to be extracted belongs to the target. For convenience of description, the pixel point indicated by the target position point in the image of the target to be extracted is taken as a target pixel point, namely the target pixel point belongs to the target.

And A3, taking a region formed by target pixel points in the image of the target to be extracted as a target, and obtaining the image only comprising the target.

S205 this step corresponds to the process of "image 2" and "image 5" to "image 6" in fig. 3.

In the present embodiment, a variety of programming languages and tools may be used for implementation. Specifically, the programming language may be a python interpreter version may be 3.6.8. The deep learning framework may be tensiorflow (both CPU and GPU versions) and the version may be 1.13.1. The tools used to determine the background binary image and the foreground binary image may include Numpy, Opencv-python, H5 py. The visualization process may employ matplotlib.

The beneficial effects of this embodiment include:

the beneficial effects are that:

acquiring an image and position information of a target to be extracted, wherein the position information comprises position information of foreground points and position information of background points; calculating a foreground binary image according to the position information of the foreground point, and calculating a background binary image according to the position information of the background point; on one hand, the foreground binary image and the background binary image are calculated, so that the calculation speed is increased, and the target extraction speed is increased; on the other hand, because the information content of the binary image is small, the image, the foreground binary image and the background binary image are subjected to channel combination, and the information content of the obtained image after channel combination is small; furthermore, the image after channel combination is input into the trained target semantic segmentation model, so that the processing speed of the trained semantic segmentation model can be improved.

Since the target semantic segmentation model in this embodiment is obtained by replacing at least the decapsulation v3+ Xception-65 in the semantic segmentation network with the residual network ResNet-101, the residual network ResNet-101 can extract more semantic information than the Xception-65, and therefore, the mask output by the target semantic segmentation model provided by this embodiment has higher accuracy. Furthermore, according to the mask, the accuracy of the target extracted from the image is higher.

In summary, the speed and the accuracy of extracting the target by the target extracting method provided by the embodiment are improved.

The beneficial effects are that:

in this embodiment, the loss function used in the training process of the model includes a labeled loss function, and the labeled loss function can improve the convergence speed of the training model, and further can improve the training speed of the model.

Fig. 4 is a schematic diagram of an experimental result provided in an embodiment of the present application. Specifically, fig. 4(a), 4(b), 4(c), and 4(d) are included.

Fig. 4(a) is an image of an object to be extracted, where two sheep are present in the image, one sheep on the right side is the object to be extracted, a foreground point is marked in a region where the sheep on the right side is located, and a background point is marked in a region outside the sheep on the right side. Fig. 4(b) is a schematic diagram of the result obtained by the target extraction method according to the embodiment of the present application.

Fig. 4(c) is another image of the target to be extracted, where two sheep exist in the image, one sheep on the left side is the target to be extracted, a foreground point is marked in the region where the sheep on the left side is located, and a background point is marked in the region outside the sheep on the left side. Fig. 4(d) is a schematic diagram of the result obtained by the target extraction method according to the embodiment of the present application.

Fig. 5 is a schematic structural diagram of a target extraction apparatus according to an embodiment of the present application, including: an acquisition module 501, a first calculation module 502, a second calculation module 503, a combination module 504, an input module 505, and an extraction module 506. Wherein the content of the first and second substances,

an obtaining module 501, configured to obtain an image and position information of a target to be extracted; the position information includes position information of foreground points and position information of background points.

The first calculating module 502 is configured to calculate a foreground binary image according to the position information of the foreground point.

And a second calculating module 503, configured to calculate a background binary image according to the position information of the background point.

And the combining module 504 is configured to perform channel combination on the image, the foreground binary image, and the background binary image to obtain an image after channel combination.

An input module 505, configured to input the image after channel combination into the trained target semantic segmentation model, so as to obtain a mask for distinguishing a target from a non-target in the image; the target semantic segmentation model is obtained by replacing at least the Xeception-65 in the Deeplab v3+ semantic segmentation network with a residual network ResNet-101.

And an extracting module 506, configured to extract an object in the image according to the mask and the image.

Optionally, the first calculating module 502 is configured to calculate a foreground binary image according to the position information of the foreground point, and includes:

the first calculating module 502 is specifically configured to make a circle with a preset radius by taking each foreground point as a center of the circle in the image, so as to obtain a circle corresponding to each foreground point; and setting the pixel value of the pixel point of the area where the circle corresponding to each foreground point is located as 1, and setting the pixel value of the pixel point outside the circle corresponding to the foreground point as 0 to obtain a foreground binary image.

Optionally, the second calculating module 503 is configured to calculate a background binary image according to the position information of the background point, and includes:

the second calculating module 503 is specifically configured to take each background point as a circle center and take a preset radius as a circle in the image to obtain a circle corresponding to each background point; and setting the pixel value of the pixel point in the area where the circle corresponding to each background point is located as 1, and setting the pixel value of the pixel point outside the circle corresponding to the background point as 0 to obtain a background binary image.

the mask comprises a first mask and a second mask; the value taking of any position point in the first mask represents the probability that the pixel point which is the same as the position point in the image belongs to the target; the value of any position point in the second mask represents the probability that the pixel point in the image same as the position point belongs to the non-target.

An extracting module 506, configured to extract an object in the image according to the mask and the image, including:

an extracting module 506, specifically configured to compare values of the target location points in the first mask and the second mask to obtain a comparison result; the target position points are the same position points; the comparison results show that: under the condition that the value of the target position point in the first mask is larger than that of the target position point in the second mask, taking the pixel point indicated by the target position point in the image as a target pixel point; and taking the area formed by the target pixel points in the image as a target.

Optionally, the method further includes: the training module is used for the training process of the target semantic segmentation model and comprises the following steps: acquiring a training sample and a sample label; any training sample is obtained by combining an RGB original image, a foreground binary image and a background binary image through a channel; the sample label of the training sample is a 2-channel mask for distinguishing a target from a non-target;

training a target semantic segmentation model by adopting the training sample, the sample label and the target loss function to obtain a trained target semantic segmentation model; the target loss function is the sum of a preset mark loss function and the cross entropy; the expression for the mark loss function is:

LabelLoss represents the marker loss function value,

l represents a non-target mask in the result mask output by the target semantic segmentation model on any input training sample, S¹Representing the foreground binary image in the training sample, S⁰Representing the background binary image in the training sample, W representing the width of the RGB raw image, and H representing the height of the RGB raw image.

the training module is specifically used for acquiring a training example sample; any one of the training example samples includes: an RGB original image and a single-class mask; the single-class mask is a single-channel mask used for dividing the RGB original image into a target and a non-target;

and taking the image after the channel combination corresponding to any example sample as a training sample, and taking the 2-channel mask of a single category in the example sample as a sample label of the training sample.

the training module is specifically used for acquiring a semantic segmentation training data set; the semantic segmentation training data set comprises RBG original images and mask labels; the mask label is a single-channel mask for distinguishing different areas in the RGB original image;

respectively determining a single type of mask corresponding to each area in the mask label; the single-category mask corresponding to any region in the mask label is obtained by setting the value of the position point of the region to be 1 and the values of other position points to be 0;

the RGB original image and each single-class mask form a training example sample.

The functions described in the method of the embodiment of the present application, if implemented in the form of software functional units and sold or used as independent products, may be stored in a storage medium readable by a computing device. Based on such understanding, part of the contribution to the prior art of the embodiments of the present application or part of the technical solution may be embodied in the form of a software product stored in a storage medium and including several instructions for causing a computing device (which may be a personal computer, a server, a mobile computing device or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of target extraction, comprising:

and extracting the target in the image according to the mask and the image.

2. The method of claim 1, wherein the computing a foreground binary image from the location information of the foreground points comprises:

3. The method according to claim 1, wherein the calculating a background binary image according to the position information of the background point comprises:

4. The method according to claim 1, wherein the number of channels of the final output network of the target semantic segmentation network is 2;

the mask comprises a first mask and a second mask; the value of any position point in the first mask represents the probability that the pixel point which is the same as the position point in the image belongs to the target; the value of any position point in the second mask represents the probability that the pixel point which is the same as the position point in the image belongs to the non-target;

5. The method of claim 4, wherein the image is an RGB image; the number of input channels of the residual network ResNet-101 is 5.

6. The method of claim 5, wherein the training process for the target semantic segmentation model comprises:

the LabelLoss represents the marker loss function value,

7. The method of claim 6, wherein the obtaining of the training sample and the sample label comprises:

8. The method of claim 7, wherein the obtaining training instance samples comprises:

9. An object extraction device, comprising:

10. The apparatus of claim 9, wherein the first computing module is configured to compute a foreground binary image according to the position information of the foreground point, and comprises: