CN115272378A

CN115272378A - Character image segmentation method based on characteristic contour

Info

Publication number: CN115272378A
Application number: CN202210622266.4A
Authority: CN
Inventors: 朱娟娟; 职玉; 王晓博; 谈旭; 郑世鑫; 刘佳琪
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-06-02
Filing date: 2022-06-02
Publication date: 2022-11-01

Abstract

The invention discloses a character image segmentation method based on characteristic contours, which comprises the following steps: acquiring a figure image to be processed; performing Mean-Shift preprocessing on the figure image to be processed to obtain an edge curve; determining an effective domain of each pixel point on the edge curve, and determining contour points according to the effective domain; determining a preset number of characteristic contours according to a plurality of contours formed by contour points and the area and the perimeter of each contour; inputting the character image to be processed and the characteristic contour into a semantic segmentation network to obtain a ternary image of the character image to be processed after semantic segmentation; inputting the character image to be processed and the ternary image into a fine segmentation network so that the fine segmentation network can predict the transparency of pixel points in an unknown region in the ternary image to obtain a segmentation result of the character image to be processed. The method can automatically generate the high-quality ternary image, not only improves the generation efficiency of the ternary image and the fineness degree of the ternary image, but also is beneficial to more accurately segmenting the character image.

Description

Character image segmentation method based on characteristic contour

Technical Field

The invention belongs to the technical field of image segmentation, and particularly relates to a character image segmentation method based on a characteristic contour.

Background

Image segmentation belongs to the category of computer vision, and the earliest single finger image semantic segmentation comprises image semantic segmentation and fine segmentation, wherein the image fine segmentation technology is mainly applied to a close-range segmentation task which takes a portrait as a target.

At present, most of image fine segmentation methods are based on deep learning and can be divided into a ternary map (trimap) and a non-ternary map. The image segmentation method with trimap needs to acquire fine trimap, then uses the fine trimap and an original image as the input of a network, and generates a corresponding transparency mask map (alpha matching) through the calculation and prediction of the network, and the image segmentation method without trimap directly predicts the front, background pixels and the transparency thereof from the original image, and the segmentation precision is difficult to guarantee, so the image segmentation method with trimap becomes a research hotspot in the field.

The Trimap is generally composed of a foreground region, a determined background region and an unknown region, fine image segmentation is to accurately predict the unknown region of the Trimap so as to obtain final alpha matching, and therefore a high-quality ternary image is important to the efficiency and accuracy of fine segmentation. In the related art, the generation method of the ternary graph includes a manual marking and an automatic generation algorithm, and the manual marking tools include photoshop, autocad, visio and the like, and although the marking accuracy can be improved by people, a lot of time and labor are needed. The automatic generation method of the ternary diagram is to continue the corrosion expansion operation on the result of the rough segmentation; however, the erosion dilation operation generally uses a convolution kernel with a fixed size, and the quality of the ternary map obtained by different kernel sizes is different, so that the method is not highly practical and lacks universality.

After obtaining the ternary Image, in the prior art, a Deep Image matching algorithm is adopted for portrait segmentation, specifically, a coding and decoding network is used for predicting coarse alpha matching of the portrait, and then a shallow network is used for describing and correcting a fuzzy edge of the alpha matching more finely. However, the transparency of the portrait edge segmented by the method has a large error, and the correction affects the automation degree of the algorithm and needs an additional loss function. Therefore, the automation level and the segmentation precision of the existing portrait segmentation method need to be improved.

Disclosure of Invention

In order to solve the above problems in the prior art, the present invention provides a method for segmenting a human image based on a feature contour. The technical problem to be solved by the invention is realized by the following technical scheme:

the invention provides a character image segmentation method based on a characteristic contour, which comprises the following steps:

acquiring a figure image to be processed;

performing Mean-Shift preprocessing on the figure image to be processed to obtain an edge curve;

determining the effective domain of each pixel point on the edge curve, and determining contour points from all the pixel points of the edge curve according to the effective domain;

respectively calculating the area and the perimeter of each contour according to a plurality of contours formed by the contour points, and determining a preset number of characteristic contours according to the areas and the perimeters;

inputting the character image to be processed and the feature outline to a pre-trained semantic segmentation network so that the semantic segmentation network generates a low-level feature stream, generating a feature map according to the feature outline and the character image to be processed, further recovering according to the feature map and the low-level feature stream to obtain a first image, classifying pixel points in the first image into a foreground region, a background region or an unknown region, and obtaining a ternary map of the character image to be processed after semantic segmentation;

inputting the character image to be processed and the ternary image into a pre-trained fine segmentation network so that the fine segmentation network predicts the transparency of pixel points of an unknown region in the ternary image to obtain a fine segmentation result of the character image to be processed.

In an embodiment of the present invention, the step of determining the effective domain of each pixel point on the edge curve, and determining the contour point from all the pixel points of the edge curve according to the effective domain includes:

determining a pixel point P on the edge curve according to an expression y = f (x) of the edge curve and a1 curvature detection algorithm_iDiscrete curvature of (c):

S(p_i)＝f_i+1-f_i

wherein, P_iRepresents the ith pixel point on the edge curve, S (P)_i) Representing a pixel point P_iDiscrete curvatures of (a);

taking pixel point P on the edge curve_iAdjacent pixel point P_i-k、P_i+kAnd K =1, 2, 3K are calculated respectively

Length l of_ikAnd P_iTo

Perpendicular distance d of_ik(ii) a Wherein the content of the first and second substances,

representing a pixel point P_i-k、P_i+kA chord in between;

when said

Length l of_ikAnd P_iTo

Perpendicular distance d of_ikWhen the preset condition is met, k is taken according to the maximum value of k_iDetermining a pixel point P_iCorresponding effective field radius and effective field D (P)_i)；

Aiming at a pixel point P on the edge curve_iWhen it is fullSufficient Condition | S (P)_i)|≥|S(P_j) If yes, the pixel point is reserved; otherwise, deleting the pixel point and the pixel point with the discrete curvature of 0; wherein | i-j | is less than or equal to k_i/2；

Aiming at the residual pixel points on the edge curve, when the pixel point P_iCorresponding effective field radius k_i=1 and valid Domain D (P)_i) In the presence of P_i-1Or P_i+1Then delete satisfies the condition | S (P)_i)|≤|S(P_i-1) I or I S (P)_i)|≤|S(P_i+1) | pixel point P_i；

If more than 2 effective domains of the pixel points exist for the residual pixel points on the edge curve, deleting the pixel points P outside the effective domain endpoint_i(ii) a If there is a valid field containing 2 pixels, then in | S (P)_i)|>|S(P_i+1) I, delete Pixel P_i+1And in | S (P)_i)|<|S(P_i+1) I, delete Pixel P_i；

And determining the residual pixel points as contour points.

In one embodiment of the present invention, the preset condition includes a first preset condition and a second preset condition; wherein the content of the first and second substances,

the first preset condition is as follows: l. the_ik≥l_i,k+1；

The second preset condition is as follows:

in one embodiment of the present invention, said

Length l of_ikAnd P_iTo

Perpendicular distance d of_ikWhen the preset condition is met, k is taken according to the maximum value of k_iDetermining a pixel point P_iCorresponding valid fieldRadius and effective field D (P)_i) The method comprises the following steps:

when the temperature is higher than the set temperature

Length l of_ikMeets the first preset condition and/or

Length l of_ikAnd P_iTo

Perpendicular distance d of_ikWhen the second preset condition is met, the maximum value k is taken_iIs determined as a pixel point P_iThe radius of the effective domain, and determining the pixel point P according to the radius_iEffective domain D (P) of_i)。

In an embodiment of the present invention, the step of calculating an area and a perimeter of each contour according to a plurality of contours composed of the contour points, and determining a preset number of feature contours according to the area and the perimeter includes:

numbering a plurality of contours formed by the contour points respectively and establishing a mesh structure;

calculating the area s and perimeter l of each contour to obtain two arrays: l = [ L =₁,l₂,K,l_m]And S = [ S ]₁,s₂,K,s_m]M represents the number of the outline;

when the corresponding elements in the two arrays satisfy the condition (l)_m≥a)∪(s_mWhen the number of the contour is more than or equal to b), determining the contour with the number of m as a characteristic contour; wherein a represents a preset perimeter threshold and b represents a preset area threshold.

In one embodiment of the invention, the semantic segmentation network comprises: a first encoder, a hole space convolution pooling pyramid ASPP module, a first decoder, and a Softmax layer, the first encoder comprising a Block of a plurality of MobileNetV3-mid networks; wherein the content of the first and second substances,

the first encoder is used for generating a low-level feature stream and generating a sub-feature map according to the input feature outline and the to-be-processed person image;

the ASPP module is used for extracting different scale features in the sub-feature graph and generating a feature graph according to the sub-feature graph and the different scale features;

the first decoder is used for recovering to obtain a first image according to the feature map and the low-level feature stream, and the size of the first image is the same as that of the figure image to be processed;

the Softmax layer is used for classifying all pixel points in the first image into a foreground region, a background region or an unknown region and outputting a ternary image.

In one embodiment of the invention, the fine partitioning network comprises a U-net comprising a second encoder and a second decoder, and a context attention CAM module; wherein the content of the first and second substances,

the second encoder is used for extracting the transparency characteristic of the character image to be processed and generating an Alpha characteristic stream after the ternary image and the character image to be processed are obtained;

the CAM module is used for predicting the transparency of the pixels classified into the unknown region according to the transparency of the pixels classified into the foreground region and the background region

And the second decoder is used for recovering the original resolution of the figure image to be processed and predicting a transparency mask image according to the Alpha characteristic stream to obtain a segmentation result of the figure image to be processed.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a character image segmentation method based on a characteristic contour, which comprises the steps of firstly generating a character image characteristic contour a priori, marking a character image to be processed by the characteristic contour, then utilizing a semantic segmentation network to classify and predict a ternary diagram in a three-way mode, and further utilizing a fine segmentation network to segment high-quality alpha matching.

The present invention will be described in further detail with reference to the accompanying drawings and examples.

Drawings

Fig. 1 is a flowchart of a method for segmenting a human image based on a feature contour according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a method for segmenting a human image based on a feature contour according to an embodiment of the present invention;

FIG. 3a is a diagram showing a result of a contour detection method in the related art;

FIG. 3b is a diagram of an example of a feature profile provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of a semantic segmentation network provided by an embodiment of the present invention;

fig. 5 is a schematic diagram of Block in a MobileNetV3 network according to an embodiment of the present invention;

FIG. 6 is a diagram of a Last Stage structure in a MobileNet V3-mid network according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a fine-segmented network according to an embodiment of the present invention;

fig. 8 is a schematic diagram of Block modules in a fine-grained split network according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of a CAM bank according to an embodiment of the invention;

FIG. 10 is a comparison graph of the results of the generation of the ternary graph provided by the embodiments of the present invention;

fig. 11 is a comparison diagram of the human image segmentation result provided by the embodiment of the invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.

Fig. 1 is a flowchart of a method for segmenting a human image based on a feature contour according to an embodiment of the present invention, and fig. 2 is a schematic diagram of the method for segmenting a human image based on a feature contour according to an embodiment of the present invention. Referring to fig. 1-2, an embodiment of the present invention provides a method for segmenting a human image based on a feature contour, including:

s1, acquiring a figure image to be processed;

s2, performing Mean-Shift pretreatment on the figure image to be processed to obtain an edge curve;

s3, determining the effective domain of each pixel point on the edge curve, and determining contour points from all the pixel points of the edge curve according to the effective domains;

s4, respectively calculating the area and the perimeter of each contour according to a plurality of contours formed by contour points, and determining a preset number of characteristic contours according to the areas and the perimeters;

s5, inputting the character image to be processed and the feature outline to a pre-trained semantic segmentation network so that the semantic segmentation network generates a low-level feature stream, generating a feature map according to the feature outline and the character image to be processed, further recovering to obtain a first image according to the feature map and the low-level feature stream, classifying pixel points in the first image into a foreground region, a background region or an unknown region, and obtaining a three-value image of the character image to be processed after semantic segmentation;

and S6, inputting the character image to be processed and the ternary image into a pre-trained fine segmentation network so that the fine segmentation network can predict the transparency of pixel points in an unknown region in the ternary image, and obtaining the segmentation result of the character image to be processed.

In the person image segmentation method, mean-Shift preprocessing may be performed on the person image to be processed first. Specifically, a Mean-Shift vector is defined: the dimension of the setting space is d, N pixel points are arranged in the character image to be processed and are represented as pixel =1, \ 8230, and the Mean-Shift vector of the N and x points is defined as follows:

S_his a point set, which represents pixel points satisfying the following conditions:

S_h(x)≡{y:(y-x)^T(y-x)≤h²}

wherein k represents the number of all pixel points satisfying the above condition, S_hIt can also be regarded as a high dimensional sphere, h being its radius. Then, a spherical space is created to any point P₀As the origin, S_p、C_rRespectively, the radius in physical space and color space, wherein S_p＝1、C_r=70. Each point in the ball to point P₀All have a color vector, calculate the sum of all vectors, calculate the vector sum of each point iteratively, the termination condition of iteration is that the end point of the vector sum is exactly the sphere center point P_N。

Optionally, in step S3, the step of determining an effective domain of each pixel point on the edge curve, and determining a contour point from all pixel points of the edge curve according to the effective domain includes:

s301, determining a pixel point P on the edge curve according to an expression y = f (x) of the edge curve and a curvature detection algorithm 1_iDiscrete curvature of (c):

S(p_i)＝f_i+1-f_i

wherein, P_iIndicating the ith pixel point on the edge curve, S (P)_i) Representing a pixel point P_iDiscrete curvature of (2).

S302, taking a pixel point P on the edge curve_iAdjacent pixel point P_i-k、P_i+kAnd calculating K =1, 2, 3K, respectively

Length l of_ikAnd P_iTo

Perpendicular distance d of_ik(ii) a Wherein, the first and the second end of the pipe are connected with each other,

representing a pixel point P_i-k、P_i+kThe chord in between.

Alternatively, in step S302,

s303, when

Length l of_ikAnd P_iTo

Specifically, the preset conditions include a first preset condition and a second preset condition; wherein the first preset condition is as follows: l. the_ik≥l_i,k+1The second preset condition is as follows:

optionally when

Length l of_ikSatisfy the first preset condition and/or

Length l of_ikAnd P_iTo

Perpendicular distance d of_ikWhen a second preset condition is met, the maximum value k of k is taken_iIs determined as a pixel point P_iThe radius of the effective domain, and determining the pixel point P according to the radius_iEffective domain D (P) of_i)。

Wherein, the first and the second end of the pipe are connected with each other,

A. b respectively represent a first preset condition and a second preset condition.

S304, aiming at a pixel point P on the edge curve_iWhen it satisfies the condition | S (P)_i)|≥|S(P_j) If yes, the pixel point is reserved; otherwise, deleting the pixel point and the pixel point with the discrete curvature of 0; wherein | i-j | is less than or equal to k_i/2；

S305, aiming at the residual pixel points on the edge curve, when the pixel point P_iCorresponding effective field radius k_i=1 and an active domain D (P)_i) In the presence of P_i-1Or P_i+1Then delete satisfies the condition | S (P)_i)|≤|S(P_i-1) I or I S (P)_i)|≤|S(P_i+1) | pixel point P_i；

S306, aiming at the residual pixel points on the edge curve, if the effective domain more than 2 pixel points exists, deleting the pixel points P outside the effective domain endpoint_i(ii) a If there is a valid field containing 2 pixels, then in | S (P)_i)|>|S(P_i+1) |, delete Pixel P_i+1And in | S (P)_i)|<|S(P_i+1) |, delete Pixel P_i；

And S307, determining the residual pixel points as contour points.

In the step S4, the step of respectively calculating the area and the perimeter of each contour according to a plurality of contours formed by the contour points, and determining a preset number of feature contours according to the area and the perimeter includes:

s401, numbering a plurality of contours formed by contour points respectively and establishing a mesh structure;

s402, calculating the area S and the perimeter l of each contour to obtain two arrays: l = [ L =₁,l₂,K,l_m]And S = [ S ]₁,s₂,K,s_m]M represents the number of the outline;

s403, when the corresponding elements in the two arrays satisfy the condition (l)_m≥a)∪(s_mWhen the number of the contour is more than or equal to b), determining the contour with the number of m as a characteristic contour; wherein a represents a preset perimeter threshold value and b represents a preset area threshold value.

Fig. 3a is a diagram showing a result of a contour detection method in the related art, and fig. 3b is a diagram showing an example of a feature contour according to an embodiment of the present invention. It should be noted that, in this embodiment, 2 to 4 feature profiles need to be determined finally, the perimeter threshold a and the area threshold b may be flexibly set according to actual needs, and if the number of determined profiles is greater than 4, the perimeter threshold a and the area threshold b are increased, and the step length of the adjustment is 1; otherwise, if the determined number of the contours is less than 2, the perimeter threshold value a and the area threshold value b are correspondingly reduced. The characteristic contour of the portrait can be obtained by iterative calculation, and the interference contour inside and outside the portrait can be eliminated to a certain degree.

Fig. 4 is a schematic diagram of a semantic segmentation network according to an embodiment of the present invention, fig. 5 is a schematic diagram of Block in a MobileNetV3 network according to an embodiment of the present invention, and fig. 6 is a schematic diagram of a Last Stage structure in a MobileNetV3-mid network according to an embodiment of the present invention. Referring to fig. 4-6, in the present embodiment, the semantic segmentation network includes: the device comprises a first encoder, a hole space convolution pooling pyramid ASPP module, a first decoder and a Softmax layer, wherein the first encoder comprises a plurality of blocks of a MobileNet V3-mid network; wherein, the first and the second end of the pipe are connected with each other,

the ASPP module is used for extracting different scale characteristics in the sub-characteristic graph and generating a characteristic graph according to the sub-characteristic graph and the different scale characteristics;

the first decoder is used for recovering and obtaining a first image according to the feature map and the low-level feature stream, and the size of the first image is the same as that of the figure image to be processed;

the Softmax layer is used for classifying all pixel points in the first image into a foreground area, a background area or an unknown area and outputting a ternary image.

In this embodiment, a MobileNet Block of the SE module is added for the semantic segmentation network, and the network structure thereof is shown in fig. 4. The input image feature block is firstly subjected to point-by-point convolution with 1 × 1 and then subjected to depth-by-depth convolution with 3 × 3, and normalization and nonlinear processing are performed by using operations of BN and an activation function ReLU6 after each convolution. Then added via a SE BlockAttention is paid to the weight, squeeze (compression F) is mainly performed in SE Block_ex) And Excitation (Excitation F)_sq) Two operations.

In the MobileNet V3, the activation function of block adopts a nonlinear activation function NL, the second fully-connected layer adopts a hard-swish activation function, and the calculation formula is as follows:

it should be noted that in this embodiment, the MobileNetV3 network still uses the ReLU6 activation function with higher precision in the Block of the first half, and uses the h-swish activation function in the second half.

Further, the overall structure of the MobileNetV3 network is shown in table 1. Taking the resolution of the human image to be processed as 320 × 320 as an example, the human image is input into the network, and since the number of convolution kernels is reduced from 32 to 16 in the first convolution, the change of the number of channels of the network structure header can be avoided, the corresponding parameter amount is reduced, the time of about 2ms can be saved, and almost any influence on the accuracy can be realized. In table 1, input represents shape of feature graph of input current layer, out represents channel size of output, and bottleeck represents Block of a MobileNetV2 network, where SE means that there is Block of SE, NL is activation function type, and s is step stride. Attached to the bottom of Bottleneck is the convolution kernel size of the DW convolution, and NBN indicates that no normalization is used.

TABLE 1

Compared with the conventional MobileNet V3-large network, the MobileNet V3-mid network provided by the invention has the advantages that 2 Bottleneecks are reduced, but the number of feature extraction times is not reduced, and is still 5, and only one Bottleneeck is reduced during two feature extraction times. The speed of the MobileNet V3-mid in the image semantic segmentation task is effectively improved compared with the MobileNet V3-large, and the loss accuracy can be ignored.

Further, as shown in table 2, when the MobileNetV3-mid network structure is used as the backbone network in the present embodiment, an ASPP module is added to capture the multi-scale features of the image, so that the Last Stage can be simplified, the simplified Last Stage structure is shown in fig. 6, and after the first Conv convolution, the pooling layer and the subsequent convolution operation are discarded, and the above operations are directly performed in the subsequent ASPP.

The feature graph of the first 1 × 1 convolution output with the structure shown in table 2, namely the deep feature output of 10 × 480, is input into the ASPP module, and the ASPP module can extract features of different scales, wherein the used hole convolution enlarges the receptive field of the network, and avoids the problem of information loss in the pooling process, and the hole convolution enables the output of the convolution to contain wider and larger information.

Specifically, the ASPP module used in the present invention connects a1 × 1 convolution, three hole convolutions with hole ratios of 6, 12, and 18, and the global information layer in parallel, and the output channel of each branch is 256, and finally connects these parallel layers together, and then outputs the result after a1 × 1 convolution. The ASPP module uses deep separable hole convolutions of different spreading rates for feature extraction, and the design of 5 parallel modules is shown in table 2. In order to increase the network convergence speed, the batch normalization BN and the ReLU activation functions are used after each convolution layer in the improved ASPP module of this embodiment, and the outputs obtained by the five structure blocks are respectively denoted as a0, a1, a2, a3, and a4. And (3) superposing the five structural blocks, namely a0+ a1+ a2+ a3+ a4, wherein the size of the superposed structural blocks is 10 × 1280, the number of channels is changed to 256 through 1 × 1 convolution, and finally, the output is 10 × 256, namely y. The invention improves the ASPP module, can capture more global information by using larger sampling rate, and is beneficial to improving the segmentation effect.

TABLE 2

TABLE 3

As shown in fig. 3, the semantic segmentation network further comprises a first decoder. For the first decoder, as shown in table 3, the low-level feature stream of 80 × 24 of the MobileNetV3-mid module is first processed, the number of channels is adjusted to 64 by 1 × 1 convolution, and then subjected to BN and ReLU operations; the output y of the ASPP module was then up-sampled and the signature was enlarged, using resize _ images to change the size to 80 × 256; the two were then superimposed, and the superimposed signatures were subjected to two 3 × 3 convolution kernels and two 1 × 1 convolutions, and finally resized _ images were used to change the size to 320 × 3.

For the obtained 320 × 3 feature map, the softmax layer performs reshape to 102400 × 3 size, then performs three-classification operation on 102400 pixels, judges whether each pixel belongs to a background region, a foreground region or an unknown region, and completes semantic segmentation.

Illustratively, the formula handled by the softmax function is:

in the formula, y_iThe probability distribution of the output image block of 102400 × 3 size is calculated by softmax, and n represents the number of categories, where n is 3.

In the embodiment, three types of the three-value graph are respectively represented by white, black and gray colors, each pixel point of the character image to be processed is traversed, the type to which the pixel point belongs is calculated to have the highest probability, and then the pixel point is classified into the corresponding region.

Fig. 7 is a schematic diagram of a fine-divided network according to an embodiment of the present invention, fig. 8 is a schematic diagram of Block modules in the fine-divided network according to the embodiment of the present invention, and fig. 9 is a schematic diagram of a CAM module according to the embodiment of the present invention. Referring to FIGS. 7-9, the fine partitioning network includes a U-net including a second encoder and a second decoder, and a context attention CAM module; wherein, the first and the second end of the pipe are connected with each other,

the second encoder is used for extracting the transparency characteristic of the character image to be processed and generating an Alpha characteristic stream after acquiring the ternary image and the character image to be processed;

And the second decoder is used for recovering the original resolution of the character image to be processed and predicting the transparency mask image according to the Alpha characteristic stream to obtain the segmentation result of the character image to be processed.

Specifically, the structure of the fine segmentation network is shown in fig. 7, and the processing procedure of the image in the fine segmentation network is as follows:

and extracting the characteristics of the character image to be processed by using a second encoder of the fine segmentation network, wherein the overall structure of the fine segmentation network adopts U-net, the shape of the fine segmentation network is similar to that of the letter U, and the downward part of the fine segmentation network is the second encoder. Optionally, the second encoder has five layers, each layer is short-circuited by a short cut block, so that the second decoder combines features of the second encoder before the upsampling block, and can avoid performing more convolution on a feature map after the second encoder. In addition, the embodiment aligns the channels of the second encoder feature map with two layer short cut blocks to realize feature fusion. In addition, the original input of the image is directly connected to the last convolution layer of the network after short cut, so that the features can not generate any calculation operation in the backbone network, and the detail features and the gradient information are well stored.

The structure of each block is shown in fig. 8. The short cut block module consists of two 3 by 3 Stride Conv; res Block consists of two 3 × 3 ordinary convolutions, one 1 × 1 convolution; four downsampled convolution blocks Res Down Block, consisting of two 3 × 3 convolutions, one averaging pooling operation and one 1 × 1 convolution, extract the depth features of the image.

The structure of the CAM bank is shown in FIG. 9. The CAM module is used for predicting the opacity of an unknown region according to information around a foreground region and a background region. The CAM module carries out affinity on the learned low-level features and transmits high-level opacity information based on the appearance information, so that rich features generated by the network can be effectively used.

It should be noted that the low-level Feature stream is Feature information of intermediate features of the semantic segmentation network after passing through the Feature change module, and the Alpha prediction Feature stream is an information stream transmitted by the second encoder in the fine segmentation network, and includes an unknown region and a known region. Since the low-level feature stream is the same size as the Alpha prediction feature stream, the information of the Alpha prediction feature stream is used to divide the detail features into two parts, namely, a known region (foreground region, background region) and an unknown region, in the low-level feature stream. And (3) cutting the whole image feature stream by using a 3-by-3 window, forming a feature pattern block set reshape after cutting into one-dimensional features, and taking the feature pattern block set reshape as a convolution kernel to perform similarity calculation with a feature pattern of an unknown region:

in the formula of U_x,yRepresenting a feature map of an unknown region centered on (x, y), I_x′,y′Representing a block of image features, S, centred on (x', y_{(x,y),(x′,y′)}Indicates the similarity of two regions, U_x,yE mu is also an element (mu E tau) in the image feature block set tau, and a constant lambda is a penalty hyper-parameter and is set to-10⁴This avoids the unknown region from having too great a similarity to the location where it is located.

Starting from (x ', y'), scaling softmax is performed to get the attention weight attention score for each feature block according to the following formula:

a_{(x,y),(x′,y′)}＝soft max(ω(μ,κ,x′,y′)s_{(x,y),(x′,y′)})

clamp(φ)＝min(max(φ,0.1),10)

where ω () represents a weight function, and κ = I- μ is a known region block in the image feature block set. However, the size of the unknown region in the ternary map is uncertain, and there may be a large number of unknown regions, so the present embodiment sets the weight of each feature block according to the size of the known region, and the designed function is

When the known area is larger, the feature block of the known area can carry more accurate detail information to distinguish the foreground from the background, and the feature block of the known area can be weighted by a large weight value; conversely, if the known area is small, the feature block of the known area only provides a small amount of appearance information, which is insufficient for predicting the opacity, so the feature block of the known area can be given a smaller weight value. Meanwhile, the Alpha prediction feature stream is also divided in the same way, reshape is used as a convolution kernel, then deconvolution operation is carried out, then the unknown region is reproduced on an Alpha feature map, and the final result is fused with the original Alpha feature to carry out downward transmission training.

Further, the picture size is restored using a second decoder of the fine division network, i.e. the part of the U-shaped network that is up, which also has 5 layers. Please continue to refer to fig. 10, in which the upsampling convolution Block Res Block up of the first four layers of decoders is composed of two 3 × 3 convolutions, a Nearest update operation and a1 × 1 convolution, and the features output by the upsampling Block are processed by a Res module, and then are added to the bottom layer information from the short cut Block by an element-wise feature addition operation. The Feature change module is used to size and channel the intermediate features in order to obtain the same size and channel number as the other input Feature map of the CAM module. The middle feature of the semantic segmentation network is 80 × 24, and the size output is 40 × 40 by performing 3 × 3 convolution operation with Stride 2 only once, and the number of channels is adjusted to 64 by performing one 1 × 1 convolution operation. Instead of using the 80 x 24 intermediate feature, the original image needs to be re-tuned by three times of 3 x 3 convolution with Stride 2, which increases the computational burden of the network. The design fully utilizes the semantic segmentation network, strengthens the connection between the two sub-networks and reduces unnecessary calculation redundancy. The last layer of the decoder consists of one Deconv and one Conv operation. All convolution operations in the illustration add spectral normalization SN and bulk normalization BN to speed convergence.

It should be noted that the fine segmentation network predicts depth alpha features around the portrait contour, so the supervision signal used in the training process is an alpha mask map of an unknown region in a ternary map, which is defined as an absolute difference between alpha matching and Ground Truth of the unknown region, and the loss function of the fine segmentation network is as follows:

L_d＝m_μ||α_p-α_g||₁

wherein m is_μIs a binary mask indicating whether it is an unknown region of a ternary map, m_μ=1 denotes that the pixel point is in an unknown area, m_μ=0 indicating that the pixel is in the known area, α_pIndicates the predicted alpha matching, alpha_gAlpha matching, m, representing group Truth_μThe determined area is shielded to play a role of attention, so that the loss calculation and feedback process can be effectively simplified, and m_μThe prediction of the area =0 may not be accurate enough, and especially for the detail inside the portrait foreground, the prediction difficulty is too large, which consumes a large amount of calculation resourcesSource, affecting the efficiency of the segmentation of the entire network.

The portrait fine segmentation algorithm based on the characteristic contour consists of a semantic segmentation network and a fine segmentation sub-network, each network plays its own role, has independent prediction tasks and loss functions, is tightly connected, and is an end-to-end calculation and training process. Therefore, a joint Loss function is needed to calculate the Loss of the whole network, and the Loss is the Loss function L of the semantic segmentation network_tAnd loss function L of the fine network_dWeighted sum of (2):

Loss＝εL_t+(1-ε)L_d

wherein epsilon represents the weight of the loss function, the value is 0.1 during training, and the fineness of the alpha matching graph is emphasized while a high-quality ternary graph is used as constraint.

The present invention further describes the above method for segmenting a human image based on a feature contour through experiments.

FIG. 10 is a comparison graph of the results of the generation of the ternary graph provided by the embodiments of the present invention. Referring to fig. 10, the analysis of the experimental result of generating the ternary map based on the semantic segmentation network of the feature contour according to the present invention is as follows: the first column is the image of the person to be processed, the second column is the Ground Truth of the image of the first column, and the third column is the ternary diagram obtained by carrying out corrosion expansion operation on the second column. For some more complicated contour regions (such as hair, clothing edge, etc.), more regions need to be marked as unknown, and if the convolution kernel of morphological operation is increased, the globally unknown regions of the ternary map are expanded, which affects the efficiency and precision of subsequent fine segmentation. The fourth column is a ternary image obtained after semantic segmentation and further classification by using DeeplabV3+, so that the problem that the human image edge is not complete and clear enough and the detail features are lost at the moment can be seen, and more positions are predicted as unknown regions relative to the fifth column. The fifth column is the ternary diagram generated by using the semantic segmentation network provided by the invention, so that after the feature contour generation preprocessing is used, the area of an unknown region of the ternary diagram is smaller, the edge of the ternary diagram is clearer, the situations of roughness and misjudgment are less, and an attention mechanism and an ASPP module are added to improve the accuracy and the segmentation efficiency of high-frequency features, so that the quality of the ternary diagram of the fifth column is obviously higher than that of the fourth column. As shown in fig. 10, most of the unknown regions are marked by 10 pixel units during marking, the marked area and accuracy can be manually controlled, but the overall operation efficiency is low, the average time of operating one image by using the photoshop is 3min, the average speed of generating the ternary map by using the automatic semantic segmentation network is 22.3fps, and the generation speed of millisecond level is incomparable with manual operation.

As shown in table 4, the present invention uses the average cross-over ratio (MIoU) and the percent of unknown region pixels in the image (UP) for analytical validation of the data:

TABLE 4

The ternary graph generated by the semantic segmentation network (FCSS) based on the feature contour is closest to manual labeling, the average area ratio of an unknown region in the ternary graph is 3.88%, and the area ratio of the unknown region in the ternary graph is 30% less than that of the ternary graph generated by the DeeplabV3+ algorithm. And taking the manually marked ternary diagram as the group Truth, and calculating the cross-over ratio IoU of the effect of each algorithm and the group Truth. The foreground region intersection ratio (fgIou), the background region intersection ratio (fgIou), the unknown region intersection ratio (uniIou) and the average intersection ratio (MIoU) of the semantic segmentation algorithm based on the feature contour are the highest, and the classification performance of the algorithm is obviously improved. The parameter number for the DeeplabV3+ algorithm using MobileNet V3-large as the backbone network was 5.6M. The invention provides a MobileNet V3-mid, and designs a semantic segmentation network FCSS taking the MobileNet V3-mid as a backbone network, wherein the parameter quantity of the whole network model is only 4.8M, and the processing efficiency is also improved. The feature contour generation preprocessing algorithm provided by the invention can well predict most contour edges, and an attention module introduced in the design of a semantic segmentation network is added, so that edge information can be accurately positioned and strengthened. And then, the ASPP is used for fusing information with different sizes, so that the classification performance is greatly improved. The semantic segmentation network based on the feature contour provided by the invention has a better segmentation effect no matter from the view of a supervisor and an objective evaluation index, and the generated high-quality ternary graph provides a good basis for the subsequent fine segmentation.

Fig. 11 is a comparison diagram of the segmentation result of the human image according to the embodiment of the present invention. Referring to fig. 11, the analysis of the overall effect of the portrait segmentation based on the feature contour is as follows: model training and verification are carried out by adopting Deep Image matching and PM-100 data sets. In fig. 11, the first column is a portrait image to be processed; the seventh column is the group Truth of the original image segmentation chart; the second column is the segmentation effect of DCNN; the third column shows the MODNet segmentation effect; the fourth column is the Deep Image matching algorithm segmentation effect, which uses the ternary map generated by erosion dilation as input. And (3) generating a ternary graph by using Deeplabv3+ as a semantic segmentation network, and combining the fine segmentation network based on the CAM, wherein the segmentation effect is shown in the fifth column. The sixth column is the segmentation effect graph of the algorithm of the present invention. The small red-edge frame in the segmentation effect graph is an enlarged detail graph, and the region with obvious division in each group of images is selected for display. As can be seen from fig. 11, the accuracy of segmentation increases from left to right, and the visual effect also gradually improves. For close-range high-definition portraits, particularly fine features such as hair and the like are more, and the segmentation effect of the algorithm can reach the hair level. But the detail quantity that the algorithm can keep still has great difference, and the portrait edge detail that uses DCNN algorithm to obtain is minimum, and portrait characteristic easily monoblock loses, and stability is not enough. For the segmentation map using the MODNet algorithm, since a ternary map is not used as an input, the segmentation effect is slightly worse, fine edge features are blurred, a plurality of hairs are often predicted as a blurred hair, mushy shadows are more, and a region with a large number of dark portions such as shadows is recognized as a human image, and the segmentation effect is also unstable. The segmentation effect of Deep Image matching algorithm is obviously better than that of the former algorithms, the retained detail features are the most, and the human Image edge is very sharp and clear. However, through careful observation, the phenomenon of excessive segmentation exists in the alpha mapping image, some segmented hairs are thicker than those in the Ground Truth, the details are too sharp and not soft enough, some features in the portrait are often determined as the background, and a certain misjudgment problem exists. In particular, for translucent wedding images, the transparency determination is inferior to the algorithm of the present invention, and the amount of parameters used in complex network architectures can result in some images in the validation set with resolutions exceeding 2500 x 4000 popping up the GPU memory during processing. The segmentation effect of the fifth column is close to that of the sixth column, the retained details are slightly less than those in the sixth column, and under a more complex background, the edge of the portrait outline is misjudged as the background, so that the edge is not continuous enough. In conclusion, the segmentation effect of the algorithm is closest to the Ground Truth, the edge is clear, the details are reserved more, the portrait is not blurred or too hard, and the situation of over-segmentation is rare. And the human image segmentation effect for different object distances is stable, and the adaptability to different environments such as too dark and too bright is strong.

As shown in Table 5, the present invention calculated MSE, SAD, gradient, connectivity values in the validation set, and the objective analysis of the segmentation effect was as follows:

TABLE 5

Algorithm	Trimap	MSE(×10^-1)	SAD	Grad	Conn
						DIM-Trimapless	-	0.110	70.31	70.06	70.05
DCNN	-	0.079	122.40	129.57	121.80
						MODNet	-	0.041	50.05	42.31	-
DIM+DE	√	0.014	50.04	31.00	50.08
						DIM+FCSS	√	0.013	46.03	36.32	49.22
Alpha GAN	√	0.030	52.40	38.00	-
						Ours+Deeplabv3+	√	0.0108	41.12	22.46	36.74
Ours+FCSS	√	0.0091	35.22	18.81	44.23

Ours stands for CAM-based fine-segmented network proposed by the present invention. As can be seen from fig. 11, the segmentation algorithm proposed by the present invention is superior to the algorithm of the control group in evaluation index. Whether each algorithm uses a ternary graph in the graph also has great influence on the segmentation precision, wherein the DIM-triangle shows that after the network structure of the Deep Image matching algorithm deletes a trimap channel, the network directly predicts an alpha value, and the precision of the algorithm is obviously reduced. DIM + DE means that a ternary graph generated by corrosion expansion is used as input by a DIM algorithm; the DIM + FCSS uses the semantic segmentation algorithm provided by the invention to generate a ternary diagram, and inputs the ternary diagram into a DIM network; alpha GAN is also a segmentation algorithm that requires a ternary map. The MSE, SAD and the like of the MODNet algorithm without the need of the ternary diagram are better than those of other algorithms without the ternary diagram in design, but if the MODNet is added with the semantic segmentation network FCSS provided by the text, namely a ternary diagram channel is added during network training, the effect is also obviously improved, and the MSE is reduced by about 50% in the PPM-100 data set. Meanwhile, the quality of the input ternary map during training also influences the segmentation precision, and as a comparison, the ternary map generated by the Deeplabv3+ algorithm is input into a DIM network and a CAM-based fine segmentation network, the MSEs are respectively 0.013 and 0.0112, but after the ternary map generated by the FCSS algorithm is used as input, the MSEs are respectively reduced by 18.8% and 12.5% on a PPM-100 data set. The fine segmentation algorithm (Ours) of the present invention reduced the MSE on average by 27% relative to the DIM algorithm, ours + FCSS being the best segmentation in these sets of experiments. The experimental effects performed under the two data sets are small in difference, and the advantages and the disadvantages of the algorithms can be objectively proved.

The beneficial effects of the invention are that:

the invention provides a character image segmentation method based on a characteristic contour, which comprises the steps of firstly carrying out prior generation of the characteristic contour of a character, marking the character image to be processed by the characteristic contour, then utilizing a semantic segmentation network to carry out three-classification prediction on a ternary image, and further utilizing a fine segmentation network to segment high-quality alpha matching.

In the description of the present invention, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to imply that the number of technical features indicated is significant. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples described in this specification can be combined and combined by those skilled in the art.

While the present application has been described in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed application, from a review of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, numerous simple deductions or substitutions may be made without departing from the spirit of the invention, which shall be deemed to belong to the scope of the invention.

Claims

1. A character image segmentation method based on characteristic contours is characterized by comprising the following steps:

acquiring a figure image to be processed;

inputting the character image to be processed and the ternary image into a pre-trained fine segmentation network so that the fine segmentation network predicts the transparency of pixel points of an unknown region in the ternary image to obtain a segmentation result of the character image to be processed.

2. The method of claim 2, wherein the step of determining the effective region of each pixel on the edge curve and determining the contour point from all the pixels of the edge curve according to the effective region comprises:

S(p_i)＝f_i+1-f_i

taking a pixel point P on the edge curve_iAdjacent pixel point P_i-k、P_i+kAnd calculating k =1, 2, 3

Length l of_ikAnd P_iTo

representing a pixel point P_i-k、P_i+kA chord in between;

when said

Length l of_ikAnd P_iTo

Perpendicular distance d of_ikWhen the preset condition is met, k is selected according to the maximum value of k_iDetermining a pixel point P_iCorresponding effective field radius and effective field D (P)_i)；

Aiming at a pixel point P on the edge curve_iWhen it satisfies the condition | S (P)_i)|≥|S(P_j) If yes, the pixel point is reserved; otherwise, deleting the pixel point and the pixel point with the discrete curvature of 0; wherein | i-j | < k_i/2；

Aiming at the residual pixel points on the edge curve, when the pixel point P_iCorresponding effective field radius k_i=1 and valid Domain D (P)_i) In the presence of P_i-1Or P_i+1Then delete satisfies the condition | S (P)_i)|≤|S(P_i-1) I or I S (P)_i)|≤|S(P_i+1) I pixel point P_i；

If the effective domain more than 2 pixel points exists for the residual pixel points on the edge curve, deleting the pixel points P outside the effective domain end point_i(ii) a If there is a valid field containing 2 pixels, then in | S (P)_i)|＞|S(P_i+1) |, delete Pixel P_i+1And in | S (P)_i)|＜|S(P_i+1) |, delete Pixel P_i；

And determining the residual pixel points as contour points.

3. The method of claim 2, wherein the preset conditions include a first preset condition and a second preset condition; wherein, the first and the second end of the pipe are connected with each other,

the first preset condition is as follows: l. the_ik≥l_i,k+1；

The second preset condition is as follows:

4. the method of claim 3, wherein the image segmentation is performed based on the feature profile

Length l of_ikAnd P_iTo

Perpendicular distance d of_ikWhen the preset condition is satisfiedAccording to the maximum value k of k_iDetermining a pixel point P_iCorresponding effective field radius and effective field D (P)_i) The method comprises the following steps:

when in use

Length l of_ikSatisfy the first preset condition and/or

Length l of_ikAnd P_iTo

Perpendicular distance d of_ikWhen the second preset condition is met, the maximum value k of k is taken_iIs determined as a pixel point P_iThe radius of the effective domain, and determining the pixel point P according to the radius_iActive domain D (P) of_i)。

5. The method of claim 2, wherein the step of calculating an area and a perimeter of each contour according to a plurality of contours formed by the contour points and determining a predetermined number of feature contours according to the area and the perimeter comprises:

calculating the area s and perimeter l of each contour to obtain two arrays: l = [ L₁,l₂,...,l_m]And S = [ S ]₁,s₂,...,s_m]M represents the number of the outline;

6. The method of claim 1, wherein the semantic segmentation network comprises: the device comprises a first encoder, a hole space convolution pooling pyramid (ASPP) module, a first decoder and a Softmax layer, wherein the first encoder comprises a plurality of blocks of a MobileNet V3-mid network; wherein the content of the first and second substances,

7. The method for segmenting a human image based on a feature outline according to claim 6, wherein said fine segmentation network comprises a U-net and a context attention CAM module, said U-net comprises a second encoder and a second decoder; wherein the content of the first and second substances,