CN115331254A - Anchor frame-free example portrait semantic analysis method - Google Patents

Anchor frame-free example portrait semantic analysis method Download PDF

Info

Publication number
CN115331254A
CN115331254A CN202210203916.1A CN202210203916A CN115331254A CN 115331254 A CN115331254 A CN 115331254A CN 202210203916 A CN202210203916 A CN 202210203916A CN 115331254 A CN115331254 A CN 115331254A
Authority
CN
China
Prior art keywords
character
module
instance
semantic
portrait
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210203916.1A
Other languages
Chinese (zh)
Inventor
鲍虎军
李特
操晓春
张三义
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Zhejiang Lab
Original Assignee
Institute of Information Engineering of CAS
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS, Zhejiang Lab filed Critical Institute of Information Engineering of CAS
Priority to CN202210203916.1A priority Critical patent/CN115331254A/en
Publication of CN115331254A publication Critical patent/CN115331254A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an anchor frame-free example portrait semantic parsing method. The method comprises the steps of training an example portrait semantic analysis model by using a training set, wherein the example portrait semantic analysis model comprises a feature extraction module, a character example detection branch and a character example fine-grained perception branch; the character instance fine-grained perception branch comprises a detail keeping module, a human body part context coding module, a character instance analyzing module and an instance analyzing result refining module; the invention adopts a one-stage anchor-frame-free detector based on central point prediction for character instance boundary frame position prediction and an edge-guided character instance semantic analysis module for identification of a character semantic component, the anchor-frame-free character detector not only inherits the advantages of pixel level design, but also can effectively avoid the problem of hyperparametric sensitivity caused by generation of candidate boundary frames; the edge-guided portrait semantic parsing module can effectively distinguish different portrait instance positions and adjacent portrait semantic categories.

Description

Anchor frame-free example portrait semantic analysis method
Technical Field
The invention belongs to the field of computer vision example portrait semantic segmentation, and particularly relates to an anchor-frame-free example portrait semantic analysis method.
Background
Example portrait semantic parsing is a fundamental problem in the computer vision and multimedia fields, focusing on human-centric pixel-level content analysis in real scenes. The goal of example character semantic parsing is not only to effectively distinguish the regions of different character instances, but also to be able to accurately parse the character semantic categories in the region of each character instance.
The current mainstream example portrait semantic analysis models, such as matching R-CNN, RP R-CNN, M-CE2P and the like, all follow the frame of Mask R-CNN, wherein a character example detector in two stages provides an anchor frame candidate area and performs fine-grained portrait part segmentation. However, the two-stage anchor-based character instance detector has two drawbacks, the first of which is that the performance of the anchor-based two-stage object detector is sensitive to predefined hyper-parameters in the generation of the anchor frame, such as aspect ratio, the area and scale of the anchor frame, and so on. When a new target detection task is encountered or a new data set is processed, the hyper-parameters generated by the anchor block need to be carefully fine-tuned or redesigned for better performance. In addition, selecting bounding box samples with high recall and controlling the imbalance ratio of positive and negative samples during training both incur additional computational and memory costs. A second limitation is that the two-stage anchor-box-based object detection method is a non-pixel-level prediction mode, which may not be consistent with the pixel-level prediction mode of the example portrait semantic parsing task. Therefore, a detector which is solvable in full convolution, free of anchor frame design and in one stage is explored, and meanwhile a solution method capable of avoiding hyper-parameter sensitivity is more fit with an example portrait semantic analysis task.
For the example portrait semantic analysis task, only a person detector with a powerful and elegant full convolution mode is not enough, and the pixel-level prediction result of the person example region can be accurately analyzed to be the final target of the person example region. Given a character instance region, the aim is to perform fine-grained segmentation on the region and analyze the category to which each pixel belongs. Typically, the character instance area contains a plurality of human body part types, so that the edge information is introduced as a meaningful clue for distinguishing the boundary between different character instances and the edge of the adjacent human body part type area in the single character instance area. A simple edge prediction branch and a human example semantic analysis branch are executed in parallel, the reason for adopting the design strategy is that a human example detector can provide a relatively accurate boundary box prediction result, and a human example semantic analysis module can concentrate on fine-grained identification.
In addition, because the example-level portrait semantic analysis task needs to process not only the example-level detection but also the segmentation at the processing component level, the ideal portrait semantic analysis result cannot be obtained only by optimizing the cross entropy loss function at the pixel level. Because instance-level portrait semantic segmentation is performed on the basis of predicted candidate bounding boxes, there are often some low-quality character instance prediction boxes, and the predicted character instance regions are inaccurate. Meanwhile, the prediction result of the portrait semantic category at the component level is not ideal in the boundary candidate box.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention aims to provide an anchor-frame-free example portrait semantic parsing method. The method can solve the problem of hyperparametric sensitivity in the portrait semantic analysis model of the current embodiment, the problem that the boundaries of adjacent character embodiments and the boundaries of semantic classification areas of adjacent portraits are difficult to distinguish effectively, and the problem that the portrait analysis effect of the embodiment is not ideal due to low frame-based prediction confidence coefficient.
The technical scheme of the invention is as follows:
an anchor-frame-free example portrait semantic parsing method comprises the following steps:
in the model training stage, a training set is obtained or generated, and each character image in the training set is marked with the spatial position information of the character and the semantic category information of each pixel; training an example portrait semantic analysis model by using the training set, wherein the example portrait semantic analysis model comprises a feature extraction module, a character example detection branch and a character example fine-grained perception branch; the character instance fine-grained perception branch comprises a detail keeping module, a human body part context coding module, a character instance analyzing module and an instance analyzing result refining module; wherein
The feature extraction module is used for extracting a plurality of different scale features of the character image, and each scale corresponds to a feature map; wherein the ith scale corresponds to the characteristic diagram P i I =1 to n; n is the set scale number;
for each feature map P i Inputting the character instance detection branch and the character instance fine-grained perception branch into the character instance detection branch and the character instance fine-grained perception branch in parallel respectively, and obtaining a characteristic diagram P through character instance detection branch prediction i The category of the candidate frame corresponding to each pixel point in the image; obtaining a characteristic diagram P through character instance fine-grained perception branch prediction i The central perception probability of each pixel point in the image corresponds to the boundary frame and the offset of the position of the boundary frame;
for the candidate box with the type of the character, the detail keeping module performs region-of-interest alignment pooling operation on a set scale feature map to obtain the feature of the corresponding character instance bounding box and inputs the feature into the human body part context coding module; the human body part context coding module comprises a human body part context pyramid module and a non-local mining module; the human body part context pyramid module is used for acquiring multi-scale context information of a character instance from the input features and sending the multi-scale context information to the non-local mining module; the non-local mining module is used for mining the spatial position correlation in the input multi-scale context information and inputting the spatial position correlation into the character instance analysis module; the character instance analysis module is used for predicting a pixel-level semantic analysis graph and an edge information graph of a character instance according to input features;
the example analysis result refining module is used for evaluating the quality of the analysis result according to the pixel-level semantic analysis graph obtained in the character example analysis module;
in the model training phase, each module is optimized by adopting different loss functions. The loss function of the human instance detection branch is:
Figure BDA0003530663970000031
wherein the content of the first and second substances,
Figure BDA0003530663970000032
for calculating the loss of the human instance candidate box offset regression,
Figure BDA0003530663970000033
for calculating the loss of the character instance candidate box classification,
Figure BDA0003530663970000034
and the central perception probability loss corresponding to the bounding box is calculated.
The loss function of the person instance parsing module is:
Figure BDA0003530663970000035
wherein
Figure BDA0003530663970000036
For calculating the loss of the human instance pixel level semantic parse graph,
Figure BDA0003530663970000037
for calculating the loss of the edge information map of the character instance.
The loss function of the refining module is:
Figure BDA0003530663970000038
wherein L is miou Quality loss, L, for computing a semantic profile at the pixel level for a person instance miou-score Computer executable for computing human instance pixel level semantic parse graphsLoss of reliability probability; α, β, θ, γ are weight coefficients, respectively.
Total loss of the ensemble the sum of the losses of all modules:
Figure BDA0003530663970000039
in the model application stage, for a character image to be analyzed, a plurality of feature expressions with different sizes of an input image are extracted by using a feature extraction module, the features are input into a character instance detection branch to obtain a detection result of the character image in the input image, namely, a rectangular frame is used for representing an area where a character instance is located, and then, the character instance analysis module is further used for obtaining a final pixel-level analysis result of each character instance on the feature P3 by using the pooling operation according to the obtained rectangular frame area.
In order to solve the technical problem, an anchor frame-free example portrait semantic parsing method comprises the following steps:
(1) Human instance detection branch
Given an input human image, the multi-layered perceptual backbone network extracts multi-level perceptual features of the input image, specifically, features P3, P4, P5, P6, and P7 at five different levels, of sizes 1/8, 1/16, 1/32, 1/64, and 1/128 of the input image, respectively. The multi-level features P3 to P7 share a person instance detection sub-model, wherein the feature map P i Has three prediction outputs, each being a four-dimensional bounding box offset vector t, for each spatial position (x, y) in the image * A classifier of a person and a center perception score. Four-dimensional bounding box regression vector t * Is represented by (l) * ,t * ,r * ,b * ) Wherein l is * ,t * ,r * ,b * Respectively, the offset distances from the spatial position (x, y) to the left, top, right, and bottom of the bounding box.
(2) Edge-guided character instance fine-grained perceptual branching
The edge-guided character instance fine-grained perceptual branch comprises four main parts, namely: the system comprises a detail keeping module, a human body component context coding module, a portrait analyzing module and a refining module.
a) Detail keeping module
The detail keeping module is realized by the operation of region-of-interest alignment pooling on the feature P3, wherein P3 is the output feature with the finest granularity in the feature pyramid, i.e. the spatial dimension is the largest. According to the characteristics of the human instance semantic parsing task, a detail keeping mechanism is adopted for two reasons. One reason is that a sufficiently large sample of annotated character instances is required to train a robust model, and small-scale character instances possess less appearance information. The appearance information that can be provided due to the coarse resolution, i.e. the small spatial dimensions, is relatively limited, especially for those classes of human body within the human body region that have relatively small dimensions. If the region of interest pooling operation is performed on a coarser resolution feature (such as P7), some small scale human body parts categories, such as gloves, left hands, or glasses, may be ignored on the coarse resolution layer due to the down-sampling operation. Therefore, the fine-grained resolution is more suitable for a fine-grained portrait semantic parsing task, and more apparent detail information can be provided, so that the segmentation of the character instance is assisted and better executed. Inspired by a high-resolution retention strategy adopted in a semantic segmentation model DeepLab V3+, the method can assist in recovering fine-grained information by using detailed information in a feature map of a lower layer, so that the method selects the feature P3 with the finest granularity in a feature pyramid to perform region-of-interest pooling. Intuitively, it is not sufficient to perform region-of-interest pooling operations on the feature map P3 alone, since some high-level semantic information may be ignored, but in fact, the P3 features already fuse the high-level semantic features.
b) Human body part context coding module
The human body part context encoding module is executed on the output features of the detail preserving module. The context information is a very meaningful clue, and the validity of the context information has been proved in tasks of semantic segmentation, portrait semantic analysis and the like. The invention adopts two modes to mine context information, namely, the scale and the category relation. The context information of the scale factor is used for solving the problem of inconsistent scales. The human body part category generally comprises multiple scales, so that an effective mode is adopted to capture global and local information to form multi-scale context feature expression, and the method is greatly helpful for a subsequent fine-grained portrait semantic analysis step. In fact, the feature pyramid is a popular effective architecture capable of fusing multi-scale information. In particular, a portrait part context pyramid module is employed to explore multi-scale context information of a person instance, where multiple parallel aggregation-excitation units employ different spatial amplitude ratios for exploring different scale information. And the other method is to mine the category relationship to extract the context information, which can provide valuable correlation relationship among a plurality of human body categories, and improve the context feature expression according to the adjacency matrix of the correlation. The context information of a spatial location is generally related to a series of spatial locations, and a spatial location (or pixel) in the image corresponds to a real human category. From this it can be concluded that constructing a relation between a spatial position and its contextual position can reflect the relation between different body classes. Inspired by the self-attention mechanism, the self-supervision attention mechanism can capture the spatial position dependency relationship at a longer distance, and finally selects non-local operation for mining the spatial position dependency relationship in the input multi-scale context information. Finally, the two ways of exploring context information are combined to form a character part context coding module which can provide rich context feature expression to help identify fine-grained human body categories.
c) Personality instance parsing module
The persona instance parsing module includes two parallel outputs, namely: person instance parsing and edge prediction. One motivation for introducing edge prediction branches is to be able to aid in distinguishing between different human classes within a single-person instance region. Generally, a plurality of different human body semantic categories exist in a human figure example region, and accurately distinguishing adjacent human body part categories can greatly help to correctly analyze all human figure semantic category regions. Another motivation is that within the candidate bounding box area, since there are overlapping human instances that are present, there must be multiple different human instances, and therefore it is also critical to effectively distinguish between the overlapping human instances. Since the edge information has been validated in the portrait semantic parsing, it is a valid clue to help distinguish the boundaries of human body parts. The invention uses edge information for reference and expands the effect of the edge information, thereby not only distinguishing different human examples, but also helping to accurately analyze human example areas. The portrait resolution module is implemented after the enhanced context feature expression, specifically, four consecutive convolutional layers are used to provide the features of the portrait resolution module, and then input to two prediction branches: portrait semantic categories and edge prediction.
d) Example parsing result refining module
The example portrait semantic parsing result has two main problems due to the predicted bounding box, which are a low-quality global bounding box and an incorrect component-level semantic parsing map. Poor quality bounding box prediction region inaccuracies can result in lower cross-over ratio scores and affect the instance-level average accuracy score. Meanwhile, if the prediction of the portrait resolution map is inaccurate, some portrait semantic categories can obtain some lower intersection ratio IoU scores, so that the obtained global portrait semantic resolution result is not ideal. The goal of the improvement module is to adopt a complementary mechanism, the global bounding box can focus on improving the quality of the prediction bounding box, and the part-level portrait semantic analysis map focuses on improving the quality of each portrait semantic category. Specifically, the global bounding box estimates the IoU score by using a sub-network for evaluating cross-over ratio, and then estimates the quality of the prediction example human image in the detection edit box. The sub-network includes five convolutional layers, the final output is a one-dimensional IoU score, and the component level analysis result is a solvable proxy loss by optimizing a structural perception.
The invention has the characteristics and beneficial effects that:
the operation efficiency of the anchor-free frame example portrait semantic analysis method provided by the invention on a GPU configured as CPU i7-3770 CPU @3.40GHz and 11G Yingwei 2080Ti is 13.3 frames/second (fps).
Drawings
FIG. 1 is an overall flowchart of anchor-free box example portrait semantic parsing.
FIG. 2 is a diagram of a context coding module for a portrait session.
FIG. 3 is a portrait part context pyramid module.
Fig. 4 is a schematic diagram of a non-local excavation module.
FIG. 5 is a block diagram of an exemplary analysis result refining module.
Detailed Description
The invention provides an anchor frame-free example portrait semantic analysis method aiming at the two problems of heuristic boundary frame design and non-pixel level design characteristics in the existing two-stage anchor frame-based example portrait semantic analysis model. The invention and embodiments are further described below with reference to the drawings.
Aiming at a figure semantic analysis task under an actual complex scene environment, the invention provides an anchor frame-free example figure semantic analysis method to effectively analyze the position information of each figure in an image and the human semantic category to which each pixel of a figure belongs. The model is a pixel-level solvable, full-volume mode design. Specifically, a one-stage anchor-free box detector based on center point prediction is used for character instance bounding box position prediction and an edge-guided character instance semantic analysis module is used for identification of character semantic components. The anchor-frame-free character detector not only inherits the advantages of pixel-level design, but also can effectively avoid the problem of hyperparameter sensitivity caused by generation of candidate bounding frames. The edge-guided portrait semantic parsing module can effectively distinguish different portrait instance positions and adjacent portrait semantic categories. The process of analyzing the portrait semantics of the anchor-free frame example provided by the invention is shown in figure 1, and comprises the following specific steps:
(1) Given an input person image, firstly, extracting multiple different scale features of the input image by using a multi-layer perceptual backbone network, specifically, firstly, performing multiple convolution operations on the input image to obtain five features with different spatial dimensions, namely C1, C2, C3, C4 and C5, which are respectively 1/2, 1/4, 1/8, 1/16 and 1/32 of the input image, and the illustration of C1 and C2 is omitted in fig. 1. Specifically, the features of different spatial sizes of C1 to C5 are obtained step by using a plurality of convolution operations and downsampling operations. The three features C3, C4 and C5 are used to obtain features P3, P4, P5, P6 and P7 of five different levels with five different steps, whose sizes are 1/8, 1/16, 1/32, 1/64 and 1/128 of the input image, respectively. Wherein, P3, P4, P5 are obtained by C3, C4, C5 in the convolutional neural network through a convolution operation of 1 × 1 in the top-down connection mode. As shown in fig. 1, P3, P4 and P5 are obtained from the features C3, C4 and C5 of the convolutional network through a convolutional layer with a convolutional kernel size of 1 × 1 by top-down connection, respectively. P6 and P7 are obtained by applying a convolution layer with a convolution kernel size of 3X 3 and a step size of 2 to P5 and P6, respectively.
(2) For the five levels of feature maps { P3, P4, P5, P6, P7} obtained in step (1), aiming at each feature map P i And respectively obtaining the category of the candidate frame corresponding to each pixel point in the feature map, the offset of the position of the boundary frame and the central perception prediction of the pixel point through two parallel prediction branches. A prediction branch is a category prediction of a target frame corresponding to each pixel point, namely a human or a background, obtained by convolution operation of four continuous convolution layers and the category prediction. And the other prediction branch is also expressed by extracting features from four continuous convolution layers, the central perception probability of the boundary frame corresponding to each pixel point is obtained through convolution operation of centrality prediction, and the prediction of four displacement coordinates of the boundary frame is obtained through convolution operation of the other regression prediction. Offset vector t * =(l * ,t * ,r * ,b * ) Is comprised of the offset distances from the spatial location (m, n) to the left, top, right and bottom sides of the bounding box. Bounding box corresponding to space position (m, n)
Figure BDA0003530663970000061
Wherein
Figure BDA0003530663970000062
And
Figure BDA0003530663970000063
representing the coordinates of the vertices of the upper left corner and the lower right corner of the bounding box, respectively, and then a four-dimensional offset vector t * Can be expressed as:
Figure BDA0003530663970000064
Figure BDA0003530663970000065
the penalty function for two parallel predicted branches is written as:
Figure BDA0003530663970000066
wherein the content of the first and second substances,
Figure BDA0003530663970000071
represents the focal loss function, focal loss,
Figure BDA0003530663970000072
the loss of the cross-over ratio is expressed,
Figure BDA0003530663970000073
representing a binary cross entropy loss.
Figure BDA0003530663970000074
And
Figure BDA0003530663970000075
representing the loss of three outputs of classification, offset regression and central perceptibility, respectively.
(3) And (3) performing region-of-interest alignment pooling (roiign) operation on the character detection frame (the candidate frame predicted as the character in the previous step) obtained in the step (2) to obtain a feature expression of each character instance bounding box, wherein the output size of the alignment pooling is 32 × 32, so that each character instance candidate bounding box obtains a feature representation with the size of 32 × 32 × 256, and the feature expression is further enhanced by a human body part context coding module (as shown in fig. 2) to obtain a new feature with the size of 32 × 32 × 512. The human body part context coding module comprises two parts, one part is a human body part context pyramid module (as shown in fig. 3), and the other part is a non-local mining module (as shown in fig. 4). As shown in fig. 3, the input features of the human body part context pyramid module are features (with a size of 32 × 32 × 256) obtained by aligning and pooling the human body part instance candidate boxes, and feature expressions with different sizes, namely 32 × 32 × 256, 8 × 8 × 256, 4 × 4 × 256, 2 × 2 × 256, and 1 × 1 × 256, are obtained through five parallel convolution operation branches. The convolution of the first parallel branch consists of a convolution layer with a convolution kernel size of 3 x 3 and a convolution of 1 x 1. The second to fifth parallel branches respectively obtain coding features from 2, 3, 4, 5 depth separable convolution layers with convolution kernel size of 3 × 3, then obtain new activation features through up-sampling operations of 4 times, 8 times, 16 times and 32 times, multiply the input features element by element, and then perform a convolution operation of 1 × 1. Finally, the output characteristics of the five parallel branches are connected in series to obtain a new candidate frame characteristic with the size of 32 multiplied by 1280, and the output characteristic of the module is obtained through a new convolution operation with the size of 3 multiplied by 3, wherein the size of the output characteristic is 32 multiplied by 512. The output features of the human body component context pyramid module are subsequently used as input of the non-local mining module, as shown in fig. 4, the input features pass through two branches, wherein the first branch passes through a convolution operation of 1 × 1 to obtain a new feature (with a size of 32 × 32 × 512), the second branch passes through a maximum pooling operation and two convolution operations of 1 × 1 to respectively obtain two groups of features with a size of 16 × 16 × 512, wherein the similarity calculation of the two features, namely the similarity matrix between the spatial positions, is realized by respectively passing through Reshape operations on the one group of features and the output features of the first branch, and the similarity matrix is (32 × 32) × (16 × 16). And after the similarity matrix is subjected to Softmax normalization operation, the similarity matrix is subjected to matrix multiplication and Reshape transformation size operation with another group of features of the second branch to obtain a new feature expression with the size of 32 multiplied by 512, and then a convolution and group normalization operation (GN) with the size of 1 multiplied by 1 is carried out, and the convolution and group normalization operation (GN) and the input features are added element by element to obtain a new output feature (with the size of 32 multiplied by 512) of the non-local mining module.
The output characteristics of the non-local mining module are input into a character instance analysis module, the character instance analysis module firstly carries out four continuous convolutional layer operations and a group standard normalization layer on the input characteristics, and then pixel-level semantic analysis graphs (the size is 256 multiplied by C, C represents the category number of human body parts) and edge information graphs (the size is 256 multiplied by 2) of the character instances are respectively predicted by two parallel branches. The two parallel prediction branches are realized by deconvolution operation, the size of a convolution kernel of the deconvolution operation adopted by the portrait resolution map is 4 multiplied by 4, and the step length is 2; the edge prediction uses a deconvolution operation with a convolution kernel size of 2 x 2 and a step size of 2.
The loss function of the person instance parsing module may be expressed as:
Figure BDA0003530663970000081
wherein
Figure BDA0003530663970000082
Represents a cross-entropy loss function of the standard,
Figure BDA0003530663970000083
is a cross entropy loss function with weights, α = β =2.
(4) And (4) further carrying out optimization processing on the prediction result obtained in the step (3) through a refining module. Firstly, aiming at the pixel-level semantic analysis result of the character instance, further optimization is carried out by adopting a uniformly-intersected loss function. Secondly, a human image semantic analysis quality prediction model is adopted to predict the quality degree of the semantic analysis result, and the specific structure is shown in fig. 5. Firstly, obtaining features with the size of 32 × 32 × C from a portrait semantic analysis result (256 × 256 × C) through a maximum pooling operation, then obtaining a new feature expression of 32 × 32 × (512 + C) through a series operation with the last layer of features in the step (3), namely 32 × 32 × 512, and then performing a convolution operation with the size of two convolution kernels of 3 × 3 and the number of convolution kernels of 128, an average pooling operation and three convolution layers (the sizes of the convolution kernels are all 1 × 1, and the output sizes are 256, 256 and 1, respectively). And finally obtaining the semantic analysis quality evaluation score of the character instance bounding box.
The loss function of the refining module is expressed as:
Figure BDA0003530663970000084
wherein L is miou Represents the mean cross-to-parallel ratio loss function, L miou-score Is a mean square error function, θ =2, γ =1.
The overall loss function is formulated as follows:
Figure BDA0003530663970000085
in the model training stage, the input image needs to label the spatial position of each person in the image, and the spatial position is represented by the position coordinate of the central point, the length and the width. In addition, pixel-level semantic category information of each person needs to be labeled, namely human semantic categories to which each pixel belongs, such as a face, a left arm, a right arm, a hat and the like. The length of the short side of the output image is randomly selected from the range of [640,800] as the length of the short side, then the length of the short side is adjusted to 800 pixels, the length-width ratio is kept unchanged, and the length of the long side is smaller than or equal to 1333 pixels. The model was trained using a Stochastic Gradient Descent (SGD) optimization method for a total of 75 cycles, the batch size was 8 images, the initial learning rate was set to 0.005, and was reduced to one tenth of the previous learning rate at 50 and 65 cycles, respectively. The weight attenuation and the momentum factor are set to 0.0001 and 0.9, respectively. The initialization of the backbone network adopts model parameters pre-trained on ImageNet classification data. In the test inference stage, the person instance detection branch outputs 50 highest confidence scoring person instance bounding boxes, and then inputs these candidate boxes to the subsequent edge-guided instance portrait resolution branch to predict the result of the portrait semantic resolution.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent structures or equivalent flow transformations that may be embodied in the present specification and drawings, or directly or indirectly applied to other related arts, are included in the claimed scope of the present invention.

Claims (8)

1. An anchor-frame-free example portrait semantic parsing method is characterized by comprising the following steps:
a model training stage: acquiring or generating a training set, wherein each character image in the training set is marked with the spatial position information of the character and the semantic category information of each pixel; training an example portrait semantic analysis model by using the training set, wherein the example portrait semantic analysis model comprises a feature extraction module, a character example detection branch and a character example fine-grained perception branch; the character instance fine-grained perception branch comprises a detail keeping module, a human body part context coding module, a character instance analyzing module and an instance analyzing result refining module; wherein
The feature extraction module is used for extracting a plurality of different scale features of the character image, and each scale corresponds to a feature map; wherein the ith scale corresponds to the characteristic diagram P i I =1 to n; n is the set scale number;
for each feature map P i Inputting the character instance detection branch and the character instance fine-grained perception branch into the character instance detection branch and the character instance fine-grained perception branch in parallel respectively, and obtaining a characteristic diagram P through character instance detection branch prediction i The category of the candidate frame corresponding to each pixel point; obtaining a characteristic diagram P through character instance fine-grained perception branch prediction i The central perception probability of each pixel point in the image corresponds to the bounding box and the offset of the position of the bounding box;
for the candidate box with the type of the character, the detail keeping module performs region-of-interest alignment pooling operation on a set scale feature map to obtain the feature of the corresponding character instance bounding box and inputs the feature into the human body part context coding module; the human body component context coding module comprises a human body component context pyramid module and a non-local mining module; the human body part context pyramid module is used for acquiring multi-scale context information of a character instance from input features and sending the multi-scale context information to the non-local mining module; the non-local mining module is used for mining the spatial position correlation in the input multi-scale context information and inputting the spatial position correlation into the character instance analysis module; the character instance analysis module is used for predicting a pixel-level semantic analysis graph and an edge information graph of a character instance according to input features; the example analysis result refining module is used for evaluating the quality of the analysis result according to the pixel-level semantic analysis graph obtained in the character example analysis module;
wherein the loss function adopted for training the example portrait semantic analysis model is
Figure FDA0003530663960000011
Figure FDA0003530663960000012
Figure FDA0003530663960000013
Detecting loss functions of branches for the human examples, and calculating the loss of the human example candidate box offset regression;
Figure FDA0003530663960000014
the loss function of the module is parsed for the artifact instance,
Figure FDA0003530663960000015
Figure FDA0003530663960000016
for computing the loss of a pixel-level semantic parse map of a person instance,
Figure FDA0003530663960000017
calculating loss of the edge information graph of the character example;
Figure FDA0003530663960000018
to illustrate the loss function of the result refining module,
Figure FDA0003530663960000019
L miou quality loss, L, for computing a pixel-level semantic parse map of a human instance miou-score The probability loss of credibility of the pixel-level semantic analysis graph of the character instance is calculated; alpha, beta, theta and gamma are weight coefficients respectively;
and (3) a model application stage: for a person image to be analyzed, the feature extraction module is utilized to extract a plurality of different scale features of the person image to be analyzed and input the features into the person instance detection branch to obtain a pixel level analysis result of each person instance in the person image to be analyzed.
2. The method of claim 1, wherein the human instance detection branch loss function
Figure FDA0003530663960000021
Wherein the content of the first and second substances,
Figure FDA0003530663960000022
for calculating the loss of the classification of the character instance candidate box,
Figure FDA0003530663960000023
for calculating the central perceptibility probability loss corresponding to the bounding box,
Figure FDA0003530663960000024
is lost as cross-over ratio.
3. The method of claim 2,
Figure FDA0003530663960000025
in order to be a function of the focal loss,
Figure FDA0003530663960000026
is a binary cross entropy loss function.
4. The method of claim 1, wherein L is miou Is a mean cross-over-parallel ratio loss function, L miou-score Is a function of the mean square error.
5. The method of claim 1, wherein the semantic category information comprises face, left arm, right arm, hat.
6. The method of claim 1, wherein the scaled feature map is set to the feature map corresponding to the largest scale.
7. A server, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for carrying out the steps of the method of any one of claims 1 to 6.
8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN202210203916.1A 2022-03-03 2022-03-03 Anchor frame-free example portrait semantic analysis method Pending CN115331254A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210203916.1A CN115331254A (en) 2022-03-03 2022-03-03 Anchor frame-free example portrait semantic analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210203916.1A CN115331254A (en) 2022-03-03 2022-03-03 Anchor frame-free example portrait semantic analysis method

Publications (1)

Publication Number Publication Date
CN115331254A true CN115331254A (en) 2022-11-11

Family

ID=83916311

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210203916.1A Pending CN115331254A (en) 2022-03-03 2022-03-03 Anchor frame-free example portrait semantic analysis method

Country Status (1)

Country Link
CN (1) CN115331254A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117036868A (en) * 2023-10-08 2023-11-10 之江实验室 Training method and device of human body perception model, medium and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117036868A (en) * 2023-10-08 2023-11-10 之江实验室 Training method and device of human body perception model, medium and electronic equipment
CN117036868B (en) * 2023-10-08 2024-01-26 之江实验室 Training method and device of human body perception model, medium and electronic equipment

Similar Documents

Publication Publication Date Title
Tulbure et al. A review on modern defect detection models using DCNNs–Deep convolutional neural networks
CN111210443B (en) Deformable convolution mixing task cascading semantic segmentation method based on embedding balance
CN112966684B (en) Cooperative learning character recognition method under attention mechanism
Ji et al. An improved algorithm for small object detection based on YOLO v4 and multi-scale contextual information
CN111259930A (en) General target detection method of self-adaptive attention guidance mechanism
CN112329760B (en) Method for recognizing and translating Mongolian in printed form from end to end based on space transformation network
CN113159120A (en) Contraband detection method based on multi-scale cross-image weak supervision learning
CN110782420A (en) Small target feature representation enhancement method based on deep learning
CN111967480A (en) Multi-scale self-attention target detection method based on weight sharing
WO2021218786A1 (en) Data processing system, object detection method and apparatus thereof
CN111353544B (en) Improved Mixed Pooling-YOLOV 3-based target detection method
CN112232371B (en) American license plate recognition method based on YOLOv3 and text recognition
Shen et al. Vehicle detection in aerial images based on lightweight deep convolutional network and generative adversarial network
US20230137337A1 (en) Enhanced machine learning model for joint detection and multi person pose estimation
Lei et al. Boundary extraction constrained siamese network for remote sensing image change detection
CN112733942A (en) Variable-scale target detection method based on multi-stage feature adaptive fusion
Huang et al. Attention-guided label refinement network for semantic segmentation of very high resolution aerial orthoimages
Fan et al. A novel sonar target detection and classification algorithm
Saida et al. Mu-net: Modified u-net architecture for automatic ocean eddy detection
Zhu et al. YOLOv7-CSAW for maritime target detection
CN115331254A (en) Anchor frame-free example portrait semantic analysis method
Gong et al. An enhanced SSD with feature cross-reinforcement for small-object detection
Sagar et al. MSA R-CNN: A comprehensive approach to remote sensing object detection and scene understanding
Zhu et al. TransText: Improving scene text detection via transformer
CN116258931A (en) Visual finger representation understanding method and system based on ViT and sliding window attention fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination