CN116863509B - Method for detecting human-shaped outline and recognizing gesture by using improved polar mask - Google Patents

Method for detecting human-shaped outline and recognizing gesture by using improved polar mask Download PDF

Info

Publication number
CN116863509B
CN116863509B CN202311119512.5A CN202311119512A CN116863509B CN 116863509 B CN116863509 B CN 116863509B CN 202311119512 A CN202311119512 A CN 202311119512A CN 116863509 B CN116863509 B CN 116863509B
Authority
CN
China
Prior art keywords
human
humanoid
polar
model
mask
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311119512.5A
Other languages
Chinese (zh)
Other versions
CN116863509A (en
Inventor
温廷羲
童斌斌
侯晴霏
陈雨萍
谢建华
曾焕强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Huanyutong Technology Co ltd
Huaqiao University
Original Assignee
Fujian Huanyutong Technology Co ltd
Huaqiao University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Huanyutong Technology Co ltd, Huaqiao University filed Critical Fujian Huanyutong Technology Co ltd
Priority to CN202311119512.5A priority Critical patent/CN116863509B/en
Publication of CN116863509A publication Critical patent/CN116863509A/en
Application granted granted Critical
Publication of CN116863509B publication Critical patent/CN116863509B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Human Computer Interaction (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The invention uses improved polar mask to detect human outline and identify gesture, uses improved polar mask model, based on human outline feature, to process human outline polar coordinate modeling design; then, an improved polar mask model is built as a humanoid contour segmentation model, a channel attention mechanism module is added, jump connection is added in an original feature pyramid network based on YOLOV7, the detail information lost in the feature fusion process is made up, and finally, a training strategy based on weak labels is adopted for training out a primary humanoid contour segmentation model capable of identifying a rectangular frame containing humanoid position information and the gesture type of a person; in the formal training process, pre-training weights which are pre-trained and used for learning relevant information of the humanoid contours in advance are used for transfer learning, and in the process of learning the real humanoid contours, the predicted humanoid contours are continuously converged, and the humanoid contours and the gesture types are accurately identified.

Description

Method for detecting human-shaped outline and recognizing gesture by using improved polar mask
Technical Field
The invention belongs to the field of image recognition, and particularly relates to a method for detecting a humanoid outline and recognizing a gesture by using an improved polar mask.
Background
With the development of domestic computer technology and imaging technology, human profile detection and gesture recognition methods are receiving more and more attention. The human contour detection and gesture recognition technology can play a key role in improving multiple fields such as human-computer interaction, motion analysis, health monitoring, virtual reality, augmented reality, safety monitoring and the like. The method provides a more visual, efficient and immersive solution for various application scenes, and promotes the development of artificial intelligence technology in practical application. Particularly, aiming at the demands of human behavior recognition and intelligent monitoring, most of the human behavior recognition and monitoring in the existing video images are based on a target recognition method, and the position of a person and the current state of the human body are roughly recognized, wherein the states comprise standing or falling. Firstly, a target frame containing a human shape is obtained through a target recognition and gesture recognition method, and the gesture of the human body is recognized through coordinates of human body skeleton nodes. Although the existing method can achieve the effect of identifying the position and the state of the human body to a certain extent, the contour of the human body cannot be accurately identified. Therefore, a human shape contour recognition method is needed, which not only can accurately recognize the human body, but also can accurately divide the human body contour and provide more useful information for judging the posture of the human body, thereby improving the accuracy of human body posture recognition, and being more beneficial to applying the technology to more related fields.
The accuracy of the example segmentation method based on polar Mask is far lower than that of other example segmentation methods such as Mask R-CNN. In addition, because the polar mask is a deep learning model, a large number of data sets are required for training to achieve the expected result, and the segmentation type marking of one picture is very time-consuming and labor-consuming, and the average marking of one picture requires 1 minute, so that a large amount of cost is required for training the model.
Disclosure of Invention
The invention aims to provide a human contour detection and gesture recognition method by using an improved polar mask, so that the purposes of contour segmentation and gesture recognition of a human body are realized.
The invention uses the improved polar mask to detect the human outline and identify the method of gesture, adopt the improved polar mask model, based on the characteristic of human outline, distribute the ray number of each area in the bounding box with the ratio of the length and width of the human bounding box that the polar mask model identified, carry on the design of the polar coordinate modeling of human outline; then, an improved polar mask model is constructed as a humanoid contour segmentation model, before feature pyramid networks are subjected to feature fusion, a channel attention mechanism module is added after each feature of different scales, jump connection is added in the original feature pyramid network based on YOLOV7, so that detail information lost in the feature fusion process is compensated, finally, a weak tag data set based on weak tags is adopted to perform pre-training of the humanoid contour segmentation model, and the pre-training is used for training a primary humanoid contour segmentation model capable of identifying a rectangular frame containing humanoid position information and a human posture type; in the formal training process, pre-training weights of pre-trained relevant information of the humanoid contours are used for transfer learning, and in the process of learning the real humanoid contours, predicted humanoid contours are enabled to be continuously converged, and the humanoid contours and gesture types are accurately identified;
the improved polar mask model is constructed as a humanoid contour segmentation model, the original polar mask model is taken as a basis, and a Yolo V7 network structure is taken as a basis, so that a FPN network structure of the original polar mask model is replaced by a Yolo AT_FPN characteristic pyramid network, and a backbone network and a characteristic pyramid structure of the original polar mask model are improved; the humanoid contour segmentation model consists of an encoder and three decoders;
the encoder employs a yolat_fpn feature pyramid network that is modified based on the backbone network of yolat 7 as follows:
(1) The method comprises the steps of replacing an activation function of an original convolution module, and replacing the original activation function SiLU with a nonlinear activation function GELU applied to natural language processing;
(2) A channel attention mechanism module is added before feature fusion is carried out in the feature pyramid;
(3) The feature images of multiple scales extracted from the backbone network of the original YOLOV7 are further extracted to obtain important detail information through a channel attention mechanism module, shallow information and deep information are connected in a jumping mode through a convolution kernel of 1×1, and the detail information lost through feature fusion is completed;
the three decoders refer to three branches which are processed in parallel and are respectively a classification branch, a centrality branch and a polar coordinate mask branch, wherein the classification branch sequentially uses 4×4 Conv and 1×1 Conv to extract features, an H×W×N feature map is generated to predict N gestures, prediction of classification target types is realized, H, W respectively represents the length and the width of the input feature map, and N represents the type of gesture to be predicted; the centrality branch uses 4×4 Conv and 1×1 Conv shared by the classification branches to extract features, and generates a feature map of HxW×1 to predict polar coordinate center points; the polar mask branches are sequentially subjected to feature extraction by using 4×4 Conv and 1×1 Conv, and a feature map of h×w×60 is generated to predict the distances of 60 rays of the polar coordinates.
The improved polar mask model is adopted, based on the characteristics of the human outline, the ratio of the length to the width of the human bounding box identified by the polar mask model is used for distributing the ray number of each area in the bounding box, and the design of polar coordinate modeling of the human outline is carried out, specifically: four vertexes A, B, C, D of the humanoid bounding box identified by the polar mask model and the human body center O form four areas, the ray number of each area is distributed according to the ratio of the length to the width of the identified bounding box, the design of the humanoid outline polar coordinate modeling is carried out, and a calculation formula is shown as a formula (2):
wherein 0 is a human center point, four vertexes A, B, C, D of the human bounding box and the human center point O form four areas, namely an AOB area, a BOC area, a COD area and an AOD area, respectively, the Number is AOB Representing the Number of rays in the AOB region required to construct the body contour, the Number COD· Representing the number of rays required in the COD region to construct the body contour; the Number is AOD Representing the number of rays in the AOD region required to construct the body contour; the Number is BOC Representing the number of rays in the BOC region required to construct the body contour; n represents the total ray number; y represents the height of the bounding box; x represents the width of the bounding box.
The channel attention mechanism module selects a SENet model, the SENet model comprises two stages of compression and excitation, global space information is compressed in the compression stage, then feature learning is carried out in the channel dimension, so that the attention weight of each channel is formed, and finally the attention weight generated in the compression stage acts on the corresponding channel in the excitation stage, specifically:
the compression stage is performed first, the Global pulling is used to compress the H×W×C input into a 1×1×C output, then the excitation stage is performed, which comprises two fully connected layers, the first fully connected layer has C/r neurons, the output is 1×1× (C/r), and the activation function ReLU is used; the second fully connected layer has C neurons, restores the output to 1×1×c, and uses the activation function Sigmoid, where r is the compressed value of the first fully connected layer; in the excitation stage, the attention weight of each channel is generated by learning the feature information of each channel, and the finally output 1×1×c channel attention weight is multiplied by the channel corresponding to the original feature map.
A human profile detection and gesture recognition device employing an improved polar mask, said device comprising a processor and a memory; the memory is used for storing a computer program; the processor is used for executing any one of the methods for detecting the human-shaped outline and recognizing the gesture by using the improved polar mask according to the computer program.
A computer readable storage medium for storing a computer program for executing any one of the above methods for human profile detection and gesture recognition using the modified polar mask.
A chip for executing instructions for performing any of the above methods for human profile detection and gesture recognition using the modified polar mask.
The improved polar mask model is characterized in that the network structure of the YOLOV7 is used for reference on the basis of the original polar mask model, jump connection is added to a backbone network, an attention mechanism module is added, the improved polar mask is applied to human-shaped contour example segmentation and gesture recognition, and a Box type weak label is used for pre-training of the improved polar mask model, so that migration learning is introduced into the contour segmentation model, and compared with the prior art, the improved polar mask model has the following technical effects:
(1) The human body contour is segmented based on the improved polar mask model, the posture of the human body is assisted to be recognized by calculating the distance, the angle and the like of 60 rays, the improved polar mask model is pre-trained by adopting a weak tag data set of a box type with simple labeling, and then the obtained pre-trained weight is subjected to migration learning. If the example segmentation labels of the COCO type are used for the polar mask model based pre-training, the processing of the data set is very time consuming, which results in high costs. The method adopts the Box type weak tag data set to perform model pre-training, and the difficulty and the time spent for marking through the Box type weak tag are much less than those of the example segmentation type, so that the required cost is reduced. After model pre-training is performed by using the weak tag data set, a good effect can be achieved by only using a small number of example segmentation tags to correct. Compared with other methods based on polar mask, the method can save cost to a great extent.
(2) The invention redesigns the polar coordinate modeling method, and changes the originally equally spaced rays in the polar mask into the rays which are unevenly distributed according to the characteristics of the humanoid. More rays are used to represent complex contours and fewer rays are used to represent simple contours. The method solves the problem that the original polar mask model has redundancy when the contours are represented by uniformly distributed rays, can finally represent the contours of human bodies by fewer rays, reduces the parameters of the model and obtains more accurate contours of human bodies.
(3) The invention replaces the backbone network in the original polar mask model, and helps the model to extract the characteristics more effectively. The new backbone network uses the network structure of YOLOV7 to improve the activation function of the convolution module and increase jump connection so as to improve the extraction capability of the network to detail information.
(4) According to the invention, the attention mechanism module is introduced into the human body contour segmentation model, and the channel attention network is used before feature fusion, so that the model focuses on important feature information more, and the segmentation accuracy of the network can be improved to a certain extent.
(5) According to the invention, the transfer learning is adopted, and the value of the pre-training weight can be closer to the optimal convergence point, so that the convergence point can be reached only by a shorter training time in the subsequent formal training process, the network can be converged to the optimal point more easily, the accuracy is improved, the efficiency is improved, and the generalization capability of the model is also improved to a certain extent.
Drawings
FIG. 1 is a diagram of a model structure of the present invention;
FIG. 2 is a modified view of a convolution module of the present invention;
FIG. 3 is a polar modeling improvement graph based on humanoid features of the present invention;
fig. 4 is a flowchart of a migration training method based on weak labels according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, it being apparent that the described embodiments are only some, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
The first embodiment of the invention relates to a method for detecting human contours and recognizing gestures by using an improved polar mask, which adopts a polar coordinate modeling method of an improved polar mask model, combines the characteristics of human contours and redesigns the human contours; then, based on a YOLOV7 backbone network, the backbone network of the original polar mask model is improved again, an original convolution module is improved, before feature fusion is carried out on a feature pyramid network, a attention mechanism module is added after each feature of different scales, jump connection is added in the original feature pyramid network based on YOLOV7, the detail information lost in the feature fusion process is made up, finally, a training strategy based on weak labels is adopted, a weak label dataset is used for pre-training of a human contour segmentation model, pre-trained weights are used for migration learning in the formal training process, and the predicted human contour is continuously converged in the process of learning the real human contour, so that the human contour and the gesture type are accurately identified, and the method specifically comprises the following steps:
step 1, constructing an improved polar mask model as a humanoid contour segmentation model
As shown in fig. 1, the original polar mask model is taken as a base, the network structure of YOLOV7 is taken as a basis, the backbone network and the characteristic pyramid structure of the original polar mask model are improved, and the FPN network structure of the original polar mask model is replaced by the yolat_fpn characteristic pyramid network is designed; the humanoid contour segmentation model consists of an encoder and three decoders;
the encoder employs a yolat_fpn feature pyramid network that is modified based on the backbone network of yolat 7 as follows:
(1) The activation function of the original convolution module is replaced, and the original activation function SiLU is replaced by a nonlinear activation function GELU (Gaussian Error Linear Unit) applied to natural language processing, as shown in fig. 2; the activation function GELU is a smooth nonlinear function with continuously-conductive properties, which better adapts to gradient descent algorithms and converges more easily during training. The specific mathematical formula of the GELU activation function is shown in formula (1):
GELU(x)=0.5×x×(1+tanh(sqrt(2/pi)×(x+0.044715×x^3))) (1)
(2) A channel attention mechanism module is added before feature fusion is carried out in the feature pyramid, so that a model is helped to pay more attention to important information, and unimportant information is ignored, and therefore the segmentation accuracy of an image is improved;
(3) The feature pyramid structure is improved, jump connection is added, and a Yoleat_FPN feature pyramid network helping model to promote the extraction of detail information is designed. Although the original network structure of YOLOV7 performs multi-scale fusion through a feature pyramid, in the process of feature fusion, as the number of network layers is deepened, the loss of detail features is inevitably caused, and the importance of a shallow network to model identification is ignored.
The three decoders refer to three branches which are processed in parallel and are respectively a classification branch, a centrality branch and a polar coordinate mask branch, wherein the classification branch sequentially uses 4×4 Conv and 1×1 Conv to extract features, an H×W×N feature map is generated to predict N gestures, prediction of classification target types is realized, H, W respectively represents the length and the width of the input feature map, and N represents the type of gesture to be predicted; the centrality branch uses 4×4 Conv and 1×1 Conv shared by the classification branches to extract features, and generates a feature map of HxW×1 to predict polar coordinate center points; the polar coordinate mask branches sequentially use 4×4 Conv and 1×1 Conv to perform feature extraction, and a feature map of H×W×60 is generated to predict the distance of 60 rays of polar coordinates, meanwhile, the invention redesigns the arrangement mode of the 60 rays according to the characteristics of human shapes, so that the method is more suitable for human image segmentation, and a specific method is described in detail in step 2.
The attention mechanism module selects a SENet model (a squeize-and-Excitation Networks compression and excitation network) belonging to the attention of the channel, the SENet model comprises two stages of compression and excitation, global space information is compressed in the compression stage, then feature learning is performed in the channel dimension, so as to form the attention weight of each channel, and finally the attention weight generated in the compression stage acts on the corresponding channel in the excitation stage, specifically:
the compression stage is performed first, the Global pulling is used to compress the H×W×C input into a 1×1×C output, then the excitation stage is performed, which comprises two fully connected layers, the first fully connected layer has C/r neurons, the output is 1×1× (C/r), and the activation function ReLU is used; the second fully-connected layer has C neurons, restores the output to 1×1×C, and uses an activation function Sigmoid, where r is the compressed value of the first fully-connected layer, and generally takes 16 to achieve better effect. In the excitation stage, the attention weight of each channel is generated by learning the feature information of each channel, and the finally output 1×1×c channel attention weight is multiplied by the channel corresponding to the original feature map.
Step 2, based on the characteristics of the human outline, forming four areas by four vertexes A, B, C, D of a human-shaped bounding box identified by the polar mask model and a human center O, distributing the ray number of each area according to the ratio of the length to the width of the identified bounding box, and performing design of polar coordinate modeling of the human outline
The original polar mask model is to send out N different rays from the center of an object to be segmented at equal intervals, and design the outline of the object to be segmented by sequentially connecting the endpoints of the different rays, as shown in a) in fig. 3. The contour modeling method is more suitable for objects with circular structures. If the method is directly applied to the design of the humanoid outline, a large amount of redundancy exists, so that not only is unnecessary calculation burden increased, but also finer outlines cannot be segmented, because the outlines of some parts of the humanoid outline can be well represented by a small number of ray segments, and the outlines of some complicated parts can be represented by more rays.
As shown in b) of fig. 3, the ray number of each region is allocated according to the ratio of the length and the width of the humanoid bounding box identified by the polar mask model, and a specific calculation formula is shown as formula (2):
wherein 0 is a human center point, four vertexes A, B, C, D of the human bounding box and the human center point O form four areas, namely an AOB area, a BOC area, a COD area and an AOD area, respectively, the Number is AOB Representing the Number of rays required to construct the body contour in the AOB region as in b) of FIG. 3, the Number COD· Representing the number of rays required in the COD region to construct the body contour; the Number is AOD Representing the number of rays in the AOD region required to construct the body contour; the Number is BOC Representing the number of rays in the BOC region required to construct the body contour; n represents the total ray number; y represents the height of the bounding box; x represents the width of the bounding box.
The improved human contour polar coordinate modeling method not only can carry out finer representation on the contour and reduce redundancy, but also can reduce the parameters of a model by adopting the nonuniform polar coordinate modeling method, and the test finds that 90 uniformly distributed rays are originally required to accurately represent the contour of a human, and the improved human contour modeling method can well represent the contour of the human by using only 60 rays.
Step 3, pre-training a human outline segmentation model by using a Box type weak tag data set, wherein the pre-training is used for training a primary human outline segmentation model which can identify a rectangular frame containing human position information and a human gesture type; in the formal training process, pre-training weights which are pre-trained and used for learning relevant information of the outline of the person in advance are used for transfer learning, and finally the outline of the person and the gesture type are accurately identified;
because the improved polar mask model needs to predict the human-shaped bounding box and the center point in the training process, and the ray numbers of all areas in the image are distributed according to the predicted bounding box and the center point, the model is extremely important for predicting the bounding box and the center point, so that in order to predict the human-shaped outline more quickly and accurately, the invention adopts a transfer learning method based on weak labels, utilizes pre-training weights to learn the relevant information of the human-shaped outline in advance, and is more beneficial to converging the segmentation model to the optimal point, thereby improving the segmentation accuracy of the model. As shown in fig. 4.
Because of the multiple figures in some pictures and the overlapping, if the actual labels are directly used, a lot of manpower is consumed in the process of making and collecting the labels. The method uses the Box type weak tag data set to pretrain the human contour segmentation model. Because the Box label is mostly a weak label in the VOC format, the weak label in the VOC format needs to be converted into a label in the COCO format, and a label which can be provided for the polar mask model to perform primary segmentation outline is obtained, and the specific conversion method is as follows: two points of a rectangular box used to represent an identification target in the VOC format, namely a point minimum at the upper left corner of boxes and a point maximum at the lower right corner of boxes, are converted into 4 of "segments" in the COCO format.
According to the method, the pre-training weight trained by the Box type weak tag data set is used for transfer learning, and the segmentation accuracy of the model can be effectively improved. The number of the data sets is large, and compared with a mask of a segmentation type, the human-shaped outline is easier to mark by using a boundary box. Before the real label is used, the method has the advantages that the prediction of the rectangular frame with higher accuracy is obtained, the human-shaped outline can be correctly converged in the training process, and finally, the human-shaped outline and the gesture type of the human can be accurately identified.
Example two
The second embodiment of the invention provides a device for detecting a humanoid outline and identifying a gesture by using an improved polar mask, wherein the electronic device can be a terminal device or a server, and also can be a terminal device or a server which is connected with other terminal devices or servers and realizes the method of the embodiment of the invention.
The apparatus may include: a processor (e.g., CPU), a memory, a data acquisition device; the processor is connected with and controls the data acquisition device. The memory may store various instructions for performing the various processing functions and implementing the processing steps described in the methods of the previous embodiments.
Example III
The third embodiment of the present invention also provides a computer-readable storage medium having instructions stored therein, which when executed on a computer, cause the computer to perform the processing steps described in the method of the first embodiment.
Example IV
The fourth embodiment of the present invention further provides a chip for executing instructions, where the chip is configured to perform the processing steps described in the method of the foregoing embodiment.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (5)

1. The method for detecting the human-shaped outline and recognizing the gesture by using the improved polar mask is characterized by comprising the following steps of: adopting an improved polar mask model, distributing the ray number of each area in the bounding box according to the length and width ratio of the human-shaped bounding box identified by the polar mask model based on the human-shaped contour characteristics, and carrying out design of polar coordinate modeling of the human-shaped contour; then, an improved polar mask model is constructed as a humanoid contour segmentation model, before feature pyramid networks are subjected to feature fusion, a channel attention mechanism module is added after each feature of different scales, jump connection is added in the original feature pyramid network based on YOLOV7, so that detail information lost in the feature fusion process is compensated, finally, a weak tag data set based on weak tags is adopted to perform pre-training of the humanoid contour segmentation model, and the pre-training is used for training a primary humanoid contour segmentation model capable of identifying a rectangular frame containing humanoid position information and a human posture type; in the formal training process, pre-training weights of pre-trained relevant information of the humanoid contours are used for transfer learning, and in the process of learning the real humanoid contours, predicted humanoid contours are enabled to be continuously converged, and the humanoid contours and gesture types are accurately identified;
the improved polar mask model is constructed as a humanoid contour segmentation model, the original polar mask model is taken as a basis, and a Yolo V7 network structure is taken as a basis, so that a FPN network structure of the original polar mask model is replaced by a Yolo AT_FPN characteristic pyramid network, and a backbone network and a characteristic pyramid structure of the original polar mask model are improved; the humanoid contour segmentation model consists of an encoder and three decoders;
the encoder employs a yolat_fpn feature pyramid network that is modified based on the backbone network of yolat 7 as follows:
(1) The method comprises the steps of replacing an activation function of an original convolution module, and replacing the original activation function SiLU with a nonlinear activation function GELU applied to natural language processing;
(2) A channel attention mechanism module is added before feature fusion is carried out in the feature pyramid; the channel attention mechanism module selects a SENet model, the SENet model comprises two stages of compression and excitation, global space information is compressed in the compression stage, then feature learning is carried out in the channel dimension, so that the attention weight of each channel is formed, and finally the attention weight generated in the compression stage acts on the corresponding channel in the excitation stage, specifically:
the compression stage is performed first, the Global pulling is used to compress the H×W×C input into a 1×1×C output, then the excitation stage is performed, which comprises two fully connected layers, the first fully connected layer has C/r neurons, the output is 1×1× (C/r), and the activation function ReLU is used; the second fully connected layer has C neurons, restores the output to 1×1×c, and uses the activation function Sigmoid, where r is the compressed value of the first fully connected layer; in the excitation stage, generating attention weights of each channel by learning the characteristic information of each channel, and multiplying the finally output 1 multiplied by C channel attention weights with the channel corresponding to the original characteristic diagram;
(3) The feature images of multiple scales extracted from the backbone network of the original YOLOV7 are further extracted to obtain important detail information through a channel attention mechanism module, shallow information and deep information are connected in a jumping mode through a convolution kernel of 1×1, and the detail information lost through feature fusion is completed;
the three decoders refer to three branches which are processed in parallel and are respectively a classification branch, a centrality branch and a polar coordinate mask branch, wherein the classification branch sequentially uses 4×4 Conv and 1×1 Conv to extract features, an H×W×N feature map is generated to predict N gestures, prediction of classification target types is realized, H, W respectively represents the length and the width of the input feature map, and N represents the type of gesture to be predicted; the centrality branch uses 4×4 Conv and 1×1 Conv shared by the classification branches to extract features, and generates a feature map of HxW×1 to predict polar coordinate center points; the polar mask branches are sequentially subjected to feature extraction by using 4×4 Conv and 1×1 Conv, and a feature map of h×w×60 is generated to predict the distances of 60 rays of the polar coordinates.
2. The method for detecting human outline and recognizing gesture by using improved polar mask according to claim 1, wherein the method for carrying out polar coordinate modeling of human outline is characterized in that the ratio of length to width of human bounding box recognized by polar mask model is used to distribute ray number of each area in bounding box based on human outline characteristics by adopting improved polar mask model, specifically: four vertexes A, B, C, D of the humanoid bounding box identified by the polar mask model and the human body center O form four areas, the ray number of each area is distributed according to the ratio of the length to the width of the identified bounding box, the design of the humanoid outline polar coordinate modeling is carried out, and a calculation formula is shown as a formula (2):
wherein O is a human center point, four vertexes A, B, C, D of the human bounding box and the human center point O form four areas which are an AOB area, a BOC area, a COD area and an AOD area respectively, and the Number is AOB Representing the Number of rays in the AOB region required to construct the body contour, the Number COD· Representing the number of rays required in the COD region to construct the body contour; the Number is AOD Representing the number of rays in the AOD region required to construct the body contour; the Number is BOC Representing the number of rays in the BOC region required to construct the body contour; n represents the total ray number; y represents the height of the bounding box; x represents the width of the bounding box.
3. The utility model provides an utilize modified polar mask to carry out humanoid profile detection and gesture recognition equipment which characterized in that: the apparatus includes a processor and a memory; the memory is used for storing a computer program; the processor is configured to execute the method of any of claims 1-2 using the modified polar mask for human profile detection and gesture recognition according to the computer program.
4. A computer-readable storage medium, characterized by: the computer readable storage medium is for storing a computer program for executing the method for human profile detection and gesture recognition using the modified polar mask according to any of claims 1-2.
5. A chip for executing instructions, characterized by: the chip is used for executing any one of the methods for detecting human-shaped contours and recognizing gestures by using the improved polar mask as claimed in claims 1-2.
CN202311119512.5A 2023-09-01 2023-09-01 Method for detecting human-shaped outline and recognizing gesture by using improved polar mask Active CN116863509B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311119512.5A CN116863509B (en) 2023-09-01 2023-09-01 Method for detecting human-shaped outline and recognizing gesture by using improved polar mask

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311119512.5A CN116863509B (en) 2023-09-01 2023-09-01 Method for detecting human-shaped outline and recognizing gesture by using improved polar mask

Publications (2)

Publication Number Publication Date
CN116863509A CN116863509A (en) 2023-10-10
CN116863509B true CN116863509B (en) 2024-02-20

Family

ID=88219371

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311119512.5A Active CN116863509B (en) 2023-09-01 2023-09-01 Method for detecting human-shaped outline and recognizing gesture by using improved polar mask

Country Status (1)

Country Link
CN (1) CN116863509B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113223027A (en) * 2021-04-13 2021-08-06 山东师范大学 Immature persimmon segmentation method and system based on PolarMask
CN116188785A (en) * 2023-05-04 2023-05-30 福建环宇通信息科技股份公司 Polar mask old man contour segmentation method using weak labels

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220309275A1 (en) * 2021-03-29 2022-09-29 Hewlett-Packard Development Company, L.P. Extraction of segmentation masks for documents within captured image

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113223027A (en) * 2021-04-13 2021-08-06 山东师范大学 Immature persimmon segmentation method and system based on PolarMask
CN116188785A (en) * 2023-05-04 2023-05-30 福建环宇通信息科技股份公司 Polar mask old man contour segmentation method using weak labels

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Cell image instance segmentation based on PolarMask using weak labels;Binbin Tong等;Computer Methods and Programs in Biomedicine;第231卷;第1-10页 *

Also Published As

Publication number Publication date
CN116863509A (en) 2023-10-10

Similar Documents

Publication Publication Date Title
Dvornik et al. On the importance of visual context for data augmentation in scene understanding
Wu et al. Object detection based on RGC mask R‐CNN
CN109977918B (en) Target detection positioning optimization method based on unsupervised domain adaptation
Yin et al. FD-SSD: An improved SSD object detection algorithm based on feature fusion and dilated convolution
US9202144B2 (en) Regionlets with shift invariant neural patterns for object detection
CN110738207A (en) character detection method for fusing character area edge information in character image
CN110991444B (en) License plate recognition method and device for complex scene
CN112651438A (en) Multi-class image classification method and device, terminal equipment and storage medium
CN109478239A (en) The method and object detection systems of object in detection image
CN110782420A (en) Small target feature representation enhancement method based on deep learning
CN109508675B (en) Pedestrian detection method for complex scene
CN113177560A (en) Universal lightweight deep learning vehicle detection method
CN112036260B (en) Expression recognition method and system for multi-scale sub-block aggregation in natural environment
CN114758288A (en) Power distribution network engineering safety control detection method and device
CN111414916A (en) Method and device for extracting and generating text content in image and readable storage medium
CN112528845A (en) Physical circuit diagram identification method based on deep learning and application thereof
CN106874913A (en) A kind of vegetable detection method
CN112906520A (en) Gesture coding-based action recognition method and device
CN117079095A (en) Deep learning-based high-altitude parabolic detection method, system, medium and equipment
CN114764941A (en) Expression recognition method and device and electronic equipment
CN116188785A (en) Polar mask old man contour segmentation method using weak labels
CN116863509B (en) Method for detecting human-shaped outline and recognizing gesture by using improved polar mask
CN114511877A (en) Behavior recognition method and device, storage medium and terminal
CN113610015A (en) Attitude estimation method, device and medium based on end-to-end rapid ladder network
CN113158870A (en) Countermeasure type training method, system and medium for 2D multi-person attitude estimation network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20231212

Address after: 362000 North China Road, Dongcheng, Fengze District, Quanzhou City, Fujian Province, 269

Applicant after: HUAQIAO University

Applicant after: FUJIAN HUANYUTONG TECHNOLOGY CO.,LTD.

Address before: 362000, 7th Floor, Office Building, Haixi Electronic Information Industry Development Base, Keji Road, High tech Industrial Park (formerly Xunmei Industrial Zone), Fengze District, Quanzhou City, Fujian Province

Applicant before: FUJIAN HUANYUTONG TECHNOLOGY CO.,LTD.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Wen Tingxi

Inventor after: Tong Binbin

Inventor after: Hou Qingfei

Inventor after: Chen Yuping

Inventor after: Xie Jianhua

Inventor after: Zeng Huanqiang

Inventor before: Wen Tingxi

Inventor before: Tong Binbin

Inventor before: Hou Qingfei

Inventor before: Chen Yuping

Inventor before: Xie Jianhua

Inventor before: Zeng Huanqiang

CB03 Change of inventor or designer information