CN116168413A - Hand detection method, device, equipment and medium - Google Patents

Hand detection method, device, equipment and medium Download PDF

Info

Publication number
CN116168413A
CN116168413A CN202111390280.8A CN202111390280A CN116168413A CN 116168413 A CN116168413 A CN 116168413A CN 202111390280 A CN202111390280 A CN 202111390280A CN 116168413 A CN116168413 A CN 116168413A
Authority
CN
China
Prior art keywords
image
neural network
network model
detected
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111390280.8A
Other languages
Chinese (zh)
Inventor
王博
高雪松
陈维强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Group Holding Co Ltd
Original Assignee
Hisense Group Holding Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Group Holding Co Ltd filed Critical Hisense Group Holding Co Ltd
Priority to CN202111390280.8A priority Critical patent/CN116168413A/en
Publication of CN116168413A publication Critical patent/CN116168413A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The application discloses a hand detection method, a device, equipment and a medium, wherein in the method, a feature diagram of each preset scale of an image to be detected is obtained according to a convolution layer corresponding to each preset scale based on a feature extraction network of a neural network model which is trained in advance; the neural network model determines each prediction area and the corresponding confidence coefficient based on the feature map of each preset scale, and outputs the target prediction area with the maximum confidence coefficient as the hand area.

Description

Hand detection method, device, equipment and medium
Technical Field
The present disclosure relates to the field of model training and recognition technologies, and in particular, to a method, an apparatus, a device, and a medium for hand detection.
Background
Computer simulation of human vision has been an important branch of artificial intelligence, and the need for automatic analysis of video is increasing due to explosive growth of video. In the process of video analysis, target detection is one of the key technologies, is widely applied to a plurality of fields of civil use and military use, has wide development prospect, and becomes one of the leading-edge research directions of great interest in computer vision. Target detection is a very challenging task with many difficulties such as light changes, pose shape changes, background clutter, half or full occlusion, and blurring and noise. These difficult factors can lead to drift phenomena and even missing objects during the detection process.
The existing target detection comprises hand detection, face detection and the like, and the detection of the hand is an important component for improving the experience of users in different technical fields and platforms. For example, hand detection may aid sign language understanding and enable gesture control, and may also enable digital content and information to be overlaid on top of the physical world of augmented reality (Augmented Reality, AR). Real-time hand detection is a very challenging computer vision task because hands tend to self-hide or obscure each other, such as between fingers or palms, and handshake, and lack high contrast between hands.
The existing hand detection method mainly comprises the steps of traversing an image by adopting a sliding window through traditional region selection, feature extraction, classifier model classification and the like, extracting a feature vector of each window, setting a tag of the feature vector of the window where a hand region is located as 1, setting a tag of the feature vector of the window where a non-hand region is located as 0, inputting the feature vector and tag information corresponding to the feature vector of each window into the classifier model for training, predicting the feature vector of each window of an image to be detected by utilizing the trained classifier model, and determining the predicted tag information of the feature vector of each window, thereby determining that the window with the predicted tag information of 1 is a target region position where the hand region is located, wherein the feature extraction mainly extracts the feature vector comprising a scale-invariant feature transform (Scale Invariant Feature Transform, SIFT) feature vector, a direction gradient histogram (Histogram of Oriented Gradient, HOG) feature vector, and the classifier is mainly classified by adopting a support vector machine (Support Vector Machine, an AdAbst classifier and the like.
Because the existing hand detection method needs to traverse the image by adopting a sliding window, the time complexity of hand detection in the prior art is higher, and the problem of window redundancy exists.
Disclosure of Invention
The application provides a hand detection method, device, equipment and medium, which are used for solving the problems of higher time complexity and window redundancy in hand detection in the prior art.
In a first aspect, the present application provides a hand detection method, the method comprising:
based on a feature extraction network of a neural network model which is trained in advance, acquiring a feature map of each preset scale of an image to be detected according to a convolution layer corresponding to each preset scale;
and the neural network model determines each prediction region and the corresponding confidence coefficient based on the feature map of each preset scale, and outputs a target prediction region with the maximum confidence coefficient as a hand region.
Further, the neural network model determining each prediction region and the corresponding confidence level based on the feature map of each preset scale includes:
based on the neural network model, carrying out convolution and up-sampling processing on the feature images of each preset scale to obtain each output feature image; and determining each prediction area and the corresponding confidence degree in each output feature map.
Further, before the feature extraction network based on the neural network model which is trained in advance obtains the feature map of each preset scale of the image to be detected according to the convolution layer corresponding to each preset scale, the method further includes:
and performing distance augmentation processing on the image to be detected.
Further, before the feature extraction network based on the neural network model which is trained in advance obtains the feature map of each preset scale of the image to be detected according to the convolution layer corresponding to each preset scale, the method further includes:
according to a pre-stored image enhancement function and each preset brightness, the brightness of the image to be detected is adjusted, and a target image to be detected corresponding to each preset brightness is obtained;
and identifying human body region images in the target to-be-detected images corresponding to each preset brightness, fusing the human body region images with each pre-stored background image, and determining the fused target to-be-detected images.
Further, the method further comprises:
and adjusting the size of the target image to be detected to a preset size.
Further, the training process of the neural network model includes:
Acquiring first tag data of a hand region position marked in advance in each sample image containing the hand region;
inputting each sample image into an original neural network model, processing each sample image based on the original neural network model, determining each target prediction area corresponding to each sample image, and determining the position of each target prediction area as each target label data of a hand area in each sample image;
and adjusting the parameter values of the parameters of the original neural network model according to the first label data and the target label data to obtain the trained neural network model.
Further, the processing each sample image based on the original neural network model, and determining each target prediction area corresponding to each sample image includes:
for each sample image, based on a feature extraction network of an original neural network model, acquiring a sample feature map of each preset scale of the sample image according to a convolution layer corresponding to each preset scale; and the original neural network model determines each prediction region and the corresponding confidence coefficient based on the sample feature map of each preset scale, and outputs a target prediction region with the maximum confidence coefficient.
In a second aspect, the present application provides a hand detection device, the device comprising:
the extraction module is used for extracting a network based on the characteristics of the neural network model which is trained in advance, and acquiring a characteristic diagram of each preset scale of the image to be detected according to the convolution layer corresponding to each preset scale;
the determining module is used for determining each prediction area and the corresponding confidence coefficient based on the feature map of each preset scale by the neural network model, and outputting a target prediction area with the maximum confidence coefficient as a hand area.
Further, the determining module is specifically configured to perform convolution and upsampling processing on the feature map of each preset scale based on the neural network model, so as to obtain each output feature map; and determining each prediction area and the corresponding confidence degree in each output feature map.
Further, the apparatus further comprises:
the processing module is used for performing distance augmentation processing on the image to be detected before the feature map of each preset scale of the image to be detected is acquired according to the convolution layer corresponding to each preset scale based on the feature extraction network of the neural network model which is trained in advance.
Further, the processing module is further configured to adjust, according to a pre-stored image enhancement function and each preset brightness, the brightness of the image to be detected before obtaining the feature map of each preset scale of the image to be detected according to the convolutional layer corresponding to each preset scale, so as to obtain a target image to be detected corresponding to each preset brightness; and identifying human body region images in the target to-be-detected images corresponding to each preset brightness, fusing the human body region images with each pre-stored background image, and determining the fused target to-be-detected images.
Further, the processing module is further configured to adjust a size of the target image to be detected to a preset size.
Further, the apparatus further comprises:
the training module is used for acquiring each sample image containing the hand region and first label data of the hand region position marked in advance in each sample image; inputting each sample image into an original neural network model, processing each sample image based on the original neural network model, determining each target prediction area corresponding to each sample image, and determining the position of each target prediction area as each target label data of a hand area in each sample image; and adjusting the parameter values of the parameters of the original neural network model according to the first label data and the target label data to obtain the trained neural network model.
Further, the training module is specifically configured to obtain, for each sample image, a sample feature map of each preset scale of the sample image according to the convolution layer corresponding to each preset scale based on a feature extraction network of an original neural network model; and the original neural network model determines each prediction region and the corresponding confidence coefficient based on the sample feature map of each preset scale, and outputs a target prediction region with the maximum confidence coefficient.
In a third aspect, the present application provides an electronic device comprising a processor and a memory for storing program instructions, the processor being adapted to implement the steps of any one of the methods of hand detection described above when executing a computer program stored in the memory.
In a fourth aspect, the present application provides a computer readable storage medium storing a computer program which when executed by a processor performs the steps of any one of the above-described hand detection methods.
The application provides a hand detection method, device, equipment and medium, wherein in the method, a feature diagram of each preset scale of an image to be detected is obtained according to a convolution layer corresponding to each preset scale based on a feature extraction network of a neural network model which is trained in advance; the neural network model determines each prediction area and the corresponding confidence coefficient based on the feature map of each preset scale, and outputs the target prediction area with the maximum confidence coefficient as the hand area.
Drawings
In order to more clearly illustrate the technical solutions of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic process diagram of a hand detection method provided in the present application;
fig. 2 is a schematic diagram of an image to be detected provided in the present application;
FIG. 3 is a schematic diagram of an image to be detected after a distance augmentation process provided in the present application;
fig. 4 is a target to-be-detected image corresponding to preset brightness provided by the application;
fig. 5 is a target to-be-detected image corresponding to another preset brightness provided in the present application;
fig. 6 is a schematic diagram of a fused target to-be-detected image provided in the present application;
FIG. 7 is a schematic view of the hand area of FIG. 2 provided herein;
FIG. 8 is a schematic view of the hand area of FIG. 6 provided herein;
FIG. 9 is a schematic diagram of a Yolov3 backbone network Darknet-53 provided herein;
FIG. 10 is an overall block diagram of YOLOv3 provided herein;
FIG. 11 is a schematic diagram of a basic component DBL of Yolov3 provided herein;
FIG. 12 is a schematic diagram of a basic component Res_Unit of Yolov3 provided herein;
FIG. 13 is a schematic diagram of a basic component Resblock_body of Yolov3 provided herein;
fig. 14 is a schematic structural diagram of a hand detection device provided in the present application;
fig. 15 is a schematic structural diagram of an electronic device provided in the present application.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail below with reference to the accompanying drawings, wherein it is apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
In order to reduce the time complexity of hand detection, the application provides a hand detection method, a device, equipment and a medium.
Fig. 1 is a schematic process diagram of a hand detection method provided in the present application, where the process includes the following steps:
S101: based on a feature extraction network of the neural network model which is trained in advance, acquiring a feature map of each preset scale of an image to be detected according to a convolution layer corresponding to each preset scale.
The hand detection method is applied to electronic equipment, wherein the electronic equipment can be a PC (personal computer), a tablet personal computer, an intelligent terminal, a server and the like, and the server can be a local server or a cloud server.
In order to reduce the time complexity of hand detection, the electronic device stores each preset scale in advance when extracting the characteristics, and also trains in advance to finish a neural network model, wherein the neural network model is trained in advance and used for identifying the hand region of a human body in an image, and the neural network model can be a YOLO model or a convolutional neural network model.
Based on a feature extraction network of the pre-trained neural network model and each pre-stored preset scale, a convolution layer is corresponding to each preset scale in the feature extraction network, and a feature map of each preset scale of the image to be detected is extracted according to the convolution layer corresponding to each preset scale.
S102: and the neural network model determines each prediction region and the corresponding confidence coefficient based on the feature map of each preset scale, and outputs a target prediction region with the maximum confidence coefficient as a hand region.
In order to determine a hand region in an image to be detected, in the method, a neural network model is based on a feature map of each preset scale, and the feature map of each preset scale is processed to obtain each prediction region and corresponding confidence.
And determining a target prediction area with the maximum confidence according to each prediction area and the corresponding confidence, and determining the target prediction area as a hand area for output.
In order to determine each prediction area and the corresponding confidence coefficient, in the present application, the determining, by using the neural network model based on the feature map of each preset scale, each prediction area and the corresponding confidence coefficient includes:
based on the neural network model, carrying out convolution and up-sampling processing on the feature images of each preset scale to obtain each output feature image; and determining each prediction area and the corresponding confidence degree in each output feature map.
In order to determine each prediction region and the corresponding confidence coefficient, in the present application, based on a trained neural network model, convolution and upsampling processing are performed on each feature map with a preset scale, so as to obtain each processed output feature map, decoding processing is performed on each output feature map, and each prediction region and the corresponding confidence coefficient in each output feature map are determined.
For example, in the present application, there are three feature graphs with preset scales, performing convolution processing on a first feature graph output by a 79 th layer convolution layer to obtain a first output feature graph and a fourth feature graph output in the middle of the convolution processing, performing convolution and upsampling processing on the fourth feature graph, and performing stitching with a second feature graph output by a 61 th layer convolution layer to obtain a first stitching feature graph, and performing convolution processing on the first stitching feature graph to obtain a second output feature graph and a fifth feature graph output in the middle of the convolution processing; and carrying out convolution and up-sampling processing on the fifth characteristic diagram, splicing the fifth characteristic diagram with the third characteristic diagram output by the 36 th layer convolution layer to obtain a second spliced characteristic diagram, and carrying out convolution processing on the second spliced characteristic diagram to obtain a third output characteristic diagram. And decoding the first output characteristic diagram, the second output characteristic diagram and the third output characteristic diagram to determine each prediction area and the corresponding confidence degree in each output characteristic diagram.
In order to improve accuracy of hand recognition in a complex scene, based on the foregoing embodiments, in the present application, the method further includes, before the feature extraction network based on the neural network model that is trained in advance obtains the feature map of each preset scale of the image to be detected according to the convolution layer corresponding to each preset scale:
And performing distance augmentation processing on the image to be detected.
In order to improve accuracy of hand recognition in a complex scene, in the application, the distance augmentation processing is further performed on the image to be detected aiming at the image to be detected. Specifically, according to the first center point and the first size of the image to be detected, determining a target image with a target center point being the first center point and a target size being a first size of a preset multiple, wherein the target image is the image to be detected after the distance augmentation treatment.
For example, when the image to be detected is an image shot at a distance of 4 meters, taking the center point of the image to be detected as the origin, taking half of the image to be detected up, down, left and right, obtaining an image with half of the size of the image to be detected, and obtaining an image shot at a visual distance of 2 meters, thereby realizing distance augmentation processing of the image to be detected.
Fig. 2 is a schematic diagram of an image to be detected provided by the present application, fig. 3 is a schematic diagram of an image to be detected after distance augmentation provided by the present application, a center point of fig. 3 is the same as that of fig. 2, and a size of fig. 3 is half of a size of fig. 2.
In order to improve accuracy of hand recognition in a complex scene, based on the foregoing embodiments, in the present application, the method further includes, before the feature extraction network based on the neural network model that is trained in advance obtains the feature map of each preset scale of the image to be detected according to the convolution layer corresponding to each preset scale:
According to a pre-stored image enhancement function and each preset brightness, the brightness of the image to be detected is adjusted, and a target image to be detected corresponding to each preset brightness is obtained;
and identifying human body region images in the target to-be-detected images corresponding to each preset brightness, fusing the human body region images with each pre-stored background image, and determining the fused target to-be-detected images.
In order to improve accuracy of hand recognition in a complex scene, in the application, the electronic device further performs image enhancement processing on an image to be detected. The electronic equipment pre-stores an image enhancement function and each preset brightness, adjusts the brightness of the image to be detected based on the image enhancement function, updates the brightness of the image to be detected to each preset brightness, and obtains a target image to be detected corresponding to each preset brightness. Wherein the image enhancement function is an imadjust function.
For example, when the preset intensity is 0.8 times of the brightness of the image to be detected, as shown in fig. 4, fig. 4 is a target image to be detected corresponding to the preset brightness provided in the present application; when the preset intensity is 1.2 times of the brightness of the image to be detected, as shown in fig. 5, fig. 5 is another target image to be detected corresponding to the preset brightness provided in the application.
And carrying out human body identification on the target to-be-detected image according to the determined target to-be-detected image corresponding to each preset brightness, and identifying a human body area image in the target to-be-detected image. Specifically, the process of performing human body recognition on the target image to be detected based on the human body recognition algorithm belongs to the prior art, and the description of the process is omitted.
And fusing the target to-be-detected image corresponding to each identified preset brightness with each background image according to each pre-stored background image, and determining the fused target to-be-detected image. For example, when the preset brightness is the brightness of the image to be detected, the image to be detected is fused with the complex background image, as shown in fig. 6, and fig. 6 is a schematic diagram of the fused target image to be detected provided in the application.
The image to be detected in fig. 2 is input into a neural network model trained in advance, and the hand region in fig. 2 is determined, as shown in fig. 7, and fig. 7 is a schematic diagram of the hand region in fig. 2 provided by the application. The hand region in fig. 6 is determined by inputting fig. 6 as an image to be detected into a neural network model trained in advance, as shown in fig. 8, and fig. 8 is a schematic diagram of the hand region in fig. 6 provided by the application.
In order to improve the efficiency of hand detection, in the present application, the method further includes:
and adjusting the size of the target image to be detected to a preset size.
If the sizes of the background images are different, the sizes of the target to-be-detected images after the fusion of the human body region image and the background images are different, and in order to improve the efficiency of hand detection, in the application, the electronic equipment further stores preset sizes, and adjusts the sizes of the target to-be-detected images according to a pre-stored size adjustment function to obtain the target to-be-detected images with the adjusted preset sizes.
In order to obtain a trained neural network model, on the basis of the above embodiments, in the present application, the training process of the neural network model includes:
acquiring first tag data of a hand region position marked in advance in each sample image containing the hand region;
inputting each sample image into an original neural network model, processing each sample image based on the original neural network model, determining each target prediction area corresponding to each sample image, and determining the position of each target prediction area as each target label data of a hand area in each sample image;
And adjusting the parameter values of the parameters of the original neural network model according to the first label data and the target label data to obtain the trained neural network model.
In order to realize training of the deep learning model, a sample set for training is stored in the application, a sample image in the sample set comprises a hand region image of a human body, first tag data of the sample image in the sample set is manually marked in advance, and the first tag data is used for marking a set range region of the hand region contained in the sample image.
In the application, after any sample image in a sample set and first tag data of the sample image are acquired, the sample image is input into an original deep learning model, and the original deep learning model outputs target tag data of the sample image. The target tag data identifies a set range area containing a hand area in the sample image identified by the original deep learning model.
After determining the target tag data of the sample image according to the original deep learning model, training the original deep learning model according to the target tag data and the first tag data of the sample image so as to adjust the parameter values of various parameters of the original deep learning model.
And carrying out the operation on each sample image contained in the sample set for training the original deep learning model, and obtaining the trained deep learning model when the preset condition is met. The preset condition may be that the number of sample images, which are obtained after training the original deep learning model, of the sample images in the sample set and the first label information are consistent with the second label information, is greater than a set number; or the iteration number of training the original deep learning model reaches the set maximum iteration number. Specifically, the present application is not limited thereto.
In order to determine each target prediction area corresponding to each sample image, in this application, the processing, based on the original neural network model, each sample image includes:
for each sample image, based on a feature extraction network of an original neural network model, acquiring a sample feature map of each preset scale of the sample image according to a convolution layer corresponding to each preset scale; and the original neural network model determines each prediction region and the corresponding confidence coefficient based on the sample feature map of each preset scale, and outputs a target prediction region with the maximum confidence coefficient.
In order to determine each target prediction area of each sample image, in the application, for each sample image, the sample image is input into a feature extraction network of an original neural network model, and a sample feature map of each prediction scale of the sample image is obtained in each convolution layer according to the convolution layer corresponding to each preset scale.
The original neural network model processes the sample feature map based on the sample feature map of each preset scale, wherein convolution processing and up-sampling processing are carried out on the sample feature map, and each processed output sample feature map, each prediction area in each output sample feature map and the corresponding confidence level are determined. And determining a target prediction area with the maximum confidence coefficient according to each prediction area and the corresponding confidence coefficient, and outputting the target prediction area.
As a possible implementation manner, when training the original deep learning model, the sample images in the sample set may be divided into training sample images and test sample images, the original deep learning model is trained based on the training sample images, and then the reliability of the trained deep learning model is tested based on the test sample images.
The training strategy for the original deep learning model in the application is as follows: the prediction area is divided into three cases: positive (positive), negative (negative), ignore (ignor). The positive example is to take any true value (ground trunk), calculate the IOU with all frames, and the predicted area with the maximum IOU is the positive example. And a prediction area can be allocated to only one group. For example, the first group trunk already matches a positive example detection frame, then the next group trunk finds the detection frame with the largest IOU among the remaining detection frames as the positive example. The sequential order of the group trunk is negligible. The positive example generates a confidence level loss, a detection frame loss, and a category loss. The prediction area is a corresponding group trunk label; the corresponding category of the category label is 1, and the rest are 0; the confidence label is 1. If the IOU of any group trunk is larger than the threshold value, the ignored sample is the ignored sample except the positive sample. Ignoring the sample does not produce any loss. Negative examples are negative examples except positive examples (the detection frame with the largest IOU after the calculation of the group trunk is smaller than the threshold value, but the IOU is still positive examples), and the IOU with all the group trunk is smaller than the threshold value. Negative examples only have confidence generating loss, confidence label 0.
When the deep learning model is a target detection model (YOLOv 3), the backbone network of YOLOv3 is a deep learning framework (dark net-53) composed of 52 convolution layers and a final fully connected layer. In the YOLOv3 network structure, convolution kernels of 1x1 and 3x3 are adopted, so that the parameter amount and the calculation amount in model reasoning can be reduced. YOLOv3 is an improvement on the basis of YOLOv2, and 5 Residual blocks (Residual) are added in a backbone network structure, namely, an identical mapping form is formed by utilizing the principle of a Residual network ResNet, so that the deep network has the same performance as a shallow network, and gradient explosion caused by too deep network layer is avoided.
The YOLOv3 reasoning process adopts cross-scale prediction (Predictions Across Scales), the principle of a characteristic pyramid network (feature pyramid networks, FPN) is used for reference, targets with different sizes are detected by using multiple scales, and finer grids can detect smaller objects. YOLOv3 provides 3 bounding boxes with different sizes, namely 3 prediction areas with different sizes are obtained for each target to be predicted, and probability calculation is carried out on the prediction areas, so that the best matching result is screened. The system uses this idea to extract features of different sizes to form a pyramid network (pyramid network). The last convolutional layer of the YOLOv3 network predicts a three-dimensional tensor-encoded prediction region, object and class. The tensors obtained under the COCO dataset were: n [3 (4+1+80) ],4 bounding box offsets, 1 targeting prediction and 80 class predictions.
Fig. 9 is a schematic diagram of a YOLOv3 backbone network dark-53 provided in the present application, as shown in fig. 9, where Convolitional is denoted as a convolution layer, residual is denoted as a Residual block, type is denoted as a network layer, filters is denoted as a convolution kernel included in the convolution layer, size is denoted as a Size, output is denoted as an Output, avgpool is denoted as an average pooling layer, connected is denoted as a full connection layer, and Softmax is a function of performing numerical processing.
Fig. 10 is an overall structure diagram of YOLOv3 provided in the present application, as shown in fig. 10, the first row is a minimum-scale yolo layer, 13 x 13 feature maps are input, 1024 channels are input, the size of the feature maps is unchanged through a series of convolution operations, but the number of channels is reduced to 75 finally, 13 x 13 feature maps are output, 75 channels are output finally, and then classification and position regression are performed.
The second row is a mesoscale yolo layer, the feature maps of 13 x 13 and 512 channels of 79 layers are convolved to generate feature maps of 13 x 13 and 256 channels, then upsampled to generate feature maps of 26 x 26 and 256 channels, and combined with the mesoscale feature maps of 26 x 26 and 512 channels. A series of convolution operations are performed, the size of the feature map is unchanged, but the number of channels is reduced to 75, the feature map with the size of 26×26 is output finally, 75 channels are output, and then classification and position regression are performed.
The third row is a large-scale yolo layer, the feature images of 26×26 and 512 channels of 91 layers are convolved to form a convolution grass group, the feature images of 26×26 and 128 channels are generated, then up-sampling is performed to generate the feature images of 52×52 and 128 channels, and the feature images of 52×52 and 256 channels of 36 layers are combined. A series of convolution operations are performed, the size of the feature map is unchanged, but the number of channels is reduced to 75 finally, the feature map with the size of 52 x 52 is output finally, 75 channels are output, and then classification and position regression are performed.
Fig. 11 is a schematic diagram of a basic component DBL of YOLOv3 provided in the present application, where, as shown in fig. 11, the DBL includes a convolution layer, BN, and leak relu, which are non-separable parts of the convolution layer for YOLOv3, and together form a minimum component DBL.
Fig. 12 is a schematic diagram of a basic component res_unit of YOLOv3 provided in the present application, and as shown in fig. 12, res_unit includes two basic components DBL and an add layer.
Fig. 13 is a schematic diagram of a basic component resblock_body of YOLOv3 provided in the present application, where, as shown in fig. 13, the basic components DBL, zero padding, and res unit are included in the resblock_body.
Fig. 14 is a schematic structural diagram of a hand detection device provided in the present application, as shown in fig. 14, the device includes:
An extraction module 1401, configured to obtain a feature map of each preset scale of an image to be detected according to a convolutional layer corresponding to each preset scale based on a feature extraction network of a neural network model that is trained in advance;
the determining module 1402 is configured to determine each prediction area and a corresponding confidence level based on the feature map of each preset scale, and output a target prediction area with the largest confidence level as a hand area.
Further, the determining module is specifically configured to perform convolution and upsampling processing on the feature map of each preset scale based on the neural network model, so as to obtain each output feature map; and determining each prediction area and the corresponding confidence degree in each output feature map.
Further, the apparatus further comprises:
the processing module is used for performing distance augmentation processing on the image to be detected before the feature map of each preset scale of the image to be detected is acquired according to the convolution layer corresponding to each preset scale based on the feature extraction network of the neural network model which is trained in advance.
Further, the processing module is further configured to adjust, according to a pre-stored image enhancement function and each preset brightness, the brightness of the image to be detected before obtaining the feature map of each preset scale of the image to be detected according to the convolutional layer corresponding to each preset scale, so as to obtain a target image to be detected corresponding to each preset brightness; and identifying human body region images in the target to-be-detected images corresponding to each preset brightness, fusing the human body region images with each pre-stored background image, and determining the fused target to-be-detected images.
Further, the processing module is further configured to adjust a size of the target image to be detected to a preset size.
Further, the apparatus further comprises:
the training module is used for acquiring each sample image containing the hand region and first label data of the hand region position marked in advance in each sample image; inputting each sample image into an original neural network model, processing each sample image based on the original neural network model, determining each target prediction area corresponding to each sample image, and determining the position of each target prediction area as each target label data of a hand area in each sample image; and adjusting the parameter values of the parameters of the original neural network model according to the first label data and the target label data to obtain the trained neural network model.
Further, the training module is specifically configured to obtain, for each sample image, a sample feature map of each preset scale of the sample image according to the convolution layer corresponding to each preset scale based on a feature extraction network of an original neural network model; and the original neural network model determines each prediction region and the corresponding confidence coefficient based on the sample feature map of each preset scale, and outputs a target prediction region with the maximum confidence coefficient.
Fig. 15 is a schematic structural diagram of an electronic device provided in the present application, and on the basis of the foregoing embodiments, the present application further provides an electronic device, as shown in fig. 15, including: the device comprises a processor 1501, a communication interface 1502, a memory 1503 and a communication bus 1504, wherein the processor 1501, the communication interface 1502 and the memory 1503 are in communication with each other through the communication bus 1504.
The memory 1503 stores a computer program that, when executed by the processor 1501, causes the processor 1501 to perform the steps of:
based on a feature extraction network of a neural network model which is trained in advance, acquiring a feature map of each preset scale of an image to be detected according to a convolution layer corresponding to each preset scale;
and the neural network model determines each prediction region and the corresponding confidence coefficient based on the feature map of each preset scale, and outputs a target prediction region with the maximum confidence coefficient as a hand region.
Further, the processor 1501 is specifically configured to determine, based on the feature map of each preset scale, each prediction region and the corresponding confidence level by using the neural network model, where the determining includes:
based on the neural network model, carrying out convolution and up-sampling processing on the feature images of each preset scale to obtain each output feature image; and determining each prediction area and the corresponding confidence degree in each output feature map.
Further, the processor 1501 is further configured to, in the feature extraction network based on the neural network model that is trained in advance, before obtaining the feature map of each preset scale of the image to be detected according to the convolution layer corresponding to each preset scale, further include:
and performing distance augmentation processing on the image to be detected.
Further, the processor 1501 is further configured to, in the feature extraction network based on the neural network model that is trained in advance, before obtaining the feature map of each preset scale of the image to be detected according to the convolution layer corresponding to each preset scale, further include:
according to a pre-stored image enhancement function and each preset brightness, the brightness of the image to be detected is adjusted, and a target image to be detected corresponding to each preset brightness is obtained;
and identifying human body region images in the target to-be-detected images corresponding to each preset brightness, fusing the human body region images with each pre-stored background image, and determining the fused target to-be-detected images.
Further, the processor 1501 is further configured to adjust the size of the target image to be detected to a preset size.
Further, the training process of the processor 1501 specifically for the neural network model includes:
acquiring first tag data of a hand region position marked in advance in each sample image containing the hand region;
inputting each sample image into an original neural network model, processing each sample image based on the original neural network model, determining each target prediction area corresponding to each sample image, and determining the position of each target prediction area as each target label data of a hand area in each sample image;
and adjusting the parameter values of the parameters of the original neural network model according to the first label data and the target label data to obtain the trained neural network model.
Further, the processor 1501 is specifically configured to process each sample image based on the original neural network model, and determining each target prediction area corresponding to each sample image includes:
for each sample image, based on a feature extraction network of an original neural network model, acquiring a sample feature map of each preset scale of the sample image according to a convolution layer corresponding to each preset scale; and the original neural network model determines each prediction region and the corresponding confidence coefficient based on the sample feature map of each preset scale, and outputs a target prediction region with the maximum confidence coefficient.
The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface 1502 is used for communication between the above-described electronic device and other devices.
The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit, a network processor (Network Processor, NP), etc.; but also digital instruction processors (Digital Signal Processing, DSP), application specific integrated circuits, field programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
On the basis of the above embodiments, the present application further provides a computer readable storage medium having stored therein a computer program executable by a processor, which when run on the processor, causes the processor to perform the steps of:
based on a feature extraction network of a neural network model which is trained in advance, acquiring a feature map of each preset scale of an image to be detected according to a convolution layer corresponding to each preset scale;
and the neural network model determines each prediction region and the corresponding confidence coefficient based on the feature map of each preset scale, and outputs a target prediction region with the maximum confidence coefficient as a hand region.
Further, the neural network model determining each prediction region and the corresponding confidence level based on the feature map of each preset scale includes:
based on the neural network model, carrying out convolution and up-sampling processing on the feature images of each preset scale to obtain each output feature image; and determining each prediction area and the corresponding confidence degree in each output feature map.
Further, before the feature extraction network based on the neural network model which is trained in advance obtains the feature map of each preset scale of the image to be detected according to the convolution layer corresponding to each preset scale, the method further includes:
And performing distance augmentation processing on the image to be detected.
Further, before the feature extraction network based on the neural network model which is trained in advance obtains the feature map of each preset scale of the image to be detected according to the convolution layer corresponding to each preset scale, the method further includes:
according to a pre-stored image enhancement function and each preset brightness, the brightness of the image to be detected is adjusted, and a target image to be detected corresponding to each preset brightness is obtained;
and identifying human body region images in the target to-be-detected images corresponding to each preset brightness, fusing the human body region images with each pre-stored background image, and determining the fused target to-be-detected images.
Further, the method further comprises:
and adjusting the size of the target image to be detected to a preset size.
Further, the training process of the neural network model includes:
acquiring first tag data of a hand region position marked in advance in each sample image containing the hand region;
inputting each sample image into an original neural network model, processing each sample image based on the original neural network model, determining each target prediction area corresponding to each sample image, and determining the position of each target prediction area as each target label data of a hand area in each sample image;
And adjusting the parameter values of the parameters of the original neural network model according to the first label data and the target label data to obtain the trained neural network model.
Further, the processing each sample image based on the original neural network model, and determining each target prediction area corresponding to each sample image includes:
for each sample image, based on a feature extraction network of an original neural network model, acquiring a sample feature map of each preset scale of the sample image according to a convolution layer corresponding to each preset scale; and the original neural network model determines each prediction region and the corresponding confidence coefficient based on the sample feature map of each preset scale, and outputs a target prediction region with the maximum confidence coefficient.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims (10)

1. A method of hand detection, the method comprising:
based on a feature extraction network of a neural network model which is trained in advance, acquiring a feature map of each preset scale of an image to be detected according to a convolution layer corresponding to each preset scale;
and the neural network model determines each prediction region and the corresponding confidence coefficient based on the feature map of each preset scale, and outputs a target prediction region with the maximum confidence coefficient as a hand region.
2. The method of claim 1, wherein the neural network model determining each prediction region and the corresponding confidence level based on the feature map of each preset scale comprises:
based on the neural network model, carrying out convolution and up-sampling processing on the feature images of each preset scale to obtain each output feature image; and determining each prediction area and the corresponding confidence degree in each output feature map.
3. The method according to claim 1, wherein before the feature extraction network based on the pre-trained neural network model obtains the feature map of each preset scale of the image to be detected according to the convolution layer corresponding to each preset scale, the method further comprises:
and performing distance augmentation processing on the image to be detected.
4. A method according to claim 1 or 3, wherein before the feature extraction network based on the pre-trained neural network model obtains the feature map of each preset scale of the image to be detected according to the convolution layer corresponding to each preset scale, the method further comprises:
according to a pre-stored image enhancement function and each preset brightness, the brightness of the image to be detected is adjusted, and a target image to be detected corresponding to each preset brightness is obtained;
And identifying human body region images in the target to-be-detected images corresponding to each preset brightness, fusing the human body region images with each pre-stored background image, and determining the fused target to-be-detected images.
5. The method according to claim 4, wherein the method further comprises:
and adjusting the size of the target image to be detected to a preset size.
6. The method of claim 1, wherein the training process of the neural network model comprises:
acquiring first tag data of a hand region position marked in advance in each sample image containing the hand region;
inputting each sample image into an original neural network model, processing each sample image based on the original neural network model, determining each target prediction area corresponding to each sample image, and determining the position of each target prediction area as each target label data of a hand area in each sample image;
and adjusting the parameter values of the parameters of the original neural network model according to the first label data and the target label data to obtain the trained neural network model.
7. The method of claim 6, wherein the processing each sample image based on the original neural network model, determining each target prediction region corresponding to each sample image comprises:
for each sample image, based on a feature extraction network of an original neural network model, acquiring a sample feature map of each preset scale of the sample image according to a convolution layer corresponding to each preset scale; and the original neural network model determines each prediction region and the corresponding confidence coefficient based on the sample feature map of each preset scale, and outputs a target prediction region with the maximum confidence coefficient.
8. A hand detection device, the device comprising:
the extraction module is used for extracting a network based on the characteristics of the neural network model which is trained in advance, and acquiring a characteristic diagram of each preset scale of the image to be detected according to the convolution layer corresponding to each preset scale;
the determining module is used for determining each prediction area and the corresponding confidence coefficient based on the feature map of each preset scale by the neural network model, and outputting a target prediction area with the maximum confidence coefficient as a hand area.
9. An electronic device, comprising: the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
the memory has stored therein a computer program which, when executed by the processor, causes the processor to perform the method of any of claims 1-7.
10. A computer readable storage medium, characterized in that it stores a computer program executable by a processor, which when run on the processor causes the processor to perform the method of any of claims 1-7.
CN202111390280.8A 2021-11-23 2021-11-23 Hand detection method, device, equipment and medium Pending CN116168413A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111390280.8A CN116168413A (en) 2021-11-23 2021-11-23 Hand detection method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111390280.8A CN116168413A (en) 2021-11-23 2021-11-23 Hand detection method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN116168413A true CN116168413A (en) 2023-05-26

Family

ID=86415030

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111390280.8A Pending CN116168413A (en) 2021-11-23 2021-11-23 Hand detection method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN116168413A (en)

Similar Documents

Publication Publication Date Title
JP7236545B2 (en) Video target tracking method and apparatus, computer apparatus, program
CN109584248B (en) Infrared target instance segmentation method based on feature fusion and dense connection network
Xie et al. Multilevel cloud detection in remote sensing images based on deep learning
CN109558832B (en) Human body posture detection method, device, equipment and storage medium
CN111027493B (en) Pedestrian detection method based on deep learning multi-network soft fusion
CN108427927B (en) Object re-recognition method and apparatus, electronic device, program, and storage medium
KR101640998B1 (en) Image processing apparatus and image processing method
CN110378297B (en) Remote sensing image target detection method and device based on deep learning and storage medium
CN111814902A (en) Target detection model training method, target identification method, device and medium
CN108805016B (en) Head and shoulder area detection method and device
CN111401516A (en) Neural network channel parameter searching method and related equipment
CN111860398B (en) Remote sensing image target detection method and system and terminal equipment
CN111626176B (en) Remote sensing target rapid detection method and system based on dynamic attention mechanism
CN109472193A (en) Method for detecting human face and device
CN110222718B (en) Image processing method and device
CN113591872A (en) Data processing system, object detection method and device
CN109583364A (en) Image-recognizing method and equipment
CN112487844A (en) Gesture recognition method, electronic device, computer-readable storage medium, and chip
CN109977875A (en) Gesture identification method and equipment based on deep learning
US20210256717A1 (en) Edge-guided ranking loss for monocular depth prediction
CN112541394A (en) Black eye and rhinitis identification method, system and computer medium
Sun et al. IRDCLNet: Instance segmentation of ship images based on interference reduction and dynamic contour learning in foggy scenes
CN115223239A (en) Gesture recognition method and system, computer equipment and readable storage medium
CN110334703B (en) Ship detection and identification method in day and night image
CN111353577B (en) Multi-task-based cascade combination model optimization method and device and terminal equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination