CN117095339A - RGB video frame handheld object detection method based on deep learning - Google Patents

RGB video frame handheld object detection method based on deep learning Download PDF

Info

Publication number
CN117095339A
CN117095339A CN202311359648.3A CN202311359648A CN117095339A CN 117095339 A CN117095339 A CN 117095339A CN 202311359648 A CN202311359648 A CN 202311359648A CN 117095339 A CN117095339 A CN 117095339A
Authority
CN
China
Prior art keywords
hand
pixels
area
image
video frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311359648.3A
Other languages
Chinese (zh)
Other versions
CN117095339B (en
Inventor
陶涛
马勇
邹健
刘玲蒙
赵涵
唐泳
苏松
李凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangxi Normal University
Original Assignee
Jiangxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangxi Normal University filed Critical Jiangxi Normal University
Priority to CN202311359648.3A priority Critical patent/CN117095339B/en
Publication of CN117095339A publication Critical patent/CN117095339A/en
Application granted granted Critical
Publication of CN117095339B publication Critical patent/CN117095339B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/56Extraction of image or video features relating to colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • G06V10/763Non-hierarchical techniques, e.g. based on statistics of modelling distributions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for detecting RGB video frame handhelds based on deep learning, which comprises the steps of performing frame extraction processing on a small video to be detected through fixed frame sampling density to obtain an ordered video frame sequence, and then obtaining a hand-object image through a hand-object target area selection model and a cooperative salient object detection model constructed by the method; converting the hand-object image from an RGB format to an HSV format, clustering by using a K-Means clustering algorithm to obtain the number of hand area pixels and the number of hand object area pixels, and calculating the hand-object proportion by the number of hand area pixels and the number of hand object area pixels; judging whether the hand-object proportion is larger than a proportion threshold value theta or not, and judging whether an object is held on the hand or not. Through the mode, the problem that the handheld object detection can be carried out only by using the RGB-D camera at present can be solved, and the shooting difficulty and the shooting cost are reduced.

Description

RGB video frame handheld object detection method based on deep learning
Technical Field
The invention relates to the field of visual identification, in particular to a detection method of RGB video frame handhelds based on deep learning.
Background
The handheld object detection has important application in the scenes of intelligent interaction, behavior analysis and the like, and can detect objects taken by a user in the hand in the intelligent interaction scene, such as pens, remote controllers and the like, so as to realize a more natural and flexible man-machine interaction mode; in a behavior analysis scene, the interests and hobbies of a user are analyzed by detecting objects, such as beverages, foods and the like, held by pedestrians on hands, and personalized accurate service recommendation is provided. In the prior art, an RGB-D image is mostly needed to be used for acquiring a mask of a handheld object area in the image, a single RGB image is needed to be combined with a target detection algorithm for positioning and identifying the handheld object, the target detection algorithm needs to construct a specific data set for training according to a specific detection scene, a great deal of time and labor cost are needed, the generalization capability is poor, and the method is difficult to be directly applied to other similar scenes. Therefore, the invention provides a detection method of the RGB video frame handhelds based on deep learning, which has important practical significance.
Patent document with publication number CN111553891a, named a handheld object presence detection method, uses RGB-D images, first establishes RGB camera and depth camera for calibration, then uses the calibrated RGB-D camera for image acquisition, uses a hand joint point detection program in combination with the depth value of the depth image, uses an area growth algorithm to obtain hand related area mask, finally calculates scaling factor of arm skin, and determines whether an object is held in the hand according to a set threshold, but this method requires using the depth camera, has complicated shooting process and high shooting cost.
Patent document with publication number CN112016398A, named hand-held object recognition method and device, uses deep learning target detection technology to detect objects and hands in RGB pictures and designs hardware equipment for intelligent robot interaction, but the method needs to collect the data set of the hand-held object to be recognized according to specific requirement scene in advance, so that the deep learning algorithm can perform training, and only the detection frame of the hand-held object can be obtained, and high-precision pixel-level area masks cannot be generated.
Disclosure of Invention
In order to solve the technical problems, the invention adopts a technical scheme that: the method for detecting the RGB video frame handhelds based on the deep learning comprises the following steps:
s10, acquiring a small-segment RGB video to be detected, and performing frame extraction processing on the small-segment RGB video to be detected through fixed frame sampling density to obtain an ordered video frame sequence;
s20, constructing a hand-object target area selection model, and inputting the ordered video frame sequence into the hand-object target area selection model to obtain a hand-object area picture sequence;
s30, constructing a collaborative saliency object detection model, and inputting the hand-object region picture sequence into the collaborative saliency object detection model to obtain a hand-object region mask;
s40, carrying out bit multiplication on the hand-object region picture sequence and the hand-object region mask to obtain a hand-object image;
s50, converting the hand-object image into an HSV format, and clustering the hand-object image in the HSV format by using a K-Means clustering algorithm to obtain the number of pixels in the hand area and the number of pixels in the hand area;
s60, calculating the hand-object ratio through the number of the pixels of the hand area and the number of the pixels of the hand object area;
s70, judging whether the hand-object ratio is larger than a ratio threshold value theta, if so, indicating that an object is not held on the hand; if not, indicating that the hand holds the object, and obtaining a hand-held object image;
the bit multiplication is that the pixel values of RGB three channels of a hand-object region picture sequence are multiplied by the pixel values of a hand-object region mask of a single channel respectively;
the hand-object ratio is obtained by dividing the number of hand area pixels by the number of hand object area pixels;
the ratio threshold θ is 0.1;
the hand-held object image is an aggregate area formed by all hand-held object area pixels on the hand-held object image.
Further, the step S10 includes:
the small-section RGB video to be detected is a segment to be detected which is selected from the monitoring video, and the duration is 3 to 5 seconds;
the fixed frame sampling density is determined according to the length of the small-segment RGB video to be detected, and the default value is 3;
the ordered sequence of video frames is represented asWherein, the method comprises the steps of, wherein,and B is the number of pictures, and B and t are positive integers.
Further, the step S20 includes:
s21, acquiring a picture through Hand Landmark Detection algorithm in a MediaPipe tool21 hand position information in (a);
the 21 hand position informationExpressed as:
wherein,for picturesI is a positive integer,respectively as dotsThe horizontal and vertical coordinates of the position, the coordinate system takes the upper left corner of the picture as an origin, the horizontal direction is the x axis to the right, and the vertical direction is the y axis downwards;
s22, calculating to obtain a minimum circumscribed rectangle containing 21 key points of the hand;
the minimum circumscribed rectangleExpressed as:
wherein,is the abscissa of the center point of the smallest bounding rectangle,is the ordinate of the center point of the smallest bounding rectangle,is the width of the smallest circumscribed rectangle,the height is the minimum circumscribed rectangle;
s23, calculating a boundary box of the target area through the minimum circumscribed rectangle;
the target area bounding boxExpressed as:
wherein,the expansion ratio is 3;
s24, creating a hand-object region picture sequence;
s25, slave picturesIntercepting a hand-object region picture through the target region boundary boxAdding the hand-object region picture sequence;
s26, executing S21-S25 steps on each picture of the ordered video frame sequence to obtain a final hand-object region picture sequence;
the final hand-object region picture sequenceExpressed as:
wherein,an image taken through the target area bounding box for each picture.
Further, the minimum bounding rectangle of the 21 key points of the hand includes:
the minimum circumscribed rectangleThe center point coordinates of (a) and the width and height calculation formulas are:
wherein,is the abscissa of the center point of the smallest bounding rectangle,is the ordinate of the center point of the smallest bounding rectangle,is the width of the smallest circumscribed rectangle,is the height of the smallest circumscribed rectangle,is the minimum of 21 keypoint abscissas,is the maximum of 21 keypoints on the abscissa,is the minimum of 21 keypoints on the ordinate,the maximum value of the ordinate of the 21 key points is expressed as:
further, the step S30 includes:
s31, carrying out data enhancement and size normalization operation on each hand-object region picture;
the calculation formula of the data enhancement and normalization size operation is as follows:
the Aug (-) is a data enhancement process of the picture, and comprises translation, rotation and color transformation operations, wherein Norm (-) is an operation of normalizing pixel values of an image, resize (-) is an operation of adjusting the size, and images with different resolutions are adjusted into images with specified sizes through linear interpolation or downsampling modes;
s32, extracting the characteristics of the input pictures through an Encoder characteristic Encoder to generate a batch of characteristic diagramsObtaining characteristic vector through linear mapping
The Encoder feature Encoder uses Resnet101 as the backbone network;
the linear mapping obtains a feature vectorThe calculation formula of (2) is as follows:
wherein,in the case of a batch of pictures,is in the shape ofB is the number of pictures, C is the number of channels per picture, c=3, H and W are the height and width of the pictures after resizing,is in the shape ofDim is the dimension of the feature vector;
s33, through the feature vectorCalculating to obtain a feature map of the consistency salients
The characteristic diagramThe calculation formula of (2) is as follows:
the cross-attention module is used for extracting consistency information contained in the feature vectors of different pictures;
the consistency salients are objects commonly contained in all pictures;
s34, respectively overlapping the feature images of the consistent salient objects into feature images corresponding to each image, and obtaining a prediction mask of the salient objects in each image through a semantic segmentation module;
the calculation formula of the prediction mask is as follows:
wherein cat (·) is a splicing operation, and Seg (·) is semantic segmentationModule, using transposed convolution and upsampling method, maps featuresRestoring to original slice size, the Softmax function has the expression:
wherein,the number of elements in the matrix, C is the number of classes of classification, C is a positive integer, where c=2, representing foreground pixels and background pixels;
s35, performing cross entropy loss calculation on the prediction mask and a real mask label of the image to obtain a loss value;
s36, carrying out back propagation optimization network parameters through the loss value, and completing training of the collaborative saliency object detection model;
the real mask label and the pictures in the training process are provided by the public dataset CoSOD3k dataset.
Further, the S50 includes:
s51, converting the hand-object image from an RGB format to an HSV format;
s52, traversing foreground pixels of the hand-object image, and taking H, S, V values of each foreground pixel as characteristics of each foreground pixel;
s53, creating a data point set, and storing the characteristics of all foreground pixels in the hand-object image into the data point set;
s54, clustering the data point sets by using a K-Means clustering algorithm to obtain the number of clustered hand area pixels and the number of hand object area pixels;
the foreground pixels are pixels except for background pixels in the hand-object image, and the pixel value of the background pixels is [0, 0];
the K-Means clustering algorithm sets the number of clustered class clusters to 2, namely k=2, and the two class clusters are hand area pixels and hand object area pixels respectively.
In summary, due to the adoption of the technical scheme, the beneficial effects of the invention are as follows:
1. according to the RGB video frame handheld object detection method based on deep learning, handheld object detection can be achieved only by using multi-frame RGB images, the problem that handheld object detection can be achieved only by using an RGB-D camera at present is solved, and shooting difficulty and shooting cost are reduced.
2. According to the RGB video frame handheld object detection method based on deep learning, the collaborative saliency detection model is applied to handheld object detection for the first time, in a plurality of video frames, a hand and a handheld object can be regarded as a whole, the consistency of relative positions is maintained, and when the background changes, masks of the hand and the handheld object can be obtained efficiently and accurately by using the collaborative saliency detection model.
3. According to the RGB video frame handheld object detection method based on deep learning, the used deep learning model has strong generalization performance, and can adaptively detect consistent significant objects among video multiframes without independently training specific detection scenes, so that the problem that the deep learning model needs a large amount of data of specific scenes to perform secondary training is avoided, and the time and the computational cost of training a neural network are saved.
4. The RGB video frame handheld object detection method based on deep learning provided by the invention can realize detection of handheld objects in video frames, obtain position and appearance information of the handheld objects in each frame, can obtain category information of the handheld objects by using a simple classifier according to the detection result of the handheld objects, and can be applied to various scenes, such as man-machine interaction of intelligent robots, object pick-and-place monitoring and counting of a workbench, monitoring of handheld dangerous objects in public safety management and control, virtual reality and the like.
Drawings
Fig. 1 is a flowchart of a method for detecting an RGB video frame handheld object based on deep learning.
Fig. 2 is a flowchart of a method for constructing a hand-object target area selection model based on a deep learning RGB video frame hand-object detection method.
Fig. 3 is a flowchart of a method for constructing a collaborative saliency object detection model based on a deep learning RGB video frame hand-held object detection method.
Fig. 4 is a flowchart of K-Means clustering of the RGB video frame handheld object detection method based on deep learning provided by the invention.
Detailed Description
The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present invention can be more easily understood by those skilled in the art, thereby making clear and defining the scope of the present invention.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced otherwise than as described herein; it will be apparent that the embodiments in the specification are only some, but not all, embodiments of the invention.
Fig. 1 is a flowchart of a method for detecting an RGB video frame handheld object based on deep learning according to an embodiment of the present invention, where the method includes:
s10, acquiring a small-segment RGB video to be detected, and performing frame extraction processing on the small-segment RGB video to be detected through fixed frame sampling density to obtain an ordered video frame sequence;
further, the step S10 includes:
the small-section RGB video to be detected is a segment to be detected which is selected from the monitoring video, and the duration is 3 to 5 seconds;
the fixed frame sampling density is determined according to the length of the small-segment RGB video to be detected, and the default value is 3;
the ordered sequence of video frames is represented asWherein, the method comprises the steps of, wherein,for the t frame picture, B is the number of pictures, B and t are positive integersA number.
S20, constructing a hand-object target area selection model, and inputting the ordered video frame sequence into the hand-object target area selection model to obtain a hand-object area picture sequence;
further, referring to fig. 2, the step S20 includes:
s21, acquiring a picture through Hand Landmark Detection algorithm in a MediaPipe tool21 hand position information in (a);
the 21 hand position informationExpressed as:
wherein,for picturesI is a positive integer,respectively as dotsThe horizontal and vertical coordinates of the position, the coordinate system takes the upper left corner of the picture as an origin, the horizontal direction is the x axis to the right, and the vertical direction is the y axis downwards;
s22, calculating to obtain a minimum circumscribed rectangle containing 21 key points of the hand;
the minimum circumscribed rectangleExpressed as:
wherein,is the abscissa of the center point of the smallest bounding rectangle,is the ordinate of the center point of the smallest bounding rectangle,is the width of the smallest circumscribed rectangle,the height is the minimum circumscribed rectangle;
s23, calculating a boundary box of the target area through the minimum circumscribed rectangle;
the target area bounding boxExpressed as:
wherein,the expansion ratio is 3;
s24, creating a hand-object region picture sequence;
s25, slave picturesIntercepting a hand-object region picture through the target region boundary boxAdding the hand-object region picture sequence;
s26, executing S21-S25 steps on each picture of the ordered video frame sequence to obtain a final hand-object region picture sequence;
the final hand-object region picture sequenceExpressed as:
wherein,an image taken through the target area bounding box for each picture.
Further, the minimum bounding rectangle of the 21 key points of the hand includes:
the minimum circumscribed rectangleThe center point coordinates of (a) and the width and height calculation formulas are:
wherein,is the abscissa of the center point of the smallest bounding rectangle,is the ordinate of the center point of the smallest bounding rectangle,is the width of the smallest circumscribed rectangle,is the height of the smallest circumscribed rectangle,is the minimum of 21 keypoint abscissas,is the maximum of 21 keypoints on the abscissa,is the minimum of 21 keypoints on the ordinate,the maximum value of the ordinate of the 21 key points is expressed as:
s30, constructing a collaborative saliency object detection model, and inputting the hand-object region picture sequence into the collaborative saliency object detection model to obtain a hand-object region mask;
further, referring to fig. 3, the step S30 includes:
s31, carrying out data enhancement and size normalization operation on each hand-object region picture;
the calculation formula of the data enhancement and normalization size operation is as follows:
the Aug (-) is a data enhancement process of the picture, and comprises translation, rotation and color transformation operations, wherein Norm (-) is an operation of normalizing pixel values of an image, resize (-) is an operation of adjusting the size, and images with different resolutions are adjusted into images with specified sizes through linear interpolation or downsampling modes;
s32, extracting the characteristics of the input pictures through an Encoder characteristic Encoder to generate a batch of characteristic diagramsObtaining characteristic vector through linear mapping
The Encoder feature Encoder uses Resnet101 as the backbone network;
the linear mapping obtains a feature vectorThe calculation formula of (2) is as follows:
wherein,in the case of a batch of pictures,is in the shape ofB is the number of pictures, C is the number of channels per picture, c=3, H and W are the height and width of the pictures after resizing,is in the shape ofDim is the dimension of the feature vector;
s33, through the feature vectorCalculating to obtain a feature map of the consistency salients
The characteristic diagramThe calculation formula of (2) is as follows:
the cross-attention module is used for extracting consistency information contained in the feature vectors of different pictures;
the consistency salients are objects commonly contained in all pictures;
s34, respectively overlapping the feature images of the consistent salient objects into feature images corresponding to each image, and obtaining a prediction mask of the salient objects in each image through a semantic segmentation module;
the calculation formula of the prediction mask is as follows:
wherein cat (-) is a splicing operation, seg (-) is a semantic segmentation module, and a transposed convolution and upsampling method is used for processing the feature mapRestoring to original slice size, the Softmax function has the expression:
wherein,the number of elements in the matrix, C is the number of classes of classification, C is a positive integer, where c=2, representing foreground pixels and background pixels;
s35, performing cross entropy loss calculation on the prediction mask and a real mask label of the image to obtain a loss value;
s36, carrying out back propagation optimization network parameters through the loss value, and completing training of the collaborative saliency object detection model;
the real mask label and the pictures in the training process are provided by the public dataset CoSOD3k dataset.
S40, carrying out bit multiplication on the hand-object region picture sequence and the hand-object region mask to obtain a hand-object image;
the bit-wise multiplication is that the pixel values of RGB three channels of the hand-object region picture sequence are multiplied by the pixel values of the hand-object region mask of a single channel respectively.
S50, converting the hand-object image into an HSV format, and clustering the hand-object image in the HSV format by using a K-Means clustering algorithm to obtain the number of pixels in the hand area and the number of pixels in the hand area;
further, referring to fig. 4, the S50 includes:
s51, converting the hand-object image from an RGB format to an HSV format;
s52, traversing foreground pixels of the hand-object image, and taking H, S, V values of each foreground pixel as characteristics of each foreground pixel;
s53, creating a data point set, and storing the characteristics of all foreground pixels in the hand-object image into the data point set;
s54, clustering the data point sets by using a K-Means clustering algorithm to obtain the number of clustered hand area pixels and the number of hand object area pixels;
the foreground pixels are pixels except for background pixels in the hand-object image, and the pixel value of the background pixels is [0, 0];
the K-Means clustering algorithm sets the number of clustered class clusters to 2, namely k=2, and the two class clusters are hand area pixels and hand object area pixels respectively.
S60, calculating the hand-object ratio through the number of the pixels of the hand area and the number of the pixels of the hand object area;
the hand-to-object ratio is determined by dividing the number of hand area pixels by the number of hand area pixels.
S70, judging whether the hand-object ratio is larger than a ratio threshold value theta, if so, indicating that an object is not held on the hand; if not, indicating that the hand holds the object, and obtaining a hand-held object image;
the ratio threshold θ is 0.1;
the hand-held object image is an aggregate area formed by all hand-held object area pixels on the hand-held object image.
According to the RGB video frame handheld object detection method based on deep learning, handheld object detection can be achieved only by using multi-frame RGB images, the problem that handheld object detection can be achieved only by using an RGB-D camera at present is solved, and shooting difficulty and shooting cost are reduced; the collaborative saliency detection model is applied to handheld object detection for the first time, in a plurality of video frames, the hand and the handheld object can be regarded as a whole, the consistency of relative positions is maintained, and when the background changes, the mask of the hand and the handheld object can be obtained efficiently and accurately by utilizing the collaborative saliency detection model; the deep learning model has strong generalization performance, can adaptively detect objects with obvious consistency among video multiframes, does not need to train independently aiming at specific detection scenes, avoids the problem that the deep learning model needs a large amount of data of specific scenes to train secondarily, and saves time and calculation cost for training a neural network; the method can realize the detection of the handheld object in the video frame, obtain the position and appearance information of the handheld object in each frame, can obtain the category information of the handheld object by using a simple classifier by utilizing the detection result of the handheld object, and can be applied to various scenes, such as man-machine interaction of an intelligent robot, object taking and placing back monitoring and counting of a workbench, monitoring of handheld dangerous objects in public safety management and control, virtual reality and the like.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.
It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (6)

1. The method for detecting the RGB video frame handhelds based on the deep learning is characterized by comprising the following steps of:
s10, acquiring a small-segment RGB video to be detected, and performing frame extraction processing on the small-segment RGB video to be detected through fixed frame sampling density to obtain an ordered video frame sequence;
s20, constructing a hand-object target area selection model, and inputting the ordered video frame sequence into the hand-object target area selection model to obtain a hand-object area picture sequence;
s30, constructing a collaborative saliency object detection model, and inputting the hand-object region picture sequence into the collaborative saliency object detection model to obtain a hand-object region mask;
s40, carrying out bit multiplication on the hand-object region picture sequence and the hand-object region mask to obtain a hand-object image;
s50, converting the hand-object image into an HSV format, and clustering the hand-object image in the HSV format by using a K-Means clustering algorithm to obtain the number of pixels in the hand area and the number of pixels in the hand area;
s60, calculating the hand-object ratio through the number of the pixels of the hand area and the number of the pixels of the hand object area;
s70, judging whether the hand-object ratio is larger than a ratio threshold value theta, if so, indicating that an object is not held on the hand; if not, indicating that the hand holds the object, and obtaining a hand-held object image;
the bit multiplication is that the pixel values of RGB three channels of a hand-object region picture sequence are multiplied by the pixel values of a hand-object region mask of a single channel respectively;
the hand-object ratio is obtained by dividing the number of hand area pixels by the number of hand object area pixels;
the ratio threshold θ is 0.1;
the hand-held object image is an aggregate area formed by all hand-held object area pixels on the hand-held object image.
2. The method for detecting an RGB video frame handheld object based on deep learning of claim 1, wherein S10 comprises:
the small-section RGB video to be detected is a segment to be detected which is selected from the monitoring video, and the duration is 3 to 5 seconds;
the fixed frame sampling density is determined according to the length of the small-segment RGB video to be detected, and the default value is 3;
the ordered sequence of video frames is represented asWherein->And B is the number of pictures, and B and t are positive integers.
3. The method for detecting an RGB video frame handheld object based on deep learning of claim 1, wherein S20 comprises:
s21, acquiring a picture through Hand Landmark Detection algorithm in a MediaPipe tool21 hand position information in (a);
the 21 hand position informationExpressed as:
wherein,for picture->The ith hand joint point coordinate in (i) is a positive integer,/and (ii)>Point +.>The horizontal and vertical coordinates of the position, the coordinate system takes the upper left corner of the picture as an origin, the horizontal direction is the x axis to the right, and the vertical direction is the y axis downwards;
s22, calculating to obtain a minimum circumscribed rectangle containing 21 key points of the hand;
the minimum circumscribed rectangleExpressed as:
wherein,the abscissa of the center point of the smallest bounding rectangle, +.>Is the ordinate of the center point of the smallest bounding rectangle,is the width of the minimum circumscribed rectangle, +.>The height is the minimum circumscribed rectangle;
s23, calculating a boundary box of the target area through the minimum circumscribed rectangle;
the target area bounding boxExpressed as:
wherein,the expansion ratio is 3;
s24, creating a hand-object region picture sequence;
s25, slave picturesThe hand-object area picture is taken through the target area boundary box>Adding the hand-object region picture sequence;
s26, executing S21-S25 steps on each picture of the ordered video frame sequence to obtain a final hand-object region picture sequence;
the final hand-object region picture sequenceExpressed as:
wherein,passing through the target for each pictureAn image taken by the region bounding box.
4. A method for detecting an RGB video frame handheld object based on deep learning as claimed in claim 3, wherein the minimum bounding rectangle of 21 key points of the hand comprises:
the minimum circumscribed rectangleThe center point coordinates of (a) and the width and height calculation formulas are:
wherein,the abscissa of the center point of the smallest bounding rectangle, +.>Is the ordinate of the center point of the smallest bounding rectangle,is the width of the minimum circumscribed rectangle, +.>Is the height of the minimum circumscribed rectangle, +.>Is the minimum of 21 keypoint abscissas,maximum of 21 keypoints abscissa, +.>Minimum value of 21 key point ordinate, +.>The maximum value of the ordinate of the 21 key points is expressed as:
5. the method for detecting an RGB video frame handheld object based on deep learning of claim 1, wherein S30 comprises:
s31, carrying out data enhancement and size normalization operation on each hand-object region picture;
the calculation formula of the data enhancement and normalization size operation is as follows:
the Aug (-) is a data enhancement process of the picture, and comprises translation, rotation and color transformation operations, wherein Norm (-) is an operation of normalizing pixel values of an image, resize (-) is an operation of adjusting the size, and images with different resolutions are adjusted into images with specified sizes through linear interpolation or downsampling modes;
s32, extracting the characteristics of the input pictures through an Encoder characteristic Encoder to generate a batch of characteristic diagramsObtaining a feature vector through linear mapping>
The Encoder feature Encoder uses Resnet101 as the backbone network;
the linear mapping obtains a feature vectorThe calculation formula of (2) is as follows:
wherein,for a batch of pictures +.>Is shaped as +.>Tensors of (B) are the number of pictures, C is each pictureThe number of channels of the slice, c=3, H and W are the height and width of the picture after resizing operation,is shaped as +.>Dim is the dimension of the feature vector;
s33, through the feature vectorCalculating to obtain characteristic diagram of consistency salience>
The characteristic diagramThe calculation formula of (2) is as follows:
the cross-attention module is used for extracting consistency information contained in the feature vectors of different pictures;
the consistency salients are objects commonly contained in all pictures;
s34, respectively overlapping the feature images of the consistent salient objects into feature images corresponding to each image, and obtaining a prediction mask of the salient objects in each image through a semantic segmentation module;
the calculation formula of the prediction mask is as follows:
wherein cat (-) is a splicing operation, seg (-) is a semantic segmentation module, and a transposed convolution and upsampling method is used for processing the feature mapRestoring to original slice size, the Softmax function has the expression:
wherein,the number of elements in the matrix, C is the number of classes of classification, C is a positive integer, where c=2, representing foreground pixels and background pixels;
s35, performing cross entropy loss calculation on the prediction mask and a real mask label of the image to obtain a loss value;
s36, carrying out back propagation optimization network parameters through the loss value, and completing training of the collaborative saliency object detection model;
the real mask label and the pictures in the training process are provided by the public dataset CoSOD3k dataset.
6. The method for detecting an RGB video frame handheld object based on deep learning of claim 1, wherein S50 comprises:
s51, converting the hand-object image from an RGB format to an HSV format;
s52, traversing foreground pixels of the hand-object image, and taking H, S, V values of each foreground pixel as characteristics of each foreground pixel;
s53, creating a data point set, and storing the characteristics of all foreground pixels in the hand-object image into the data point set;
s54, clustering the data point sets by using a K-Means clustering algorithm to obtain the number of clustered hand area pixels and the number of hand object area pixels;
the foreground pixels are pixels except for background pixels in the hand-object image, and the pixel value of the background pixels is [0, 0];
the K-Means clustering algorithm sets the number of clustered class clusters to 2, namely k=2, and the two class clusters are hand area pixels and hand object area pixels respectively.
CN202311359648.3A 2023-10-20 2023-10-20 RGB video frame handheld object detection method based on deep learning Active CN117095339B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311359648.3A CN117095339B (en) 2023-10-20 2023-10-20 RGB video frame handheld object detection method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311359648.3A CN117095339B (en) 2023-10-20 2023-10-20 RGB video frame handheld object detection method based on deep learning

Publications (2)

Publication Number Publication Date
CN117095339A true CN117095339A (en) 2023-11-21
CN117095339B CN117095339B (en) 2024-01-30

Family

ID=88770171

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311359648.3A Active CN117095339B (en) 2023-10-20 2023-10-20 RGB video frame handheld object detection method based on deep learning

Country Status (1)

Country Link
CN (1) CN117095339B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103106387A (en) * 2011-11-15 2013-05-15 中国科学院深圳先进技术研究院 Method and device of image recognition
CN108062525A (en) * 2017-12-14 2018-05-22 中国科学技术大学 A kind of deep learning hand detection method based on hand region prediction
CN111382600A (en) * 2018-12-28 2020-07-07 山东华软金盾软件股份有限公司 Security video monochromatic shelter detection device and method
CN111553326A (en) * 2020-05-29 2020-08-18 上海依图网络科技有限公司 Hand motion recognition method and device, electronic equipment and storage medium
CN112001313A (en) * 2020-08-25 2020-11-27 北京深醒科技有限公司 Image identification method and device based on attribution key points
CN112016398A (en) * 2020-07-29 2020-12-01 华为技术有限公司 Handheld object identification method and device
DE102020127508A1 (en) * 2019-10-24 2021-04-29 Nvidia Corporation POSITION TRACKING OBJECTS IN HAND
CN114821764A (en) * 2022-01-25 2022-07-29 哈尔滨工程大学 Gesture image recognition method and system based on KCF tracking detection

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103106387A (en) * 2011-11-15 2013-05-15 中国科学院深圳先进技术研究院 Method and device of image recognition
CN108062525A (en) * 2017-12-14 2018-05-22 中国科学技术大学 A kind of deep learning hand detection method based on hand region prediction
CN111382600A (en) * 2018-12-28 2020-07-07 山东华软金盾软件股份有限公司 Security video monochromatic shelter detection device and method
DE102020127508A1 (en) * 2019-10-24 2021-04-29 Nvidia Corporation POSITION TRACKING OBJECTS IN HAND
CN111553326A (en) * 2020-05-29 2020-08-18 上海依图网络科技有限公司 Hand motion recognition method and device, electronic equipment and storage medium
CN112016398A (en) * 2020-07-29 2020-12-01 华为技术有限公司 Handheld object identification method and device
WO2022022292A1 (en) * 2020-07-29 2022-02-03 华为技术有限公司 Method and device for recognizing handheld object
CN112001313A (en) * 2020-08-25 2020-11-27 北京深醒科技有限公司 Image identification method and device based on attribution key points
CN114821764A (en) * 2022-01-25 2022-07-29 哈尔滨工程大学 Gesture image recognition method and system based on KCF tracking detection

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ABHINAV G.等: "SHOP: A Deep Learning Based Pipeline for Near Real-Time Detection of Small Handheld Objects Present in Blurry Video", SOUTHEASTCON 2022 *
KRISHNEEL C. 等: "Learning to Segment Generic Handheld Objects Using Class-Agnostic Deep Comparison and Segmentation Network", IEEE ROBOTICS AND AUTOMATION LETTERS *
刘祎莹: "基于零样本学习的手持物识别技术与实现", 中国优秀硕士学位论文全文数据库 *

Also Published As

Publication number Publication date
CN117095339B (en) 2024-01-30

Similar Documents

Publication Publication Date Title
CN108492343B (en) Image synthesis method for training data for expanding target recognition
CN109583483B (en) Target detection method and system based on convolutional neural network
US6961466B2 (en) Method and apparatus for object recognition
CN109903331B (en) Convolutional neural network target detection method based on RGB-D camera
CN110929593B (en) Real-time significance pedestrian detection method based on detail discrimination
CN110334762B (en) Feature matching method based on quad tree combined with ORB and SIFT
CN109948549B (en) OCR data generation method and device, computer equipment and storage medium
CN112541491B (en) End-to-end text detection and recognition method based on image character region perception
CN110991258B (en) Face fusion feature extraction method and system
CN112784869B (en) Fine-grained image identification method based on attention perception and counterstudy
CN112396036B (en) Method for re-identifying blocked pedestrians by combining space transformation network and multi-scale feature extraction
CN110866900A (en) Water body color identification method and device
CN111325184B (en) Intelligent interpretation and change information detection method for remote sensing image
CN115147932A (en) Static gesture recognition method and system based on deep learning
CN111767854A (en) SLAM loop detection method combined with scene text semantic information
CN108647605B (en) Human eye gaze point extraction method combining global color and local structural features
WO2022063321A1 (en) Image processing method and apparatus, device and storage medium
Wang et al. Perception-guided multi-channel visual feature fusion for image retargeting
CN117095339B (en) RGB video frame handheld object detection method based on deep learning
CN116432160A (en) Slider verification code identification method and system based on RPA and LBP characteristics
CN116503622A (en) Data acquisition and reading method based on computer vision image
KR100532129B1 (en) lip region segmentation and feature extraction method for Speech Recognition
CN112418344B (en) Training method, target detection method, medium and electronic equipment
CN111881732B (en) SVM (support vector machine) -based face quality evaluation method
CN116704518A (en) Text recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant