CN117095339A

CN117095339A - RGB video frame handheld object detection method based on deep learning

Info

Publication number: CN117095339A
Application number: CN202311359648.3A
Authority: CN
Inventors: 陶涛; 马勇; 邹健; 刘玲蒙; 赵涵; 唐泳; 苏松; 李凡
Original assignee: Jiangxi Normal University
Current assignee: Jiangxi Normal University
Priority date: 2023-10-20
Filing date: 2023-10-20
Publication date: 2023-11-21
Anticipated expiration: 2043-10-20
Also published as: CN117095339B

Abstract

The invention discloses a method for detecting RGB video frame handhelds based on deep learning, which comprises the steps of performing frame extraction processing on a small video to be detected through fixed frame sampling density to obtain an ordered video frame sequence, and then obtaining a hand-object image through a hand-object target area selection model and a cooperative salient object detection model constructed by the method; converting the hand-object image from an RGB format to an HSV format, clustering by using a K-Means clustering algorithm to obtain the number of hand area pixels and the number of hand object area pixels, and calculating the hand-object proportion by the number of hand area pixels and the number of hand object area pixels; judging whether the hand-object proportion is larger than a proportion threshold value theta or not, and judging whether an object is held on the hand or not. Through the mode, the problem that the handheld object detection can be carried out only by using the RGB-D camera at present can be solved, and the shooting difficulty and the shooting cost are reduced.

Description

RGB video frame handheld object detection method based on deep learning

Technical Field

The invention relates to the field of visual identification, in particular to a detection method of RGB video frame handhelds based on deep learning.

Background

The handheld object detection has important application in the scenes of intelligent interaction, behavior analysis and the like, and can detect objects taken by a user in the hand in the intelligent interaction scene, such as pens, remote controllers and the like, so as to realize a more natural and flexible man-machine interaction mode; in a behavior analysis scene, the interests and hobbies of a user are analyzed by detecting objects, such as beverages, foods and the like, held by pedestrians on hands, and personalized accurate service recommendation is provided. In the prior art, an RGB-D image is mostly needed to be used for acquiring a mask of a handheld object area in the image, a single RGB image is needed to be combined with a target detection algorithm for positioning and identifying the handheld object, the target detection algorithm needs to construct a specific data set for training according to a specific detection scene, a great deal of time and labor cost are needed, the generalization capability is poor, and the method is difficult to be directly applied to other similar scenes. Therefore, the invention provides a detection method of the RGB video frame handhelds based on deep learning, which has important practical significance.

Patent document with publication number CN111553891a, named a handheld object presence detection method, uses RGB-D images, first establishes RGB camera and depth camera for calibration, then uses the calibrated RGB-D camera for image acquisition, uses a hand joint point detection program in combination with the depth value of the depth image, uses an area growth algorithm to obtain hand related area mask, finally calculates scaling factor of arm skin, and determines whether an object is held in the hand according to a set threshold, but this method requires using the depth camera, has complicated shooting process and high shooting cost.

Patent document with publication number CN112016398A, named hand-held object recognition method and device, uses deep learning target detection technology to detect objects and hands in RGB pictures and designs hardware equipment for intelligent robot interaction, but the method needs to collect the data set of the hand-held object to be recognized according to specific requirement scene in advance, so that the deep learning algorithm can perform training, and only the detection frame of the hand-held object can be obtained, and high-precision pixel-level area masks cannot be generated.

Disclosure of Invention

In order to solve the technical problems, the invention adopts a technical scheme that: the method for detecting the RGB video frame handhelds based on the deep learning comprises the following steps:

s10, acquiring a small-segment RGB video to be detected, and performing frame extraction processing on the small-segment RGB video to be detected through fixed frame sampling density to obtain an ordered video frame sequence;

s20, constructing a hand-object target area selection model, and inputting the ordered video frame sequence into the hand-object target area selection model to obtain a hand-object area picture sequence;

s30, constructing a collaborative saliency object detection model, and inputting the hand-object region picture sequence into the collaborative saliency object detection model to obtain a hand-object region mask;

s40, carrying out bit multiplication on the hand-object region picture sequence and the hand-object region mask to obtain a hand-object image;

s50, converting the hand-object image into an HSV format, and clustering the hand-object image in the HSV format by using a K-Means clustering algorithm to obtain the number of pixels in the hand area and the number of pixels in the hand area;

s60, calculating the hand-object ratio through the number of the pixels of the hand area and the number of the pixels of the hand object area;

s70, judging whether the hand-object ratio is larger than a ratio threshold value theta, if so, indicating that an object is not held on the hand; if not, indicating that the hand holds the object, and obtaining a hand-held object image;

the bit multiplication is that the pixel values of RGB three channels of a hand-object region picture sequence are multiplied by the pixel values of a hand-object region mask of a single channel respectively;

the hand-object ratio is obtained by dividing the number of hand area pixels by the number of hand object area pixels;

the ratio threshold θ is 0.1;

the hand-held object image is an aggregate area formed by all hand-held object area pixels on the hand-held object image.

Further, the step S10 includes:

the small-section RGB video to be detected is a segment to be detected which is selected from the monitoring video, and the duration is 3 to 5 seconds;

the fixed frame sampling density is determined according to the length of the small-segment RGB video to be detected, and the default value is 3;

the ordered sequence of video frames is represented asWherein, the method comprises the steps of, wherein,and B is the number of pictures, and B and t are positive integers.

Further, the step S20 includes:

s21, acquiring a picture through Hand Landmark Detection algorithm in a MediaPipe tool21 hand position information in (a);

the 21 hand position informationExpressed as:

；

wherein,for picturesI is a positive integer,respectively as dotsThe horizontal and vertical coordinates of the position, the coordinate system takes the upper left corner of the picture as an origin, the horizontal direction is the x axis to the right, and the vertical direction is the y axis downwards;

s22, calculating to obtain a minimum circumscribed rectangle containing 21 key points of the hand;

the minimum circumscribed rectangleExpressed as:

；

wherein,is the abscissa of the center point of the smallest bounding rectangle,is the ordinate of the center point of the smallest bounding rectangle,is the width of the smallest circumscribed rectangle,the height is the minimum circumscribed rectangle;

s23, calculating a boundary box of the target area through the minimum circumscribed rectangle;

the target area bounding boxExpressed as:

；

wherein,the expansion ratio is 3;

s24, creating a hand-object region picture sequence;

s25, slave picturesIntercepting a hand-object region picture through the target region boundary boxAdding the hand-object region picture sequence;

s26, executing S21-S25 steps on each picture of the ordered video frame sequence to obtain a final hand-object region picture sequence;

the final hand-object region picture sequenceExpressed as:

；

wherein,an image taken through the target area bounding box for each picture.

Further, the minimum bounding rectangle of the 21 key points of the hand includes:

the minimum circumscribed rectangleThe center point coordinates of (a) and the width and height calculation formulas are:

；

wherein,is the abscissa of the center point of the smallest bounding rectangle,is the ordinate of the center point of the smallest bounding rectangle,is the width of the smallest circumscribed rectangle,is the height of the smallest circumscribed rectangle,is the minimum of 21 keypoint abscissas,is the maximum of 21 keypoints on the abscissa,is the minimum of 21 keypoints on the ordinate,the maximum value of the ordinate of the 21 key points is expressed as:

；

。

further, the step S30 includes:

s31, carrying out data enhancement and size normalization operation on each hand-object region picture;

the calculation formula of the data enhancement and normalization size operation is as follows:

；

the Aug (-) is a data enhancement process of the picture, and comprises translation, rotation and color transformation operations, wherein Norm (-) is an operation of normalizing pixel values of an image, resize (-) is an operation of adjusting the size, and images with different resolutions are adjusted into images with specified sizes through linear interpolation or downsampling modes;

s32, extracting the characteristics of the input pictures through an Encoder characteristic Encoder to generate a batch of characteristic diagramsObtaining characteristic vector through linear mapping；

The Encoder feature Encoder uses Resnet101 as the backbone network;

the linear mapping obtains a feature vectorThe calculation formula of (2) is as follows:

；

wherein,in the case of a batch of pictures,is in the shape ofB is the number of pictures, C is the number of channels per picture, c=3, H and W are the height and width of the pictures after resizing,is in the shape ofDim is the dimension of the feature vector;

s33, through the feature vectorCalculating to obtain a feature map of the consistency salients；

The characteristic diagramThe calculation formula of (2) is as follows:

；

the cross-attention module is used for extracting consistency information contained in the feature vectors of different pictures;

the consistency salients are objects commonly contained in all pictures;

s34, respectively overlapping the feature images of the consistent salient objects into feature images corresponding to each image, and obtaining a prediction mask of the salient objects in each image through a semantic segmentation module;

the calculation formula of the prediction mask is as follows:

；

wherein cat (·) is a splicing operation, and Seg (·) is semantic segmentationModule, using transposed convolution and upsampling method, maps featuresRestoring to original slice size, the Softmax function has the expression:

；

wherein,the number of elements in the matrix, C is the number of classes of classification, C is a positive integer, where c=2, representing foreground pixels and background pixels;

s35, performing cross entropy loss calculation on the prediction mask and a real mask label of the image to obtain a loss value;

s36, carrying out back propagation optimization network parameters through the loss value, and completing training of the collaborative saliency object detection model;

the real mask label and the pictures in the training process are provided by the public dataset CoSOD3k dataset.

Further, the S50 includes:

s51, converting the hand-object image from an RGB format to an HSV format;

s52, traversing foreground pixels of the hand-object image, and taking H, S, V values of each foreground pixel as characteristics of each foreground pixel;

s53, creating a data point set, and storing the characteristics of all foreground pixels in the hand-object image into the data point set;

s54, clustering the data point sets by using a K-Means clustering algorithm to obtain the number of clustered hand area pixels and the number of hand object area pixels;

the foreground pixels are pixels except for background pixels in the hand-object image, and the pixel value of the background pixels is [0, 0];

the K-Means clustering algorithm sets the number of clustered class clusters to 2, namely k=2, and the two class clusters are hand area pixels and hand object area pixels respectively.

In summary, due to the adoption of the technical scheme, the beneficial effects of the invention are as follows:

1. according to the RGB video frame handheld object detection method based on deep learning, handheld object detection can be achieved only by using multi-frame RGB images, the problem that handheld object detection can be achieved only by using an RGB-D camera at present is solved, and shooting difficulty and shooting cost are reduced.

2. According to the RGB video frame handheld object detection method based on deep learning, the collaborative saliency detection model is applied to handheld object detection for the first time, in a plurality of video frames, a hand and a handheld object can be regarded as a whole, the consistency of relative positions is maintained, and when the background changes, masks of the hand and the handheld object can be obtained efficiently and accurately by using the collaborative saliency detection model.

3. According to the RGB video frame handheld object detection method based on deep learning, the used deep learning model has strong generalization performance, and can adaptively detect consistent significant objects among video multiframes without independently training specific detection scenes, so that the problem that the deep learning model needs a large amount of data of specific scenes to perform secondary training is avoided, and the time and the computational cost of training a neural network are saved.

4. The RGB video frame handheld object detection method based on deep learning provided by the invention can realize detection of handheld objects in video frames, obtain position and appearance information of the handheld objects in each frame, can obtain category information of the handheld objects by using a simple classifier according to the detection result of the handheld objects, and can be applied to various scenes, such as man-machine interaction of intelligent robots, object pick-and-place monitoring and counting of a workbench, monitoring of handheld dangerous objects in public safety management and control, virtual reality and the like.

Drawings

Fig. 1 is a flowchart of a method for detecting an RGB video frame handheld object based on deep learning.

Fig. 2 is a flowchart of a method for constructing a hand-object target area selection model based on a deep learning RGB video frame hand-object detection method.

Fig. 3 is a flowchart of a method for constructing a collaborative saliency object detection model based on a deep learning RGB video frame hand-held object detection method.

Fig. 4 is a flowchart of K-Means clustering of the RGB video frame handheld object detection method based on deep learning provided by the invention.

Detailed Description

The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present invention can be more easily understood by those skilled in the art, thereby making clear and defining the scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced otherwise than as described herein; it will be apparent that the embodiments in the specification are only some, but not all, embodiments of the invention.

Fig. 1 is a flowchart of a method for detecting an RGB video frame handheld object based on deep learning according to an embodiment of the present invention, where the method includes:

further, the step S10 includes:

the ordered sequence of video frames is represented asWherein, the method comprises the steps of, wherein,for the t frame picture, B is the number of pictures, B and t are positive integersA number.

further, referring to fig. 2, the step S20 includes:

the 21 hand position informationExpressed as:

；

the minimum circumscribed rectangleExpressed as:

；

the target area bounding boxExpressed as:

；

wherein,the expansion ratio is 3;

s24, creating a hand-object region picture sequence;

the final hand-object region picture sequenceExpressed as:

；

wherein,an image taken through the target area bounding box for each picture.

；

。

further, referring to fig. 3, the step S30 includes:

；

The Encoder feature Encoder uses Resnet101 as the backbone network;

；

The characteristic diagramThe calculation formula of (2) is as follows:

；

the consistency salients are objects commonly contained in all pictures;

the calculation formula of the prediction mask is as follows:

；

wherein cat (-) is a splicing operation, seg (-) is a semantic segmentation module, and a transposed convolution and upsampling method is used for processing the feature mapRestoring to original slice size, the Softmax function has the expression:

；

the bit-wise multiplication is that the pixel values of RGB three channels of the hand-object region picture sequence are multiplied by the pixel values of the hand-object region mask of a single channel respectively.

further, referring to fig. 4, the S50 includes:

s51, converting the hand-object image from an RGB format to an HSV format;

the hand-to-object ratio is determined by dividing the number of hand area pixels by the number of hand area pixels.

the ratio threshold θ is 0.1;

According to the RGB video frame handheld object detection method based on deep learning, handheld object detection can be achieved only by using multi-frame RGB images, the problem that handheld object detection can be achieved only by using an RGB-D camera at present is solved, and shooting difficulty and shooting cost are reduced; the collaborative saliency detection model is applied to handheld object detection for the first time, in a plurality of video frames, the hand and the handheld object can be regarded as a whole, the consistency of relative positions is maintained, and when the background changes, the mask of the hand and the handheld object can be obtained efficiently and accurately by utilizing the collaborative saliency detection model; the deep learning model has strong generalization performance, can adaptively detect objects with obvious consistency among video multiframes, does not need to train independently aiming at specific detection scenes, avoids the problem that the deep learning model needs a large amount of data of specific scenes to train secondarily, and saves time and calculation cost for training a neural network; the method can realize the detection of the handheld object in the video frame, obtain the position and appearance information of the handheld object in each frame, can obtain the category information of the handheld object by using a simple classifier by utilizing the detection result of the handheld object, and can be applied to various scenes, such as man-machine interaction of an intelligent robot, object taking and placing back monitoring and counting of a workbench, monitoring of handheld dangerous objects in public safety management and control, virtual reality and the like.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.

It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The method for detecting the RGB video frame handhelds based on the deep learning is characterized by comprising the following steps of:

the ratio threshold θ is 0.1;

2. The method for detecting an RGB video frame handheld object based on deep learning of claim 1, wherein S10 comprises:

the ordered sequence of video frames is represented asWherein->And B is the number of pictures, and B and t are positive integers.

3. The method for detecting an RGB video frame handheld object based on deep learning of claim 1, wherein S20 comprises:

the 21 hand position informationExpressed as:

；

wherein,for picture->The ith hand joint point coordinate in (i) is a positive integer,/and (ii)>Point +.>The horizontal and vertical coordinates of the position, the coordinate system takes the upper left corner of the picture as an origin, the horizontal direction is the x axis to the right, and the vertical direction is the y axis downwards;

the minimum circumscribed rectangleExpressed as:

；

wherein,the abscissa of the center point of the smallest bounding rectangle, +.>Is the ordinate of the center point of the smallest bounding rectangle,is the width of the minimum circumscribed rectangle, +.>The height is the minimum circumscribed rectangle;

the target area bounding boxExpressed as:

；

wherein,the expansion ratio is 3;

s24, creating a hand-object region picture sequence;

s25, slave picturesThe hand-object area picture is taken through the target area boundary box>Adding the hand-object region picture sequence;

the final hand-object region picture sequenceExpressed as:

；

wherein,passing through the target for each pictureAn image taken by the region bounding box.

4. A method for detecting an RGB video frame handheld object based on deep learning as claimed in claim 3, wherein the minimum bounding rectangle of 21 key points of the hand comprises:

；

wherein,the abscissa of the center point of the smallest bounding rectangle, +.>Is the ordinate of the center point of the smallest bounding rectangle,is the width of the minimum circumscribed rectangle, +.>Is the height of the minimum circumscribed rectangle, +.>Is the minimum of 21 keypoint abscissas,maximum of 21 keypoints abscissa, +.>Minimum value of 21 key point ordinate, +.>The maximum value of the ordinate of the 21 key points is expressed as:

；

。

5. the method for detecting an RGB video frame handheld object based on deep learning of claim 1, wherein S30 comprises:

；

s32, extracting the characteristics of the input pictures through an Encoder characteristic Encoder to generate a batch of characteristic diagramsObtaining a feature vector through linear mapping>；

The Encoder feature Encoder uses Resnet101 as the backbone network;

；

wherein,for a batch of pictures +.>Is shaped as +.>Tensors of (B) are the number of pictures, C is each pictureThe number of channels of the slice, c=3, H and W are the height and width of the picture after resizing operation,is shaped as +.>Dim is the dimension of the feature vector;

s33, through the feature vectorCalculating to obtain characteristic diagram of consistency salience>；

The characteristic diagramThe calculation formula of (2) is as follows:

；

the consistency salients are objects commonly contained in all pictures;

the calculation formula of the prediction mask is as follows:

；

6. The method for detecting an RGB video frame handheld object based on deep learning of claim 1, wherein S50 comprises:

s51, converting the hand-object image from an RGB format to an HSV format;