WO2023174098A1 - Procédé et appareil de détection de geste en temps réel - Google Patents

Procédé et appareil de détection de geste en temps réel Download PDF

Info

Publication number
WO2023174098A1
WO2023174098A1 PCT/CN2023/080066 CN2023080066W WO2023174098A1 WO 2023174098 A1 WO2023174098 A1 WO 2023174098A1 CN 2023080066 W CN2023080066 W CN 2023080066W WO 2023174098 A1 WO2023174098 A1 WO 2023174098A1
Authority
WO
WIPO (PCT)
Prior art keywords
gesture
feature
feature map
real
gesture detection
Prior art date
Application number
PCT/CN2023/080066
Other languages
English (en)
Chinese (zh)
Inventor
裴超
Original Assignee
百果园技术(新加坡)有限公司
裴超
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百果园技术(新加坡)有限公司, 裴超 filed Critical 百果园技术(新加坡)有限公司
Publication of WO2023174098A1 publication Critical patent/WO2023174098A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the embodiments of the present application relate to the field of image processing technology, and in particular, to a real-time gesture detection method and device.
  • gestures can be used for emotional expression, interactive entertainment, virtual games, etc.
  • Gesture detection can directly obtain the position of the hand in the image and the type of gesture currently made, which is of great significance for the interaction of live broadcast and short video applications.
  • Existing gesture detection methods are mainly divided into two categories: gesture detection based on traditional features such as SIFT and gesture detection based on convolutional neural networks.
  • the former calculates the gesture position and category in the image by extracting some scale-invariant features in the image.
  • features are generally designed manually, and their ability to express features contained in images is very limited, making them prone to missed detections and false detections.
  • the latter extracts image features through a multi-layer convolutional neural network, and then returns the position and category of the gesture in the image.
  • the general convolutional neural network has a huge amount of calculation, and the computing power, memory, heat dissipation capacity, etc. of the mobile device are It is restricted and cannot be directly applied to scenarios with high real-time requirements such as live broadcasts.
  • Embodiments of the present application provide a real-time gesture detection method and device to solve the technical problem in related technologies that gesture recognition cannot meet the real-time requirements due to the large amount of calculation of convolutional neural networks and limited equipment processing capabilities on the mobile terminal, and reduce the The calculation amount of gesture detection can effectively meet the real-time requirements of gesture recognition.
  • embodiments of the present application provide a real-time gesture detection method, configured in a real-time gesture detection device, including:
  • the image to be recognized is input into the trained gesture detection model, so that the gesture detection model outputs a gesture recognition result based on the image to be recognized, and the gesture detection model is configured to be based on a separable convolution structure and a residual
  • the structure obtains multiple original feature maps of different levels of the input image, fuses the multiple original feature maps to obtain multiple fused feature maps, and performs gesture recognition based on the multiple fused feature maps and outputs the gesture recognition result;
  • the gesture type and gesture position are determined based on the gesture recognition result output by the gesture detection model.
  • embodiments of the present application provide a real-time gesture detection device, including an image acquisition module, a gesture recognition module and a gesture determination module, wherein:
  • the image acquisition module is configured to acquire an image to be recognized
  • the gesture recognition module is configured to input the image to be recognized into a trained gesture detection model, so that the gesture detection model outputs a gesture recognition result based on the image to be recognized, and the gesture detection model is configured to be based on
  • the separable convolution structure and the residual structure can obtain multiple original feature maps of different levels of the input image, fuse the multiple original feature maps to obtain multiple fused feature maps, and perform gestures based on the multiple fused feature maps. Recognize and output gesture recognition results;
  • the gesture determination module is configured to determine the gesture type and gesture position based on the gesture recognition result output by the gesture detection model.
  • embodiments of the present application provide a real-time gesture detection device, including: a memory and one or more processors;
  • the memory is used to store one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the real-time gesture detection method as described in the first aspect.
  • embodiments of the present application provide a storage medium that stores computer-executable instructions, which when executed by a computer processor are used to perform the real-time gesture detection method as described in the first aspect.
  • inventions of the present application provide a computer program product.
  • the computer program product includes a computer program.
  • the computer program is stored in a computer-readable storage medium.
  • At least one processor of the device reads the computer program from the computer-readable storage medium.
  • the computer program is retrieved and executed, causing the device to perform the real-time gesture detection method as described in the first aspect.
  • the embodiment of the present application performs gesture recognition by acquiring an image to be recognized and inputting the image to be recognized into a gesture detection model, and determines the gesture type and gesture position based on the gesture recognition result output by the gesture detection model.
  • the gesture detection model is based on separable convolution.
  • the structure and residual structure extract multiple levels of original feature maps of the input image to reduce the calculation amount of feature extraction and gesture detection, and fuse multiple original feature maps to obtain a fusion feature map.
  • Using the fusion features Enhance the target detection capability to make up for the performance loss caused by the reduction of parameters, and at the same time enhance the detection effect of small targets and fuzzy scenes. Then perform gesture recognition based on the fusion feature map and output the gesture recognition results, which can effectively meet the real-time performance of gesture recognition. Require.
  • Figure 1 is a flow chart of a real-time gesture detection method provided by an embodiment of the present application
  • Figure 2 is a schematic flowchart of feature extraction from an input image provided by an embodiment of the present application
  • Figure 3 is a schematic structural diagram of a basic feature extraction network provided by an embodiment of the present application.
  • Figure 4 is a schematic flowchart of a fusion of original feature maps provided by an embodiment of the present application.
  • Figure 5 is a schematic structural diagram of a feature fusion network provided by an embodiment of the present application.
  • Figure 6 is a schematic flowchart of gesture recognition on fused feature maps provided by an embodiment of the present application.
  • Figure 7 is a schematic structural diagram of a separate detection head network provided by an embodiment of the present application.
  • Figure 8 is a schematic diagram of the relationship between a fusion feature map and a priori frame provided by an embodiment of the present application.
  • Figure 9 is a schematic structural diagram of a real-time gesture detection device provided by an embodiment of the present application.
  • Figure 10 is a schematic structural diagram of a real-time gesture detection device provided by an embodiment of the present application.
  • Figure 1 shows a flow chart of a real-time gesture detection method provided by an embodiment of the present application.
  • the real-time gesture detection method provided by an embodiment of the present application can be executed by a real-time gesture detection device and configured in a real-time gesture detection device.
  • the real-time gesture detection device can be implemented in hardware and/or software, and integrated into the real-time gesture detection device.
  • the real-time gesture detection method includes:
  • the image to be recognized can be obtained through videos or images from the Internet or a local gallery, or can be obtained through real-time shooting by a camera module mounted on the real-time gesture detection device. For example, install a video application (such as live video software) on a real-time gesture detection device (such as a mobile terminal), and use each frame of the image as an image to be recognized while shooting the video, and determine the gesture type and gesture on the image to be recognized. After the position is determined, the next step can be processed based on the gesture type and gesture position.
  • a video application such as live video software
  • a real-time gesture detection device such as a mobile terminal
  • S102 Input the image to be recognized into the trained gesture detection model, so that the gesture detection model outputs the gesture recognition result based on the image to be recognized, and the gesture detection model is configured to obtain the multi-dimensional representation of the input image based on the separable convolution structure and the residual structure.
  • Original feature maps of different levels are fused to obtain multiple fused feature maps, and gesture recognition is performed based on the multiple fused feature maps and the gesture recognition results are output.
  • a trained gesture detection model is configured in the real-time gesture detection device. After acquiring the image to be recognized, the image to be recognized is input into the gesture detection model in turn, and the gesture detection model performs the detection based on the received image to be recognized. Gesture recognition and output corresponding gesture recognition results.
  • the gesture detection model is built based on the separable convolution structure and the residual structure.
  • the input image that is, the image to be recognized
  • image fuse the multiple original feature maps to obtain multiple fused feature maps
  • the gesture detection model provided by this solution extracts multiple levels of original feature maps of the input image based on separable convolution structures and residual structures, effectively reducing the calculation amount of feature extraction, reducing the calculation amount of gesture detection, and processing multiple original features.
  • the feature maps are fused to obtain a fused feature map.
  • the fused features are used to enhance the detection ability of the target to compensate for the performance loss caused by the reduction of the number of parameters. At the same time, the detection effect of small targets and blurred scenes is enhanced, which can effectively meet the real-time requirements of gesture recognition. .
  • the gesture detection model provided by this application includes a hierarchical feature extraction network, a feature fusion network and a separate head detection network connected in sequence.
  • the hierarchical feature extraction network is configured to obtain multiple original feature maps of different levels of the input image based on the separable convolution structure and the residual structure
  • the feature fusion network is configured to fuse multiple original feature maps output by the hierarchical feature extraction network. Multiple fused feature maps are obtained
  • the separate detection head network is configured to perform gesture recognition based on multiple fused feature maps and output gesture recognition results.
  • the gesture recognition results output by the separate detection head network include predicted gesture categories, gesture confidence, and predicted gesture positions.
  • the hierarchical feature extraction network provided by this application includes multiple serial basic feature extraction networks, and the basic feature extraction network at each level is configured to perform feature extraction on the input image to obtain the original feature map of the corresponding level. And the size of the original feature map output by the basic feature extraction network at each level is halved relative to the size of the input image, and the number of channels of the original feature map (number of convolution structure channels) is doubled relative to the number of channels of the input image.
  • the output image of one level of basic feature extraction network is used as the input image of the next level of basic feature extraction network.
  • the first-level basic feature grid uses the acquired image to be recognized as the input image, halve the input corresponding to the image to be recognized (halve the size) and double the channels (double the number of channels of the input convolution structure), Extract features from the image to be recognized and output the first-level original feature map.
  • the original feature map of the first level is used as the input image of the basic feature extraction network of the second level, the input corresponding to the original feature map of the first level is halved (the size is halved) and the channels are doubled. (The number of channels of the input convolution structure is doubled), feature extraction is performed on the original feature map of the first level and the original feature map of the second level is output, and by analogy, the original feature map of each level is obtained.
  • the hierarchical feature extraction network of this application includes 5 serial basic feature extraction networks, that is, the hierarchical feature extraction network is composed of 5 serial basic feature extraction networks.
  • Each layer of the basic feature extraction network obtains The size (length and width) of the original feature map is reduced by half relative to the input image.
  • the downsampling step size of the entire hierarchical feature extraction network is 32.
  • Original feature maps (downsampling step size is 32). These original feature maps are characterized by high abstraction and rich high-level visual features.
  • the basic feature extraction network provided by this application specifically includes steps S1021-S1023 when extracting features of the input image:
  • S1021 Use the basic convolution module to perform a convolution structure channel halving operation on the input image, and use the separable convolution module to perform feature extraction on the input image after the convolution structure channel is halved to obtain the feature extraction result.
  • the basic feature extraction network is built based on the basic convolution module and the separable convolution module.
  • the basic convolution module can be used to change the number of channels of the input convolution structure, and the separable convolution module can be used for the main features. extract.
  • the basic feature extraction network After the basic feature extraction network receives the input image (the input image of the first-level basic feature extraction network is the image to be recognized, and the input image of the subsequent-level basic feature extraction network is the original feature map output by the previous-level basic feature extraction network) , use the basic convolution module to halve the convolution structure channel of the input image to reduce the calculation amount of feature extraction, and send the input image after halving the convolution structure channel of the block to the separable convolution module for feature extraction to obtain the features Extract the results.
  • S1022 Perform element-by-element addition on the input image and feature extraction results after the convolution structure channel is halved to obtain the element-by-element addition result, and perform a confusion operation on the element-by-element addition result through the basic convolution module to obtain the element-by-element addition confusion result.
  • the input image with the previous basic convolution module halving the convolution structure channels is compared with the separable convolution module.
  • the feature extraction results output by the separate convolution module are added element by element (for example, the input after the channel is halved The corresponding pixel points of the image and feature extraction results are added) to obtain the element-wise addition result, and the basic convolution module is used to perform a confusion operation on the element-wise addition result to obtain the element-wise addition confusion result.
  • S1023 Perform string concatenation on the element addition confusion result and the input image after the convolution structure channel is halved to obtain the concatenation result, and perform downsampling on the concatenation result to obtain the original feature map.
  • the element-wise addition confusion result is further concatenated with the input image whose convolution structure channels are halved by the previous basic convolution module to perform string concatenation.
  • the connection result is obtained, and the connection result is further down-sampled (assuming the down-sampling step size is 2) to obtain the basic feature extraction network of the current level relative to the input image input.
  • the input is halved (the size is halved), and the channels are doubled (volume The original feature map (the number of channels in the product structure is doubled).
  • efficient separable convolution and residual structure can be used to construct a basic feature extraction network.
  • the basic feature extraction network (Layer in the figure) provided by this solution is built based on the basic convolution module (CBL in the figure) and the separable convolution module (DwUnit in the figure).
  • the basic convolution module includes a 1*1 convolution kernel (1x1Conv in the figure), a BatchNorm normalization unit (BatchNorm in the figure) and a LeakyReLU activation function unit (LeakyReLU in the figure) connected in sequence.
  • the LeakyReLU activation function unit The nonlinear activation function used is optimized by the ReLU activation function. Compared with other activation functions, it has the advantages of high computational efficiency and fast convergence speed, and reduces the sparsity of the ReLU activation function.
  • the separable convolution module includes the first basic convolution module (CBL, one level before DwCBL in the figure), the feature extraction module (DwCBL in the figure), and the second basic convolution module (one level below DwCBL in the figure) connected in sequence. of CBL).
  • the feature extraction module includes a 3*3 depth separable convolution kernel (3x3DwConv in the figure), a BatchNorm normalization unit (BatchNorm in the figure) and a LeakyReLU activation function unit (LeakyReLU in the figure) connected in sequence.
  • the feature extraction module DwConv is different from traditional convolution.
  • Each channel of the DwConv convolution kernel only performs convolution calculations with part of the channels of the input feature (the number of channels participating in the calculation can be preset), which greatly reduces the amount of calculation. , but the feature extraction capability of the feature extraction module DwCBL is therefore weakened. Therefore, before using the feature extraction module DwConv, the basic convolution module CBL is first used to increase the number of channels, and after the feature extraction module DwConv, the basic convolution module CBL is used to reduce the number of channels. .
  • the basic feature extraction network Layer is built based on the basic convolution module CBL and the separable convolution module DwUnit.
  • the input in the figure is used to receive the image receiving module of the input image.
  • a basic convolution module CBL is used to halve the convolution structure channel of the input image to reduce the size of the input image by half.
  • the left side of the basic feature extraction network layer in the figure is the residual structure mainly using the separable convolution module DwUnit, and the right side does not perform other operations after the input is halved.
  • the image receiving module input is connected to the basic convolution module CBL, it is also connected in sequence to the separable convolution module DwUnit, element addition module Add, basic convolution module CBL, channel connection module concat and step
  • the separable convolution module DwUnit with a length of 2 (Strident 2), on the right side, after the image receiving module input is connected to the basic convolution module CBL, is connected to the channel connection module concat, forming a layer of basic features in the hierarchical feature extraction network Extract network.
  • the channels are halved through the basic convolution modules CBL on both sides respectively.
  • the input image after the channel is halved is passed through the separable convolution module DwUnit on the left side.
  • the block performs feature extraction on the input image after the convolution structure channel is halved to obtain the feature extraction result.
  • the element addition module Add the input image after the convolution structure channel is halved and the feature extraction result are added element by element to obtain the element-by-element addition. tie knot
  • the result, and the basic convolution module CBL after the element addition module Add performs a confusion operation on the element-by-element addition results to obtain the element addition confusion result.
  • the element addition confusion result output by the basic convolution module CBL after the channel connection module concat adds the element addition module and the input after halving the convolution structure channel output by the right basic convolution module CBL
  • the images are connected by strings (that is, the output of the basic convolution module CBL on the left and right sides are concat-connected in the channel dimension) to obtain the connection result.
  • the connection result is downsampled through the separable convolution module DwUnit with a step size of 2. Get the original feature map with half the input and doubled channels relative to the input image.
  • the basic feature extraction network provided by this solution only performs convolution operations on the data of the left half of the channel, reducing the calculation amount by half.
  • the residual structure can well maintain the data transmission of the deep network.
  • five of the above-mentioned basic feature extraction networks are used to form a hierarchical feature extraction network.
  • the length and width of the original feature map obtained at each layer are reduced by half, and the downsampling step size of the entire network is 32.
  • the final original feature map will lose some basic features and some targets.
  • the original feature maps of different levels can be fused, and feature fusion confusion can be used to enhance the ability of gesture recognition.
  • the feature fusion network when the feature fusion network fuses multiple original feature maps output by the hierarchical feature extraction network to obtain multiple fused feature maps, it specifically fuses the last three layers of original feature maps output by the hierarchical feature extraction network.
  • Three fused feature maps For example, the fusion method for fusing the original feature maps may adopt an element-wise (feature multiplication and addition) fusion method.
  • the feature fusion network fuses the last three layers of original feature maps output by the hierarchical feature extraction network to obtain multiple fusion features.
  • Figure 1 includes steps S1024-S1026:
  • S1024 Perform down-sampling step size halving and channel halving operations on the last layer of original feature map output by the hierarchical feature extraction network to obtain the first intermediate feature map, and combine the first intermediate feature map and the reciprocal feature map output by the hierarchical feature extraction network.
  • the second-layer original feature maps are added element by element to obtain the second fused feature map.
  • the last three layers of original feature maps output by the hierarchical feature extraction network are used as the basis for fusion. Since the sizes of the original feature maps at different levels are different, taking the hierarchical feature extraction network with a 5-layer basic feature extraction network as an example, finally The down-sampling step sizes of the original feature maps in the three stages are x8, x16 and x32 respectively, and the corresponding channel numbers are 128, 256 and 512 respectively. The down-sampling step size and channel number need to be processed before fusing the original feature maps. , so that the two original feature maps used for fusion are at the desired downsampling step size and channel number level.
  • the down-sampling step size and channel number level of the last layer of original feature maps are twice the down-sampling step size and channel number level of the penultimate layer of original feature maps.
  • This solution performs downsampling step size halving and channel halving operations on the last layer of original feature map to obtain the first intermediate feature map, and
  • the first intermediate feature map and the penultimate layer original feature map output by the hierarchical feature extraction network are added element by element (for example, the corresponding pixel points of the first intermediate feature map and the penultimate layer original feature map are added) to obtain the second fusion feature. picture.
  • feature confusion processing can be further performed on the second fused feature map to further enhance the characteristics of the second fused feature map. Symbol expressive ability.
  • S1025 Perform downsampling step size halving and channel halving operations on the second fusion feature map to obtain the second intermediate feature map, and perform step-by-step processing on the second intermediate feature map and the penultimate layer original feature map output by the hierarchical feature extraction network. The elements are added to obtain the third fusion feature map.
  • the last layer original feature map and the penultimate layer original feature map can be fused as described above.
  • the fusion of feature maps is performed.
  • the second fusion feature map combines the characteristics of the last layer of original feature map and the penultimate layer of original feature map, its feature expression ability is stronger. Based on this, the second fusion feature map can be used to replace the penultimate layer at this stage.
  • layer of original feature maps that is, using the fusion process of the second fused feature map and the third-to-last layer of original feature maps. That is, the second fusion feature map is halved in the downsampling step size and channel halved to obtain the second intermediate feature map, and the second intermediate feature map and the penultimate layer original feature map output by the hierarchical feature extraction network are element-by-element.
  • Addition (for example, addition of corresponding pixels of the second intermediate feature map and the third-to-last layer original feature map) obtains the third fused feature map.
  • feature confusion processing can be further performed on the third fused feature map to further enhance the feature expression ability of the third fused feature map.
  • S1026 Perform a downsampling step doubling operation on the second fusion feature map to obtain the third intermediate feature map, and add the third intermediate feature map and the last layer original feature map output by the hierarchical feature extraction network element by element to obtain The first fused feature map.
  • the second fusion feature map is subjected to a downsampling step-size doubling operation to obtain the third intermediate feature map, and the third intermediate feature map is
  • the enhanced high-level feature map is obtained by adding element-by-element the original feature map of the last layer output by the hierarchical feature extraction network (for example, adding the corresponding pixels of the third intermediate feature map and the original feature map of the last layer).
  • a fused feature map that can be used to detect large targets in images to be recognized.
  • F5, F4 and F3 in the figure are the last, second to last and third to last layer original feature maps output by the hierarchical feature extraction network.
  • the original feature The downsampling steps of Figures F5, F4 and F3 are x32, x16 and x8 respectively, and the number of channels are 512, 256 and 128 respectively.
  • For the original feature map F5, use the x2 upsampling module (UpSample) and the basic convolution module (1x1CBL) to halve the downsampling step size of the original feature map F5 (reducing the downsampling step size to x16) and halve the channel.
  • the first intermediate feature map P5 is obtained through the operation.
  • the upsampling module (UpSample) and the basic convolution module (1x1CBL) are used to halve the downsampling step size of the second fusion feature map FF2 (reducing the downsampling step size to x8) and halve the channel. (Reducing the number of channels to 128)
  • the second intermediate feature map P4 is obtained through the operation.
  • the graph FF2 and the original feature map F5 are fused in an element-by-element manner to obtain the first fused feature map FF1.
  • the second fusion feature map FF2 and the third fusion feature map FF3 both adopt the forward feature fusion method.
  • the third fusion feature map FF3 combines the perceptual features of the original feature maps F3, F4, F5, etc., and has A larger visual receptive field can better detect small targets and process blurry scenes, while the enhanced
  • the first fusion feature map FF1 is mainly used to detect large targets, and the second fusion feature map FF2 takes into account both.
  • the three fusion feature maps complement each other and effectively improve the performance of gesture detection.
  • S1027 For each fused feature map, separate the fused feature map through the basic convolution module to obtain the first separated feature map, the second separated feature map and the third separated feature map.
  • S1028 Determine the predicted gesture category based on the first separated feature map, determine the gesture confidence based on the second separated feature map, and determine the predicted gesture position based on the third separated feature map.
  • fusion feature map including the first fusion feature map FF1, the second fusion feature map FF2 and the third fusion feature map FF3 provided above
  • three basic convolution modules (1x1CBL) are used to fuse the features.
  • Three branches are obtained by separating the image, and the three branches are respectively the first separation feature map, the second separation feature map and the third separation feature map.
  • These three branches can be used to predict gesture category, gesture confidence and hand position respectively, and finally the three branches are merged as the final output.
  • a 1x1conv convolution kernel may be utilized to determine the predicted gesture category from the first separation feature map
  • a 1x1conv convolution kernel may be utilized to determine the gesture confidence from the second separation feature map
  • a 1x1conv convolution kernel may be utilized
  • the predicted gesture location is determined from the third separated feature map.
  • the output corresponding to the three branches is connected to the concat connection to output the gesture recognition results including predicted gesture category, gesture confidence and predicted gesture position.
  • a basic convolution module (1x1CBL) can be used to reduce the number of channels corresponding to the separated feature maps and reduce the amount of calculation.
  • the softmax normalization module can be used to normalize the predicted gesture category.
  • the sigmoid normalization module can be used to normalize the gesture confidence to between 0 and 1. That is, if the corresponding value of the normalized gesture confidence is greater than 0.5, it means that in this prior box Contains valid targets. If it is less than 0.5, it means that the prior box does not contain valid targets.
  • first separated feature map use 1x1conv normalization to obtain the same output as the preset number of categories (the preset number of categories is the number of annotations in the fused feature map. For example, 10 gestures need to be recognized, and different probabilities of the 10 gestures are output.
  • the one with the highest probability is considered the corresponding predicted gesture category), then use softmax to normalize the probability of the category, and determine the category with the highest probability as the predicted gesture category.
  • For the second separation feature map use 1x1conv normalization to obtain the gesture confidence, and use the sigmoid function to normalize the gesture confidence to between 0 and 1. If the output is greater than 0.5, it means that the prior box contains a valid target; If it is less than 0.5, it means that the a priori box does not contain a valid target.
  • 1x1conv normalization is used to obtain the predicted gesture position.
  • the three branches are connected through concat, and the gesture recognition results including the predicted gesture category, gesture confidence and predicted gesture position are output.
  • this solution can use a grid position encoding representation bit based on a priori box (anchor). configuration information.
  • the a priori frame needs to predict the target position (the position of the target frame, the target frame is the predicted box containing the target), but the direct prediction of the position range is too large.
  • This solution sets the a priori frame, and the predicted target The position is the a priori box + offset (encoding).
  • the predicted gesture position is represented based on the grid position coding of the target frame.
  • the grid position coding is used to represent the coding coordinates of the target frame in the feature grid.
  • the feature grid is divided by the fused feature map according to the set unit length.
  • the predicted gesture position is determined based on the decoding coordinates of the target frame on the fused feature map, the decoding size, and the downsampling step size of the fused feature map. That is, the global absolute coordinates of the target frame on the corresponding fusion feature map are determined based on the predicted decoding coordinates and decoding size of the target frame, and then the global absolute coordinates are multiplied by the downsampling step size of the fusion feature map to obtain the target frame on the image to be recognized.
  • the global absolute coordinates of the target frame on the image to be recognized are the predicted gesture positions.
  • the figure shows a priori box (dashed line box) and the fusion feature map.
  • N the size of the fusion feature map
  • the fusion feature map is divided into NxN feature grids (cells).
  • the length and width of each feature grid are 1.
  • this solution uses relative offset coordinate encoding.
  • the encoding result is predicted.
  • the predicted By decoding the result, the global absolute coordinates of the target (preset gesture) in the image to be recognized can be obtained.
  • (b x , b y ) is the decoding coordinate of the center coordinate of the target frame on the fused feature map
  • (c x , c y ) is the coordinate of the upper left corner of the current feature grid
  • ⁇ (t x ) and ⁇ (t y ) is the offset of the a priori box from the upper left corner of the current feature grid
  • (t x , t y ) is the encoded coordinate of the center coordinate of the a priori box on the fusion feature map.
  • the decoding size provided by this solution can be determined based on the following formula:
  • b h and b w are the length and width of the decoding size of the target frame
  • p h and p w are the length and width of the encoding size of the prior frame
  • t h and t w are the exponential coefficients obtained by training the gesture detection model .
  • the calculated b Recognize the global absolute coordinates on the image.
  • the global absolute coordinates of the target frame on the image to be recognized are the predicted gesture positions.
  • gesture detection model for training the gesture detection model, different types of gesture pictures are collected, and the gesture targets in the pictures are manually annotated (including gesture types and gesture positions), and then a training set and a verification set are constructed.
  • Directional propagation and gradient descent methods are used to iteratively train and continuously update the parameters of the gesture detection model based on the loss function. After the gesture detection model converges on the verification set, save the parameters of the gesture detection model and output the model file of the gesture detection model.
  • On real-time gesture detection devices such as mobile application products, load the saved gesture detection model file through the neural network inference framework, use the image to be recognized as input, and perform the forward calculation of the gesture detection model to obtain the information of the image to be recognized.
  • Contains gesture categories and positions. These results (gesture categories and positions) can be used as special effects Input signals from other technology chains such as rendering can meet various mobile application requirements.
  • This solution uses an end-to-end network structure.
  • the end-to-end supervised training method is also used to train the gesture detection model, and the stochastic gradient descent method can be used for optimization and solution.
  • the detection network used in this solution has three prediction branches that already have a priori frames, so an optimized joint training method can be used to train the gesture detection model.
  • the gesture detection model is trained based on a joint loss function, where the joint loss function is based on whether the target is included in the prior box, the coordinate error between the prior box and the prior box, the loss of the predicted target and the matching prior box The value is determined.
  • the joint loss function provided by this solution can be determined based on the following formula:
  • W is the width of the fused feature map
  • H is the length of the fused feature map
  • A is the number of a priori boxes for each point on the fused feature map
  • maxiou is the maximum overlap ratio between each a priori box and all real targets.
  • Overlap ratio thresh is the set overlap ratio screening threshold
  • ⁇ noobj is the set negative sample loss function weight
  • o is the target score corresponding to the prior box
  • t is the number of training times
  • ⁇ prior is the weight of the warmup loss function
  • r represents the preset coordinates
  • ⁇ coord is the loss function weight of the coordinates
  • truth r is the coordinate value of the labeled target in the training sample
  • ⁇ obj is the loss function weight of whether the target is included
  • truth c is the predicted target category
  • the first loss function loss1 is used to determine whether the prediction box contains a target. First, it is necessary to calculate the overlap ratio (Intersection-over-Union, IoU) of each prediction box and all labeled groundtruths, and take the maximum value. maxiou, if the value is less than the preset threshold (preset hyperparameter, such as 0.65), then this prediction box is marked as the background category, so the confidence error of noobj (negative sample) needs to be calculated. Among them, the real target is the gesture marked on the sample image.
  • IoU Intersection-over-Union
  • the second loss function loss2 is used to calculate the coordinate error between the prior box and the prediction box, but only calculates the first 12,800 iterations (this process is called the warmup process, which enhances the shape convergence effect of the prediction box through the warmup method, effectively speeding up the overall training Speed), the setting of this second loss function is mainly to enable the gesture detection model to quickly learn the length and width of the prior box and speed up the convergence speed of the overall training.
  • the third loss function loss3 is used to calculate various loss values between the predicted target and a matching real target (ground-truth). Because each feature grid on the fused feature map predicts 3 target frames, the number of real targets on a map is very small, and each real target only corresponds to one prediction frame to be predicted, which is a positive sample. The remaining prediction boxes are negative samples.
  • this solution can use a matching to distinguish positive and negative samples: for a real target, first determine which feature grid its center point should fall in, and then calculate the feature grid's 3 a priori boxes and the IoU of the real target value (since the coordinates are not considered when calculating the value of this IoU, only the shape is considered, and their upper left corner can be offset to the zero position and then calculated), and the prior box with the largest IoU is selected as the match.
  • this prior The prediction box corresponding to the box is the positive sample, which is used for subsequent calculations. All prediction boxes that are not matched by the real target are negative samples, so the number of negative samples is particularly large.
  • this solution follows the setting of the first loss function and only selects prediction boxes whose maxiou is less than the threshold for calculation. The remaining prediction boxes are discarded.
  • the loss part of the positive sample is also divided into three parts for calculation, corresponding to the three branches of prediction (the first separation feature map, the second separation feature map and the third separation feature map): the first item is to calculate the difference between the prediction box and the real
  • the coordinate loss of the target uses the squared difference loss function; the second item is the confidence loss, the smaller the IoU, the greater the loss function value; the third item is the classification loss, the category target corresponding to the real target is 1, and the other category targets are 0 , use the cross-entropy loss function to calculate the output result of softmax.
  • This solution uses a jointly optimized joint loss function to directly train the entire gesture detection model at once, and at the same time sets a corresponding matching mechanism for positive and negative samples to reduce poor training results due to an imbalance in the number of positive and negative samples.
  • the gesture detection model After receiving the image to be recognized, the gesture detection model performs gesture recognition on the image to be recognized and outputs the corresponding gesture recognition result. It can be determined according to the gesture recognition result output by the gesture detection model whether the setting is recognized on the image to be recognized. types of gestures, and when a set type of gesture is recognized, the gesture type recognized and the gesture position corresponding to each gesture type. For example, determine the predicted gesture category, gesture confidence and predicted gesture position in the gesture recognition result, determine the gesture category and predicted gesture position whose corresponding gesture confidence reaches the set confidence threshold, and determine the corresponding gesture category and predicted gesture position as Gesture type and gesture location. When the gesture confidence corresponding to each gesture category and the predicted gesture position is less than the set confidence threshold, it is determined that the target gesture is not recognized in the image to be recognized.
  • the gesture response mode and gesture response position may be determined based on the gesture type. Determining the gesture response mode may be determining a special effect type for special effect rendering, and correspondingly, the gesture response position may be a rendering position of the corresponding special effect.
  • multiple different types of special effect information can be configured in the real-time gesture detection device, special effects can be rendered according to the special effect information and the corresponding special effects can be displayed on the interactive interface.
  • the video live broadcast software is configured with a gesture detection model and special effect information of the special effect type "heartbeat".
  • the anchor starts the video live broadcast software for live broadcast
  • the video live broadcast software submits the video frames collected in real time to the gesture detection model, and the gesture detection model outputs a gesture recognition result indicating that the "heart-to-heart" gesture type is detected at a certain location.
  • the video live broadcast software determines that the "Heart Comparison" gesture has been detected based on the gesture recognition results, and can determine that the corresponding special effect type is "Heartbeat".
  • gesture recognition is performed by obtaining the image to be recognized and inputting the image to be recognized into the gesture detection model, and determining the gesture type and gesture position according to the gesture recognition result output by the gesture detection model.
  • the gesture detection model is based on the separable convolution structure and The residual structure extracts multiple levels of original feature maps of the input image to reduce the calculation amount of feature extraction and gesture detection, and fuses multiple original feature maps to obtain a fused feature map.
  • the fused features are used to enhance the
  • the target detection capability makes up for the performance loss caused by the reduction in parameter size, and at the same time enhances the detection effect of small targets and blurry scenes.
  • gesture recognition is performed based on the fused feature map and the gesture recognition results are output, which can effectively meet the real-time requirements of gesture recognition.
  • the amount of model parameters is reduced through separable convolution, and the channel-level residual structure is used to reduce the number of input channels for convolution calculations, thereby achieving a lightweight gesture detection model, effectively reducing the amount of model parameters and calculations, and warmup, bit Training objectives such as settings and categories are unified into a joint loss function for joint optimization to speed up model convergence and operating efficiency.
  • the residual structure and feature fusion are used to compensate for the performance loss caused by the reduction in parameter size, and at the same time enhance the detection effect of small targets and blurry scenes, effectively solving the problem of poor end-to-end detection performance for small targets and blurred backgrounds.
  • the prediction of gesture position is expressed using coding, which reduces the prediction error caused by the difference in coordinate extreme values and speeds up the convergence speed of training.
  • connection layers such as full connection and pooling
  • This solution does not use connection layers such as full connection and pooling, but extracts the features of the image to be recognized through the convolutional neural network. It then regresses and outputs the positions and categories of all gestures in the image to be recognized based on the features.
  • the use of depth-separable convolution and feature pyramid structure can take into account both computational efficiency and feature extraction granularity, which can effectively reduce the calculation scale of the network while ensuring the accuracy of the neural network, and can achieve better results in mobile applications. gesture recognition effect.
  • Figure 9 is a schematic structural diagram of a real-time gesture detection device provided by an embodiment of the present application.
  • the real-time gesture detection device includes an image acquisition module 21 , a gesture recognition module 22 and a gesture determination module 23 .
  • the image acquisition module 21 is configured to acquire the image to be recognized;
  • the gesture recognition module 22 is configured to input the image to be recognized into the trained gesture detection model, so that the gesture detection model outputs the gesture recognition result based on the image to be recognized.
  • the detection model is configured to obtain multiple original feature maps of different levels of the input image based on the separable convolution structure and the residual structure, fuse the multiple original feature maps to obtain multiple fused feature maps, and perform the operation based on the multiple fused feature maps. Recognize the gesture and output the gesture recognition result; the gesture determination module 23 is configured to determine the gesture type and gesture position based on the gesture recognition result output by the gesture detection model.
  • gesture recognition is performed by obtaining the image to be recognized and inputting the image to be recognized into the gesture detection model, and determining the gesture type and gesture position according to the gesture recognition result output by the gesture detection model.
  • the gesture detection model is based on the separable convolution structure and The residual structure extracts multiple levels of original feature maps of the input image to reduce the calculation amount of feature extraction and gesture detection, and fuses multiple original feature maps to obtain a fused feature map.
  • the fused features are used to enhance the
  • the target detection capability makes up for the performance loss caused by the reduction in parameter size, and at the same time enhances the detection effect of small targets and blurry scenes.
  • gesture recognition is performed based on the fused feature map and the gesture recognition results are output, which can effectively meet the real-time requirements of gesture recognition.
  • the gesture detection model includes a hierarchical feature extraction network, a feature fusion network and a separate detection head network, where:
  • a hierarchical feature extraction network configured to obtain multiple different levels of original feature maps of the input image based on the separable convolution structure and the residual structure;
  • the feature fusion network is configured to fuse multiple original feature maps output by the hierarchical feature extraction network to obtain multiple fused feature maps
  • a separate detection head network configured to perform gesture recognition based on multiple fused feature maps and output gesture recognition results, which include predicted gesture categories, gesture confidence, and predicted gesture positions.
  • the hierarchical feature extraction network includes multiple serial basic feature extraction networks, and the basic feature extraction network at each level is configured to perform feature extraction on the input image to obtain the original feature map of the corresponding level, where, The size of the original feature map is halved relative to the size of the input image, and the number of channels of the original feature map is doubled relative to the number of channels of the input image.
  • the basic feature extraction network includes a feature extraction module, an element addition confusion module and a data connection module, where:
  • the feature extraction module is configured to perform a convolution structure channel halving operation on the input image through the basic convolution module, and perform feature extraction on the input image after the convolution structure channel is halved through the separable convolution module to obtain the feature extraction result;
  • the element-wise addition confusion module is configured to perform element-by-element addition on the input image and feature extraction results after the convolution structure channel is halved to obtain the element-by-element addition result, and performs a confusion operation on the element-by-element addition result through the basic convolution module. Obtain the confusing result of element addition;
  • the data connection module is configured to perform string concatenation on the element addition confusion result and the input image after the convolution structure channel is halved to obtain the concatenation result, and perform downsampling on the concatenation result to obtain the original feature map.
  • the basic convolution module includes a 1*1 convolution kernel, a BatchNorm normalization unit and a LeakyReLU activation function unit that are connected in sequence
  • the separable convolution module includes a first basic convolution module that is connected in sequence.
  • feature extraction module includes a 3*3 depth separable convolution kernel, a BatchNorm normalization unit and a LeakyReLU activation function unit connected in sequence.
  • the hierarchical feature extraction network includes 5 serial basic feature extraction networks.
  • the feature fusion network is configured to fuse the last three layers of original feature maps output by the hierarchical feature extraction network to obtain three fused feature maps.
  • the feature fusion network includes a first fusion module, a second fusion module and a third fusion module, where:
  • the second fusion module is configured to halve the downsampling step size and halve the channel on the last layer of the original feature map output by the hierarchical feature extraction network to obtain the first intermediate feature map, and extract the first intermediate feature map and the hierarchical feature
  • the penultimate layer of original feature maps output by the network are added element by element to obtain the second fused feature map;
  • the third fusion module is configured to halve the downsampling step size and halve the channel operation on the second fusion feature map to obtain the second intermediate feature map, and combine the second intermediate feature map and the third-to-last layer output by the hierarchical feature extraction network
  • the original feature maps are added element by element to obtain the third fused feature map
  • the first fusion module is configured to perform a downsampling step doubling operation on the second fusion feature map to obtain a third intermediate feature map, and perform step-by-step processing on the third intermediate feature map and the last layer of original feature map output by the hierarchical feature extraction network. The elements are added to obtain the first fusion feature map.
  • the separate detection head network includes a feature map separation module and a gesture prediction module, where:
  • the feature map separation module is configured to separate the fused feature map through the basic convolution module for each fused feature map to obtain the first separated feature map, the second separated feature map and the third separated feature map;
  • the gesture prediction module is configured to determine the predicted gesture category according to the first separated feature map, determine the gesture confidence based on the second separated feature map, and determine the predicted gesture position based on the third separated feature map.
  • the predicted gesture position is represented based on the grid position encoding of the target frame
  • the grid position encoding is configured to represent the encoding coordinates of the target frame in the feature grid
  • the feature grid is set by the fused feature map according to Divided into units of length.
  • the predicted gesture position is determined based on the decoded coordinates of the target frame on the fused feature map, the decoded size, and the downsampling step size of the fused feature map.
  • (b x , b y ) is the decoding coordinate of the center coordinate of the target frame on the fused feature map
  • (c x , c y ) is the coordinate of the upper left corner of the current feature grid
  • ⁇ (t x ) and ⁇ (t y ) is the offset of the prior box from the upper left corner of the current feature grid
  • (t x , t y ) is the encoded coordinate of the center coordinate of the prior box on the fusion feature map
  • the decoding size is determined based on the following formula:
  • b h and b w are the length and width of the decoding size of the target frame
  • p h and p w are the length and width of the encoding size of the prior frame
  • t h and t w are the exponential coefficients obtained by training the gesture detection model .
  • the gesture detection model is trained based on a joint loss function.
  • the joint loss function is based on whether the target is included in the prior box, the coordinate error between the prior box and the prior box, the predicted target and the matching prior.
  • the loss value of the box is determined.
  • the joint loss function is determined based on the following formula:
  • W is the width of the fused feature map
  • H is the length of the fused feature map
  • A is the number of a priori boxes for each point on the fused feature map
  • maxiou is the maximum overlap ratio between each a priori box and all real targets.
  • Overlap ratio thresh is the set overlap ratio screening threshold
  • ⁇ noobj is the set negative sample loss function weight
  • o is the target score corresponding to the prior box
  • t is the number of training times
  • ⁇ prior is the weight of the warmup loss function
  • r represents the preset coordinates
  • ⁇ coord is the loss function weight of the coordinates
  • truth r is the coordinate value of the labeled target in the training sample
  • ⁇ obj is the loss function weight of whether the target is included
  • truth c is the predicted target category
  • the embodiment of the present application also provides a real-time gesture detection device, which can integrate the real-time gesture detection device provided by the embodiment of the present application.
  • Figure 10 is a schematic structural diagram of a real-time gesture detection device provided by an embodiment of the present application.
  • the real-time gesture detection device includes: an input device 33, an output device 34, a memory 32 and one or more processors 31; the memory 32 is used to store one or more programs; when one or more programs are Or multiple processors 31 execute, so that one or more processors 31 implement the real-time gesture detection method as provided in the above embodiment.
  • the real-time gesture detection device, equipment and computer provided above can be used to execute the real-time gesture detection method provided by any of the above embodiments, and have corresponding functions and beneficial effects.
  • Embodiments of the present application also provide a storage medium that stores computer-executable instructions.
  • the computer-executable instructions are used to perform the above when executed by a computer processor.
  • the real-time gesture detection method provided by the above embodiments.
  • the embodiments of the present application provide a storage medium that stores computer-executable instructions.
  • the computer-executable instructions are not limited to the real-time gesture detection method provided above, and can also execute the real-time gesture detection method provided by any embodiment of the application. related operations.
  • the real-time gesture detection device, equipment and storage medium provided in the above embodiments can execute the real-time gesture detection method provided in any embodiment of this application.
  • Real-time gesture detection method provided.
  • various aspects of the method provided by the present disclosure can also be implemented in the form of a program product, which includes program code.
  • program product which includes program code.
  • the above program product is run on a computer device, the above program code is used to make the above
  • the computer device performs the steps in the methods described above according to various exemplary embodiments of the present disclosure.
  • the above-mentioned computer device can perform the real-time gesture detection method described in the embodiments of the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

Des modes de réalisation de la présente demande concernent un procédé et un appareil de détection de geste en temps réel. Selon la solution technique fournie dans les modes de réalisation de la présente demande, une image à reconnaître est obtenue, l'image à reconnaître est entrée dans un modèle de détection de geste pour une reconnaissance de geste et un type de geste et une position de geste sont déterminés en fonction d'un résultat de reconnaissance de geste délivré en sortie par le modèle de détection de geste. Le modèle de détection de geste extrait des cartes de caractéristiques d'origine d'une pluralité de niveaux de l'image d'entrée sur la base d'une structure de convolution séparable et d'une structure résiduelle de telle sorte que la complexité de calcul d'une extraction de caractéristiques soit réduite et que la complexité de calcul de la détection de geste soit réduite ; en outre, la pluralité de cartes de caractéristiques d'origine sont fusionnées pour obtenir des cartes de caractéristiques fusionnées et des caractéristiques fusionnées sont utilisées pour améliorer la capacité de détection pour la cible pour compenser une perte de performance provoquée par une réduction de paramètres et pour améliorer l'effet de détection pour une petite cible et une scène floue ; ensuite, une reconnaissance de geste est effectuée selon les cartes de caractéristiques fusionnées et le résultat de reconnaissance de geste est délivré en sortie de telle sorte que l'exigence de performance en temps réel de reconnaissance de geste puisse être efficacement satisfaite.
PCT/CN2023/080066 2022-03-14 2023-03-07 Procédé et appareil de détection de geste en temps réel WO2023174098A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210249415.7A CN114612832A (zh) 2022-03-14 2022-03-14 一种实时手势检测方法及装置
CN202210249415.7 2022-03-14

Publications (1)

Publication Number Publication Date
WO2023174098A1 true WO2023174098A1 (fr) 2023-09-21

Family

ID=81863469

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/080066 WO2023174098A1 (fr) 2022-03-14 2023-03-07 Procédé et appareil de détection de geste en temps réel

Country Status (2)

Country Link
CN (1) CN114612832A (fr)
WO (1) WO2023174098A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117351420A (zh) * 2023-10-18 2024-01-05 江苏思行达信息技术有限公司 一种智能开关门检测方法
CN117593516A (zh) * 2024-01-18 2024-02-23 苏州元脑智能科技有限公司 一种目标检测方法、装置、设备及存储介质
CN117893413A (zh) * 2024-03-15 2024-04-16 博创联动科技股份有限公司 基于图像增强的车载终端人机交互方法

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114612832A (zh) * 2022-03-14 2022-06-10 百果园技术(新加坡)有限公司 一种实时手势检测方法及装置
CN115690853B (zh) * 2022-12-30 2023-04-28 广州蚁窝智能科技有限公司 手势识别方法及电动卫生罩启闭控制系统
CN118172801B (zh) * 2024-05-15 2024-08-23 南昌虚拟现实研究院股份有限公司 一种手势检测方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109740534A (zh) * 2018-12-29 2019-05-10 北京旷视科技有限公司 图像处理方法、装置及处理设备
CN110135237A (zh) * 2019-03-24 2019-08-16 北京化工大学 一种手势识别方法
CN112906794A (zh) * 2021-02-22 2021-06-04 珠海格力电器股份有限公司 一种目标检测方法、装置、存储介质及终端
US20210174149A1 (en) * 2018-11-20 2021-06-10 Xidian University Feature fusion and dense connection-based method for infrared plane object detection
CN114612832A (zh) * 2022-03-14 2022-06-10 百果园技术(新加坡)有限公司 一种实时手势检测方法及装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111898432B (zh) * 2020-06-24 2022-10-14 南京理工大学 一种基于改进YOLOv3算法的行人检测系统及方法
CN113298181A (zh) * 2021-06-16 2021-08-24 合肥工业大学智能制造技术研究院 基于密集连接的Yolov3网络的井下管道异常目标识别方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210174149A1 (en) * 2018-11-20 2021-06-10 Xidian University Feature fusion and dense connection-based method for infrared plane object detection
CN109740534A (zh) * 2018-12-29 2019-05-10 北京旷视科技有限公司 图像处理方法、装置及处理设备
CN110135237A (zh) * 2019-03-24 2019-08-16 北京化工大学 一种手势识别方法
CN112906794A (zh) * 2021-02-22 2021-06-04 珠海格力电器股份有限公司 一种目标检测方法、装置、存储介质及终端
CN114612832A (zh) * 2022-03-14 2022-06-10 百果园技术(新加坡)有限公司 一种实时手势检测方法及装置

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117351420A (zh) * 2023-10-18 2024-01-05 江苏思行达信息技术有限公司 一种智能开关门检测方法
CN117351420B (zh) * 2023-10-18 2024-06-04 江苏思行达信息技术股份有限公司 一种智能开关门检测方法
CN117593516A (zh) * 2024-01-18 2024-02-23 苏州元脑智能科技有限公司 一种目标检测方法、装置、设备及存储介质
CN117593516B (zh) * 2024-01-18 2024-03-22 苏州元脑智能科技有限公司 一种目标检测方法、装置、设备及存储介质
CN117893413A (zh) * 2024-03-15 2024-04-16 博创联动科技股份有限公司 基于图像增强的车载终端人机交互方法
CN117893413B (zh) * 2024-03-15 2024-06-11 博创联动科技股份有限公司 基于图像增强的车载终端人机交互方法

Also Published As

Publication number Publication date
CN114612832A (zh) 2022-06-10

Similar Documents

Publication Publication Date Title
WO2023174098A1 (fr) Procédé et appareil de détection de geste en temps réel
US11551333B2 (en) Image reconstruction method and device
CN109558832B (zh) 一种人体姿态检测方法、装置、设备及存储介质
CN111402130B (zh) 数据处理方法和数据处理装置
WO2021248859A1 (fr) Procédé et appareil de classification vidéo, ainsi que dispositif et support de stockage lisible par ordinateur
CN111480169B (zh) 用于模式识别的方法、系统和装置
CN112070044B (zh) 一种视频物体分类方法及装置
GB2555136A (en) A method for analysing media content
CN114973049B (zh) 一种统一卷积与自注意力的轻量视频分类方法
WO2021103731A1 (fr) Procédé de segmentation sémantique et procédé et appareil d'apprentissage de modèle
CN112749666B (zh) 一种动作识别模型的训练及动作识别方法与相关装置
CN110751649A (zh) 视频质量评估方法、装置、电子设备及存储介质
CN113901909B (zh) 基于视频的目标检测方法、装置、电子设备和存储介质
CN110569814A (zh) 视频类别识别方法、装置、计算机设备及计算机存储介质
CN110532959B (zh) 基于双通道三维卷积神经网络的实时暴力行为检测系统
US20230030431A1 (en) Method and apparatus for extracting feature, device, and storage medium
WO2023036157A1 (fr) Apprentissage auto-supervisé d'une représentation spatio-temporelle par exploration de la continuité vidéo
CN114821096A (zh) 一种图像处理方法、神经网络的训练方法以及相关设备
CN111242181B (zh) 基于图像语义和细节的rgb-d显著性物体检测器
Chen et al. Video‐based action recognition using spurious‐3D residual attention networks
Wang et al. Global contextual guided residual attention network for salient object detection
CN116363361A (zh) 基于实时语义分割网络的自动驾驶方法
CN111401267A (zh) 基于自学习局部特征表征的视频行人再识别方法及系统
CN113569687A (zh) 基于双流网络的场景分类方法、系统、设备及介质
CN110659641A (zh) 一种文字识别的方法、装置及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23769604

Country of ref document: EP

Kind code of ref document: A1