WO2021098587A1 - 手势分析方法、装置、设备及计算机可读存储介质 - Google Patents

手势分析方法、装置、设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2021098587A1
WO2021098587A1 PCT/CN2020/128469 CN2020128469W WO2021098587A1 WO 2021098587 A1 WO2021098587 A1 WO 2021098587A1 CN 2020128469 W CN2020128469 W CN 2020128469W WO 2021098587 A1 WO2021098587 A1 WO 2021098587A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
key point
coordinates
depth
gesture analysis
Prior art date
Application number
PCT/CN2020/128469
Other languages
English (en)
French (fr)
Inventor
周扬
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2021098587A1 publication Critical patent/WO2021098587A1/zh
Priority to US17/746,956 priority Critical patent/US20220351547A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • G06V40/113Recognition of static hand signs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects

Definitions

  • the embodiments of the present application relate to the field of Internet technology, and relate to, but are not limited to, a gesture analysis method, device, device, and computer-readable storage medium.
  • Gesture recognition and gesture analysis techniques are used in many fields, and their purpose is to estimate the coordinates of several joint points of the hand by analyzing the image. Because images can accurately and effectively reconstruct the movement of human hands, it is expected to find exciting new applications in immersive virtual reality and augmented reality, robot control, and sign language recognition.
  • gesture analysis is still a difficult task, and the accuracy of gesture analysis methods in related technologies needs to be improved.
  • the embodiments of the present application provide a gesture analysis method, device, device, and computer-readable storage medium, which separate the gesture estimation tasks of fingers and palms.
  • This separated architecture focus on finger key points and palm key points respectively.
  • the processing is performed to realize the gesture analysis of the entire hand. In this way, the accuracy of the gesture analysis can be greatly improved.
  • An embodiment of the present application provides a gesture analysis method, including:
  • a gesture analysis is performed on the image to be analyzed to obtain a gesture analysis result.
  • An embodiment of the present application provides a gesture analysis device, including:
  • the feature extraction module is used to perform feature extraction on the acquired image to be analyzed to obtain a first number of first key point features and a second number of second key point features;
  • the UV coordinate regression processing module is used to perform UV coordinate regression processing on each of the first key point features and each of the second key point features, and correspondingly obtain the first UV coordinates and each of the finger key points.
  • the depth regression processing module is used to perform depth regression processing on each of the first key point features and each of the second key point features, and correspondingly obtain the first depth coordinates of each finger key point and each palm key The second depth coordinate of the point;
  • the gesture analysis module is configured to perform gesture analysis on the image to be analyzed according to the first UV coordinates, the first depth coordinates, the second UV coordinates, and the second depth coordinates to obtain a gesture analysis result.
  • An embodiment of the present application provides a gesture analysis device, including:
  • the memory is used to store executable instructions; the processor is used to implement the aforementioned gesture analysis method when executing the executable instructions stored in the memory.
  • the embodiment of the present application provides a computer-readable storage medium that stores executable instructions for causing a processor to execute the executable instructions to implement the aforementioned gesture analysis method.
  • the gesture estimation tasks of the fingers and the palm are separated.
  • feature extraction is performed on the acquired image to be analyzed, and the first key point feature and the first key point feature of the first quantity are obtained.
  • Two number of second key point features; then each first key point feature and each second key point feature are respectively subjected to UV coordinate regression processing and depth regression processing, and based on the results after UV coordinate regression processing and depth regression processing
  • FIG. 1 is a schematic diagram of an optional architecture of a gesture analysis system provided by an embodiment of the present application
  • FIG. 2 is an optional flowchart of a gesture analysis method provided by an embodiment of the present application
  • FIG. 3 is an optional flowchart of a gesture analysis method provided by an embodiment of the present application.
  • FIG. 4 is an optional flowchart of a gesture analysis method provided by an embodiment of the present application.
  • FIG. 5 is an optional flowchart of a gesture analysis method provided by an embodiment of the present application.
  • FIG. 6 is an optional flowchart of a gesture analysis method provided by an embodiment of the present application.
  • FIG. 7 is an optional flowchart of a gesture analysis method provided by an embodiment of the present application.
  • FIG. 8 is an optional flowchart of a gesture analysis model training method provided by an embodiment of the present application.
  • Fig. 9 is an example image captured by a TOF camera provided by an embodiment of the present application.
  • FIG. 10 is a hand detection result including a prediction range and hand existence probability provided by an embodiment of the present application.
  • FIG. 11 is an example diagram of the position of key points of the hand provided by an embodiment of the present application.
  • FIG. 12 is an example diagram of a two-dimensional hand posture estimation result provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of a hand detection and hand posture estimation process provided by an embodiment of the present application.
  • FIG. 14 is a schematic diagram of RoI Align provided by an embodiment of the present application.
  • FIG. 15 is a schematic diagram of the result of the NMS provided by an embodiment of the present application.
  • FIG. 16 is a schematic diagram of IoU provided by an embodiment of the present application.
  • FIG. 17 is a framework diagram of a regional integration network of a pose guidance structure provided by an embodiment of the present application.
  • FIG. 18 is a flowchart of a gesture analysis method provided by an embodiment of the present application.
  • Figure 19 is a network architecture diagram of a gesture estimation module provided by an embodiment of the present application.
  • FIG. 20 is a schematic structural diagram of a gesture analysis device provided by an embodiment of the present application.
  • the gesture analysis system 10 provided in the embodiment of the present application includes a terminal 100, a network 200, and a server 300, wherein the terminal 100 runs a video playback application or has a video recording unit or An image display application is running, the video recorded in real time or pre-recorded by the video recording unit is played through the video playback application, and each video frame in the video is regarded as the image to be analyzed by the method of the embodiment of this application, and the image to be analyzed is to be analyzed Perform gesture analysis on the hand in the image display application, or perform gesture analysis on the image to be analyzed displayed by the image display application.
  • the terminal 100 sends the image to be analyzed to the server 300 through the network 200; the server 300 performs feature extraction on the image to be analyzed to obtain the first number of first key point features and the first number Two quantities of second key point features; then UV coordinate regression processing is performed on each first key point feature and each second key point feature respectively, and the first UV coordinates of each finger key point and each palm key are obtained correspondingly The second UV coordinates of the points; and the depth regression processing is performed on each first key point feature and each second key point feature, corresponding to the first depth coordinate of each finger key point and the first depth coordinate of each palm key point Two depth coordinates; finally, according to the first UV coordinates, the first depth coordinates, the second UV coordinates, and the second depth coordinates, a gesture analysis is performed on the image to be analyzed to obtain a gesture analysis result.
  • the server 300 After obtaining the gesture analysis result, the server 300 sends the gesture analysis result to the terminal 100, and the terminal 100 displays a marked image marked with the gesture analysis result on the current interface 100-1 or directly displays the gesture analysis result.
  • the accuracy of gesture analysis can be greatly improved.
  • the gesture analysis device provided by the embodiment of the present application can be implemented as a notebook computer, a tablet computer, a desktop computer, or a mobile device (for example, a mobile phone). , Portable music players, personal digital assistants, dedicated messaging devices, portable game devices), smart robots, smart video surveillance, and other arbitrary terminals.
  • the gesture analysis device provided in this embodiment of the application can also be implemented For the server.
  • an exemplary application when the gesture analysis device is implemented as a server will be explained.
  • FIG. 2 is an optional flowchart of a gesture analysis method provided by an embodiment of the present application. As shown in FIG. 2, the method includes the following steps:
  • Step S201 Perform feature extraction on the acquired image to be analyzed to obtain a first number of first key point features and a second number of second key point features.
  • the image to be analyzed has a hand image
  • the feature extraction of the acquired image to be analyzed may be the feature extraction of the hand.
  • a pre-trained hand detection model when performing hand position recognition, can be used to realize the detection and output of the hand at each position (which can be the entire area of the image to be analyzed). Any sub-region in, also called bounding box or bounding box) exists probability value, and the sub-region with the largest probability value is determined as the region where the hand is located.
  • hand feature extraction is performed on the sub-region to obtain a first number of first key point features and a second number of second key point features, where:
  • the first key point feature may be a finger key point feature
  • the second key point feature may be a palm key point feature, where the first number and the second number can be any positive integers.
  • Hand feature extraction can be achieved by using a pre-trained hand feature extraction model.
  • the depth image with the hand can be input into the hand feature extraction model.
  • the depth image is recognized to determine at least one key point of the hand in the depth image, and these key points include not only finger key points, but also palm key points.
  • artificial intelligence technology may also be used to implement the method of the embodiments of the present application, that is, artificial intelligence technology is used to identify the sub-region where the hand is located, and artificial intelligence technology is used to identify key finger points and palm key points.
  • the key point feature of the finger is the image feature obtained by image feature extraction of the key point of the finger
  • the key point feature of the palm is the image feature obtained by the image feature extraction of the key point of the palm.
  • Step S202 Perform UV coordinate regression processing on each of the first key point features and each of the second key point features, correspondingly to obtain the first UV coordinates of each finger key point and the first UV coordinate of each palm key point Two UV coordinates.
  • the UV coordinate regression processing is used to determine the UV coordinates of the finger key points and the palm key points, and the UV coordinates are coordinates relative to the XYZ coordinates.
  • Step S203 Perform depth regression processing on each of the first key point features and each of the second key point features, and obtain the first depth coordinates of each finger key point and the second key point of each palm correspondingly. Depth coordinates.
  • the depth regression processing is used to determine the depth coordinates of the finger key points and the palm key points.
  • the depth coordinates are also coordinates relative to the XYZ coordinates.
  • the UV coordinates and the depth coordinates together form the UVD coordinates of the finger key points and the palm key points.
  • Step S204 Perform a gesture analysis on the image to be analyzed according to the first UV coordinate, the first depth coordinate, the second UV coordinate, and the second depth coordinate to obtain a gesture analysis result.
  • the first UV coordinate and the first depth coordinate form the UVD coordinate of the key point of the finger
  • the second UV coordinate and the second depth coordinate form the UVD coordinate of the key point of the palm.
  • the embodiment of the present application uses UVD coordinates to represent the position of the finger and the palm. , In order to realize hand gesture recognition and analysis.
  • the gesture analysis result includes the UVD coordinates of each finger key point and the UVD coordinate of each palm key point, or the gesture analysis result also includes the UVD coordinates of each finger key point and each palm key point Gesture structure diagram of the hand determined by UVD coordinates.
  • the gesture analysis method separates the gesture estimation tasks of the fingers and the palm.
  • feature extraction is performed on the acquired image to be analyzed to obtain a first number of first key point features and The second number of second key point features; then each first key point feature and each second key point feature are subjected to UV coordinate regression processing and depth regression processing, and according to the UV coordinate regression processing and depth regression processing
  • gesture analysis is performed on the image to be analyzed, and the result of the gesture analysis is obtained. In this way, the accuracy of the gesture analysis can be greatly improved.
  • the gesture analysis system includes at least a terminal and a server, and a video playback application is running on the terminal.
  • the method in the embodiments of this application can be used to determine the hand in each video frame of the video played by the video playback application.
  • Perform gesture analysis on the terminal or the terminal has a video recording unit, records the video in real time through the video recording unit, and uses the method of this embodiment to perform gesture analysis on the hand in each video frame in the real-time recorded video, or ,
  • the terminal has an image capture unit, the image is captured by the image capture unit, and the method of the embodiment of this application is used to perform gesture analysis on the hand in the captured image, or an image display application is running on the terminal, this application can be used
  • the method of the embodiment performs gesture analysis on the hand in the image displayed by the image display application.
  • FIG. 3 is an optional flowchart of the gesture analysis method provided by the embodiment of the application. As shown in FIG. 3, the method includes the following steps:
  • step S301 the terminal obtains an image to be analyzed.
  • the terminal may download the image to be analyzed on the network, or may use an image capturing unit to capture the image to be analyzed in real time, or may also use the received image as the image to be analyzed.
  • Step S302 It is judged whether there is a hand on the image to be analyzed.
  • a pre-trained hand recognition model can be used to recognize the image to be analyzed.
  • the recognition result shows that the probability of having a hand in any sub-region on the image to be analyzed is greater than the threshold, it indicates that there is a hand in the sub-region, so that it is determined that there is a hand in the image to be analyzed;
  • the probability value of having a hand in each sub-region on the image is less than the threshold, it indicates that there is no hand in the image to be analyzed.
  • step S303 is executed, and if the judgment result is no, then return and continue to execute step S301.
  • Step S303 The terminal sends the image to be analyzed to the server.
  • step S304 the server performs hand feature extraction on the image to be analyzed to obtain the first number of finger key point features and the second number of palm key point features.
  • step S305 the server respectively performs UV coordinate regression processing on each finger key point feature and each palm key point feature, and correspondingly obtains the first UV coordinate of each finger key point and the second UV coordinate of each palm key point.
  • step S306 the server respectively performs depth regression processing on each finger key point feature and each palm key point feature, and correspondingly obtains the first depth coordinate of each finger key point and the second depth coordinate of each palm key point.
  • step S307 the server performs gesture analysis on the image to be analyzed according to the first UV coordinate, the first depth coordinate, the second UV coordinate, and the second depth coordinate to obtain the gesture analysis result.
  • steps S304 to S307 are the same as the above-mentioned steps S201 to S204, which will not be repeated in this embodiment of the application.
  • step S308 the server sends the gesture analysis result to the terminal.
  • step S309 the terminal displays the gesture analysis result on the current interface.
  • the terminal obtains the image to be analyzed, and sends the image to be analyzed to the server for analysis and recognition. After analyzing the hand gesture in the image to be analyzed, the result of the gesture analysis is fed back to the terminal, And it is displayed on the current interface of the terminal.
  • the server is performing gesture analysis, , Is to separate the gesture estimation tasks of the fingers and the palm. In this separated architecture, the key points of the fingers and the palm are processed separately to realize the gesture analysis of the entire hand. In this way, it can greatly Improve the accuracy of gesture analysis.
  • FIG. 4 is an optional flowchart of the gesture analysis method provided by an embodiment of the present application. As shown in FIG. 4, step S202 can be implemented through the following steps:
  • step S401 UV encoding is performed on each of the finger key point features and each of the palm key point features, and correspondingly obtain the first UV encoding feature of each finger key point and each palm key point.
  • the second UV encoding feature of the point is obtained.
  • the UV encoding processing in the embodiments of the present application is performed for the key point features of the fingers and the key point features of the palm respectively, and the process of UV encoding the key point features of the finger is the same as the UV encoding process for the key point features of the palm.
  • the encoding process is the same.
  • step S401 UV encoding is performed on each key point feature of the finger to obtain the first UV encoding feature of each key point of the finger, which can be implemented through the following steps:
  • Step S4011 using the first convolution layer to perform convolution processing on each of the finger key point features to obtain the first convolution feature.
  • the first convolutional layer has a specific convolution kernel, and the number of convolution kernels of the first convolutional layer may be a preset value, or may be obtained through training.
  • Step S4012 sequentially performing jump connection processing for the first preset number of times on the first convolution feature through the first convolution layer to obtain the first jump connection feature.
  • jump connection processing can solve the problem of gradient disappearance in the case of a deeper network layer, and at the same time help the back propagation of the gradient and speed up the image processing process.
  • step S4012 can be implemented through the following steps:
  • Step S4012a Determine the first convolution feature as the input feature of the first convolution layer during the first jump connection process.
  • the first convolutional feature obtained by the first convolutional layer performing convolution processing on the key point features of the finger is determined as the input feature of the first jump connection processing, that is, the jump connection processing is connected after the first convolutional layer, After the first convolution layer performs convolution processing, jump connection processing is performed.
  • Step S4012b Determine the output feature of the first convolutional layer at the Nth time as the input feature of the first convolutional layer at the Nth jump connection process, where N is an integer greater than 1.
  • the output of the first convolutional layer is jump connected to the input position of the first convolutional layer. Then, in the Nth jump connection process, the input feature, that is, the first convolutional layer is in the Nth The output characteristics of the time.
  • Step S4012c input the determined input feature each time into the first convolutional layer, and sequentially perform the first preset number of jump connection processing to obtain the first jump connection feature.
  • the entire jump connection process is: after the first convolution layer performs convolution processing on the key point features of the finger to obtain the first convolution feature, input the first convolution feature into the first convolution layer for the first jump connection Process to obtain the output feature of the first jump connection process, and then use the output feature of the first jump connection process as the input feature of the second jump connection process, and input it into the first convolutional layer for the second jump connection Process to obtain the output feature of the second jump connection process, and then input the output feature of the second jump connection process as the input feature of the third jump connection process into the first convolutional layer for the third jump connection Processing... and so on, until the jump connection processing for the first preset number of times is completed, and the first jump connection feature is obtained.
  • Step S4013 Perform pooling processing on the first jump connection feature to reduce the spatial size of the first jump connection feature, and obtain the first UV encoding feature of each key point of the finger.
  • the first jump connection feature may be pooled through a preset first pooling layer. Pooling processing is down-sampling processing, and pooling processing is used to reduce the spatial size of the first jump connection feature.
  • step S401 UV encoding is performed on each of the palm key point features to obtain the second UV encoding feature of each palm key point, which can be implemented through the following steps:
  • Step S4014 using a second convolution layer to perform convolution processing on each of the palm key point features to obtain a second convolution feature.
  • the second convolutional layer has a specific convolution kernel, and the number of convolution kernels of the second convolutional layer may be a preset value, or may be obtained through training.
  • Step S4015 sequentially performing a second preset number of skip connection processing on the second convolution feature through the second convolution layer to obtain a second skip connection feature.
  • step S4015 can be implemented through the following steps:
  • Step S4015a Determine the second convolution feature as the input feature of the second convolution layer during the first jump connection process.
  • Step S4015b Determine the Kth output feature of the second convolutional layer as the input feature of the Kth jump connection process of the second convolutional layer, where K is an integer greater than 1.
  • Step S4015c input the determined input features each time into the second convolutional layer, and sequentially perform the jump connection processing for the second preset number of times to obtain the second jump connection feature.
  • step S4015a to step S4015c is the same as the processing procedure of the skip connection processing of the first preset number described above. Please refer to the above steps S4012a to S4012c. Explanation, the embodiments of this application will not be repeated.
  • the first preset number of times and the second preset number of times may be the same or different, and the first preset number of times and the second preset number of times may be determined according to data processing requirements and data processing volume.
  • Step S4016 Perform pooling processing on the second jump connection feature to reduce the spatial size of the second jump connection feature to obtain the second UV encoding feature of each key point of the palm.
  • the second jump connection feature may be pooled through a preset second pooling layer.
  • Step S402 Perform full connection processing on each of the first UV encoding features and each of the second UV encoding features, and correspondingly obtain the first UV coordinates of each of the finger key points and each of the palms. The second UV coordinate of the key point.
  • step S203 can be implemented through the following steps:
  • Step S403 Perform depth encoding processing on each of the finger key point features and each of the palm key point features, and obtain the first depth encoding feature of each finger key point and each palm key point correspondingly. The second depth encoding feature of the point.
  • the depth encoding processing in the embodiment of the present application is performed on the key point features of the fingers and the key point features of the palm respectively, and the process of performing the depth encoding processing on the key point features of the fingers is similar to the depth encoding process of the key point features of the palm.
  • the encoding process is the same.
  • step S403 the depth encoding process is performed on each of the finger key point features, and the first depth encoding feature of each finger key point is correspondingly obtained, which can be implemented by the following steps:
  • Step S4031 using a third convolution layer to perform convolution processing on each of the finger key point features to obtain a third convolution feature.
  • the third convolutional layer has a specific convolution kernel, and the number of convolution kernels of the third convolutional layer may be a preset value, or may be obtained through training.
  • step S4032 the third convolutional layer is used to sequentially perform jump connection processing for a third preset number of times on the third convolution feature to obtain a third jump connection feature.
  • step S4032 can be implemented through the following steps:
  • Step S4032a Determine the third convolutional feature as the input feature of the third convolutional layer during the first jump connection process.
  • Step S4032b Determine the M-th output feature of the third convolutional layer as the input feature of the M-th jump connection process of the third convolutional layer, where M is an integer greater than 1.
  • Step S4032c input the determined input feature each time into the third convolutional layer, and sequentially perform the jump connection processing for the third preset number of times to obtain the third jump connection feature.
  • step S4032a to step S4032c is the same as the processing procedure of the jump connection processing of the first preset number of times and the jump connection processing of the second preset number of times.
  • the processing procedures are all the same, please refer to the explanations of the above steps S4012a to S4012c, and the details are not repeated in the embodiment of the present application.
  • Step S4033 Perform pooling processing on the third jump connection feature to reduce the spatial size of the third jump connection feature to obtain the first depth coding feature of each key point of the finger.
  • the third jump connection feature may be pooled through a preset third pooling layer.
  • step S403 the depth encoding process is performed on each of the palm key point features, and the second depth encoding feature of each palm key point is correspondingly obtained, which can be implemented through the following steps:
  • Step S4034 using a fourth convolution layer to perform convolution processing on each of the palm key point features to obtain a fourth convolution feature.
  • the fourth convolution layer has a specific convolution kernel.
  • step S4035 the fourth convolutional layer is used to sequentially perform jump connection processing for a fourth preset number of times on the fourth convolution feature to obtain a fourth jump connection feature.
  • step S4035 can be implemented through the following steps:
  • Step S4035a Determine the fourth convolution feature as the input feature of the fourth convolution layer during the first jump connection process.
  • Step S4035b Determine the output feature of the fourth convolutional layer at the Lth time as the input feature of the fourth convolutional layer at the Lth jump connection process, where L is an integer greater than 1.
  • Step S4035c input the determined input features each time into the fourth convolutional layer, and sequentially perform the jump connection processing for the fourth preset number of times to obtain the fourth jump connection feature.
  • the processing procedure of the jump connection processing of the fourth preset number of times in steps S4035a to S4035c is the same as the processing procedure of the jump connection processing of the first preset number of times and the jump connection processing of the second preset number of times.
  • the processing procedure is the same as the processing procedure of the skip connection processing of the third preset number of times. Please refer to the explanation of the above step S4012a to step S4012c, and the details will not be repeated in this embodiment of the application.
  • the third preset number of times and the fourth preset number of times may be the same or different, and the third preset number of times and the fourth preset number of times may be determined according to data processing requirements and data processing volume.
  • Step S4036 Perform pooling processing on the fourth jump connection feature to reduce the spatial size of the fourth jump connection feature to obtain the second depth coding feature of each key point of the palm.
  • the fourth jump connection feature may be pooled through a preset fourth pooling layer.
  • Step S404 Perform full connection processing on each of the first depth encoding features and each of the second depth encoding features, and correspondingly obtain the first depth coordinates of each of the finger key points and each of the palms The second depth coordinate of the key point.
  • FIG. 5 is an optional flowchart of the gesture analysis method provided by an embodiment of the present application. As shown in FIG. 5, step S204 can be implemented by the following steps:
  • Step S501 Perform coordinate conversion on the first UV coordinate and the first depth coordinate of each key point of the finger to obtain the first space coordinate corresponding to the key point of the finger.
  • coordinate conversion refers to the conversion of UVD coordinates to XYZ coordinates, where the UVD coordinates of the key points of the finger are determined by the first UV coordinates and the first depth coordinates, that is, the first UV coordinates and the first depth coordinates together form a finger
  • the UVD coordinates of the key point is the representation of the XYZ coordinates of the key point of the finger.
  • (x, y, z) are the coordinates in the XYZ format
  • (u, v, d) are the coordinates in the UVD format, where u and v correspond to the pixel value of the two-dimensional image
  • d represents the depth value (depth) , That is, the depth value of the coordinate point from the camera.
  • Cx and Cy represent the principal point, which should ideally be located at the center of the image, where the principal point is the optical center of the camera, which is generally located at the center of the image and is in the image coordinate system.
  • fx and fy are the focal lengths in the x direction and y direction, respectively.
  • Step S502 Perform coordinate conversion on the second UV coordinate and the second depth coordinate of each palm key point to obtain a second space coordinate corresponding to the palm key point.
  • the UVD coordinates of the key points of the palm are determined by the second UV coordinates and the second depth coordinates, that is, the second UV coordinates and the second depth coordinates together form the UVD coordinates of the key points of the palm.
  • the coordinate conversion between the second UV coordinate and the second depth coordinate of each key point of the palm can be realized through the above formula (1-1).
  • the second space coordinate is the representation of the key points of the palm in XYZ coordinates.
  • Step S503 Perform a gesture analysis on the image to be analyzed according to the first spatial coordinates and the second spatial coordinates to obtain a gesture analysis result.
  • the XYZ coordinate representation is used for gesture analysis, and the position of each key point of the hand on the three-dimensional coordinate can be obtained, so as to obtain an accurate gesture analysis result.
  • step S503 can be implemented through the following steps:
  • Step S5031 Determine a first relative position relationship between every two key points of the fingers and a second relative position relationship between every two key points of the palm.
  • the first relative position relationship is the relative position relationship between every two finger key points.
  • the first relative position relationship between two adjacent finger key points on the same finger is the relative position of the two finger key points. It is adjacent and can be directly connected; the first relative positional relationship between the two finger key points located on the two fingers is that the two finger key points cannot be directly connected.
  • the second relative position relationship between the second palm key points is the relative position relationship between every two palm key points.
  • the palm key points of two adjacent positions on the palm can be directly connected, and the palms are not adjacent.
  • the key points of the palms of the two positions cannot be directly connected.
  • Step S5032 according to the first relative position relationship and the second relative position relationship, sequentially connect the first number of finger key points and the second number of palm key points to form a hand key point connection diagram.
  • the key point connection diagram of the hand includes the XYZ coordinates of each key point.
  • Step S5033 Perform a gesture analysis on the image to be analyzed according to the hand key point connection diagram to obtain a gesture analysis result.
  • the shape of each finger and the shape of the palm can be determined through the key point connection diagram of the hand, thereby determining the result of the hand gesture analysis.
  • FIG. 6 is an optional flowchart of the gesture analysis method provided by an embodiment of the present application. As shown in FIG. 6, step S201 can be implemented through the following steps:
  • Step S601 Perform target recognition on the image to be analyzed, so as to realize the identification of a target sub-region with a target object in at least two sub-regions of the image to be analyzed.
  • step S601 can be implemented through the following steps:
  • Step S6011 Obtain a scan frame with a preset size, and the size of the image to be analyzed is larger than the preset size.
  • the area corresponding to the image to be analyzed includes multiple sub-areas, and the size of the sub-areas is the same as the size of the scan frame, that is, every time the scan frame scans a position, the position corresponds to a sub-area.
  • Step S6012 by sliding the scan frame on the area of the image to be analyzed, to determine the probability value of the target object in each of the sub-regions.
  • the target object may be a hand.
  • a pre-trained target recognition model can be used to perform target recognition on sub-regions to determine the probability value of each sub-region with a target object.
  • Step S6013 Determine the sub-region with the highest probability value as the target sub-region.
  • step S602 the target sub-region is intercepted to obtain a intercepted image.
  • the target sub-region is intercepted to exclude other regions that do not contain the hand, so that the amount of data processing in the subsequent gesture analysis process can be reduced.
  • Step S603 Perform the hand feature extraction on the captured image to obtain the first number of finger key point features and the second number of palm key point features.
  • step S603 can be implemented through the following steps:
  • Step S6031 Perform RoI matching feature extraction on the intercepted image to obtain at least two image RoI matching features on pixels with floating-point coordinates.
  • Step S6032 Determine the RoI matching feature map according to the RoI matching features of the at least two images.
  • the RoI matching feature map is determined according to the extracted image RoI matching feature, that is, the extracted image RoI matching feature is embedded into a feature map to form the RoI matching feature map. In this way, in the subsequent gesture analysis process, you can The RoI matching feature map starts the feature extraction of fingers and palms without starting from the original image.
  • Step S6033 Perform two-dimensional hand posture estimation on the RoI matching feature map to determine the first number of finger key point features and the second number of palm key point features.
  • FIG. 7 is an optional flowchart of the gesture analysis method provided by the embodiment of the present application. As shown in FIG. 7, step S6033 can be implemented through the following steps:
  • a fifth convolutional layer is used to perform convolution processing on the image RoI matching feature in the RoI matching feature map to obtain a RoI matching convolution feature.
  • the fifth convolutional layer has a specific convolution kernel.
  • step S702 the sixth convolutional layer is used to perform a fifth preset number of jump connection processing on the RoI matching convolution feature to obtain a fifth jump connection feature.
  • the sixth convolutional layer has a specific convolution kernel.
  • Step S703 Perform pooling processing on the fifth jump connection feature to reduce the spatial size of the fifth jump connection feature, and determine the first number of finger key point features and the second number of key point features.
  • the fifth jump connection feature may be pooled through a preset fifth pooling layer.
  • the gesture analysis method provided in the embodiments of the present application can also be implemented by using a gesture analysis model, that is, using a gesture analysis model for the hand feature extraction, the UV coordinate regression processing, and the depth regression processing And the gesture analysis to obtain the gesture analysis result.
  • FIG. 8 is an optional flowchart of a gesture analysis model training method provided by an embodiment of the present application. As shown in FIG. 8, the training method includes the following steps:
  • Step S801 Input the sample image into the gesture analysis model.
  • Step S802 Perform feature extraction on the sample image through the hand feature extraction network in the gesture analysis model to obtain a third number of sample first key point features and a fourth number of sample second key point features.
  • the first key point feature of the sample may be a key point feature of the sample finger
  • the second key point feature of the sample may be a key point feature of the sample palm.
  • the hand feature extraction network can include two branches, one is the finger feature extraction branch and the other is the palm feature extraction branch. Finger feature extraction is performed on the sample image through the finger feature extraction branch to obtain the third number of sample finger key point features. The palm feature extraction is performed on the sample image through the palm feature extraction branch, and the fourth number of key point features of the sample palm is obtained.
  • Step S803 through the UV coordinate regression network in the gesture analysis model, respectively perform UV coordinate regression processing on the first key point feature of each sample and the second key point feature of each sample, and obtain each sample correspondingly.
  • the UV coordinate regression network is used to perform UV coordinate regression processing on the key point features of the sample finger and the key point feature of the sample finger to determine the UV coordinates of each sample key point (including the sample finger key point and the sample palm key point).
  • step S804 through the deep regression network in the gesture analysis model, the first key point feature of each sample and the second key point feature of each sample are respectively subjected to deep regression processing, and the finger key of each sample is correspondingly obtained.
  • the depth regression network is used to perform depth regression processing on the key point features of the sample finger and the key point feature of the sample finger to determine the depth coordinates of each sample key point.
  • step S805 the UV coordinates of the first sample, the UV coordinates of the second sample, the depth coordinates of the first sample, and the depth coordinates of the second sample are performed through the gesture analysis network in the gesture analysis model. Gesture analysis to obtain the sample gesture analysis result.
  • Step S806 Input the sample gesture analysis result into the preset loss model to obtain the loss result.
  • the preset loss model is used to compare the sample gesture analysis result with the preset gesture analysis result to obtain the loss result, where the preset gesture analysis result may be the gesture analysis result corresponding to the sample image preset by the user.
  • the preset loss model includes a loss function, and the similarity between the sample gesture analysis result and the preset gesture analysis result can be calculated through the loss function.
  • the sample gesture analysis result can be calculated by calculating the similarity between the sample gesture analysis result and the preset gesture analysis result.
  • the distance between the preset gesture analysis results, and the above loss result is determined according to the distance. When the distance between the sample gesture analysis result and the preset gesture analysis result is larger, it indicates that the training result of the model is far from the true value, and further training is required; when the sample gesture analysis result is compared with the preset gesture analysis The smaller the distance between the results, the closer the training results of the model are to the true value.
  • Step S807 According to the loss result, correct the parameters in the hand feature extraction network, the UV coordinate regression network, the depth regression network, and the gesture analysis network to obtain a corrected gesture analysis model.
  • the loss result indicates that the hand feature extraction network in the current gesture analysis model cannot accurately extract the hand features of the sample image to obtain the accurate sample finger key of the sample image Point features and key point features of the sample palm, and/or the UV coordinate regression network cannot accurately perform UV coordinate regression processing on the key point features of the sample finger and the key point feature of the sample palm, and get the first sample UV of the accurate finger key point
  • the coordinates and the UV coordinates of the second sample of the key points of the sample palm, and/or the depth regression network cannot accurately perform deep regression processing on the key point features of the sample fingers and the key point features of the sample palm, and get the first accurate sample finger key points.
  • the sample depth coordinates and the second sample depth coordinates of the key points of the sample palm, and/or the gesture analysis network cannot accurately calculate the UV coordinates of the first sample, the UV coordinates of the second sample, the depth coordinates of the first sample, and the second sample Gesture analysis is performed on the depth coordinates, and accurate sample gesture analysis results corresponding to the sample image are obtained. Therefore, the current gesture analysis model needs to be revised. Then, the parameters in at least one of the hand feature extraction network, the UV coordinate regression network, the depth regression network, and the gesture analysis network can be corrected according to the above distance, until the sample gesture analysis result output by the gesture analysis model and the preset gesture When the distance between the analysis results meets the preset condition, the corresponding gesture analysis model is determined as the trained gesture analysis model.
  • the sample image is input into the gesture analysis model
  • the sample image is processed through the hand feature extraction network, the UV coordinate regression network, the depth regression network, and the gesture analysis network in sequence.
  • the sample gesture analysis result is obtained, and the sample gesture analysis result is input into the preset loss model to obtain the loss result. Therefore, the parameters in at least one of the hand feature extraction network, the UV coordinate regression network, the depth regression network, and the gesture analysis network can be corrected according to the loss result, and the resulting gesture analysis model can accurately determine the gesture of the image to be analyzed , To improve the user experience.
  • the embodiment of the present application provides a gesture analysis method. Since the gesture estimation of the fingers is more difficult than that of the palm, because the fingers are highly deformed during the movement, and the palm usually maintains a rigid surface. Through such findings, the embodiment of the present application separates the gesture estimation task of the fingers and the palm. In this separated architecture, finger features or palm features are extracted specifically for fingers or palms, thereby obtaining better gesture estimation performance.
  • TOF camera is a range imaging camera system that uses time-of-flight technology to measure the round-trip time of artificial light signals emitted by lasers or LEDs to analyze and capture images The distance between each point of the subject and the camera.
  • the TOF camera outputs an image with a frame size of HxW, and each pixel value on the two-dimensional image represents the depth value of the object (that is, the pixel value ranges from 0mm to 3000mm).
  • Fig. 9 is an example image 901 captured by a TOF camera provided by an embodiment of the present application. In the following, the image 901 captured by the TOF camera is taken as the depth image (ie, the image to be analyzed).
  • Hand detection is a process: input the depth image, and then output the probability of the existence of the hand (for example, the probability can be a number from 0 to 1, and the larger the value, the greater the probability of the existence of the hand. That is, the greater the degree of confidence), and the bounding box of a hand (for example, the bounding box represents the position and size of the hand).
  • FIG. 10 is a hand detection result including a prediction range 1001 and a hand existence probability 1002 (ie, confidence) provided by an embodiment of the present application.
  • the prediction range of the hand is expressed as (x min , y min , x max , y max ), where (x min , y min ) is the upper left corner of the prediction range, and (x max , y max ) is the prediction The bottom right corner of the range.
  • Two-dimensional gesture estimation Input the depth image, and then output the two-dimensional key point position of the hand skeleton.
  • An example diagram of the hand key point position is shown in Figure 11, where positions 0, 1, 2, 4, 5, 6 , 8, 9, 10, 12, 13, 14, 16, 17 indicate the key points of the fingers, and positions 3, 7, 11, 15, 18, and 19 indicate the key points of the palm.
  • Each key point is a two-dimensional coordinate representing a position (for example, x, y, where x is on the horizontal image axis and y is on the vertical image axis).
  • FIG. 12 includes a plurality of estimated hand key points 121.
  • Three-dimensional gesture estimation input the depth image, output the 3D key point position of the hand skeleton, and an example image of the hand key point position is shown in Figure 11.
  • Each key point position is a three-dimensional coordinate (such as x, y, z, where x is on the horizontal image axis, y is on the vertical image axis, and z is on the depth direction).
  • the embodiment of this application is to study the problem of three-dimensional hand pose estimation.
  • a typical hand posture detection process includes: hand detection and hand posture estimation process, as shown in Figure 13, hand detection 131 includes backbone feature extractor 1311 and prediction range detection head 1312, hand The pose estimation 132 includes a backbone feature extractor 1321 and a pose estimation head 1322. It should be noted that the tasks of hand detection 131 and hand posture estimation 132 are completely separated. In order to connect the two tasks, the output prediction range position is adjusted to the centroid of the pixels in the prediction range, and the size of the prediction range is slightly enlarged to include all hand pixels, that is, the size of the prediction range is adjusted by the bounding box adjustment module 133 Adjustment.
  • the adjusted prediction range is used to crop the original depth image, that is, the adjusted prediction range is cropped by the image cropping module 134.
  • the cropped image 135 is input into the hand pose estimation 132 task. It should be noted that when the backbone feature extractor is used to extract the image features of the initial image 130, repeated calculations will occur.
  • RoI Align eliminates RoIPool's harsh quantification and correctly aligns the extracted features with the input.
  • the improvement proposed in the embodiments of this application is simple: it avoids any quantification of RoI boundaries or bins (for example, x/16 can be used instead of [x/16], where x/16 represents a floating point number, and [x /16] means rounding).
  • the bilinear interpolation calculation method is used to calculate the precise values of the input features of the four periodic sampling positions in each RoI bin, and the results are summarized (using the maximum value or the average value), as shown in Figure 14, which is provided by the embodiment of the application
  • the principle diagram of RoI Align the dotted grid represents a feature map, the solid line represents an ROI (in this example, there are 2 ⁇ 2 boxes), the points in the figure represent 4 sampling points in each box 141, RoI Align calculates the value of each sampling point from the adjacent grid points on the feature map through bilinear interpolation, and does not perform quantization on the RoI, its container, or any coordinates involved in the sampling point. It should be noted that as long as the quantization is not performed, the result is not sensitive to the precise sampling position or the number of sampling points.
  • NMS Non-maximum suppression
  • the sliding window-based method usually generates multiple high score windows close to the correct position of the target. This is the result of the generalization ability of the object detector, the smoothness of the response function, and the visual correlation of the near window. This relatively dense output is generally unsatisfactory for understanding the content of the image. In fact, the number of window assumptions in this step is not related to the actual number of objects in the image. Therefore, the goal of NMS is to keep only one window for each group, corresponding to the precise local maximum of the response function. Ideally, each object is only detected once.
  • Figure 15 is a schematic diagram of the results of the NMS provided by the embodiments of the present application. Figure 15 shows an example of the NMS.
  • the figure on the left is the result of detection without NMS technology, which will lead to near the real position (ie the position of the face).
  • There are multiple sets of detection results 151 that is, the detection frame in the figure); the right figure is the result of the detection using NMS technology, and only one detection result 152 is kept in the real position.
  • Prediction range operation The embodiment of this application defines two simple prediction range operations. As shown in Figure 16, given two prediction ranges BB1 and BB2, the intersection of BB1 and BB2 is represented as BB1 ⁇ BB2, which is represented by Defined as the overlapping area 161 of BB1 and BB2; BB1 ⁇ BB2 is defined as the unified area 162 of BB1 and BB2, and the intersection ratio (IoU, Intersection over Union) is shown in Fig. 16, which is the overlapping area of the dark area in Fig. 16 Ratio between 161 and 162 in the unified area
  • (x, y, z) are the coordinates in the XYZ format
  • (u, v, d) are the coordinates in the UVD format, where u and v correspond to the pixel value of the two-dimensional image
  • d represents the depth value (depth) , That is, the depth value of the coordinate point from the camera.
  • Cx and Cy represent the principal point, which should ideally be located at the center of the image, where the principal point is the optical center of the camera, which is generally located at the center of the image and is in the image coordinate system.
  • fx and fy are the focal lengths in the x direction and y direction, respectively
  • Classification and regression The problem of classification prediction modeling is different from the regression prediction modeling problem. Classification is the task of predicting a discrete class label; regression is the task of predicting a continuous number.
  • classification algorithms can predict continuous values, but continuous values appear in the form of class label probabilities; regression algorithms can predict a discrete value, but discrete values exist in integer form.
  • Convolutional neural network consists of an input layer, an output layer and multiple hidden layers.
  • the hidden layer of a CNN usually consists of a series of convolutional layers, which are convolved by multiplication or other dot products.
  • the activation function is usually a RELU layer, and after the activation function layer are additional convolutional layers, such as pooling layer, fully connected layer and normalization layer, because their input and output are covered by the activation function and the final convolution. , So it is called a hidden layer.
  • the final convolution in turn, usually includes back propagation to more accurately calculate the weight of the final product. Although these layers are often called convolutions, this is just convention. Mathematically speaking, it is a sliding dot product or cross-correlation. This is important for the index in the matrix because it affects how to determine the weight at a specific index point.
  • each convolutional layer in the neural network should have the following properties:
  • the input is a tensor with a shape of (number of images) ⁇ (image width) ⁇ (image height) ⁇ (Image depth). Width and height are hyperparameters, and the depth must be equal to the convolution kernel of the image depth.
  • the convolutional layer convolves the input and passes the result to the next layer. This is similar to how neurons in the visual cortex respond to specific stimuli.
  • Each convolutional neuron processes data only for its receiving domain.
  • the fully connected feedforward neural network can be used for feature learning and data classification, it is impractical to apply this structure to images. Even in the shallow (as opposed to the deep) structure, a very large number of neurons are required, because the input size associated with the image is very large, and each pixel is a related variable. For example, for a (small) image of size 100x100, a fully connected layer has 10,000 weights for each neuron in the second layer.
  • the convolution operation solves this problem because it reduces the number of free parameters, allowing the network to go deeper with fewer parameters. For example, regardless of the size of the image, a 5x5 spread area, each area has the same shared weight, and only 25 learnable parameters are needed. In this way, the back propagation method is used to solve the problem of gradient disappearance or explosion in traditional multi-layer neural network training.
  • Pooling layer Convolutional neural networks can include local or global pooling layers to simplify the underlying calculations.
  • the pooling layer reduces the dimensionality of the data by merging the output of one layer of neuron clusters into a single neuron of the next layer.
  • the local pool combines small clusters, usually 2x2.
  • the global pool acts on all neurons in the convolutional layer.
  • the pool can calculate a maximum or average value.
  • the maximum pool uses the maximum value of each neuron cluster in the previous layer.
  • the average pool uses the average value of each neuron cluster in the previous layer.
  • Fully connected layer A fully connected layer connects every neuron of one layer to every neuron of another layer. It is the same in principle as the traditional Multi-Layer Perceptron (MLP).
  • MLP Multi-Layer Perceptron
  • the flat matrix classifies images through a fully connected layer.
  • the gesture analysis method provided by the embodiment of the application is similar to the work of Pose-REN, and the framework of the Pose guided structured Region Ensemble Network (Pose-REN, Pose guided structured Region Ensemble Network) is shown in FIG. 17.
  • a simple CNN network represented by Init-CNN in the figure
  • Pose t-1 feature regions are extracted from the feature map 171 generated by CNN, and a tree structure is used for hierarchical fusion.
  • Pose t is a refined hand gesture obtained by Pose-REN, which will be used as a guide for the next stage.
  • fc in the figure represents a fully connected layer (Fully Connected)
  • concate in the figure represents a merged array, used to connect two or more arrays
  • the method in the embodiment of the present application belongs to the category of using the fully connected layer as the last layer of the Pose-REN to return the coordinates.
  • the first thing is to start from the RoI feature, not from the original image.
  • the architecture of the regression head is different (that is, in addition to the final regression layer, the convolutional layer is mainly used instead of the fully connected layer).
  • the UVD coordinates are returned instead of the XYZ coordinates.
  • the main inventive point of the embodiment of this application is placed after the RoiAlign feature extractor, which is a regression module used for the task of three-dimensional hand pose estimation.
  • the proposed regression module reuses the feature map obtained from the hand detection task. It starts from the RoiAlign feature map instead of the original image.
  • the location of the method of the embodiment of the present application is shown in FIG. 18, the gesture estimation module 181 for implementing gesture analysis is located behind the RoiAlign feature extractor 182, where the backbone feature extractor 183 is used for the backbone feature extraction of the input initial image 180
  • the bounding box detection module 184 is used to perform bounding box detection on the initial image, and the bounding box selection module 185 is used to select the bounding box. After the bounding box is selected, the RoiAlign feature extractor 182 is used to perform RoiAlign feature extraction.
  • FIG. 19 is a network architecture diagram of the gesture estimation module 181 provided in an embodiment of the present application.
  • the entire network system includes a basic feature extractor 191.
  • the first UV encoder 192, the first depth encoder 193, the second UV encoder 194, and the second depth encoder 195 are included in the entire network system.
  • the basic feature extractor 191 extracts key point features on a 7x7x256 (height*width*channel) image feature map.
  • the image feature map first applies a 3x3x128 convolutional layer Conv1 to reduce the channel from 256 to 128 (that is, saving calculations).
  • the 7x7x128 feature map is convolved with the convolutional layer Conv2 (3x3x128) to further extract the basic key point features, and Conv2 has a jump connection, and the input of Conv2 is added to the output of Conv2. This Conv2 and its jump connection are repeated 4 times .
  • the pooling layer of the 3x3 kernel namely Pool1 is used for down-sampling twice, and the size is 3x3x128.
  • the gesture estimation module 181 is partially divided into two branches, the finger and the palm. There are 14 key points on the finger branches and 6 key points on the palm. Gesture key points and palm key points as shown in Figure 11, where finger key points are 0, 1, 2, 4, 5, 6, 8, 9, 10, 12, 13, 14, 16, 17, palm key The points are 3, 7, 11, 15, 18, 19.
  • the first UV encoder 192 extracts key point features for UV coordinate regression.
  • the first UV encoder 192 inputs a 3x3x128 key point feature map, and the conv layer Conv3 outputs a key point feature map of the same size, and adds the input of Conv3 to the output of Conv3 through a jump connection.
  • This Conv3 repeats the corresponding jump connection 4 times.
  • the 3x3x128 key point feature map is downsampled twice through the pooling layer with the kernel of 3x3, namely Pool2, with a size of 1x1x128.
  • the fully connected layer FC1 is used to restore the UV coordinates of 14 key points.
  • the first depth encoder 193 extracts key point features for depth regression.
  • the first depth encoder 193 inputs a 3x3x128 key point feature map, and the conv layer Conv4 outputs a key point feature map of the same size, and adds the input of Conv4 to the output of Conv4 through a jump connection, and this Conv4 is connected to the corresponding jump Repeat 4 times.
  • the 3x3x128 key point feature map is downsampled twice through the pooling layer with the kernel of 3x3, namely Pool3, with a size of 1x1x128.
  • the fully connected layer FC2 is used to return the depth coordinates of 14 key points.
  • the second UV encoder 194 extracts key point features for UV coordinate regression.
  • the second UV encoder 194 inputs a 3x3x128 key point feature map
  • the conv layer Conv5 outputs the same size key point feature map, and adds the input of Conv5 to the output of Conv5 through a jump connection, and this Conv5 is connected to the corresponding jump Repeat 4 times.
  • the key point feature map of 3x3x128 is down-sampled twice, and the size is 1x1x128.
  • the fully connected layer FC3 is used to regress the UV coordinates of the six key points.
  • the second depth encoder 195 extracts key point characteristics for depth regression.
  • the second depth encoder 195 inputs a 3x3x128 key point feature map, and the conv layer Conv6 outputs a key point feature map of the same size, and adds the input of Conv6 and the output of Conv6 through a jump connection.
  • This Conv6 is repeated 4 times with the corresponding jump connection.
  • the key point feature map of 3x3x128 is down-sampled twice, and the size is 1x1x128.
  • the fully connected layer FC4 is used to return the depth coordinates of the 6 key points.
  • the UVD coordinates of each key point of the finger and the UVD coordinates of each key point of the palm are respectively obtained.
  • the UV coordinates plus the depth are used to calculate the XYZ coordinates, that is, the UVD coordinates are converted to XYZ coordinates, and the pairing is completed.
  • an embodiment of the present application provides a gesture analysis device, which includes each module included and each unit included in each module, which can be implemented by a processor in the receiving end; of course, it can also be implemented by Specific logic circuit implementation; in the implementation process, the processor can be a central processing unit (CPU), a microprocessor (MPU), a digital signal processor (DSP), or a field programmable gate array (FPGA), etc.
  • the processor can be a central processing unit (CPU), a microprocessor (MPU), a digital signal processor (DSP), or a field programmable gate array (FPGA), etc.
  • FIG. 20 is a schematic structural diagram of a gesture analysis device provided by an embodiment of the present application. As shown in FIG. 20, the gesture analysis device 200 includes:
  • the feature extraction module 201 is configured to perform feature extraction on the acquired image to be analyzed to obtain a first number of first key point features and a second number of second key point features;
  • the UV coordinate regression processing module 202 is configured to perform UV coordinate regression processing on each of the first key point features and each of the second key point features, and correspondingly obtain the first UV coordinates and each key point of each finger.
  • the depth regression processing module 203 is configured to perform depth regression processing on each of the first key point features and each of the second key point features, and correspondingly obtain the first depth coordinates of each finger key point and each palm The second depth coordinate of the key point;
  • the gesture analysis module 204 is configured to perform gesture analysis on the image to be analyzed according to the first UV coordinates, the first depth coordinates, the second UV coordinates, and the second depth coordinates to obtain a gesture analysis result .
  • the UV coordinate regression processing module is further configured to: separately perform UV encoding processing on each of the first key point features and each of the second key point features, and obtain each of the corresponding The first UV encoding feature of finger key points and the second UV encoding feature of each palm key point; each of the first UV encoding feature and each of the second UV encoding features are fully connected , Correspondingly obtain the first UV coordinate of each key point of the finger and the second UV coordinate of each key point of the palm.
  • the UV coordinate regression processing module is further configured to: use the first convolution layer to perform convolution processing on each of the first key point features to obtain the first convolution feature;
  • the convolutional layer sequentially performs jump connection processing for the first preset number of times on the first convolution feature to obtain the first jump connection feature; pooling the first jump connection feature to reduce the first jump Connect the spatial dimensions of the features to obtain the first UV encoding feature of each key point of the finger.
  • the UV coordinate regression processing module is further configured to: determine the first convolutional feature as the input feature of the first convolutional layer during the first jump connection process; and, The output feature of the first convolutional layer at the Nth time is determined as the input feature of the first convolutional layer at the Nth jump connection process, where N is an integer greater than 1; The input feature of is input into the first convolutional layer, and the jump connection processing of the first preset number of times is sequentially performed to obtain the first jump connection feature.
  • the UV coordinate regression processing module is further configured to: use a second convolution layer to perform convolution processing on each of the palm key point features to obtain a second convolution feature;
  • the stacking layer sequentially performs a second preset number of jump connection processing on the second convolution feature to obtain a second jump connection feature; pooling the second jump connection feature is performed to reduce the second jump connection
  • the spatial size of the feature is used to obtain the second UV encoding feature of each key point of the palm.
  • the depth regression processing module is further configured to: perform a depth encoding process on each of the first key point features and each of the second key point features, and obtain each of the fingers correspondingly.
  • the depth regression processing module is further configured to: use a third convolution layer to perform convolution processing on each of the finger key point features to obtain a third convolution feature; through the third convolution The layer sequentially performs a third preset number of jump connection processing on the third convolution feature to obtain a third jump connection feature; pooling the third jump connection feature is performed to reduce the third jump connection feature The first depth coding feature of each key point of the finger is obtained.
  • the depth regression processing module is further configured to: determine the third convolutional feature as the input feature of the third convolutional layer during the first jump connection processing; and, The output feature of the third convolutional layer at the Mth time is determined as the input feature of the third convolutional layer at the Mth jump connection processing, where M is an integer greater than 1; The input feature is input into the third convolutional layer, and the jump connection processing is sequentially performed for the third preset number of times to obtain the third jump connection feature.
  • the depth regression processing module is further configured to: use a fourth convolution layer to perform convolution processing on each of the palm key point features to obtain a fourth convolution feature; through the fourth convolution The layer sequentially performs jump connection processing for the fourth preset number of times on the fourth convolution feature to obtain the fourth jump connection feature; performs pooling processing on the fourth jump connection feature to reduce the fourth jump connection feature To obtain the second depth encoding feature of each key point of the palm.
  • the gesture analysis module is further configured to: perform coordinate conversion on the first UV coordinates and the first depth coordinates of each of the finger key points to obtain the first space corresponding to the finger key points Coordinates; perform coordinate conversion on the second UV coordinates and the second depth coordinates of each of the palm key points to obtain the second space coordinates corresponding to the palm key points; according to the first space coordinates and the first Two spatial coordinates, perform gesture analysis on the image to be analyzed, and obtain a gesture analysis result.
  • the gesture analysis module is further configured to: determine a first relative position relationship between every two key points of the fingers and a second relative position relationship between every two key points of the palm; The first relative position relationship and the second relative position relationship connect the first number of finger key points and the second number of palm key points in sequence to form a hand key point connection diagram; according to the hand key The dot connection graph performs gesture analysis on the image to be analyzed to obtain a gesture analysis result.
  • the feature extraction module is further configured to: perform target recognition on the image to be analyzed, so as to realize the identification of target subregions with target objects in at least two subregions of the image to be analyzed;
  • the target sub-region is intercepted to obtain the intercepted image;
  • the feature extraction is performed on the intercepted image to obtain the first number of the first key point features and the second number of the The second key feature.
  • the feature extraction module is further configured to: obtain a scanning frame with a preset size, and the size of the image to be analyzed is larger than the preset size; The scanning frame is used to determine the probability value of the target object in each of the sub-regions; and the sub-region with the highest probability value is determined as the target sub-region.
  • the feature extraction module is further configured to: perform RoI matching feature extraction on the intercepted image to obtain at least two image RoI matching features on pixels with floating-point coordinates; At least two image RoI matching features are determined to determine the RoI matching feature map; two-dimensional hand pose estimation is performed on the RoI matching feature map to determine the first number of the first key point features and the second The number of the second key point features.
  • the feature extraction module is further configured to: use a fifth convolution layer to perform convolution processing on the image RoI matching feature in the RoI matching feature map to obtain a RoI matching convolution feature;
  • the six convolutional layers perform a fifth preset number of jump connection processing on the RoI matching convolution feature to obtain a fifth jump connection feature; perform pooling processing on the fifth jump connection feature to reduce the fifth jump
  • the spatial dimensions of the connection features determine the first number of the finger key point features and the second number of the palm key point features.
  • the device further includes: a processing module for using a gesture analysis model to perform the feature extraction, the UV coordinate regression processing, the depth regression processing, and the gesture analysis to obtain the gesture Analyze the results.
  • a processing module for using a gesture analysis model to perform the feature extraction, the UV coordinate regression processing, the depth regression processing, and the gesture analysis to obtain the gesture Analyze the results.
  • the gesture analysis model is trained by the following steps: inputting a sample image into the gesture analysis model; using a hand feature extraction network in the gesture analysis model to characterize the sample image Extraction to obtain the first key point feature of the third number of samples and the second key point feature of the fourth number of samples; through the UV coordinate regression network in the gesture analysis model, the first key point feature of each sample Perform UV coordinate regression processing with the second key point feature of each sample to obtain the first sample UV coordinates of each sample finger key point and the second sample UV coordinate of each sample palm key point; through the gesture Analyze the deep regression network in the model, respectively perform deep regression processing on the first key point feature of each sample and the second key point feature of each sample, and obtain the first sample depth corresponding to the key points of each sample finger Coordinates and the second sample depth coordinates of the key points of the palm of each sample; through the gesture analysis network in the gesture analysis model, the UV coordinates of the first sample, the UV coordinates of the second sample, and the first sample Perform gesture analysis on this depth coordinate and the second sample depth
  • the embodiments of the present application provide a computer program product or computer program.
  • the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the method described in the embodiment of the present application.
  • the embodiment of the present application provides a storage medium storing executable instructions, and the executable instructions are stored therein.
  • the processor will cause the processor to execute the method provided in the embodiments of the present application, for example, as shown in FIG. 2 shows the method.
  • the storage medium may be a computer-readable storage medium, for example, Ferromagnetic Random Access Memory (FRAM), Read Only Memory (ROM), and Programmable Read Only Memory (PROM). Read Only Memory), Erasable Programmable Read Only Memory (EPROM, Erasable Programmable Read Only Memory), Electrically Erasable Programmable Read Only Memory (EEPROM, Electrically Erasable Programmable Read Only Memory), flash memory, magnetic surface memory, optical disks, Or CD-ROM (Compact Disk-Read Only Memory) and other memories; it can also be a variety of devices including one or any combination of the foregoing memories.
  • FRAM Ferromagnetic Random Access Memory
  • ROM Read Only Memory
  • PROM Programmable Read Only Memory
  • EPROM Erasable Programmable Read Only Memory
  • EEPROM Electrically Erasable Programmable Read Only Memory
  • flash memory magnetic surface memory, optical disks, Or CD-ROM (Compact Disk-Read Only Memory) and other memories; it can also be a variety of devices including one or any combination of the foregoing
  • the executable instructions may be in the form of programs, software, software modules, scripts or codes, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and their It can be deployed in any form, including being deployed as an independent program or as a module, component, subroutine or other unit suitable for use in a computing environment.
  • executable instructions may but do not necessarily correspond to files in the file system, and may be stored as part of files that store other programs or data, for example, in a HyperText Markup Language (HTML, HyperText Markup Language) document
  • HTML HyperText Markup Language
  • One or more scripts in are stored in a single file dedicated to the program in question, or in multiple coordinated files (for example, a file storing one or more modules, subroutines, or code parts).
  • executable instructions can be deployed to be executed on one computing device, or on multiple computing devices located in one location, or on multiple computing devices that are distributed in multiple locations and interconnected by a communication network Executed on.
  • the feature and each of the second key point features are subjected to depth regression processing, and the first depth coordinate of each finger key point and the second depth coordinate of each palm key point are obtained correspondingly; and finally according to the first UV coordinates,
  • the first depth coordinate, the second UV coordinate, and the second depth coordinate are subjected to gesture analysis on the image to be analyzed to obtain a gesture analysis result. In this way, the accuracy of gesture analysis can be greatly improved, and it has certain industrial applicability.

Abstract

本申请实施例提供一种目标对象跟踪方法、装置、设备及计算机可读存储介质,其中方法包括:对获取的待分析图像进行特征提取,得到第一数量的第一关键点特征和第二数量的第二关键点特征;分别对每一所述第一关键点特征和每一所述第二关键点特征进行UV坐标回归处理,对应得到每一手指关键点的第一UV坐标和每一手掌关键点的第二UV坐标;分别对每一所述第一关键点特征和每一所述第二关键点特征进行深度回归处理,对应得到每一手指关键点的第一深度坐标和每一手掌关键点的第二深度坐标;根据所述第一UV坐标、所述第一深度坐标、所述第二UV坐标和所述第二深度坐标,对所述待分析图像进行手势分析,得到手势分析结果。

Description

手势分析方法、装置、设备及计算机可读存储介质
相关申请的交叉引用
本申请基于申请号为62/938,189、申请日为2019年11月20日、申请名称为“SEPARATE FINGER AND PALM PROCESSES FOR EFFICIENT 3D HAND POSE ESTIMATION FOR A MOBILE TOF CAMERA”的在先美国临时专利申请提出,并要求该在先美国临时专利申请的优先权,该在先美国临时专利申请的全部内容在此以全文引入的方式引入本申请作为参考。
技术领域
本申请实施例涉及互联网技术领域,涉及但不限于一种手势分析方法、装置、设备及计算机可读存储介质。
背景技术
手势识别和手势分析技术应用于诸多领域,其目的是通过对图像进行分析,以估计出手部若干个关节点的坐标。由于基于图像能够准确、有效地重建人手的运动,因此有望在沉浸式虚拟现实和增强现实、机器人控制和手语识别中获得令人兴奋的新应用。
近年来,尤其是随着消费者深度相机的到来,这些应用取得了长足的进步。但是,由于不受约束的全局和局部姿势变化、频繁的遮挡、局部自相似性以及高度的关节运动,使得手势分析仍然是一项艰巨的任务,相关技术中手势分析方法的准确性有待提高。
发明内容
本申请实施例提供一种手势分析方法、装置、设备及计算机可读存储介质,将手指和手掌的手势估计任务分离开,在这种分离的架构中,分别针对于手指关键点和手掌关键点进行处理,以实现对整个手部的手势分析,如此,能够极大的提高手势分析的准确率。
本申请实施例的技术方案是这样实现的:
本申请实施例提供一种手势分析方法,包括:
对获取的待分析图像进行特征提取,得到第一数量的第一关键点特征和第二数量的第二关键点特征;
分别对每一所述第一关键点特征和每一所述第二关键点特征进行UV坐标回归处理,对应得到每一手指关键点的第一UV坐标和每一手掌关键点的第二UV坐标;
分别对每一所述第一关键点特征和每一所述第二关键点特征进行深度回归处理,对应得到每一手指关键点的第一深度坐标和每一手掌关键点的第二深度坐标;
根据所述第一UV坐标、所述第一深度坐标、所述第二UV坐标和所述第二深度坐标,对所述待分析图像进行手势分析,得到手势分析结果。
本申请实施例提供一种手势分析装置,包括:
特征提取模块,用于对获取的待分析图像进行特征提取,得到第一数量的第一关键点特征和第二数量的第二关键点特征;
UV坐标回归处理模块,用于分别对每一所述第一关键点特征和每一所述第二关键点特征进行UV坐标回归处理,对应得到每一手指关键点的第一UV坐标和每一手掌关键点 的第二UV坐标;
深度回归处理模块,用于分别对每一所述第一关键点特征和每一所述第二关键点特征进行深度回归处理,对应得到每一手指关键点的第一深度坐标和每一手掌关键点的第二深度坐标;
手势分析模块,用于根据所述第一UV坐标、所述第一深度坐标、所述第二UV坐标和所述第二深度坐标,对所述待分析图像进行手势分析,得到手势分析结果。
本申请实施例提供一种手势分析设备,包括:
存储器,用于存储可执行指令;处理器,用于执行所述存储器中存储的可执行指令时,实现上述的手势分析方法。
本申请实施例提供一种计算机可读存储介质,存储有可执行指令,用于引起处理器执行所述可执行指令时,实现上述的手势分析方法。
本申请实施例具有以下有益效果:将手指和手掌的手势估计任务分离开,在这种分离的架构中,对获取的待分析图像进行特征提取,得到第一数量的第一关键点特征和第二数量的第二关键点特征;然后分别对每一第一关键点特征和每一第二关键点特征进行UV坐标回归处理和深度回归处理,并根据UV坐标回归处理和深度回归处理之后的结果对待分析图像进行手势分析,得到手势分析结果,如此,能够极大的提高手势分析的准确率。
附图说明
图1是本申请实施例提供的手势分析系统的一个可选的架构示意图;
图2是本申请实施例提供的手势分析方法的一个可选的流程示意图;
图3是本申请实施例提供的手势分析方法的一个可选的流程示意图;
图4是本申请实施例提供的手势分析方法的一个可选的流程示意图;
图5是本申请实施例提供的手势分析方法的一个可选的流程示意图;
图6是本申请实施例提供的手势分析方法的一个可选的流程示意图;
图7是本申请实施例提供的手势分析方法的一个可选的流程示意图;
图8是本申请实施例提供的手势分析模型训练方法的一个可选的流程示意图;
图9是本申请实施例提供的由TOF摄像机捕获的一个示例图像;
图10是本申请实施例提供的包括预测范围和手存在概率的手检测结果;
图11是本申请实施例提供的手部关键点位置示例图;
图12是本申请实施例提供的二维手部姿态估计结果示例图;
图13是本申请实施例提供的手部检测和手部姿态估计过程示意图;
图14是本申请实施例提供的RoI Align的原理图;
图15是本申请实施例提供的NMS的结果示意图;
图16是本申请实施例提供的IoU的原理图;
图17是本申请实施例提供的位姿引导结构区域集成网络的框架图;
图18是本申请实施例提供的手势分析方法的流程图;
图19是本申请实施例提供的手势估计模块的网络体系结构图;
图20是本申请实施例提供的手势分析装置的结构示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可 以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。除非另有定义,本申请实施例所使用的所有的技术和科学术语与属于本申请实施例的技术领域的技术人员通常理解的含义相同。本申请实施例所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
为了更好地理解本申请实施例中提供的目标对象跟踪方法,首先对本申请实施例提供的手势分析系统进行说明:
参见图1,图1是本申请实施例提供的手势分析系统10的一个可选的架构示意图。为实现对待分析图像中的手部进行手势分析,本申请实施例提供的手势分析系统10中包括终端100、网络200和服务器300,其中,终端100上运行有视频播放应用或者具有视频录制单元或者运行有图像显示应用,通过视频播放应用播放视频录制单元实时录制的视频或者预先录制的视频,并通过本申请实施例的方法,将视频中的每一帧视频帧作为待分析图像,对待分析图像中的手部进行手势分析,或者,对图像显示应用所显示的待分析图像进行手势分析。
本申请实施例的方法中,在获取到待分析图像之后,终端100通过网络200向服务器300发送待分析图像;服务器300对待分析图像进行特征提取,得到第一数量的第一关键点特征和第二数量的第二关键点特征;然后分别对每一第一关键点特征和每一第二关键点特征进行UV坐标回归处理,对应得到每一手指关键点的第一UV坐标和每一手掌关键点的第二UV坐标;并分别对每一第一关键点特征和每一第二关键点特征进行深度回归处理,对应得到每一手指关键点的第一深度坐标和每一手掌关键点的第二深度坐标;最后,根据第一UV坐标、第一深度坐标、第二UV坐标和第二深度坐标,对待分析图像进行手势分析,得到手势分析结果。服务器300在得到手势分析结果之后,将手势分析结果发送给终端100,终端100在当前界面100-1上显示标记有手势分析结果的标记图像或直接显示手势分析结果。通过本申请实施例的方法,能够极大的提高手势分析的准确率。
下面说明本申请实施例的手势分析设备的示例性应用,在一种实现方式中,本申请实施例提供的手势分析设备可以实施为笔记本电脑,平板电脑,台式计算机,移动设备(例如,移动电话,便携式音乐播放器,个人数字助理,专用消息设备,便携式游戏设备)、智能机器人、智能视频监控等任意的终端,在另一种实现方式中,本申请实施例提供的手势分析设备还可以实施为服务器。下面,将说明手势分析设备实施为服务器时的示例性应用。
图2是本申请实施例提供的手势分析方法的一个可选的流程示意图,如图2所示,方法包括以下步骤:
步骤S201,对获取的待分析图像进行特征提取,得到第一数量的第一关键点特征和第二数量的第二关键点特征。
这里,待分析图像中具有手部图像,对获取的待分析图像进行特征提取可以是进行手部特征提取。在对待分析图像进行手部特征提取之前,首先进行手部位置识别,以确定手部所在的区域,从而对该区域的图像进行分析和识别,以确定该区域的手部的手势。
在一些实施例中,在进行手部位置识别时,可以采用预先训练好的手部检测模型来实现,通过手部检测模型检测并输出手部在每一位置(可以是待分析图像的整个区域中的任一子区域,也称包围盒或边界框)存在的概率值,并将具有最大概率值的子区域确定为手部所在的区域。
本申请实施例中,在识别出手部所在的子区域之后,对该子区域进行手部特征提取,以得到第一数量的第一关键点特征和第二数量的第二关键点特征,其中,第一关键点特征可以是手指关键点特征,第二关键点特征可以是手掌关键点特征,其中,第一数量和第二数量可以是任意的正整数。手部特征提取可以采用预先训练好的手部特征提取模型来实现,其中,手部特征提取模型在使用过程中,可以将具有手部的深度图像输入至手部特征 提取模型中,模型内部对深度图像进行识别,以确定出深度图像中的手部的至少一个关键点,且这些关键点不仅包括手指关键点,还包括手掌关键点。
在一些实施例中,还可以采用人工智能技术实现本申请实施例的方法,即采用人工智能技术识别手部所在的子区域,以及采用人工智能技术识别手指关键点和手掌关键点。
手指关键点特征是对手指关键点进行图像特征提取所得到的图像特征,手掌关键点特征是对手掌关键点进行图像特征提取所得到的图像特征。
步骤S202,分别对每一所述第一关键点特征和每一所述第二关键点特征进行UV坐标回归处理,对应得到每一手指关键点的第一UV坐标和每一手掌关键点的第二UV坐标。
这里,UV坐标回归处理用于确定手指关键点和手掌关键点的UV坐标,UV坐标是相对于XYZ坐标的坐标。
步骤S203,分别对每一所述第一关键点特征和每一所述第二关键点特征进行深度回归处理,对应得到每一手指关键点的第一深度坐标和每一手掌关键点的第二深度坐标。
这里,深度回归处理用于确定手指关键点和手掌关键点的深度坐标,深度坐标也是相对于XYZ坐标的坐标,UV坐标与深度坐标共同形成手指关键点和手掌关键点的UVD坐标。
步骤S204,根据所述第一UV坐标、所述第一深度坐标、所述第二UV坐标和所述第二深度坐标,对所述待分析图像进行手势分析,得到手势分析结果。
这里,第一UV坐标和第一深度坐标形成手指关键点的UVD坐标,第二UV坐标和第二深度坐标形成手掌关键点的UVD坐标,本申请实施例采用UVD坐标来表征手指和手掌的位置,以实现对手部进行手势识别和分析。
在一些实施例中,手势分析结果包括每一手指关键点的UVD坐标和每一手掌关键点UVD坐标,或者,手势分析结果中还包括根据每一手指关键点的UVD坐标和每一手掌关键点UVD坐标所确定出的手部的手势结构图。
本申请实施例提供的手势分析方法,将手指和手掌的手势估计任务分离开,在这种分离的架构中,对获取的待分析图像进行特征提取,得到第一数量的第一关键点特征和第二数量的第二关键点特征;然后分别对每一第一关键点特征和每一第二关键点特征进行UV坐标回归处理和深度回归处理,并根据UV坐标回归处理和深度回归处理之后的结果对待分析图像进行手势分析,得到手势分析结果,如此,能够极大的提高手势分析的准确率。
在一些实施例中,手势分析系统中至少包括终端和服务器,终端上运行有视频播放应用,可以采用本申请实施例的方法,对视频播放应用所播放的视频中的每一视频帧中的手部进行手势分析,或者,终端上具有视频录制单元,通过视频录制单元实时录制视频,并采用本申请实施例的方法对实时录制的视频中的每一视频帧中的手部进行手势分析,或者,终端上具有图像拍摄单元,通过图像拍摄单元拍摄图像,并采用本申请实施例的方法,对拍摄的图像中的手部进行手势分析,或者,终端上运行有图像显示应用,可以采用本申请实施例的方法对图像显示应用所显示的图像中的手部进行手势分析。
下面以对终端上的图像进行手势分析,且以手势分析时的特征提取过程为手势特征提取,手势特征提取得到第一数量的手指关键点特征和第二数量的手掌关键点特征为例,对本申请实施例的方法进行说明,图3是本申请实施例提供的手势分析方法的一个可选的流程示意图,如图3所示,方法包括以下步骤:
步骤S301,终端获取待分析图像。
这里,终端可以在网络上下载待分析图像,也可以采用图像拍摄单元实时拍摄待分析图像,或者还可以将接收到的图像作为待分析图像。
步骤S302,判断待分析图像上是否具有手部。
这里,可以采用预先训练好的手部识别模型对待分析图像进行识别。当识别结果显示待分析图像上的任一子区域中具有手部的概率值大于阈值时,表明该子区域中具有手部, 从而确定出待分析图像上具有手部;当识别结果显示待分析图像上的每一子区域中具有手部的概率值均小于阈值时,表明待分析图像上不具有手部。
如果判断结果为是,则执行步骤S303,如果判断结果为否,则返回继续执行步骤S301。
步骤S303,终端将待分析图像发送给服务器。
步骤S304,服务器对待分析图像进行手部特征提取,得到第一数量的手指关键点特征和第二数量的手掌关键点特征。
步骤S305,服务器分别对每一手指关键点特征和每一手掌关键点特征进行UV坐标回归处理,对应得到每一手指关键点的第一UV坐标和每一手掌关键点的第二UV坐标。
步骤S306,服务器分别对每一手指关键点特征和每一手掌关键点特征进行深度回归处理,对应得到每一手指关键点的第一深度坐标和每一手掌关键点的第二深度坐标。
步骤S307,服务器根据第一UV坐标、第一深度坐标、第二UV坐标和第二深度坐标,对待分析图像进行手势分析,得到手势分析结果。
需要说明的是,步骤S304至步骤S307与上述的步骤S201至步骤S204相同,本申请实施例不再赘述。
步骤S308,服务器将手势分析结果发送给终端。
步骤S309,终端在当前界面上显示手势分析结果。
本申请实施例提供的手势分析方法,终端获取待分析图像,并将待分析图像发送给服务器进行分析和识别,当分析出待分析图像中手部的手势后,将手势分析结果反馈给终端,并在终端的当前界面上显示,如此,通过终端与服务器之间的交互,能够实现对终端实时获取的图像进行实时的手势分析,提高了用户的使用体验,并且,由于服务器在进行手势分析是,是将手指和手掌的手势估计任务分离开,在这种分离的架构中,分别针对于手指关键点和手掌关键点进行处理,以实现对整个手部的手势分析,如此,能够极大的提高手势分析的准确率。
基于图2,图4是本申请实施例提供的手势分析方法的一个可选的流程示意图,如图4所示,步骤S202可以通过以下步骤实现:
步骤S401,分别对每一所述手指关键点特征和每一所述手掌关键点特征,进行UV编码处理,对应得到每一所述手指关键点的第一UV编码特征和每一所述手掌关键点的第二UV编码特征。
需要说明的是,本申请实施例的UV编码处理是分别针对于手指关键点特征和手掌关键点特征进行的,且对手指关键点特征进行UV编码处理的处理过程与对手掌关键点特征进行UV编码处理的处理过程相同。
在一些实施例中,步骤S401中对每一所述手指关键点特征进行UV编码处理,得到每一所述手指关键点的第一UV编码特征,可以通过以下步骤实现:
步骤S4011,采用第一卷积层对每一所述手指关键点特征进行卷积处理,得到第一卷积特征。
这里,第一卷积层具有特定卷积核,第一卷积层卷积核的数量可以是预设值,也可以是通过训练得到的。
步骤S4012,通过所述第一卷积层对所述第一卷积特征依次进行第一预设次数的跳跃连接处理,得到第一跳跃连接特征。
这里,跳跃连接处理可以解决网络层数较深的情况下梯度消失的问题,同时有助于梯度的反向传播,加快图像处理过程。
在一些实施例中,步骤S4012可以通过以下步骤实现:
步骤S4012a,将所述第一卷积特征确定为所述第一卷积层在第一次跳跃连接处理时的输入特征。
这里,将第一卷积层对手指关键点特征进行卷积处理所得到第一卷积特征确定为第一 次跳跃连接处理时的输入特征,即跳跃连接处理连接在第一卷积层之后,在第一卷积层进行卷积处理之后,即进行跳跃连接处理。
步骤S4012b,将所述第一卷积层在第N次的输出特征,确定为所述第一卷积层在第N次跳跃连接处理的输入特征,其中,N为大于1的整数。
在跳跃连接处理过程中,是将第一卷积层的输出跳跃连接至第一卷积层的输入位置,那么,在第N次跳跃连接处理时,输入特征即第一卷积层在第N次的输出特征。
步骤S4012c,将所确定出的每一次的所述输入特征,输入至所述第一卷积层中,依次进行所述第一预设次数的所述跳跃连接处理,得到所述第一跳跃连接特征。
整个跳跃连接处理过程是:在第一卷积层对手指关键点特征进行卷积处理得到第一卷积特征之后,将第一卷积特征输入至第一卷积层中进行第一次跳跃连接处理,得到第一次跳跃连接处理的输出特征,然后,将第一次跳跃连接处理的输出特征作为第二次跳跃连接处理的输入特征,输入至第一卷积层中进行第二次跳跃连接处理,得到第二次跳跃连接处理的输出特征,然后,将第二次跳跃连接处理的输出特征,作为第三次跳跃连接处理的输入特征输入至第一卷积层中进行第三次跳跃连接处理……以此类推,直至完成第一预设次数的跳跃连接处理,得到第一跳跃连接特征。
步骤S4013,对所述第一跳跃连接特征进行池化处理,以降低所述第一跳跃连接特征的空间尺寸,得到每一所述手指关键点的所述第一UV编码特征。
这里,可以通过预设的第一池化层对所述第一跳跃连接特征进行池化处理。池化处理即下采样处理,池化处理用于降低第一跳跃连接特征的空间尺寸。
在一些实施例中,步骤S401中对每一所述手掌关键点特征,进行UV编码处理,得到每一所述手掌关键点的第二UV编码特征,可以通过以下步骤实现:
步骤S4014,采用第二卷积层对每一所述手掌关键点特征进行卷积处理,得到第二卷积特征。
这里,第二卷积层具有特定卷积核,第二卷积层卷积核的数量可以是预设值,也可以是通过训练得到的。
步骤S4015,通过所述第二卷积层对所述第二卷积特征依次进行第二预设次数的跳跃连接处理,得到第二跳跃连接特征。
在一些实施例中,步骤S4015可以通过以下步骤实现:
步骤S4015a,将所述第二卷积特征确定为所述第二卷积层在第一次跳跃连接处理时的输入特征。
步骤S4015b,将所述第二卷积层在第K次的输出特征,确定为所述第二卷积层在第K次跳跃连接处理的输入特征,其中,K为大于1的整数。
步骤S4015c,将所确定出的每一次的所述输入特征,输入至所述第二卷积层中,依次进行所述第二预设次数的所述跳跃连接处理,得到所述第二跳跃连接特征。
需要说明的是,步骤S4015a至步骤S4015c中的第二预设次数的跳跃连接处理的处理过程,与上述第一预设次数的跳跃连接处理的处理过程相同,请参照上述步骤S4012a至步骤S4012c的解释,本申请实施例不再赘述。第一预设次数与第二预设次数可以相同也可以不同,第一预设次数和第二预设次数可以根据数据处理需求和数据处理量来确定。
步骤S4016,对所述第二跳跃连接特征进行池化处理,以降低所述第二跳跃连接特征的空间尺寸,得到每一所述手掌关键点的所述第二UV编码特征。
这里,可以通过预设的第二池化层对所述第二跳跃连接特征进行池化处理。
步骤S402,分别对每一所述第一UV编码特征和每一所述第二UV编码特征,进行全连接处理,对应得到每一所述手指关键点的第一UV坐标和每一所述手掌关键点的第二UV坐标。
请继续参照图4,步骤S203可以通过以下步骤实现:
步骤S403,分别对每一所述手指关键点特征和每一所述手掌关键点特征,进行深度编码处理,对应得到每一所述手指关键点的第一深度编码特征和每一所述手掌关键点的第二深度编码特征。
需要说明的是,本申请实施例的深度编码处理是分别针对于手指关键点特征和手掌关键点特征进行的,且对手指关键点特征进行深度编码处理的处理过程与对手掌关键点特征进行深度编码处理的处理过程相同。
在一些实施例中,步骤S403中对每一所述手指关键点特征进行深度编码处理,对应得到每一所述手指关键点的第一深度编码特征,可以通过以下步骤实现:
步骤S4031,采用第三卷积层对每一所述手指关键点特征进行卷积处理,得到第三卷积特征。
这里,第三卷积层具有特定卷积核,第三卷积层卷积核的数量可以是预设值,也可以是通过训练得到的。
步骤S4032,通过所述第三卷积层对所述第三卷积特征依次进行第三预设次数的跳跃连接处理,得到第三跳跃连接特征。
在一些实施例中,步骤S4032可以通过以下步骤实现:
步骤S4032a,将所述第三卷积特征确定为所述第三卷积层在第一次跳跃连接处理时的输入特征。
步骤S4032b,将所述第三卷积层在第M次的输出特征,确定为所述第三卷积层在第M次跳跃连接处理的输入特征,其中,M为大于1的整数。
步骤S4032c,将所确定出的每一次的所述输入特征,输入至所述第三卷积层中,依次进行所述第三预设次数的所述跳跃连接处理,得到所述第三跳跃连接特征。
需要说明的是,步骤S4032a至步骤S4032c中的第三预设次数的跳跃连接处理的处理过程,与上述第一预设次数的跳跃连接处理的处理过程和第二预设次数的跳跃连接处理的处理过程均相同,请参照上述步骤S4012a至步骤S4012c的解释,本申请实施例不再赘述。
步骤S4033,对所述第三跳跃连接特征进行池化处理,以降低所述第三跳跃连接特征的空间尺寸,得到每一所述手指关键点的所述第一深度编码特征。
这里,可以通过预设的第三池化层对所述第三跳跃连接特征进行池化处理。
在一些实施例中,步骤S403中对每一所述手掌关键点特征进行深度编码处理,对应得到每一所述手掌关键点的第二深度编码特征,可以通过以下步骤实现:
步骤S4034,采用第四卷积层对每一所述手掌关键点特征进行卷积处理,得到第四卷积特征。其中,第四卷积层具有特定卷积核。
步骤S4035,通过所述第四卷积层对所述第四卷积特征依次进行第四预设次数的跳跃连接处理,得到第四跳跃连接特征。
在一些实施例中,步骤S4035可以通过以下步骤实现:
步骤S4035a,将所述第四卷积特征确定为所述第四卷积层在第一次跳跃连接处理时的输入特征。
步骤S4035b,将所述第四卷积层在第L次的输出特征,确定为所述第四卷积层在第L次跳跃连接处理的输入特征,其中,L为大于1的整数。
步骤S4035c,将所确定出的每一次的所述输入特征,输入至所述第四卷积层中,依次进行所述第四预设次数的所述跳跃连接处理,得到所述第四跳跃连接特征。
需要说明的是,步骤S4035a至步骤S4035c中的第四预设次数的跳跃连接处理的处理过程,与上述第一预设次数的跳跃连接处理的处理过程和第二预设次数的跳跃连接处理的处理过程和第三预设次数的跳跃连接处理的处理过程均相同,请参照上述步骤S4012a至步骤S4012c的解释,本申请实施例不再赘述。第三预设次数与第四预设次数可以相同也可以不同,第三预设次数和第四预设次数可以根据数据处理需求和数据处理量来确定。
步骤S4036,对所述第四跳跃连接特征进行池化处理,以降低所述第四跳跃连接特征的空间尺寸,得到每一所述手掌关键点的所述第二深度编码特征。
这里,可以通过预设的第四池化层对所述第四跳跃连接特征进行池化处理。
步骤S404,分别对每一所述第一深度编码特征和每一所述第二深度编码特征,进行全连接处理,对应得到每一所述手指关键点的第一深度坐标和每一所述手掌关键点的第二深度坐标。
基于图2,图5是本申请实施例提供的手势分析方法的一个可选的流程示意图,如图5所示,步骤S204可以通过以下步骤实现:
步骤S501,对每一所述手指关键点的所述第一UV坐标和所述第一深度坐标进行坐标转换,得到对应手指关键点的第一空间坐标。
这里,坐标转换是指将UVD坐标转换为XYZ坐标,其中,手指关键点的UVD坐标是由第一UV坐标和第一深度坐标确定的,即,第一UV坐标和第一深度坐标共同形成手指关键点的UVD坐标。第一空间坐标是手指关键点在XYZ坐标的表示。
本申请实施例中,将UVD坐标转换为XYZ坐标,可以通过以下公式(1-1)实现:
Figure PCTCN2020128469-appb-000001
其中,(x,y,z)是XYZ格式的坐标,(u,v,d)是UVD格式的坐标,其中,u和v对应的是二维图像的像素值,d表示深度值(depth),即该坐标点距离相机的深度值。Cx和Cy代表主点,理想情况下应该位于图像的中心,其中主点是相机的光心,一般位于图像的中心,是在图像坐标系下。fx和fy分别是x方向和y方向上的焦距。
步骤S502,对每一所述手掌关键点的所述第二UV坐标和所述第二深度坐标进行坐标转换,得到对应手掌关键点的第二空间坐标。
这里,手掌关键点的UVD坐标是由第二UV坐标和第二深度坐标确定的,即,第二UV坐标和第二深度坐标共同形成手掌关键点的UVD坐标。本申请实施例中,可以通过上述公式(1-1)实现对每一手掌关键点的第二UV坐标和第二深度坐标进行坐标转换。第二空间坐标是手掌关键点在XYZ坐标的表示。
步骤S503,根据所述第一空间坐标和所述第二空间坐标,对所述待分析图像进行手势分析,得到手势分析结果。
这里,采用XYZ坐标表示来进行手势分析,能够得到手部各个关键点在三维坐标上的位置,从而得到准确的手势分析结果。
在一些实施例中,步骤S503可以通过以下步骤实现:
步骤S5031,确定每两个手指关键点之间的第一相对位置关系、和每两个手掌关键点之间的第二相对位置关系。
这里,第一相对位置关系是每两个手指关键点之间的相对位置关系,例如,同一手指上相邻的两个手指关键点之间的第一相对位置关系是这两个手指关键点相邻且可以直接连接;分别位于两个手指上的两个手指关键点之间的第一相对位置关系是这两个手指关键点不可以直接连接。
第二手掌关键点之间的第二相对位置关系是每两个手掌关键点之间的相对位置关系,例如,手掌上相邻的两个位置的手掌关键点可以直接连接,手掌上不相邻的两个位置的手掌关键点不可以直接连接。
步骤S5032,根据所述第一相对位置关系和所述第二相对位置关系,依次连接所述第一数量的手指关键点和所述第二数量的手掌关键点,形成手部关键点连接图。
这里,手部关键点连接图中包括每一关键点的XYZ坐标。
步骤S5033,根据所述手部关键点连接图对所述待分析图像进行手势分析,得到手势分析结果。
这里,通过手部关键点连接图可以确定出每一手指的形状和手掌的形状,从而确定出手部的手势分析结果。
基于图2,图6是本申请实施例提供的手势分析方法的一个可选的流程示意图,如图6所示,步骤S201可以通过以下步骤实现:
步骤S601,对所述待分析图像进行目标识别,以实现在所述待分析图像的至少两个子区域中识别出具有目标对象的目标子区域。
在一些实施例中,步骤S601可以通过以下步骤实现:
步骤S6011,获取具有预设尺寸的扫描框,所述待分析图像的尺寸大于所述预设尺寸。
这里,待分析图像对应的区域包括多个子区域,子区域的尺寸与扫描框的尺寸相同,即扫描框每扫描到一个位置,该位置对应一个子区域。
步骤S6012,通过在所述待分析图像的区域上滑动所述扫描框,以确定出每一所述子区域中具有所述目标对象的概率值。
这里,对于扫描框滑动到的子区域,确定该子区域中是否具有目标对象,本申请实施例中,目标对象可以是手部。可以通过预先训练好的目标识别模型对子区域进行目标识别,以确定出每一子区域具有目标对象的概率值。
步骤S6013,将具有最高概率值的子区域确定为所述目标子区域。
步骤S602,对所述目标子区域进行截取,得到截取后的图像。
这里,将目标子区域截取到,以剔除不包含手部的其他区域,从而能够减小后续手势分析过程的数据处理量。
步骤S603,对所述截取后的图像进行所述手部特征提取,得到所述第一数量的所述手指关键点特征和所述第二数量的所述手掌关键点特征。
在一些实施例中,步骤S603可以通过以下步骤实现:
步骤S6031,对所述截取后的图像进行RoI匹配特征提取,以获得坐标为浮点数的像素点上的至少两个图像RoI匹配特征。
步骤S6032,根据所述至少两个图像RoI匹配特征,确定RoI匹配特征图。
这里,根据所提取到的图像RoI匹配特征确定RoI匹配特征图,即将提取到的图像RoI匹配特征嵌入至一特征图中,形成RoI匹配特征图,如此,在后续的手势分析过程中,可以从RoI匹配特征图开始进行手指和手掌的特征提取,而无需从原始图像开始。
步骤S6033,对所述RoI匹配特征图进行二维手部姿态估计,以确定出所述第一数量的所述手指关键点特征和所述第二数量的所述手掌关键点特征。
图7是本申请实施例提供的手势分析方法的一个可选的流程示意图,如图7所示,步骤S6033可以通过以下步骤实现:
步骤S701,采用第五卷积层,对所述RoI匹配特征图中的所述图像RoI匹配特征进行卷积处理,得到RoI匹配卷积特征。其中,第五卷积层具有特定卷积核。
步骤S702,采用第六卷积层,对所述RoI匹配卷积特征进行第五预设次数的跳跃连接处理,得到第五跳跃连接特征。其中,第六卷积层具有特定卷积核。
步骤S703,对所述第五跳跃连接特征进行池化处理,以降低所述第五跳跃连接特征的空间尺寸,确定出所述第一数量的所述手指关键点特征和所述第二数量的所述手掌关键点特征。这里,可以通过预设的第五池化层对所述第五跳跃连接特征进行池化处理。
在一些实施例中,本申请实施例提供的手势分析方法还可以采用手势分析模型来实现,即,采用手势分析模型进行所述手部特征提取、所述UV坐标回归处理、所述深度回归处理和所述手势分析,以得到所述手势分析结果。
图8是本申请实施例提供的手势分析模型训练方法的一个可选的流程示意图,如图8所示,训练方法包括以下步骤:
步骤S801,将样本图像输入至所述手势分析模型中。
步骤S802,通过所述手势分析模型中的手部特征提取网络,对所述样本图像进行特征提取,得到第三数量的样本第一关键点特征和第四数量的样本第二关键点特征。
这里,样本第一关键点特征可以是样本手指关键点特征,样本第二关键点特征可以是样本手掌关键点特征。手部特征提取网络中可以包括两个分支,一个为手指特征提取分支,一个为手掌特征提取分支,通过手指特征提取分支对样本图像进行手指特征提取,得到第三数量的样本手指关键点特征,通过手掌特征提取分支对样本图像进行手掌特征提取,得到第四数量的样本手掌关键点特征。
步骤S803,通过所述手势分析模型中的UV坐标回归网络,分别对每一所述样本第一关键点特征和每一所述样本第二关键点特征进行UV坐标回归处理,对应得到每一样本手指关键点的第一样本UV坐标和每一样本手掌关键点的第二样本UV坐标。
UV坐标回归网络用于对样本手指关键点特征和样本手指关键点特征进行UV坐标回归处理,以确定出每一样本关键点(包括样本手指关键点和样本手掌关键点)的UV坐标。
步骤S804,通过所述手势分析模型中的深度回归网络,分别对每一所述样本第一关键点特征和每一所述样本第二关键点特征进行深度回归处理,对应得到每一样本手指关键点的第一样本深度坐标和每一样本手掌关键点的第二样本深度坐标。
深度回归网络用于对样本手指关键点特征和样本手指关键点特征进行深度回归处理,以确定出每一样本关键点的深度坐标。
步骤S805,通过所述手势分析模型中的手势分析网络,对所述第一样本UV坐标、所述第二样本UV坐标、所述第一样本深度坐标和所述第二样本深度坐标进行手势分析,得到样本手势分析结果。
步骤S806,将样本手势分析结果输入至预设损失模型中,得到损失结果。
这里,预设损失模型用于将样本手势分析结果与预设的手势分析结果进行比较,得到损失结果,其中,预设的手势分析结果可以是用户预先设置的与样本图像对应的手势分析结果。
本申请实施例中,预设损失模型中包括损失函数,通过损失函数可以计算样本手势分析结果与预设的手势分析结果之间的相似度,在计算过程中,可以通过计算样本手势分析结果与预设的手势分析结果之间的距离,并根据距离确定上述损失结果。当样本手势分析结果与预设的手势分析结果之间的距离越大时,表明模型的训练结果与真实值的差距较大,需要进行进一步的训练;当样本手势分析结果与预设的手势分析结果之间的距离越小时,表明模型的训练结果更加接近真实值。
步骤S807,根据所述损失结果,对所述手部特征提取网络、所述UV坐标回归网络、所述深度回归网络和所述手势分析网络中的参数进行修正,得到修正后的手势分析模型。
这里,当上述距离大于预设距离阈值时,则损失结果表明当前的手势分析模型中的手部特征提取网络,不能准确的对样本图像进行手部特征提取,得到样本图像的准确的样本手指关键点特征和样本手掌关键点特征,和/或,UV坐标回归网络不能准确的对样本手指关键点特征和样本手掌关键点特征进行UV坐标回归处理,得到准确的手指关键点的第一样本UV坐标和样本手掌关键点的第二样本UV坐标,和/或,深度回归网络不能准确的对样本手指关键点特征和样本手掌关键点特征进行深度回归处理,得到准确的样本手指关键点的第一样本深度坐标和样本手掌关键点的第二样本深度坐标,和/或,手势分析网络不能准确的对第一样本UV坐标、第二样本UV坐标、第一样本深度坐标和第二样本深度坐标进行手势分析,得到样本图像对应的准确的样本手势分析结果。因此,需要对当前的手势分析模型进行修正。那么,可以根据上述距离,对手部特征提取网络、UV坐标回归网络、深度回归网络和手势分析网络中的至少一个中的参数进行修正,直至手势分析模型输出的样本手势分析结果与预设的手势分析结果之间的距离满足预设条件时,将对应的手势分析模型确定为训练好的手势分析模型。
本申请实施例提供的手势分析模型的训练方法,由于将样本图像输入至手势分析模型中,依次通过手部特征提取网络、UV坐标回归网络、深度回归网络和手势分析网络对样本图像进行处理,得到样本手势分析结果,并将样本手势分析结果输入至预设损失模型中,得到损失结果。因此,能够根据损失结果对手部特征提取网络、UV坐标回归网络、深度回归网络和手势分析网络中的至少一个中的参数进行修正,所得到的手势分析模型能够准确的确定出待分析图像的手势,提高用户的使用体验。
下面,将说明本申请实施例在一个实际的应用场景中的示例性应用。
本申请实施例提供一种手势分析方法,由于手指的姿态估计比手掌的姿态估计困难,因为手指在移动过程中高度变形,而手掌通常保持一个刚性表面。通过这样的发现,本申请实施例将手指和手掌的姿态估计任务分离开来。在这种分离的架构中,专门针对手指或手掌提取了手指特征或手掌特征,从而获得了更好的手势估计性能。
在解释本申请实施例的方法之前,首先对本申请实施例所涉及的技术进行说明。
1)飞行时间(TOF,Time-of-Flight)相机:TOF相机是一种范围成像摄像机系统,采用飞行时间技术,通过测量激光器或者LED发出的人造光信号的往返时间,从而解析得到图像上拍摄主体的每一点与相机之间的距离。TOF相机输出一帧大小为HxW的图像,二维图像上的每个像素值代表对象的深度值(即像素值范围为0mm~3000mm)。图9是本申请实施例提供的由TOF摄像机捕获的一个示例图像901。下面,将TOF相机捕捉到的图像901作为深度图像(即待分析图像)。
2)手的检测:手的检测是这样一个过程:输入深度图像,然后输出手存在的概率(例如,概率可以是一个数值从0到1的数字,数值越大表示手存在的概率越大,即置信度越大),和一个手的预测范围(bounding box)(例如,该预测范围表示了手的位置和大小)。图10是本申请实施例提供的包括预测范围1001和手存在概率1002(即置信度)的手检测结果。本申请实施例中,手的预测范围表示为(x min,y min,x max,y max),其中(x min,y min)为预测范围的左上角,(x max,y max)为预测范围的右下角。
3)二维手势估计:输入深度图像,然后输出手部骨架的二维关键点位置,手部关键点位置示例图如图11所示,其中,位置0、1、2、4、5、6、8、9、10、12、13、14、16、17表示手指的关键点,位置3、7、11、15、18、19表示手掌的关键点。每个关键点都是一个表示位置的二维坐标(例如x、y,其中x在水平图像轴上,y在垂直图像轴上)。二维手部姿态估计结果如图12所示,图中包括估计出的多个手部关键点121。
4)三维手势估计:输入深度图像,输出手部骨架的3D关键点位置,手部关键点位置示例图像如图11所示。每个关键点位置都是一个三维坐标(如x、y、z,其中x在水平图像轴上,y在垂直图像轴上,z在深度方向上)。本申请实施例即研究三维手位姿估计问题。
5)手势检测流程:典型的手部姿态检测流程包括:手部检测和手部姿态估计过程,如图13所示,手部检测131包括骨干特征提取器1311和预测范围检测头1312,手部姿态估计132包括骨干特征提取器1321和姿态估计头1322。需要说明的是,手部检测131和手部姿态估计132的任务是完全分离的。为了连接两个任务,将输出的预测范围位置调整为预测范围内像素的质心,并将预测范围的大小稍微放大,以包含所有的手像素,即通过边界框调整模块133对预测范围的大小进行调整。调整后的预测范围用于裁剪原始深度图像,即通过图像裁剪模块134对调整后的预测范围进行裁剪。将裁剪后的图像135输入到手部姿态估计132任务中。需要说明的是,当使用骨干特征提取器提取初始图像130的图像特征时,会出现重复计算现象。
6)注意力区域匹配(RoI Align,Rage of Interesting Alignment):RoI Align层消除了RoIPool的苛刻量化,正确地将提取的特征与输入对齐。本申请实施例提议的改进很简单:避免了对RoI边界或箱子(bins)进行任何量化(例如,可以使用x/16而不是[x/16],这里的x/16表示浮点数,[x/16]表示取整)。使用双线性插值计算法来计算每个RoI bin中四个 定期采样位置的输入特征的精确值,并汇总结果(使用最大值或平均值),如图14所示,是本申请实施例提供的RoI Align的原理图,虚线网格表示一个特征图,实线表示一个ROI(在本例中有2×2个箱子),图中的点表示每个箱子中的4个采样点141,RoI Align从特征图上邻近的网格点通过双线性插值计算每个采样点的值,对RoI、它的容器或采样点所涉及的任何坐标都不执行量化。需要说明的是,只要不执行量化,结果对精确的采样位置或采样的点数不敏感。
7)非极大值抑制(NMS,Non-maximum suppression):NMS在计算机视觉的几个关键方面得到了广泛的应用,它是许多被提出的检测方法的一个组成部分,可能是边缘、角或目标检测。它的必要性是由于检测算法对感兴趣的概念进行定位的能力不强,导致在真实位置附近出现多组检测结果。
在目标检测中,基于滑动窗口的方法通常会产生多个靠近目标正确位置的高分窗口。这是物体探测器的泛化能力、响应函数的平滑性和近处窗口的视觉相关性的结果。这种相对密集的输出对于理解图像的内容通常不能令人满意。事实上,这一步中窗口假设的数量与图像中物体的真实数量是不相关的。因此,NMS的目标是每个组只保留一个窗口,对应于响应函数的精确局部最大值,理想情况下每个对象只获得一次检测。图15是本申请实施例提供的NMS的结果示意图,图15显示了NMS的一个示例,其中,左图中是不采用NMS技术进行检测的结果,会导致在真实位置(即人脸位置)附近出现多组检测结果151(即图中的检测框);右图是采用NMS技术进行检测的结果,在真实位置只保留一个检测结果152。
8)预测范围操作:本申请实施例定义了两个简单的预测范围操作,如图16所示,给定两个预测范围BB1和BB2,其中,BB1和BB2的交集表示为BB1∩BB2,被定义为BB1和BB2的重叠区域161;BB1∪BB2被定义为BB1和BB2的统一区域162,交并比(IoU,Intersection over Union)在图16中表示,即图16中深色区域的重叠区域161与统一区域162之间的比值
9)UVD坐标和XYZ坐标之间的关系:UVD坐标和XYZ坐标之间的关系采用以下公式(2-1)进行UVD到XYZ的转换:
Figure PCTCN2020128469-appb-000002
其中,(x,y,z)是XYZ格式的坐标,(u,v,d)是UVD格式的坐标,其中,u和v对应的是二维图像的像素值,d表示深度值(depth),即该坐标点距离相机的深度值。Cx和Cy代表主点,理想情况下应该位于图像的中心,其中主点是相机的光心,一般位于图像的中心,是在图像坐标系下。fx和fy分别是x方向和y方向上的焦距
10)分类和回归:分类预测建模问题不同于回归预测建模问题。分类是预测一个离散类标签的任务;回归是预测连续数量的任务。
分类和回归算法之间有一些重叠,例如,分类算法可以预测连续值,但连续值是以类标签概率的形式出现的;回归算法可以预测一个离散值,但离散值以整数形式存在。
11)卷积神经网络(CNN,Convolutional neural network):卷积神经网络由输入层、输出层和多个隐藏层组成。CNN的隐藏层通常由一系列卷积层组成,这些层通过乘法或其他点积进行卷积。激活函数通常是一个RELU层,在激活函数层之后是附加的卷积层,如池化层、全连接层和归一化层,由于它们的输入和输出都被激活函数和最终的卷积掩盖了,所以称为隐藏层。最后的卷积反过来,通常包括反向传播,以便更准确地计算最终产物的权重。尽管这些层通常被称为卷积,但这只是惯例。从数学上讲,它是一个滑动点积或交叉相关。这对矩阵中的指数有重要意义,因为它影响在一个特定的指数点如何确定权重。
卷积层:在对CNN进行设计时,神经网络中的每个卷积层都应该具备以下属性:输入是一个张量,其形状为(图像数量)×(图像宽度)×(图像高度)×(图像深度)。 宽度和高度为超参数,深度必须等于图像深度的卷积核。卷积层对输入进行卷积,并将结果传递给下一层。这类似于视觉皮层中的神经元对特定刺激的反应。
每个卷积神经元仅为其接收域处理数据。虽然全连接前馈神经网络可以用于特征学习和数据分类,但将这种结构应用于图像是不实际的。即使在浅层(与深层相对)结构中,也需要非常多的神经元,因为与图像相关的输入尺寸非常大,其中每个像素都是一个相关变量。例如,对于大小为100x100的(小)图像,一个完全连接的层对第二层的每个神经元有10000个权重。卷积操作解决了这个问题,因为它减少了自由参数的数量,使得网络可以用更少的参数更深入。例如,不管图像大小如何,大小为5x5的平摊区域,每个区域具有相同的共享权值,只需要25个可学习的参数。通过这种方法,利用反向传播的方法,解决了传统多层神经网络训练中梯度消失或爆炸的问题。
池化层:卷积神经网络可以包括本地或全局池化层来简化底层的计算。池化层通过将一层神经元簇的输出合并为下一层的单个神经元来减少数据的维数。本地池结合了小的集群,通常是2x2。全局池作用于卷积层的所有神经元。此外,池可以计算最大值或平均值。最大池使用前一层的每个神经元簇的最大值。平均池使用前一层每个神经元簇的平均值。
全连接层:全连接层将一层的每个神经元连接到另一层的每个神经元。它在原理上与传统的多层感知器神经网络(MLP,Multi-Layer Perceptron)相同。扁平矩阵通过一个全连通层对图像进行分类。
本申请实施例提供的手势分析方法类似于Pose-REN的工作,位姿引导结构区域集成网络(Pose-REN,Pose guided structured Region Ensemble Network)的框架如图17所示。用一个简单的CNN网络(图中用Init-CNN来表示)预测一个初始的手部姿态pose0(用来作为级联结构的初始化)。在pose t-1的指导下,从CNN生成的特征图谱171中提取特征区域,并采用树状结构进行分层融合。Pose t是由Pose-REN获得的精制的手姿态,将作为下一阶段的指导。其中,图中的fc表示全连接层(Fully Connected),图中的concate表示合并数组,用于连接两个或多个数组
本申请实施例的方法属于使用完全连接层作为Pose-REN的最后一层来回归坐标的范畴。但是,首先是从RoI特征出发,而不是从原始图像出发,其次,回归头的架构是不同的(即除了最终的回归层,主要使用卷积层,而不是全连接层)。最后,返回UVD坐标,而不是XYZ坐标。
本申请实施例的主要发明点被置于RoiAlign特征提取器之后,它是用于三维手部姿态估计任务的回归模块。所提出的回归模块复用了从手部检测任务中得到的特征图,它从RoiAlign特征图开始,而不是从原始图像开始。本申请实施例方法的位置如图18所示,用于实现手势分析的手势估计模块181位于RoiAlign特征提取器182之后,其中,骨干特征提取器183用于对输入的初始图像180进行骨干特征提取,边界框检测模块184用于对初始图像进行边界框检测,边界框选择模块185用于对边界框进行选择,在对边界框选择之后,采用RoiAlign特征提取器182进行RoiAlign特征提取。
基于图18所示的手势估计模块181在整个框架中的位置,图19是本申请实施例提供的手势估计模块181的网络体系结构图,如图19所示,整个网络体系包括基础特征提取器191、第一UV编码器192、第一深度编码器193、第二UV编码器194、第二深度编码器195。
基础特征提取器191提取7x7x256(高*宽*通道)的图像特征图上的关键点特征,图像特征图首先应用3x3x128的卷积层Conv1将通道从256缩小到128(即节省计算)。将7x7x128的特征图与卷积层Conv2(3x3x128)卷积,进一步提取基本关键点特征,且Conv2有跳跃连接,将Conv2的输入与Conv2的输出相加,这个Conv2和它的跳跃连接重复4次。之后,对7x7x128的关键点特征映射,使用3x3内核的池化层,即Pool1,向下采样2次,大小为3x3x128。
本申请实施例中,手势估计模块181部分分为手指和手掌两个分支。手指分支有14个关键点,而手掌有6个关键点。如图11所示的手势关键点和手掌关键点,其中,手指关键点为0、1、2、4、5、6、8、9、10、12、13、14、16、17,手掌关键点为3、7、11、15、18、19。
在手指分支中,第一UV编码器192提取关键点特征,用于UV坐标回归。第一UV编码器192输入3x3x128的关键点特征图,卷积层Conv3输出相同大小的关键点特征图,并通过跳跃连接将Conv3的输入与Conv3的输出相加,这个Conv3与对应的跳跃连接重复4次。之后,通过内核为3x3的池化层,即Pool2,将3x3x128的关键点特征映射向下采样2次,大小为1x1x128。
在手指分支中,使用全连接层FC1来还原14个关键点的UV坐标。
在手指分支中,第一深度编码器193提取关键点特征用于深度回归。第一深度编码器193输入3x3x128的关键点特征图,卷积层Conv4输出相同大小的关键点特征图,并通过跳转连接将Conv4的输入与Conv4的输出相加,这个Conv4与对应的跳跃连接重复4次。之后,通过内核为3x3的池化层,即Pool3,将3x3x128的关键点特征映射向下采样2次,大小为1x1x128。
在手指分支中,使用完全连接的层FC2来返回14个关键点的深度坐标。
在手掌分支中,第二UV编码器194提取关键点特征,用于UV坐标回归。第二UV编码器194输入3x3x128的关键点特征图,卷积层Conv5输出相同大小的关键点特征图,并通过跳转连接将Conv5的输入与Conv5的输出相加,这个Conv5与对应的跳跃连接重复4次。之后,利用内核为3x3的池化层,即Pool4,将3x3x128的关键点特征映射向下采样2次,大小为1x1x128。
在手掌分支中,使用全连接层FC3来对6个关键点的UV坐标进行回归。
在手掌分支中,第二深度编码器195提取关键点特性,用于深度回归。第二深度编码器195输入3x3x128的关键点特征图,卷积层Conv6输出相同大小的关键点特征图,并通过跳转连接将Conv6的输入与Conv6的输出相加。此Conv6与相应的跳跃连接重复4次。之后,利用内核为3x3的池化层,即Pool5,将3x3x128的关键点特征映射向下采样2次,大小为1x1x128。
在手掌分支中,使用完全连接的层FC4来返回6个关键点的深度坐标。
通过上述计算,分别得到每一手指关键点的UVD坐标和每一手掌关键点的UVD坐标,然后,UV坐标加上深度,被用来计算XYZ坐标,即将UVD坐标转化为XYZ坐标,即完成对手势的估计。
基于前述的实施例,本申请实施例提供一种手势分析装置,该装置包括所包括的各模块、以及各模块所包括的各单元,可以通过接收端中的处理器来实现;当然也可通过具体的逻辑电路实现;在实施的过程中,处理器可以为中央处理器(CPU)、微处理器(MPU)、数字信号处理器(DSP)或现场可编程门阵列(FPGA)等。
图20是本申请实施例提供的手势分析装置的结构示意图,如图20所示,所述手势分析装置200包括:
特征提取模块201,用于对获取的待分析图像进行特征提取,得到第一数量的第一关键点特征和第二数量的第二关键点特征;
UV坐标回归处理模块202,用于分别对每一所述第一关键点特征和每一所述第二关键点特征进行UV坐标回归处理,对应得到每一手指关键点的第一UV坐标和每一手掌关键点的第二UV坐标;
深度回归处理模块203,用于分别对每一所述第一关键点特征和每一所述第二关键点特征进行深度回归处理,对应得到每一手指关键点的第一深度坐标和每一手掌关键点的第二深度坐标;
手势分析模块204,用于根据所述第一UV坐标、所述第一深度坐标、所述第二UV坐标和所述第二深度坐标,对所述待分析图像进行手势分析,得到手势分析结果。
在一些实施例中,所述UV坐标回归处理模块还用于:分别对每一所述第一关键点特征和每一所述第二关键点特征,进行UV编码处理,对应得到每一所述手指关键点的第一UV编码特征和每一所述手掌关键点的第二UV编码特征;分别对每一所述第一UV编码特征和每一所述第二UV编码特征,进行全连接处理,对应得到每一所述手指关键点的第一UV坐标和每一所述手掌关键点的第二UV坐标。
在一些实施例中,所述UV坐标回归处理模块还用于:采用第一卷积层对每一所述第一关键点特征进行卷积处理,得到第一卷积特征;通过所述第一卷积层对所述第一卷积特征依次进行第一预设次数的跳跃连接处理,得到第一跳跃连接特征;对所述第一跳跃连接特征进行池化处理,以降低所述第一跳跃连接特征的空间尺寸,得到每一所述手指关键点的所述第一UV编码特征。
在一些实施例中,所述UV坐标回归处理模块还用于:将所述第一卷积特征确定为所述第一卷积层在第一次跳跃连接处理时的输入特征;并且,将所述第一卷积层在第N次的输出特征,确定为所述第一卷积层在第N次跳跃连接处理的输入特征,其中,N为大于1的整数;将所确定出的每一次的所述输入特征,输入至所述第一卷积层中,依次进行所述第一预设次数的所述跳跃连接处理,得到所述第一跳跃连接特征。
在一些实施例中,所述UV坐标回归处理模块还用于:采用第二卷积层对每一所述手掌关键点特征进行卷积处理,得到第二卷积特征;通过所述第二卷积层对所述第二卷积特征依次进行第二预设次数的跳跃连接处理,得到第二跳跃连接特征;对所述第二跳跃连接特征进行池化处理,以降低所述第二跳跃连接特征的空间尺寸,得到每一所述手掌关键点的所述第二UV编码特征。
在一些实施例中,所述深度回归处理模块还用于:分别对每一所述第一关键点特征和每一所述第二关键点特征,进行深度编码处理,对应得到每一所述手指关键点的第一深度编码特征和每一所述手掌关键点的第二深度编码特征;分别对每一所述第一深度编码特征和每一所述第二深度编码特征,进行全连接处理,对应得到每一所述手指关键点的第一深度坐标和每一所述手掌关键点的第二深度坐标。
在一些实施例中,所述深度回归处理模块还用于:采用第三卷积层对每一所述手指关键点特征进行卷积处理,得到第三卷积特征;通过所述第三卷积层对所述第三卷积特征依次进行第三预设次数的跳跃连接处理,得到第三跳跃连接特征;对所述第三跳跃连接特征进行池化处理,以降低所述第三跳跃连接特征的空间尺寸,得到每一所述手指关键点的所述第一深度编码特征。
在一些实施例中,所述深度回归处理模块还用于:将所述第三卷积特征确定为所述第三卷积层在第一次跳跃连接处理时的输入特征;并且,将所述第三卷积层在第M次的输出特征,确定为所述第三卷积层在第M次跳跃连接处理的输入特征,其中,M为大于1的整数;将所确定出的每一次的所述输入特征,输入至所述第三卷积层中,依次进行所述第三预设次数的所述跳跃连接处理,得到所述第三跳跃连接特征。
在一些实施例中,所述深度回归处理模块还用于:采用第四卷积层对每一所述手掌关键点特征进行卷积处理,得到第四卷积特征;通过所述第四卷积层对所述第四卷积特征依次进行第四预设次数的跳跃连接处理,得到第四跳跃连接特征;对所述第四跳跃连接特征进行池化处理,以降低所述第四跳跃连接特征的空间尺寸,得到每一所述手掌关键点的所述第二深度编码特征。
在一些实施例中,所述手势分析模块还用于:对每一所述手指关键点的所述第一UV坐标和所述第一深度坐标进行坐标转换,得到对应手指关键点的第一空间坐标;对每一所述手掌关键点的所述第二UV坐标和所述第二深度坐标进行坐标转换,得到对应手掌关键 点的第二空间坐标;根据所述第一空间坐标和所述第二空间坐标,对所述待分析图像进行手势分析,得到手势分析结果。
在一些实施例中,所述手势分析模块还用于:确定每两个手指关键点之间的第一相对位置关系、和每两个手掌关键点之间的第二相对位置关系;根据所述第一相对位置关系和所述第二相对位置关系,依次连接所述第一数量的手指关键点和所述第二数量的手掌关键点,形成手部关键点连接图;根据所述手部关键点连接图对所述待分析图像进行手势分析,得到手势分析结果。
在一些实施例中,所述特征提取模块还用于:对所述待分析图像进行目标识别,以实现在所述待分析图像的至少两个子区域中识别出具有目标对象的目标子区域;对所述目标子区域进行截取,得到截取后的图像;对所述截取后的图像进行所述特征提取,得到所述第一数量的所述第一关键点特征和所述第二数量的所述第二关键点特征。
在一些实施例中,所述特征提取模块还用于:获取具有预设尺寸的扫描框,所述待分析图像的尺寸大于所述预设尺寸;通过在所述待分析图像的区域上滑动所述扫描框,以确定出每一所述子区域中具有所述目标对象的概率值;将具有最高概率值的子区域确定为所述目标子区域。
在一些实施例中,所述特征提取模块还用于:对所述截取后的图像进行RoI匹配特征提取,以获得坐标为浮点数的像素点上的至少两个图像RoI匹配特征;根据所述至少两个图像RoI匹配特征,确定RoI匹配特征图;对所述RoI匹配特征图进行二维手部姿态估计,以确定出所述第一数量的所述第一关键点特征和所述第二数量的所述第二关键点特征。
在一些实施例中,所述特征提取模块还用于:采用第五卷积层对所述RoI匹配特征图中的所述图像RoI匹配特征进行卷积处理,得到RoI匹配卷积特征;采用第六卷积层对所述RoI匹配卷积特征进行第五预设次数的跳跃连接处理,得到第五跳跃连接特征;对所述第五跳跃连接特征进行池化处理,以降低所述第五跳跃连接特征的空间尺寸,确定出所述第一数量的所述手指关键点特征和所述第二数量的所述手掌关键点特征。
在一些实施例中,所述装置还包括:处理模块,用于采用手势分析模型进行所述特征提取、所述UV坐标回归处理、所述深度回归处理和所述手势分析,以得到所述手势分析结果。
在一些实施例中,所述手势分析模型通过以下步骤进行训练:将样本图像输入至所述手势分析模型中;通过所述手势分析模型中的手部特征提取网络,对所述样本图像进行特征提取,得到第三数量的样本第一关键点特征和第四数量的样本第二关键点特征;通过所述手势分析模型中的UV坐标回归网络,分别对每一所述样本第一关键点特征和每一所述样本第二关键点特征进行UV坐标回归处理,对应得到每一样本手指关键点的第一样本UV坐标和每一样本手掌关键点的第二样本UV坐标;通过所述手势分析模型中的深度回归网络,分别对每一所述样本第一关键点特征和每一所述样本第二关键点特征进行深度回归处理,对应得到每一样本手指关键点的第一样本深度坐标和每一样本手掌关键点的第二样本深度坐标;通过所述手势分析模型中的手势分析网络,对所述第一样本UV坐标、所述第二样本UV坐标、所述第一样本深度坐标和所述第二样本深度坐标进行手势分析,得到样本手势分析结果;将所述样本手势分析结果输入至预设损失模型中,得到损失结果;根据所述损失结果,对所述手部特征提取网络、所述UV坐标回归网络、所述深度回归网络和所述手势分析网络中的参数进行修正,得到修正后的手势分析模型。
需要说明的是,本申请实施例装置的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果,因此不做赘述。对于本装置实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。
本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器 从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行本申请实施例上述的方法。
本申请实施例提供一种存储有可执行指令的存储介质,其中存储有可执行指令,当可执行指令被处理器执行时,将引起处理器执行本申请实施例提供的方法,例如,如图2示出的方法。
在一些实施例中,存储介质可以是计算机可读存储介质,例如,铁电存储器(FRAM,Ferromagnetic Random Access Memory)、只读存储器(ROM,Read Only Memory)、可编程只读存储器(PROM,Programmable Read Only Memory)、可擦除可编程只读存储器(EPROM,Erasable Programmable Read Only Memory)、带电可擦可编程只读存储器(EEPROM,Electrically Erasable Programmable Read Only Memory)、闪存、磁表面存储器、光盘、或光盘只读存储器(CD-ROM,Compact Disk-Read Only Memory)等存储器;也可以是包括上述存储器之一或任意组合的各种设备。
在一些实施例中,可执行指令可以采用程序、软件、软件模块、脚本或代码的形式,按任意形式的编程语言(包括编译或解释语言,或者声明性或过程性语言)来编写,并且其可按任意形式部署,包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它单元。
作为示例,可执行指令可以但不一定对应于文件系统中的文件,可以可被存储在保存其它程序或数据的文件的一部分,例如,存储在超文本标记语言(HTML,Hyper Text Markup Language)文档中的一个或多个脚本中,存储在专用于所讨论的程序的单个文件中,或者,存储在多个协同文件(例如,存储一个或多个模块、子程序或代码部分的文件)中。作为示例,可执行指令可被部署为在一个计算设备上执行,或者在位于一个地点的多个计算设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算设备上执行。
以上所述,仅为本申请的实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和范围之内所作的任何修改、等同替换和改进等,均包含在本申请的保护范围之内。
工业实用性
本申请实施例中,首先对获取的待分析图像进行特征提取,得到第一数量的第一关键点特征和第二数量的第二关键点特征;然后分别对每一所述第一关键点特征和每一所述第二关键点特征进行UV坐标回归处理,对应得到每一手指关键点的第一UV坐标和每一手掌关键点的第二UV坐标;分别对每一所述第一关键点特征和每一所述第二关键点特征进行深度回归处理,对应得到每一手指关键点的第一深度坐标和每一手掌关键点的第二深度坐标;最后根据所述第一UV坐标、所述第一深度坐标、所述第二UV坐标和所述第二深度坐标,对所述待分析图像进行手势分析,得到手势分析结果。如此,能够能够极大的提高手势分析的准确率,具有一定的工业实用性。

Claims (20)

  1. 一种手势分析方法,所述方法包括:
    对获取的待分析图像进行特征提取,得到第一数量的第一关键点特征和第二数量的第二关键点特征;
    分别对每一所述第一关键点特征和每一所述第二关键点特征进行UV坐标回归处理,对应得到每一手指关键点的第一UV坐标和每一手掌关键点的第二UV坐标;
    分别对每一所述第一关键点特征和每一所述第二关键点特征进行深度回归处理,对应得到每一手指关键点的第一深度坐标和每一手掌关键点的第二深度坐标;
    根据所述第一UV坐标、所述第一深度坐标、所述第二UV坐标和所述第二深度坐标,对所述待分析图像进行手势分析,得到手势分析结果。
  2. 根据权利要求1所述的方法,其中,所述分别对每一所述第一关键点特征和每一所述第二关键点特征进行UV坐标回归处理,对应得到每一手指关键点的第一UV坐标和每一手掌关键点的第二UV坐标,包括:
    分别对每一所述第一关键点特征和每一所述第二关键点特征,进行UV编码处理,对应得到每一所述手指关键点的第一UV编码特征和每一所述手掌关键点的第二UV编码特征;
    分别对每一所述第一UV编码特征和每一所述第二UV编码特征,进行全连接处理,对应得到每一所述手指关键点的第一UV坐标和每一所述手掌关键点的第二UV坐标。
  3. 根据权利要求2所述的方法,其中,对每一所述第一关键点特征进行UV编码处理,得到每一所述手指关键点的第一UV编码特征,包括:
    采用第一卷积层对每一所述第一关键点特征进行卷积处理,得到第一卷积特征;
    通过所述第一卷积层对所述第一卷积特征依次进行第一预设次数的跳跃连接处理,得到第一跳跃连接特征;
    对所述第一跳跃连接特征进行池化处理,以降低所述第一跳跃连接特征的空间尺寸,得到每一所述手指关键点的所述第一UV编码特征。
  4. 根据权利要求3所述的方法,其中,所述通过所述第一卷积层对所述第一卷积特征依次进行第一预设次数的跳跃连接处理,得到第一跳跃连接特征,包括:
    将所述第一卷积特征确定为所述第一卷积层在第一次跳跃连接处理时的输入特征;并且,
    将所述第一卷积层在第N次的输出特征,确定为所述第一卷积层在第N次跳跃连接处理的输入特征,其中,N为大于1的整数;
    将所确定出的每一次的所述输入特征,输入至所述第一卷积层中,依次进行所述第一预设次数的所述跳跃连接处理,得到所述第一跳跃连接特征。
  5. 根据权利要求2所述的方法,其中,对每一所述第二关键点特征,进行UV编码处理,得到每一所述手掌关键点的第二UV编码特征,包括:
    采用第二卷积层对每一所述手掌关键点特征进行卷积处理,得到第二卷积特征;
    通过所述第二卷积层对所述第二卷积特征依次进行第二预设次数的跳跃连接处理,得到第二跳跃连接特征;
    对所述第二跳跃连接特征进行池化处理,以降低所述第二跳跃连接特征的空间尺寸,得到每一所述手掌关键点的所述第二UV编码特征。
  6. 根据权利要求1所述的方法,其中,所述分别对每一所述第一关键点特征和每一所述第二关键点特征进行深度回归处理,对应得到每一手指关键点的第一深度坐 标和每一手掌关键点的第二深度坐标,包括:
    分别对每一所述第一关键点特征和每一所述第二关键点特征,进行深度编码处理,对应得到每一所述手指关键点的第一深度编码特征和每一所述手掌关键点的第二深度编码特征;
    分别对每一所述第一深度编码特征和每一所述第二深度编码特征,进行全连接处理,对应得到每一所述手指关键点的第一深度坐标和每一所述手掌关键点的第二深度坐标。
  7. 根据权利要求6所述的方法,其中,对每一所述第一关键点特征进行深度编码处理,对应得到每一所述手指关键点的第一深度编码特征,包括:
    采用第三卷积层对每一所述手指关键点特征进行卷积处理,得到第三卷积特征;
    通过所述第三卷积层对所述第三卷积特征依次进行第三预设次数的跳跃连接处理,得到第三跳跃连接特征;
    对所述第三跳跃连接特征进行池化处理,以降低所述第三跳跃连接特征的空间尺寸,得到每一所述手指关键点的所述第一深度编码特征。
  8. 根据权利要求7所述的方法,其中,所述通过所述第三卷积层对所述第三卷积特征依次进行第三预设次数的跳跃连接处理,得到第三跳跃连接特征,包括:
    将所述第三卷积特征确定为所述第三卷积层在第一次跳跃连接处理时的输入特征;并且,
    将所述第三卷积层在第M次的输出特征,确定为所述第三卷积层在第M次跳跃连接处理的输入特征,其中,M为大于1的整数;
    将所确定出的每一次的所述输入特征,输入至所述第三卷积层中,依次进行所述第三预设次数的所述跳跃连接处理,得到所述第三跳跃连接特征。
  9. 根据权利要求6所述的方法,其中,对每一所述第二关键点特征进行深度编码处理,对应得到每一所述手掌关键点的第二深度编码特征,包括:
    采用第四卷积层对每一所述手掌关键点特征进行卷积处理,得到第四卷积特征;
    通过所述第四卷积层对所述第四卷积特征依次进行第四预设次数的跳跃连接处理,得到第四跳跃连接特征;
    对所述第四跳跃连接特征进行池化处理,以降低所述第四跳跃连接特征的空间尺寸,得到每一所述手掌关键点的所述第二深度编码特征。
  10. 根据权利要求1所述的方法,其中,所述根据所述第一UV坐标、所述第一深度坐标、所述第二UV坐标和所述第二深度坐标对所述待分析图像进行手势分析,得到手势分析结果,包括:
    对每一所述手指关键点的所述第一UV坐标和所述第一深度坐标进行坐标转换,得到对应手指关键点的第一空间坐标;
    对每一所述手掌关键点的所述第二UV坐标和所述第二深度坐标进行坐标转换,得到对应手掌关键点的第二空间坐标;
    根据所述第一空间坐标和所述第二空间坐标,对所述待分析图像进行手势分析,得到手势分析结果。
  11. 根据权利要求10所述的方法,其中,所述根据所述第一空间坐标和所述第二空间坐标,对所述待分析图像进行手势分析,得到手势分析结果,包括:
    确定每两个手指关键点之间的第一相对位置关系、和每两个手掌关键点之间的第二相对位置关系;
    根据所述第一相对位置关系和所述第二相对位置关系,依次连接所述第一数量的手指关键点和所述第二数量的手掌关键点,形成手部关键点连接图;
    根据所述手部关键点连接图对所述待分析图像进行手势分析,得到手势分析结果。
  12. 根据权利要求1所述的方法,其中,所述对获取的待分析图像进行特征提取,得到第一数量的第一关键点特征和第二数量的第二关键点特征,包括:
    对所述待分析图像进行目标识别,以实现在所述待分析图像的至少两个子区域中识别出具有目标对象的目标子区域;
    对所述目标子区域进行截取,得到截取后的图像;
    对所述截取后的图像进行所述特征提取,得到所述第一数量的所述第一关键点特征和所述第二数量的所述第二关键点特征。
  13. 根据权利要求12所述的方法,其中,所述对所述待分析图像进行目标识别,以实现在所述待分析图像的至少两个子区域中识别出具有目标对象的目标子区域,包括:
    获取具有预设尺寸的扫描框,所述待分析图像的尺寸大于所述预设尺寸;
    通过在所述待分析图像的区域上滑动所述扫描框,以确定出每一所述子区域中具有所述目标对象的概率值;
    将具有最高概率值的子区域确定为所述目标子区域。
  14. 根据权利要求12所述的方法,其中,所述对所述截取后的图像进行所述特征提取,得到所述第一数量的所述第一关键点特征和所述第二数量的所述第二关键点特征,包括:
    对所述截取后的图像进行RoI匹配特征提取,以获得坐标为浮点数的像素点上的至少两个图像RoI匹配特征;
    根据所述至少两个图像RoI匹配特征,确定RoI匹配特征图;
    对所述RoI匹配特征图进行二维手部姿态估计,以确定出所述第一数量的所述第一关键点特征和所述第二数量的所述第二关键点特征。
  15. 根据权利要求14所述的方法,其中,所述对所述RoI匹配特征图进行二维手部姿态估计,以确定出所述第一数量的所述第一关键点特征和所述第二数量的所述第二关键点特征,包括:
    采用第五卷积层对所述RoI匹配特征图中的所述图像RoI匹配特征进行卷积处理,得到RoI匹配卷积特征;
    采用第六卷积层对所述RoI匹配卷积特征进行第五预设次数的跳跃连接处理,得到第五跳跃连接特征;
    对所述第五跳跃连接特征进行池化处理,以降低所述第五跳跃连接特征的空间尺寸,确定出所述第一数量的所述手指关键点特征和所述第二数量的所述手掌关键点特征。
  16. 根据权利要求1至15任一项所述的方法,其中,所述方法还包括:
    采用手势分析模型进行所述特征提取、所述UV坐标回归处理、所述深度回归处理和所述手势分析,以得到所述手势分析结果。
  17. 根据权利要求16所述的方法,其中,所述手势分析模型通过以下步骤进行训练:
    将样本图像输入至所述手势分析模型中;
    通过所述手势分析模型中的手部特征提取网络,对所述样本图像进行特征提取,得到第三数量的样本第一关键点特征和第四数量的样本第二关键点特征;
    通过所述手势分析模型中的UV坐标回归网络,分别对每一所述样本第一关键点特征和每一所述样本第二关键点特征进行UV坐标回归处理,对应得到每一样本手指 关键点的第一样本UV坐标和每一样本手掌关键点的第二样本UV坐标;
    通过所述手势分析模型中的深度回归网络,分别对每一所述样本第一关键点特征和每一所述样本第二关键点特征进行深度回归处理,对应得到每一样本手指关键点的第一样本深度坐标和每一样本手掌关键点的第二样本深度坐标;
    通过所述手势分析模型中的手势分析网络,对所述第一样本UV坐标、所述第二样本UV坐标、所述第一样本深度坐标和所述第二样本深度坐标进行手势分析,得到样本手势分析结果;
    将所述样本手势分析结果输入至预设损失模型中,得到损失结果;
    根据所述损失结果,对所述手部特征提取网络、所述UV坐标回归网络、所述深度回归网络和所述手势分析网络中的参数进行修正,得到修正后的手势分析模型。
  18. 一种手势分析装置,所述装置包括:
    特征提取模块,配置为对获取的待分析图像进行特征提取,得到第一数量的第一关键点特征和第二数量的第二关键点特征;
    UV坐标回归处理模块,配置为分别对每一所述第一关键点特征和每一所述第二关键点特征进行UV坐标回归处理,对应得到每一手指关键点的第一UV坐标和每一手掌关键点的第二UV坐标;
    深度回归处理模块,配置为分别对每一所述第一关键点特征和每一所述第二关键点特征进行深度回归处理,对应得到每一手指关键点的第一深度坐标和每一手掌关键点的第二深度坐标;
    手势分析模块,配置为根据所述第一UV坐标、所述第一深度坐标、所述第二UV坐标和所述第二深度坐标,对所述待分析图像进行手势分析,得到手势分析结果。
  19. 一种手势分析设备,包括:
    存储器,用于存储可执行指令;处理器,用于执行所述存储器中存储的可执行指令时,实现权利要求1至17任一项所述的手势分析方法。
  20. 一种计算机可读存储介质,存储有可执行指令,用于引起处理器执行所述可执行指令时,实现权利要求1至17任一项所述的手势分析方法。
PCT/CN2020/128469 2019-11-20 2020-11-12 手势分析方法、装置、设备及计算机可读存储介质 WO2021098587A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/746,956 US20220351547A1 (en) 2019-11-20 2022-05-17 Gesture analysis method and device, and computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962938189P 2019-11-20 2019-11-20
US62/938,189 2019-11-20

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/746,956 Continuation US20220351547A1 (en) 2019-11-20 2022-05-17 Gesture analysis method and device, and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2021098587A1 true WO2021098587A1 (zh) 2021-05-27

Family

ID=75980831

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/128469 WO2021098587A1 (zh) 2019-11-20 2020-11-12 手势分析方法、装置、设备及计算机可读存储介质

Country Status (2)

Country Link
US (1) US20220351547A1 (zh)
WO (1) WO2021098587A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378773A (zh) * 2021-06-29 2021-09-10 北京百度网讯科技有限公司 手势识别方法、装置、设备、存储介质以及程序产品
CN116766213A (zh) * 2023-08-24 2023-09-19 烟台大学 一种基于图像处理的仿生手控制方法、系统和设备

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116784837A (zh) * 2023-08-07 2023-09-22 北京工业大学 一种上肢运动障碍评估方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017164979A1 (en) * 2016-03-22 2017-09-28 Intel Corporation Identifying local coordinate system for gesture recognition
US20180373927A1 (en) * 2017-06-21 2018-12-27 Hon Hai Precision Industry Co., Ltd. Electronic device and gesture recognition method applied therein
CN109657537A (zh) * 2018-11-05 2019-04-19 北京达佳互联信息技术有限公司 基于目标检测的图像识别方法、系统和电子设备
CN109800676A (zh) * 2018-12-29 2019-05-24 上海易维视科技股份有限公司 基于深度信息的手势识别方法及系统
CN109858524A (zh) * 2019-01-04 2019-06-07 北京达佳互联信息技术有限公司 手势识别方法、装置、电子设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017164979A1 (en) * 2016-03-22 2017-09-28 Intel Corporation Identifying local coordinate system for gesture recognition
US20180373927A1 (en) * 2017-06-21 2018-12-27 Hon Hai Precision Industry Co., Ltd. Electronic device and gesture recognition method applied therein
CN109657537A (zh) * 2018-11-05 2019-04-19 北京达佳互联信息技术有限公司 基于目标检测的图像识别方法、系统和电子设备
CN109800676A (zh) * 2018-12-29 2019-05-24 上海易维视科技股份有限公司 基于深度信息的手势识别方法及系统
CN109858524A (zh) * 2019-01-04 2019-06-07 北京达佳互联信息技术有限公司 手势识别方法、装置、电子设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHI DING, LIN JUN, YOU JUN, YUAN HAO: "A Gesture Recognition Method Based on Deep Learning", CONTROL AND INFORMATION TECHNOLOGY, 1 June 2018 (2018-06-01), pages 96 - 99, XP055814607, [retrieved on 20210616], DOI: 10.13889/j.issn.2096-5427.2018.06.016 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378773A (zh) * 2021-06-29 2021-09-10 北京百度网讯科技有限公司 手势识别方法、装置、设备、存储介质以及程序产品
CN113378773B (zh) * 2021-06-29 2023-08-08 北京百度网讯科技有限公司 手势识别方法、装置、设备、存储介质以及程序产品
CN116766213A (zh) * 2023-08-24 2023-09-19 烟台大学 一种基于图像处理的仿生手控制方法、系统和设备
CN116766213B (zh) * 2023-08-24 2023-11-03 烟台大学 一种基于图像处理的仿生手控制方法、系统和设备

Also Published As

Publication number Publication date
US20220351547A1 (en) 2022-11-03

Similar Documents

Publication Publication Date Title
US10366313B2 (en) Activation layers for deep learning networks
US11783496B2 (en) Scalable real-time hand tracking
Nunez et al. Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition
WO2021017606A1 (zh) 视频处理方法、装置、电子设备及存储介质
WO2021098587A1 (zh) 手势分析方法、装置、设备及计算机可读存储介质
US11205086B2 (en) Determining associations between objects and persons using machine learning models
US7308112B2 (en) Sign based human-machine interaction
JP6571108B2 (ja) モバイル機器用三次元ジェスチャのリアルタイム認識及び追跡システム
CN111062263B (zh) 手部姿态估计的方法、设备、计算机设备和存储介质
WO2021218786A1 (zh) 一种数据处理系统、物体检测方法及其装置
CN112651292A (zh) 基于视频的人体动作识别方法、装置、介质及电子设备
WO2021190296A1 (zh) 一种动态手势识别方法及设备
WO2021047587A1 (zh) 手势识别方法、电子设备、计算机可读存储介质和芯片
US20220343687A1 (en) Gesture recognition method and apparatus, and storage medium
CN111444764A (zh) 一种基于深度残差网络的手势识别方法
US20220351405A1 (en) Pose determination method and device and non-transitory storage medium
CN116997941A (zh) 用于姿态估计的基于关键点的采样
WO2023083030A1 (zh) 一种姿态识别方法及其相关设备
KR102143034B1 (ko) 객체의 미래 움직임 예측을 통한 동영상에서의 객체 추적을 위한 방법 및 시스템
CN114641799A (zh) 对象检测设备、方法和系统
Hussain et al. Intelligent sign language recognition system for e-learning context
KR20210061839A (ko) 전자 장치 및 그 제어 방법
CN114897039A (zh) 一种数据处理方法及相关设备
WO2024022301A1 (zh) 视角路径获取方法、装置、电子设备及介质
US11961249B2 (en) Generating stereo-based dense depth images

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20889084

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20889084

Country of ref document: EP

Kind code of ref document: A1