WO2022170896A9 - 关键点检测方法、系统、智能终端和存储介质 - Google Patents

关键点检测方法、系统、智能终端和存储介质 Download PDF

Info

Publication number
WO2022170896A9
WO2022170896A9 PCT/CN2022/070537 CN2022070537W WO2022170896A9 WO 2022170896 A9 WO2022170896 A9 WO 2022170896A9 CN 2022070537 W CN2022070537 W CN 2022070537W WO 2022170896 A9 WO2022170896 A9 WO 2022170896A9
Authority
WO
WIPO (PCT)
Prior art keywords
image
frame
detection
target
key point
Prior art date
Application number
PCT/CN2022/070537
Other languages
English (en)
French (fr)
Other versions
WO2022170896A1 (zh
Inventor
张夏杰
蔚栋
史培元
安山
Original Assignee
北京沃东天骏信息技术有限公司
北京京东世纪贸易有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京沃东天骏信息技术有限公司, 北京京东世纪贸易有限公司 filed Critical 北京沃东天骏信息技术有限公司
Publication of WO2022170896A1 publication Critical patent/WO2022170896A1/zh
Publication of WO2022170896A9 publication Critical patent/WO2022170896A9/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application is based on the CN application number 202110175751.7 and the filing date is February 9, 2021, and claims its priority.
  • the disclosure of the CN application is hereby incorporated into the present application as a whole.
  • the present disclosure relates to the technical field of image detection, in particular to a key point detection method, system, intelligent terminal and storage medium.
  • Key point positioning refers to locating the target position from the input data. Specifically, in the key point positioning of the hand, it refers to locating the position of each joint point of the hand from the input data, and the number of joint points is 21.
  • the input data has various forms due to different sensors. Taking the input of monocular RGB data (such as the image captured by the front camera of a mobile phone) as an example, since the collection of monocular RGB data is widely used, obtaining stable and accurate key points of the human hand from the monocular RGB image can be applied to In all kinds of equipment, improve the operation efficiency.
  • monocular RGB data such as the image captured by the front camera of a mobile phone
  • top-down top-down
  • bottom-up bottom-up
  • the top-down method is conducive to detecting individuals, but with the increase of target objects, the key point positioning model will be repeatedly run, and the overall time-consuming geometric increase; using the bottom-up method, local key points will be detected from the entire image. , to complete the positioning task end-to-end, so it has obvious advantages in speed, but it is difficult to complete the task of detecting small targets such as human hands, high degrees of freedom, and highly similar local features.
  • An object of the present disclosure is to propose a solution for improving detection efficiency while ensuring the accuracy of key point detection.
  • a method for detecting key points including: detecting an image of a target in an image to be processed, and obtaining a first detection frame of the target; according to the first detection frame in the current image frame and The first detection frame of the historical image frame, the second detection frame is obtained through image stabilization; the image of the target area is extracted from the image to be processed according to the second detection frame; according to the image of the target area, a heat map is obtained through a deep learning network, wherein, The number of channels in the heat map matches the number of target key points; the location information of the target key points is determined by obtaining the peak points of the heat map.
  • the key point detection method further includes: rendering the target key point in the image to be processed according to the position information of the target key point, and obtaining the key point detection image.
  • acquiring the second detection frame through image stabilization according to the first detection frame in the current image frame and the first detection frame in the historical image frame includes: according to the first detection frame in the current image frame, and obtaining the second detection frame from the current image frame From the first frame before the image frame to the first detection frame in the historical image frame of the predetermined number of frames, the second detection frame is obtained by weighted average; wherein, the weight of the first detection frame corresponding to the current image frame is the largest, and the distance from the current The shorter the time length of the image frame, the greater the weight of the first detection frame in the historical image frame.
  • extracting the image of the target area from the image to be processed according to the second detection frame includes: enlarging the second detection frame at a predetermined ratio; and capturing an image in the enlarged second detection frame as the image of the target area.
  • acquiring the heat map through the deep learning network according to the image of the target area includes: processing the image of the target area to a first resolution; inputting the image of the target area in the first resolution state into the coding of the deep learning algorithm module to obtain high-level features; input the output information of the encoding module into the decoding module of the deep learning algorithm to improve the resolution of the feature map, and obtain a heat map of a second resolution state, where the second resolution is smaller than the first resolution.
  • inputting the image of the target area in the first resolution state into the coding module of the deep learning algorithm, and obtaining high-level features includes: extracting low-level features of the image of the target area in the first resolution state; Fusion; perform continuous downsampling on the fused features to obtain high-level features.
  • inputting the output information of the encoding module into the decoding module to improve the resolution of the feature map, and obtaining the heat map in the second resolution state includes: performing a three-layer transposed convolution operation to convert the feature map output from the encoding module The resolution is upscaled to the second resolution.
  • obtaining the location information of the target key points by extracting the peak points of the heat map includes: extracting the peak points respectively for the images of each channel of the heat map, and determining the corresponding target key points; obtaining the determined target key points location information of the point.
  • the keypoint detection method conforms to at least one of the following: the first resolution is 256*256; the second resolution is 64*64; or the resolution of the feature map is 8*8.
  • the target image is a hand image of a human body
  • the target key points are key points of the hand.
  • a key point detection system including: a target detection unit, configured to detect an image of a target in an image to be processed, and obtain a first detection frame; a frame stabilization unit, configured to In order to obtain the second detection frame through image stabilization according to the first detection frame in the current image frame and the first detection frame in the historical image frame; the target area extraction unit is configured to extract the to-be-processed image according to the second detection frame The image of the target area; the heat map acquisition unit is configured to obtain a heat map through the deep learning network according to the image of the target area, wherein the number of channels of the heat map matches the number of target key points; the key point extraction unit is configured In order to obtain the location information of the target key points by the peak points of the heat map.
  • the key point detection system further includes: a rendering unit configured to place the target key points in the image to be processed according to the position information of the target key points, and obtain a key point detection image.
  • a keypoint detection system comprising: a memory; and a processor coupled to the memory, the processor being configured to execute any one of the above keypoints based on instructions stored in the memory point detection method.
  • a non-transitory computer-readable storage medium on which computer program instructions are stored, and when the instructions are executed by a processor, implement the steps of any of the above key point detection methods .
  • an intelligent terminal including: an image acquisition device configured to acquire images; and any one of the above key point detection systems.
  • FIG. 1A is a flowchart of some embodiments of keypoint detection methods of the present disclosure.
  • FIG. 1B is a schematic diagram of some embodiments of the keypoint detection method of the present disclosure.
  • FIG. 2 is a flowchart of other embodiments of the keypoint detection method of the present disclosure.
  • FIG. 3 is a schematic diagram of some embodiments of obtaining a heat map in the keypoint detection method of the present disclosure.
  • FIG. 4 is a schematic diagram of some embodiments of the keypoint detection system of the present disclosure.
  • FIG. 5 is a schematic diagram of other embodiments of the keypoint detection system of the present disclosure.
  • FIG. 6 is a schematic diagram of further embodiments of the keypoint detection system of the present disclosure.
  • FIG. 7 is a schematic diagram of some embodiments of the disclosed smart terminal.
  • FIG. 1A A flowchart of some embodiments of the keypoint detection method of the present disclosure is shown in FIG. 1A .
  • step 101 an image of a target is detected in an image to be processed, and a first detection frame of the target is obtained.
  • the target detection algorithm in the related art can be used to extract the target; or based on the target detection model in the related art, the target extraction can be realized by extracting and training object images of the same type as the target.
  • the mobilenet-ssd model may be used.
  • the model uses the lightweight convolutional neural network mobilenetv2 as the backbone, and then adopts the detection head head of the SSD model architecture.
  • the SSD detection architecture is a single-stage, end-to-end method, which has advantages in speed compared with the two-stage method; and SSD utilizes 6 feature map information of different resolutions, which is conducive to the detection of objects of different sizes, which is beneficial to Improving the detection of objects like the hand where the position of the camera frequently changes and the movement shape (for example, the image will change when making a fist or stretching) can improve the detection accuracy.
  • a second detection frame is acquired through image stabilization according to the first detection frame in the current image frame and the first detection frame in the historical image frame. Due to the top-down method for key point positioning, the problem of detection frame jitter is introduced. Obtaining the second detection frame based on the first detection frame in the detected multi-frame images can reduce the positioning error caused by jitter and further improve the accuracy. Spend.
  • step 103 an image of the target area is extracted from the image to be processed according to the second detection frame.
  • the image in the second detection frame area may be captured as the image of the target area.
  • the shape and size of the non-rigid frame will change with the action, for example, the size of the detection frame of the palm and the fist varies greatly, and the aspect ratio of the detection frame fluctuates greatly during the rotation of the palm.
  • the frame is enlarged by a predetermined ratio, such as 1.5 times outer expansion, so that the image of the target can always be kept in the central area of the image, and the robustness of the detection is improved.
  • a heat map is obtained through a deep learning network according to the image of the target area, wherein the number of channels of the heat map matches the number of target key points.
  • the form of heatmap replaces the form of regression (regression), which retains more spatial position information, which is beneficial to ensure the ability to accurately locate key points of high-degree-of-freedom targets.
  • the position information of the target key point is determined by acquiring the peak point of the heat map.
  • position information of 21 hand key points can be obtained by extracting peak points channel by channel.
  • key points can be detected in a top-down manner, a heat map can be generated based on the image to be processed, and key points can be located by processing the heat map to retain more spatial position information, thereby improving
  • the key point detection is carried out on the basis of the intercepted image, which improves the detection efficiency and accuracy.
  • the key point detection method further includes step 106 , rendering the target key points in the image to be processed according to the position information of the target key points, and obtaining a key point detection image.
  • the stable detection frame may also be rendered in the image to be processed, so that the keypoint detection image includes the target keypoint and the second detection frame.
  • the target key points and the second detection frame may be presented with logos of different colors and shapes to improve the resolution of the two.
  • FIG. 1B A schematic diagram of some embodiments of the keypoint detection method of the present disclosure is shown in FIG. 1B .
  • the original image RGB image
  • the hand detection network After detection preprocessing to obtain the first detection frame of the hand, and the current first detection frame and the first detection frame in the historical image frame are sent to the frame stabilization module to obtain
  • the second detection frame is deducted from the original image, and the ROI (Region of Interest) is deducted from the original image.
  • the key point positioning network After positioning preprocessing, it is sent to the key point positioning network to obtain the heatmap of the hand, from which the 21 key points of the hand are extracted. 2D information, and finally render key points and detection frame information to the original image.
  • the top-down hand key point detection method and the heat map method are used to locate the key points, which improves the accuracy of key point positioning; based on the accurate detection frame extraction process, the intercepted image is based on The hand key point detection is carried out on the upper side, which improves the detection efficiency and accuracy.
  • FIG. 2 A flowchart of other embodiments of the keypoint detection method of the present disclosure is shown in FIG. 2 .
  • step 201 an image of a target is detected in an image to be processed, and a first detection frame of the target is obtained.
  • step 202 according to the first detection frame of the current image frame, and from the first frame before the current image frame to the first detection frame in the historical image frame of the predetermined number of frames, obtain the second detection frame by weighted average; Wherein, the weight of the first detection frame corresponding to the current image frame is the largest, and the shorter the time length from the current image frame, the greater the weight of the first detection frame in the historical image frame.
  • an exponentially weighted average is used to stabilize the detection frames, and for detection frames from near to far in time, a weighted average is performed according to an exponentially decreasing weight.
  • This method can utilize and reflect the correlation between the current frame and the past frame in the video stream, and the closer the time is, the higher the correlation and the greater the weight, and the farther the time frame is, the lower the correlation and the smaller the weight.
  • the detection frame stabilization formula may be:
  • k refers to the frame index before the current frame, for example, 0 refers to the current frame, 1 refers to the previous frame of the current frame; n refers to the total number of frames in which the initial detection frame is calculated (for example, it can be set to 6) ; P k is the detection frame coordinates of the kth frame before the current frame; is the exponential decay coefficient, e is a constant; P cur is the position of the detection frame after stabilization, that is, the position of the stable detection frame.
  • step 203 the second detection frame is enlarged by a predetermined ratio (eg, 1.5 times), and an image in the enlarged second detection frame is captured as an image of the target area.
  • a predetermined ratio eg, 1.5 times
  • the image of the target area is processed to a first resolution.
  • the image of the target area obtained by matting may be processed to the first resolution in a sampling manner.
  • a resolution of 256*256 may be selected as the first resolution.
  • a heatmap is generated based on a deep learning network.
  • step 205 the image of the target area in the first resolution state is input into the encoding module Encoder of the deep learning algorithm to obtain high-level features.
  • the Encoder may use multiple 1x1-dimension convolutions, residual connections, and depthwise separable convolutions, so that the network maintains a small amount of parameters and has deeper layers to learn information from higher layers.
  • the Encoder may include a first submodule (marked as a Low part), a second submodule (marked as a Middle part), and a third submodule (marked as a High part).
  • the Low part extracts low-level features; the Middle part performs feature fusion to improve the utilization of parameters; the High part obtains more high-level information through continuous downsampling.
  • the image input to the Encoder in the figure is an RGB image.
  • part B of FIG. 3 shows the convolution module used for resolution reduction in the Encoder in A, which consists of a continuous residual block and a fused convolution and pooled downsampling module.
  • the residual block in the Low and Middle parts, can be configured to repeat twice; in the High part, the residual block can be configured to repeat four times);
  • Part C of Figure 3 shows that the Middle part in A implements upsampling
  • the convolution module which consists of residual blocks, convolutions, and image resize (resize) operations.
  • step 206 the output information of the encoding module is input into the decoding module Decoder of the deep learning algorithm to improve the resolution of the feature map, and obtain a heat map in a second resolution state, where the second resolution is smaller than the first resolution.
  • the output resolution of the Encoder part may be 8*8, and the resolution is increased to 64*64 by the Decoder part to retain more position-related information.
  • GT Round Truth, correctly labeled data
  • a 64*64*21 heatmap may be output, where 21 is the number of channels.
  • step 207 with respect to the images of each channel of the heat map, peak points are respectively extracted, corresponding target key points are determined, and position information of the determined target key points is obtained.
  • the amount of computation can be reduced by extracting the target image and changing the size, and at the same time, the accuracy of key point positioning can be improved, so as to achieve the effect of double optimization of accuracy and efficiency, which is beneficial to the expansion of application scenarios and application equipment.
  • the accuracy of key point positioning can be improved, so as to achieve the effect of double optimization of accuracy and efficiency, which is beneficial to the expansion of application scenarios and application equipment.
  • the key point detection system needs to be trained, so that the image detection of the target and the neural network generated by the heatmap both meet the accuracy requirements.
  • datasets may be prepared based on a generic video website.
  • the training set is composed of YouTube2D and GANeratedHand, of which YouTube2D has done 10 times of data enhancement, including size scaling and random matting, reaching 471,250 images; GANeratedHand uses data without objects, totaling 141,449 images; training set data totaling 612,699 Zhang, where the real data:generated data maintains a 10:3 ratio.
  • the test set consists of only YouTube2D, with a total of 1525 images.
  • FIG. 4 A schematic diagram of some embodiments of the keypoint detection system of the present disclosure is shown in FIG. 4 .
  • the target detection unit 401 can detect the image of the target in the image to be processed, and obtain the first detection frame of the target.
  • the target detection algorithm in the related art can be used to extract the target; or the target can be extracted and trained based on the target detection model in the related art by extracting and training object images of the same type as the target.
  • the frame stabilization unit 402 can obtain the second detection frame through image stabilization according to the first detection frame in the current image frame and the first detection frame of the historical image frame. Since the key point positioning is carried out in a top-down manner, the problem of input frame jitter will be introduced. Obtaining the second detection frame based on the first detection frame in the detected multi-frame images can reduce the positioning error caused by jitter and further improve the accuracy. .
  • the target area extraction unit 403 can extract the image of the target area in the image to be processed according to the second detection frame.
  • the image in the second detection frame area may be captured as the image of the target area.
  • the second detection frame may be enlarged by a predetermined ratio, such as an external expansion of 1.5, so that the image of the target can always be kept in the central area of the image, thereby improving the robustness of detection.
  • the heat map obtaining unit 404 can obtain a heat map through a deep learning network according to the image of the target area, wherein the number of channels of the heat map matches the number of target key points.
  • the form of heatmap replaces the form of regress, which retains more spatial position information, which is beneficial to ensure the ability to accurately locate the key points of the high-degree-of-freedom target.
  • the key point extraction unit 405 can determine the position information of the target key point by acquiring the peak point of the heat map.
  • position information of 21 hand key points can be obtained by extracting peak points channel by channel.
  • Such a key point detection system can detect key points in a top-down manner, generate a heat map on the basis of the image to be processed, locate key points by processing the heat map, and retain more spatial position information, thereby improving the At the same time, based on the accurate detection frame extraction process, the key point detection is carried out on the basis of the intercepted image, which improves the detection efficiency and accuracy.
  • the keypoint detection system may further include a rendering unit 406 that renders the target keypoint in the image to be processed according to the position information of the target keypoint, and obtains the keypoint detection image.
  • the stable detection frame may also be rendered in the image to be processed, so that the keypoint detection image includes target key points and stable detection frames.
  • the target keypoints and the stable detection frame may be presented with logos of different colors and shapes to improve the resolution of the two.
  • Such a key point detection system can improve the intuitiveness of the presentation of target key points, and facilitate the identification of testers and users, no matter in the process of testing or use.
  • the specific structure of the heat map obtaining unit 404 may be shown in FIG. 3 . Based on the operations in steps 204 and 205 above, the amount of computation is reduced while the accuracy of key point positioning is improved, so that the accuracy and The effect of double optimization of efficiency.
  • Such a key point detection system can detect key points in a top-down manner, generate a heat map on the basis of the image to be processed, locate key points by processing the heat map, and retain more spatial position information, thereby improving the At the same time, based on the accurate detection frame extraction process, the key point detection is carried out on the basis of the intercepted image, which improves the detection efficiency and accuracy.
  • the keypoint detection system includes a memory 501 and a processor 502 .
  • the memory 501 may be a magnetic disk, a flash memory or any other non-volatile storage medium.
  • the memory is used to store the instructions in the corresponding embodiments of the above keypoint detection method.
  • the processor 502 is coupled to the memory 501 and may be implemented as one or more integrated circuits, such as a microprocessor or microcontroller.
  • the processor 502 is configured to execute the instructions stored in the memory, which can improve the efficiency and accuracy of key point detection.
  • the keypoint detection system 600 includes a memory 601 and a processor 602 .
  • Processor 602 is coupled to memory 601 through BUS bus 603 .
  • the keypoint detection system 600 can also be connected to an external storage device 605 through a storage interface 604 for recalling external data, and can also be connected to a network or another computer system (not shown) through a network interface 606 . It will not be described in detail here.
  • the data instructions are stored in the memory, and the above instructions are processed by the processor, which can improve the efficiency and accuracy of key point detection.
  • a non-transitory computer-readable storage medium stores computer program instructions thereon, and when the instructions are executed by a processor, implements the steps of the method in the corresponding embodiment of the key point detection method.
  • embodiments of the present disclosure may be provided as a method, apparatus, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein .
  • SSE describes the sum of squares of errors between the predicted data and the corresponding key points of the GT data. The closer the result is to 0, the better the model fit and the more successful the data prediction. It can be calculated by the formula shown below:
  • y si is GT
  • i is the index of the 21 knuckles
  • s is the index of the hand sample
  • D is the number of samples in the data set
  • w and h are the width and height of the original image, respectively.
  • EPE describes the average Euclidean distance between the predicted key point and the GT key point after aligning the root node (that is, the wrist point). The smaller the value, the more successful the prediction result. It can be calculated by the formula shown below:
  • y si is GT
  • i is the index of the 21 knuckles
  • s is the index of the hand sample
  • D is the number of samples in the data set
  • w and h are the width and height of the original image, respectively.
  • PCK describes the proportion of correctly positioned points in the prediction results to all points, and the closer to 100%, the better the result.
  • the so-called correct prediction points mean that after normalizing the predicted key points and the GT key points, if their Euclidean distance is less than a certain threshold, the prediction is considered correct and the number of correctly predicted points is added by 1, otherwise the prediction is wrong. It can be calculated by the formula shown below:
  • y si is GT, is the predicted value, i is the index of 21 knuckles, s is the index of the hand sample, D is the number of samples in the data set, w and h are the width and height of the original image respectively; 1( ) is the indicator function, ⁇ is the threshold, set to 1 when the L2 distance of the key point is less than ⁇ , otherwise it is 0; Represents the PCK index of the i-th keypoint at the ⁇ threshold; PCK ⁇ represents the average PCK index of all keypoints at the ⁇ threshold.
  • Table 1 shows the comparison of the performance indicators of the three hand key point localization methods SRHand, NSRMHand, and InterHand on the STB (Stereo Hand Pose Tracking Benchmark, Stereo Hand Pose Tracking Benchmark) dataset.
  • the performance indicators of the present disclosure are second only to InterHand trained on STB.
  • Table 2 shows the comparison of the running metrics of the three methods: SRHand, NSRMHand, and InterHand on RHD (Rendered Hand Dataset).
  • SRHand running metrics of the three methods: SRHand, NSRMHand, and InterHand on RHD (Rendered Hand Dataset).
  • RHD Raster Hand Dataset
  • the parameter size of the key point location network proposed in this disclosure is only 3.7MB, and the volume is only 13MB, and the running speed on the PC-side NVIDIA GeForce 940MX reaches 31.9134ms, that is, 31fps. It takes about 60ms for the whole system to run once on the PC side, achieving a quasi-real-time effect.
  • Table 3 shows the forward time of the disclosed method comparing SRHand, NSRMHand, InterHand on GeForce 940MX and Jeston TX2.
  • Table 4 shows the comparison results of the model size of the key point localization network of the present invention and methods such as SRHand, NSRMHand, and InterHand.
  • the visualization effects of the solution proposed in the present disclosure in the real shooting environment include the effects of gestures from numbers 0 to 9, rock, love, thumbs up, and claws under different perspectives, including stretching gestures and
  • the self-occluded gestures can be accurately positioned; compared with the effects of SRHand, NSRMHand, and InterHand on the public dataset RHD, the accuracy is significantly higher than that of the SRHand and NSRMHand methods; the disclosed method is not based on training on RHD. , which achieves a more accurate effect that is almost consistent with the InterHand generated by training on RHD.
  • the smart terminal 700 includes one or more image capturing devices 71 .
  • the intelligent terminal 700 further includes any one of the key point detection systems mentioned above, and executes any one of the key point detection methods mentioned above.
  • the smart terminal may be a device such as a mobile phone, a camera, or a computer.
  • Such an intelligent terminal can detect key points in a top-down manner on the basis of the collected images to be processed, generate a heat map on the basis of the images to be processed, locate key points by processing the heat map, and retain more More spatial position information is obtained, thereby improving the accuracy of key point positioning; at the same time, based on the accurate detection frame extraction process, key point detection is performed on the basis of the intercepted image, which improves the detection efficiency and accuracy.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions
  • the apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.
  • the methods and apparatus of the present disclosure may be implemented in many ways.
  • the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware.
  • the above-described order of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise.
  • the present disclosure can also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing methods according to the present disclosure.
  • the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

本公开提出一种关键点检测方法、系统、智能终端和存储介质,涉及图像检测技术领域。本公开的一种关键点检测方法,包括:在待处理图像中检测目标的图像,获取目标的第一检测框;根据当前图像帧中的第一检测框和历史图像帧的第一检测框,通过图像稳定获取第二检测框;根据第二检测框在待处理图像中提取目标区域的图像;根据目标区域的图像,通过深度学习网络获取热力图,其中,热力图的通道数量与目标关键点的数量相匹配;通过对热力图的峰值点获取确定目标关键点的位置信息。通过这样的方法,能够基于准确的检测框提取过程,在截取后的图像的基础上进行关键点检测,提高了检测效率和准确度。

Description

关键点检测方法、系统、智能终端和存储介质
相关申请的交叉引用
本申请是以CN申请号为202110175751.7,申请日为2021年2月9日的申请为基础,并主张其优先权,该CN申请的公开内容在此作为整体引入本申请中。
技术领域
本公开涉及图像检测技术领域,特别是一种关键点检测方法、系统、智能终端和存储介质。
背景技术
关键点定位是指从输入数据中定位出目标位置,具体到手部的关键点定位中,是指从输入数据中定位出手部每个关节点的位置,关节点的数量为21个。
输入数据因传感器的不同,形式多种多样。以输入单目RGB数据(如手机的前置相机拍摄的图像)为例,由于单目RGB数据的采集应用十分广泛,从单目RGB图像中获取稳定、准确的人手关键点位置,能够应用于各类设备中,提升操作效率。
目前手部姿态估计等方法主要分两种思路:一,top-down(自上而下),即先检测个体,再定位个体中的关键点的位置;二,bottom-up(自下而上),即先检测关键点,再将关键点聚类形成每一个个体。
top-down的方法有利于检测出个体,但会随着目标对象的增多,重复运行关键点定位模型,整体耗时几何型增长;采用bottom-up的方法,会从整图中检测局部关键点,端到端完成定位任务,因此在速度上有明显的优势,但完成对于人手这样的小目标进行检测、高自由度、局部特征高度相似的任务是很困难的。
发明内容
本公开的一个目的在于提出一种在保证关键点检测的准确度的同时提高检测效率的方案。
根据本公开的一些实施例的一个方面,提出一种关键点检测方法,包括:在待处理图像中检测目标的图像,获取目标的第一检测框;根据当前图像帧中的第一检测框和历史图像帧的第一检测框,通过图像稳定获取第二检测框;根据第二检测框在待处 理图像中提取目标区域的图像;根据目标区域的图像,通过深度学习网络获取热力图,其中,热力图的通道数量与目标关键点的数量相匹配;通过对热力图的峰值点获取确定目标关键点的位置信息。
在一些实施例中,关键点检测方法还包括:根据目标关键点的位置信息,将目标关键点渲染在待处理图像中,获取关键点检测图像。
在一些实施例中,根据当前图像帧中的第一检测框和历史图像帧中的第一检测框,通过图像稳定获取第二检测框包括:根据当前图像帧的第一检测框,和从当前图像帧之前的第一帧到第预定数量帧的历史图像帧中的第一检测框,通过加权平均获取第二检测框;其中,当前图像帧对应的第一检测框的权重最大,且距离当前图像帧的时间长度越短,历史图像帧中的第一检测框的权重越大。
在一些实施例中,根据第二检测框在待处理图像中提取目标区域图像包括:以预定比例放大第二检测框;截取放大后的第二检测框内的图像作为目标区域的图像。
在一些实施例中,根据目标区域的图像,通过深度学习网络获取热力图包括:将目标区域的图像处理为第一分辨率;将第一分辨率状态的目标区域的图像输入深度学习算法的编码模块,获取高层特征;将编码模块的输出信息输入深度学习算法的解码模块以提升特征图的分辨率,获取第二分辨率状态的热力图,其中,第二分辨率小于第一分辨率。
在一些实施例中,将第一分辨率状态的目标区域的图像输入深度学习算法的编码模块,获取高层特征包括:提取第一分辨率状态的目标区域的图像的低层特征;基于低层特征进行特征融合;对融合后的特征执行连续下采样,获取高层特征。
在一些实施例中,将编码模块的输出信息输入解码模块以提升特征图的分辨率,获取第二分辨率状态的热力图包括:通过三层转置卷积操作将编码模块输出的特征图的分辨率提升至第二分辨率。
在一些实施例中,通过对热力图的峰值点提取获取目标关键点的位置信息包括:针对热力图的每个通道的图像,分别提取峰值点,确定对应的目标关键点;获取确定的目标关键点的位置信息。
在一些实施例中,关键点检测方法符合以下至少一项:第一分辨率为256*256;第二分辨率为64*64;或特征图的分辨率为8*8。
在一些实施例中,目标图像为人体的手部图像,目标关键点为手部的关键点。
根据本公开的一些实施例的一个方面,提出一种关键点检测系统,包括:目标检 测单元,被配置为在待处理图像中检测目标的图像,获取第一检测框;框稳定单元,被配置为根据当前图像帧中的第一检测框和历史图像帧中的第一检测框,通过图像稳定获取第二检测框;目标区域提取单元,被配置为根据第二检测框在待处理图像中提取目标区域的图像;热力图获取单元,被配置为根据目标区域的图像,通过深度学习网络获取热力图,其中,热力图的通道数量与目标关键点的数量相匹配;关键点提取单元,被配置为通过对热力图的峰值点获取目标关键点的位置信息。
在一些实施例中,关键点检测系统还包括:渲染单元,被配置为根据目标关键点的位置信息,将目标关键点在待处理图像中,获取关键点检测图像。
根据本公开的一些实施例的一个方面,提出一种关键点检测系统,包括:存储器;以及耦接至存储器的处理器,处理器被配置为基于存储在存储器的指令执行上文中任意一种关键点检测方法。
根据本公开的一些实施例的一个方面,提出一种非瞬时性计算机可读存储介质,其上存储有计算机程序指令,该指令被处理器执行时实现上文中任意一种关键点检测方法的步骤。
根据本公开的一些实施例的一个方面,提出一种智能终端,包括:图像采集设备,被配置为采集图像;和上文中任意一种关键点检测系统。
附图说明
此处所说明的附图用来提供对本公开的进一步理解,构成本公开的一部分,本公开的示意性实施例及其说明用于解释本公开,并不构成对本公开的不当限定。在附图中:
图1A为本公开的关键点检测方法的一些实施例的流程图。
图1B为本公开的关键点检测方法的一些实施例的示意图。
图2为本公开的关键点检测方法的另一些实施例的流程图。
图3为本公开的关键点检测方法中获取热力图的一些实施例的示意图。
图4为本公开的关键点检测系统的一些实施例的示意图。
图5为本公开的关键点检测系统的另一些实施例的示意图。
图6为本公开的关键点检测系统的又一些实施例的示意图。
图7为本公开的智能终端的一些实施例的示意图。
具体实施方式
下面通过附图和实施例,对本公开的技术方案做进一步的详细描述。
本公开的关键点检测方法的一些实施例的流程图如图1A所示。
在步骤101中,在待处理图像中检测目标的图像,获取目标的第一检测框。在一些实施例中,可以采用相关技术中的目标检测算法提取目标;或可以基于相关技术中的目标检测模型,通过对与目标相同种类的物体图像提取训练,实现对目标的提取。
在一些实施例中,可以使用mobilenet-ssd模型。该模型以轻量级卷积神经网络mobilenetv2作为骨干backbone,之后采用SSD模型架构的检测头head。SSD检测架构是单阶段、端到端的方法,相比两阶段的方法,在速度上有优势;并且SSD利用了6个不同分辨率的特征图信息,有利于不同大小的目标检测,从而有利于提高检测类似于手部这样的距离摄像头位置会经常发生变化、动作形态也会发生变化(如握拳、舒展动作下图像会发生变化)的物体,能够提高检测准确度。
在步骤102中,根据当前图像帧中的第一检测框和历史图像帧的第一检测框,通过图像稳定获取第二检测框。由于采用自上而下的方式进行关键点定位,引入了检测框抖动的问题,基于检测到的多帧图像中的第一检测框得到第二检测框能够减少抖动引起的定位误差,进一步提高准确度。
在步骤103中,根据第二检测框在待处理图像中提取目标区域的图像。在一些实施例中,可以将第二检测框区域内的图像截取下来作为目标区域的图像。
由于非刚体框的形状、尺寸会随动作变化,比如手掌与拳头的检测框在尺寸上变化很大,手掌在旋转过程中检测框的宽高比波动大,在一些实施例中,可以对检测框做预定比例的放大,如1.5倍的外扩,使得目标的图像能始终保持在图像的中心区域,提高检测的鲁棒性。
在步骤104中,根据目标区域的图像,通过深度学习网络获取热力图,其中,热力图的通道数量与目标关键点的数量相匹配。在一些实施例中,heatmap(热力图)的形式代替regress(回归)的形式,保留了更多的空间位置信息,有利于保证对高自由度目标的关键点的准确定位的能力。
在步骤105中,通过对热力图的峰值点获取确定目标关键点的位置信息。在一些实施例中,例如对于21通道的手部图像的热力图,通过逐通道提取峰值点可得到21个手部关键点的位置信息。
通过这样的方法,能够采用自上而下的方式进行关键点检测,在待处理图像的基 础上生成热力图,通过对热力图的处理定位关键点,保留较多的空间位置信息,从而提高了关键点定位的准确度;同时,基于准确的检测框提取过程,在截取后的图像的基础上进行关键点检测,提高了检测效率和准确度。
在一些实施例中,关键点检测方法还包括步骤106,根据目标关键点的位置信息,将目标关键点渲染在待处理图像中,获取关键点检测图像。在一些实施例中,还可以将稳定检测框渲染在待处理图像中,使得关键点检测图像中包括目标关键点和第二检测框。在一些实施中,目标关键点和第二检测框可以采用不同颜色、形状的标识呈现,以提高两者的分辨度。
通过这样的方法,无论在测试过程中还是使用过程中,均能够提高目标关键点呈现的直观度,方便测试人员和用户识别。
本公开的关键点检测方法的一些实施例的示意图如图1B所示。将原图(RGB图像)经过检测预处理送入手部检测网络,得到手部的第一检测框,将当前的第一检测框与历史图像帧中的第一检测框送入框稳定模块,得到第二检测框,并从原图中扣出ROI(Region of Interest,感兴趣区域),经过定位预处理送入关键点定位网络,得到手部的heatmap图,从中提取21个手部关键点的2D信息,最终将关键点、检测框信息并渲染到原图上。
通过这样的方法,通过自上而下的手部关键点检测方式和采用热力图法定位关键点,提高了关键点定位的准确度;基于准确的检测框提取过程,在截取后的图像的基础上进行手部关键点检测,提高了检测效率和准确度。
本公开的关键点检测方法的另一些实施例的流程图如图2所示。
在步骤201中,在待处理图像中检测目标的图像,获取目标的第一检测框。
在步骤202中,根据当前图像帧的第一检测框,和从当前图像帧之前的第一帧到第预定数量帧的历史图像帧中的第一检测框,通过加权平均获取第二检测框;其中,当前图像帧对应的第一检测框的权重最大,且距离当前图像帧的时间长度越短,历史图像帧中的第一检测框的权重越大。
在一些实施例中,采用指数加权平均的方式稳定检测框,对于时间上由近到远的检测框,按照指数递减的权重进行加权平均。这样的方式能够利用并体现视频流中当前的框与过去的框相关性,并且时间上越近的相关性越高、权重越大,时间上越远的框相关性越低、权重越小。
在一些实施例中,检测框稳定公式可以为:
Figure PCTCN2022070537-appb-000001
上式中,k指当前帧之前的帧索引,比如0指当前帧,1指当前帧的前一帧;n指共对多少帧图像中的初检测框进行计算(例如,可以设置为6);P k为当前帧之前第k帧的检测框坐标;
Figure PCTCN2022070537-appb-000002
是指数衰减系数,e为常数;P cur为经过稳定后的检测框位置,即稳定检测框的位置。
在步骤203中,将第二检测框进行预定比例(如1.5倍)的放大,截取放大后的第二检测框内的图像作为目标区域的图像。
在步骤204中,将目标区域的图像处理为第一分辨率。在一些实施例中,可以采用采样的方式,将抠图得到的目标区域的图像处理至第一分辨率。在一些实施例中,考虑到分辨率过高会增加运算压力,分辨率低会降低准确度,可以选择256*256的分辨率作为第一分辨率。
在一些实施例中,在步骤205、206中,基于深度学习网络生成热力图。
在步骤205中,将第一分辨率状态的目标区域的图像输入深度学习算法的编码模块Encoder,获取高层特征。
在一些实施例中,Encoder可采用多个1x1维度的卷积、残差连接、深度可分离卷积,使网络保持小参数量的同时具备较深的层,以学到更高层的信息。Encoder可以如图3的A部分所示,具备第一子模块(标识为Low部分)、第二子模块(标识为Middle部分)和第三子模块(标识为High部分)。Low部分提取底层特征;Middle部分进行特征融合以提升参数的利用率;High部分通过连续的下采样,得到更多的高层信息。图中输入Encoder的图像为RGB图像。
在一些实施例中,图3的B部分展示了A中的Encoder中用于降低分辨率的卷积模块,该模块由连续的残差块和融合卷积、池化的下采样模块组成。在一些实施例中,在Low、Middle部分,可以配置残差块重复2次;在High部分,可以配置残差块重复4次);图3的C部分展示了A中的Middle部分实现上采样的卷积模块,该模块由残差块、卷积和图像resize(改变尺寸)操作组成。
在步骤206中,将编码模块的输出信息输入深度学习算法的解码模块Decoder以提升特征图的分辨率,获取第二分辨率状态的热力图,其中,第二分辨率小于第一分辨率。
在一些实施例中,Encoder部分的输出分辨率可以为8*8,通过Decoder部分将 分辨率提升至64*64,以保留更多的位置相关信息。在一些实施例中,将heatmaps作为GT(Ground Truth,标注正确的数据)来预测关键点时,无需采用跳层连接、多分辨率融合等手段,仅使用3层转置卷积进行上采样,以恢复特征图的分辨率,降低了运算量,提高了处理效率。在一些实施例中,可以输出64*64*21的热力图,其中21为通道数量。
在步骤207中,针对热力图的每个通道的图像,分别提取峰值点,确定对应的目标关键点,获取确定的目标关键点的位置信息。
通过这样的方法,能够通过目标图像的提取、尺寸的改变降低运算量,同时提高关键点定位的准确度,达到准确度和效率双重优化的效果,有利于应用的场景和应用设备的扩展,有利于推广应用。
在一些实施例中,在实际使用之前,需要先对关键点检测系统进行训练,使执行目标的图像检测、heatmap生成的神经网络的均达到准确度需求。
在一些实施例中,可以基于通用的视频网站准备数据集,包括训练集和测试集。例如,训练集由YouTube2D和GANeratedHand构成,其中YouTube2D做了10倍的数据增强,包括尺寸缩放、随机抠图,达到471250张;GANeratedHand中使用了无object的数据,共141449张;训练集数据共计612699张,其中真实数据:生成数据保持10:3的比例。测试集只有YouTube2D构成,共1525张。
进一步的,配置训练参数。例如,设置批大小batch=64,那么每个epoch(训练集中的全部样本训练一次)内迭代9574次;使用亚当Adam优化器训练,初始学习率为0.001,前3个epoch保持,之后每个epoch指数下降;最大epoch(即全部训练集中样本被训练的轮数)=50。
最后,选择软件框架、硬件。使用TensorFlow2.0框架,在Tesla P40 GPU上4卡同时训练。
本公开的关键点检测系统的一些实施例的示意图如图4所示。
目标检测单元401能够在待处理图像中检测目标的图像,获取目标的第一检测框。在一些实施例中,可以采用相关技术中的目标检测算法提取目标;或可以基于相关技术中的目标检测模型通过对与目标相同种类的物体图像提取训练,实现对目标的提取。
框稳定单元402能够根据当前图像帧中的第一检测框和历史图像帧的第一检测框,通过图像稳定获取第二检测框。由于采用自上而下的方式进行关键点定位,会引 入输入框抖动问题,基于检测到的多帧图像中的第一检测框得到第二检测框能够减少抖动引起的定位误差,进一步提高准确度。
目标区域提取单元403能够根据第二检测框在待处理图像中提取目标区域的图像。在一些实施例中,可以将第二检测框区域内的图像截取下来作为目标区域的图像。在一些实施例中,可以对第二检测框做预定比例的放大,如1.5的外扩,使得目标的图像能始终保持在图像的中心区域,提高检测的鲁棒性。
热力图获取单元404能够根据目标区域的图像,通过深度学习网络获取热力图,其中,热力图的通道数量与目标关键点的数量相匹配。在一些实施例中,heatmap的形式代替regress的形式,保留了更多的空间位置信息,有利于保证对高自由度目标的关键点的准确定位的能力。
关键点提取单元405能够通过对热力图的峰值点获取确定目标关键点的位置信息。在一些实施例中,例如对于21通道的手部图像的热力图,通过逐通道提取峰值点可得到21个手部关键点的位置信息。
这样的关键点检测系统能够采用自上而下的方式进行关键点检测,在待处理图像的基础上生成热力图,通过对热力图的处理定位关键点,保留较多的空间位置信息,从而提高了关键点定位的准确度;同时,基于准确的检测框提取过程,在截取后的图像的基础上进行关键点检测,提高了检测效率和准确度。
在一些实施例中,如图4所示,关键点检测系统还可以包括渲染单元406根据目标关键点的位置信息,将目标关键点渲染在待处理图像中,获取关键点检测图像。在一些实施例中,还可以将稳定检测框渲染在待处理图像中,使得关键点检测图像中包括目标关键点和稳定检测框。在一些实施中,目标关键点和稳定检测框可以采用不同颜色、形状的标识呈现,以提高两者的分辨度。
这样的关键点检测系统无论在测试过程中还是使用过程中,均能够提高目标关键点呈现的直观度,方便测试人员和用户识别。
在一些实施例中,热力图获取单元404的具体构成可以如图3中所示,基于上文中步骤204、205中的操作,降低运算量的同时提高关键点定位的准确度,达到准确度和效率双重优化的效果。
这样的关键点检测系统能够采用自上而下的方式进行关键点检测,在待处理图像的基础上生成热力图,通过对热力图的处理定位关键点,保留较多的空间位置信息,从而提高了关键点定位的准确度;同时,基于准确的检测框提取过程,在截取后的图 像的基础上进行关键点检测,提高了检测效率和准确度。
本公开关键点检测系统的一个实施例的结构示意图如图5所示。关键点检测系统包括存储器501和处理器502。其中:存储器501可以是磁盘、闪存或其它任何非易失性存储介质。存储器用于存储上文中关键点检测方法的对应实施例中的指令。处理器502耦接至存储器501,可以作为一个或多个集成电路来实施,例如微处理器或微控制器。该处理器502用于执行存储器中存储的指令,能够提高关键点检测的效率和准确度。
在一个实施例中,还可以如图6所示,关键点检测系统600包括存储器601和处理器602。处理器602通过BUS总线603耦合至存储器601。该关键点检测系统600还可以通过存储接口604连接至外部存储装置605以便调用外部数据,还可以通过网络接口606连接至网络或者另外一台计算机系统(未标出)。此处不再进行详细介绍。
在该实施例中,通过存储器存储数据指令,再通过处理器处理上述指令,能够提高关键点检测的效率和准确度。
在另一个实施例中,一种非瞬时性的计算机可读存储介质,其上存储有计算机程序指令,该指令被处理器执行时实现关键点检测方法对应实施例中的方法的步骤。本领域内的技术人员应明白,本公开的实施例可提供为方法、装置、或计算机程序产品。因此,本公开可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本公开可采用在一个或多个其中包含有计算机可用程序代码的计算机可用非瞬时性存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
以下为对本公开的关键点检测方法和系统进行的测评,共使用了3个定量指标来展示算法的效果,分别是SSE(Sum Squared Error,误差平方和)、EPE(End Point Error,终点误差)、PCK(Percentage of Correct Keypoints,关键点正确估计的比例)。
SSE描述的是预测数据和GT数据对应关键点间误差的平方和,该结果越接近于0,说明模型拟合越好,数据预测也越成功。可以通过如下所示的公式计算:
Figure PCTCN2022070537-appb-000003
上式中,y si是GT,
Figure PCTCN2022070537-appb-000004
是预测值,i是21个指节的索引,s是手部样本的索引,D是数据集中的样本数,w、h分别是原图的宽、高。
EPE描述的是在对齐根节点(即手腕点)后,预测的关键点与GT关键点间的平均欧式距离,该值越小,表示预测结果越成功。可以通过如下所示的公式计算:
Figure PCTCN2022070537-appb-000005
上式中,y si是GT,
Figure PCTCN2022070537-appb-000006
是预测值,i是21个指节的索引,s是手部样本的索引,D是数据集中的样本数,w、h分别是原图的宽、高。
PCK描述的是预测结果中定位正确的点数占所有点数的比例,越接近100%表示结果越好。所谓预测正确的点是指将预测关键点和GT关键点归一化后,如果其欧式距离小于某个阈值,则认为预测正确并将预测正确的点数加1,否则预测错误。可以通过如下所示的公式计算:
Figure PCTCN2022070537-appb-000007
Figure PCTCN2022070537-appb-000008
上式中,y si是GT,
Figure PCTCN2022070537-appb-000009
是预测值,i是21个指节的索引,s是手部样本的索引,D是数据集中的样本数,w、h分别是原图的宽、高;1(·)是指示函数,σ是阈值,当关键点的L2距离小于σ时置1,否则为0;
Figure PCTCN2022070537-appb-000010
表示在σ阈值时,第i个关键点的PCK指标;PCK σ表示在σ阈值时,所有关键点的平均PCK指标。
表1展示了在STB(Stereo Hand Pose Tracking Benchmark,立体手姿态跟踪基准)数据集上与相关技术中的三种手部关键点定位方法SRHand、NSRMHand、InterHand的运行指标的对比。在STB数据集上,本公开的运行效果指标仅次于在STB上训练的InterHand。
表1 STB数据集上的对比指标
Figure PCTCN2022070537-appb-000011
表2展示了在RHD(Rendered Hand Dataset,渲染手势数据集)上与SRHand、NSRMHand、InterHand三种方法的运行指标的对比。在RHD数据集上,本公开的运行效果指标仅次于在RHD上训练的InterHand;展示了本公开在数据集上较好的泛化性。
表2 RHD数据集上的对比指标
Figure PCTCN2022070537-appb-000012
本公开提出的关键点定位网络的参数量只有3.7MB,体积只有13MB,在PC端NVIDIA GeForce 940MX上运行速度达31.9134ms,即31fps。整个系统在PC端运行一次需要60ms左右,达到了准实时的效果。表3展示了本公开的方法对比SRHand、NSRMHand、InterHand在GeForce 940MX和Jeston TX2上的前向时间。
表3前向速度对比表
Figure PCTCN2022070537-appb-000013
表4展示了本发明的关键点定位网络的模型大小与SRHand、NSRMHand、 InterHand等方法的对比结果。
表4模型尺寸对比表
Figure PCTCN2022070537-appb-000014
上述两表展示了本公开的方案在前向速度、模型大小方面相比其他方法具有明显的优势。
另外,通过对本公开提出的方案在实拍环境下的可视化效果,包括从数字0~9、摇滚、爱、大拇指点赞、爪等手势在不同视角下的效果,既有舒展的手势,也有自遮挡的手势,均能够准确的定位;在公开数据集RHD上与SRHand、NSRMHand、InterHand的效果对比上,准确性明显高于SRHand、NSRMHand方法;本公开的方法没有在RHD上训练的基础上,达到了与在RHD上训练生成的InterHand几乎一致的更加准确的效果。
本公开的智能终端的一些实施例的示意图如图7所示。智能终端700包括一个或多个图像采集设备71。智能终端700还包括上文中提到的任意一种关键点检测系统,执行上文中提到的任意一种关键点检测方法。在一些实施例中,智能终端可以为手机、相机或电脑等设备。
这样的智能终端能够在采集的待处理图像的基础上,采用自上而下的方式进行关键点检测,在待处理图像的基础上生成热力图,通过对热力图的处理定位关键点,保留较多的空间位置信息,从而提高了关键点定位的准确度;同时,基于准确的检测框提取过程,在截取后的图像的基础上进行关键点检测,提高了检测效率和准确度。
本公开是参照根据本公开实施例的方法、设备(系统)和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
至此,已经详细描述了本公开。为了避免遮蔽本公开的构思,没有描述本领域所公知的一些细节。本领域技术人员根据上面的描述,完全可以明白如何实施这里公开的技术方案。
可能以许多方式来实现本公开的方法以及装置。例如,可通过软件、硬件、固件或者软件、硬件、固件的任何组合来实现本公开的方法以及装置。用于所述方法的步骤的上述顺序仅是为了进行说明,本公开的方法的步骤不限于以上具体描述的顺序,除非以其它方式特别说明。此外,在一些实施例中,还可将本公开实施为记录在记录介质中的程序,这些程序包括用于实现根据本公开的方法的机器可读指令。因而,本公开还覆盖存储用于执行根据本公开的方法的程序的记录介质。
最后应当说明的是:以上实施例仅用以说明本公开的技术方案而非对其限制;尽管参照较佳实施例对本公开进行了详细的说明,所属领域的普通技术人员应当理解:依然可以对本公开的具体实施方式进行修改或者对部分技术特征进行等同替换;而不脱离本公开技术方案的精神,其均应涵盖在本公开请求保护的技术方案范围当中。

Claims (15)

  1. 一种关键点检测方法,包括:
    在待处理图像中检测目标的图像,获取目标的第一检测框;
    根据当前图像帧中的所述第一检测框和历史图像帧中的所述第一检测框,通过图像稳定获取第二检测框;
    根据所述第二检测框在所述待处理图像中提取目标区域的图像;
    根据所述目标区域的图像,通过深度学习网络获取热力图,其中,所述热力图的通道数量与目标关键点的数量相匹配;
    通过对所述热力图的峰值点获取确定目标关键点的位置信息。
  2. 根据权利要求1所述的关键点检测方法,还包括:
    根据所述目标关键点的位置信息,将所述目标关键点渲染在所述待处理图像中,获取关键点检测图像。
  3. 根据权利要求1或2所述的关键点检测方法,其中,所述根据当前图像帧中的所述第一检测框和历史图像帧中的第一检测框,通过图像稳定获取第二检测框包括:
    根据当前图像帧的所述第一检测框,和从当前图像帧之前的第一帧到第预定数量帧的历史图像帧中的所述第一检测框,通过加权平均获取所述第二检测框;
    其中,所述当前图像帧对应的所述第一检测框的权重最大,且距离当前图像帧的时间长度越短,所述历史图像帧中的所述第一检测框的权重越大。
  4. 根据权利要求1或2所述的关键点检测方法,其中,所述根据所述第二检测框在所述待处理图像中提取目标区域的图像包括:
    以预定比例放大所述第二检测框;
    截取放大后的所述第二检测框内的图像作为所述目标区域的图像。
  5. 根据权利要求1或2所述的关键点检测方法,其中,所述根据所述目标区域的图像,通过深度学习网络获取热力图包括:
    将所述目标区域的图像处理为第一分辨率;
    将第一分辨率状态的所述目标区域的图像输入深度学习算法的编码模块,获取高层特征;
    将所述编码模块的输出信息输入深度学习算法的解码模块以提升特征图的分辨 率,获取第二分辨率状态的热力图,其中,所述第二分辨率小于所述第一分辨率。
  6. 根据权利要求5所述的关键点检测方法,其中,
    所述将第一分辨率状态的所述目标区域的图像输入深度学习算法的编码模块,获取高层特征包括:
    提取所述第一分辨率状态的所述目标区域的图像的低层特征;
    基于所述低层特征进行特征融合;
    对融合后的特征执行连续下采样,获取所述高层特征。
  7. 根据权利要求5所述的关键点检测方法,其中,所述将所述编码模块的输出信息输入解码模块以提升特征图的分辨率,获取第二分辨率状态的热力图包括:
    通过三层转置卷积操作将所述编码模块输出的特征图的分辨率提升至所述第二分辨率。
  8. 根据权利要求1所述的关键点检测方法,其中,所述通过对所述热力图的峰值点提取获取目标关键点的位置信息包括:
    针对所述热力图的每个通道的图像,分别提取峰值点,确定对应的目标关键点;
    获取确定的目标关键点的位置信息。
  9. 根据权利要求5所述的关键点检测方法,符合以下至少一项:
    所述第一分辨率为256*256;
    所述第二分辨率为64*64;或
    所述特征图的分辨率为8*8。
  10. 根据权利要求1所述的关键点检测方法,其中,
    所述目标图像为人体的手部图像,所述目标关键点为手部的关键点。
  11. 一种关键点检测系统,包括:
    目标检测单元,被配置为在待处理图像中检测目标的图像,获取第一检测框;
    框稳定单元,被配置为根据当前图像帧中的所述第一检测框和历史图像帧中的所述第一检测框,通过图像稳定获取第二检测框;
    目标区域提取单元,被配置为根据所述第二检测框在所述待处理图像中提取目标区域的图像;
    热力图获取单元,被配置为根据所述目标区域的图像,通过深度学习网络获取热力图,其中,所述热力图的通道数量与目标关键点的数量相匹配;
    关键点提取单元,被配置为通过对所述热力图的峰值点获取目标关键点的位置信息。
  12. 根据权利要求11所述的关键点检测系统,还包括:
    渲染单元,被配置为根据所述目标关键点的位置信息,将所述目标关键点在所述待处理图像中,获取关键点检测图像。
  13. 一种关键点检测系统,包括:
    存储器;以及
    耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器的指令执行如权利要求1至10任一项所述的方法。
  14. 一种非瞬时性计算机可读存储介质,其上存储有计算机程序指令,该指令被处理器执行时实现权利要求1至10任意一项所述的方法的步骤。
  15. 一种智能终端,包括:
    图像采集设备,被配置为采集图像;和
    权利要求11~13任意一项所述的关键点检测系统。
PCT/CN2022/070537 2021-02-09 2022-01-06 关键点检测方法、系统、智能终端和存储介质 WO2022170896A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110175751.7A CN113743177A (zh) 2021-02-09 2021-02-09 关键点检测方法、系统、智能终端和存储介质
CN202110175751.7 2021-02-09

Publications (2)

Publication Number Publication Date
WO2022170896A1 WO2022170896A1 (zh) 2022-08-18
WO2022170896A9 true WO2022170896A9 (zh) 2022-09-15

Family

ID=78728166

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/070537 WO2022170896A1 (zh) 2021-02-09 2022-01-06 关键点检测方法、系统、智能终端和存储介质

Country Status (2)

Country Link
CN (1) CN113743177A (zh)
WO (1) WO2022170896A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113743177A (zh) * 2021-02-09 2021-12-03 北京沃东天骏信息技术有限公司 关键点检测方法、系统、智能终端和存储介质
CN114373191A (zh) * 2022-01-04 2022-04-19 北京沃东天骏信息技术有限公司 一种手部骨节定位方法和装置
CN114782449B (zh) * 2022-06-23 2022-11-22 中国科学技术大学 下肢x光影像中关键点提取方法、系统、设备及存储介质
CN118550401A (zh) * 2023-02-24 2024-08-27 腾讯科技(深圳)有限公司 手部姿态识别方法、装置、设备、存储介质和程序产品
CN117789256A (zh) * 2024-02-27 2024-03-29 湖北星纪魅族集团有限公司 手势识别方法、装置、设备及计算机可读介质

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229488B (zh) * 2016-12-27 2021-01-01 北京市商汤科技开发有限公司 用于检测物体关键点的方法、装置及电子设备
CN110059522B (zh) * 2018-01-19 2021-06-25 北京市商汤科技开发有限公司 人体轮廓关键点检测方法、图像处理方法、装置及设备
CN109684920B (zh) * 2018-11-19 2020-12-11 腾讯科技(深圳)有限公司 物体关键点的定位方法、图像处理方法、装置及存储介质
CN110211211B (zh) * 2019-04-25 2024-01-26 北京达佳互联信息技术有限公司 图像处理方法、装置、电子设备及存储介质
CN110276316B (zh) * 2019-06-26 2022-05-24 电子科技大学 一种基于深度学习的人体关键点检测方法
CN111160111B (zh) * 2019-12-09 2021-04-30 电子科技大学 一种基于深度学习的人体关键点检测方法
CN111402294B (zh) * 2020-03-10 2022-10-18 腾讯科技(深圳)有限公司 目标跟踪方法、装置、计算机可读存储介质和计算机设备
CN111402228B (zh) * 2020-03-13 2021-05-07 腾讯科技(深圳)有限公司 图像检测方法、装置和计算机可读存储介质
CN112330589A (zh) * 2020-09-18 2021-02-05 北京沃东天骏信息技术有限公司 估计位姿的方法、装置及计算机可读存储介质
CN113743177A (zh) * 2021-02-09 2021-12-03 北京沃东天骏信息技术有限公司 关键点检测方法、系统、智能终端和存储介质

Also Published As

Publication number Publication date
CN113743177A (zh) 2021-12-03
WO2022170896A1 (zh) 2022-08-18

Similar Documents

Publication Publication Date Title
WO2022170896A9 (zh) 关键点检测方法、系统、智能终端和存储介质
CN109657631B (zh) 人体姿态识别方法及装置
CN110147717B (zh) 一种人体动作的识别方法及设备
CN111191622B (zh) 基于热力图和偏移向量的姿态识别方法、系统及存储介质
US11200424B2 (en) Space-time memory network for locating target object in video content
Ji et al. Interactive body part contrast mining for human interaction recognition
JP5554984B2 (ja) パターン認識方法およびパターン認識装置
Zhu et al. Saliency optimization from robust background detection
CN107067413B (zh) 一种时空域统计匹配局部特征的运动目标检测方法
US20130028517A1 (en) Apparatus, method, and medium detecting object pose
CN110135246A (zh) 一种人体动作的识别方法及设备
JP4951498B2 (ja) 顔画像認識装置、顔画像認識方法、顔画像認識プログラムおよびそのプログラムを記録した記録媒体
CN113807361B (zh) 神经网络、目标检测方法、神经网络训练方法及相关产品
CN107766864B (zh) 提取特征的方法和装置、物体识别的方法和装置
CN116453067B (zh) 基于动态视觉识别的短跑计时方法
CN110458235B (zh) 一种视频中运动姿势相似度比对方法
CN111027555B (zh) 一种车牌识别方法、装置及电子设备
Gouidis et al. Accurate hand keypoint localization on mobile devices
CN112733767B (zh) 一种人体关键点检测方法、装置、存储介质及终端设备
Yao et al. Poserac: Pose saliency transformer for repetitive action counting
CN113557546B (zh) 图像中关联对象的检测方法、装置、设备和存储介质
CN110633630B (zh) 一种行为识别方法、装置及终端设备
Radwan et al. Regression based pose estimation with automatic occlusion detection and rectification
JP6393495B2 (ja) 画像処理装置および物体認識方法
JP7479031B2 (ja) 顔解析のための画像正規化

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22752058

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 27.11.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 22752058

Country of ref document: EP

Kind code of ref document: A1