CN115359265A - Key point extraction method, device, equipment and storage medium - Google Patents

Key point extraction method, device, equipment and storage medium Download PDF

Info

Publication number
CN115359265A
CN115359265A CN202210995374.6A CN202210995374A CN115359265A CN 115359265 A CN115359265 A CN 115359265A CN 202210995374 A CN202210995374 A CN 202210995374A CN 115359265 A CN115359265 A CN 115359265A
Authority
CN
China
Prior art keywords
feature
image
characteristic
processed
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210995374.6A
Other languages
Chinese (zh)
Inventor
张映艺
赵凯
姜鹏涛
张睿欣
丁守鸿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210995374.6A priority Critical patent/CN115359265A/en
Publication of CN115359265A publication Critical patent/CN115359265A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • G06V40/11Hand-related biometrics; Hand pose recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a key point extraction method, a key point extraction device, key point extraction equipment and a storage medium, and relates to the technical field of artificial intelligence. The method comprises the following steps: acquiring an image to be processed, wherein a target object is displayed in the image to be processed; extracting the features of the image to be processed to obtain a feature map of the image to be processed; the characteristic map is used for characterizing the characteristic information of at least one key point related to the target object; processing the characteristic diagram to respectively obtain a transverse characteristic and a longitudinal characteristic; the transverse features are used for representing the feature information of the image to be processed in the horizontal direction, and the longitudinal features are used for representing the feature information of the image to be processed in the vertical direction; determining horizontal position information respectively corresponding to at least one key point in the image to be processed according to the horizontal features; and determining the vertical position information respectively corresponding to at least one key point in the image to be processed according to the longitudinal characteristics. By the method, the time consumption and cost for positioning the key points from the image to be processed are low.

Description

Key point extraction method, device, equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of artificial intelligence, in particular to a method, a device, equipment and a storage medium for extracting key points.
Background
In the field of computer vision, information such as the identity and the posture of a target object in an image can be predicted by processing the image through a machine learning model.
In the related art, a thermodynamic diagram-based method extracts key points in an image, and convolves the image to obtain a feature diagram corresponding to the image. And processing the characteristic diagram to obtain a plurality of thermodynamic diagrams, wherein one thermodynamic diagram corresponds to the position information of one key point, and the key points of the target object in the image are respectively predicted according to the plurality of thermodynamic diagrams.
However, since different key points in the related art have different thermodynamic diagrams, a large number of operations are required, the memory is occupied, and the time for locating the key points from the image is long.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a storage medium for extracting key points. The technical scheme is as follows:
according to an aspect of an embodiment of the present application, there is provided a method for extracting a keypoint, the method including:
acquiring an image to be processed, wherein a target object is displayed in the image to be processed;
performing feature extraction on the image to be processed to obtain a feature map of the image to be processed; wherein the feature map is used for characterizing feature information of at least one key point related to the target object;
processing the characteristic graph to respectively obtain a transverse characteristic and a longitudinal characteristic; the transverse features are used for representing the feature information of the image to be processed in the horizontal direction, and the longitudinal features are used for representing the feature information of the image to be processed in the vertical direction;
determining horizontal position information respectively corresponding to the at least one key point in the image to be processed according to the transverse features;
and determining the vertical position information respectively corresponding to the at least one key point in the image to be processed according to the longitudinal features.
According to an aspect of an embodiment of the present application, there is provided a keypoint extraction apparatus, including:
the image acquisition module is used for acquiring an image to be processed, and a target object is displayed in the image to be processed;
the characteristic extraction module is used for extracting the characteristics of the image to be processed to obtain a characteristic diagram of the image to be processed; wherein the feature map is used for characterizing feature information of at least one key point related to the target object;
the direction processing module is used for processing the characteristic diagram to respectively obtain a transverse characteristic and a longitudinal characteristic; the transverse features are used for representing the feature information of the image to be processed in the horizontal direction, and the longitudinal features are used for representing the feature information of the image to be processed in the vertical direction;
the position determining module is used for determining horizontal position information respectively corresponding to the at least one key point in the image to be processed according to the transverse features; and determining the vertical position information respectively corresponding to the at least one key point in the image to be processed according to the longitudinal features.
According to an aspect of embodiments of the present application, there is provided a computer device comprising a processor and a memory, the memory having stored therein a computer program, the computer program being loaded and executed by the processor to implement the above-mentioned method.
According to an aspect of the embodiments of the present application, there is provided a computer-readable storage medium having a computer program stored therein, the computer program being loaded and executed by a processor to implement the above method.
According to an aspect of an embodiment of the present application, there is provided a computer program product including computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the above-described method.
The technical scheme provided by the embodiment of the application can have the following beneficial effects: in the process of positioning the key points, the position information of the key points is decoupled, and the horizontal position information and the vertical position information of the key points are respectively determined in a regression mode, so that the accuracy of the positioned key points is improved. On one hand, compared with the condition that the regression algorithm in the related technology does not reserve the spatial information in the feature map, the transverse feature and the longitudinal feature obtained through the feature map in the method reserve part of the spatial information in the feature map, and are beneficial to providing more information for positioning the key points. On the other hand, the transverse features and the longitudinal features are respectively determined, so that mutual interference of spatial information in different directions is avoided, and the horizontal position information extracted from the transverse features and the vertical position information extracted from the longitudinal features are more accurate.
In addition, under the condition that a plurality of key points correspond to the target object, the horizontal position information corresponding to the key points can be determined through one horizontal feature, the position information corresponding to the key points can be determined through one vertical feature, compared with the method that the position information of one key point is determined sequentially through one thermodynamic diagram, the calculation amount of key point positioning in the image to be processed is reduced, and the speed of key point positioning in the image to be processed is accelerated. The method provided by the application has the advantages that the time consumption cost for extracting the key points from the image is low, and the method is suitable for scenes needing to position the key points from the image in real time, such as human body posture estimation scenes.
Drawings
FIG. 1 is a schematic illustration of an environment for implementing an embodiment provided by an embodiment of the present application;
fig. 2 is a schematic diagram of an application scenario of a keypoint extraction method according to an embodiment of the present application;
fig. 3 is a schematic diagram of an application scenario of a keypoint extraction method according to another embodiment of the present application;
FIG. 4 is a flowchart of a method for keypoint extraction provided by an exemplary embodiment of the present application;
FIG. 5 is a schematic diagram of an image to be processed provided by an exemplary embodiment of the present application;
FIG. 6 is a schematic illustration of an image to be processed as provided by another exemplary embodiment of the present application;
FIG. 7 is a schematic diagram of a keypoint extraction method provided by an embodiment of the present application;
FIG. 8 is a schematic illustration of a target area provided by an exemplary embodiment of the present application;
fig. 9 is a block diagram of a key point extracting apparatus according to an embodiment of the present application;
FIG. 10 is a block diagram of a computer device according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Before the technical solutions of the present application are introduced, some background knowledge related to the present application will be described. The following related arts as alternatives can be arbitrarily combined with the technical solutions of the embodiments of the present application, and all of them belong to the scope of the embodiments of the present application. The embodiment of the present application includes at least part of the following contents.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence base technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further means that a camera and a Computer are used for replacing human eyes to perform machine Vision such as identification and measurement on a target, and further graphics processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image recognition, image semantic understanding, image retrieval, video content/behavior recognition, three-dimensional object reconstruction, virtual reality, augmented reality, synchronous positioning, map construction and other technologies, and also includes common biometric identification technologies such as face recognition, fingerprint recognition, palm print recognition and the like.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning and the like.
With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and researched in a plurality of fields, for example, common computer vision applications such as intelligent motion recognition and identity verification are applied.
The scheme provided by the embodiment of the application relates to the computer vision technology and the like of artificial intelligence, and the key points of the target object are positioned in the image to obtain the position information of the key points in the image. Therefore, the pose of the target object in the image can be analyzed through the key points in the following process, the detection frame can be determined through the key points, and the image in the detection frame can be checked and the like.
Before the embodiments of the present application are described, terms appearing in the present embodiment will be explained below in order to facilitate understanding of the present embodiment.
Human Keypoint Detection (HKD) refers to a pre-task that occurs before Human action recognition, behavior analysis, human-computer interaction, and other tasks, and belongs to a basic task in computer vision. And the coordinates of the key points can be estimated by detecting the human body key points and human bones in various postures. In the related art, the human body key point detection may be divided into single/multi-person key point detection, 2D/3D key point detection, whole body posture estimation/local posture estimation, and the like. Meanwhile, the algorithm can continuously identify the key points after the key point detection is finished.
Human body detection refers to positioning a human body by using a target detection technology so as to extract a human body region picture from a picture.
Hand detection refers to positioning the hand using object detection techniques to extract a hand region picture from the picture.
Estimating hand postures: and estimating the coordinates of key points of the hand bones in various postures.
The pose estimation based on the thermodynamic diagrams refers to a method for determining the coordinates of key points by determining at least one thermodynamic diagram of an input image object and analyzing the thermodynamic diagrams respectively.
The regression-based pose estimation refers to a method of determining coordinates of output key points in a regression manner by processing an input image.
The backbone network is used for extracting a feature map from an input image. The feature maps extracted by the backbone network can be transmitted to a prediction head for prediction so as to determine the coordinates of the key points.
The prediction head is a network structure which takes a characteristic diagram extracted by a backbone network as input and outputs predicted shutdown point-by-point coordinates.
The global average pooling refers to performing global average calculation on the feature map to obtain average and simplified features. Computational cost may be reduced by global average pooling.
The Argmax function is used to obtain the array index corresponding to the maximum element in the input array.
The linear layer refers to a neural network layer that linearly transforms an input.
Refer to fig. 1, which illustrates a schematic diagram of an environment for implementing an embodiment of the present application. The embodiment implementation environment may include: a terminal device 10 and a server 20.
The terminal device 10 includes, but is not limited to, a mobile phone, a tablet Computer, a smart voice interaction device, a game console, a wearable device, a multimedia player, a PC (Personal Computer), a vehicle-mounted terminal, a smart home appliance, and other electronic devices. A client of the target application may be installed in the terminal device 10.
In the embodiment of the present application, the target application may be any application capable of providing an image processing function. Typically, the application is an image processing class application. Such applications provide the functionality of analyzing the content of the input image. Of course, besides the image processing application, the image processing service may be provided in other types of applications, for example, a news application, a shopping application, a social application, an interactive entertainment application, a browser application, a content sharing application, a Virtual Reality (VR) application, an Augmented Reality (AR) application, and the like, which is not limited in this embodiment of the present application. In addition, for different application programs, the types of images processed by the application programs may be different, and corresponding functions may also be different, which may be configured in advance according to actual requirements, and this is not limited in this embodiment of the present application. Alternatively, the terminal device 10 runs a client of the application program.
The server 20 is used for providing background services for clients of target applications in the terminal device 10. For example, the server 20 may be an independent physical server, a server cluster or a distributed system composed of a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, a cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), and a big data and artificial intelligence platform, but is not limited thereto.
The server 20 has at least data receiving and processing capabilities such that the terminal device 10 and the server 20 can communicate with each other via a network. The network may be a wired network or a wireless network. The server 20 receives the image to be processed sent by the terminal device 10, processes the image to be processed to obtain the position information of at least one key point corresponding to the target object of the image, optionally, after obtaining the position information of the key point, may also analyze and identify the target object in the image to be processed through the key point, and feed back the identification result to the terminal device 10.
In one example, the keypoint extraction process is performed in a keypoint extraction model. The key point extraction model is run on a computer device, that is, the method provided by the present application, and the execution subject of each step may be the computer device, and the computer device may be any electronic device with data storage and processing capabilities. For example, the computer device may be the server 20 in fig. 1, may be the terminal device 10 in fig. 1, or may be another device other than the terminal device 10 and the server 20.
In one embodiment, the terminal device 10 is a computer device, and the terminal device 10 performs a key point identification on the image to be processed and determines a Region of Interest (ROI) according to the identified key point. The terminal device 10 cuts the image to be processed according to the region of interest to obtain an image block in the region of interest. The terminal device 10 transmits the image block to the server 20. The server 20 receives the image block sent by the terminal device 10, and checks the identity of the user according to the information (e.g. palm print information) contained in the image block.
In other embodiments, the computer device is a device other than the terminal device 10 and the server 20. The terminal device 10 sends the image to be processed to the server 20, and the server 20 instructs the computer device to locate the key points in the image to be processed, and identifies the gesture of the target object (such as a human body) in the image to be processed according to the located key points.
Please refer to fig. 2, which illustrates a schematic diagram of an application scenario of the keypoint extraction method according to an embodiment of the present application.
With the increasing requirement of society on privacy, the palm print recognition has wider application prospect in practical application scenes such as payment and body verification. The human body posture estimation technology provided by the application can be applied to detecting the hand of a human body. By detecting the key points of the hand in the hand image acquired in real time, the palm area in the image to be processed acquired in real time can be positioned. In order to further verify the information of the palm area.
As shown in fig. 2, the palm recognition component 210 can be obtained by combining a palm keypoint extraction model and a palm detection model that apply human pose estimation techniques. The assembly may be used in a palm recognition process.
The terminal device obtains an image to be verified by taking a picture, and in some embodiments, the image to be verified belongs to a dynamically changing image frame. And carrying out palm detection on the image frame to be verified. After the palm detection passes, at least one keypoint associated with the palm in the current image frame is determined using a hand keypoint detection model. In one aspect, a region of interest may be extracted from the current image frame by the determined at least one keypoint. Subsequently, the image in the region of interest (referred to as the identification photograph) is transferred to the background, and the image in the region of interest is subjected to palm print identification by the background. On the other hand, because the palm key point detection model determines that the key points need to consume a period of time, under the condition that a plurality of continuous image frames need to be detected, the key points corresponding to the ith image frame can be used as reference information for predicting the key point process of the (i + k) th image frame; and i and k are positive integers, and i + k is less than or equal to the total number of video frames in the video.
In some embodiments, the computer device determines a detection frame in the ith image frame according to the key point corresponding to the ith image frame, and estimates a detection frame in the (i + k) th image frame according to the detection frame of the ith image frame, so that the key point in the ith image frame can provide a reference for the key point in the (i + k) th image frame.
In the process of palm print recognition of the palm of the user, the background is used for carrying out palm recognition comparison on the recognition photo of the user and at least one photo in the registry so as to determine the identity information of the user. If there is no photo in the registry with the same features as the recognition photo, the palm verification fails. In the case that the same and positive photo as the identification photo exists in the registry, the identity of the user may be further determined according to the identity information corresponding to the photo in the registry.
Please refer to fig. 3, which illustrates a schematic diagram of an application scenario of a keypoint extraction method according to another embodiment of the present application.
Besides determining the region of interest by the key points, the extracted key points can also be used for human body posture estimation. The human body posture estimation comprises the recognition of the motion, the gesture, the gait and other aspects of the target object in the image. For example, the fall condition and the disease signal of the target object in the image can be judged through gait recognition so as to detect the health condition of the target object. The motion recognition or gesture recognition through the extracted key points can be applied to automatic teaching of fitness, sports and dancing and the like. Through the action is the quilt, whether the action of the target object is correct or not is detected through gesture recognition assistance, and the function of assisting the target object to learn and exercise is achieved.
In fig. 3, the terminal device obtains an image with a target object displayed by taking a picture, performs target detection on the image, and locates a human body or a hand region of the target object by using a detection model. And positioning key points of the target object through human body posture estimation. Subsequently, the image may be subjected to motion recognition, gesture recognition, gait recognition, and the like with reference to the located key points.
Please refer to fig. 4, which illustrates a flowchart of a method for extracting key points according to an embodiment of the present application. The executing body of each step of the method can be the terminal device 10 in the embodiment implementing environment shown in fig. 1, for example, the executing body of each step can be a client of the target application program, and can also be the server 20 in the embodiment implementing environment shown in fig. 1. In the following method embodiments, for convenience of description, only the execution subject of each step is referred to as "computer equipment". The method may include at least one of the following steps (410-450):
step 410, acquiring an image to be processed, wherein a target object is displayed in the image to be processed.
In some embodiments, the target object refers to an object that needs to be gesture-recognized. For example, the target object may be a human, an animal, or the like that can move an individual. For example, the target object may be a complete individual, or may be a partial limb of the individual, such as a palm region of a person. The type of the target object is determined according to an actual application scenario, and the application is not limited herein.
In some embodiments, the image to be processed refers to an image on which a target object image is displayed. The image to be processed may be an RGB (Red Green Blue ) color image, and the image includes position information and color information of each pixel point. Optionally, the image to be processed includes three color channels, and a numerical value corresponding to a pixel point in no color channel is a pixel value of the pixel point in the current color channel.
The image to be processed can also be a depth image, and the image comprises the position information and the depth information of each pixel point. In this case, the image to be processed may be obtained by processing the color image through a machine learning model (e.g., pixel2pixel algorithm model).
Optionally, the information of the pixel points in the image to be processed may be represented by using coordinates, and when the image to be processed is a two-dimensional image (such as an RGB color image), the pixel points may be represented by using two-dimensional coordinates; when the image to be processed is a depth image (including two-dimensional information and depth information), a pixel point in the image to be processed may also be represented by a three-dimensional coordinate, which is not limited in this embodiment of the present application.
In some embodiments, the computer device obtains the image to be processed through the capture component. For example, in a process where palm print recognition is required, the computer device captures an image of a palm to be recognized as an image to be recognized by the camera assembly. In other embodiments, the computer device receives the image to be recognized sent by the other device and processes the image to be recognized, for example, the computer device receives a video sent by the other device and extracts at least one video frame from the video as the image to be processed respectively.
FIG. 5 is a schematic diagram of an image to be processed according to an exemplary embodiment of the present application. In fig. 5, the target object 511 is located in a right region of the center of the image, and the image can be extracted through the region of interest 515, so as to obtain an image to be processed in the region of interest 515.
Fig. 6 is a schematic diagram of an image to be processed according to an exemplary embodiment of the present application. The target object 610 in the image to be processed of fig. 6 is a hand.
Step 420, extracting the features of the image to be processed to obtain a feature map of the image to be processed; wherein the feature map is used for characterizing feature information of at least one key point related to the target object.
In some embodiments, the feature map of the image to be processed has more than one channel. Wherein different channels focus on different features in the image to be processed. In some embodiments, the feature maps for different lanes are the same size, i.e., the Height (Height) and Width (Width) of the feature maps in different lanes are the same.
After the image to be processed is obtained, the computer equipment performs feature extraction on the image to be processed so as to obtain a feature map of the image to be processed. The feature map can characterize feature information related to the target object, e.g., the feature map is used to characterize information related to at least one keypoint in the target object.
In some embodiments, the keypoints of the target object can reflect the contour or action of the target object. The keypoint may correspond to a location in the target object entity. For example, the keypoints belong to skeletal keypoints, used to characterize the joints of the target object. By connecting the plurality of key points according to a certain sequence, the limb connection sequence of the target object can be obtained, and the posture of the target object can be estimated. For another example, the key point may be an identification point provided in the target object. The key points may represent the eyebrow position points, the canthus position points or the mouth angle position points corresponding to the target object.
For another example, the key points may be edge contour points of the target object, and the contour of the target object may be traced through the key points. The correspondence between the key point and the target object entity (for example, the key point and the joint in the target object entity) may be determined according to actual needs, and the present application is not limited herein.
In some embodiments, the computer device performs feature extraction on the image to be processed through the backbone network to obtain a feature map corresponding to the image to be processed. And the backbone network takes the image to be processed as input and outputs a characteristic diagram of the image to be processed.
In some embodiments, the type of backbone Network belonging to a Convolutional Neural Network (CNN) backbone Network includes, but is not limited to, at least one of: alexNet, visual Geometry GroupNet (computer graphics networking), googleNet, and ResNet, among others. The accuracy of the feature maps generated by different backbone networks is different, and a proper backbone neural network can be selected to extract the feature map of the image to be processed according to the requirements of accuracy and speed.
In different application scenarios, the backbone networks used are not exactly the same, for example, the backbone network 1 is used in an action gesture recognition scenario, and the backbone network 2 is used in a palm print recognition application scenario. Different backbone networks are configured for different application scenes, and the accuracy of the extracted key points is improved.
Step 430, processing the feature map to respectively obtain a transverse feature and a longitudinal feature; the transverse features are used for representing feature information of the image to be processed in the horizontal direction, and the longitudinal features are used for representing feature information of the image to be processed in the vertical direction.
In some embodiments, the data volume of the feature map extracted from the image to be processed is large, and feature information in the feature map needs to be reduced for the purposes of compressing the image to be processed, reducing parameters needed to be used in subsequent processing processes, and the like.
In some embodiments, the processing of the feature map by the computer device according to the target direction means that in the target direction, the remote relevance between different areas of the feature map is established, which is equivalent to establishing the relevance of different areas in the image to be processed in the target direction. The target direction includes, but is not limited to, at least one of: horizontal and vertical.
In some embodiments, the computer device processes the feature map according to at least one target direction, respectively, to obtain target direction features corresponding to the at least one target direction, respectively; the target direction feature is used for representing feature information of the image to be processed in the target direction. In some embodiments, the computer device needs to process the feature map through at least two different target directions to obtain target features corresponding to at least two different target directions. Optionally, the two target directions are orthogonal to each other, e.g. the two target directions are a horizontal direction and a vertical direction.
And the computer equipment processes the feature map according to the target direction 1 to obtain a target feature 1, and the computer equipment processes the feature map according to the target direction 2 to obtain a target feature 2.
Because the characteristics of the images to be processed in different directions are not completely the same, the characteristic diagram is compressed according to the directions, which is beneficial to adapting to the characteristic distribution of the characteristic diagram in different directions so as to improve the accuracy of the key points obtained by positioning. On the other hand, the feature map is compressed along the target direction, partial spatial information (spatial distribution of feature values) in the feature map is also kept in the obtained target direction, more reference information is provided for the subsequent key point position information determining process, the accuracy of the determined key point position information is improved, and the accuracy of gesture recognition is further improved.
In some embodiments, the target feature has the same number of channels as the number of channels of the feature map. For example, if the number of channels of the feature map is 50, the number of channels of the target feature is also 50.
In one example, the computer device processes the feature map separately according to two different target directions. The two target directions are respectively: horizontal direction, vertical direction. And processing the characteristic diagram according to the horizontal direction by the computer equipment to obtain transverse characteristics, and processing the characteristic diagram according to the vertical direction to obtain longitudinal characteristics. For details of this process, please refer to the following examples.
It should be noted that the present application does not limit the order of acquiring the lateral features and the longitudinal features. The computer device can acquire the transverse features and the longitudinal features in parallel according to the feature map, or acquire the transverse features before acquiring the longitudinal features, or acquire the longitudinal features before acquiring the transverse features.
And step 440, determining horizontal position information respectively corresponding to at least one key point in the image to be processed according to the transverse features.
In some embodiments, the computer device determines corresponding horizontal position information of the at least one keypoint in the image to be processed from the pair of lateral features.
In some embodiments, the horizontal position information is used to characterize the position of the keypoint in the horizontal direction (i.e., the x-axis direction) in the image to be processed. The horizontal position information may be a directional distance of the key point relative to the anchor point, for example, the anchor point is a central point in the image to be processed, the horizontal position information of a certain key point may be a directional distance from the key point to the central point, and the position of the key point in the horizontal direction of the image to be processed may be located by the anchor point and the horizontal position information. The horizontal position information may also be the abscissa of the keypoint. For example, if the horizontal position information of a certain key point is 3, it indicates that the abscissa of the key point in the image to be processed is 3.
In some embodiments, the computer device calculates the horizontal position information of the keypoints by means of regression.
In some embodiments, the target object has more than one key point, and the computer device may determine, by using a regression method, horizontal position information corresponding to each of the key points from the horizontal features. Optionally, the horizontal position information of the plurality of key points is arranged in an output order.
For example, the output order is: the computer device determines horizontal position information corresponding to the 2 key points according to the horizontal features, and then outputs a horizontal position sequence according to an output sequence, specifically [ horizontal position information corresponding to the shoulder key points, horizontal position information corresponding to the wrist key points ].
Optionally, the output order is preconfigured. For example, the output order is determined according to the distribution positions of the key points in the target object. As another example, the output order may be randomly generated.
And step 450, determining the vertical position information respectively corresponding to at least one key point in the image to be processed according to the longitudinal features.
In some embodiments, the vertical position information is used to characterize the position of the keypoint in the vertical direction (i.e., the y-axis direction) in the image to be processed. The vertical position information may be a directional distance of the key point relative to the anchor point, for example, the anchor point is a central point in the image to be processed, the vertical position information of a certain key point may be a directional distance from the key point to the central point, and the position of the key point in the vertical direction of the image to be processed may be determined by the anchor point and the horizontal position information. The vertical position information may also be the abscissa of the key point. For example, if the vertical position information of a certain key point is 0, it indicates that the ordinate of the key point in the image to be processed is 0.
In some embodiments, the computer device determines the vertical position information of the keypoints from the feature map by way of regression. In the case that the target object has more than one key point, the computer device may determine, by a regression method, vertical position information corresponding to each of the plurality of key points from the longitudinal feature. Alternatively, the vertical position information of the plurality of key points is arranged in the output order.
In some embodiments, the horizontal position information and the vertical position information of the plurality of key points are output in the same order, and the computer device may determine the complete position information of any one key point from the horizontal position sequence and the vertical position sequence.
For example, in the case where the horizontal position information and the vertical position information are expressed in the form of coordinates, assuming that the horizontal position sequence is [30, 12, 47] and the vertical position sequence is [46, 15,6], it is described that the coordinates of the keypoint 1 in the image to be processed are (30, 46), the coordinates of the keypoint 2 in the image to be processed are (12, 15), and the coordinates of the keypoint 3 in the image to be processed are (47, 6).
In some embodiments, mapping exists between the key points and the target object entity, different key points correspond to different positions on different target objects, and the position sequence output by the computer device can indicate the position information corresponding to each key point respectively according to a certain sequence, so that the plurality of key points can be connected according to the mapping relation between the different key points and the target object entity to obtain a bone action image of the target object, and the gesture of the target object can be further identified and analyzed according to the bone action image.
That is, according to the method, the computer device determines the position information of different key points in the target direction according to the target direction characteristics and a certain sequence, and can directly connect a plurality of key points through the relevance between different key points without establishing the connection relation of the key points after determining the position information of the key points.
It should be noted that, the present application does not limit the execution sequence of step 440 and step 450, and step 440 and step 450 may be executed in parallel or executed sequentially. For example, step 440 is performed first and then step 450 is performed, but step 450 may be performed first and then step 440 is performed.
In summary, in the process of locating the key points, the position information of the key points is decoupled, and the horizontal position information and the vertical position information of the key points are determined by regression respectively, which is helpful for obtaining the accuracy of the key points by locating. On one hand, compared with the condition that the regression algorithm does not reserve the spatial information in the feature map in the related technology, the transverse feature and the longitudinal feature obtained through the feature map in the method reserve part of the spatial information in the feature map, and are beneficial to providing more information for positioning the key points. On the other hand, the transverse features and the longitudinal features are respectively determined, which is beneficial to avoiding mutual interference of spatial information in different directions, so that the horizontal position information extracted from the transverse features and the vertical position information extracted from the longitudinal features are more accurate.
In addition, under the condition that a plurality of key points correspond to the target object, the horizontal position information corresponding to the key points can be determined through one horizontal feature, the position information corresponding to the key points can be determined through one vertical feature, compared with the method that the position information of one key point is determined sequentially through one thermodynamic diagram, the calculation amount of key point positioning in the image to be processed is reduced, and the speed of key point positioning in the image to be processed is accelerated.
The process of acquiring the lateral and longitudinal features is described in several embodiments below.
In some embodiments, the computer device processes the feature map to obtain the horizontal feature and the vertical feature, respectively, including: for target direction features in the transverse features and the longitudinal features, dividing the feature graph by the computer equipment according to the direction corresponding to the target direction features to obtain a plurality of feature strips; the characteristic strip comprises at least two characteristic values with the same position in the target direction; for each characteristic strip, the computer equipment performs pooling treatment on the characteristic strip to obtain a pooling result corresponding to the characteristic strip; and the computer equipment arranges the pooling results corresponding to the plurality of characteristic strips respectively according to the arrangement sequence of the plurality of characteristic strips in the characteristic diagram to obtain the target direction characteristics.
The target direction feature may be a transverse feature or may be referred to as a longitudinal feature. The target direction characteristic is determined according to actual needs, the target direction characteristic is the transverse characteristic under the condition that the transverse characteristic needs to be determined by the computer equipment, and the target direction characteristic is the longitudinal characteristic when the longitudinal characteristic needs to be determined by the computer equipment.
The direction corresponding to the target direction feature is used for indicating the direction of processing the feature map. The direction corresponding to the target direction feature is simply referred to as the target direction.
The feature strip is part of a feature map. In the case where the feature map has a plurality of channels, the number of channels of one feature band is 1. That is, in the process of dividing the feature map, each channel of the feature map needs to be divided, and one channel of the feature map corresponds to a plurality of feature strips.
In some embodiments, the feature strip includes feature values in a plurality of feature maps. Optionally, the position relationship of the feature values in the feature strip is the same as the relative position relationship of the feature values in the feature map.
In some embodiments, the number of eigenvalues in the target direction in the characteristic band is greater than the number of eigenvalues perpendicular to the target direction. For example, the target direction is a horizontal direction, and the number of characteristic values in the horizontal direction in the characteristic band is greater than the number of characteristic values in the vertical direction, that is, the characteristic band may be a i*j Wherein i, j are positive integers, and i is greater than j.
In some embodiments, the number of feature values included in each feature strip in a channel of the feature map is the same. In some embodiments, the dividing, by the computer device, the feature map according to the target direction to obtain a plurality of feature bands includes: for any channel in the feature map, the computer device divides the feature matrix corresponding to the channel according to the dividing step length target direction to obtain a plurality of feature strips.
In some embodiments, there are no overlapping feature values in adjacent feature strips. In some embodiments, the division step size is less than half of the smallest dimension. The smallest dimension refers to the minimum of the height and width of the feature map.
In some embodiments, the division step size is equal to 1, i.e. for the feature matrix of any one channel, the feature values in any row (column) in the feature matrix belong to the same feature strip. Assuming that the target direction is the horizontal direction, any one row in the feature matrix is divided into one feature strip.
After the characteristic strips are obtained through division, the computer equipment conducts pooling treatment on the characteristic strips to obtain pooling results corresponding to the characteristic strips.
For any channel in the feature map, after obtaining a plurality of pooling results, the computer device arranges the pooling results according to the arrangement sequence of the feature strips to obtain the target direction feature of the feature map in the target direction.
In some embodiments, the pooling processing of the feature strip by the computer device to obtain a pooling result corresponding to the feature strip includes: the computer equipment determines the maximum value included in the characteristic strip, and determines the maximum value as a pooling result corresponding to the characteristic strip; or the computer device determines an average value of characteristic values included in the characteristic strip, and takes the average value as a pooling result corresponding to the characteristic strip; or, the computer device determines an average value and a maximum value of the feature values included in the feature strip, and determines the pooling result of the feature strip according to the average value and the maximum value.
In some embodiments, to avoid introducing processing errors, the computer device processes any one feature stripe belonging to the same feature map by using the same pooling processing method.
In one example, the computer device determines a maximum eigenvalue in the characteristic band as the pooled result corresponding to the characteristic band. For example, the feature band A is specified as [1,3,5,7, 13], and the computer determines that the pooling result of the feature band A is 13. This process may be implemented by a Max () function.
In another example, the computer device determines an average of the feature values included in the feature strip, and takes the average as the corresponding pooling result for the feature strip. The computer device calculates the average value of each characteristic value in the characteristic strip and takes the calculated average value as the pooling result of the characteristic strip. For example, the feature band B is specified as [12,8,6,4], and the computer determines that the pooling result of the feature band B is 7.5. This process may be implemented by means of a Mean () function.
In another example, the computer device determines an average of feature values comprised by the feature strip, and determines a pooling result for the feature strip based on the average and a maximum value in the feature strip. Specifically, the computer device calculates the sum of the average value and the maximum value in the characteristic strip to obtain the pooling result of the characteristic strip. For example, the feature band B is specified as [2, 0,2,8,4], and the computer determines that the pooling result for the feature band C is 11. This process may be implemented by Max () + Mean ().
Since the feature map is directly subjected to global pooling, all spatial information in the feature map is lost, the feature strips are divided according to the coordinate axis directions (horizontal direction and vertical direction) by the method, and the feature strips are respectively pooled, so that on one hand, all spatial information in the feature map is not lost, namely, partial spatial information in the feature map can still be retained in the transverse features and the longitudinal features. It also helps to avoid spatially unrelated information from interfering with each other.
On the other hand, the slender distribution of the characteristic values in the characteristic strips enables the characteristic strips to be subjected to pooling, and remote dependence can be established between the areas where the images to be recognized are distributed discretely. Meanwhile, due to the narrow shape of the characteristic strip along other dimensions, the method is beneficial to capturing local details in other dimensions, and therefore the accuracy of locating the obtained key points in the image to be processed is improved.
In order to further improve the accuracy of locating the obtained key points, the following operations may also be performed before determining the position information of the key points.
In some embodiments, the computer device processes the feature map to obtain the transverse feature and the longitudinal feature, respectively, and further includes: for the target direction features in the transverse features and the longitudinal features, the computer equipment extracts effective information in the target direction features to obtain target effective features; the number of channels of the target effective characteristic is less than that of the channels of the target direction characteristic; the computer equipment carries out expansion processing on the target effective characteristics to obtain target expansion characteristics; the unfolding processing refers to merging at least two dimensions in the target effective characteristics; the computer equipment performs characteristic extraction on the target expansion characteristic to obtain an extracted target direction characteristic; the refined target direction features are used for replacing the target direction features to determine horizontal position information or vertical position information, and the feature refinement refers to determining the relevance among internal components of the target expansion features.
The target direction feature may be a lateral feature or a longitudinal feature. When the computer equipment needs to carry out spatial correlation on the transverse features, the target direction features are the transverse features; when the computer device needs to spatially correlate the longitudinal features, the target direction features are based on the longitudinal features.
In some embodiments, the extracting, by the computer device, valid information in the target direction feature refers to performing dimension reduction on the target direction feature, and reducing the data volume of the target direction feature. In some embodiments, the computer device performs convolution processing on the target direction features to obtain target valid features. For example, the computer device convolutes the target directional feature by at least one convolution kernel of 1 × 1, and compresses the number of channels in the target directional feature without changing the height and width of the target directional feature to obtain the target effective feature.
In some embodiments, the computer device unrolling the target valid feature refers to reducing the dimensionality of the target valid feature without changing the data volume of the target valid feature. Specifically, the computer device expands the target active features in multiple dimensions into one dimension. The computer device may expand the channel direction of the target active feature. Optionally, after the target directional feature is subjected to the unfolding processing, the number of channels of the target unfolding feature is equal to 1. In some embodiments, the computer device performs an unfolding process on the target valid feature, including merging a width dimension and a height dimension in the target valid feature into one dimension.
In some embodiments, the computer device performs feature refinement on the target expansion features to obtain refined target direction features, and determines horizontal position information or vertical position information of the key points using the refined target direction features.
In some embodiments, feature refinement is used to improve the spatial correlation between at least two eigenvalues of the target unfolding feature, which may be accomplished by a mechanism of attention.
In some embodiments, the computer device performs feature extraction on the target expansion feature to obtain an extracted target direction feature, including: dividing the target expansion characteristics by the computer equipment to obtain a plurality of characteristic blocks; wherein, two adjacent feature blocks are not overlapped; determining combination information corresponding to the plurality of feature blocks respectively; the combination information is obtained according to the feature block and the combination information corresponding to the feature block; the computer equipment carries out self-attention processing according to the plurality of combined information to obtain an intermediate matrix; wherein the self-attention processing is used for processing the relevance among a plurality of combined information; unfolding the intermediate matrix to obtain an unfolded intermediate matrix; and carrying out at least one time of full connection processing on the expanded intermediate matrix to obtain the refined target direction characteristics. In some embodiments, the computer device partitions the target expansion features into a plurality of feature blocks. The feature block includes at least one feature value, in the case where a plurality of feature values are included in the feature block. The plurality of feature values are at successive locations in the target unfolded feature. In one example, the feature block includes a feature value, that is, the computer device takes any one feature value in the target expansion feature as a feature block.
Subsequently, the computer device determines position information corresponding to each of the plurality of feature blocks. The position information may be referred to as position embedding (position embedding). The encoding method of the position information may be related to actual needs, and the present application is not limited herein.
And the computer equipment obtains the combined information according to the characteristic block and the corresponding position information. In some embodiments, the computer device adds the feature blocks and the position information corresponding to the feature blocks to obtain the combination information corresponding to the feature blocks. For example, each feature block includes an x feature value, and the computer device processes the x feature values through the position information, to finally obtain the combined information, where x is a positive integer. And for the characteristic value a, the computer equipment adds the position information and the characteristic value a to obtain a new characteristic value a.
In some embodiments, 1 feature block includes 1 feature value in the target expansion feature, and the computer device determines the position information corresponding to each feature block, and obtains the combination information corresponding to the feature block by using the feature value in the feature block and the corresponding position information camera. In this case, the computer device may determine the position information of the corresponding feature block according to the position of the feature value in the target expanded feature. For example, if the index of a certain feature value in the target expansion feature is 10, the position information of the feature block corresponding to the feature value is 10. In some embodiments, the computer device performs self-attention processing on a plurality of combined information to obtain an intermediate matrix, comprising: for target combination information in the plurality of combination information, determining an association score between the target combination information and other combination information; and determining an intermediate matrix according to the target combination information, the association score and other combination information. The other combination information indicates arbitrary combination information other than the target combination information among the plurality of combination information. Optionally, the above process is accomplished using a multi-head attention mechanism.
The intermediate matrix package includes: a transverse intermediate matrix and a longitudinal intermediate matrix. The intermediate matrix is a transverse intermediate matrix in the case where the target direction is the horizontal direction, and a longitudinal intermediate matrix in the case where the target direction is the vertical direction.
In some embodiments, the expanding the intermediate matrix by the computer device to obtain an expanded intermediate matrix, includes: the channel dimensions of the intermediate matrix are expanded. For example, the number of channels of the intermediate matrix a is equal to 5, each channel includes 10 eigenvalues, and after the intermediate matrix is expanded, the number of channels of the expanded intermediate matrix is equal to 1, and the channel includes 50 eigenvalues.
By the method, the refined target direction characteristics can pay attention to the information related to the key points in the target direction characteristics, and the position information of the key points is determined by using the refined target direction characteristics, so that the accuracy of the determined position information is improved.
In some embodiments, the keypoints are extracted by a keypoint extraction model comprising: the method comprises the following steps of (1) extracting a network, a direction pooling layer and a decoupling regression network: the feature extraction network is used for extracting features of the image to be processed to obtain a feature map of the image to be processed; the direction pooling layer is used for processing the characteristic diagram to respectively obtain a transverse characteristic and a longitudinal characteristic; wherein the lateral and longitudinal features are processed by different branches in the directional pooling layer; the decoupling regression network is used for determining horizontal position information corresponding to the at least one key point in the image to be processed according to the horizontal features and determining vertical position information corresponding to the at least one key point in the image to be processed according to the longitudinal features; the horizontal position information and the vertical position information are obtained through different branches in the decoupling regression network.
In some embodiments, the keypoint extraction model further comprises: and the characteristic refining network is used for carrying out characteristic refining on the target direction characteristics in the transverse characteristics and the longitudinal characteristics to obtain refined target expansion characteristics. Optionally, the feature refining network is designed based on transformerViT.
Fig. 7 is a schematic diagram of a keypoint extraction method according to an exemplary embodiment of the present application.
In fig. 7, it can be seen that the key point extraction network includes a feature extraction network, and the feature extraction network may be constructed based on a convolutional neural network, and is used to input a to-be-processed picture and output a feature map. Inputting the feature map into a horizontal pooling branch in the directional pooling network to obtain a transverse feature; the feature map is input to vertical pooling branches in the directional pooling network to obtain vertical features. Optionally, the parameters in the horizontal pooling branch and the vertical pooling branch are not identical so as to adapt to the feature distribution in different directions in the feature map, which helps to improve the accuracy of the located key points. Then, the horizontal features are input into horizontal refining branches of the feature refining network, and the refined horizontal features are output. Optionally, the horizontal refinement branch includes: 1 × 1 convolutional layer, unroll layer, sub-attention layer, feedforward layer, etc.
Inputting the refined transverse features into a transverse regression branch (including at least one full connection layer) of the decoupling regression network to obtain horizontal position information (such as an abscissa) of at least one key point output by the transverse regression branch, wherein the horizontal position information of each key point is arranged in an output sequence according to a preset sequence.
And inputting the longitudinal features into a longitudinal refining branch of the feature refining network, and outputting the refined longitudinal features. And inputting the refined longitudinal features into a longitudinal regression branch (comprising at least one full connection layer) of the decoupling regression network to obtain vertical position information (such as a vertical coordinate) of at least one key point output by the longitudinal regression branch, wherein the water vertical position information of each key point is arranged in an output sequence according to a preset sequence. Alternatively, the parameters in the transverse and longitudinal refinement branches may not be identical. The parameters in the transverse regression branch and the longitudinal regression branch may not be exactly the same.
In some embodiments, the keypoint extraction process may include several steps:
1. the computer device determines a feature map of the image to be processed using a backbone network, which may be any backbone network capable of feature extraction, for example, a CNN backbone network. The computer equipment takes the image I to be processed as the input of a CNN backbone network F (theta), and determines a feature map F corresponding to the image I to be processed through F (theta).
F=f(θ,I)
Wherein, the characteristic diagram F epsilon R C×H×W Wherein C is the channel number of the characteristic diagram, H is the height of the characteristic diagram, and the width of the characteristic diagram.
2. In the obtained characteristic diagram F epsilon R C×H×W On the basis, the computer equipment performs pooling processing on the feature map in the target direction to obtain target direction features. For example, a Max (-) operation, a Mean (-) operation, or (Max (-) and Mean (-) operation) is employed. Taking the Max (·) operation as an example, the computation of banding pooling can be expressed as follows:
Figure BDA0003805318710000201
Figure BDA0003805318710000202
wherein, F x Watch transverse feature, F y Denotes a longitudinal feature, F x ∈R C×1×W ,F y ∈R C×H×1
In this step, the feature information in the feature map can be refined in the horizontal direction (the row direction of the feature matrix) or the vertical direction (the column direction of the feature matrix), and partial spatial information in the feature map is still retained in the obtained transverse features and longitudinal features, which is helpful for improving the accuracy of the position information of the key point obtained by positioning in the image to be processed.
3. The computer device further refines the features in the target direction features using a convolution layer and a Transformer layer. In obtaining the transverse feature F x And longitudinal feature F y Thereafter, the computer device passes the convolution layer f with at least one convolution kernel (kernel) of 1 × 1 1×1x ) To refine the useful information in the horizontal features as well as the vertical features.
V x =f 1×1 (F xx )
V y =f 1×1 (F yy )
Wherein, V x Is a transverse effective feature, V y Is a longitudinally effective feature that is characterized by,
Figure BDA0003805318710000203
is C 1 The number of channels of the transverse active feature or the longitudinal active feature.
Subsequently, the computer device pairs the transverse valid feature V x And a longitudinal effective characteristic V y Performing unfolding treatment to obtain transverse unfolding characteristic v' x And a longitudinally developed feature v' y
v′ x =Flattern(V x )
v′ y =Flattern(V y )
Wherein, flattern () is an unfolding function.
To better refine the effective information into a feature, a computer device uses two transform networks T designed according to visual transform ViT x And T y Respectively to the transversely developed characteristic v' x And a longitudinally developed feature v' y Processing the matrix to obtain a transverse intermediate matrix v ″) x And a longitudinal intermediate matrix v ″) y . Alternatively, T x And T y The method comprises an attention module, so that the processed features can focus on the feature map response corresponding to the key points.
v″ x =T x (v′ x )
v″ y =T y (v′ y )
Wherein,
Figure BDA0003805318710000204
subsequently, the computer device pairs the transverse intermediate matrix v ″ x And a longitudinal intermediate matrix v ″) y Unfolding processing is carried out to obtain refined transverse unfolding characteristic v' x And refined longitudinal spread signature v' y
v″′ x =Flattern(v″ x )
v″′ y =Flattern(v″ y ) Computer device from refined transverse features v' x And refined longitudinal feature v' y Coordinate regression is decoupled. Refined transverse features v' after obtaining reserved spatial information x And refined longitudinal features v ″) y Thereafter, the computer device will v ″ x Incoming full connection layer FC x () obtaining lateral location information for at least one keypoint; and, converting v ″', to y Transmitting full connection layer FC y (. The) obtaining lateral position information for the at least one keypoint, optionally the lateral position information and the longitudinal position information being represented in coordinate form.
x=FC x (v″′ x )
y=FC y (v″′ y )
Wherein x represents the abscissa of the key point in the image to be processed, and y represents the ordinate of the key point in the image to be processed.
In some embodiments, the method of keypoint extraction further comprises: the computer equipment determines a target area in the image to be processed according to at least one key point; wherein the target region comprises at least one keypoint; the computer equipment carries out recognition analysis on the target object in the target area, and the recognition analysis comprises at least one of the following steps: palm print recognition, gesture recognition and expression recognition.
In some embodiments, the computer device determines the target area in the image to be processed according to the at least one key point, including, in a case where the target object has a plurality of key points, the computer device determining one or more edge points from the at least one key point with reference to the position information, the computer device determining the position information of the center point of the target area and the size of the target area according to the one or more edge points, and the computer device determining the target area in the image to be processed according to the position information of the center point and the size of the target area.
The boundary point refers to a key point distributed at an edge among the plurality of key points. Alternatively, the position information may be represented by coordinates. In some embodiments, the abscissa of the boundary point is greater than (or less than) the abscissas of the other plurality of keypoints, i.e., the abscissa of the boundary point is the largest (small) value of the abscissas of the plurality of keypoints. In some embodiments, the ordinate of the keypoint is greater than (or less than) the ordinates of the other keypoints, i.e., the abscissa of the boundary point is the maximum (small) value of the ordinates of the keypoints.
In some embodiments, the computer device takes as boundary points the keypoints of the maximum abscissa, the minimum abscissa, the maximum ordinate, and the minimum ordinate, respectively, from the plurality of keypoints. The computer device determines the abscissa of the center point according to the maximum abscissa and the minimum abscissa, and determines the ordinate of the center point according to the maximum ordinate and the minimum ordinate. Taking the abscissa of the centerline point as an example: the abscissa of the center point is equal to (maximum abscissa plus minimum abscissa) × 0.5, and the ordinate of the center point is determined by the same method.
In some embodiments, the computer device determines the length of the target area based on the maximum abscissa and the minimum abscissa. Specifically, the length of the target region is equal to p times the difference between the maximum abscissa and the minimum abscissa, where p is a positive number greater than 1.
In some embodiments, the computer device determines the height of the target area from the maximum and minimum ordinates, in particular the height of the target area is equal to q times the difference between the maximum and minimum ordinates, where q is a positive number greater than 1.
In some embodiments, p and q may be the same or different. The values of q and q can be set according to actual needs, and the application is not limited herein.
FIG. 8 is a schematic illustration of a target area provided by an exemplary embodiment of the present application. And calculating a target area based on the key points 0, 1, 2, 5, 9, 13 and 17, wherein 810 is a central point, and the target area one 820 is the target area obtained under the condition that p = q =1. The target region two 830 is obtained when p = q = 1.1.
The method has low time consumption and cost for positioning the key points in the image to be processed, so the method can be applied to real-time identification scenes. Further, in the process of extracting the key points of the continuously transformed multiple images to be processed, the key points of the a + b frames of images to be processed can be predicted by using the key points corresponding to the a frame of images to be processed through a method of predicting by the detection frame, so that the time for predicting the key point consumption can be further reduced. Model effect verification
In order to verify the effect of the keypoint extraction method (which may be referred to as XYPose) provided by the present application. The method is compared with a regression-based human posture estimation method Deeppose in the related art. Specifically, the two methods use the feature maps of the same backbone network for extraction. The key points of the image to be processed included in the five public data sets are respectively positioned by using two methods, and the accuracy of the key points positioned by the two methods is compared (the experimental data is shown in table 1). By comparing the data in table 1, it is clear that the performance of the method is superior to deepose.
TABLE 1 results on five public data sets MSCOCO test-dev data sets
AP AP 50 AP 75 AP M AP L
Deeppose 64.2 88.8 72.3 61.1 70.6
This application 70.6 90.2 78.3 67.3 77.3
The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.
Referring to fig. 9, a block diagram of a keypoint extraction apparatus according to an embodiment of the present application is shown. The device has the functions of implementing the method examples, and the functions can be realized by hardware or by hardware executing corresponding software. The apparatus may be the computer device described above, or may be provided in a computer device. As shown in fig. 9, the apparatus 900 may include: an image acquisition module 910, a feature extraction module 920, an orientation processing module 930, and a location determination model 940.
An image obtaining module 910, configured to obtain an image to be processed, where a target object is displayed in the image to be processed;
a feature extraction module 920, configured to perform feature extraction on the image to be processed to obtain a feature map of the image to be processed; wherein the feature map is used for characterizing feature information of at least one key point related to the target object;
a direction processing module 930, configured to process the feature map to obtain a horizontal feature and a vertical feature respectively; the transverse features are used for representing the feature information of the image to be processed in the horizontal direction, and the longitudinal features are used for representing the feature information of the image to be processed in the vertical direction;
a position determining module 940, configured to determine, according to the lateral features, horizontal position information corresponding to the at least one key point in the image to be processed, respectively; and determining the vertical position information respectively corresponding to the at least one key point in the image to be processed according to the longitudinal features.
In some embodiments, the direction processing module 930 comprises: the strip dividing unit is used for dividing the characteristic diagram according to the direction corresponding to the target direction characteristic to obtain a plurality of characteristic strips for the target direction characteristic in the transverse characteristic and the longitudinal characteristic; wherein the characteristic strip comprises at least two characteristic values with the same position in the target direction; the pooling processing unit is used for pooling the characteristic strips for each characteristic strip to obtain a pooling result corresponding to the characteristic strip; and the characteristic generating unit is used for arranging the pooling results corresponding to the plurality of characteristic strips according to the arrangement sequence of the plurality of characteristic strips in the characteristic diagram to obtain the target direction characteristics.
In some embodiments, the pooling processing unit is configured to determine a maximum value of the feature values in the feature strip, and determine the maximum value as a pooling result corresponding to the feature strip; or determining an average value of characteristic values included in the characteristic strip, and taking the average value as a pooling result corresponding to the characteristic strip; or determining an average value and a maximum value of the characteristic values included in the characteristic strip, and determining the pooling result of the characteristic strip according to the average value and the maximum value.
In some embodiments, the apparatus 900 further includes a channel processing model module (not shown in fig. 9) configured to, for a target directional feature of the transverse feature and the longitudinal feature, extract valid information in the target directional feature to obtain a target valid feature; the number of channels of the target effective characteristic is smaller than that of the channels of the target direction characteristic; a feature expansion module (not shown in fig. 9) configured to expand the target effective features to obtain target expansion features; wherein, the unfolding processing refers to merging at least two dimensions in the target effective characteristics; a feature extraction module (not shown in fig. 9) configured to perform feature extraction on the target expansion features to obtain extracted target direction features; wherein the refined target direction feature is used to replace the target direction feature to determine the horizontal position information or the vertical position information, and the feature refinement is to determine a correlation between internal components of the target expanded feature.
In some embodiments, the feature extraction module is configured to divide the target expansion feature to obtain a plurality of feature blocks; wherein, the adjacent two feature blocks are not overlapped; determining combination information corresponding to the plurality of feature blocks respectively; the combination information is obtained according to the feature block and the combination information corresponding to the feature block; performing self-attention processing according to the plurality of combined information to obtain an intermediate matrix; wherein the self-attention processing is for processing an association between the plurality of combined information; performing the expansion processing on the intermediate matrix to obtain an expanded intermediate matrix; and carrying out full connection processing on the expanded intermediate matrix at least once to obtain the refined target direction characteristic.
In some embodiments, the keypoints are extracted by a keypoint extraction model comprising: the method comprises the following steps of (1) a feature extraction network, a direction pooling layer and a decoupling regression network: the feature extraction network is used for extracting features of the image to be processed to obtain a feature map of the image to be processed; the direction pooling layer is used for processing the characteristic graph to respectively obtain the transverse characteristic and the longitudinal characteristic; wherein the lateral features and the longitudinal features are processed by different branches in the directional pooling layer; the decoupling regression network is used for determining horizontal position information respectively corresponding to the at least one key point in the image to be processed according to the transverse characteristic, and determining vertical position information respectively corresponding to the at least one key point in the image to be processed according to the longitudinal characteristic; and acquiring the horizontal position information and the vertical position information through different branches in the decoupling regression network.
In some embodiments, the apparatus 900 further comprises: a target identification module (not shown in fig. 9) for determining a target region in the image to be processed according to the at least one key point; wherein the target region includes the at least one keypoint; performing recognition analysis on the target object in the target area, the recognition analysis including at least one of: palm print recognition, gesture recognition and expression recognition.
It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.
Referring to fig. 10, a block diagram of a computer device 1000 according to an embodiment of the present application is shown.
Generally, the computer device 1000 includes: a processor 1001 and a memory 1002.
Processor 1001 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 1001 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1001 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also referred to as a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1001 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 1001 may further include an AI processor for processing computing operations related to machine learning.
Memory 1002 may include one or more computer-readable storage media, which may be non-transitory. The memory 1002 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer-readable storage medium in memory 1002 is used to store a computer program configured to be executed by one or more processors to implement the above-described keypoint extraction method.
Those skilled in the art will appreciate that the architecture illustrated in FIG. 10 does not constitute a limitation of the computer device 1000, and may include more or fewer components than those illustrated, or some of the components may be combined, or a different arrangement of components may be employed.
In an exemplary embodiment, there is also provided a computer-readable storage medium having stored therein a computer program which, when executed by a processor, implements the above keypoint extraction method.
Optionally, the computer-readable storage medium may include: ROM (Read-Only Memory), RAM (Random Access Memory), SSD (Solid State drive), or optical disc. The Random Access Memory may include a ReRAM (resistive Random Access Memory) and a DRAM (Dynamic Random Access Memory), among others.
In an exemplary embodiment, a computer program product is also provided that includes computer instructions stored in a computer readable storage medium. And a processor of the computer device reads the computer instruction from the computer readable storage medium, and executes the computer instruction, so that the terminal device executes the key point extraction method.
It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. In addition, the step numbers described herein only show an exemplary possible execution sequence among the steps, and in some other embodiments, the steps may also be executed out of the numbering sequence, for example, two steps with different numbers are executed simultaneously, or two steps with different numbers are executed in a reverse order to the illustrated sequence, which is not limited in this application.
It should be noted that the information (including but not limited to user device information, user personal information, etc.), data (including but not limited to data for analysis, stored data, displayed data, etc.) and signals referred to in this application are authorized by the user or fully authorized by various parties, and the collection, use and processing of the relevant data are subject to relevant laws and regulations and standards in relevant countries and regions. For example, the palm print information, the image to be processed, etc. referred to in this application are obtained under sufficient authorization.
The above description is only exemplary of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (10)

1. A method for extracting key points, the method comprising:
acquiring an image to be processed, wherein a target object is displayed in the image to be processed;
performing feature extraction on the image to be processed to obtain a feature map of the image to be processed; wherein the feature map is used for characterizing feature information of at least one key point related to the target object;
processing the characteristic graph to respectively obtain a transverse characteristic and a longitudinal characteristic; the transverse features are used for representing the feature information of the image to be processed in the horizontal direction, and the longitudinal features are used for representing the feature information of the image to be processed in the vertical direction;
determining horizontal position information respectively corresponding to the at least one key point in the image to be processed according to the transverse features;
and determining the vertical position information respectively corresponding to the at least one key point in the image to be processed according to the longitudinal features.
2. The method of claim 1, wherein the processing the feature map to obtain the horizontal feature and the vertical feature respectively comprises:
for target direction features in the transverse features and the longitudinal features, dividing the feature graph according to the direction corresponding to the target direction features to obtain a plurality of feature strips; wherein the characteristic strip comprises at least two characteristic values with the same position in the target direction;
for each characteristic strip, performing pooling treatment on the characteristic strip to obtain a pooling result corresponding to the characteristic strip;
and arranging the pooling results corresponding to the plurality of characteristic strips according to the arrangement sequence of the plurality of characteristic strips in the characteristic diagram to obtain the target direction characteristics.
3. The method of claim 2, wherein the pooling the feature strips to obtain pooling results corresponding to the feature strips comprises:
determining the maximum value of the characteristic values included in the characteristic strip, and determining the maximum value as a pooling result corresponding to the characteristic strip; or,
determining an average value of characteristic values included in the characteristic strip, and taking the average value as a pooling result corresponding to the characteristic strip; or,
determining an average value and a maximum value of the characteristic values included in the characteristic strip, and determining a pooling result of the characteristic strip according to the average value and the maximum value.
4. The method of claim 1, wherein after processing the feature map to obtain the transverse features and the longitudinal features, respectively, further comprising:
extracting effective information in the target direction features from the transverse features and the longitudinal features to obtain target effective features; the number of channels of the target effective characteristic is smaller than that of the channels of the target direction characteristic;
unfolding the target effective characteristics to obtain target unfolding characteristics; wherein, the unfolding processing refers to merging at least two dimensions in the target effective characteristics;
carrying out characteristic refinement on the target expansion characteristic to obtain a refined target direction characteristic; wherein the refined target direction feature is used to replace the target direction feature to determine the horizontal position information or the vertical position information, and the feature refinement is to determine a correlation between internal components of the target expanded feature.
5. The method of claim 4, wherein the performing feature extraction on the target expansion features to obtain extracted target direction features comprises:
dividing the target expansion characteristics to obtain a plurality of characteristic blocks; wherein, the adjacent two feature blocks are not overlapped;
determining combination information corresponding to the plurality of feature blocks respectively; the combination information is obtained according to the feature block and the combination information corresponding to the feature block;
performing self-attention processing according to the plurality of combined information to obtain an intermediate matrix; wherein the self-attention processing is for processing an association between the plurality of combined information; unfolding the intermediate matrix to obtain an unfolded intermediate matrix;
and carrying out at least one time of full connection processing on the expanded intermediate matrix to obtain the refined target direction characteristic.
6. The method according to any one of claims 1 to 5, wherein the keypoints are extracted by means of a keypoint extraction model comprising: the method comprises the following steps of (1) extracting a network, a direction pooling layer and a decoupling regression network:
the feature extraction network is used for extracting features of the image to be processed to obtain a feature map of the image to be processed;
the direction pooling layer is used for processing the characteristic graph to respectively obtain the transverse characteristic and the longitudinal characteristic; wherein the lateral features and the longitudinal features are processed by different branches in the directional pooling layer;
the decoupling regression network is used for determining horizontal position information of the at least one key point in the image to be processed according to the transverse features, and determining vertical position information of the at least one key point in the image to be processed according to the longitudinal features; wherein the horizontal position information and the vertical position information are obtained through different branches in the decoupling regression network.
7. The method according to any one of claims 1 to 5, further comprising:
determining a target area in the image to be processed according to the at least one key point; wherein the target region includes the at least one keypoint;
performing a recognition analysis on the target object in the target region, the recognition analysis including at least one of: palm print recognition, gesture recognition and expression recognition.
8. A keypoint extraction apparatus, characterized in that it comprises:
the image acquisition module is used for acquiring an image to be processed, and a target object is displayed in the image to be processed;
the characteristic extraction module is used for extracting the characteristics of the image to be processed to obtain a characteristic diagram of the image to be processed; wherein the feature map is used for characterizing feature information of at least one key point related to the target object;
the direction processing module is used for processing the characteristic diagram to respectively obtain a transverse characteristic and a longitudinal characteristic; the transverse features are used for representing the feature information of the image to be processed in the horizontal direction, and the longitudinal features are used for representing the feature information of the image to be processed in the vertical direction;
the position determining module is used for determining horizontal position information respectively corresponding to the at least one key point in the image to be processed according to the transverse features; and determining the vertical position information respectively corresponding to the at least one key point in the image to be processed according to the longitudinal features.
9. A computer device, characterized in that the computer device comprises a processor and a memory, in which a computer program is stored, which computer program is loaded and executed by the processor to implement the method according to any of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which is loaded and executed by a processor to implement the method according to any one of claims 1 to 17.
CN202210995374.6A 2022-08-18 2022-08-18 Key point extraction method, device, equipment and storage medium Pending CN115359265A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210995374.6A CN115359265A (en) 2022-08-18 2022-08-18 Key point extraction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210995374.6A CN115359265A (en) 2022-08-18 2022-08-18 Key point extraction method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115359265A true CN115359265A (en) 2022-11-18

Family

ID=84003508

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210995374.6A Pending CN115359265A (en) 2022-08-18 2022-08-18 Key point extraction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115359265A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117636078A (en) * 2024-01-25 2024-03-01 华南理工大学 Target detection method, target detection system, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117636078A (en) * 2024-01-25 2024-03-01 华南理工大学 Target detection method, target detection system, computer equipment and storage medium
CN117636078B (en) * 2024-01-25 2024-04-19 华南理工大学 Target detection method, target detection system, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111126272B (en) Posture acquisition method, and training method and device of key point coordinate positioning model
US20210342990A1 (en) Image coordinate system transformation method and apparatus, device, and storage medium
US20220237829A1 (en) Artificial intelligence-based image generation method, device and apparatus, and storage medium
CN111240476B (en) Interaction method and device based on augmented reality, storage medium and computer equipment
CN111553267B (en) Image processing method, image processing model training method and device
JP2023548921A (en) Image line-of-sight correction method, device, electronic device, computer-readable storage medium, and computer program
CN111680672B (en) Face living body detection method, system, device, computer equipment and storage medium
CN111401216A (en) Image processing method, model training method, image processing device, model training device, computer equipment and storage medium
CN113570684A (en) Image processing method, image processing device, computer equipment and storage medium
CN114758362B (en) Clothing changing pedestrian re-identification method based on semantic perception attention and visual shielding
CN109978077B (en) Visual recognition method, device and system and storage medium
CN111652974A (en) Method, device and equipment for constructing three-dimensional face model and storage medium
CN110796593A (en) Image processing method, device, medium and electronic equipment based on artificial intelligence
CN111754396A (en) Face image processing method and device, computer equipment and storage medium
CN109684969A (en) Stare location estimation method, computer equipment and storage medium
CN111583399A (en) Image processing method, device, equipment, medium and electronic equipment
CN114998934A (en) Clothes-changing pedestrian re-identification and retrieval method based on multi-mode intelligent perception and fusion
WO2023184817A1 (en) Image processing method and apparatus, computer device, computer-readable storage medium, and computer program product
CN112699857A (en) Living body verification method and device based on human face posture and electronic equipment
CN115359265A (en) Key point extraction method, device, equipment and storage medium
Li et al. Global co-occurrence feature learning and active coordinate system conversion for skeleton-based action recognition
CN111626212B (en) Method and device for identifying object in picture, storage medium and electronic device
CN113822114A (en) Image processing method, related equipment and computer readable storage medium
CN116129473A (en) Identity-guide-based combined learning clothing changing pedestrian re-identification method and system
CN117011449A (en) Reconstruction method and device of three-dimensional face model, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination