WO2022237688A1 - Method and apparatus for pose estimation, computer device, and storage medium - Google Patents

Method and apparatus for pose estimation, computer device, and storage medium Download PDF

Info

Publication number
WO2022237688A1
WO2022237688A1 PCT/CN2022/091484 CN2022091484W WO2022237688A1 WO 2022237688 A1 WO2022237688 A1 WO 2022237688A1 CN 2022091484 W CN2022091484 W CN 2022091484W WO 2022237688 A1 WO2022237688 A1 WO 2022237688A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
target
feature
position information
features
Prior art date
Application number
PCT/CN2022/091484
Other languages
French (fr)
Chinese (zh)
Inventor
贾配洋
侯俊
Original Assignee
影石创新科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 影石创新科技股份有限公司 filed Critical 影石创新科技股份有限公司
Publication of WO2022237688A1 publication Critical patent/WO2022237688A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4007Scaling of whole images or parts thereof, e.g. expanding or contracting based on interpolation, e.g. bilinear interpolation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]

Definitions

  • the present application relates to the technical field of computer vision, in particular to a pose estimation method, device, computer equipment and storage medium.
  • attitude estimation as one of the important applications in computer vision, has also developed rapidly and is widely used in object activity analysis, video surveillance or object interaction and other fields.
  • human body pose estimation in pose estimation through human body pose estimation, can detect various key points of a human body in an image containing a human body.
  • the facial features, limbs or joints of the human body can be obtained through human body pose estimation. Because of its functions, it is widely used in scenes such as stop-motion animation, collage dance, transparent people, walking stitching or action classification.
  • a pose estimation method comprising: acquiring a target image to be subjected to pose estimation; the target image includes a target object to be processed; performing feature extraction based on the target image to obtain a first extracted feature;
  • the expansion network performs feature expansion on the first extraction feature to obtain an expanded image feature; performs feature extraction on the expanded image feature to obtain a second extraction feature; performs feature compression on the second extraction feature through an image feature compression network, Obtaining compressed image features; determining key point position information corresponding to the target object in the target image based on the compressed image features, and performing pose estimation on the target object based on the key point position information.
  • the image feature expansion network includes a plurality of feature convolution channels
  • performing feature expansion on the first extracted feature through the image feature expansion network, and obtaining the expanded image feature includes:
  • the extracted features are respectively input into a plurality of feature convolution channels corresponding to the image feature expansion network, and each of the feature convolution channels uses a feature dimension preserving convolution kernel to convolve the first extracted feature to obtain each of the feature
  • the convolution features output by the convolution channels; the expanded image features are obtained by combining the convolution features output by each of the feature convolution channels.
  • the determining the key point position information corresponding to the target object in the target image based on the compressed image features includes: amplifying the compressed image features to obtain enlarged image features; The enlarged image features are convolved to obtain third extracted features; based on the third extracted features, key point position information corresponding to the target object in the target image is determined.
  • the acquiring the target image to be subjected to pose estimation includes: acquiring an initial image; performing object detection on the initial image to obtain probabilities that each of the multiple candidate image regions in the initial image includes the target object; Based on the probability that the candidate image area includes the target object, select from the candidate image area to obtain an object image area including the target object; according to the object image area, extract an intercepted image area from the initial image, and extract the intercepted image as the target image for pose estimation.
  • the extraction of the intercepted image area from the initial image according to the object image area, and using the intercepted image as the target image for pose estimation includes: obtaining the image area in the object image area the center coordinates of the object image area; obtain the area size corresponding to the object image area, and obtain the area extension value based on the area size and the size expansion coefficient; The extension direction is extended to obtain the extension coordinates; the image area located within the extension coordinates is used as the image interception area, and the intercepted image is used as the target image to be subjected to pose estimation.
  • the method further includes: according to the mapping relationship between the key point position information and the target point position information, converting each key point position information into a corresponding The target point position information;
  • the target point position information is the position information of the key point position information in the initial image; Based on each of the target position information, perform pose estimation on the target object to obtain the target image corresponding target pose.
  • a method for generating a target video the method further comprising: acquiring a target action, determining a gesture sequence corresponding to the target action, and performing the gestures in the gesture sequence in order to obtain the target action; performing the above gesture estimation
  • the method is to obtain the target pose corresponding to each target image in the target image set; to obtain the image corresponding to each target pose in the pose sequence from the target image set as a video frame image; according to the sorting of the poses in the pose sequence
  • the obtained video frame images are arranged to obtain the target video corresponding to the target action.
  • a pose estimation device comprising: a target image acquisition module, for acquiring a target image to be subjected to pose estimation; the target image includes a target object to be processed; a first feature extraction module, for based on the The target image is feature extracted to obtain the first extracted feature; the expanded image feature obtaining module is used to expand the first extracted feature through the image feature expansion network to obtain the expanded image feature; the second extracted feature obtained module is used for Feature extraction is performed on the expanded image features to obtain second extracted features; a compressed image feature obtaining module is used to perform feature compression on the second extracted features through an image feature compression network to obtain compressed image features; key point position information is determined A module, configured to determine key point position information corresponding to the target object in the target image based on the compressed image features, and perform pose estimation on the target object based on the key point position information.
  • the expanded image feature obtaining module is used to input the first extracted features into multiple feature convolution channels corresponding to the image feature expansion network, each of the feature convolution channels utilizes a feature dimension Keeping the convolution kernel to convolve the first extracted features to obtain the convolution features output by each of the feature convolution channels; combining the output convolution features of each of the feature convolution channels to obtain the expanded image features.
  • the key point location information determination module is used to amplify the compressed image features to obtain enlarged image features; perform convolution on the enlarged image features to obtain a third extracted feature; based on the The third feature extraction is to determine key point position information corresponding to the target object in the target image.
  • the target image acquisition module is used to acquire an initial image; object detection is performed on the initial image to obtain the probability that a plurality of candidate image areas in the initial image respectively include the target object; based on the candidate image areas The probability of including the target object is selected from the candidate image area to obtain the object image area including the target object; according to the object image area, the intercepted image area is extracted from the initial image, and the intercepted image is used as the image to be estimated. target image.
  • the target image acquisition module is used to acquire the center coordinates in the object image area; acquire the area size corresponding to the object image area, and obtain the area extension value based on the area size and the size expansion coefficient; Based on the central coordinates and the area extension value, extend to the extension direction corresponding to the area extension value to obtain extension coordinates; use the image area within the extension coordinates as the intercepted image area, and use the intercepted image as The target image to be pose estimated.
  • the target image acquisition module is used to convert each of the key point position information into corresponding target point position information according to the mapping relationship between the key point position information and the target point position information; the target point The position information is the position information of the key point position information in the initial image; based on each of the target position information, pose estimation is performed on the target object to obtain a target pose corresponding to the target image.
  • a target video generation device the device is used to acquire target actions, determine the gesture sequence corresponding to the target action, the gestures in the gesture sequence are executed in order to obtain the target action; acquire each target image set The target pose corresponding to the target image; obtaining the image corresponding to each target pose in the pose sequence from the target image set as a video frame image; pairing the obtained video frame images according to the sorting of the poses in the pose sequence Arranging to obtain the target video corresponding to the target action.
  • a computer device comprising a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program: acquiring a target image to be subjected to pose estimation; the target image includes a target image to be processed The target object; Based on the target image, feature extraction is performed to obtain the first extraction feature; the image feature expansion network is used to perform feature expansion on the image extraction feature to obtain the expanded image feature; the feature extraction is performed on the expanded image feature to obtain The second extraction feature; performing feature compression on the second extraction feature through an image feature compression network to obtain a compressed image feature; determining key point position information corresponding to the target object in the target image based on the compressed image feature, Estimating the pose of the target object based on the position information of the key points.
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented: acquiring a target image to be subjected to attitude estimation; the target image includes a target object to be processed; based on The target image is subjected to feature extraction to obtain the first extracted feature;
  • Feature expansion is performed on the image extraction feature by an image feature expansion network to obtain an expanded image feature; feature extraction is performed on the expanded image feature to obtain a second extraction feature; and the second extraction feature is obtained by an image feature compression network. compressing to obtain compressed image features; determining key point position information corresponding to the target object in the target image based on the compressed image features, and performing pose estimation on the target object based on the key point position information.
  • the above attitude estimation method, device, computer equipment and storage medium obtain the target image to be subjected to attitude estimation; the target image includes the target object to be processed; perform feature extraction based on the target image to obtain the first extracted feature;
  • the expansion network performs feature expansion on the first extraction feature to obtain the expanded image feature; performs feature extraction on the expanded image feature to obtain the second extraction feature; performs feature compression on the second extraction feature through the image feature compression network to obtain the compressed image feature; Compress image features to determine the key point position information corresponding to the target object in the target image, and perform pose estimation on the target object based on the key point position information.
  • the image feature expansion network is used to expand the extracted first extraction features, so that as many image features as possible can be input at the input end of the pose estimation network, and then the second extraction features are characterized by the image feature compression network. Compression, combined with the above features, can achieve the purpose of improving the efficiency and accuracy of pose estimation.
  • Fig. 1 is an application environment diagram of a posture estimation method in an embodiment
  • Fig. 2 is a schematic flow chart of a pose estimation method in an embodiment
  • Fig. 3 is a schematic flow chart of a pose estimation method in another embodiment
  • FIG. 4 is a schematic flow chart of a pose estimation method in another embodiment
  • FIG. 5 is a schematic flow chart of a pose estimation method in another embodiment
  • Fig. 6 is a schematic flow chart of a pose estimation method in another embodiment
  • FIG. 7 is a schematic flow chart of a pose estimation method in another embodiment
  • Fig. 8 is a schematic flow chart of a method for generating a target video in an embodiment
  • Fig. 9 is a schematic diagram of a panoramic image including objects in one embodiment
  • Fig. 10 is a schematic diagram of detecting an object in an embodiment
  • Fig. 11 is a schematic diagram of intercepting object subgraphs in an embodiment
  • Fig. 12 is a schematic diagram of object key points in an embodiment
  • Fig. 13 is a schematic diagram of a human body posture model in an embodiment
  • Fig. 14 is a structural block diagram of a pose estimation device in an embodiment
  • Figure 15 is a diagram of the internal structure of a computer device in one embodiment.
  • the attitude estimation method provided in this application can be applied to the application environment shown in FIG. 1 , and specifically applied to an attitude estimation system.
  • the pose estimation system includes an image acquisition device 102 and a terminal 104 , wherein the image acquisition device 102 and the terminal 104 are connected in communication.
  • the terminal 104 executes a pose estimation method.
  • the terminal 104 acquires a target image to be subjected to pose estimation transmitted from the image acquisition device 102; the target image includes a target object to be processed; the terminal 104 performs feature extraction based on the target image , to obtain the first extraction feature; expand the first extraction feature through the image feature expansion network to obtain the expanded image feature; perform feature extraction on the above-mentioned expanded image feature to obtain the second extraction feature; use the image feature compression network to obtain the second extraction feature
  • the feature is compressed to obtain the compressed image feature; the key point position information corresponding to the target object in the target image is determined based on the compressed image feature, and the pose estimation of the target object is performed based on the key point position information.
  • the image acquisition device 102 may be, but not limited to, various devices with an image acquisition function, and may be distributed outside the terminal 104 or inside the terminal 104 .
  • the terminal 104 can be, but is not limited to, various cameras, personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. It can be understood that the method provided in the embodiment of the present application may also be executed by a server.
  • a pose estimation method is provided.
  • the method is applied to the terminal in FIG. 1 as an example for illustration, including the following steps:
  • Step 202 acquiring a target image to be subjected to pose estimation; the target image includes a target object to be processed.
  • pose estimation refers to the process of estimating the pose of the target object by detecting key points in the target object and describing one or more key points.
  • Key points refer to feature points that can describe the structural features of the target object. For example, the facial features, leg joints, or hand joints of the target object.
  • the target object refers to the object for pose estimation. For example, human body or animal etc.
  • the terminal can directly or indirectly obtain the target image to be subjected to pose estimation.
  • the terminal takes the received image including the target object to be processed and transmitted from the image acquisition device as the target image.
  • the terminal preprocesses the received image transmitted from the image acquisition device, and uses the preprocessed image as the target image.
  • the image collection device is a panoramic camera. After the panoramic camera collects the panoramic image, the panoramic image is used as a target image to be subjected to pose estimation, and the target image includes a target object to be processed.
  • the target object can be complete, only partially contained or occluded.
  • the terminal acquires a panoramic image by extracting frames from the panoramic video, and directly or after preprocessing the panoramic image, obtains a target image to be subjected to pose estimation.
  • the preprocessing includes normalizing the panoramic image or cropping the target object in the panoramic image.
  • Step 204 perform feature extraction based on the target image, and obtain first extracted features.
  • a feature is information representing a specific attribute of a target image, through which an object in the target image can be identified or the target image can be classified.
  • feature extraction may be performed on the target image through a feature extraction network to obtain the first extracted feature.
  • the feature extraction of the target image may be performed through a lightweight deep neural network to obtain the first extracted feature.
  • Step 206 performing feature expansion on the first extracted feature through the image feature expansion network to obtain expanded image features.
  • the image feature expansion network refers to a network that can increase the number of image features.
  • the expanded image feature refers to the image feature after the image feature is expanded.
  • the acquired channels of the first extracted features are expanded by point-by-point convolution, the number of features is enriched, and the expanded image features are obtained.
  • the image feature expansion network uses 1*1 point-by-point convolution to perform feature expansion on the image extracted features to obtain expanded image features.
  • Step 208 perform feature extraction on the features of the expanded image to obtain second extracted features.
  • feature extraction may be performed on the expanded image features through convolution with fewer parameters to obtain the second extracted features.
  • the expanded image features may be down-sampled by preset convolution and activation functions, and feature extraction may be performed on the expanded image features to obtain the second extracted features.
  • the preset convolution can be 3*3 convolution and a ReLU (Rectified Linear Unit, rectified linear unit) activation function to perform feature extraction on the expanded image feature to obtain the second extracted feature.
  • the above activation function can use Sigmoid function (Sigmoid function, S-type growth curve), ELU (Exponential Linear Unit, Exponential Linear Unit), GELU (Gaussian Error Linear Unit, Gaussian error linear unit) and other replacements.
  • Step 210 perform feature compression on the second extracted features through the image feature compression network to obtain compressed image features.
  • the image feature compression network refers to a network that can reduce the number of image features.
  • Compressed image features refer to image features after image features are compressed.
  • the terminal may perform feature compression on the second extracted features, so as to increase the speed of the terminal's pose estimation.
  • the image feature compression network uses 1*1 point-by-point convolution to perform feature compression on the second extracted features, and obtains compressed image features after linear transformation.
  • Step 212 determine key point position information corresponding to the target object in the target image based on the compressed image features, and perform pose estimation on the target object based on the key point position information.
  • the key point position information refers to information capable of determining the position of the key point in the target image. For example, information such as the coordinates, names or directions of the key points in the target image.
  • the terminal may obtain the key point position information corresponding to the target object through the correspondence between the compressed image feature and the key point position information corresponding to the target object.
  • the terminal stores a matching relationship table between compressed image features and key point location information. After obtaining the compressed image features, the terminal obtains corresponding key point location information by traversing the above matching relationship table. According to the position coordinates and names in the key point position information, the pose estimation of the target object is performed. For example, the position information of the key point obtained by the terminal after traversing the above matching relationship table is (200, 200, wrist joint), indicating that the position coordinates of the key point are (200, 200), and the key point is the wrist joint. Describe the position information of multiple key points above, and estimate the pose.
  • the target image includes the target object to be processed; feature extraction is performed based on the target image to obtain the first extraction feature; the first extraction feature is obtained through the image feature expansion network Perform feature expansion to obtain the expanded image features; perform feature extraction on the expanded image features to obtain the second extracted features; perform feature compression on the second extracted features through the image feature compression network to obtain compressed image features; determine the target image based on the compressed image features
  • the target object corresponds to the key point position information, and the pose estimation of the target object is performed based on the key point position information.
  • the feature expansion of the extracted first extraction features is performed through the image feature expansion network, so that as many image features as possible can be input at the input end of the pose estimation network, and then the second extraction features are characterized by the image feature compression network. Compression, combined with the above features, can achieve the purpose of improving the efficiency and accuracy of pose estimation.
  • the image feature expansion network includes a plurality of feature convolution channels, and the first extracted feature is subjected to feature expansion through the image feature expansion network, and the expanded image features obtained include:
  • Step 302 input the first extracted features into multiple feature convolution channels corresponding to the image feature expansion network, and each feature convolution channel uses a feature dimension-preserving convolution kernel to convolve the first extracted features to obtain each feature convolution The convolutional features output by the channel.
  • the feature dimension preserving convolution kernel refers to the convolution kernel that can keep the dimension of the image unchanged, and the dimension of the image refers to the number of channels of the image. For example, a convolution kernel of size 1*1.
  • the terminal can maintain the convolution kernel by setting the feature dimension, and perform convolution on the first extracted feature from each feature convolution channel, which can be obtained by using fewer parameters while keeping the scale of the first extracted feature unchanged.
  • the convolutional features output by the feature convolution channel can be maintained by setting the feature dimension, and perform convolution on the first extracted feature from each feature convolution channel, which can be obtained by using fewer parameters while keeping the scale of the first extracted feature unchanged.
  • the terminal performs convolution on the image extraction features of each feature convolution channel by setting the feature dimension to maintain the number and size of the convolution kernels, and obtains the convolution features output by the feature convolution channel. For example, on a 3*3 convolutional network with a network of 64 channels, after adding a convolution kernel with a size of 1*1 and a channel number of 256, the original network can be converted to 64*256 parameters. The number of channels has been expanded from 64 to 256.
  • step 304 the convolution features output by each feature convolution channel are synthesized to obtain expanded image features.
  • the feature dimension-preserving convolution kernel can linearly combine each pixel in the first extracted feature in the image on different channels to obtain the expanded image feature .
  • the composition of the expanded network is to add a 1*1, 28-channel convolution kernel behind the 3*3, 64-channel convolution kernel, and convert it into a 3*3, 28-channel convolution kernel.
  • the original 64 The channels are linearly combined across channels to become 28 channels, which realizes the information interaction between channels, and obtains the expanded image features through the convolution features output by each feature convolution channel.
  • the convolution feature output by the feature convolution channel is obtained, and each feature convolution The convolution feature output by the product channel is used to obtain the expanded image feature, which can achieve the purpose of obtaining the expanded image feature with fewer parameters, thereby improving the efficiency of pose estimation.
  • determining the key point position information corresponding to the target object in the target image based on the compressed image features includes:
  • Step 402 amplifying the compressed image features to obtain enlarged image features.
  • the enlarged image features are obtained by upsampling the features.
  • the terminal amplifies the compressed image features, and sets the number of input and output channels of the three-layer sampling network to (256, 128), (128, 64), (64, 64 ), which can reduce the amount of network parameters and computation.
  • the terminal performs interpolation calculation on the compressed image features through an interpolation method to obtain enlarged image features. For example, on the basis of compressed image features, new elements are inserted between pixels using appropriate interpolation algorithms such as linear interpolation or bilinear interpolation.
  • Step 404 performing convolution on the enlarged image features to obtain a third extracted feature.
  • the enlarged image features are obtained, in order to compensate for the reduction of nonlinear units during the process of enlarging the compressed image features, the enlarged image features are convolved to obtain the third extracted features.
  • Step 406 based on the third extracted feature, determine key point position information corresponding to the target object in the target image.
  • searching and filtering are performed on the third extracted features to determine key point position information corresponding to the target object in the target image.
  • the terminal stores a matching list of image features and key point position information. After obtaining the third extracted feature, the matching list is traversed to obtain the key point position information corresponding to the third extracted feature, that is, the key point position information in the target image The key point position information corresponding to the target object.
  • the point position information can achieve the purpose of improving the image output quality, so as to achieve the purpose of obtaining the key point position information corresponding to the target object more accurately.
  • acquiring a target image to be subjected to pose estimation includes:
  • Step 502 acquiring an initial image.
  • the initial image refers to the unprocessed original image.
  • the original image is an image obtained directly by an image acquisition device or a terminal.
  • the terminal can collect the initial image through the connected image acquisition device, and the acquisition device transmits the acquired initial image to the terminal in real time; or the acquisition device temporarily stores the acquired initial image locally in the acquisition device, when When receiving an image acquisition instruction from the terminal, the locally stored initial image is transmitted to the terminal, and accordingly, the terminal can acquire the initial image.
  • the terminal collects the initial image through an internal image collection module, stores the collected image in the terminal memory, and obtains the initial image from the memory when the terminal needs to obtain the initial image.
  • Step 504 perform object detection on the initial image, and obtain probabilities that each of the multiple candidate image regions in the initial image includes the target object.
  • the initial image is divided into multiple image sub-regions as candidate image regions, and the probability of the target object in each candidate image region is detected. For example, if the image is divided into sub-region A, sub-region B and sub-region C, the probability of the target object in sub-region A is 0%, the probability of the target object in sub-region B is 10%, and the probability of the target object in sub-region C is The probability is 90%.
  • the terminal obtains the probability that each image sub-region includes the target object by gradually reducing the size of the image sub-regions.
  • Step 506 based on the probability that the candidate image region includes the target object, an object image region including the target object is selected from the candidate image regions.
  • the terminal may compare the probabilities of each candidate image region to obtain candidate image regions whose probabilities are within a preset probability threshold range, and use the candidate The image area serves as an object image area including the target object.
  • the terminal traverses the probability that the candidate image areas include the target object, obtains the maximum probability value among the probabilities, and uses the candidate image area corresponding to the maximum probability value as the target image area of the target object.
  • Step 508 according to the target image area, extract the intercepted image area from the initial image, and use the intercepted image as the target image to be subjected to pose estimation.
  • the image of the obtained image area can be intercepted as the target image to be subjected to pose estimation, so as to reduce the amount of computation during pose estimation, Improve pose estimation efficiency.
  • the terminal can extract the coordinate information of the object image area, and use the coordinate information to intercept and obtain the target image for pose estimation.
  • the frame-selected image area may be used as the target image area, and the frame-selected image area may be intercepted from the initial image as the target image to be pose estimated.
  • the probabilities that the multiple candidate image areas in the initial image respectively include the target object are obtained, and based on the probability that the candidate image areas include the target object, the candidate image areas including the target object are selected to obtain
  • the object image area of the target object is extracted from the initial image to obtain the object image area, and the intercepted image is used as the target image for pose estimation, so as to achieve the purpose of accurately obtaining the target image from the initial image.
  • the intercepted image area is extracted from the initial image, and the intercepted image is used as the target image to be subjected to pose estimation including:
  • Step 602 acquiring the center coordinates in the object image area.
  • the center coordinates refer to the coordinates of the pixel at the center point of the object image area.
  • the coordinates are based on the coordinates of the pixel at the center of the object image area of the target object in the initial image, and can be determined according to the length and width of the initial image.
  • the terminal obtains the pixel point at the center of the target image area, and obtains the coordinates of the pixel point through the pixel point coordinate obtaining tool.
  • step 604 the area size corresponding to the target image area is obtained, and the area extension value is obtained based on the area size and the size expansion coefficient.
  • the area size refers to the area length and area width of the target image area. For example, if the area length of the target image area is h, and the area width of the target image area is w, then the area size is w*h, and the size expansion coefficient refers to a coefficient that can increase the area size.
  • the area extension value refers to the growth value of the area size obtained by correcting the area size with the size expansion coefficient.
  • the terminal may then acquire the area size corresponding to the target image area, and obtain the area extension value through the functional relationship between the area size and the size expansion coefficient.
  • the terminal obtains the area size corresponding to the object image area through the image size measurement tool, and obtains the area extension value by using the product relationship between the area size and the size expansion coefficient.
  • the area width in the area size is w
  • the size expansion coefficient is exp_ratio
  • the area extension value of the area width is w*exp_ratio*1.2/2; similarly, the area extension value of the area length in the area size can also be passed
  • the corresponding dimensional expansion coefficients are obtained.
  • Step 606 based on the center coordinates and the area extension value, extend to the extension direction corresponding to the area extension value to obtain the extension coordinates.
  • the extension direction refers to the direction in which the width and length increase corresponding to the area extension value.
  • the extension coordinates refer to the coordinates of the object image area obtained by extending the object image area in the extension direction with the center coordinates as a reference point.
  • the coordinates may be represented by the coordinates of the upper left corner and the lower right corner of the object image area.
  • the terminal may use the center coordinates as a reference point to expand the target image area by using the area extension value to obtain the extension coordinates corresponding to the target image area.
  • the terminal may use the center coordinates as a reference point to expand the target image area by using the area extension value to obtain the extension coordinates corresponding to the target image area.
  • the center coordinates are expressed as (x, y), and the extension coordinates are (x0, y0) and (x1, y1), where x0 and x1 are the coordinates of the extension value of the object image area in the direction of the width extension of the image , y0 and y1 are the coordinates of the object image area extension value in the length extension direction of the image, then the extension coordinates can be expressed as the formula:
  • the extension value in the width extension direction in the area extension value when the area extension value in the width extension direction in the area extension value is less than or equal to 0, the extension value is zero; when the area extension value is greater than or equal to the width of the initial image, the width of the initial image is used as the area extension value .
  • the area extension value in the length extension direction in the area extension value is less than or equal to zero, the area extension value is zero; when the area extension value is greater than or equal to the height of the initial image, the height of the initial image is used as the area extension value.
  • Step 608 taking the image area within the extended coordinates as the intercepted image area, and using the intercepted image as the target image to be subjected to pose estimation.
  • the terminal may use the image area within the extended coordinates as the intercepted image area, and use the intercepted image as the target image for pose estimation.
  • the area extension value is obtained based on the area size and the size expansion coefficient, and the area extension value corresponding to the area extension value is obtained based on the center coordinates and the area extension value.
  • the extension direction is extended to obtain the extension coordinates, and the image area located in the extension coordinates is used as the intercepted image area, and the intercepted image is used as the target image to be estimated for attitude, which can achieve the purpose of accurately intercepting the target image, and then improve the performance of attitude estimation. efficiency.
  • the method also includes:
  • Step 702 According to the mapping relationship between the key point position information and the target point position information, each key point position information is converted into corresponding target point position information; the target point position information is the position information of the key point position information in the initial image.
  • the location information refers to information related to the location that can reflect a certain location point.
  • the position point may be a key point, or other points having the same structure or function as the key point.
  • the position-related information may be coordinate information of a key point or information describing the position of the key point.
  • keypoint location information can be expressed as (100,100, eyes).
  • the key point position information in the target image there is a one-to-one correspondence between the key point position information in the target image and the target point position information in the initial image, and they can be converted to each other. After knowing the position information of the key point, the position information of the key point in the initial image can be correspondingly obtained, so that the position information in the initial image obtained in the target image can be accurately reflected in the initial image.
  • the key point position information of the jth key point is expressed as (x_keypoints_j, y_keypoints_j), and the vertex coordinates of the i-th target image in the upper left corner of the initial image are expressed as (x_person_i, y_person_i), the key point is in The coordinates in the initial image are expressed as (x_original_keypoints, y_original_keypoints), then the coordinates of the key point in the initial image are expressed as the formula:
  • x_original_keypoints x_person_i + x_keypoints_j
  • y_original_keypoints y_person_i + y_keypoints_j
  • Step 704 perform pose estimation on the target object based on each target position information, and obtain the target pose corresponding to the target image.
  • the terminal After determining each target position information corresponding to a plurality of key points, the terminal performs pose estimation through the corresponding relationship between the specific type of the key point and the target position information, and obtains the target pose corresponding to the target image.
  • the position information of each key point is converted into the corresponding target point position information, and the pose estimation of the target object is performed through each target position information to obtain the corresponding position of the target image.
  • target attitude The purpose of obtaining the target pose in the target image can be achieved.
  • the target video generation method includes:
  • Step 802 acquire the target action, and determine the gesture sequence corresponding to the target action, wherein the gestures in the gesture sequence are executed in order to obtain the target action.
  • the target action refers to an action obtained after each gesture is executed in sequence.
  • Gesture refers to the individual sub-actions that make up an action.
  • the target action is an arm stretching movement, and the target action is composed of multiple sub-actions such as placing the arms flat, straightening the arms, and turning the arms together sideways.
  • Multiple gestures can form a gesture sequence according to the order of the front and back, and when the sequence of gestures is executed sequentially, the target action is obtained.
  • Step 804 acquire the target pose corresponding to each target image in the target image set.
  • the terminal can obtain various poses based on the above-mentioned pose estimation methods, and each pose constituting the target action is in a different target image, and the corresponding poses can be obtained from each target image.
  • the target pose F exists in the target image E
  • the target pose H exists in the target image G, and so on.
  • the target pose corresponding to each target image is obtained in the target image set.
  • Step 806 acquire images corresponding to each target pose in the pose sequence from the target image set as video frame images.
  • the terminal After acquiring the target pose, the terminal obtains images corresponding to the target pose according to the target pose, and uses each obtained image as a video frame image.
  • the time stamps corresponding to the images corresponding to the target poses are obtained, and the images carrying the respective time stamps are used as video frame images.
  • Step 808 arrange the obtained video frame images according to the pose sequence in the pose sequence, and obtain the target video corresponding to the target action.
  • the corresponding video frame images are arranged according to the pose sequence in the pose sequence to obtain the target video corresponding to the target action.
  • the gestures in the gesture sequence are sorted, that is, after obtaining the video frame images, according to the The time stamps of the frame images arrange the video frame images, and the target video is obtained according to the video frame images.
  • the image corresponding to each target pose in the pose sequence is obtained from the target image set as a video frame image, according to the pose in the pose sequence
  • Sorting Arranges the obtained video frame images to obtain the target video corresponding to the target action, which can achieve the purpose of obtaining the target video corresponding to the target action through pose estimation, so that the pose estimation can be realized and obtained in practical applications.
  • the terminal is a panoramic camera and the target object is a human body as an example.
  • the target object is a human body as an example.
  • a human target object for human pose estimation needs to be included.
  • the human target object can be complete, or only contain a part of it or have occlusions.
  • the coordinate value of the human body bounding box B1 is obtained through the human body tracking or detection algorithm, and the coordinate value of the human body bounding box or the expansion of the human body bounding box is obtained. Coordinate value of bounding box B2 after expansion.
  • the panoramic image is cropped to obtain a sub-panoramic image with a border of B2.
  • the sub-panoramic image is input into the trained human pose estimation model, as shown in Figure 12
  • the heat map of the first key point C is obtained
  • the sub-panoramic image is input into the trained human pose estimation model to obtain the heat map of the second key point, and so on, by obtaining multiple heat maps map, and in turn get a preset number of heatmaps with keypoints.
  • the coordinates of the key points in the heat map are mapped to the sub-panoramic image, and the position of the key point in the sub-panoramic image is mapped to the panoramic image, so as to obtain the position of the key point in the panoramic image, thereby estimating the posture of the human body.
  • the terminal when the terminal performs normalization processing on the panoramic image or the normalization processing on the sub-panoramic image, the pixel value of the pixel point in the normalized image and the pixel value of the pixel point in the original image The proportional relationship between the difference with the average value of the pixel value, and obtain the pixel value of the pixel point in the normalized image.
  • the pixel value of a certain pixel in the normalized image is expressed as X_normalization
  • the pixel value of a certain pixel in the panoramic image or sub-panoramic image is expressed as X
  • the pixels of all the pixels in the panoramic image or sub-panoramic image The average value of the value is expressed as mean
  • the proportional coefficient is expressed as std
  • X_normalization is expressed as a formula:
  • std can be the variance of all pixels in the panoramic image or sub-panoramic image; a certain pixel in the panoramic image or sub-panoramic image can be a pixel of RGB (red, green and blue) three channels.
  • the terminal may obtain the coordinate value of the bounding box of the human body through a human body detection algorithm.
  • a human body detection algorithm For example, using Faster RCNN (Faster Region-CNN), YOLO (You Only Look Once) series algorithms, SSD (Single Shot MultiBox Detector) series algorithms, etc. or tracking algorithms such as Siamese (Siamese network ) tracking algorithm, etc.
  • the human pose estimation model can be reduced by reducing the number of image feature blocks between stages in HRNet, for example, changing the number of down-sampled image feature blocks in the second stage to 1 , so that the human pose estimation model can reduce the amount of parameters and calculations, thereby improving the efficiency of human pose estimation.
  • a pose estimation device 1400 including: a target image acquisition module 1402, a first feature extraction module 1404, an expanded image feature acquisition module 1406, and a second extraction feature acquisition module 1408 , compressed image feature obtaining module 1410 and key point position information determining module 1412, wherein: the target image obtaining module 1402 is used to obtain the target image to be subjected to attitude estimation; the target image includes the target object to be processed; the first feature extraction module 1404, for feature extraction based on the target image, and obtain the first extracted feature; the expanded image feature obtaining module 1406, for performing feature expansion on the first extracted feature through the image feature expansion network, to obtain the expanded image feature; the second extracted feature is obtained Module 1408, for feature extraction of expanded image features, to obtain second extracted features; compressed image feature obtaining module 1410, for feature compression of second extracted features through image feature compression network, to obtain compressed image features; key point position
  • the information determination module 1412 is configured to determine key point position information corresponding to the target object in the target image based on the
  • the expanded image feature obtaining module 1406 is used to input the first extracted features into multiple feature convolution channels corresponding to the image feature expansion network, and each feature convolution channel uses the feature dimension to keep the convolution kernel to the first Extract features for convolution to obtain the convolution features output by each feature convolution channel; synthesize the convolution features output by each feature convolution channel to obtain expanded image features.
  • the key point position information determination module 1412 is used to amplify the compressed image features to obtain enlarged image features; perform convolution on the enlarged image features to obtain the third extracted features; based on the third extracted features, determine The key point position information corresponding to the target object in the target image.
  • the target image acquisition module 1402 is used to acquire an initial image; perform object detection on the initial image to obtain the probability that a plurality of candidate image areas in the initial image respectively include the target object; based on the probability that the candidate image areas include the target object from An object image area including the target object is selected from the candidate image area; according to the object image area, an intercepted image area is extracted from the initial image, and the intercepted image is used as the target image to be pose estimated.
  • the target image acquisition module 1402 is used to acquire the center coordinates in the object image area; acquire the area size corresponding to the object image area, and obtain the area extension value based on the area size and the size expansion coefficient; based on the center coordinates and the area extension The value is extended to the extension direction corresponding to the area extension value to obtain the extension coordinates; the image area located within the extension coordinates is used as the intercepted image area, and the intercepted image is used as the target image for pose estimation.
  • the target image acquisition module 1402 is configured to convert each key point position information into corresponding target point position information according to the mapping relationship between the key point position information and the target point position information; the target point position information is the key point position The position information of the information in the initial image; the pose estimation of the target object is performed based on each target position information, and the target pose corresponding to the target image is obtained.
  • the target video generation device is used to obtain target actions, determine the gesture sequence corresponding to the target action, and the gestures in the gesture sequence are executed in order to obtain the target action; acquire the target gesture corresponding to each target image in the target image set ; Obtain the image corresponding to each target pose in the pose sequence from the target image set as a video frame image; arrange the obtained video frame images according to the pose sequence in the pose sequence to obtain the target video corresponding to the target action.
  • Each module in the above-mentioned attitude estimation device can be implemented in whole or in part by software, hardware and a combination thereof.
  • the above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, and can also be stored in the memory of the computer device in the form of software, so that the processor can invoke and execute the corresponding operations of the above-mentioned modules.
  • a computer device is provided.
  • the computer device may be a terminal, and its internal structure may be as shown in FIG. 15 .
  • the computer device includes a processor, a memory, a communication interface, a display screen and an input device connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer programs.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, an operator network, NFC (Near Field Communication) or other technologies.
  • a pose estimation method is realized.
  • the display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen
  • the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the casing of the computer device , and can also be an external keyboard, touchpad, or mouse.
  • Figure 15 is only a block diagram of a partial structure related to the solution of this application, and does not constitute a limitation on the computer equipment on which the solution of this application is applied.
  • the specific computer equipment can be More or fewer components than shown in the figures may be included, or some components may be combined, or have a different arrangement of components.
  • a computer device including a memory and a processor, where a computer program is stored in the memory, and the processor implements the steps in the above method embodiments when executing the computer program.
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps in the foregoing method embodiments are implemented.
  • Non-volatile memory can include read-only memory (Read-Only Memory, ROM), tape, floppy disk, flash memory or optical memory, etc.
  • Volatile memory can include Random Access Memory (Random Access Memory, RAM) or external cache memory.
  • RAM can come in many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (Dynamic Random Access Memory) Access Memory, DRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The present application relates to a method and apparatus for pose estimation, a computer device, and a storage medium. The method comprises: acquiring a target image to undergo pose estimation, the target image comprising a target subject to be processed; performing feature extraction on the basis of the target image, and acquiring a first extracted feature; performing feature expansion on the first extracted feature by means of an image feature expansion network, and obtaining an expanded image feature; performing feature extraction on the expanded image feature, and obtaining a second extracted feature; performing feature compression on the second extracted feature by means of an image feature compression network, and obtaining a compressed image feature; determining, on the basis of the compressed image feature, key point location information corresponding to the target subject in the target image, and performing pose estimation on the target subject on the basis of the key point location information. The present method can improve the efficiency of pose estimation.

Description

姿态估计方法、装置、计算机设备和存储介质Attitude estimation method, device, computer equipment and storage medium 技术领域technical field
本申请涉及计算机视觉技术领域,特别是涉及一种姿态估计方法、装置、计算机设备和存储介质。The present application relates to the technical field of computer vision, in particular to a pose estimation method, device, computer equipment and storage medium.
背景技术Background technique
随着计算机视觉技术的发展,姿态估计作为计算机视觉中的重要应用之一,也得到了快速的发展,被广泛应用于对象活动分析、视频监控或者对象交互等领域。例如,姿态估计中的人体姿态估计,通过人体姿态估计,可以在一张包含人体的图像中检测出人体的各个关键点。例如,可以通过人体姿态估计得到人体的五官、四肢或者关节等。因其具有的功能,被广泛应用到定格动画、拼贴舞蹈、透明人、走路拼接或者动作分类等场景。With the development of computer vision technology, attitude estimation, as one of the important applications in computer vision, has also developed rapidly and is widely used in object activity analysis, video surveillance or object interaction and other fields. For example, human body pose estimation in pose estimation, through human body pose estimation, can detect various key points of a human body in an image containing a human body. For example, the facial features, limbs or joints of the human body can be obtained through human body pose estimation. Because of its functions, it is widely used in scenes such as stop-motion animation, collage dance, transparent people, walking stitching or action classification.
技术问题technical problem
然而,目前的姿态估计方法,存在效率低的问题。However, current pose estimation methods suffer from low efficiency.
技术解决方案technical solution
基于此,有必要针对上述技术问题,提供一种能够提高姿态估计效率的姿态估计方法、装置、计算机设备和存储介质。Based on this, it is necessary to provide an attitude estimation method, device, computer equipment and storage medium capable of improving the efficiency of attitude estimation in order to address the above technical problems.
一种姿态估计方法,所述方法包括:获取待进行姿态估计的目标图像;所述目标图像中包括待处理的目标对象; 基于所述目标图像进行特征提取,获取第一提取特征;通过图像特征扩张网络对所述第一提取特征进行特征扩张,得到扩张图像特征;对所述扩张图像特征进行特征提取,得到第二提取特征;通过图像特征压缩网络对所述第二提取特征进行特征压缩,得到压缩图像特征;基于所述压缩图像特征确定所述目标图像中的所述目标对象对应的关键点位置信息,基于所述关键点位置信息对所述目标对象进行姿态估计。A pose estimation method, the method comprising: acquiring a target image to be subjected to pose estimation; the target image includes a target object to be processed; performing feature extraction based on the target image to obtain a first extracted feature; The expansion network performs feature expansion on the first extraction feature to obtain an expanded image feature; performs feature extraction on the expanded image feature to obtain a second extraction feature; performs feature compression on the second extraction feature through an image feature compression network, Obtaining compressed image features; determining key point position information corresponding to the target object in the target image based on the compressed image features, and performing pose estimation on the target object based on the key point position information.
在其中一个实施例中,所述图像特征扩张网络包括多个特征卷积通道,所述通过图像特征扩张网络对所述第一提取特征进行特征扩张,得到扩张图像特征包括:将所述第一提取特征分别输入到所述图像特征扩张网络对应的多个特征卷积通道中,各个所述特征卷积通道利用特征维度保持卷积核对所述第一提取特征进行卷积,得到各个所述特征卷积通道输出的卷积特征;综合各个所述特征卷积通道所述输出的卷积特征得到所述扩张图像特征。In one of the embodiments, the image feature expansion network includes a plurality of feature convolution channels, and performing feature expansion on the first extracted feature through the image feature expansion network, and obtaining the expanded image feature includes: The extracted features are respectively input into a plurality of feature convolution channels corresponding to the image feature expansion network, and each of the feature convolution channels uses a feature dimension preserving convolution kernel to convolve the first extracted feature to obtain each of the feature The convolution features output by the convolution channels; the expanded image features are obtained by combining the convolution features output by each of the feature convolution channels.
在其中一个实施例中, 所述基于所述压缩图像特征确定所述目标图像中的所述目标对象对应的关键点位置信息包括:对所述压缩图像特征进行放大,得到放大的图像特征;对所述放大的图像特征进行卷积,得到第三提取特征;基于所述第三提取特征,确定所述目标图像中的所述目标对象对应的关键点位置信息。In one of the embodiments, the determining the key point position information corresponding to the target object in the target image based on the compressed image features includes: amplifying the compressed image features to obtain enlarged image features; The enlarged image features are convolved to obtain third extracted features; based on the third extracted features, key point position information corresponding to the target object in the target image is determined.
在其中一个实施例中,所述获取待进行姿态估计的目标图像包括:获取初始图像;对所述初始图像进行对象检测,得到所述初始图像中多个候选图像区域分别包括目标对象的概率;基于所述候选图像区域包括目标对象的概率从候选图像区域中选取得到包括目标对象的对象图像区域;根据所述对象图像区域,从所述初始图像中提取得到截取图像区域,将截取得到的图像作为待进行姿态估计的目标图像。In one of the embodiments, the acquiring the target image to be subjected to pose estimation includes: acquiring an initial image; performing object detection on the initial image to obtain probabilities that each of the multiple candidate image regions in the initial image includes the target object; Based on the probability that the candidate image area includes the target object, select from the candidate image area to obtain an object image area including the target object; according to the object image area, extract an intercepted image area from the initial image, and extract the intercepted image as the target image for pose estimation.
在其中一个实施例中,所述根据所述对象图像区域,从所述初始图像中提取得到截取图像区域,将截取得到的图像作为待进行姿态估计的目标图像包括:获取所述对象图像区域中的中心坐标;获取所述对象图像区域对应的区域尺寸,基于所述区域尺寸以及尺寸外扩系数得到区域延伸值;基于所述中心坐标以及所述区域延伸值向所述区域延伸值所对应的延伸方向进行延伸,得到延伸坐标;将位于所述延伸坐标内的图像区域作为图像截取区域,将截取得到的图像作为待进行姿态估计的目标图像。In one of the embodiments, the extraction of the intercepted image area from the initial image according to the object image area, and using the intercepted image as the target image for pose estimation includes: obtaining the image area in the object image area the center coordinates of the object image area; obtain the area size corresponding to the object image area, and obtain the area extension value based on the area size and the size expansion coefficient; The extension direction is extended to obtain the extension coordinates; the image area located within the extension coordinates is used as the image interception area, and the intercepted image is used as the target image to be subjected to pose estimation.
在其中一个实施例中,所述关键点位置信息为多个,所述方法还包括:根据所述关键点位置信息与目标点位置信息的映射关系,将各个所述关键点位置信息转换为对应的目标点位置信息; 所述目标点位置信息为所述关键点位置信息在所述初始图像中的位置信息;基于各个所述目标位置信息对所述目标对象进行姿态估计,得到所述目标图像对应的目标姿态。In one embodiment, there are multiple key point position information, and the method further includes: according to the mapping relationship between the key point position information and the target point position information, converting each key point position information into a corresponding The target point position information; The target point position information is the position information of the key point position information in the initial image; Based on each of the target position information, perform pose estimation on the target object to obtain the target image corresponding target pose.
一种目标视频生成方法,,所述方法还包括:获取目标动作,确定所述目标动作所对应的姿态序列,所述姿态序列中的姿态按照顺序执行,得到所述目标动作;执行上述姿态估计方法,获取目标图像集合中各个目标图像对应的目标姿态;从所述目标图像集合中获取所述姿态序列中各个目标姿态所对应的图像,作为视频帧图像;按照所述姿态序列中姿态的排序对得到的所述视频帧图像进行排列,得到所述目标动作所对应的目标视频。A method for generating a target video, the method further comprising: acquiring a target action, determining a gesture sequence corresponding to the target action, and performing the gestures in the gesture sequence in order to obtain the target action; performing the above gesture estimation The method is to obtain the target pose corresponding to each target image in the target image set; to obtain the image corresponding to each target pose in the pose sequence from the target image set as a video frame image; according to the sorting of the poses in the pose sequence The obtained video frame images are arranged to obtain the target video corresponding to the target action.
一种姿态估计装置,所述装置包括:目标图像获取模块,用于获取待进行姿态估计的目标图像;所述目标图像中包括待处理的目标对象;第一提取特征模块,用于基于所述目标图像进行特征提取,获取第一提取特征;扩张图像特征得到模块,用于通过图像特征扩张网络对所述第一提取特征进行特征扩张,得到扩张图像特征;第二提取特征得到模块,用于对所述扩张图像特征进行特征提取,得到第二提取特征;压缩图像特征得到模块,用于通过图像特征压缩网络对所述第二提取特征进行特征压缩,得到压缩图像特征;关键点位置信息确定模块,用于基于所述压缩图像特征确定所述目标图像中的所述目标对象对应的关键点位置信息,基于所述关键点位置信息对所述目标对象进行姿态估计。A pose estimation device, the device comprising: a target image acquisition module, for acquiring a target image to be subjected to pose estimation; the target image includes a target object to be processed; a first feature extraction module, for based on the The target image is feature extracted to obtain the first extracted feature; the expanded image feature obtaining module is used to expand the first extracted feature through the image feature expansion network to obtain the expanded image feature; the second extracted feature obtained module is used for Feature extraction is performed on the expanded image features to obtain second extracted features; a compressed image feature obtaining module is used to perform feature compression on the second extracted features through an image feature compression network to obtain compressed image features; key point position information is determined A module, configured to determine key point position information corresponding to the target object in the target image based on the compressed image features, and perform pose estimation on the target object based on the key point position information.
在其中一个实施例中,扩张图像特征得到模块用于将所述第一提取特征分别输入到所述图像特征扩张网络对应的多个特征卷积通道中,各个所述特征卷积通道利用特征维度保持卷积核对所述第一提取特征进行卷积,得到各个所述特征卷积通道输出的卷积特征;综合各个所述特征卷积通道所述输出的卷积特征得到所述扩张图像特征。In one of the embodiments, the expanded image feature obtaining module is used to input the first extracted features into multiple feature convolution channels corresponding to the image feature expansion network, each of the feature convolution channels utilizes a feature dimension Keeping the convolution kernel to convolve the first extracted features to obtain the convolution features output by each of the feature convolution channels; combining the output convolution features of each of the feature convolution channels to obtain the expanded image features.
在其中一个实施例中,关键点位置信息确定模块用于对所述压缩图像特征进行放大,得到放大的图像特征;对所述放大的图像特征进行卷积,得到第三提取特征;基于所述第三提取特征,确定所述目标图像中的所述目标对象对应的关键点位置信息。In one of the embodiments, the key point location information determination module is used to amplify the compressed image features to obtain enlarged image features; perform convolution on the enlarged image features to obtain a third extracted feature; based on the The third feature extraction is to determine key point position information corresponding to the target object in the target image.
在其中一个实施例中,目标图像获取模块用于获取初始图像;对所述初始图像进行对象检测,得到所述初始图像中多个候选图像区域分别包括目标对象的概率;基于所述候选图像区域包括目标对象的概率从候选图像区域中选取得到包括目标对象的对象图像区域;根据所述对象图像区域,从所述初始图像中提取得到截取图像区域,将截取得到的图像作为待进行姿态估计的目标图像。In one of the embodiments, the target image acquisition module is used to acquire an initial image; object detection is performed on the initial image to obtain the probability that a plurality of candidate image areas in the initial image respectively include the target object; based on the candidate image areas The probability of including the target object is selected from the candidate image area to obtain the object image area including the target object; according to the object image area, the intercepted image area is extracted from the initial image, and the intercepted image is used as the image to be estimated. target image.
在其中一个实施例中,目标图像获取模块用于获取所述对象图像区域中的中心坐标;获取所述对象图像区域对应的区域尺寸,基于所述区域尺寸以及尺寸外扩系数得到区域延伸值;基于所述中心坐标以及所述区域延伸值向所述区域延伸值所对应的延伸方向进行延伸,得到延伸坐标;将位于所述延伸坐标内的图像区域作为截取图像区域,将截取得到的图像作为待进行姿态估计的目标图像。In one of the embodiments, the target image acquisition module is used to acquire the center coordinates in the object image area; acquire the area size corresponding to the object image area, and obtain the area extension value based on the area size and the size expansion coefficient; Based on the central coordinates and the area extension value, extend to the extension direction corresponding to the area extension value to obtain extension coordinates; use the image area within the extension coordinates as the intercepted image area, and use the intercepted image as The target image to be pose estimated.
在其中一个实施例中,目标图像获取模块用于根据所述关键点位置信息与目标点位置信息的映射关系,将各个所述关键点位置信息转换为对应的目标点位置信息; 所述目标点位置信息为所述关键点位置信息在所述初始图像中的位置信息;基于各个所述目标位置信息对所述目标对象进行姿态估计,得到所述目标图像对应的目标姿态。In one of the embodiments, the target image acquisition module is used to convert each of the key point position information into corresponding target point position information according to the mapping relationship between the key point position information and the target point position information; the target point The position information is the position information of the key point position information in the initial image; based on each of the target position information, pose estimation is performed on the target object to obtain a target pose corresponding to the target image.
一种目标视频生成装置,所述装置用于获取目标动作,确定所述目标动作所对应的姿态序列,所述姿态序列中的姿态按照顺序执行,得到所述目标动作;获取目标图像集合中各个目标图像对应的目标姿态;从所述目标图像集合中获取所述姿态序列中各个目标姿态所对应的图像,作为视频帧图像;按照所述姿态序列中姿态的排序对得到的所述视频帧图像进行排列,得到所述目标动作所对应的目标视频。A target video generation device, the device is used to acquire target actions, determine the gesture sequence corresponding to the target action, the gestures in the gesture sequence are executed in order to obtain the target action; acquire each target image set The target pose corresponding to the target image; obtaining the image corresponding to each target pose in the pose sequence from the target image set as a video frame image; pairing the obtained video frame images according to the sorting of the poses in the pose sequence Arranging to obtain the target video corresponding to the target action.
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现以下步骤:获取待进行姿态估计的目标图像;所述目标图像中包括待处理的目标对象; 基于所述目标图像进行特征提取,获取第一提取特征;通过图像特征扩张网络对所述图像提取特征进行特征扩张,得到扩张图像特征;对所述扩张图像特征进行特征提取,得到第二提取特征;通过图像特征压缩网络对所述第二提取特征进行特征压缩,得到压缩图像特征;基于所述压缩图像特征确定所述目标图像中的所述目标对象对应的关键点位置信息,基于所述关键点位置信息对所述目标对象进行姿态估计。A computer device, comprising a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program: acquiring a target image to be subjected to pose estimation; the target image includes a target image to be processed The target object; Based on the target image, feature extraction is performed to obtain the first extraction feature; the image feature expansion network is used to perform feature expansion on the image extraction feature to obtain the expanded image feature; the feature extraction is performed on the expanded image feature to obtain The second extraction feature; performing feature compression on the second extraction feature through an image feature compression network to obtain a compressed image feature; determining key point position information corresponding to the target object in the target image based on the compressed image feature, Estimating the pose of the target object based on the position information of the key points.
一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现以下步骤:获取待进行姿态估计的目标图像;所述目标图像中包括待处理的目标对象; 基于所述目标图像进行特征提取,获取第一提取特征;A computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented: acquiring a target image to be subjected to attitude estimation; the target image includes a target object to be processed; based on The target image is subjected to feature extraction to obtain the first extracted feature;
通过图像特征扩张网络对所述图像提取特征进行特征扩张,得到扩张图像特征;对所述扩张图像特征进行特征提取,得到第二提取特征;通过图像特征压缩网络对所述第二提取特征进行特征压缩,得到压缩图像特征;基于所述压缩图像特征确定所述目标图像中的所述目标对象对应的关键点位置信息,基于所述关键点位置信息对所述目标对象进行姿态估计。Feature expansion is performed on the image extraction feature by an image feature expansion network to obtain an expanded image feature; feature extraction is performed on the expanded image feature to obtain a second extraction feature; and the second extraction feature is obtained by an image feature compression network. compressing to obtain compressed image features; determining key point position information corresponding to the target object in the target image based on the compressed image features, and performing pose estimation on the target object based on the key point position information.
技术效果technical effect
上述姿态估计方法、装置、计算机设备和存储介质,通过获取待进行姿态估计的目标图像;该目标图像中包括待处理的目标对象; 基于目标图像进行特征提取,获取第一提取特征;通过图像特征扩张网络对第一提取特征进行特征扩张,得到扩张图像特征;对扩张图像特征进行特征提取,得到第二提取特征;通过图像特征压缩网络对第二提取特征进行特征压缩,得到压缩图像特征;基于压缩图像特征确定目标图像中的目标对象对应的关键点位置信息,基于关键点位置信息对目标对象进行姿态估计。先利用图像特征扩张网络对提取到的第一提取特征进行特征扩张,使得在进行姿态估计的网络的输入端能够输入尽可能多的图像特征,然后通过图像特征压缩网络对第二提取特征进行特征压缩,结合以上特征能够达到提高姿态估计效率以及准确度的目的。The above attitude estimation method, device, computer equipment and storage medium obtain the target image to be subjected to attitude estimation; the target image includes the target object to be processed; perform feature extraction based on the target image to obtain the first extracted feature; The expansion network performs feature expansion on the first extraction feature to obtain the expanded image feature; performs feature extraction on the expanded image feature to obtain the second extraction feature; performs feature compression on the second extraction feature through the image feature compression network to obtain the compressed image feature; Compress image features to determine the key point position information corresponding to the target object in the target image, and perform pose estimation on the target object based on the key point position information. First, the image feature expansion network is used to expand the extracted first extraction features, so that as many image features as possible can be input at the input end of the pose estimation network, and then the second extraction features are characterized by the image feature compression network. Compression, combined with the above features, can achieve the purpose of improving the efficiency and accuracy of pose estimation.
附图说明Description of drawings
图1为一个实施例中姿态估计方法的应用环境图;Fig. 1 is an application environment diagram of a posture estimation method in an embodiment;
图2为一个实施例中姿态估计方法的流程示意图;Fig. 2 is a schematic flow chart of a pose estimation method in an embodiment;
图3为另一个实施例中姿态估计方法的流程示意图;Fig. 3 is a schematic flow chart of a pose estimation method in another embodiment;
图4为另一个实施例中姿态估计方法的流程示意图;FIG. 4 is a schematic flow chart of a pose estimation method in another embodiment;
图5为另一个实施例中姿态估计方法的流程示意图;FIG. 5 is a schematic flow chart of a pose estimation method in another embodiment;
图6为另一个实施例中姿态估计方法的流程示意图;Fig. 6 is a schematic flow chart of a pose estimation method in another embodiment;
图7为另一个实施例中姿态估计方法的流程示意图;FIG. 7 is a schematic flow chart of a pose estimation method in another embodiment;
图8为一个实施例中目标视频生成方法的流程示意图;Fig. 8 is a schematic flow chart of a method for generating a target video in an embodiment;
图9为一个实施例中包含对象的全景图像的示意图;Fig. 9 is a schematic diagram of a panoramic image including objects in one embodiment;
图10为一个实施例中对对象进行检测的示意图;Fig. 10 is a schematic diagram of detecting an object in an embodiment;
图11为一个实施例中截取带有对象子图的示意图;Fig. 11 is a schematic diagram of intercepting object subgraphs in an embodiment;
图12为一个实施例中对象关键点的示意图;Fig. 12 is a schematic diagram of object key points in an embodiment;
图13为一个实施例中人体姿态模型的示意图;Fig. 13 is a schematic diagram of a human body posture model in an embodiment;
图14为一个实施例中姿态估计装置的结构框图;Fig. 14 is a structural block diagram of a pose estimation device in an embodiment;
图15为一个实施例中计算机设备的内部结构图。Figure 15 is a diagram of the internal structure of a computer device in one embodiment.
本发明的实施方式Embodiments of the present invention
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not intended to limit the present application.
本申请提供的姿态估计方法,可以应用于如图1所示的应用环境中,具体应用到一种姿态估计系统中。该姿态估计系统包括图像采集设备102与终端104,其中,图像采集设备102与终端104通信连接。终端104执行一种姿态估计方法,具体的,终端104获取从图像采集设备102传输过来的,待进行姿态估计的目标图像;目标图像中包括待处理的目标对象;终端104基于目标图像进行特征提取,获取第一提取特征;通过图像特征扩张网络对第一提取特征进行特征扩张,得到扩张图像特征;对上述扩张图像特征进行特征提取,得到第二提取特征;通过图像特征压缩网络对第二提取特征进行特征压缩,得到压缩图像特征;基于压缩图像特征确定目标图像中的目标对象对应的关键点位置信息,基于关键点位置信息对目标对象进行姿态估计。其中,图像采集设备102可以但不限于是各种有图像采集功能的设备,可以分布于终端104的外部,也可以分布于终端104的内部。例如:分布于终端104的外部的各种摄像头、扫描仪、各种相机、图像采集卡。终端104可以但不限于是各种相机、个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。可以理解,本申请实施例提供的方法,也可以是由服务器执行的。The attitude estimation method provided in this application can be applied to the application environment shown in FIG. 1 , and specifically applied to an attitude estimation system. The pose estimation system includes an image acquisition device 102 and a terminal 104 , wherein the image acquisition device 102 and the terminal 104 are connected in communication. The terminal 104 executes a pose estimation method. Specifically, the terminal 104 acquires a target image to be subjected to pose estimation transmitted from the image acquisition device 102; the target image includes a target object to be processed; the terminal 104 performs feature extraction based on the target image , to obtain the first extraction feature; expand the first extraction feature through the image feature expansion network to obtain the expanded image feature; perform feature extraction on the above-mentioned expanded image feature to obtain the second extraction feature; use the image feature compression network to obtain the second extraction feature The feature is compressed to obtain the compressed image feature; the key point position information corresponding to the target object in the target image is determined based on the compressed image feature, and the pose estimation of the target object is performed based on the key point position information. Wherein, the image acquisition device 102 may be, but not limited to, various devices with an image acquisition function, and may be distributed outside the terminal 104 or inside the terminal 104 . For example: various cameras, scanners, various cameras, and image acquisition cards distributed outside the terminal 104 . The terminal 104 can be, but is not limited to, various cameras, personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. It can be understood that the method provided in the embodiment of the present application may also be executed by a server.
在一个实施例中,如图2所示,提供了一种姿态估计方法,以该方法应用于图1中的终端为例进行说明,包括以下步骤: In one embodiment, as shown in FIG. 2 , a pose estimation method is provided. The method is applied to the terminal in FIG. 1 as an example for illustration, including the following steps:
步骤202,获取待进行姿态估计的目标图像;上述目标图像中包括待处理的目标对象。Step 202, acquiring a target image to be subjected to pose estimation; the target image includes a target object to be processed.
其中,姿态估计是指通过检测目标对象中的关键点,对一个或多个关键点的描述,估计得到目标对象的姿态的过程。关键点是指能够描述目标对象的结构特征的特征点。例如,目标对象的五官、腿关节或者手关节等。目标对象是指进行姿态估计的对象。例如,人体或者动物等。Among them, pose estimation refers to the process of estimating the pose of the target object by detecting key points in the target object and describing one or more key points. Key points refer to feature points that can describe the structural features of the target object. For example, the facial features, leg joints, or hand joints of the target object. The target object refers to the object for pose estimation. For example, human body or animal etc.
具体地,终端可以通过直接或者间接的方式得到待进行姿态估计的目标图像。Specifically, the terminal can directly or indirectly obtain the target image to be subjected to pose estimation.
在一个实施例中,终端将接收到的图像采集设备传输过来的,包括待处理的目标对象的图像作为目标图像。In an embodiment, the terminal takes the received image including the target object to be processed and transmitted from the image acquisition device as the target image.
在一个实施例中,终端将接收到的,图像采集设备传输过来的图像,经过图像预处理后,将预处理后的图像作为目标图像。In one embodiment, the terminal preprocesses the received image transmitted from the image acquisition device, and uses the preprocessed image as the target image.
在一个实施例中,图像采集设备为全景相机,全景相机采集到全景图像后,将全景图像作为待进行姿态估计的目标图像,该目标图像中包括待处理的目标对象。目标对象可以是完整的,也可以只包含其中一部分或存在遮挡的情况。In one embodiment, the image collection device is a panoramic camera. After the panoramic camera collects the panoramic image, the panoramic image is used as a target image to be subjected to pose estimation, and the target image includes a target object to be processed. The target object can be complete, only partially contained or occluded.
在一个实施例中,终端通过在全景视频中抽帧获取全景图像,对该全景图像直接或者将该全景图像进行预处理后,得到待进行姿态估计的目标图像。预处理包括对全景图像进行归一化或者对对全景图像中的目标对象进行裁剪等处理。In one embodiment, the terminal acquires a panoramic image by extracting frames from the panoramic video, and directly or after preprocessing the panoramic image, obtains a target image to be subjected to pose estimation. The preprocessing includes normalizing the panoramic image or cropping the target object in the panoramic image.
步骤204,基于目标图像进行特征提取,获取第一提取特征。Step 204, perform feature extraction based on the target image, and obtain first extracted features.
其中,特征是代表目标图像所特有的属性的信息,通过该信息可以识别到目标图像中的某个对象或者对目标图像进行分类等。Wherein, a feature is information representing a specific attribute of a target image, through which an object in the target image can be identified or the target image can be classified.
具体地,可以通过特征提取网络对目标图像进行特征提取,获取第一提取特征。Specifically, feature extraction may be performed on the target image through a feature extraction network to obtain the first extracted feature.
在一个实施例中,可以通过轻量级深度神经网络对目标图像进行特征提取,获取第一提取特征。In one embodiment, the feature extraction of the target image may be performed through a lightweight deep neural network to obtain the first extracted feature.
步骤206,通过图像特征扩张网络对第一提取特征进行特征扩张,得到扩张图像特征。Step 206, performing feature expansion on the first extracted feature through the image feature expansion network to obtain expanded image features.
其中,图像特征扩张网络是指能够使图像特征数量增多的网络。扩张图像特征是指对图像特征进行扩张之后的图像特征。Among them, the image feature expansion network refers to a network that can increase the number of image features. The expanded image feature refers to the image feature after the image feature is expanded.
具体地,通过逐点卷积将获取到的第一提取特征的通道进行扩张,丰富特征数量,得到扩张图像特征。Specifically, the acquired channels of the first extracted features are expanded by point-by-point convolution, the number of features is enriched, and the expanded image features are obtained.
在一个实施例中,终端在获取第一提取特征之后,图像特征扩张网络利用1*1的逐点卷积将图像提取特征进行特征扩张,得到扩张图像特征。In one embodiment, after the terminal obtains the first extracted features, the image feature expansion network uses 1*1 point-by-point convolution to perform feature expansion on the image extracted features to obtain expanded image features.
步骤208,对扩张图像特征进行特征提取,得到第二提取特征。Step 208, perform feature extraction on the features of the expanded image to obtain second extracted features.
具体地,在得到扩张图像特征之后,可以通过参数较少的卷积对扩张图像特征进行特征提取,得到第二提取特征。Specifically, after the expanded image features are obtained, feature extraction may be performed on the expanded image features through convolution with fewer parameters to obtain the second extracted features.
在一个实施例中,可以通过预设卷积和激活函数对扩张图像特征进行降采样,对扩张图像特征进行特征提取,得到第二提取特征。例如,可以利用预设卷积可以为3*3的卷积和ReLU(Rectified Linear Unit,修正线性单元)激活函数对扩张图像特征进行特征提取,得到第二提取特征。基于各个应用场景的适配性,上述激活函数可以使用Sigmoid函数(Sigmoid function,S型生长曲线)、ELU(Exponential Linear Unit,指数线性单元)、GELU( Gaussian Error Linear Unit,高斯误差线性单元)等替换。In one embodiment, the expanded image features may be down-sampled by preset convolution and activation functions, and feature extraction may be performed on the expanded image features to obtain the second extracted features. For example, the preset convolution can be 3*3 convolution and a ReLU (Rectified Linear Unit, rectified linear unit) activation function to perform feature extraction on the expanded image feature to obtain the second extracted feature. Based on the adaptability of each application scenario, the above activation function can use Sigmoid function (Sigmoid function, S-type growth curve), ELU (Exponential Linear Unit, Exponential Linear Unit), GELU (Gaussian Error Linear Unit, Gaussian error linear unit) and other replacements.
步骤210,通过图像特征压缩网络对第二提取特征进行特征压缩,得到压缩图像特征。Step 210, perform feature compression on the second extracted features through the image feature compression network to obtain compressed image features.
其中,图像特征压缩网络是指能够使图像特征数量减少的网络。压缩图像特征是指对图像特征进行压缩之后的图像特征。Among them, the image feature compression network refers to a network that can reduce the number of image features. Compressed image features refer to image features after image features are compressed.
具体地,终端在得到第二提取特征之后,可以对第二提取特征进行特征压缩,以便于提高终端进行姿态估计的速度。Specifically, after obtaining the second extracted features, the terminal may perform feature compression on the second extracted features, so as to increase the speed of the terminal's pose estimation.
在一个实施例中,终端在得到第二提取特征之后,图像特征压缩网络利用1*1的逐点卷积对第二提取特征进行特征压缩,经过线性变换后得到压缩图像特征。In one embodiment, after the terminal obtains the second extracted features, the image feature compression network uses 1*1 point-by-point convolution to perform feature compression on the second extracted features, and obtains compressed image features after linear transformation.
步骤212,基于压缩图像特征确定目标图像中的目标对象对应的关键点位置信息,基于关键点位置信息对目标对象进行姿态估计。Step 212 , determine key point position information corresponding to the target object in the target image based on the compressed image features, and perform pose estimation on the target object based on the key point position information.
其中,关键点位置信息是指能够确定关键点在目标图像中位置的信息。例如,关键点在目标图像的坐标、名称或者方向等信息。Wherein, the key point position information refers to information capable of determining the position of the key point in the target image. For example, information such as the coordinates, names or directions of the key points in the target image.
具体的,终端在得到压缩图像特征之后,可以通过压缩图像特征与目标对象对应的关键点位置信息之间的对应关系,得到目标对象对应的关键点位置信息。Specifically, after obtaining the compressed image feature, the terminal may obtain the key point position information corresponding to the target object through the correspondence between the compressed image feature and the key point position information corresponding to the target object.
在一个实施例中,终端存储有压缩图像特征与关键点位置信息之间的匹配关系表,在得到压缩图像特征之后,终端通过遍历上述匹配关系表,得到相应的关键点位置信息。根据关键点位置信息中的位置坐标及名称,对目标对象进行姿态估计。例如,终端在遍历上述匹配关系表后得到的关键点位置信息为(200,200,腕关节),表示该关键点的位置坐标为(200,200)处,并且该关键点是腕关节,通过多个上述关键点位置信息的描述,估计出姿态。In one embodiment, the terminal stores a matching relationship table between compressed image features and key point location information. After obtaining the compressed image features, the terminal obtains corresponding key point location information by traversing the above matching relationship table. According to the position coordinates and names in the key point position information, the pose estimation of the target object is performed. For example, the position information of the key point obtained by the terminal after traversing the above matching relationship table is (200, 200, wrist joint), indicating that the position coordinates of the key point are (200, 200), and the key point is the wrist joint. Describe the position information of multiple key points above, and estimate the pose.
上述姿态估计方法中,通过获取待进行姿态估计的目标图像;该目标图像中包括待处理的目标对象; 基于目标图像进行特征提取,获取第一提取特征;通过图像特征扩张网络对第一提取特征进行特征扩张,得到扩张图像特征;对扩张图像特征进行特征提取,得到第二提取特征;通过图像特征压缩网络对第二提取特征进行特征压缩,得到压缩图像特征;基于压缩图像特征确定目标图像中的目标对象对应的关键点位置信息,基于关键点位置信息对目标对象进行姿态估计。先通过图像特征扩张网络对提取到的第一提取特征进行特征扩张,使得在进行姿态估计的网络的输入端能够输入尽可能多的图像特征,然后通过图像特征压缩网络对第二提取特征进行特征压缩,结合以上特征能够达到提高姿态估计效率以及准确度的目的。In the above-mentioned attitude estimation method, by obtaining the target image to be subjected to attitude estimation; the target image includes the target object to be processed; feature extraction is performed based on the target image to obtain the first extraction feature; the first extraction feature is obtained through the image feature expansion network Perform feature expansion to obtain the expanded image features; perform feature extraction on the expanded image features to obtain the second extracted features; perform feature compression on the second extracted features through the image feature compression network to obtain compressed image features; determine the target image based on the compressed image features The target object corresponds to the key point position information, and the pose estimation of the target object is performed based on the key point position information. Firstly, the feature expansion of the extracted first extraction features is performed through the image feature expansion network, so that as many image features as possible can be input at the input end of the pose estimation network, and then the second extraction features are characterized by the image feature compression network. Compression, combined with the above features, can achieve the purpose of improving the efficiency and accuracy of pose estimation.
在一个实施例中,如图3所示,图像特征扩张网络包括多个特征卷积通道,通过图像特征扩张网络对第一提取特征进行特征扩张,得到扩张图像特征包括:In one embodiment, as shown in FIG. 3 , the image feature expansion network includes a plurality of feature convolution channels, and the first extracted feature is subjected to feature expansion through the image feature expansion network, and the expanded image features obtained include:
步骤302,将第一提取特征分别输入到图像特征扩张网络对应的多个特征卷积通道中,各个特征卷积通道利用特征维度保持卷积核对第一提取特征进行卷积,得到各个特征卷积通道输出的卷积特征。Step 302, input the first extracted features into multiple feature convolution channels corresponding to the image feature expansion network, and each feature convolution channel uses a feature dimension-preserving convolution kernel to convolve the first extracted features to obtain each feature convolution The convolutional features output by the channel.
其中,特征维度保持卷积核是指能够使图像的维度保持不变的卷积核,图像的维度就是指图像的通道数。例如,尺寸为1*1的卷积核。Among them, the feature dimension preserving convolution kernel refers to the convolution kernel that can keep the dimension of the image unchanged, and the dimension of the image refers to the number of channels of the image. For example, a convolution kernel of size 1*1.
具体地,终端可以通过设定特征维度保持卷积核,从各个特征卷积通道对第一提取特征进行卷积,可以在保持第一提取特征尺度不变的情况下,使用更少的参数得到特征卷积通道输出的卷积特征。 Specifically, the terminal can maintain the convolution kernel by setting the feature dimension, and perform convolution on the first extracted feature from each feature convolution channel, which can be obtained by using fewer parameters while keeping the scale of the first extracted feature unchanged. The convolutional features output by the feature convolution channel.
在一个实施例中,终端通过设定特征维度保持卷积核的数量和大小,对各个特征卷积通道的图像提取特征进行卷积,得到特征卷积通道输出的卷积特征。例如,在网络为64通道的3*3的卷积网络上,添加一个大小为1*1,通道数量为256的卷积核后,能够实现通过64*256个参数就能够将原有网络的通道数从64扩张为256。In one embodiment, the terminal performs convolution on the image extraction features of each feature convolution channel by setting the feature dimension to maintain the number and size of the convolution kernels, and obtains the convolution features output by the feature convolution channel. For example, on a 3*3 convolutional network with a network of 64 channels, after adding a convolution kernel with a size of 1*1 and a channel number of 256, the original network can be converted to 64*256 parameters. The number of channels has been expanded from 64 to 256.
步骤304,综合各个特征卷积通道输出的卷积特征得到扩张图像特征。In step 304, the convolution features output by each feature convolution channel are synthesized to obtain expanded image features.
具体的,在得到特征卷积通道输出的卷积特征之后,特征维度保持卷积核可以对图像中的第一提取特征中的每个像素点在不同的通道上进行线性组合,得到扩张图像特征。例如,扩张网络的组成是,在 3*3,64通道的卷积核后面添加一个1*1,28通道的卷积核,转变成3*3,28通道的卷积核,原有的64个通道就跨通道线性组合变成了28通道,实现了通道之间的信息交互,通过各个特征卷积通道输出的卷积特征得到扩张图像特征。Specifically, after obtaining the convolution feature output by the feature convolution channel, the feature dimension-preserving convolution kernel can linearly combine each pixel in the first extracted feature in the image on different channels to obtain the expanded image feature . For example, the composition of the expanded network is to add a 1*1, 28-channel convolution kernel behind the 3*3, 64-channel convolution kernel, and convert it into a 3*3, 28-channel convolution kernel. The original 64 The channels are linearly combined across channels to become 28 channels, which realizes the information interaction between channels, and obtains the expanded image features through the convolution features output by each feature convolution channel.
本实施例中,通过在图像特征扩张网络中,利用特征维度保持卷积核对特征卷积通道中的第一提取特征进行卷积,得到特征卷积通道输出的卷积特征,并且综合各个特征卷积通道输出的卷积特征得到扩张图像特征,能够达到在较少参数量的情况下,得到扩张图像特征的目的,进而提高了姿态估计的效率。In this embodiment, in the image feature expansion network, by using the feature dimension preserving convolution kernel to convolve the first extracted feature in the feature convolution channel, the convolution feature output by the feature convolution channel is obtained, and each feature convolution The convolution feature output by the product channel is used to obtain the expanded image feature, which can achieve the purpose of obtaining the expanded image feature with fewer parameters, thereby improving the efficiency of pose estimation.
在一个实施例中,如图4所示,基于压缩图像特征确定目标图像中的目标对象对应的关键点位置信息包括:In one embodiment, as shown in FIG. 4, determining the key point position information corresponding to the target object in the target image based on the compressed image features includes:
步骤402,对压缩图像特征进行放大,得到放大的图像特征。Step 402, amplifying the compressed image features to obtain enlarged image features.
具体的,在得到压缩图像特征之后,通过对特征进行上采样,得到放大的图像特征。Specifically, after the compressed image features are obtained, the enlarged image features are obtained by upsampling the features.
在一个实施例中,终端对压缩图像特征进行放大,通过设置三层采样网络,将三层采样网络的输入输出通道数分别设置为(256,128), (128,64), (64, 64),可以起到减少网络参数量和计算量的效果。In one embodiment, the terminal amplifies the compressed image features, and sets the number of input and output channels of the three-layer sampling network to (256, 128), (128, 64), (64, 64 ), which can reduce the amount of network parameters and computation.
在一个实施例中,终端通过插值方法,对压缩图像特征进行插值计算,得到放大的图像特征。例如,在压缩图像特征的基础上,在像素点之间采用合适的插值算法如线性插值或者双线性插值等插入新的元素。In one embodiment, the terminal performs interpolation calculation on the compressed image features through an interpolation method to obtain enlarged image features. For example, on the basis of compressed image features, new elements are inserted between pixels using appropriate interpolation algorithms such as linear interpolation or bilinear interpolation.
步骤404,对放大的图像特征进行卷积,得到第三提取特征。Step 404, performing convolution on the enlarged image features to obtain a third extracted feature.
具体地,在得到放大的图像特征之后,为了弥补对压缩图像特征进行放大过程中非线性单元的减少,对放大的图像特征进行卷积,得到第三提取特征。 Specifically, after the enlarged image features are obtained, in order to compensate for the reduction of nonlinear units during the process of enlarging the compressed image features, the enlarged image features are convolved to obtain the third extracted features.
步骤406,基于第三提取特征,确定目标图像中的目标对象对应的关键点位置信息。Step 406, based on the third extracted feature, determine key point position information corresponding to the target object in the target image.
具体地,在得到第三提取特征后,对第三提取特征进行查找过滤等,确定目标图像中的目标对象对应的关键点位置信息。Specifically, after the third extracted features are obtained, searching and filtering are performed on the third extracted features to determine key point position information corresponding to the target object in the target image.
在一个实施例中,终端中存储有图像特征与关键点位置信息的匹配列表,当得到第三提取特征后,遍历上述匹配列表,得到第三提取特征对应的关键点位置信息,即目标图像中的目标对象对应的关键点位置信息。In one embodiment, the terminal stores a matching list of image features and key point position information. After obtaining the third extracted feature, the matching list is traversed to obtain the key point position information corresponding to the third extracted feature, that is, the key point position information in the target image The key point position information corresponding to the target object.
本实施例中,通过对压缩图像特征进行放大,得到放大的图像特征,对放大的图像特征进行卷积,得到第三提取特征,基于第三提取特征,确定目标图像中的目标对象对应的关键点位置信息,能够达提高图像输出质量的目的,从而达到更准确地得到目标对象对应的关键点位置信息的目的。In this embodiment, by amplifying the compressed image features, the enlarged image features are obtained, and the enlarged image features are convolved to obtain the third extraction feature, based on the third extraction feature, the key corresponding to the target object in the target image is determined The point position information can achieve the purpose of improving the image output quality, so as to achieve the purpose of obtaining the key point position information corresponding to the target object more accurately.
在一个实施例中,如图5所示,获取待进行姿态估计的目标图像包括:In one embodiment, as shown in FIG. 5 , acquiring a target image to be subjected to pose estimation includes:
步骤502,获取初始图像。Step 502, acquiring an initial image.
其中,初始图像是指未经过处理的原始图像。原始图像为图像采集设备或者终端直接得到的图像。Among them, the initial image refers to the unprocessed original image. The original image is an image obtained directly by an image acquisition device or a terminal.
在一个实施例中,终端可以通过连接的图像采集设备进行初始图像的采集,采集设备将采集到的初始图像实时传输给终端;或者采集设备将采集到的初始图像暂存到采集设备本地,当接收到终端的图像获取指令时,将本地存储的初始图像传输给终端,相应的,终端能够获取到初始图像。 In one embodiment, the terminal can collect the initial image through the connected image acquisition device, and the acquisition device transmits the acquired initial image to the terminal in real time; or the acquisition device temporarily stores the acquired initial image locally in the acquisition device, when When receiving an image acquisition instruction from the terminal, the locally stored initial image is transmitted to the terminal, and accordingly, the terminal can acquire the initial image.
在一个实施例中,终端通过内部存在的图像采集模块,对初始图像进行采集,对采集到的图像存储到终端存储器中,当终端需要获取初始图像时,从存储器中,获取初始图像。In one embodiment, the terminal collects the initial image through an internal image collection module, stores the collected image in the terminal memory, and obtains the initial image from the memory when the terminal needs to obtain the initial image.
步骤504,对初始图像进行对象检测,得到初始图像中多个候选图像区域分别包括目标对象的概率。Step 504, perform object detection on the initial image, and obtain probabilities that each of the multiple candidate image regions in the initial image includes the target object.
具体地,在获取到初始图像之后,对初始图像分割成多个图像子区域,作为候选图像区域,检测目标对象在各个候选图像区域中的概率。例如,将图像分割成子区域A、子区域B和子区域C,目标对象在子区域A中的概率是0%,目标对象在子区域B中的概率是10%,目标对象在子区域C中的概率是90%。Specifically, after the initial image is acquired, the initial image is divided into multiple image sub-regions as candidate image regions, and the probability of the target object in each candidate image region is detected. For example, if the image is divided into sub-region A, sub-region B and sub-region C, the probability of the target object in sub-region A is 0%, the probability of the target object in sub-region B is 10%, and the probability of the target object in sub-region C is The probability is 90%.
在一个实施例中,终端通过逐渐缩小图像子区域的大小,得到各个图像子区域中包括目标对象的概率。In an embodiment, the terminal obtains the probability that each image sub-region includes the target object by gradually reducing the size of the image sub-regions.
步骤506,基于候选图像区域包括目标对象的概率从候选图像区域中选取得到包括目标对象的对象图像区域。Step 506 , based on the probability that the candidate image region includes the target object, an object image region including the target object is selected from the candidate image regions.
具体地,终端在得到初始图像中多个候选图像区域分别包括目标对象的概率之后,可以对各个候选图像区域的概率进行比较,得到概率在预设概率阈值范围内的候选图像区域,将该候选图像区域作为包括目标对象的对象图像区域。Specifically, after obtaining the probabilities that the plurality of candidate image regions in the initial image respectively include the target object, the terminal may compare the probabilities of each candidate image region to obtain candidate image regions whose probabilities are within a preset probability threshold range, and use the candidate The image area serves as an object image area including the target object.
在一个实施例中,终端遍历候选图像区域包括目标对象的概率,得到各个概率中的最大概率值,将最大概率值对应的候选图像区域作为目标对象的对象图像区域。In one embodiment, the terminal traverses the probability that the candidate image areas include the target object, obtains the maximum probability value among the probabilities, and uses the candidate image area corresponding to the maximum probability value as the target image area of the target object.
步骤508,根据对象图像区域,从初始图像中提取得到截取图像区域,将截取得到的图像作为待进行姿态估计的目标图像。Step 508 , according to the target image area, extract the intercepted image area from the initial image, and use the intercepted image as the target image to be subjected to pose estimation.
具体地,终端在得到包括目标对象的对象图像区域之后,基于该对象图像区域的位置信息,可以截取得到的图像区域的图像作为待进行姿态估计的目标图像,以减少姿态估计时的运算量,提高姿态估计效率。 Specifically, after the terminal obtains the object image area including the target object, based on the position information of the object image area, the image of the obtained image area can be intercepted as the target image to be subjected to pose estimation, so as to reduce the amount of computation during pose estimation, Improve pose estimation efficiency.
在一个实施例中,终端可以通过提取到对象图像区域的坐标信息,利用该坐标信息,截取得到进行姿态估计的目标图像。In one embodiment, the terminal can extract the coordinate information of the object image area, and use the coordinate information to intercept and obtain the target image for pose estimation.
在一个实施例中,可以通过接收用户的目标对象框选操作,通过框选出的图像区域作为对象图像区域,在初始图像中截取框选出的图像区域,作为待进行姿态估计的目标图像。In one embodiment, by receiving a user's target object frame selection operation, the frame-selected image area may be used as the target image area, and the frame-selected image area may be intercepted from the initial image as the target image to be pose estimated.
本实施例中,通过获取初始图像,对初始图像进行对象检测,得到初始图像中多个候选图像区域分别包括目标对象的概率,基于候选图像区域包括目标对象的概率从候选图像区域中选取得到包括目标对象的对象图像区域,从初始图像中提取得到对象图像区域,将截取得到的图像作为待进行姿态估计的目标图像,能够达到从初始图像中准确得到目标图像的目的。In this embodiment, by acquiring the initial image and performing object detection on the initial image, the probabilities that the multiple candidate image areas in the initial image respectively include the target object are obtained, and based on the probability that the candidate image areas include the target object, the candidate image areas including the target object are selected to obtain The object image area of the target object is extracted from the initial image to obtain the object image area, and the intercepted image is used as the target image for pose estimation, so as to achieve the purpose of accurately obtaining the target image from the initial image.
在一个实施例中,如图6所示,根据对象图像区域,从初始图像中提取得到截取图像区域,将截取得到的图像作为待进行姿态估计的目标图像包括:In one embodiment, as shown in FIG. 6, according to the object image area, the intercepted image area is extracted from the initial image, and the intercepted image is used as the target image to be subjected to pose estimation including:
步骤602,获取对象图像区域中的中心坐标。Step 602, acquiring the center coordinates in the object image area.
其中,中心坐标是指处在对象图像区域中心点位置的像素点的坐标。该坐标是基于目标对象的对象图像区域的中心位置处的像素点在初始图像中的坐标,可以根据初始图像的长和宽来确定。Wherein, the center coordinates refer to the coordinates of the pixel at the center point of the object image area. The coordinates are based on the coordinates of the pixel at the center of the object image area of the target object in the initial image, and can be determined according to the length and width of the initial image.
具体地,在确定对象图像区域后,终端通过获取该对象图像区域中中心处的像素点,通过像素点坐标获取工具,获取到该像素点的坐标。Specifically, after the target image area is determined, the terminal obtains the pixel point at the center of the target image area, and obtains the coordinates of the pixel point through the pixel point coordinate obtaining tool.
步骤604,获取对象图像区域对应的区域尺寸,基于区域尺寸以及尺寸外扩系数得到区域延伸值。In step 604, the area size corresponding to the target image area is obtained, and the area extension value is obtained based on the area size and the size expansion coefficient.
其中,区域尺寸是指对象图像区域的区域长和区域宽。例如,对象图像区域的区域长为h,对象图像区域的区域宽为w,则区域尺寸为w*h,尺寸外扩系数是指能够使区域尺寸增大的系数。区域延伸值是指利用尺寸外扩系数对区域尺寸修正,得到的区域尺寸的增长值。Here, the area size refers to the area length and area width of the target image area. For example, if the area length of the target image area is h, and the area width of the target image area is w, then the area size is w*h, and the size expansion coefficient refers to a coefficient that can increase the area size. The area extension value refers to the growth value of the area size obtained by correcting the area size with the size expansion coefficient.
具体地,终端在获取到对象图像区域中的中心坐标后,可以再获取对象图像区域对应的区域尺寸,通过区域尺寸和尺寸外扩系数之间的函数关系,得到区域延伸值。Specifically, after acquiring the center coordinates in the target image area, the terminal may then acquire the area size corresponding to the target image area, and obtain the area extension value through the functional relationship between the area size and the size expansion coefficient.
在一个实施例中,终端通过图像尺寸测量工具,获取对象图像区域对应的区域尺寸,利用区域尺寸与尺寸外扩系数之间的乘积关系,得到区域延伸值。例如,区域尺寸中的区域宽为w,尺寸外扩系数为exp_ratio,区域宽的区域延伸值为w*exp_ratio*1.2/2;同理,通过区域尺寸中的区域长的区域延伸值也可以通过相应的尺寸外扩系数得到。In one embodiment, the terminal obtains the area size corresponding to the object image area through the image size measurement tool, and obtains the area extension value by using the product relationship between the area size and the size expansion coefficient. For example, the area width in the area size is w, the size expansion coefficient is exp_ratio, and the area extension value of the area width is w*exp_ratio*1.2/2; similarly, the area extension value of the area length in the area size can also be passed The corresponding dimensional expansion coefficients are obtained.
步骤606,基于中心坐标以及区域延伸值向区域延伸值所对应的延伸方向进行延伸,得到延伸坐标。Step 606, based on the center coordinates and the area extension value, extend to the extension direction corresponding to the area extension value to obtain the extension coordinates.
其中,延伸方向是指区域延伸值所对应的,宽度和长度增大的方向。延伸坐标是指以中心坐标为参考点,对对象图像区域在延伸方向上进行延伸得到的对象图像区域的坐标。该坐标可以利用对象图像区域的左上角坐标和右下角坐标来表示。Wherein, the extension direction refers to the direction in which the width and length increase corresponding to the area extension value. The extension coordinates refer to the coordinates of the object image area obtained by extending the object image area in the extension direction with the center coordinates as a reference point. The coordinates may be represented by the coordinates of the upper left corner and the lower right corner of the object image area.
具体地,终端在得到区域延伸值之后,可以以中心坐标为参考点,利用区域延伸值对对象图像区域进行扩展,得到对象图像区域对应的延伸坐标。以使对象图像区域能够包括更完整的目标对象。Specifically, after obtaining the area extension value, the terminal may use the center coordinates as a reference point to expand the target image area by using the area extension value to obtain the extension coordinates corresponding to the target image area. In order to enable the object image area to include a more complete target object.
在一个实施例中,中心坐标表示为(x,y),延伸坐标为(x0,y0)和(x1,y1),其中x0和x1为对象图像区域延伸值在图像的宽度延伸方向上的坐标,y0和y1为对象图像区域延伸值在图像的长度延伸方向上的坐标,则延伸坐标可以表示为公式:In one embodiment, the center coordinates are expressed as (x, y), and the extension coordinates are (x0, y0) and (x1, y1), where x0 and x1 are the coordinates of the extension value of the object image area in the direction of the width extension of the image , y0 and y1 are the coordinates of the object image area extension value in the length extension direction of the image, then the extension coordinates can be expressed as the formula:
x0 = int(x - w * exp_ratio * 1.2 / 2) x0 = int(x - w * exp_ratio * 1.2 / 2)
x1 = int(x + w * exp_ratio * 1.2 / 2)x1 = int(x + w * exp_ratio * 1.2 / 2)
y0 = int(y - h * exp_ratio * 0.8 / 2)y0 = int(y - h * exp_ratio * 0.8 / 2)
y1 = int(y + h * exp_ratio * 0.8 / 2) y1 = int(y + h * exp_ratio * 0.8 / 2)
在一个实施例中,当区域延伸值中的宽度延伸方向的区域延伸值小于等于0时,延伸值取零;当区域延伸值大于等于初始图像的宽度时,将初始图像的宽度作为区域延伸值。同理,当区域延伸值中的长度延伸方向的区域延伸值小于等于零时,区域延伸值取零;当区域延伸值大于等于初始图像的高度时,将初始图像的高度作为区域延伸值。In one embodiment, when the area extension value in the width extension direction in the area extension value is less than or equal to 0, the extension value is zero; when the area extension value is greater than or equal to the width of the initial image, the width of the initial image is used as the area extension value . Similarly, when the area extension value in the length extension direction in the area extension value is less than or equal to zero, the area extension value is zero; when the area extension value is greater than or equal to the height of the initial image, the height of the initial image is used as the area extension value.
步骤608,将位于延伸坐标内的图像区域作为截取图像区域,将截取得到的图像作为待进行姿态估计的目标图像。Step 608 , taking the image area within the extended coordinates as the intercepted image area, and using the intercepted image as the target image to be subjected to pose estimation.
具体地,终端在获取到延伸坐标之后,可以将延伸坐标内的图像区域作为截取图像区域,将截取得到的图像作为待进行姿态估计的目标图像。Specifically, after acquiring the extended coordinates, the terminal may use the image area within the extended coordinates as the intercepted image area, and use the intercepted image as the target image for pose estimation.
本实施例中,通过获取对象图像区域中的中心坐标和对象图像区域对应的区域尺寸,基于区域尺寸以及尺寸外扩系数得到区域延伸值,基于中心坐标以及区域延伸值向区域延伸值所对应的延伸方向进行延伸,得到延伸坐标,将位于延伸坐标内的图像区域作为截取图像区域,将截取得到的图像作为待进行姿态估计的目标图像,能够达到准确截取目标图像的目的,进而提高姿态估计的效率。In this embodiment, by obtaining the center coordinates in the object image area and the area size corresponding to the object image area, the area extension value is obtained based on the area size and the size expansion coefficient, and the area extension value corresponding to the area extension value is obtained based on the center coordinates and the area extension value. The extension direction is extended to obtain the extension coordinates, and the image area located in the extension coordinates is used as the intercepted image area, and the intercepted image is used as the target image to be estimated for attitude, which can achieve the purpose of accurately intercepting the target image, and then improve the performance of attitude estimation. efficiency.
在一个实施例中,如图7所示,关键点位置信息为多个,方法还包括:In one embodiment, as shown in Figure 7, there are multiple key point position information, and the method also includes:
步骤702,根据关键点位置信息与目标点位置信息的映射关系,将各个关键点位置信息转换为对应的目标点位置信息;目标点位置信息为关键点位置信息在初始图像中的位置信息。Step 702: According to the mapping relationship between the key point position information and the target point position information, each key point position information is converted into corresponding target point position information; the target point position information is the position information of the key point position information in the initial image.
其中,位置信息是指能够体现某个位置点的位置相关的信息。其中的位置点可以是关键点,也可以是其他和关键点结构或者功能相同的点。位置相关的信息可以是关键点的坐标信息或者对该关键点的位置进行描述的信息。例如,关键点位置信息可以表示为(100,100,眼睛)。Wherein, the location information refers to information related to the location that can reflect a certain location point. The position point may be a key point, or other points having the same structure or function as the key point. The position-related information may be coordinate information of a key point or information describing the position of the key point. For example, keypoint location information can be expressed as (100,100, eyes).
具体地,在目标图像中的关键点位置信息与在初始图像中的目标点位置信息之间有一一对应的关系,并且可以相互转换。在得知关键点位置信息后,可以相应地得到该关键点在初始图像中的位置信息,以便于在目标图像中得到的初始图像中的位置信息能够准确体现在初始图像中。Specifically, there is a one-to-one correspondence between the key point position information in the target image and the target point position information in the initial image, and they can be converted to each other. After knowing the position information of the key point, the position information of the key point in the initial image can be correspondingly obtained, so that the position information in the initial image obtained in the target image can be accurately reflected in the initial image.
在一个实施例中,第j个关键点的关键点位置信息表示为(x_keypoints_j,y_keypoints_j),第i个目标图像在初始图像中左上角的顶点坐标表示为(x_person_i,y_person_i),该关键点在初始图像中的坐标表示为(x_original_keypoints,y_original_keypoints),则该关键点在初始图像中的坐标表示为公式:In one embodiment, the key point position information of the jth key point is expressed as (x_keypoints_j, y_keypoints_j), and the vertex coordinates of the i-th target image in the upper left corner of the initial image are expressed as (x_person_i, y_person_i), the key point is in The coordinates in the initial image are expressed as (x_original_keypoints, y_original_keypoints), then the coordinates of the key point in the initial image are expressed as the formula:
x_original_keypoints = x_person_i + x_keypoints_jx_original_keypoints = x_person_i + x_keypoints_j
y_original_keypoints = y_person_i + y_keypoints_jy_original_keypoints = y_person_i + y_keypoints_j
步骤704,基于各个目标位置信息对目标对象进行姿态估计,得到目标图像对应的目标姿态。Step 704, perform pose estimation on the target object based on each target position information, and obtain the target pose corresponding to the target image.
具体的,终端在确定多个关键点对应的各个目标位置信息之后,通过关键点具体类型与目标位置信息的对应关系,进行姿态估计,得到目标图像对应的目标姿态。Specifically, after determining each target position information corresponding to a plurality of key points, the terminal performs pose estimation through the corresponding relationship between the specific type of the key point and the target position information, and obtains the target pose corresponding to the target image.
本实施例中,通过关键点位置信息与目标点位置信息的映射关系,将各个关键点位置信息转换为对应的目标点位置信息,通过各个目标位置信息对目标对象进行姿态估计,得到目标图像对应的目标姿态。能够达到得到目标图像中的目标姿态的目的。In this embodiment, through the mapping relationship between the key point position information and the target point position information, the position information of each key point is converted into the corresponding target point position information, and the pose estimation of the target object is performed through each target position information to obtain the corresponding position of the target image. target attitude. The purpose of obtaining the target pose in the target image can be achieved.
在一个实施例中,如图8所示,目标视频生成方法包括:In one embodiment, as shown in Figure 8, the target video generation method includes:
步骤802,获取目标动作,确定目标动作所对应的姿态序列,其中姿态序列中的姿态按照顺序执行,得到目标动作。Step 802, acquire the target action, and determine the gesture sequence corresponding to the target action, wherein the gestures in the gesture sequence are executed in order to obtain the target action.
其中,目标动作是指各个姿态按照顺序执行后,得到的动作。姿态是指组成动作的各个子动作。例如,目标动作为手臂伸展运动,组成该目标动作的是胳膊平放、胳膊伸直以及手臂并拢侧向转向等多个子动作。多个姿态根据前后顺序可以形成姿态序列,通过姿态序列顺序执行时,得到目标动作。Wherein, the target action refers to an action obtained after each gesture is executed in sequence. Gesture refers to the individual sub-actions that make up an action. For example, the target action is an arm stretching movement, and the target action is composed of multiple sub-actions such as placing the arms flat, straightening the arms, and turning the arms together sideways. Multiple gestures can form a gesture sequence according to the order of the front and back, and when the sequence of gestures is executed sequentially, the target action is obtained.
步骤804,根据上述各方法实施例中的步骤,获取目标图像集合中各个目标图像对应的目标姿态。Step 804, according to the steps in the above method embodiments, acquire the target pose corresponding to each target image in the target image set.
具体地,终端在获取目标动作之后,可以基于上述各个姿态估计方法得到各个姿态,组成目标动作的各个姿态处在不同的目标图像中,可以从各个目标图像获取到对应的各个姿态。例如,目标图像E中存在有目标姿态F或者目标图像G中存在有目标姿态H等等。Specifically, after acquiring the target action, the terminal can obtain various poses based on the above-mentioned pose estimation methods, and each pose constituting the target action is in a different target image, and the corresponding poses can be obtained from each target image. For example, the target pose F exists in the target image E, or the target pose H exists in the target image G, and so on.
在一个实施例中,根据姿态与图像之间的对应关系,在目标图像集合中,获取到各个目标图像对应的目标姿态。In one embodiment, according to the corresponding relationship between the pose and the image, the target pose corresponding to each target image is obtained in the target image set.
步骤806,从目标图像集合中获取姿态序列中各个目标姿态所对应的图像,作为视频帧图像。Step 806, acquire images corresponding to each target pose in the pose sequence from the target image set as video frame images.
具体的,终端在获取目标姿态后,根据目标姿态,得到与目标姿态对应的图像,并且将得到的各个图像作为视频帧图像。Specifically, after acquiring the target pose, the terminal obtains images corresponding to the target pose according to the target pose, and uses each obtained image as a video frame image.
在一个实施例中,获取与目标姿态对应的图像对应的时间戳,并且将携带有各自时间戳的图像作为视频帧图像。In one embodiment, the time stamps corresponding to the images corresponding to the target poses are obtained, and the images carrying the respective time stamps are used as video frame images.
步骤808,按照姿态序列中姿态的排序对得到的视频帧图像进行排列,得到目标动作所对应的目标视频。Step 808, arrange the obtained video frame images according to the pose sequence in the pose sequence, and obtain the target video corresponding to the target action.
具体的,姿态序列中姿态和所对应的视频帧图像之间存在一一对应关系,根据姿态序列中姿态的排序对相应的视频帧图像进行排列,得到目标动作所对应的目标视频。Specifically, there is a one-to-one correspondence between the poses in the pose sequence and the corresponding video frame images, and the corresponding video frame images are arranged according to the pose sequence in the pose sequence to obtain the target video corresponding to the target action.
在一个实施例中,姿态序列中姿态的排序与视频帧图像的时间戳之间存在绑定的对应关系,对姿态序列中的姿态进行了排序,也即在在得到视频帧图像后,根据视频帧图像的时间戳对视频帧图像进行了排列,根据视频帧图像的,得到了目标视频。In one embodiment, there is a binding correspondence between the ordering of gestures in the gesture sequence and the time stamps of the video frame images, and the gestures in the gesture sequence are sorted, that is, after obtaining the video frame images, according to the The time stamps of the frame images arrange the video frame images, and the target video is obtained according to the video frame images.
本实施例中,通过获取目标动作和目标图像集合中各个目标图像对应的目标姿态,从目标图像集合中获取姿态序列中各个目标姿态所对应的图像,作为视频帧图像,按照姿态序列中姿态的排序对得到的视频帧图像进行排列,得到目标动作所对应的目标视频,能够达到通过姿态估计得到目标动作对应的目标视频的目的,使得姿态估计能够实现得到实际的应用。 In this embodiment, by acquiring the target action and the target pose corresponding to each target image in the target image set, the image corresponding to each target pose in the pose sequence is obtained from the target image set as a video frame image, according to the pose in the pose sequence Sorting Arranges the obtained video frame images to obtain the target video corresponding to the target action, which can achieve the purpose of obtaining the target video corresponding to the target action through pose estimation, so that the pose estimation can be realized and obtained in practical applications.
在一个实施例中,以终端为全景相机,目标对象为人体为例,如图9所示,通过全景相机的全景图像或从全景相机拍摄的全景视频中抽取视频帧获取全景图像,图像中通常需包含用于进行人体姿态估计的人体目标对象。人体目标对象可以是完整的,也可以只包含其中一部分或存在遮挡等情况。如图10所示,将全景图像进行归一化处理后,通过人体跟踪或者检测算法,得到人体边界框B1的坐标值,通过该人体边界框的坐标值或者将该人体边界框进行扩充之后得到扩充之后的边框B2的坐标值。如图11所示,对全景图像进行裁剪,得到边框为B2的子全景图像,对子全景图像进行归一化之后,将该子全景图像输入到训练好的人体姿态估计模型中,如图12所示,得到第1个关键点C的热图,执行将该子全景图像输入到训练好的人体姿态估计模型中,得到第2个关键点的热图,以此类推,通过得到多张热图,依次得到预设数量带有关键点的热图。将关键点在热图中的坐标映射到子全景图像,在通过关键点在子全景图像的位置映射到全景图像中,从而得到关键点在全景图像的位置,从而估计出人体的姿态。In one embodiment, the terminal is a panoramic camera and the target object is a human body as an example. As shown in FIG. A human target object for human pose estimation needs to be included. The human target object can be complete, or only contain a part of it or have occlusions. As shown in Figure 10, after the panoramic image is normalized, the coordinate value of the human body bounding box B1 is obtained through the human body tracking or detection algorithm, and the coordinate value of the human body bounding box or the expansion of the human body bounding box is obtained. Coordinate value of bounding box B2 after expansion. As shown in Figure 11, the panoramic image is cropped to obtain a sub-panoramic image with a border of B2. After normalizing the sub-panoramic image, the sub-panoramic image is input into the trained human pose estimation model, as shown in Figure 12 As shown, the heat map of the first key point C is obtained, and the sub-panoramic image is input into the trained human pose estimation model to obtain the heat map of the second key point, and so on, by obtaining multiple heat maps map, and in turn get a preset number of heatmaps with keypoints. The coordinates of the key points in the heat map are mapped to the sub-panoramic image, and the position of the key point in the sub-panoramic image is mapped to the panoramic image, so as to obtain the position of the key point in the panoramic image, thereby estimating the posture of the human body.
在一个实施例中,终端对全景图像进行归一化处理或者子全景图像进行归一化处理,均可以通过归一化处理后的图像中像素点的像素值与原图像中像素点的像素值与像素值的平均值之间的差值之间的正比例关系,得到归一化处理后的图像中像素点的像素值。假设,归一化后的图像中某个像素点的像素值表示为X_normalization,全景图像或者子全景图像中某个像素点的像素值表示为X,全景图像或者子全景图像中全部像素点的像素值的平均值表示为mean,正比例系数表示为std,则X_normalization表示为公式:In one embodiment, when the terminal performs normalization processing on the panoramic image or the normalization processing on the sub-panoramic image, the pixel value of the pixel point in the normalized image and the pixel value of the pixel point in the original image The proportional relationship between the difference with the average value of the pixel value, and obtain the pixel value of the pixel point in the normalized image. Suppose, the pixel value of a certain pixel in the normalized image is expressed as X_normalization, the pixel value of a certain pixel in the panoramic image or sub-panoramic image is expressed as X, and the pixels of all the pixels in the panoramic image or sub-panoramic image The average value of the value is expressed as mean, the proportional coefficient is expressed as std, and X_normalization is expressed as a formula:
X_normalization = (X-mean)/stdX_normalization = (X-mean)/std
可以理解的,其中的std可以为全景图像或者子全景图像中全部像素点的方差;全景图像或者子全景图像中某个像素点可以为RGB(红、绿和蓝)三通道的像素点。It can be understood that std can be the variance of all pixels in the panoramic image or sub-panoramic image; a certain pixel in the panoramic image or sub-panoramic image can be a pixel of RGB (red, green and blue) three channels.
在一个实施例中,终端可以通过人体检测算法得到人体边界框的坐标值。例如,利用Faster RCNN(Faster Region-CNN)、YOLO(You Only Look Once)系列算法、SSD(Single Shot MultiBox Detector)系列算法等或跟踪算法如Siamese(Siamese network )跟踪算法等。In an embodiment, the terminal may obtain the coordinate value of the bounding box of the human body through a human body detection algorithm. For example, using Faster RCNN (Faster Region-CNN), YOLO (You Only Look Once) series algorithms, SSD (Single Shot MultiBox Detector) series algorithms, etc. or tracking algorithms such as Siamese (Siamese network ) tracking algorithm, etc.
在一个实施例中,如图13所示,人体姿态估计模型可以通过减少HRNet中阶段之间的图像特征块的数量,例如,将第二阶段中下采样的图像特征块的数量改为1个,以使得人体姿态估计模型能够减少参数量和计算量,从而提高人体姿态估计的效率。In one embodiment, as shown in Figure 13, the human pose estimation model can be reduced by reducing the number of image feature blocks between stages in HRNet, for example, changing the number of down-sampled image feature blocks in the second stage to 1 , so that the human pose estimation model can reduce the amount of parameters and calculations, thereby improving the efficiency of human pose estimation.
应该理解的是,虽然图2-8的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2-8中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the various steps in the flow charts of FIGS. 2-8 are displayed sequentially as indicated by the arrows, these steps are not necessarily executed sequentially in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in Figures 2-8 may include multiple steps or stages, these steps or stages are not necessarily executed at the same moment, but may be executed at different moments, the execution of these steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with other steps or at least a part of steps or stages in other steps.
在一个实施例中,如图14所示,提供了一种姿态估计装置1400,包括:目标图像获取模块1402、第一提取特征模块1404、扩张图像特征得到模块1406、第二提取特征得到模块1408、压缩图像特征得到模块1410和关键点位置信息确定模块1412,其中:目标图像获取模块1402,用于获取待进行姿态估计的目标图像;目标图像中包括待处理的目标对象;第一提取特征模块1404,用于基于目标图像进行特征提取,获取第一提取特征;扩张图像特征得到模块1406,用于通过图像特征扩张网络对第一提取特征进行特征扩张,得到扩张图像特征;第二提取特征得到模块1408,用于对扩张图像特征进行特征提取,得到第二提取特征;压缩图像特征得到模块1410,用于通过图像特征压缩网络对第二提取特征进行特征压缩,得到压缩图像特征;关键点位置信息确定模块1412,用于基于压缩图像特征确定目标图像中的目标对象对应的关键点位置信息,基于关键点位置信息对目标对象进行姿态估计。In one embodiment, as shown in FIG. 14 , a pose estimation device 1400 is provided, including: a target image acquisition module 1402, a first feature extraction module 1404, an expanded image feature acquisition module 1406, and a second extraction feature acquisition module 1408 , compressed image feature obtaining module 1410 and key point position information determining module 1412, wherein: the target image obtaining module 1402 is used to obtain the target image to be subjected to attitude estimation; the target image includes the target object to be processed; the first feature extraction module 1404, for feature extraction based on the target image, and obtain the first extracted feature; the expanded image feature obtaining module 1406, for performing feature expansion on the first extracted feature through the image feature expansion network, to obtain the expanded image feature; the second extracted feature is obtained Module 1408, for feature extraction of expanded image features, to obtain second extracted features; compressed image feature obtaining module 1410, for feature compression of second extracted features through image feature compression network, to obtain compressed image features; key point position The information determination module 1412 is configured to determine key point position information corresponding to the target object in the target image based on the compressed image features, and perform pose estimation on the target object based on the key point position information.
在一个实施例中,扩张图像特征得到模块1406用于将第一提取特征分别输入到图像特征扩张网络对应的多个特征卷积通道中,各个特征卷积通道利用特征维度保持卷积核对第一提取特征进行卷积,得到各个特征卷积通道输出的卷积特征;综合各个特征卷积通道输出的卷积特征得到扩张图像特征。In one embodiment, the expanded image feature obtaining module 1406 is used to input the first extracted features into multiple feature convolution channels corresponding to the image feature expansion network, and each feature convolution channel uses the feature dimension to keep the convolution kernel to the first Extract features for convolution to obtain the convolution features output by each feature convolution channel; synthesize the convolution features output by each feature convolution channel to obtain expanded image features.
在一个实施例中,关键点位置信息确定模块1412用于对压缩图像特征进行放大,得到放大的图像特征;对放大的图像特征进行卷积,得到第三提取特征;基于第三提取特征,确定目标图像中的目标对象对应的关键点位置信息。In one embodiment, the key point position information determination module 1412 is used to amplify the compressed image features to obtain enlarged image features; perform convolution on the enlarged image features to obtain the third extracted features; based on the third extracted features, determine The key point position information corresponding to the target object in the target image.
在一个实施例中,目标图像获取模块1402用于获取初始图像;对初始图像进行对象检测,得到初始图像中多个候选图像区域分别包括目标对象的概率;基于候选图像区域包括目标对象的概率从候选图像区域中选取得到包括目标对象的对象图像区域;根据所述对象图像区域,从初始图像中提取得到截取图像区域,将截取得到的图像作为待进行姿态估计的目标图像。In one embodiment, the target image acquisition module 1402 is used to acquire an initial image; perform object detection on the initial image to obtain the probability that a plurality of candidate image areas in the initial image respectively include the target object; based on the probability that the candidate image areas include the target object from An object image area including the target object is selected from the candidate image area; according to the object image area, an intercepted image area is extracted from the initial image, and the intercepted image is used as the target image to be pose estimated.
在一个实施例中,目标图像获取模块1402用于获取对象图像区域中的中心坐标;获取对象图像区域对应的区域尺寸,基于区域尺寸以及尺寸外扩系数得到区域延伸值;基于中心坐标以及区域延伸值向区域延伸值所对应的延伸方向进行延伸,得到延伸坐标;将位于延伸坐标内的图像区域作为截取图像区域,将截取得到的图像作为待进行姿态估计的目标图像。In one embodiment, the target image acquisition module 1402 is used to acquire the center coordinates in the object image area; acquire the area size corresponding to the object image area, and obtain the area extension value based on the area size and the size expansion coefficient; based on the center coordinates and the area extension The value is extended to the extension direction corresponding to the area extension value to obtain the extension coordinates; the image area located within the extension coordinates is used as the intercepted image area, and the intercepted image is used as the target image for pose estimation.
在一个实施例中,目标图像获取模块1402用于根据关键点位置信息与目标点位置信息的映射关系,将各个关键点位置信息转换为对应的目标点位置信息; 目标点位置信息为关键点位置信息在初始图像中的位置信息;基于各个目标位置信息对目标对象进行姿态估计,得到目标图像对应的目标姿态。In one embodiment, the target image acquisition module 1402 is configured to convert each key point position information into corresponding target point position information according to the mapping relationship between the key point position information and the target point position information; the target point position information is the key point position The position information of the information in the initial image; the pose estimation of the target object is performed based on each target position information, and the target pose corresponding to the target image is obtained.
在一个实施例中,目标视频生成装置用于获取目标动作,确定目标动作所对应的姿态序列,姿态序列中的姿态按照顺序执行,得到目标动作;获取目标图像集合中各个目标图像对应的目标姿态;从目标图像集合中获取姿态序列中各个目标姿态所对应的图像,作为视频帧图像;按照姿态序列中姿态的排序对得到的视频帧图像进行排列,得到目标动作所对应的目标视频。In one embodiment, the target video generation device is used to obtain target actions, determine the gesture sequence corresponding to the target action, and the gestures in the gesture sequence are executed in order to obtain the target action; acquire the target gesture corresponding to each target image in the target image set ; Obtain the image corresponding to each target pose in the pose sequence from the target image set as a video frame image; arrange the obtained video frame images according to the pose sequence in the pose sequence to obtain the target video corresponding to the target action.
关于姿态估计装置的具体限定可以参见上文中对于姿态估计方法的限定,在此不再赘述。上述姿态估计装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For specific limitations on the attitude estimation device, please refer to the above-mentioned limitations on the attitude estimation method, which will not be repeated here. Each module in the above-mentioned attitude estimation device can be implemented in whole or in part by software, hardware and a combination thereof. The above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, and can also be stored in the memory of the computer device in the form of software, so that the processor can invoke and execute the corresponding operations of the above-mentioned modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图15所示。该计算机设备包括通过系统总线连接的处理器、存储器、通信接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信,无线方式可通过WIFI、运营商网络、NFC(近场通信)或其他技术实现。该计算机程序被处理器执行时以实现一种姿态估计方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。In one embodiment, a computer device is provided. The computer device may be a terminal, and its internal structure may be as shown in FIG. 15 . The computer device includes a processor, a memory, a communication interface, a display screen and an input device connected through a system bus. Wherein, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, an operator network, NFC (Near Field Communication) or other technologies. When the computer program is executed by a processor, a pose estimation method is realized. The display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the casing of the computer device , and can also be an external keyboard, touchpad, or mouse.
本领域技术人员可以理解,图15中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in Figure 15 is only a block diagram of a partial structure related to the solution of this application, and does not constitute a limitation on the computer equipment on which the solution of this application is applied. The specific computer equipment can be More or fewer components than shown in the figures may be included, or some components may be combined, or have a different arrangement of components.
在一个实施例中,还提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机程序,该处理器执行计算机程序时实现上述各方法实施例中的步骤。In one embodiment, there is also provided a computer device, including a memory and a processor, where a computer program is stored in the memory, and the processor implements the steps in the above method embodiments when executing the computer program.
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述各方法实施例中的步骤。In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the steps in the foregoing method embodiments are implemented.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存或光存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented through computer programs to instruct related hardware, and the computer programs can be stored in a non-volatile computer-readable memory In the medium, when the computer program is executed, it may include the processes of the embodiments of the above-mentioned methods. Wherein, any references to memory, storage, database or other media used in the various embodiments provided in the present application may include at least one of non-volatile memory and volatile memory. Non-volatile memory can include read-only memory (Read-Only Memory, ROM), tape, floppy disk, flash memory or optical memory, etc. Volatile memory can include Random Access Memory (Random Access Memory, RAM) or external cache memory. By way of illustration and not limitation, RAM can come in many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (Dynamic Random Access Memory) Access Memory, DRAM), etc.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. To make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered to be within the range described in this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only represent several implementation modes of the present application, and the description thereof is relatively specific and detailed, but it should not be construed as limiting the scope of the patent for the invention. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present application, and these all belong to the protection scope of the present application. Therefore, the scope of protection of the patent application should be based on the appended claims.

Claims (16)

  1. 一种姿态估计方法,其特征在于,所述方法包括:A method for attitude estimation, characterized in that the method comprises:
    获取待进行姿态估计的目标图像;所述目标图像中包括待处理的目标对象; Acquiring a target image to be subjected to pose estimation; the target image includes a target object to be processed;
    基于所述目标图像进行特征提取,获取第一提取特征;performing feature extraction based on the target image to obtain a first extracted feature;
    通过图像特征扩张网络对所述第一提取特征进行特征扩张,得到扩张图像特征;performing feature expansion on the first extracted feature through an image feature expansion network to obtain expanded image features;
    对所述扩张图像特征进行特征提取,得到第二提取特征;performing feature extraction on the expanded image feature to obtain a second extracted feature;
    通过图像特征压缩网络对所述第二提取特征进行特征压缩,得到压缩图像特征;performing feature compression on the second extracted features through an image feature compression network to obtain compressed image features;
    基于所述压缩图像特征确定所述目标图像中的所述目标对象对应的关键点位置信息,基于所述关键点位置信息对所述目标对象进行姿态估计。Determining key point position information corresponding to the target object in the target image based on the compressed image features, and performing pose estimation on the target object based on the key point position information.
  2. 根据权利要求1所述的方法,其特征在于,所述图像特征扩张网络包括多个特征卷积通道,所述通过图像特征扩张网络对所述第一提取特征进行特征扩张,得到扩张图像特征包括:The method according to claim 1, wherein the image feature expansion network includes a plurality of feature convolution channels, and the image feature expansion network is used to perform feature expansion on the first extracted feature, and the expanded image features obtained include :
    将所述第一提取特征分别输入到所述图像特征扩张网络对应的多个特征卷积通道中,各个所述特征卷积通道利用特征维度保持卷积核对所述第一提取特征进行卷积,得到各个所述特征卷积通道输出的卷积特征;Inputting the first extracted features into a plurality of feature convolution channels corresponding to the image feature expansion network, each of the feature convolution channels uses a feature dimension preserving convolution kernel to convolve the first extracted features, Obtain the convolution features output by each of the feature convolution channels;
    综合各个所述特征卷积通道所述输出的卷积特征得到所述扩张图像特征。The convolution features output by each of the feature convolution channels are combined to obtain the expanded image features.
  3. 根据权利要求1所述的方法,其特征在于,所述基于所述压缩图像特征确定所述目标图像中的所述目标对象对应的关键点位置信息包括:The method according to claim 1, wherein the determining the key point position information corresponding to the target object in the target image based on the compressed image features comprises:
    对所述压缩图像特征进行放大,得到放大的图像特征;Enlarging the compressed image features to obtain the enlarged image features;
    对所述放大的图像特征进行卷积,得到第三提取特征;Convolving the enlarged image features to obtain a third extracted feature;
    基于所述第三提取特征,确定所述目标图像中的所述目标对象对应的关键点位置信息。Based on the third extracted feature, determine key point position information corresponding to the target object in the target image.
  4. 根据权利要求1所述的方法,其特征在于,所述获取待进行姿态估计的目标图像包括:The method according to claim 1, wherein said obtaining the target image to be subjected to pose estimation comprises:
    获取初始图像;Get the initial image;
    对所述初始图像进行对象检测,得到所述初始图像中多个候选图像区域分别包括目标对象的概率;Object detection is performed on the initial image to obtain probabilities that a plurality of candidate image regions in the initial image respectively include a target object;
    基于所述候选图像区域包括目标对象的概率从候选图像区域中选取得到包括目标对象的对象图像区域;selecting an object image area including the target object from the candidate image areas based on the probability that the candidate image area includes the target object;
    根据所述对象图像区域,从所述初始图像中提取得到截取图像区域,将截取得到的图像作为待进行姿态估计的目标图像。An intercepted image area is extracted from the initial image according to the object image area, and the intercepted image is used as a target image to be subjected to pose estimation.
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述对象图像区域从所述初始图像中提取得到截取图像区域,将截取得到的图像作为待进行姿态估计的目标图像包括:The method according to claim 4, wherein the extraction of the intercepted image area from the initial image according to the object image area, and using the intercepted image as the target image for pose estimation includes:
    获取所述对象图像区域中的中心坐标;Acquiring the center coordinates in the image area of the object;
    获取所述对象图像区域对应的区域尺寸,基于所述区域尺寸以及尺寸外扩系数得到区域延伸值;Acquiring an area size corresponding to the object image area, and obtaining an area extension value based on the area size and a size expansion coefficient;
    基于所述中心坐标以及所述区域延伸值向所述区域延伸值所对应的延伸方向进行延伸,得到延伸坐标;Extending to an extension direction corresponding to the area extension value based on the center coordinates and the area extension value to obtain extension coordinates;
    将位于所述延伸坐标内的图像区域作为截取图像区域,将截取得到的图像作为待进行姿态估计的目标图像。The image area located within the extended coordinates is used as the intercepted image area, and the intercepted image is used as the target image to be subjected to pose estimation.
  6. 根据权利要求4所述的方法,其特征在于,所述关键点位置信息为多个,所述方法还包括:The method according to claim 4, wherein the key point position information is multiple, and the method further comprises:
    根据所述关键点位置信息与目标点位置信息的映射关系,将各个所述关键点位置信息转换为对应的目标点位置信息; 所述目标点位置信息为所述关键点位置信息在所述初始图像中的位置信息;According to the mapping relationship between the key point position information and the target point position information, each of the key point position information is converted into corresponding target point position information; the target point position information is the key point position information in the initial location information in the image;
    基于各个所述目标位置信息对所述目标对象进行姿态估计,得到所述目标图像对应的目标姿态。Estimating the pose of the target object based on each piece of target position information to obtain a target pose corresponding to the target image.
  7. 一种目标视频生成方法,其特征在于,所述方法包括:A method for generating a target video, characterized in that the method comprises:
    获取目标动作,确定所述目标动作所对应的姿态序列,所述姿态序列中的姿态按照顺序执行,得到所述目标动作;Acquiring a target action, determining a gesture sequence corresponding to the target action, performing the gestures in the gesture sequence in order to obtain the target action;
    对目标图像进行处理,获取目标图像集合中各个目标图像对应的目标姿态;Process the target image to obtain the target pose corresponding to each target image in the target image set;
    从所述目标图像集合中获取所述姿态序列中各个目标姿态所对应的图像,作为视频帧图像;Acquiring images corresponding to each target pose in the pose sequence from the set of target images as video frame images;
    按照所述姿态序列中姿态的排序对得到的所述视频帧图像进行排列,得到所述目标动作所对应的目标视频。The obtained video frame images are arranged according to the sequence of gestures in the gesture sequence to obtain the target video corresponding to the target action.
  8. 一种目标视频生成方法,其特征在于,所述对目标图像进行处理,获取目标图像集合中各个目标图像对应的目标姿态包括:A method for generating a target video, wherein the processing of the target image and obtaining the corresponding target posture of each target image in the target image set includes:
    获取待进行姿态估计的目标图像;所述目标图像中包括待处理的目标对象; Acquiring a target image to be subjected to pose estimation; the target image includes a target object to be processed;
    基于所述目标图像进行特征提取,获取第一提取特征;performing feature extraction based on the target image to obtain a first extracted feature;
    通过图像特征扩张网络对所述第一提取特征进行特征扩张,得到扩张图像特征;performing feature expansion on the first extracted feature through an image feature expansion network to obtain expanded image features;
    对所述扩张图像特征进行特征提取,得到第二提取特征;performing feature extraction on the expanded image feature to obtain a second extracted feature;
    通过图像特征压缩网络对所述第二提取特征进行特征压缩,得到压缩图像特征;performing feature compression on the second extracted features through an image feature compression network to obtain compressed image features;
    基于所述压缩图像特征确定所述目标图像中的所述目标对象对应的关键点位置信息,基于所述关键点位置信息对所述目标对象进行姿态估计。Determining key point position information corresponding to the target object in the target image based on the compressed image features, and performing pose estimation on the target object based on the key point position information.
  9. 根据权利要求8所述的方法,其特征在于,所述图像特征扩张网络包括多个特征卷积通道,所述通过图像特征扩张网络对所述第一提取特征进行特征扩张,得到扩张图像特征包括:The method according to claim 8, wherein the image feature expansion network includes a plurality of feature convolution channels, and the image feature expansion network is used to perform feature expansion on the first extracted feature, and the expanded image features obtained include :
    将所述第一提取特征分别输入到所述图像特征扩张网络对应的多个特征卷积通道中,各个所述特征卷积通道利用特征维度保持卷积核对所述第一提取特征进行卷积,得到各个所述特征卷积通道输出的卷积特征;Inputting the first extracted features into a plurality of feature convolution channels corresponding to the image feature expansion network, each of the feature convolution channels uses a feature dimension preserving convolution kernel to convolve the first extracted features, Obtain the convolution features output by each of the feature convolution channels;
    综合各个所述特征卷积通道所述输出的卷积特征得到所述扩张图像特征。The convolution features output by each of the feature convolution channels are combined to obtain the expanded image features.
  10. 根据权利要求8所述的方法,其特征在于,所述基于所述压缩图像特征确定所述目标图像中的所述目标对象对应的关键点位置信息包括:The method according to claim 8, wherein the determining the key point position information corresponding to the target object in the target image based on the compressed image features comprises:
    对所述压缩图像特征进行放大,得到放大的图像特征;Enlarging the compressed image features to obtain the enlarged image features;
    对所述放大的图像特征进行卷积,得到第三提取特征;Convolving the enlarged image features to obtain a third extracted feature;
    基于所述第三提取特征,确定所述目标图像中的所述目标对象对应的关键点位置信息。Based on the third extracted feature, determine key point position information corresponding to the target object in the target image.
  11. 根据权利要求8所述的方法,其特征在于,所述获取待进行姿态估计的目标图像包括:The method according to claim 8, wherein said obtaining the target image to be subjected to pose estimation comprises:
    获取初始图像;Get the initial image;
    对所述初始图像进行对象检测,得到所述初始图像中多个候选图像区域分别包括目标对象的概率;Object detection is performed on the initial image to obtain probabilities that a plurality of candidate image regions in the initial image respectively include a target object;
    基于所述候选图像区域包括目标对象的概率从候选图像区域中选取得到包括目标对象的对象图像区域;selecting an object image area including the target object from the candidate image areas based on the probability that the candidate image area includes the target object;
    根据所述对象图像区域,从所述初始图像中提取得到截取图像区域,将截取得到的图像作为待进行姿态估计的目标图像。An intercepted image area is extracted from the initial image according to the object image area, and the intercepted image is used as a target image to be subjected to pose estimation.
  12. 根据权利要求11所述的方法,其特征在于,所述根据所述对象图像区域从所述初始图像中提取得到截取图像区域,将截取得到的图像作为待进行姿态估计的目标图像包括:The method according to claim 11, wherein said extracting an intercepted image area from said initial image according to said object image area, and using the intercepted image as a target image for pose estimation comprises:
    获取所述对象图像区域中的中心坐标;Acquiring the center coordinates in the image area of the object;
    获取所述对象图像区域对应的区域尺寸,基于所述区域尺寸以及尺寸外扩系数得到区域延伸值;Acquiring an area size corresponding to the object image area, and obtaining an area extension value based on the area size and a size expansion coefficient;
    基于所述中心坐标以及所述区域延伸值向所述区域延伸值所对应的延伸方向进行延伸,得到延伸坐标;Extending to an extension direction corresponding to the area extension value based on the center coordinates and the area extension value to obtain extension coordinates;
    将位于所述延伸坐标内的图像区域作为截取图像区域,将截取得到的图像作为待进行姿态估计的目标图像。The image area located within the extended coordinates is used as the intercepted image area, and the intercepted image is used as the target image to be subjected to pose estimation.
  13. 根据权利要求11所述的方法,其特征在于,所述关键点位置信息为多个,所述方法还包括:The method according to claim 11, wherein there are multiple key point position information, and the method further comprises:
    根据所述关键点位置信息与目标点位置信息的映射关系,将各个所述关键点位置信息转换为对应的目标点位置信息; 所述目标点位置信息为所述关键点位置信息在所述初始图像中的位置信息;According to the mapping relationship between the key point position information and the target point position information, each of the key point position information is converted into corresponding target point position information; the target point position information is the key point position information in the initial location information in the image;
    基于各个所述目标位置信息对所述目标对象进行姿态估计,得到所述目标图像对应的目标姿态。Estimating the pose of the target object based on each piece of target position information to obtain a target pose corresponding to the target image.
  14. 一种姿态估计装置,其特征在于,所述装置包括:A pose estimation device, characterized in that the device comprises:
    目标图像获取模块,用于获取待进行姿态估计的目标图像;所述目标图像中包括待处理的目标对象;A target image acquisition module, configured to acquire a target image to be subjected to pose estimation; the target image includes a target object to be processed;
    第一提取特征模块,用于基于所述目标图像进行特征提取,获取第一提取特征;The first feature extraction module is used to perform feature extraction based on the target image to obtain the first feature extraction;
    扩张图像特征得到模块,用于通过图像特征扩张网络对所述第一提取特征进行特征扩张,得到扩张图像特征;The expanded image feature obtaining module is used to perform feature expansion on the first extracted feature through the image feature expansion network to obtain the expanded image feature;
    第二提取特征得到模块,用于对所述扩张图像特征进行特征提取,得到第二提取特征;The second extraction feature obtaining module is used to perform feature extraction on the expanded image feature to obtain the second extraction feature;
    压缩图像特征得到模块,用于通过图像特征压缩网络对所述第二提取特征进行特征压缩,得到压缩图像特征;A compressed image feature obtaining module, configured to perform feature compression on the second extracted feature through an image feature compression network to obtain compressed image features;
    关键点位置信息确定模块,用于基于所述压缩图像特征确定所述目标图像中的所述目标对象对应的关键点位置信息,基于所述关键点位置信息对所述目标对象进行姿态估计。A key point position information determining module, configured to determine key point position information corresponding to the target object in the target image based on the compressed image features, and perform pose estimation on the target object based on the key point position information.
  15. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至6中任一项或者权利要求7至13所述的方法的步骤。A computer device, comprising a memory and a processor, wherein the memory stores a computer program, wherein when the processor executes the computer program, any one of claims 1 to 6 or claims 7 to 13 is implemented. steps of the method described above.
  16. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至6中任一项或者权利要求7至13所述的方法的步骤。A computer-readable storage medium on which a computer program is stored, wherein when the computer program is executed by a processor, the steps of any one of claims 1 to 6 or the method described in claims 7 to 13 are implemented .
PCT/CN2022/091484 2021-05-12 2022-05-07 Method and apparatus for pose estimation, computer device, and storage medium WO2022237688A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110517805.3 2021-05-12
CN202110517805.3A CN113158974A (en) 2021-05-12 2021-05-12 Attitude estimation method, attitude estimation device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2022237688A1 true WO2022237688A1 (en) 2022-11-17

Family

ID=76874942

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/091484 WO2022237688A1 (en) 2021-05-12 2022-05-07 Method and apparatus for pose estimation, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN113158974A (en)
WO (1) WO2022237688A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116071785A (en) * 2023-03-06 2023-05-05 合肥工业大学 Human body posture estimation method based on multidimensional space interaction

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113158974A (en) * 2021-05-12 2021-07-23 影石创新科技股份有限公司 Attitude estimation method, attitude estimation device, computer equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108062526A (en) * 2017-12-15 2018-05-22 厦门美图之家科技有限公司 A kind of estimation method of human posture and mobile terminal
US20190012790A1 (en) * 2017-07-05 2019-01-10 Canon Kabushiki Kaisha Image processing apparatus, training apparatus, image processing method, training method, and storage medium
CN112241731A (en) * 2020-12-03 2021-01-19 北京沃东天骏信息技术有限公司 Attitude determination method, device, equipment and storage medium
CN112308950A (en) * 2020-08-25 2021-02-02 北京沃东天骏信息技术有限公司 Video generation method and device
CN112614184A (en) * 2020-12-28 2021-04-06 清华大学 Object 6D attitude estimation method and device based on 2D detection and computer equipment
CN113158974A (en) * 2021-05-12 2021-07-23 影石创新科技股份有限公司 Attitude estimation method, attitude estimation device, computer equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111626218B (en) * 2020-05-28 2023-12-26 腾讯科技(深圳)有限公司 Image generation method, device, equipment and storage medium based on artificial intelligence
CN112347861B (en) * 2020-10-16 2023-12-05 浙江工商大学 Human body posture estimation method based on motion feature constraint

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190012790A1 (en) * 2017-07-05 2019-01-10 Canon Kabushiki Kaisha Image processing apparatus, training apparatus, image processing method, training method, and storage medium
CN108062526A (en) * 2017-12-15 2018-05-22 厦门美图之家科技有限公司 A kind of estimation method of human posture and mobile terminal
CN112308950A (en) * 2020-08-25 2021-02-02 北京沃东天骏信息技术有限公司 Video generation method and device
CN112241731A (en) * 2020-12-03 2021-01-19 北京沃东天骏信息技术有限公司 Attitude determination method, device, equipment and storage medium
CN112614184A (en) * 2020-12-28 2021-04-06 清华大学 Object 6D attitude estimation method and device based on 2D detection and computer equipment
CN113158974A (en) * 2021-05-12 2021-07-23 影石创新科技股份有限公司 Attitude estimation method, attitude estimation device, computer equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116071785A (en) * 2023-03-06 2023-05-05 合肥工业大学 Human body posture estimation method based on multidimensional space interaction

Also Published As

Publication number Publication date
CN113158974A (en) 2021-07-23

Similar Documents

Publication Publication Date Title
JP6961749B2 (en) A configurable convolution engine for interleaved channel data
JP6789402B2 (en) Method of determining the appearance of an object in an image, equipment, equipment and storage medium
CN111598998B (en) Three-dimensional virtual model reconstruction method, three-dimensional virtual model reconstruction device, computer equipment and storage medium
JP6560480B2 (en) Image processing system, image processing method, and program
WO2020010979A1 (en) Method and apparatus for training model for recognizing key points of hand, and method and apparatus for recognizing key points of hand
WO2020164270A1 (en) Deep-learning-based pedestrian detection method, system and apparatus, and storage medium
US8861800B2 (en) Rapid 3D face reconstruction from a 2D image and methods using such rapid 3D face reconstruction
WO2022237688A1 (en) Method and apparatus for pose estimation, computer device, and storage medium
WO2020134528A1 (en) Target detection method and related product
WO2020134818A1 (en) Image processing method and related product
US20230085468A1 (en) Advanced Automatic Rig Creation Processes
US20240153213A1 (en) Data acquisition and reconstruction method and system for human body three-dimensional modeling based on single mobile phone
CN112506340A (en) Device control method, device, electronic device and storage medium
TW202011284A (en) Eye state detection system and method for operating an eye state detection system
WO2020223940A1 (en) Posture prediction method, computer device and storage medium
CN114640833A (en) Projection picture adjusting method and device, electronic equipment and storage medium
Chang et al. Salgaze: Personalizing gaze estimation using visual saliency
WO2022063321A1 (en) Image processing method and apparatus, device and storage medium
WO2019000464A1 (en) Image display method and device, storage medium, and terminal
CN116452745A (en) Hand modeling, hand model processing method, device and medium
EP4135317A2 (en) Stereoscopic image acquisition method, electronic device and storage medium
JP6467994B2 (en) Image processing program, image processing apparatus, and image processing method
US10013736B1 (en) Image perspective transformation system
WO2023160072A1 (en) Human-computer interaction method and apparatus in augmented reality (ar) scene, and electronic device
WO2022099492A1 (en) Image processing method, apparatus and device, and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22806654

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22806654

Country of ref document: EP

Kind code of ref document: A1