CN116934853A

CN116934853A - Single-target attitude estimation method and device and electronic equipment

Info

Publication number: CN116934853A
Application number: CN202310781993.XA
Authority: CN
Inventors: 唐国令; 韩亚宁; 蔚鹏飞
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2023-06-28
Filing date: 2023-06-28
Publication date: 2023-10-24

Abstract

The application provides a single-target attitude estimation method and device and electronic equipment, and relates to the technical field of computer vision. The single-target attitude estimation method comprises the following steps: acquiring an image to be processed of a target; learning image features in a receptive field range in the image to be processed to obtain at least one local feature map; the local feature map reflects local features of the image to be processed at different positions; based on at least one local feature map, establishing a relation between image features exceeding a receptive field range in the image to be processed, and obtaining a global feature map; obtaining a key point information graph through feature fusion of each local feature graph and the global feature graph; carrying out attitude estimation of the target based on the key point information graph to obtain an attitude estimation result; the gesture estimation result comprises the key point position of the target in the image to be processed. The application solves the technical problems of lower accuracy of posture estimation and high model training cost in the related technology.

Description

Single-target attitude estimation method and device and electronic equipment

Technical Field

The application relates to the technical field of computer vision, in particular to a single-target attitude estimation method and device and electronic equipment.

Background

Target pose estimation refers to locating key points of a target from a two-dimensional image or video to accurately understand the behavior of the target. Object pose estimation is widely applied in many fields, and can help us to better understand and control objects and motions, the accuracy of which has important influence on the effect of downstream tasks, for example, one important subject of modern biology is clearing the relationship between neural activity and behavior, while pose estimation is one important step, and accurate pose estimation can guarantee the fineness of behavior-neural activity analysis.

Therefore, the gesture estimation needs enough accuracy, however, the current gesture estimation method is realized based on a convolutional neural network, certain limitation exists in the convolutional operation, only the image characteristic relation in a certain range can be captured, the relation between pixel points beyond the range of a convolutional kernel cannot be established, the phenomenon of point drift occurs, the detection accuracy of target key points is low, and the accuracy of gesture estimation is low; in addition, some models for pose estimation have more parameters and high training cost.

From the above, the prior art has the technical problems of low accuracy of posture estimation and high model training cost.

Disclosure of Invention

The application provides a single-target attitude estimation method and device and electronic equipment, which can solve the technical problems of low accuracy of attitude estimation and high model training cost in the related technology. The technical scheme is as follows:

according to one aspect of the present application, a single target pose estimation method includes: acquiring an image to be processed of a target; learning image features in a receptive field range in the image to be processed to obtain at least one local feature map; the local feature map reflects local features of the image to be processed at different positions; based on at least one local feature map, establishing a relation between image features exceeding a receptive field range in the image to be processed, and obtaining a global feature map; obtaining a key point information graph through feature fusion of each local feature graph and the global feature graph; carrying out attitude estimation of the target based on the key point information graph to obtain an attitude estimation result; the gesture estimation result comprises the key point position of the target in the image to be processed.

According to one aspect of the present application, a single target attitude estimation apparatus includes: the image acquisition module is used for acquiring an image to be processed of the target; the local feature map acquisition module is used for learning image features in a receptive field range in the image to be processed to obtain at least one local feature map; the local feature map reflects local features of the image to be processed at different positions; the global feature map acquisition module is used for establishing a connection between image features exceeding the receptive field range in the image to be processed based on at least one local feature map to obtain a global feature map; the feature fusion module is used for obtaining a key point information graph through feature fusion of each local feature graph and the global feature graph; the gesture estimation module is used for carrying out gesture estimation on the target based on the key point information graph to obtain a gesture estimation result; the gesture estimation result comprises the key point position of the target in the image to be processed.

In an exemplary embodiment, the local feature map is obtained by calling a CNN network; the CNN network includes at least one residual module, at least one convolutional layer, and at least one pooling layer; the local feature map acquisition module comprises: the initial feature extraction unit is used for extracting initial features of the image to be processed through the residual error module to obtain an initial feature map; and the secondary feature extraction unit is used for carrying out secondary feature extraction on the initial feature map through the convolution layer and/or the pooling layer to obtain a plurality of local feature maps.

In an exemplary embodiment, the global feature map is obtained by calling a transducer network; the global feature map acquisition module comprises: a flattening unit for flattening at least one of the partial feature maps into a one-dimensional vector; and the position embedding unit is used for carrying out position embedding on the one-dimensional vector, inputting the one-dimensional vector into the converter network to learn the dependency relationship of the long-distance pixel points in the image to be processed, and obtaining the global feature map.

In an exemplary embodiment, the location embedding unit includes: a fusion subunit, configured to allocate a corresponding position embedding vector to each element in the one-dimensional vector, and fuse the position embedding vector with the one-dimensional vector to obtain the one-dimensional vector containing position information; and the transducer subunit is used for inputting the one-dimensional vector containing the position information into the transducer network, so that the transducer network learns the relation among pixels at different positions in the image to be processed by using the position information, and the global feature map is obtained.

In an exemplary embodiment, the feature fusion module includes: the image coupling unit is used for carrying out image coupling on the local feature map and the global feature map to obtain a target feature map; and the key point information graph acquisition unit is used for carrying out convolution operation on the target feature graph to obtain the key point information graph.

In an exemplary embodiment, the keypoint information map comprises a keypoint heat map and a keypoint optimization map; the attitude estimation module includes: and the key point position identification unit is used for identifying the key point position of the target in the image to be processed according to the key point heat map and the key point position optimization map, and obtaining the gesture estimation result.

In an exemplary embodiment, the keypoint location recognition unit includes: a maximum value index determining subunit, configured to determine a maximum value index of each of the keypoint heat maps; a searching subunit, configured to search the numerical value of the key point position optimization graph at the maximum value index; and the key point position acquisition subunit is used for acquiring the key point position of the target according to the maximum value index and the numerical value of the key point position optimization graph at the maximum value index.

In an exemplary embodiment, the key point information graph further includes a skeleton graph formed by connecting the key points according to a skeleton relationship; the single-target attitude estimation device further includes: and the category prediction module is used for predicting the category of the target according to the skeleton diagram to obtain the category of the target.

According to one aspect of the application, an electronic device comprises at least one processor and at least one memory, wherein the memory has computer readable instructions stored thereon; the computer readable instructions are executed by one or more of the processors to cause an electronic device to implement a single target pose estimation method as described above.

According to one aspect of the application, a storage medium has stored thereon computer readable instructions that are executed by one or more processors to implement the single target pose estimation method as described above.

According to one aspect of the application, a computer program product includes computer readable instructions stored in a storage medium, one or more processors of an electronic device reading the computer readable instructions from the storage medium, loading and executing the computer readable instructions, causing the electronic device to implement a single target pose estimation method as described above.

The technical scheme provided by the application has the beneficial effects that: in the technical scheme, an image of a target is acquired, image features in a receptive field range in the image are learned to obtain a local feature map (which can reflect local features of the image to be processed in different positions), then the relation among the image features exceeding the receptive field range in the image is learned to obtain a global feature map, a key point information map is obtained according to the local feature map and the global feature map, and finally posture estimation is carried out based on the key point information map. The method and the device have the advantages that the method and the device respectively learn different characteristics of the image to be processed to obtain a plurality of characteristic diagrams with different emphasis characteristics, and combine the characteristic diagrams to obtain the key point information diagram so as to realize the posture estimation of the target, overcome the defect that only image characteristic relations in a certain range in the image to be processed can be captured in the prior art, establish the connection of long-distance image characteristics beyond the receptive field range, lighten the point drift phenomenon, improve the accuracy of posture estimation, and have simple and easy scheme, low model training quantity and reduced training cost.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below. It is evident that the drawings in the following description are only some embodiments of the application and that other drawings may be obtained from these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic illustration of an implementation environment in accordance with the present application;

FIG. 2 is a flowchart illustrating a single target pose estimation method according to an exemplary embodiment;

FIG. 3 is a flow chart of step 220 in one embodiment of the corresponding embodiment of FIG. 2;

FIG. 4 is a flow chart of step 240 in one embodiment of the corresponding embodiment of FIG. 2;

FIG. 5 is a flow chart of step 420 in one embodiment of the corresponding embodiment of FIG. 4;

FIG. 6 is a flow chart of step 260 in one embodiment of the corresponding embodiment of FIG. 2;

FIG. 7 is a flow chart of step 280 in one embodiment of the corresponding embodiment of FIG. 2;

FIG. 8 is a flow chart of step 700 in one embodiment of the corresponding embodiment of FIG. 7;

FIG. 9 is a flow chart of an embodiment after step 280 in the corresponding embodiment of FIG. 2;

FIG. 10 is a schematic diagram of a single-target attitude estimation network, according to an exemplary embodiment;

FIG. 11 is a schematic diagram showing a specific implementation of a single target pose estimation method in an application scenario;

FIG. 12 is a schematic diagram showing experimental results of the present application;

FIG. 13 is a block diagram illustrating a single target pose estimation device according to an exemplary embodiment;

FIG. 14 is a hardware block diagram of a server shown in accordance with an exemplary embodiment;

fig. 15 is a block diagram illustrating a configuration of an electronic device according to an exemplary embodiment.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.

The following is an introduction and explanation of several terms involved in the present application:

the ImageNet dataset is a large-scale image dataset containing more than 1500 tens of thousands of tagged images covering more than 22000 different categories. The data set is widely used for research and evaluation of tasks such as image classification, target detection, image segmentation and the like in the field of deep learning.

CNN (convolutional neural network) is a deep learning model that can perform feature extraction on images. Through convolution and pooling operations, the CNN can extract important features from the image and generate a feature map. These feature maps may be used for classification, detection, segmentation, etc. tasks. Therefore, feature extraction of images using CNN is a common method.

ResNet50 is a deep residual network with deeper layers and more complex structure. Compared with a general convolutional neural network, the ResNet50 can better solve the problems of gradient disappearance, gradient explosion and the like, so that high-level features can be better extracted. In addition, resNet50 also uses a shortcut connection to help information transfer faster, thereby improving the efficiency and accuracy of feature extraction. Thus, feature extraction using Stack1 and Stack2 of ResNet50 generally results in better feature representation, thereby improving model performance.

As described above, the prior art still has the technical problems of low accuracy of pose estimation and high model training cost.

In the prior art, although the adopted network structures are different, the attitude estimation method essentially belongs to the category of the traditional convolutional neural network, is limited by the size of a convolutional kernel receptive field, and can only cover pixel points in a certain range around a certain pixel point in an image, but cannot cover pixel points far away from the pixel point. Therefore, the convolution operation can only capture local image features, but cannot establish the relation between pixel points beyond the convolution kernel range, so that the point drift condition is easy to occur, the detection accuracy of the target key points is low, and the effect of gesture estimation is poor.

There are also solutions to make technical improvements to promote the relevance of the network learning key points, thereby improving the network accuracy, but this often makes the model parameters too many, increasing the training cost.

From the above, the technical problems of low accuracy of pose estimation and high model training cost in the related art still remain to be solved.

Therefore, the single-target attitude estimation method provided by the application can effectively improve the accuracy of attitude estimation, and is correspondingly suitable for a single-target attitude estimation device which can be deployed in electronic equipment, wherein the electronic equipment can be computer equipment with a von neumann architecture, for example, the computer equipment comprises a desktop computer, a notebook computer, a server and the like; the electronic device may also refer to a portable mobile electronic device, including, for example, a smart phone, a tablet computer, etc.

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

FIG. 1 is a schematic diagram of an implementation environment involved in a single target pose estimation method. It should be noted that this implementation environment is only one example adapted to the present application and should not be considered as providing any limitation to the scope of use of the present application.

The implementation environment includes an acquisition side 110 and a server side 130.

Specifically, the capturing end 110 may also be considered as an image capturing device, including but not limited to, a video camera, and other electronic devices having a photographing function. The acquisition end 110 is used for acquiring image data for a target to be estimated in a posture. For example, the capturing end 110 captures an image for one person, resulting in an RGB image including the single person.

The server 130 may be a desktop computer, a notebook computer, a server, or other electronic devices, or may be a computer cluster formed by multiple servers, or even a cloud computing center formed by multiple servers. The server 130 is configured to provide a background service, for example, the background service includes, but is not limited to, single-target pose estimation, and the like.

The server 130 and the acquisition end 110 are pre-connected by wired or wireless network communication, and data transmission between the server 130 and the acquisition end 110 is realized through the network communication. The data transmitted includes, but is not limited to: including images of objects, etc.

In an application scenario, through interaction between the acquisition end 110 and the server 130, the acquisition end 110 captures and acquires a target to-be-processed image, and uploads the to-be-processed image to the server 130, so as to request the server 130 to provide a single-target gesture estimation service.

For the server 130, after receiving the target image uploaded by the acquisition end 110, a target gesture estimation service is called, feature extraction is performed on different scale features of the target image to be processed, a plurality of obtained multi-scale feature images are combined, a key point information image of the target is obtained, and gesture estimation is performed on the target based on the key point information image; furthermore, the skeleton diagram in the key point information diagram can be used as a semantic segmentation diagram of the target for detecting the category of the target.

Referring to fig. 2, an embodiment of the present application provides a single target pose estimation method, which is suitable for an electronic device, and the electronic device may be the server 130 in the implementation environment shown in fig. 1.

In the following method embodiments, for convenience of description, the execution subject of each step of the method is described as an electronic device, but this configuration is not particularly limited.

As shown in fig. 2, the method may include the steps of:

step 200, obtaining a to-be-processed image of the target.

The target means a single object to be subjected to posture estimation, and the target may be a single person, or may be a single animal such as a mouse, which is not particularly limited herein.

Alternatively, the image to be processed may be an RGB image, which is not particularly limited herein.

In one possible implementation manner, the original image of the target is preprocessed to obtain the to-be-processed image of the target, and the preprocessing manner is not limited in detail herein.

In one possible implementation, different preprocessing modes can be flexibly adopted according to the actual situation, for example, when model training is performed, preprocessing includes: image enhancement, preprocessing of the resnet and scaling; in using the model for pose estimation, the preprocessing only includes preprocessing of the resnet.

The image enhancement can be rotation (+ -25 °), scaling (0.75-1.25), clipping and filling (+ -0.15), motion blur, color enhancement, and the like. The ResNet preprocessing mode comprises image size adjustment and image normalization processing. Scaling may be to scale the image of the object to a size suitable for GPU processing.

The preprocessing can enable the image of the target to be more convenient to process, improve the quality of the image, reduce noise and enhance the characteristics of the image, so that the image can be better identified and processed by a subsequent processing algorithm, and the characteristic extraction in the subsequent step is facilitated, so that the effect and the accuracy of the attitude estimation are improved.

And 220, learning image features in a receptive field range in the image to be processed to obtain at least one local feature map.

The local feature map reflects local features of the image to be processed at different positions.

In one possible implementation, the image region within the receptive field is convolved by sliding a convolution kernel over the image to be processed, and an activation function is applied to generate local feature maps reflecting the local features of the image to be processed at different locations.

In one possible implementation, a convolutional neural network may be used to perform feature extraction on an image to be processed, where image features within a receptive field range in the image to be processed may be learned through a convolutional layer, a pooling layer, a fully-connected layer, and so on, to obtain a local feature map. For example, multi-scale, multi-level local feature information of the image to be processed is extracted and represented for subsequent tasks by layer-by-layer convolution and downsampling operations.

Small range features can be extracted using smaller convolution kernels, while larger range features can be extracted using larger convolution kernels, and the pooling operation can help extract the main features in the image. In one possible implementation, in the convolutional neural network, by performing different convolution operations and/or pooling operations on the image to be processed, multiple local feature maps with different feature expression capabilities are obtained, for example, the multiple local feature maps may be 4 times, 8 times, and 16 times of the downsampled feature maps, respectively.

Acquisition of multiple feature maps retaining different multiples of features may be implemented using Convolutional Neural Networks (CNNs). In particular, multiple convolution layers may be added to the CNN, each using a different convolution kernel size and step size to extract features of different multiples. For example, the first convolution layer uses a convolution kernel of 3x3 and a step size of 1, the second convolution layer uses a convolution kernel of 5x5 and a step size of 2, the third convolution layer uses a convolution kernel of 7x7 and a step size of 3, and so on. Thus, each convolution layer can extract features of different multiples, thereby obtaining a plurality of feature maps. In addition, a pooling layer may be added after each convolution layer to further compress the feature map size.

The multiple local feature maps may be obtained by alternating convolution operations and pooling operations, or may be obtained using a particular convolutional neural network model (e.g., resNet 50), without limitation.

In one possible implementation manner, the local feature map is obtained by calling a CNN convolutional neural network, and the CNN network includes at least one residual module, at least one convolutional layer, and at least one pooling layer; specifically, as shown in fig. 3, step 220 may include the steps of:

and 300, extracting initial characteristics of the image to be processed through a residual error module to obtain an initial characteristic diagram.

The residual block is an important component of a depth residual neural network, a special convolutional neural network.

The basic structure of the Residual Module (Residual Module) is one or more convolution layers and a batch normalization layer (Batch Normalization), and a jump connection. Specifically, the residual module performs a series of convolution operations and batch normalization operations on the input first, then performs element-level addition operations on the obtained feature map and the input, and finally performs nonlinear transformation through an activation function.

Alternatively, the residual module may be part of a ResNet (a type of deep residual neural network), such as the Stack1 and Stack2 parts of ResNet50, which are not specifically limited herein.

The residual error module can be introduced to effectively solve the problems of gradient disappearance and gradient explosion in deep network training, and the network is easier to train and optimize. In conventional convolutional neural networks, information is processed through layer-to-layer unidirectional transfer. However, as the depth of the network increases, the transfer of information from layer to layer becomes difficult, resulting in reduced network performance. The residual module introduces a Skip Connection (Skip Connection) adding the input directly to the output of the network so that the network can learn the residual information more easily.

The residual error module is used for carrying out initial feature extraction on the image to be processed, so that the efficiency and accuracy of feature extraction are improved, better feature representation can be obtained, and the performance of the model is improved.

Step 320, performing secondary feature extraction on the initial feature map through the convolution layer and/or the pooling layer to obtain a plurality of local feature maps.

Based on the initial feature map, a plurality of different local feature maps are obtained through a convolution layer and/or a pooling layer, and the features of different multiples can be reserved in each local feature map respectively so as to improve the effect of subsequent gesture estimation.

Step 240, establishing a relation between image features exceeding the receptive field range in the image to be processed based on at least one local feature map, and obtaining a global feature map.

In one possible implementation, the global feature map is obtained by invoking a transducer network.

The transducer network is a sequence-to-sequence model based on an attention mechanism, and consists of an encoder and a decoder, wherein the encoder and the decoder are stacked by a plurality of identical modules. The transducer maps the input features to different feature spaces through a multi-head attention mechanism, calculates a plurality of attention distributions, and finally performs weighted summation on a plurality of output features to obtain final output features. The structure can effectively capture the related information in the sequence and improve the performance of the model. Possibly, the transducer is composed of LN (Layer Normalization), multi-head attention, dense.

Notably, the transducer can extract global features through a self-attention mechanism. The self-attention mechanism may perform an attention calculation for each location in the input sequence to determine the relationship between that location and other locations. This allows the transducer to capture global dependencies in the input sequence, thereby extracting global features. In addition, the Transformer uses a multi-headed attention mechanism to further enhance its global feature extraction capability.

Specifically, as shown in fig. 4, step 240 may include the steps of:

step 400, flattening at least one local feature map into a one-dimensional vector.

The local feature map requires shape transformation before entering the transducer network in order to accommodate the input requirements of the transducer. For example, the local feature map may be flattened into a one-dimensional vector using a convolution operation and then input into a transducer for processing.

Specifically, the flattening operation takes each pixel point in the feature map as a feature vector, and then connects the feature vectors to obtain a long vector.

And step 420, after the position embedding of the one-dimensional vector, inputting a dependency relationship of long-distance pixel points in the image to be processed through the transform network learning, and obtaining a global feature map.

After converting the image into a one-dimensional vector, in order to introduce position information, a method of position embedding (Positional Embedding) may be used.

By integrating the position information into the one-dimensional vector through position embedding, the transducer network can understand the relationship between pixels at different positions in the image by utilizing the position embedding, so that the characteristic extraction and the representation learning can be better performed.

As shown in fig. 5, step 420 may include the steps of:

And 500, distributing a corresponding position embedding vector for each element in the one-dimensional vector, and fusing the position embedding vector and the one-dimensional vector to obtain the one-dimensional vector containing the position information.

The elements in the one-dimensional vector refer to the pixel values at each position after the image is converted into the one-dimensional vector. In the process of converting an image into a one-dimensional vector, each pixel value is arranged into a one-dimensional vector in a certain order.

The position embedding vector may be a fixed length vector containing information about the position of the element in a one-dimensional vector. Thus, each element in a one-dimensional vector has a unique position-embedded vector corresponding thereto.

The position embedding vector and the one-dimensional vector may be fused by element-wise addition or concatenation. Specifically, let a one-dimensional vector be denoted as x and a position embedding vector be denoted as p. Their dimensions should be the same in order to enable element-by-element operation.

Element-by-element addition: the position embedding vector p may be added element by element with the one-dimensional vector x to obtain a processed vector. Thus, the value of the position embedding vector is added to the one-dimensional vector element of the corresponding position, so that the position information is integrated into the one-dimensional vector, and the processed vector is expressed as: x' =x+p.

Element-by-element splicing: the position-embedding vector p may be concatenated with the one-dimensional vector x in dimensions. Thus, the position-embedded vector is added as an extra dimension to the one-dimensional vector, thereby transferring the position information to the transducer model together with the one-dimensional vector, and the processed vector is expressed as: x' = [ x, p ].

The processed one-dimensional vector x' obtained by the processing of the position embedding vector and the one-dimensional vector will contain position information.

And step 520, inputting the one-dimensional vector containing the position information into a transducer network, so that the transducer network learns the relation between the pixels at different positions in the image to be processed by using the position information, and a global feature map is obtained.

And transmitting the one-dimensional vector containing the position information as input to a transducer network for feature extraction and representation learning, wherein the transducer network utilizes the position embedding vector to understand the relationship between elements at different positions in the one-dimensional vector, thereby learning the relationship between pixel points at different positions in the image to be processed and obtaining a global feature map.

And 260, obtaining a key point information graph through feature fusion of each local feature graph and the global feature graph.

The purpose of feature fusion is to combine different feature information to provide a more comprehensive, accurate representation of the features. By fusing the characteristics of different levels or sources, the image information of different levels and angles can be captured, and the perceptibility and performance of the model are improved. Feature fusion can also improve the robustness and generalization capability of the model, and reduce redundancy and noise among features.

The feature fusion method can be feature stitching, feature addition, feature averaging, feature weighted fusion, attention mechanism and the like, and can also be feature graph connection.

As shown in fig. 6, step 260 may include the steps of:

and 600, carrying out image coupling on the local feature map and the global feature map to obtain a target feature map.

The local feature map and the global feature map are features extracted by using different methods, including features from low level to high level, and features with different multiples are reserved, and each feature map focuses on different features. The local feature map and the global feature map are connected together, so that feature representation can be enriched, information transmission is promoted, robustness of the model is improved, computing resources are effectively utilized, and performance and effect of the model can be improved.

It should be noted that before image joining, it is ensured that the images to be joined have the same size or are scaled and cut accordingly to be uniform in size.

And 620, performing convolution operation on the target feature map to obtain a key point information map.

Alternatively, the keypoint information map may include: a keypoint heat map, a keypoint optimization map, and a target skeleton map.

And step 280, estimating the posture of the target based on the key point information graph to obtain a posture estimation result.

The pose estimation result comprises the key point position of the target in the image to be processed.

The keypoint information map includes a keypoint heat map and a keypoint optimization map, as shown in fig. 7, step 280 may include the steps of:

and 700, identifying the position of the key point of the target in the image to be processed according to the key point heat map and the key point position optimization map, and obtaining a gesture estimation result.

Under the action of the embodiment, an image of a target is acquired, image features in a receptive field range in the image are learned to obtain a local feature map, then the relation between the image features exceeding the receptive field range in the image is learned to obtain a global feature map, a key point information map is obtained according to the local feature map and the global feature map, and finally posture estimation is carried out based on the key point information map. The method and the device have the advantages that the method and the device respectively learn different characteristics of the image to be processed to obtain a plurality of characteristic diagrams with different emphasis characteristics, and combine the characteristic diagrams to obtain the key point information diagram so as to realize the posture estimation of the target, overcome the defect that only image characteristic relations in a certain range in the image to be processed can be captured in the prior art, establish the connection of long-distance image characteristics beyond the receptive field range, lighten the point drift phenomenon, improve the accuracy of posture estimation, and have simple and easy scheme, low model training quantity and reduced training cost.

In an exemplary embodiment, as in fig. 8, step 700 may include the steps of:

step 800, determining a maximum index for each keypoint heat map.

Maximum value index refers to the location or index in which the maximum value is found in an array or matrix. It is understood that the maximum value index is positioning information (i.e., position information).

Step 820 searches for values of the keypoint optimization map at the maximum index.

The keypoint optimization map comprises a position optimization map of the keypoint in the horizontal and vertical directions, and the values of the position optimization map of the keypoint in the horizontal and vertical directions at the maximum value index are searched.

And 840, obtaining the key point position of the target according to the maximum value index and the numerical value of the key point position optimization diagram at the maximum value index.

And adding the maximum value index of the key point heat map and the value of the key point position optimizing map at the maximum value index, and multiplying the value by a specific multiple to obtain the key point coordinates.

The specific factor depends on the global and local feature maps that are feature fused, e.g., the global and local feature maps are both 8-fold downsampled feature maps, then the specific factor is 8.

In an exemplary embodiment, the keypoint information graph further includes a skeleton graph formed by connecting keypoints according to a skeleton relationship, as shown in fig. 9, and step 280 may further include the following steps:

And step 900, predicting the category of the target according to the skeleton diagram to obtain the category of the target.

The skeleton diagram can be used as a semantic segmentation diagram of the target, and the category of the target can be detected according to the semantic segmentation diagram.

And combining the Faster R-CNN with the semantic segmentation map to detect the target category. In particular, semantic segmentation maps can be used to provide more accurate target boundary information, thereby helping the fast R-CNN to more accurately locate and classify targets.

In one possible implementation, the detection of the object in the image may be performed by:

extracting pixel coordinates of the target according to the semantic segmentation map; carrying out connected region analysis on the extracted pixel coordinates to obtain a boundary frame of the target; screening and optimizing the boundary frames, such as removing overlapped boundary frames, adjusting the size of the boundary frames, and the like; the image is subject to object detection based on the bounding box of the object, and an object detection algorithm such as Faster R-CNN, yolo, etc. may be used.

Under the action of the embodiment, aiming at the skeleton diagram obtained by the single-target gesture estimation method, the skeleton diagram is used as a semantic segmentation diagram to realize the prediction of the category of the target, so that an application mode is provided, and the application scene is further widened.

Referring to fig. 10, the embodiment of the present application further provides a single-target pose estimation network for implementing the single-target pose estimation method provided by the present application, where the single-target pose estimation network includes: CNN layer 1, maxPool layer 2, C1 layer 3, C2 layer 4, C3 layer 5, C4 layer 6, reshape layer 7, patch encoder layer 8, transducer layer 9, concate layer 12, detail convolution layer 11, MLP layer 10;

CNN layer 1 includes stack1 and stack2 of ResNet 50. ResNet50 is a deep residual convolutional neural network, consisting of a plurality of residual blocks. Where Stack1 and Stack2 are two residual block stacks in ResNet50 (stacked residual blocks). The Stack1 is used for carrying out preliminary feature extraction on the input feature map and reducing the space size of the feature map. Stack2 serves to further extract features and further reduce the spatial size of the feature map. The pre-processed graph is subjected to multi-level feature extraction through Stack1 and Stack2 of ResNet50, and the space size of the feature graph is gradually reduced, so that the performance of the model can be effectively improved, and meanwhile, the number of parameters and the calculated amount are reduced.

MaxPool layer 2 includes maxpooling2D, stride=2. The MaxPool layer 2 is used to perform 2-fold downsampling, while preserving important features, reducing feature map size to reduce computation.

The C1 layer 3, the C2 layer 4, the C3 layer 5 and the C4 layer 6 are convolutions with step length of 1. As shown in fig. 8, the C1 layer 3 and the C2 layer 4 are cascaded together and located after the MaxPool layer 2, that is, after the maximum pooling, the convolution operation with the step length of 1 is performed twice, and at this time, the convolution operation with the step length of 1 twice can increase the depth of convolution, so as to improve the expression capability and accuracy of the model; meanwhile, the convolution operation with the step length of 1 twice can better capture the detail information in the image, and the capability of the model in the aspect of feature extraction is enhanced. The C3 layer 5 and the C4 layer 6 are also cascaded together with the C1 layer 3 and the C2 layer 4.

The Reshape layer 7 is used for carrying out shape transformation on the feature map; the Reshape layer 7, which is located before the Patch encoder layer 8, is used to flatten the feature map into a one-dimensional vector.

The Patch encoder layer 8 is used for position embedding of one-dimensional vectors. Before the one-dimensional vectors enter the transducer layer 9, a position embedding is required, which position embeds a position information for each vector.

The transducer layer 9 is used for processing the position-embedded one-dimensional vector to extract global features. The transducer layer 9 can distinguish their different roles in the sequence based on the position information of the vectors to extract global features.

The MLP layer 10 is used for further processing and recovering global features into feature maps and deconvolving to obtain global feature maps. The MLP layer 10 includes LN (Layer Normalization), dense (full-connected layer), remodelling unit, deconvolution.

Detail convolution layer 11 comprises: conv2D, stride=2, pad=same; conv2D, stride=1, pad=same; BN (Batch normalization ); activation=relu. Detail convolution layer 11 can achieve 2 times downsampling.

The concat and Conv layer 12 is configured to splice the second class feature map and the global feature map, and perform a convolution operation on the target feature map obtained by the splicing. The convolution operations in the concat and Conv layer 12 include: conv2D 2, packing=same; conv2D, stride=1.

In this embodiment, the second class of feature maps includes the feature map output by the Detail convolution layer 11 and the feature map output by the C2 layer 4, and the global feature map includes the feature map output by the MLP layer 10.

Possibly, the transducer layer 9 is composed of LN (Layer Normalization), multi-head attention, and a sense (full-link layer).

In one possible implementation, the single-target pose estimation network further comprises a Intermediate supervision unit 13 for adding additional supervisory signals in the single-target pose estimation network to help the single-target pose estimation network to learn and adjust better. The supervisory signal may be the middle layer output of the predicted result or the output of other related tasks.

Through the embodiment, the network structure of the scheme is simple, the parameters are not large, and the training cost can be reduced; the three groups of feature images to be spliced are obtained through three-way feature calculation, and the feature calculation of a single way is favorable for retaining the features in high resolution, so that the feature image output by the Detail convolution layer 11 retains 4 times of features, the feature image output by the C2 layer 4 retains 8 times of features, the feature image output by the MLP layer 10 retains 16 times of features, and finally 4 times, 8 times and 16 times of features are connected to obtain a key point heat image, a key point position optimizing image and a skeleton image of a target, and effective gesture estimation can be carried out aiming at a single target.

FIG. 11 is a schematic diagram showing a specific implementation of a single target pose estimation method in an application scenario. In this application scenario, the pose estimation is performed with a mouse as a target, that is, the pose information b, c, d, and e of the mouse are obtained according to the image a of the mouse in fig. 11.

The hardware configuration of the application scene is as follows: i9 12900K,4090GPU,5200MHz DDR5 32G*4 memory, 2TB SUMSUNG 980PRO solid state disk; the environment is configured to: window 11 specialty 22H2, programmed using python3.9 language, network setup based on tensorf low2.9.3, image data enhancement using imgauge.

After the mouse image is acquired by using an acquisition device (such as a CMOS camera), the mouse image is subjected to image enhancement, specifically, rotation (+ -25 degrees), scaling (0.75-1.25 degrees), clipping and filling (+ -0.15), motion blurring and color enhancement, so as to obtain an image a of the mouse.

Preprocessing a mouse image a according to a preprocessing mode of a ResNet, scaling the mouse image a to a picture size (640 x 480 resolution is adopted in the application scene) suitable for GPU processing, and then sending the scaled mouse image a into stack1 and stack2 parts of ResNet50 pre-trained on an image data set to obtain a 4-times downsampled intermediate feature map F _4x (160X 120); for F _4x Performing maximum pooling to obtain a feature map F of 8 times downsampling _{8x_0} (80X 60); on the other hand F _4x Performing convolution operation (downsampling) with step length of 2 to obtain high-resolution detail feature diagram F _{8x_1} The method comprises the steps of carrying out a first treatment on the surface of the For F _{8x_0} Performing convolution operation with step length of 1 twice to obtain F _{8x_2} The method comprises the steps of carrying out a first treatment on the surface of the For F _{8x_2} Performing maximum pooling and convolution operation with step length of 1 twice, remolding (reshape) to be a one-dimensional vector, performing position embedding, then sending into 6 Transformer blocks for calculation, recovering the obtained features to be feature images, and deconvoluting to obtain feature images F _{8x_3} The method comprises the steps of carrying out a first treatment on the surface of the Last joint F _{8x_1} 、F _{8x_2} 、F _{8x_3} Obtaining F _8x (80X 60), for F _8x The convolution operation can be performed to obtain the posture information of the mouse, as shown in fig. 11, b is the summation of 16 key point heat maps of the mouse; c and d are respectively the summation of the horizontal position optimization graph and the vertical position optimization graph of 16 key points of the mouse; and e is an animal skeleton diagram.

In the above-described procedure, the processing procedure from the scaled mouse image a to the posture information b, c, d, and e of the mouse may be implemented by a single-object posture estimation network as shown in fig. 10.

In fig. 11, pose information b is a summation of 16 keypoint heatmaps shown for the mouse; the gesture information c and d are respectively the summation of the position optimization diagrams of the 16 key points of the mouse in the horizontal direction and the position optimization diagrams of the mouse in the vertical direction; the posture information e shows a skeleton diagram of the target (mouse) obtained by connecting key points of the target (mouse).

Further, searching the maximum index of each key point heat map and multiplying the maximum index by 8 to obtain a key point position (x, y); searching the values of the position optimization graphs of the horizontal and vertical directions of each key point at index and multiplying by 8, and adding (x, y) respectively to obtain optimized (x, y), namely the final key point position.

Compared with the prior art, the scheme of the application has better effects, and the effects of the scheme of the application are described below.

deeplabcut and slope are techniques with similar accuracy of pose estimation (average accuracy, mean average precision, mAP), whereas the present application achieves a mAP in fly32 dataset that is 0.5% higher than both techniques. The performance differences (root mean square error, root mean square error, RMSE) were evaluated using the self-established mouse dataset, with RMSE of the present application being 10% lower than the best-performing deeplabcut of the two techniques. The spot-drift condition of the application is greatly reduced.

Referring to fig. 12, a schematic diagram of a result display of the implementation of the present application is shown.

Where a is a graph comparing the accuracy of the scheme of the application and other schemes (deep bcut and slope) on the estimation of the pose of a single drosophila, it can be seen that the mAP of the scheme of the application on the drosophila dataset is 93.5+/-0.5%, which is higher than 92.8% and 92.7% of deep bcut and slope.

b is a comparative graph of RMSE on single mouse pose estimation for the inventive protocol, which is 3.44±0.19, 3.73±0.20, and other protocols (deep bcut and sleep), which is 3.83±0.41, 4.40±0.52, and which is seen to be lower on the mouse dataset RMSE than the other protocols.

c is a graph comparing the inference speed of the scheme of the application and other schemes (deep and sleep) on the posture estimation of a single mouse, and the inference speed of the scheme of the application is 90+/-4 fps, 106+/-4 fps lower than SLEEP and 44+/-2 fps higher than deep.

d and e are performance diagrams of the scheme under different training amounts, and when the training data amount is 350 Zhang Zuoyou, the gesture estimation network obtains a good prediction effect.

f is a key map of skeletal tag linkage for mice.

g is a schematic diagram of the scheme and other schemes (deep bcut and sleep) of the application to perform key point reasoning on the mouse video respectively, and the result shows that compared with deep bcut, the model of the application can greatly reduce the vibration of points, and compared with SLEEP, more key point prediction information (pcutoff=0.6) can be reserved, as shown in the box.

h is a model representation of the addition of bone tag versus no bone tag (baseline), as shown in the box, after addition of bone tag, minimal concussion of the critical point prediction is mitigated.

In the experimental effect display, the accuracy, the reasoning speed, the stability and the like of the gesture estimation are superior to those of the prior art, in addition, when the training data amount is 350 Zhang Zuoyou, the gesture estimation network obtains a better prediction effect, and the technical problems that the training amount of the gesture estimation network provided by the application is relatively low, namely the training cost is low, namely the model training cost is high are effectively solved by the scheme of the application.

The following is an embodiment of the apparatus of the present application, which may be used to execute the single target pose estimation method and the target class detection method according to the present application. For details not disclosed in the embodiment of the apparatus of the present application, please refer to the method embodiment of the single target pose estimation method and the target class detection method according to the present application.

Referring to fig. 13, in an embodiment of the present application, a single target pose estimation apparatus 1300 is provided, including but not limited to: an image acquisition module 1310, a local feature map acquisition module 1330, a global feature map acquisition module 1350, a feature fusion module 1370, and a pose estimation module 1390.

The image acquisition module 1310 is configured to acquire a to-be-processed image of the target.

The local feature map obtaining module 1330 is configured to learn image features in a receptive field range in the image to be processed, so as to obtain at least one local feature map.

The global feature map obtaining module 1350 is configured to establish a relationship between image features exceeding a receptive field range in the image to be processed based on at least one local feature map, so as to obtain a global feature map.

And the feature fusion module 1370 is configured to obtain a key point information graph through feature fusion performed by each local feature graph and the global feature graph.

The gesture estimation module 1390 is configured to perform gesture estimation of the target based on the keypoint information map, so as to obtain a gesture estimation result; the gesture estimation result comprises the key point position of the target in the image to be processed.

It should be noted that, when the single-target posture estimation device provided in the foregoing embodiment performs single-target posture estimation, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the single-target posture estimation device is divided into different functional modules to complete all or part of the functions described above.

In addition, the single-target posture estimation device and the single-target posture estimation method provided in the foregoing embodiments belong to the same concept, and the specific manner in which each module performs the operation has been described in detail in the method embodiment, which is not described herein.

Fig. 14 shows a structural schematic of a server according to an exemplary embodiment. The server is suitable for use at the server 130 in the implementation environment shown in fig. 1.

It should be noted that this server is only an example adapted to the present application, and should not be construed as providing any limitation on the scope of use of the present application. Nor should the server be construed as necessarily relying on or necessarily having one or more of the components of the exemplary server 2000 illustrated in fig. 14.

The hardware structure of the server 2000 may vary widely depending on the configuration or performance, as shown in fig. 12, the server 2000 includes: a power supply 210, an interface 230, at least one memory 250, and at least one central processing unit (CPU, central Processing Units) 270.

Specifically, the power supply 210 is configured to provide an operating voltage for each hardware device on the server 2000.

The interface 230 includes at least one wired or wireless network interface 231 for interacting with external devices. For example, interactions between acquisition side 110 and server side 130 in the implementation environment shown in FIG. 1 are performed.

Of course, in other examples of the adaptation of the present application, the interface 230 may further include at least one serial-parallel conversion interface 233, at least one input-output interface 235, at least one USB interface 237, and the like, as shown in fig. 14, which is not particularly limited herein.

The memory 250 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, where the resources stored include an operating system 251, application programs 253, and data 255, and the storage mode may be transient storage or permanent storage.

The operating system 251 is used for managing and controlling various hardware devices and applications 253 on the server 2000 to implement the operation and processing of the massive data 255 in the memory 250 by the central processing unit 270, which may be Windows server, mac OS XTM, unixTM, linuxTM, freeBSDTM, etc.

The application 253 is based on computer readable instructions on the operating system 251 to perform at least one specific task, which may include at least one module (not shown in fig. 14), each of which may include computer readable instructions to the server 2000, respectively. For example, a single target pose estimation device may be considered an application 253 deployed at the server 2000.

The data 255 may be a photograph, a picture, or the like stored in the disk, or may be an image of an object to be subjected to posture estimation, a posture estimation result, or the like, and stored in the memory 250.

The central processor 270 may include one or more of the above processors and is configured to communicate with the memory 250 via at least one communication bus to read computer readable instructions stored in the memory 250, thereby implementing operations and processing of the bulk data 255 in the memory 250. For example, the single target pose estimation method is accomplished by the central processor 270 reading a series of computer readable instructions stored in the memory 250.

Furthermore, the present application can be realized by hardware circuitry or by a combination of hardware circuitry and software, and thus, the implementation of the present application is not limited to any specific hardware circuitry, software, or combination of the two.

Referring to fig. 15, in an embodiment of the present application, an electronic device 4000 is provided, where the electronic device 4000 may include: desktop computers, notebook computers, servers, etc.

In fig. 15, the electronic device 4000 includes at least one processor 4001 and at least one memory 4003.

Among other things, data interaction between the processor 4001 and the memory 4003 may be achieved by at least one communication bus 4002. The communication bus 4002 may include a path for transferring data between the processor 4001 and the memory 4003. The communication bus 4002 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The communication bus 4002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 15, but not only one bus or one type of bus.

Optionally, the electronic device 4000 may further comprise a transceiver 4004, the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The processor 4001 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 4001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.

The Memory 4003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (Electrically Erasable Programmable Read Only Memory ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disc storage, optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program instructions or code in the form of instructions or data structures and that can be accessed by the electronic device 4000.

The memory 4003 has computer readable instructions stored thereon, and the processor 4001 can read the computer readable instructions stored in the memory 4003 through the communication bus 4002.

The computer readable instructions are executed by the one or more processors 4001 to implement the single target pose estimation method in the embodiments described above.

Further, in an embodiment of the present application, a storage medium having stored thereon computer readable instructions that are executed by one or more processors to implement the single target pose estimation method as described above is provided.

In an embodiment of the present application, a computer program product is provided, the computer program product including computer readable instructions stored in a storage medium, one or more processors of an electronic device reading the computer readable instructions from the storage medium, loading and executing the computer readable instructions, causing the electronic device to implement a single target pose estimation method as described above.

Compared with the related art, the application has the following beneficial effects:

1. obtaining an image of a target, learning image features in a receptive field range in the image to obtain a local feature map, learning the relation between image features exceeding the receptive field range in the image to obtain a global feature map, obtaining a key point information map according to the local feature map and the global feature map, and finally estimating the gesture based on the key point information map. The method and the device have the advantages that the method and the device respectively learn different characteristics of the image to be processed to obtain a plurality of characteristic diagrams with different emphasis characteristics, and combine the characteristic diagrams to obtain the key point information diagram so as to realize the posture estimation of the target, overcome the defect that only image characteristic relations in a certain range in the image to be processed can be captured in the prior art, establish the connection of long-distance image characteristics beyond the receptive field range, lighten the point drift phenomenon, improve the accuracy of posture estimation, and have simple and easy scheme, low model training quantity and reduced training cost.

2. Aiming at a skeleton diagram obtained by a single-target gesture estimation method, the skeleton diagram is used as a semantic segmentation diagram to realize the prediction of the category of the target, so that an application mode is provided, and the application scene is further widened.

3. The pre-processed graph is subjected to multi-level feature extraction through Stack1 and Stack2 of ResNet50, and the space size of the feature graph is gradually reduced, so that the performance of the model can be effectively improved, and meanwhile, the number of parameters and the calculated amount are reduced. The convolution operation with the step length of 1 twice can increase the convolution depth and improve the expression capacity and accuracy of the model; meanwhile, the convolution operation with the step length of 1 twice can better capture the detail information in the image, and the capability of the model in the aspect of feature extraction is enhanced.

4. The method comprises the steps of obtaining an image of a target, carrying out feature extraction on the image by using different convolution operations and/or pooling operations to obtain a plurality of feature images with different emphasis features, further using a transducer to obtain global features, and obtaining pose information such as a key point heat image, a key point position optimization image, a skeleton image and the like according to the feature images obtained by using a convolution neural network and the transducer. According to the technical scheme, a convolutional neural network is combined with a network structure of a transducer, wherein the convolutional neural network is responsible for extracting the characteristics of a target image under high resolution and reducing the image dimension, the transducer is responsible for processing the image characteristics output by the convolutional neural network, the obtained characteristic diagram of the transducer is advanced, the characteristics of each pixel point are associated with the characteristics of other pixel points in a sequence, and the relation between a certain pixel point and all other pixel points can be established, so that the global characteristics are obtained. Finally, different feature map information obtained by a convolutional neural network and a Transformer are combined together, a gesture key point heat map, a key point position optimizing map and a body skeleton map are generated after convolution, the convolutional neural network is combined with a network structure of the Transformer, so that global-local context can be learned to strengthen feature characterization, the limitation that convolution operation only can capture local image features is overcome, the condition of point drift is relieved, and the accuracy of single-target gesture estimation is improved; the technical scheme of the application is simple and easy to implement, the corresponding single-target gesture estimation network parameters are relatively less, the structure is not complex, and the high-precision effect can be realized by using less parameters and training cost; therefore, the technical problems of low accuracy of posture estimation and high model training cost in the related technology can be effectively solved.

5. In the experimental effect diagram shown in fig. 12, the accuracy, the inference speed, the stability and the like of the posture estimation of the scheme of the application are better than those of the prior art, in addition, fig. 12 shows that when the training data amount is 350 Zhang Zuoyou, the posture estimation network obtains a better prediction effect, which can indicate that the training amount of the posture estimation network provided by the application is relatively low, i.e. the training cost is low, i.e. the scheme of the application effectively solves the technical problem of high model training cost.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

The foregoing is only a partial embodiment of the present application, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present application, and such modifications and adaptations are intended to be comprehended within the scope of the present application.

Claims

1. A single target pose estimation method, comprising:

acquiring an image to be processed of a target;

learning image features in a receptive field range in the image to be processed to obtain at least one local feature map; the local feature map reflects local features of the image to be processed at different positions;

based on at least one local feature map, establishing a relation between image features exceeding a receptive field range in the image to be processed, and obtaining a global feature map;

obtaining a key point information graph through feature fusion of each local feature graph and the global feature graph;

carrying out attitude estimation of the target based on the key point information graph to obtain an attitude estimation result; the gesture estimation result comprises the key point position of the target in the image to be processed.

2. The method of claim 1, wherein the local feature map is derived from invoking a CNN network; the CNN network includes at least one residual module, at least one convolutional layer, and at least one pooling layer;

The learning of the image features in the receptive field range in the image to be processed to obtain at least one local feature map comprises the following steps:

extracting initial characteristics of the image to be processed through the residual error module to obtain an initial characteristic diagram;

and carrying out secondary feature extraction on the initial feature map through the convolution layer and/or the pooling layer to obtain a plurality of local feature maps.

3. The method of claim 1, wherein the global feature map is derived by invoking a transducer network;

based on at least one local feature map, establishing a relation between image features exceeding a receptive field range in the image to be processed to obtain a global feature map, wherein the method comprises the following steps:

flattening at least one of the local feature maps into a one-dimensional vector;

and after the one-dimensional vector is subjected to position embedding, inputting the position embedding into the converter network to learn the dependency relationship of the long-distance pixel points in the image to be processed, and obtaining the global feature map.

4. The method of claim 3, wherein the step of performing the position embedding on the one-dimensional vector and inputting the one-dimensional vector to the transform network to learn the dependency relationship of the long-distance pixel points in the image to be processed, and obtaining the global feature map comprises the steps of:

Distributing a corresponding position embedding vector for each element in the one-dimensional vector, and fusing the position embedding vector with the one-dimensional vector to obtain the one-dimensional vector containing position information;

and inputting the one-dimensional vector containing the position information into the transducer network, so that the transducer network learns the relation between pixel points at different positions in the image to be processed by utilizing the position information, and the global feature map is obtained.

5. The method according to claim 1, wherein the obtaining the key point information map through feature fusion between each of the local feature maps and the global feature map includes:

image coupling is carried out on the local feature map and the global feature map, and a target feature map is obtained;

and carrying out convolution operation on the target feature map to obtain the key point information map.

6. The method of claim 1, wherein the keypoint information map comprises a keypoint heat map and a keypoint optimization map;

the step of estimating the gesture of the target based on the key point information graph to obtain a gesture estimation result comprises the following steps:

and identifying the position of the key point of the target in the image to be processed according to the key point heat map and the key point position optimization map, and obtaining the attitude estimation result.

7. The method of claim 6, wherein the identifying the keypoint location of the target in the image to be processed based on the keypoint heat map and the keypoint location optimization map, to obtain the pose estimation result, comprises:

determining a maximum value index of each key point heat map;

searching the numerical value of the key point position optimization graph at the maximum value index;

and obtaining the key point position of the target according to the maximum value index and the numerical value of the key point position optimization graph at the maximum value index.

8. The method of any one of claims 1 to 7, the keypoint information map further comprising a skeleton map formed by connecting each of the keypoint points in a skeleton relationship;

the method further comprises the steps of:

and predicting the category of the target according to the skeleton diagram to obtain the category of the target.

9. A single-target attitude estimation apparatus, characterized by comprising:

the image acquisition module is used for acquiring an image to be processed of the target;

the local feature map acquisition module is used for learning image features in a receptive field range in the image to be processed to obtain at least one local feature map; the local feature map reflects local features of the image to be processed at different positions;

The global feature map acquisition module is used for establishing a connection between image features exceeding the receptive field range in the image to be processed based on at least one local feature map to obtain a global feature map;

the feature fusion module is used for obtaining a key point information graph through feature fusion of each local feature graph and the global feature graph;

the gesture estimation module is used for carrying out gesture estimation on the target based on the key point information graph to obtain a gesture estimation result; the gesture estimation result comprises the key point position of the target in the image to be processed.

10. An electronic device, comprising: at least one processor, and at least one memory, wherein,

the memory has computer readable instructions stored thereon;

the computer readable instructions are executed by one or more of the processors to cause an electronic device to implement the single-target pose estimation method according to any of claims 1 to 8.