CN116912315A

CN116912315A - Human body posture estimation method and device, intelligent terminal and storage medium

Info

Publication number: CN116912315A
Application number: CN202310825971.9A
Authority: CN
Inventors: 陆恩民; 范叔炬; 熊择正; 张园
Original assignee: China Telecom Technology Innovation Center; China Telecom Corp Ltd
Current assignee: China Telecom Technology Innovation Center; China Telecom Corp Ltd
Priority date: 2023-07-06
Filing date: 2023-07-06
Publication date: 2023-10-20

Abstract

The invention discloses a human body posture estimation method, a device, an intelligent terminal and a storage medium, wherein the method comprises the following steps: establishing a multi-frame event image according to the received event stream; processing the event images through a feature extraction model aiming at each frame of event images to obtain a preset number of two-dimensional heat maps; processing a preset number of two-dimensional heat maps through a time sequence fusion model to obtain target two-dimensional heat maps corresponding to each two-dimensional heat map one by one; and determining three-dimensional key points corresponding to the event images according to the regression model and the target two-dimensional heat map, and determining human body gestures corresponding to the event images according to the three-dimensional key points and preset attributes corresponding to the three-dimensional key points. In the method, an event image is established according to the event stream, a two-dimensional heat map corresponding to each frame of event image is determined, the estimation effect of the three-dimensional key points is optimized by using a time sequence fusion method, and the influence of poor image effect on human body posture estimation is reduced, so that the accuracy of human body posture estimation is improved.

Description

Human body posture estimation method and device, intelligent terminal and storage medium

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to a human body posture estimation method, apparatus, intelligent terminal, and storage medium.

Background

The human body posture estimation (Human Pose Estimation, HPE) is used as the basis of the computer vision task, and has wide application prospects in the aspects of man-machine interaction, mixed reality and the like. For each frame of image in the video sequence, the position of the human body joint point in the image can be automatically and accurately predicted by a human body posture estimation method, so the human body posture estimation is one of key technologies for understanding video actions and behaviors.

In the related art, a method for estimating a human body posture is generally based on a traditional camera for frame imaging, such as an RGB (Red-Green-Blue) camera, specifically, for each frame of image in an imaging result of the traditional camera, a key point of a human body in the image and a position of each key point are estimated, and a skeleton representation of a corresponding human body is obtained according to a preset key point attribute.

However, in a real scene, the imaging effect of the conventional camera is poor due to insufficient illumination intensity or a high-speed motion state of the target object, which results in poor accuracy of the human body posture estimation method in the related art.

Disclosure of Invention

The invention provides a human body posture estimation method, a device, an intelligent terminal and a storage medium, which are used for improving the accuracy of human body posture estimation.

In a first aspect, an embodiment of the present invention provides a human body posture estimation method, including:

establishing a multi-frame event image according to a received event stream comprising a plurality of events, wherein the event stream is generated by an event camera;

for each frame of event image, the following steps are performed:

processing the event images through a feature extraction model to obtain a preset number of two-dimensional heat maps, wherein the feature extraction model is built on the basis of a convolutional neural network;

processing the preset number of two-dimensional heat maps through a time sequence fusion model to obtain target two-dimensional heat maps corresponding to each two-dimensional heat map one by one, wherein the time sequence fusion model is established based on a long-short-time memory network;

and determining three-dimensional key points corresponding to the event images according to a regression model and the target two-dimensional heat map, and determining human body gestures corresponding to the event images according to the three-dimensional key points and preset attributes corresponding to the three-dimensional key points, wherein the regression model is established based on a high-resolution network.

In the human body posture estimation method provided by the embodiment of the invention, an event image is established according to an event stream generated by an event camera, a preset number of two-dimensional heat maps corresponding to each frame of event image are determined through a feature extraction model, the preset number of two-dimensional heat maps are corrected by utilizing a time sequence fusion model, a preset number of target two-dimensional heat maps are determined, feature information in the target two-dimensional heat maps is converted into three-dimensional key points according to a regression model, the estimation effect of the three-dimensional key points is optimized, the influence of poor image effect caused by environmental factors on human body posture estimation is reduced, and the accuracy of human body posture estimation is further improved.

In an alternative embodiment, the event includes first information and second information, where the first information is used to characterize coordinates of a pixel corresponding to the event, and the second information is used to characterize a polarity of brightness change of the pixel corresponding to the event;

the establishing a multi-frame event image according to the received event stream comprising a plurality of events comprises:

each time a first preset number of events are received, forming the first preset number of events into a first event set or forming the events received in the same period into the first event set, wherein the period length is a preset duration;

For any one first event set, determining a target position of a pixel point corresponding to an event in an event image corresponding to the first event set according to first information of any one event in the first event set, and determining a target pixel value of the pixel point corresponding to the event in the event image according to second information of the event; constructing a pixel point corresponding to the event based on the target pixel value at the target position of the event image;

and according to the pixel points corresponding to all the events in the first event set, establishing a single-frame event image corresponding to the first event set.

According to the method, a first event set is determined according to preset duration or preset number of events, a target position of a pixel point corresponding to the event is determined in an event image corresponding to the first event set according to first information of the events in the first event set, a pixel value of the pixel point corresponding to the event is determined in the event image corresponding to the first event set according to second information of the event, and a pixel point with the pixel value being the target pixel value is established at the target position in the event image corresponding to the first event set; and forming the pixel points corresponding to all the events in the established first event set into a single-frame event image corresponding to the first event set. By the method, when the ambient light is too dark or the target object moves at a high speed, the reliability of the event image established through the event stream can be ensured.

In an optional embodiment, the event further includes third information, where the third information is used to characterize a time corresponding to the event;

after the first preset number of events are formed into a first event set after each time a first preset number of events are received, or the events received in the same period are formed into the first event set, before determining a target pixel value of a pixel point corresponding to the event in the event image for any one of the first event sets, the method further includes:

determining a second event set taking any one event in the first event set as a starting event based on third information of the events, wherein the second event set comprises a second preset number of events;

for any event in the second event set, calculating a product of a weighting coefficient and the second information to obtain weighted second information corresponding to the event, wherein the weighting coefficient is a larger value in a first preset coefficient and a target coefficient, the target coefficient is a difference value between the second preset coefficient and a time difference, the time difference is an absolute value of a difference value between third information corresponding to the initial event and fourth information corresponding to the event, and the fourth information is determined after regularization processing is performed on the third information corresponding to the event;

And updating the initial event in the second event set according to the sum value of the weighted second information corresponding to each event in the second event set.

In an alternative embodiment, the second preset number is less than or equal to the first preset number.

According to the method, based on the third information of the event, a second event set taking any event in the first event set as an initial event is determined, a weighting coefficient is determined according to the first preset coefficient, the second preset coefficient, the third information corresponding to the initial event and the fourth information corresponding to the event, and the weighted second information corresponding to the event is determined according to the weighting coefficient; and updating the second information of the initial event according to the sum value of the weighted second information corresponding to each event in the second event set. And weighting the second information of the event according to the distribution information of the time corresponding to different events so as to improve the quality of the event image established according to the event.

In an optional implementation manner, if the preset number of two-dimensional heat maps are two-dimensional heat maps corresponding to the event image of the non-first frame and two-dimensional heat maps corresponding to the event image of the non-second frame, the processing is performed on the preset number of two-dimensional heat maps through a time sequence fusion model to obtain target two-dimensional heat maps corresponding to each two-dimensional heat map one to one, including:

Based on a preset number of fusion two-dimensional heat maps corresponding to the event image of the previous frame, respectively correcting each two-dimensional heat map in the preset number of two-dimensional heat maps through the time sequence fusion model to obtain a target two-dimensional heat map corresponding to each two-dimensional heat map one by one and a fusion two-dimensional heat map corresponding to each two-dimensional heat map one by one;

the preset number of two-dimensional heat maps are two-dimensional heat maps corresponding to the event images of the current frame.

According to the method, for the two-dimensional heat maps corresponding to the non-previous two-frame event images, the time sequence fusion model needs to respectively correct each two-dimensional heat map according to the preset number of fusion two-dimensional heat maps corresponding to the previous two-frame event images so as to complement the characteristic information in the two-dimensional heat map corresponding to the current two-dimensional heat map, and further, the estimation optimization of the three-dimensional key points is realized.

In an optional implementation manner, if the preset number of two-dimensional heat maps are two-dimensional heat maps corresponding to the second frame event image, processing the preset number of two-dimensional heat maps through a time sequence fusion model to obtain target two-dimensional heat maps corresponding to each two-dimensional heat map one by one, where the method includes:

based on a preset number of two-dimensional heat maps corresponding to the first frame event image, respectively correcting each two-dimensional heat map in the preset number of two-dimensional heat maps through the time sequence fusion model to obtain a target two-dimensional heat map corresponding to each two-dimensional heat map one by one and a fusion two-dimensional heat map corresponding to each two-dimensional heat map one by one;

If the preset number of two-dimensional heat maps are two-dimensional heat maps corresponding to the first frame event image, the method further comprises:

and taking the preset number of two-dimensional heat maps as the target two-dimensional heat map.

According to the method, if the preset number of two-dimensional heat maps are two-dimensional heat maps corresponding to the second frame event image, the time sequence fusion module corrects each two-dimensional heat map corresponding to the second frame event image according to the preset number of two-dimensional heat maps corresponding to the first frame event image, so that the reliability of feature information in the obtained target two-dimensional heat map is improved.

In an alternative embodiment, the determining the three-dimensional keypoints corresponding to the event image according to the regression model and the target two-dimensional heat map includes:

constructing a preset number of target two-dimensional heat maps into target vectors, wherein the dimension of the target vectors is equal to the preset number;

and estimating the human body posture of the target vector based on the regression model to obtain three-dimensional key points corresponding to the event images.

According to the method, the human body posture estimation is carried out on the target vectors constructed according to the preset number of target two-dimensional heat maps based on the regression model, so that the three-dimensional key points corresponding to the event images are obtained, and the accuracy of human body posture estimation can be improved because the reliability of the characteristic information in the target two-dimensional heat maps is high and the regression model can keep high-resolution characteristic input in the human body posture estimation process.

In a second aspect, an embodiment of the present invention provides a human body posture estimating apparatus, including:

an event processing unit, configured to establish a multi-frame event image according to a received event stream including a plurality of events, where the event stream is generated by an event camera;

the key point estimation unit is used for executing the following steps for each frame of event image: processing the event images through a feature extraction model to obtain a preset number of two-dimensional heat maps, wherein the feature extraction model is built on the basis of a convolutional neural network; processing the preset number of two-dimensional heat maps through a time sequence fusion model to obtain target two-dimensional heat maps corresponding to each two-dimensional heat map one by one, wherein the time sequence fusion model is established based on a long-short-time memory network; and determining three-dimensional key points corresponding to the event images according to a regression model and the target two-dimensional heat map, and determining human body gestures corresponding to the event images according to the three-dimensional key points and preset attributes corresponding to the three-dimensional key points, wherein the regression model is established based on a high-resolution network.

The event processing unit is specifically configured to:

the event processing unit is further configured to:

In an optional implementation manner, if the preset number of two-dimensional heat maps are two-dimensional heat maps corresponding to non-first frame event images and two-dimensional heat maps corresponding to non-second frame event images, the key point estimation unit is specifically configured to:

In an optional implementation manner, if the preset number of two-dimensional heat maps are two-dimensional heat maps corresponding to the second frame event image, the keypoint estimation unit is specifically configured to:

if the preset number of two-dimensional heat maps are two-dimensional heat maps corresponding to the first frame of event images, the key point estimation unit is specifically configured to:

In an alternative embodiment, the keypoint estimation unit is specifically configured to:

In a third aspect, an embodiment of the present invention provides an intelligent terminal, including:

a memory for storing executable instructions;

a processor, configured to read and execute the executable instructions stored in the memory, so as to implement the steps of the human body posture estimation method according to any one of the embodiments of the first aspect.

In a fourth aspect, embodiments of the present invention provide a computer readable storage medium, which when executed by a processor, causes the processor to perform the steps of the human body posture estimation method as set forth in any of the embodiments of the first aspect above.

The technical effects that may be achieved by the human body posture estimation device disclosed in the second aspect, the intelligent terminal disclosed in the third aspect, and the computer readable storage medium disclosed in the fourth aspect are referred to the technical effects that may be achieved by the above-mentioned first aspect or the various possible aspects in the first aspect, and the detailed description is not repeated here.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it will be apparent that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic structural diagram of a human body gesture recognition method based on a frame camera provided in the related art;

FIG. 2 is a workflow diagram of a human body gesture recognition method according to an embodiment of the present invention;

FIG. 3 is a workflow diagram for creating an event image from an event stream according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of generating an event image based on an event camera according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of converting an event image into a preset number of two-dimensional heat maps through a feature extraction model according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of another structure for converting an event image into a preset number of two-dimensional heat maps through a feature extraction model according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of another structure for converting an event image into a preset number of two-dimensional heat maps through a feature extraction model according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a structure for performing correction processing on a two-dimensional heat map through a time sequence fusion model according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of another structure for correcting a two-dimensional heat map by a time sequence fusion model according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of a structure for converting a two-dimensional heat map into three-dimensional key points through a regression model according to an embodiment of the present invention;

FIG. 11 is a schematic diagram of a regression model according to an embodiment of the present invention;

fig. 12 is a schematic block diagram of a human body gesture recognition apparatus according to an embodiment of the present invention;

fig. 13 is a schematic structural diagram of an intelligent terminal according to an embodiment of the present invention;

fig. 14 is a schematic diagram of a program product of a human body gesture recognition method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Also, in the description of the embodiments of the present invention, unless otherwise indicated, "/" means or, for example, a/B may represent a or B; the text "and/or" is merely an association relation describing the associated object, and indicates that three relations may exist, for example, a and/or B may indicate: the three cases where a exists alone, a and B exist together, and B exists alone, and furthermore, in the description of the embodiments of the present invention, "plural" means two or more than two.

The terms "first," "second," and the like, are used below for descriptive purposes only and are not to be construed as implying or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", or the like may explicitly or implicitly include one or more such feature, and in the description of embodiments of the invention, unless otherwise indicated, the meaning of "a plurality" is two or more.

In the related art, a method of human body posture estimation is generally based on a conventional camera for frame imaging, and exemplary, the conventional camera may be an RGB camera, and accordingly, an image generated by the conventional camera may be an RGB image. As shown in fig. 1, which is a structural schematic diagram corresponding to a three-dimensional human body posture estimation method based on an RGB camera, firstly, an RGB image generated by the RGB camera is input into a pretrained convolutional neural network (Convolutional Neural Networks, CNN), wherein the convolutional neural network is used for realizing two-dimensional posture estimation to obtain a two-dimensional feature map corresponding to the RGB image, and the two-dimensional feature map comprises a plurality of two-dimensional key points; then, inputting the two-dimensional feature map into a regression network (regression), and respectively carrying out mapping processing on each two-dimensional key point in the two-dimensional feature map to obtain three-dimensional key points corresponding to each two-dimensional key point so as to determine the human body gesture according to the three-dimensional key points.

However, in practical applications, the situation that the illumination intensity is insufficient or the target object is in a high-speed motion state usually occurs, so that the imaging effect of the conventional camera is poor, and further, the accuracy of the human body posture estimation method in the related art is poor.

Based on the above, the embodiment of the invention provides a human body posture estimation method, a device, an intelligent terminal and a storage medium, so as to improve the accuracy and precision of human body posture estimation.

The following describes a human body posture estimation method provided by the embodiment of the present invention through a specific embodiment, as shown in fig. 2, the method includes the following steps:

step S201, a multi-frame event image is built according to a received event stream comprising a plurality of events, wherein the event stream is generated by an event camera;

the event camera is a bionic camera sensor with microsecond response time, and is used for capturing brightness change caused by movement of a target object in a view range of the event camera, and generating an event for representing brightness change of any pixel point at the current moment by detecting brightness change of each pixel point in the receptive field.

Because the event camera generates a corresponding event stream according to the brightness change of the target object in the view range, the event camera can capture enough information under the condition that the target object moves at a high speed or the environment illumination is poor, so that the accuracy of estimating the three-dimensional key points is improved. In addition, since the event camera only concerns about the imaging position with brightness change, the event stream generated by the event camera is a sparse event stream, and under application scenes such as augmented Reality (Augmented Reality, AR) and Mixed Reality (MR), the accuracy of human body posture estimation can be improved under the condition of limited computing resources. In summary, the event camera has the characteristics of low power consumption, wide dynamic range and high real-time performance, and can efficiently and robustly obtain the information of the target object, so that the human body posture can be effectively estimated through the information.

In an alternative embodiment, the event includes first information and second information, wherein the first information is used for representing coordinates of a pixel corresponding to the event, and the second information is used for representing a polarity of brightness change of the pixel corresponding to the event;

establishing a multi-frame event image according to a received event stream comprising a plurality of events, comprising:

each time a first preset number of events are received, forming a first event set by the first preset number of events or forming a first event set by the events received in the same period, wherein the period length is a preset duration;

Illustratively, the first set of events exists in two generation manners:

mode one:

and determining a first event set according to the fixed event number, namely, forming the first event set by the first preset number of events every time the first preset number of events are received. Specifically, assuming that the first preset number is 10, each time 10 events are received, the 10 received events form a first event set, namely the 1 st to 10 th received events form a first event set; the 11 th to 20 th events received constitute another first set of events, and so on.

It should be noted that, in the embodiment of the present invention, the first preset number may be an empirical value, or may be set according to an actual service requirement, for example, 5000, 7000, etc., which is not limited in the embodiment of the present invention.

Mode two:

and determining a first event set according to the fixed duration, namely forming the first event set by the events received in the same period, wherein the period length is a preset duration. Specifically, assuming that the preset duration is 2ns, all events received in the duration of 2ns are formed into a first event set, namely, the 1 st to 7 th events received in the duration of the first 2ns are formed into a first event set, and the 1 st to 7 th events are formed into a first event set; and the 8 th to 20 th events received in the second 2ns duration are combined into another first event set, and so on.

It should be noted that, in the embodiment of the present invention, the preset duration may be an empirical value, or may be set according to an actual service requirement, which is not limited in the embodiment of the present invention.

In the embodiment of the invention, the event generated by the event camera comprises first information for representing the coordinates of the pixel point corresponding to the event and second information for representing the polarity of brightness change of the pixel point corresponding to the event. For example, assuming that event 1 is an event corresponding to pixel 25, and that event 1 is used to characterize the brightness of pixel 25 to be bright, the coordinates of pixel 25 included in event 1 are: (x, y) = (3, 5) toAnd the polarity of the brightness change of the pixel 25 is: p= +1, therefore, the first information in event 1 is: (x) ₁ ,y ₁ ) = (3, 5), the second information in event 1 is: p is p ₁ = +1; assuming that event 2 is an event corresponding to pixel 31, and that event 2 is used to characterize the darkening of the brightness of pixel 31, the coordinates of pixel 31 included in event 2 are: (x, y) = (4, 1), and the polarity of the luminance change of the pixel point 31 is: p= -1, therefore, the first information in event 2 is: (x) ₂ ,y ₂ ) = (4, 1), the second information in event 2 is: p is p ₂ ＝-1。

For example, assume that generating the first event set 1 includes event 1, event 2, event 3, and event 4, taking event 1 as an example: according to the first information (x ₁ ,y ₁ ) = (3, 5), determining a target position of a pixel point corresponding to event 1 in the event image 1 corresponding to the first event set 1 as (3, 5), and based on the second information p in event 1 ₁ Determining that a target pixel value of a pixel point corresponding to an event 1 in an event image 1 is 1; then, at (3, 5) in the event image 1, a pixel point having a target pixel value of 1 corresponding to the event 1 is constructed. Taking event 2 as an example: according to the first information (x ₂ ,y ₂ ) = (4, 1), determining a target position of a pixel point corresponding to an event 2 in the event image 1 corresponding to the first event set 1 as (4, 1), and according to the second information p in the event 2 ₂ -1, determining that a target pixel value of a pixel point corresponding to an event 2 in the event image 1 is 0; then, at (4, 1) in the event image 1, a pixel point having a target pixel value of 0 corresponding to the event 2 is constructed. Finally, the pixel points corresponding to all the events in the first event set 1 are built on the event image 1 to form a single-frame event image 1 corresponding to the first event set 1.

Since the event also includes third information characterizing the time at which the event corresponds, in an alternative embodiment, as shown in FIG. 3, the following processing is also performed after composing the first set of events:

Step 301, determining, for any one of the first event sets, a second event set with any one of the first event sets as a start event based on third information of the events, where the second event set includes a second preset number of events;

for example, assuming that the generated first event set 1 includes an event 1, an event 2, an event 3 and an event 4, and the second preset number is 2, if the event 1 is taken as a starting event, the event 1 and the event 2 are formed into a second event set 1 according to third information of each event in the first event set 1, that is, according to a time sequence corresponding to the event; if the event 2 is taken as a starting event, the event 2 and the event 3 are combined into a second event set 2 according to the third information of each event in the first event set 1, namely according to the sequence of the time corresponding to the event, and the like.

Step 302, calculating a product of a weighting coefficient and second information for any event in the second event set to obtain weighted second information corresponding to the event, wherein the weighting coefficient is a larger value in a first preset coefficient and a target coefficient, the target coefficient is a difference value between the second preset coefficient and a time difference, the time difference is an absolute value of a difference value between third information corresponding to an initial event and fourth information corresponding to the event, and the fourth information is determined after regularization processing is performed on the third information corresponding to the event;

For example, the second event set 1 is set to include an event 1 and an event 2, where the event 1 is a start event, a first preset coefficient is 0, and a second preset coefficient is 1; and setting the third information corresponding to the event 1 as t ₁ The third information corresponding to the event 2 is t ₂ The method comprises the steps of carrying out a first treatment on the surface of the The second information corresponding to the event 1 is p ₁ The second information corresponding to event 2 is p ₂ The method comprises the steps of carrying out a first treatment on the surface of the The first information corresponding to event 1 is (x ₁ ,y ₁ ) The first information corresponding to event 2 is (x ₂ ,y ₂ ). Take event 2 as an example: third information t corresponding to event 2 ₂ Regularization processing is carried out, and fourth information corresponding to the event 2 is obtained as follows: t' ₂ The method comprises the steps of carrying out a first treatment on the surface of the According to the third information t corresponding to the initial event, namely the event 1 ₁ Fourth information t 'corresponding to event 2' ₂ The determined time difference is: t ₁ -t′ ₂ I (I); according to time differenceAnd the target coefficient determined by the second preset coefficient 1 is as follows: 1- |t ₁ -t′ ₂ I (I); therefore, the weighting coefficient corresponding to event 2 is: max [0, (1- |t) ₁ -t′ ₂ |)]Namely a first preset coefficient 0 and a target coefficient (1- |t) ₁ -t′ ₂ I) is larger. Taking the product of the weighting coefficient corresponding to the event 2 and the second information corresponding to the event 2 as weighted second information corresponding to the event 2, namely the weighted second information corresponding to the event 2 is: p is p ₂ ×max[0，(1-|t ₁ -t′ ₂ |)]. Correspondingly, the weighted second information corresponding to the event 1 is: p is p ₁ ×max[0，(1-|t ₁ -t′ ₁ |)]。

In the embodiment of the present invention, the regularization processing of the third information may be performing spatio-temporal Voxel regularization processing on the third information to obtain fourth information corresponding to the event, where a Voxel (Voxel), also called a volume element, is a volume unit of a human tissue corresponding to a certain pixel. For example, the fourth information corresponding to the event k may be:wherein B is a preset voxel interval; t is t _N Third information of an nth event in the second event set, namely time corresponding to the nth event; t is t ₀ And the third information of the initial event in the second event set is the time corresponding to the initial event. The above description is only one of the implementation, and the embodiments of the present invention are not limited in any way.

Step 303, updating the initial event in the second event set according to the sum value of the weighted second information corresponding to each event in the second event set;

in a specific implementation, according to the sum value of the weighted second information corresponding to each event in the second event set, the initial event in the second event set is updated, that is, the updated initial event may be expressed as:wherein V (x, y, t, p ') is information corresponding to the updated start event, that is, the first information of the updated start event is (x, y), the second information of the updated start event is p', and the third information of the updated start event is t; n is a second preset number.

Exemplary, the second event set 1 is set to include event 1 and event 2, wherein event 1 is a start event, and weighted second information p corresponding to event 1 ₁ The method comprises the following steps: p is p ₁ ×max[0，(1-|t ₁ -t ₁ ′|)]Wherein p is ₁ For the second information corresponding to event 1, t ₁ For the third information corresponding to event 1, t' ₁ Max [0, (1- |t) is fourth information corresponding to event 1 ₁ -t′ ₁ |)]The weighting coefficient corresponding to the event 1; weighted second information p corresponding to event 2 ₂ The method comprises the following steps: p is p ₂ ×max[0，(1-|t ₁ -t ₂ ′|)]Wherein p is ₂ For the second information corresponding to event 2, t ₂ ' is fourth information corresponding to event 2, max [0, (1- |t) ₁ -t ₂ ′|)]The weighting coefficient corresponding to the event 2; the updated event 1 can be expressed as: v1 (x) ₁ ,y ₁ ,t1,p ₁ ′)＝(p ₁ ×max[0，(1-|t ₁ -t′ ₁ |)])+(p ₂ ×max[0，(1-|t ₁ -t ₂ ′|)]) Wherein (x) ₁ ,y ₁ ) For the updated first information corresponding to event 1, t ₁ For the updated second information corresponding to event 1, p ₁ ' is the third information corresponding to the updated event 1, and a corresponding event image is established according to the updated event.

And after the processing is finished, determining the target pixel value of the pixel point corresponding to the event in the event image.

Optionally, the second preset number is less than or equal to the first preset number.

By setting the second preset number to be smaller than or equal to the first preset number, all or most of the events in the events forming the second event set are attributed to the first event set, and because the single-frame event image is constructed based on the events in the first event set, each event for constructing the single-frame event image is processed according to the time distribution information of the events in the first event set, and the accuracy of the constructed single-frame event image is improved. As shown in fig. 4, a schematic structure of creating an event image according to an event stream is shown. In the process of moving the target object, the event camera generates an event stream comprising a plurality of events according to brightness change caused in the process of moving the target object, and constructs a corresponding single-frame event image according to time distribution information of each event in the event stream. By the method, when the target object is in an environment with poor illumination conditions or the target object is in a high-speed motion state, the event camera can still capture enough information about the target object to generate effective event images.

Step S202, for each frame of event image, the following steps are performed: processing the event images through a feature extraction model to obtain a preset number of two-dimensional heat maps, wherein the feature extraction model is built on the basis of a convolutional neural network; processing a preset number of two-dimensional heat maps through a time sequence fusion model to obtain target two-dimensional heat maps corresponding to each two-dimensional heat map one by one, wherein the time sequence fusion model is established based on a long-short-time memory network; and determining three-dimensional key points corresponding to the event images according to a regression model and the target two-dimensional heat map, and determining human body gestures corresponding to the event images according to the three-dimensional key points and preset attributes corresponding to the three-dimensional key points, wherein the regression model is established based on a high-resolution network.

The embodiment of the invention provides a human body posture estimation method, wherein in the method, event images are established according to event streams generated by an event camera, a preset number of two-dimensional heat maps corresponding to each frame of event images are determined, and the estimation effect of three-dimensional key points is optimized according to the relation between the two-dimensional heat maps by utilizing a time sequence fusion method, so that the influence of poor image effect caused by environmental factors on the accuracy of human body posture estimation is reduced, and the accuracy and precision of human body posture estimation are further improved.

In a specific implementation, a preset number of two-dimensional heat maps corresponding to each frame of event image are determined according to a feature extraction model, wherein the feature extraction model can be established by the following modes:

alternatively, as shown in fig. 5, which is a schematic structural diagram of a feature extraction model, the feature extraction model 50 includes a first feature extraction unit 51, a replacement unit 52, and a second feature extraction unit 53, and any one frame of event image is input into the feature extraction model 50 to perform feature extraction processing, and a preset number of two-dimensional heat maps are output, where the preset number may be 3. Illustratively, in the embodiment of the present invention, the feature extraction model 50 outputs 3 two-dimensional heat maps corresponding to the event image, which are the two-dimensional heat map in the XY direction, the two-dimensional heat map in the XZ direction, and the two-dimensional heat map in the YZ direction, respectively.

Alternatively, as shown in fig. 6, another schematic structural diagram of a feature extraction model is shown, where the first feature extraction unit 51 may be built based on a residual network (res net), and illustratively, the first feature extraction unit 51 is determined based on a backbone network (backbonnet) module in the res net 34; the permutation unit 52 may be built according to a permutation (Permute) module; the second feature extraction unit 53 may be built on the basis of an hourglass network (hoursnet), which is exemplary of the determination of the second feature extraction unit 53 on the basis of a branched network (BranchNet) module in the hourglass network, wherein the residual network and the hourglass network both belong to convolutional neural networks.

Alternatively, as shown in fig. 7, another structural schematic diagram of a feature extraction model is shown, where the first feature extraction unit 51 is determined based on a backhaul net module to include a first convolution layer, a pooling layer, a second convolution layer, a third convolution layer, a fourth convolution layer, and a fifth convolution layer that are sequentially connected, where an output end of the pooling layer is further connected to an input end of the fourth convolution layer, and an output end of the third convolution layer is further connected to an input end of the fifth convolution layer. The first feature extraction unit 51 is used to extract surface features of the event image. The permutation unit 52 built based on the permutation layer (permite) includes a first permutation layer and a second permutation layer, where an input end of the first permutation layer and an input end of the second permutation layer are both connected to an output end of the fifth convolution layer, an output end of the first permutation layer is connected to an input end of a second feature extraction branch in the second feature extraction unit 53, an output end of the second permutation layer is connected to an input end of a third feature extraction branch in the second feature extraction unit 53, and the permutation unit 52 is configured to transpose an output result of the first feature extraction unit 51 to implement dimension switching of the output result. Determining that the second feature extraction unit 53 includes 3 feature extraction branches with the same structure based on the BranchNet module, wherein each feature extraction branch is used for outputting a corresponding two-dimensional heat map; taking the first feature extraction branch as an example for explanation: the first characteristic extraction branch comprises a sixth convolution layer, a first residual block layer, a seventh convolution layer, a first adder, a second residual block layer, an eighth convolution layer, a second adder, a third residual block layer and a first integration layer which are sequentially connected, wherein the output end of the sixth convolution layer is further connected with the input end of the first adder, the output end of the first adder is further connected with the input end of the second adder, the output end of the first residual block layer is further connected with the input end of the first integration layer, the output end of the second residual block layer is further connected with the input end of the first integration layer, and the output end of the first integration layer is used for outputting a two-dimensional heat map in the XY direction. The structures of the second feature extraction branch and the third feature extraction branch are similar to those of the first feature extraction branch, the second feature extraction branch is used for outputting a two-dimensional heat map in the XZ direction, and the third feature extraction branch is used for outputting a two-dimensional heat map in the YZ direction.

In an optional embodiment, if the preset number of two-dimensional heat maps are two-dimensional heat maps corresponding to the first frame event image, the method further includes:

taking a preset number of two-dimensional heat maps as target two-dimensional heat maps;

in a specific implementation, as shown in fig. 8, a schematic structural diagram of processing a preset number of two-dimensional heat maps corresponding to different frame event images through a time sequence fusion model is shown. The preset number is set to 3, namely the preset number of two-dimensional heat maps are the two-dimensional heat maps in the XY direction, the two-dimensional heat maps in the XZ direction and the two-dimensional heat maps in the YZ direction. And for the two-dimensional heat map corresponding to the first frame event image, the two-dimensional heat map in the XY direction, the XZ direction and the YZ direction is directly used as three target two-dimensional heat maps corresponding to the first frame event image without processing the two-dimensional heat map by a time sequence fusion model.

In an alternative embodiment, if the preset number of two-dimensional heat maps are two-dimensional heat maps corresponding to the second frame event image, the preset number of two-dimensional heat maps are processed through a time sequence fusion model to obtain target two-dimensional heat maps corresponding to each two-dimensional heat map one by one, including:

and correcting each two-dimensional heat map in the preset number of two-dimensional heat maps through a time sequence fusion model based on the preset number of two-dimensional heat maps corresponding to the first frame event image to obtain a target two-dimensional heat map corresponding to each two-dimensional heat map one by one and a fusion two-dimensional heat map corresponding to each two-dimensional heat map one by one.

In specific implementation, as shown in fig. 8, for the two-dimensional heat map in the (XY, XZ, YZ) direction corresponding to the second frame event image, inputting the two-dimensional heat map in the (XY, XZ, YZ) direction into a time sequence fusion model, at the same time, receiving the two-dimensional heat map in the (XY, XZ, YZ) direction corresponding to the first frame event image by the time sequence fusion model, then correcting the two-dimensional heat map in the XY direction corresponding to the second frame event image according to the feature information contained in the two-dimensional heat map in the (XY, XZ, YZ) direction corresponding to the first frame event image, correcting the two-dimensional heat map in the XZ direction corresponding to the second frame event image according to the feature information contained in the two-dimensional heat map in the (XY, XZ, YZ) direction corresponding to the first frame event image, and finally obtaining a two-dimensional fusion map one-to-one correspondence to each target two-dimensional heat map and each two-dimensional heat map; and inputting the obtained target two-dimensional heat map in the (XY, XZ, YZ) direction into a later-stage regression model, and using the obtained fused two-dimensional heat map in the (XY, XZ, YZ) direction for correcting the two-dimensional heat map in the (XY, XZ, YZ) direction corresponding to the third frame event image.

In an alternative embodiment, if the preset number of two-dimensional heat maps are two-dimensional heat maps corresponding to the event images of the non-first frame and two-dimensional heat maps corresponding to the event images of the non-second frame, the preset number of two-dimensional heat maps are processed through a time sequence fusion model to obtain target two-dimensional heat maps corresponding to each two-dimensional heat map one by one, including:

based on a preset number of fusion two-dimensional heat maps corresponding to the event image of the previous frame, correcting each two-dimensional heat map in the preset number of two-dimensional heat maps through a time sequence fusion model to obtain a target two-dimensional heat map corresponding to each two-dimensional heat map one by one and a fusion two-dimensional heat map corresponding to each two-dimensional heat map one by one;

In the specific implementation, as shown in fig. 8, a two-dimensional heat map in the (XY, XZ, YZ) direction corresponding to the third frame event image is described as an example: inputting the two-dimensional heat maps in the (XY, XZ, YZ) directions into a time sequence fusion model, simultaneously, receiving the fused two-dimensional heat map in the (XY, XZ, YZ) directions corresponding to the second frame event image by the time sequence fusion model, correcting the two-dimensional heat map in the XY directions corresponding to the third frame event image according to the characteristic information contained in the fused two-dimensional heat map in the (XY, XZ, YZ) directions corresponding to the second frame event image, correcting the two-dimensional heat map in the XZ directions corresponding to the third frame event image according to the characteristic information contained in the fused two-dimensional heat map in the (XY, XZ, YZ) directions corresponding to the second frame event image, and correcting the two-dimensional heat map in the YZ directions corresponding to the third frame event image according to the characteristic information contained in the fused two-dimensional heat map in the (XY, XZ, YZ) directions corresponding to the second frame event image, thereby finally obtaining a target two-dimensional heat map corresponding to each two-dimensional heat map and a fused two-dimensional heat map corresponding to each two-dimensional heat map; and inputting the obtained target two-dimensional heat map in the (XY, XZ, YZ) direction into a later regression model, and using the obtained fused two-dimensional heat map in the (XY, XZ, YZ) direction for correcting the two-dimensional heat map in the (XY, XZ, YZ) direction corresponding to the fourth frame event image. The two-dimensional heat maps of the (XY, XZ, YZ) directions corresponding to other frame event images are similar.

Optionally, the time sequence fusion model is built based on a long and short memory network (Long Short Term Memory, LSTM). Exemplary, as shown in fig. 9, a schematic structure of processing a two-dimensional heat map according to a time sequence fusion model established by LSTM is shown. Taking a two-dimensional heat map in the (XY, XZ, YZ) direction corresponding to the fourth frame event image as an example, the following description will be given: as can be seen from fig. 9, xt is a two-dimensional heat map in the (XY, XZ, YZ) direction corresponding to the fourth frame event image, ht is a fused two-dimensional heat map in the (XY, XZ, YZ) direction corresponding to the third frame event image, ht+1 is a fused two-dimensional heat map in the (XY, XZ, YZ) direction corresponding to the fourth frame event image, and Ot is a target two-dimensional heat map in the (XY, XZ, YZ) direction corresponding to the fourth frame event image. Specifically, using the fused two-dimensional heat map in the (XY, XZ, YZ) direction corresponding to Ht, performing time sequence fusion processing and correction processing on the two-dimensional heat map in the XY direction corresponding to Ht, the two-dimensional heat map in the XZ direction, and the two-dimensional heat map in the YZ direction respectively to generate the fused two-dimensional heat map in the (XY, XZ, YZ) direction corresponding to ht+1 and the target two-dimensional heat map in the (XY, XZ, YZ) direction corresponding to Ot.

In the embodiment of the invention, the time sequence fusion model corrects the features in the two-dimensional heat map corresponding to the event image of the current frame according to the corresponding feature information of the event image of the previous frame so as to complement the missing information in the two-dimensional heat map corresponding to the event image of the current frame. By the method, the characteristic information in the three-direction fusion two-dimensional heat map corresponding to the event image of the previous frame is used for correcting the two-dimensional heat map of any direction corresponding to the event image of the current frame, so that the characteristic information in the output target two-dimensional heat map is more accurate.

In an alternative embodiment, determining three-dimensional keypoints corresponding to event images from regression models and target two-dimensional heat maps includes:

In specific implementation, as shown in fig. 10, a preset number of target two-dimensional heat maps are input into a pre-trained regression model to perform human body posture estimation, so that a group of three-dimensional key points corresponding to an event image are determined according to the preset number of target two-dimensional heat maps; and determining the human body gesture corresponding to the event image according to the three-dimensional key points and the preset attributes corresponding to the three-dimensional key points, wherein the preset attributes corresponding to the three-dimensional key points can be the connection relation between the three-dimensional key points, and the embodiment of the invention is not limited in any way.

Alternatively, the regression model may be built based on a High-resolution network (High-Resolution Network, HRNet). As shown in fig. 11, the structure diagram of the regression model established according to HRNet is shown, each module in the diagram is a convolution layer, the first row of convolution layers in the diagram is a convolution layer with 1 x resolution, the second row of convolution layers in the diagram is a convolution layer with 2 x resolution, and the third row of convolution layers in the diagram is a convolution layer with 4 x resolution; and repeatedly carrying out information exchange in convolution networks with different resolutions which are connected in parallel in an up-sampling/down-sampling mode so as to realize repeated fusion of different scale features, and estimating three-dimensional key points through high-resolution representation output by HRNet. In the method, the high-resolution characteristic input can be kept in the whole process through the HRNet, so that the estimation accuracy can be effectively improved.

Illustratively, the target two-dimensional heatmaps of three directions (XY, XZ, YZ) are superimposed into a target vector, wherein the target vector is a three-channel target vector, i.e., the target vector can be expressed as:and inputting the target vector into the regression model established based on the HRNet, and extracting the characteristics of the target vector through three rows of convolution layers with different resolutions to obtain a plurality of characteristics. The regression model in the embodiment of the present invention may perform feature extraction on the target vector through a soft-polar parameter function (soft-argmax), to obtain a plurality of features, as follows:

wherein,,three channel target vectors determined from target two-dimensional heatmaps of three directions (XY, XZ, YZ), respectively,/->Is according to->A defined feature coordinate,/->Is according to->Another feature coordinate determined, +_>Is according to->And determining the other characteristic coordinate.

Then, the obtained multiple features are processed to obtain three-dimensional key points for representing the human body posture, namely, according to three feature coordinates:and->Determining three-dimensional key points->The following is shown:

wherein,,for the determined three-dimensional key points->X-axis coordinate value, ">For the determined three-dimensional key points->Y-axis coordinate value, " >For the determined three-dimensional key points->Is a Z-axis coordinate value of (a).

The regression model is used for estimating the human body posture of the target vectors constructed according to the preset number of target two-dimensional heat maps to obtain three-dimensional key points corresponding to the event images, and the regression model can maintain the high-resolution characteristics in the human body posture estimation process due to the fact that the reliability of the characteristics determined according to the target vectors is high, so that the accuracy of human body posture estimation can be improved.

Based on the same conception, the embodiment of the invention also provides a human body posture estimation device, and because the device is the device in the method in the embodiment of the invention and the principle of the device for solving the problem is similar to that of the method, the implementation of the device can refer to the implementation of the method, and the repetition is omitted.

As shown in fig. 12, the above device includes the following modules:

an event processing module 1201, configured to establish a multi-frame event image according to a received event stream including a plurality of events, where the event stream is generated by an event camera;

a keypoint estimation module 1202 for performing the following steps for each frame of event image: processing the event images through a feature extraction model to obtain a preset number of two-dimensional heat maps, wherein the feature extraction model is built on the basis of a convolutional neural network; processing a preset number of two-dimensional heat maps through a time sequence fusion model to obtain target two-dimensional heat maps corresponding to each two-dimensional heat map one by one, wherein the time sequence fusion model is established based on a long-short-time memory network; and determining three-dimensional key points corresponding to the event images according to a regression model and the target two-dimensional heat map, and determining human body gestures corresponding to the event images according to the three-dimensional key points and preset attributes corresponding to the three-dimensional key points, wherein the regression model is established based on a high-resolution network.

the event processing unit 1201 is specifically configured to:

In an alternative embodiment, the event further comprises third information, wherein the third information is used for representing the time corresponding to the event;

The event processing unit 1201 is also configured to:

for any event in the second event set, calculating the product of a weighting coefficient and second information to obtain weighted second information corresponding to the event, wherein the weighting coefficient is a larger value in a first preset coefficient and a target coefficient, the target coefficient is the difference value between the second preset coefficient and a time difference, the time difference is the absolute value of the difference value between third information corresponding to the initial event and fourth information corresponding to the event, and the fourth information is determined after regularization processing is carried out on the third information corresponding to the event;

In an alternative embodiment, the second predetermined number is less than or equal to the first predetermined number.

In an alternative embodiment, if the preset number of two-dimensional heat maps are two-dimensional heat maps corresponding to the event images of the non-first frame and are two-dimensional heat maps corresponding to the event images of the non-second frame, the key point estimation unit 1202 is specifically configured to:

In an alternative embodiment, if the preset number of two-dimensional heat maps are two-dimensional heat maps corresponding to the second frame event image, the key point estimation unit 1202 is specifically configured to:

based on a preset number of two-dimensional heat maps corresponding to the first frame event image, respectively correcting each two-dimensional heat map in the preset number of two-dimensional heat maps through a time sequence fusion model to obtain a target two-dimensional heat map corresponding to each two-dimensional heat map one by one and a fusion two-dimensional heat map corresponding to each two-dimensional heat map one by one;

if the preset number of two-dimensional heat maps are two-dimensional heat maps corresponding to the first frame event image, the key point estimation unit 1202 is specifically configured to:

taking a preset number of two-dimensional heat maps as target two-dimensional heat maps.

In an alternative embodiment, the keypoint estimation unit 1202 is specifically configured to:

Based on the same conception, the embodiment of the invention also provides an intelligent terminal, and because the intelligent terminal is the intelligent terminal in the method in the embodiment of the invention, and the principle of solving the problem of the intelligent terminal is similar to that of the method, the implementation of the intelligent terminal can refer to the implementation of the method, and the repetition is omitted.

The intelligent terminal 130 according to this embodiment of the present invention is described below with reference to fig. 13. The intelligent terminal 130 shown in fig. 13 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present invention.

As shown in fig. 13, the intelligent terminal 130 may be in the form of a general purpose computing device, which may be a terminal device, for example. The components of the intelligent terminal 130 may include, but are not limited to: the at least one processor 131, the at least one memory 132 storing instructions executable by the processor 131, and a bus 133 connecting the various system components, including the memory 132 and the processor 131, the processor 131 being a processor of a smart device.

The processor 131 executes executable instructions to implement the following steps:

for each frame of event image, the following steps are performed: processing the event images through a feature extraction model to obtain a preset number of two-dimensional heat maps, wherein the feature extraction model is built on the basis of a convolutional neural network; processing a preset number of two-dimensional heat maps through a time sequence fusion model to obtain target two-dimensional heat maps corresponding to each two-dimensional heat map one by one, wherein the time sequence fusion model is established based on a long-short-time memory network; and determining three-dimensional key points corresponding to the event images according to a regression model and the target two-dimensional heat map, and determining human body gestures corresponding to the event images according to the three-dimensional key points and preset attributes corresponding to the three-dimensional key points, wherein the regression model is established based on a high-resolution network.

The processor 131 is specifically configured to:

the processor 131 is further configured to:

In an alternative embodiment, if the preset number of two-dimensional heat maps are two-dimensional heat maps corresponding to the event images of the non-first frame and two-dimensional heat maps corresponding to the event images of the non-second frame, the processor 131 is specifically configured to:

In an alternative embodiment, if the preset number of two-dimensional heat maps are two-dimensional heat maps corresponding to the second frame event image, the processor 131 is specifically configured to:

if the preset number of two-dimensional heat maps are two-dimensional heat maps corresponding to the first frame event image, the processor 131 is specifically configured to:

In an alternative embodiment, processor 131 is specifically configured to:

Bus 133 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, and a local bus using any of a variety of bus architectures.

Memory 132 may include readable media in the form of volatile memory such as Random Access Memory (RAM) 1321 and/or cache memory 1322, and may further include Read Only Memory (ROM) 1323.

Memory 132 may also include a program/utility 1325 having a set (at least one) of program modules 1324, such program modules 1324 include, but are not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The intelligent terminal 130 may also communicate with one or more external devices 134 (e.g., keyboard, pointing device, etc.), one or more devices that enable a user to interact with the intelligent terminal 130, and/or any device (e.g., router, modem, etc.) that enables the intelligent terminal 130 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 135. Also, the intelligent terminal 130 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 136. As shown, network adapter 136 communicates with other modules of intelligent terminal 130 over bus 133. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with intelligent terminal 130, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

In some possible embodiments, the aspects of the present invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps of the modules in the human body posture estimation apparatus according to the various exemplary embodiments of the present disclosure described in the above section of the "exemplary method" when the program product is run on the terminal device, for example, to create a multi-frame event image from a received event stream comprising a plurality of events, wherein the event stream is generated by an event camera; for each frame of event image, the following steps are performed: processing the event images through a feature extraction model to obtain a preset number of two-dimensional heat maps, wherein the feature extraction model is built on the basis of a convolutional neural network; processing a preset number of two-dimensional heat maps through a time sequence fusion model to obtain target two-dimensional heat maps corresponding to each two-dimensional heat map one by one, wherein the time sequence fusion model is established based on a long-short-time memory network; and determining three-dimensional key points corresponding to the event images according to a regression model and the target two-dimensional heat map, and determining human body gestures corresponding to the event images according to the three-dimensional key points and preset attributes corresponding to the three-dimensional key points, wherein the regression model is established based on a high-resolution network.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As shown in fig. 14, a program product 140 for a human body posture estimation method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

It should be noted that while several modules or sub-modules of the system are mentioned in the detailed description above, such partitioning is merely exemplary and not mandatory. Indeed, the features and functions of two or more modules described above may be embodied in one module in accordance with embodiments of the present application. Conversely, the features and functions of one module described above may be further divided into a plurality of modules to be embodied.

Furthermore, while the operations of the various modules of the inventive system are depicted in a particular order in the drawings, this is not required to either imply that the operations must be performed in that particular order or that all of the illustrated operations be performed to achieve desirable results. Additionally or alternatively, certain operations may be omitted, multiple operations combined into one operation execution, and/or one operation decomposed into multiple operation executions.

The present application is described above with reference to block diagrams and/or flowchart illustrations of methods, apparatus (systems) and/or computer program products according to embodiments of the application. It will be understood that one block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, and/or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer and/or other programmable data processing apparatus, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks.

Accordingly, the present application may be embodied in hardware and/or in software (including firmware, resident software, micro-code, etc.). Still further, the present application may take the form of a computer program product on a computer-usable or computer-readable storage medium having computer-usable or computer-readable program code embodied in the medium for use by or in connection with an instruction execution system. In the context of the present application, a computer-usable or computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A human body posture estimation method, characterized by comprising:

For each frame of event image, the following steps are performed:

2. The method of claim 1, wherein the event comprises first information and second information, wherein the first information is used to characterize coordinates of a pixel corresponding to the event, and the second information is used to characterize a polarity of a brightness change of the pixel corresponding to the event;

3. The method of claim 2, wherein the event further comprises third information, wherein the third information is used to characterize a time corresponding to the event;

4. The method of claim 3, wherein the second predetermined number is less than or equal to the first predetermined number.

5. The method of claim 1, wherein if the preset number of two-dimensional heat maps are two-dimensional heat maps corresponding to the event image of the non-first frame and two-dimensional heat maps corresponding to the event image of the non-second frame, processing the preset number of two-dimensional heat maps through a time sequence fusion model to obtain a target two-dimensional heat map corresponding to each two-dimensional heat map one by one, comprising:

6. The method according to any one of claims 1 to 5, wherein if the preset number of two-dimensional heat maps are two-dimensional heat maps corresponding to the second frame event image, processing the preset number of two-dimensional heat maps through a time sequence fusion model to obtain target two-dimensional heat maps corresponding to each two-dimensional heat map one by one, including:

7. The method of claim 1, wherein the determining three-dimensional keypoints corresponding to the event images from a regression model and the target two-dimensional heat map comprises:

8. A human body posture estimation apparatus, characterized by comprising:

the system comprises an event image construction module, a display module and a display module, wherein the event image construction module is used for establishing multi-frame event images according to received event streams comprising a plurality of events, and the event streams are generated by an event camera;

the key point estimation module is used for executing the following steps for each frame of event image: processing the event images through a feature extraction model to obtain a preset number of two-dimensional heat maps, wherein the feature extraction model is built on the basis of a convolutional neural network; processing the preset number of two-dimensional heat maps through a time sequence fusion model to obtain target two-dimensional heat maps corresponding to each two-dimensional heat map one by one, wherein the time sequence fusion model is established based on a long-short-time memory network; and determining three-dimensional key points corresponding to the event images according to a regression model and the target two-dimensional heat map, and determining human body gestures corresponding to the event images according to the three-dimensional key points and preset attributes corresponding to the three-dimensional key points, wherein the regression model is established based on a high-resolution network.

9. An intelligent terminal, characterized by comprising:

a memory for storing executable instructions;

a processor for reading and executing executable instructions stored in the memory to implement the steps of the human body posture estimation method according to any one of claims 1-7.

10. A computer readable storage medium, characterized in that instructions in the storage medium, when executed by a processor, enable the processor to perform the steps of the human body posture estimation method according to any one of claims 1-7.