CN114529983A

CN114529983A - Event and video fusion action identification method and device

Info

Publication number: CN114529983A
Application number: CN202210044281.5A
Authority: CN
Inventors: 高跃; 卢嘉轩; 万海; 赵曦滨
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2022-01-14
Filing date: 2022-01-14
Publication date: 2022-05-24
Anticipated expiration: 2042-01-14

Abstract

The application discloses an event and video fusion action identification method and device, wherein the method comprises the following steps: tracking event data, generating an event stream, and acquiring a continuous event track based on the event stream; sampling at equal time intervals on an event track, and accumulating to obtain an event frame; uniformly sampling an input video to obtain a two-dimensional image, and obtaining a sharpened two-dimensional image based on the two-dimensional image; inputting the event frame and the sharpened two-dimensional image into a preset key point prediction network to obtain two-dimensional key point coordinates; converting the two-dimensional key point coordinates into three-dimensional key point coordinates by a coordinate conversion method; and inputting the three-dimensional key point coordinates into a preset action recognition network to obtain the action category. Therefore, the problems that the convolutional neural network based on deep learning is difficult to be directly applied to feature extraction of event data and complex reasoning including action recognition is difficult to realize due to the sparse and asynchronous characteristics of the event data in the related technology are solved.

Description

Event and video fusion action identification method and device

Technical Field

The present application relates to the field of computer vision and motion recognition technologies, and in particular, to a method and an apparatus for recognizing a motion of event and video fusion.

Background

At present, motion recognition is a task of judging human motion through a video sequence, and is widely researched and applied in the fields of intelligent monitoring, behavior detection and the like. In the related art, in order to solve the problems of large resource consumption, difficulty in solving privacy problems, and the like in most motion recognition methods based on video sequences, event cameras and event data processing methods have been studied in recent years. Compared with the traditional camera, the event camera records data only when the light intensity changes, so that the redundancy of the data is greatly reduced, and the privacy of an observed user is improved.

However, in the related art, compared with the dense and synchronous video frame data, the sparse and asynchronous characteristics of the event data make it difficult for the current convolutional neural network based on deep learning to be directly applied to feature extraction of the event data. On the other hand, since the event data contains only the light intensity change information, it is difficult to realize complicated reasoning including motion recognition using only the event data.

Content of application

The application provides an event and video fusion action recognition method and device, and aims to solve the problems that in the related technology, a convolutional neural network based on deep learning is difficult to directly apply to feature extraction of event data due to the sparse and asynchronous characteristics of the event data, the event data contains single information, complex reasoning including action recognition is difficult to realize, and the action recognition is difficult to take into account privacy of an observed user and accuracy of data acquisition.

An embodiment of a first aspect of the present application provides an event and video fusion action identification method, including the following steps: tracking event data, generating an event stream, and acquiring a continuous event track based on the event stream; sampling at equal time intervals on the event track, and accumulating to obtain an event frame; uniformly sampling an input video to obtain a two-dimensional image, and obtaining a sharpened two-dimensional image based on the two-dimensional image; inputting the event frame and the sharpened two-dimensional image into a preset key point prediction network to obtain two-dimensional key point coordinates; converting the two-dimensional key point coordinates into three-dimensional key point coordinates by a coordinate conversion method; and inputting the three-dimensional key point coordinates into a preset action recognition network to obtain action categories.

Optionally, in an embodiment of the present application, the sampling and accumulating at equal time intervals on the event track to obtain an event frame includes: cutting the event track at the equal time interval, and dividing to obtain a plurality of event track areas; and accumulating the event data in the event track areas to obtain the event frame with the frame rate greater than a preset frame rate.

Optionally, in an embodiment of the present application, before obtaining the sharpened two-dimensional image based on the two-dimensional image, the method further includes: sharpening the two-dimensional image by using an image sharpening formula, wherein the image sharpening formula is as follows:

wherein, L (t)_k) Is a time interval t_k,t_k+1]And obtaining the sharpened two-dimensional image, wherein I (k) is the kth two-dimensional image, T is the exposure time of the camera, and E (T) is the event frame.

Optionally, in an embodiment of the application, the inputting the event frame and the sharpened two-dimensional image into a preset keypoint prediction network to obtain two-dimensional keypoint coordinates includes: inputting the event frame into a front-end frame of the key point prediction network to obtain a feature vector of the event frame; inputting the sharpened two-dimensional image into a front-end framework of the key point prediction network to obtain a feature vector of the sharpened two-dimensional image; splicing the feature vector of the event frame and the feature vector of the sharpened two-dimensional image to form a global feature vector; and inputting the global feature vector into a rear-end framework of the key point prediction network to obtain a key point heat map, and predicting the two-dimensional key point coordinate based on the key point heat map.

Optionally, in an embodiment of the present application, the converting the two-dimensional keypoint coordinates into three-dimensional keypoint coordinates by a coordinate conversion method includes: converting the two-dimensional key point coordinates into initial three-dimensional key point coordinates in a world coordinate system by using a coordinate conversion method; and calculating the average value of the initial three-dimensional key point coordinates to obtain the optimal three-dimensional key point coordinates.

Optionally, in an embodiment of the present application, the inputting the three-dimensional key point coordinates into a preset action recognition network to obtain an action category includes: acquiring speed information and posture information of the optimal three-dimensional key point coordinate according to the optimal three-dimensional key point coordinate and the time of the video frame; inputting the speed information and the posture information of the optimal three-dimensional key point coordinate into the action recognition network to obtain confidence degrees of all candidate actions; and selecting the one with the maximum confidence coefficient of all candidate actions, and identifying to obtain the action category.

An embodiment of a second aspect of the present application provides an event and video fusion action recognition apparatus, including: the fitting module is used for tracking event data, generating an event stream and acquiring a continuous event track based on the event stream; the sampling module is used for sampling on the event track at equal time intervals and accumulating to obtain an event frame; the sharpening module is used for uniformly sampling an input video to obtain a two-dimensional image and obtaining a sharpened two-dimensional image based on the two-dimensional image; the two-dimensional key point prediction module is used for inputting the event frame and the sharpened two-dimensional image into a preset key point prediction network to obtain two-dimensional key point coordinates; the three-dimensional key point prediction module is used for converting the two-dimensional key point coordinates into three-dimensional key point coordinates by a coordinate conversion method; and the identification module is used for inputting the three-dimensional key point coordinates into a preset action identification network to obtain action categories.

Optionally, in an embodiment of the present application, the sampling module includes: the partition unit is used for cutting the event track at the equal-time-length interval and dividing the event track into a plurality of event track areas; and the accumulation unit is used for accumulating the event data in the event track areas to obtain the event frames with the frame rate higher than a preset frame rate.

Optionally, in an embodiment of the present application, the sharpening module is further configured to sharpen the two-dimensional image by using an image sharpening formula, where the image sharpening formula is:

Optionally, in an embodiment of the present application, the two-dimensional keypoint prediction module includes: a first obtaining unit, configured to input the event frame into a front-end frame of the keypoint prediction network, and obtain a feature vector of the event frame; a second obtaining unit, configured to input the sharpened two-dimensional image into a front end framework of the keypoint prediction network, and obtain a feature vector of the sharpened two-dimensional image; the splicing unit is used for splicing the feature vector of the event frame and the feature vector of the sharpened two-dimensional image to form a global feature vector; and the prediction unit is used for inputting the global feature vector into a rear-end framework of the key point prediction network to obtain a key point heat map and predicting the two-dimensional key point coordinate based on the key point heat map.

Optionally, in an embodiment of the present application, the three-dimensional keypoint prediction module includes: the conversion unit is used for converting the two-dimensional key point coordinates into initial three-dimensional key point coordinates in a world coordinate system by using a coordinate conversion method; and the calculating unit is used for calculating the average value of the initial three-dimensional key point coordinates to obtain the optimal three-dimensional key point coordinates.

Optionally, in an embodiment of the present application, the identification module includes: the third acquisition unit is used for acquiring the speed information and the posture information of the optimal three-dimensional key point coordinate according to the optimal three-dimensional key point coordinate and the time of the video frame; the input unit is used for inputting the speed information and the posture information of the optimal three-dimensional key point coordinate into the action recognition network to obtain the confidence degrees of all candidate actions; and the identification unit is used for selecting the candidate action with the maximum confidence coefficient and identifying the action type.

An embodiment of a third aspect of the present application provides an electronic device, including: the event fusion and video fusion action recognition method comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the event fusion and video fusion action recognition method according to the embodiment.

A fourth aspect of the present application provides a computer-readable storage medium, which stores computer instructions for causing the computer to execute the method for recognizing an event and video fusion action according to the foregoing embodiment.

According to the embodiment of the application, the event data features are extracted, the extracted data are subjected to feature transformation to obtain the action categories, the event data and the video data are fused to perform action recognition together, the intrinsic information of the event data and the video data can be fully utilized, the two data are subjected to complementary fusion, high-precision and low-energy-consumption human action recognition can be realized, and the privacy of an observed user can be guaranteed. Therefore, the problems that in the related technology, due to the sparse and asynchronous characteristics of event data, a deep learning-based convolutional neural network is difficult to directly apply to feature extraction of the event data, the event data contains single information, complex reasoning including action recognition is difficult to realize, the privacy of an observed user and the accuracy of data acquisition are difficult to consider in action recognition, and the like are solved.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flowchart of an event and video fusion action recognition method according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating an event and video fusion action recognition method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an event and video fusion motion recognition apparatus according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are illustrative and intended to explain the present application and should not be construed as limiting the present application.

The following describes an event and video fusion action recognition method and apparatus according to an embodiment of the present application with reference to the drawings. In the event and video fusion action recognition method, the event data features are extracted, the extracted data is subjected to feature transformation to obtain action categories, the event data and the video data are fused to perform action recognition together, the intrinsic information of the event data and the video data can be fully utilized, the two data are subjected to complementary fusion, and high-precision, high-precision and high-precision convolution can be realized, The human motion recognition with low energy consumption can also ensure the privacy of the observed user. Therefore, the problems that in the related technology, due to the sparse and asynchronous characteristics of event data, a deep learning-based convolutional neural network is difficult to directly apply to feature extraction of the event data, the event data contains single information, complex reasoning including action recognition is difficult to realize, the privacy of an observed user and the accuracy of data acquisition are difficult to consider in action recognition, and the like are solved.

Specifically, fig. 1 is a schematic flowchart of an event and video fusion action identification method according to an embodiment of the present disclosure.

As shown in fig. 1, the method for recognizing the event and video fusion action includes the following steps:

in step S101, tracking event data, generating an event stream, and acquiring a continuous event track based on the event stream; sampling at equal time intervals on the event track, and accumulating to obtain an event frame.

Specifically, according to the embodiment of the application, event data can be tracked by using an event tracking algorithm to generate an event stream, a continuous event track is obtained by using a fitting algorithm, sampling is performed on the event track at equal time intervals, and event frames with high frame rates are obtained through accumulation, so that the accuracy of the obtained event information is improved, and the subsequent action identification judgment result is more accurate.

The event tracking algorithm is an event tracking algorithm based on asynchronous photometric features, the fitting algorithm is a fitting algorithm based on a B-spline curve, the event trajectory is a curve set reflecting the motion of each event point, and the sampling duration interval on the event trajectory in the embodiment of the present application can be adaptively adjusted by a person skilled in the art according to actual conditions, which is not specifically limited herein.

Optionally, in an embodiment of the present application, sampling at equal time intervals on an event track, and accumulating to obtain an event frame includes: cutting event tracks at equal-time-length intervals, and dividing to obtain a plurality of event track areas; and accumulating the event data in the event track areas to obtain event frames with the frame rate greater than the preset frame rate.

It can be understood that, in the embodiment of the present application, event tracks can be cut at equal-time long intervals, and then a plurality of event track areas are obtained through division, and event data in the event track areas are accumulated, so as to obtain an event frame with a high frame rate, where the event frame in the embodiment of the present application can be a two-dimensional image-like feature representation, and can be used as an input of a mainstream convolutional neural network, so as to avoid a difficult problem that asynchronous data is difficult to process.

In step S102, an input video is uniformly sampled to obtain a two-dimensional image, and a sharpened two-dimensional image is obtained based on the two-dimensional image.

Specifically, according to the embodiment of the application, a group of two-dimensional images obtained by uniformly sampling the input video can be further obtained through an image sharpening algorithm, for example, a sharpened two-dimensional image can be further obtained through the image sharpening algorithm, the human body action can be clearer through the two-dimensional image obtained through the image sharpening algorithm, and a foundation is laid for the accuracy of subsequent action identification and judgment. According to the embodiment of the application, the event data and the video data are fused together to perform action recognition, the internal information of the event data and the video data can be fully utilized, and the two data are complementarily fused, so that high-precision and low-energy-consumption human body action recognition is realized.

Optionally, in an embodiment of the present application, before obtaining the sharpened two-dimensional image based on the two-dimensional image, the method further includes: sharpening the two-dimensional image by utilizing an image sharpening formula, wherein the image sharpening formula is as follows:

It should be understood by those skilled in the art that the image sharpening algorithm in the embodiment of the present application may be an algorithm for eliminating image blur, and is not particularly limited, so as to sharpen a two-dimensional image by an image sharpening formula; wherein the image sharpening formula may be:

In step S103, the event frame and the sharpened two-dimensional image are input to a preset key point prediction network, so as to obtain two-dimensional key point coordinates.

As a possible implementation manner, the keypoint prediction network in the embodiment of the present application is a convolutional neural network, and the convolutional neural network includes: a front-end framework of the keypoint prediction network and a back-end framework of the keypoint prediction network.

The front-end framework of the key point prediction network specifically comprises: a convolutional layer, a pooling layer, and an active layer; the method for outputting the key point heat map by the back-end framework of the key point prediction network specifically comprises the following steps: transpose the convolutional layer and the active layer.

Optionally, in an embodiment of the present application, inputting the event frame and the sharpened two-dimensional image into a preset keypoint prediction network to obtain two-dimensional keypoint coordinates, where the method includes: inputting the event frame into a front-end frame of a key point prediction network to obtain a feature vector of the event frame; inputting the sharpened two-dimensional image into a front-end framework of a key point prediction network to obtain a feature vector of the sharpened two-dimensional image; splicing the feature vector of the event frame and the feature vector of the sharpened two-dimensional image to form a global feature vector; and inputting the global feature vector into a rear-end framework of the key point prediction network to obtain a key point heat map, and predicting two-dimensional key point coordinates based on the key point heat map.

The following examples are set forth to illustrate: firstly, the embodiment of the application can input an event frame into a front-end frame of a key point prediction network so as to obtain a feature vector of the event frame; secondly, the sharpened two-dimensional image can be input into a front-end framework of a key point prediction network, and then a feature vector of the sharpened two-dimensional image is obtained; thirdly, splicing the obtained feature vector of the event frame and the feature vector of the sharpened two-dimensional image to form a global feature vector; finally, the global feature vector is input into a rear-end framework of the key point prediction network to obtain a key point heat map, and the predicted two-dimensional key point coordinate is obtained. Furthermore, the feature vector of the event frame and the feature vector of the two-dimensional image are spliced to form a global feature vector, so that complementary fusion of event data and video data is guaranteed.

In step S104, converting the two-dimensional key point coordinates into three-dimensional key point coordinates by a coordinate conversion method; and inputting the three-dimensional key point coordinates into a preset action recognition network to obtain the action category.

It can be understood that, in the embodiment of the present application, the two-dimensional key point coordinates may be converted into the three-dimensional key point coordinates by a coordinate conversion method, and the three-dimensional key point coordinates obtained through the conversion are input to the action recognition network, so as to obtain the action category. According to the embodiment of the application, the two-dimensional key points are converted into the three-dimensional key points, so that the acquired data information is converted into a three-dimensional key point from a plane, and the action judgment is more convenient.

Optionally, in an embodiment of the present application, converting the two-dimensional keypoint coordinates into three-dimensional keypoint coordinates by a coordinate conversion method includes: converting the two-dimensional key point coordinates into initial three-dimensional key point coordinates in a world coordinate system by using a coordinate conversion method; and calculating the average value of the initial three-dimensional key point coordinates to obtain the optimal three-dimensional key point coordinates.

For example, in the coordinate transformation method according to the embodiment of the present application, the two-dimensional keypoint coordinates may be transformed into three-dimensional keypoint coordinates in a world coordinate system, and an average value of the calculated three-dimensional keypoint coordinates may be used as an optimal three-dimensional keypoint coordinate.

Optionally, in an embodiment of the present application, inputting the three-dimensional key point coordinates into a preset action recognition network to obtain an action category, where the action category includes: acquiring speed information and posture information of the optimal three-dimensional key point coordinate according to the optimal three-dimensional key point coordinate and the time of the video frame; inputting the speed information and the posture information of the optimal three-dimensional key point coordinate into an action recognition network to obtain confidence coefficients of all candidate actions; and selecting the candidate action with the maximum confidence coefficient, and identifying to obtain the action category.

According to the method and the device, the speed information and the posture information of the optimal three-dimensional key point coordinate can be obtained according to the optimal three-dimensional key point coordinate obtained through calculation and the time of the video frame, the speed information and the posture information of the optimal three-dimensional key point coordinate are input into the action recognition network, the confidence degrees of all candidate actions are further obtained, and the candidate action with the maximum confidence degree is selected from all candidate actions to carry out action type judgment. According to the method and the device, the candidate action with the maximum confidence coefficient is obtained and selected through the optimal three-dimensional key point coordinate, the speed information and the posture information of the three-dimensional key point coordinate, and the final action recognition result is enabled to be more accurate.

The principle of the event and video fusion action recognition method according to a specific embodiment of the present application is described in detail below with reference to fig. 2.

As shown in fig. 2, the embodiment of the present application includes the following steps:

step S201: and fitting the event track. According to the method and the device, the event data are tracked by using the event tracking algorithm through the fitting module, the event stream is generated, and the continuous event track is obtained by using the fitting algorithm. The event tracking algorithm is based on asynchronous photometric characteristics; the fitting algorithm is based on a B spline curve; an event trajectory is a set of curves that reflect the motion of each event point.

Step S202: and sampling event frames. In the embodiment of the application, the sampling module samples the acquired event track at equal time intervals, and the event track is accumulated to obtain the event frame with the high frame rate. Specifically, in the embodiment of the present application, the event track is cut at equal-time intervals, a series of event track areas are obtained by dividing, and the event data in the event track areas are accumulated, so as to obtain the event frame with the high frame rate. Wherein, the event frame is a two-dimensional image-like feature representation. According to the embodiment of the application, the two-dimensional images after sharpening can be further obtained through the image sharpening algorithm by uniformly sampling the group of two-dimensional images obtained on the input video, the events and actions of the two-dimensional images obtained through the image sharpening algorithm are clearer, and a foundation is laid for the accuracy of subsequent action identification and judgment.

Step S203: and (5) video sampling and processing. According to the embodiment of the application, after a group of two-dimensional images are obtained by uniformly sampling an input video through a sharpening module, a group of sharpened two-dimensional images are obtained by using an image sharpening algorithm. The image sharpening algorithm used in the embodiment of the application is an algorithm for eliminating image blur, the two-dimensional image is sharpened through an image sharpening formula, and the image sharpening formula is as follows:

wherein, L (t)_k) Is a time interval t_k,t_k+1]And obtaining the sharpened two-dimensional image, wherein I (k) is the kth two-dimensional image, T is the exposure time of the camera, and E (T) is the event frame. According to the embodiment of the application, the event data and the video data are fused to perform action recognition together, the internal information of the event data and the video data can be fully utilized, and the two data are subjected to complementary fusion, so that the human body action recognition with high precision and low energy consumption is realized.

Step S204: and (5) predicting two-dimensional key points. According to the embodiment of the application, the obtained event frame and the sharpened two-dimensional image are input into a key point prediction network through a two-dimensional key point prediction module, and the coordinates of the two-dimensional key points are obtained. Specifically, the keypoint prediction network used in the embodiment of the present application is a convolutional neural network, and specifically includes: a front-end framework of the keypoint prediction network and a back-end framework of the keypoint prediction network. The front-end framework of the key point prediction network specifically comprises: a convolutional layer, a pooling layer, and an active layer; the method for outputting the key point heat map by the back-end framework of the key point prediction network specifically comprises the following steps: transpose the convolutional layer and the active layer.

Further, the two-dimensional keypoint prediction module of the embodiment of the application firstly inputs an event frame into a front end frame of a keypoint prediction network to obtain a feature vector of the event frame, secondly inputs a sharpened two-dimensional image into the front end frame of the keypoint prediction network to obtain a feature vector of the sharpened two-dimensional image, thirdly splices the feature vector of the event frame and the feature vector of the sharpened two-dimensional image to form a global feature vector, and finally inputs the global feature vector into a rear end frame of the keypoint prediction network to obtain a keypoint heat map, thereby obtaining predicted two-dimensional keypoint coordinates. According to the embodiment of the application, the feature vectors in the event frame are spliced to form the global feature vector, so that complementary fusion of event data and video data is guaranteed.

Step S205: and (4) predicting three-dimensional key points. The three-dimensional key point prediction module in the embodiment of the present application converts a two-dimensional key point coordinate into a three-dimensional key point coordinate by using a coordinate conversion method, and specifically, in the embodiment of the present application, the two-dimensional key point coordinate is converted into a three-dimensional key point coordinate in a world coordinate system by using the coordinate conversion method, and an average value of the three-dimensional key point coordinate is calculated as an optimal three-dimensional key point coordinate. According to the embodiment of the application, the two-dimensional key points are converted into the three-dimensional key points, the acquired data information is converted into a three-dimensional key point from a plane, and the action judgment is facilitated.

Step S206: and judging the action type. The identification module of the embodiment of the application acquires the speed information and the posture information of the optimal three-dimensional key point coordinate according to the optimal three-dimensional key point coordinate and the time of the video frame, inputs the speed information and the posture information of the optimal three-dimensional key point coordinate into the action identification network, acquires the confidence degrees of all candidate actions, and selects the action with the maximum confidence degree of all candidate actions to acquire the identified action category. And obtaining and selecting the candidate action with the maximum confidence coefficient according to the optimal three-dimensional key point coordinate, the speed information and the posture information thereof, so that the final action recognition result has higher accuracy.

According to the event and video fusion action recognition method provided by the embodiment of the application, the action type is obtained by extracting the event data characteristics and performing characteristic transformation on the extracted data, the event data and the video data are fused and identified together, the internal information of the event data and the video data can be fully utilized, the two data are complementarily fused, the high-precision and low-energy-consumption human action recognition can be realized, and the privacy of an observed user can be ensured. Therefore, the problems that in the related technology, due to the sparse and asynchronous characteristics of event data, a deep learning-based convolutional neural network is difficult to directly apply to feature extraction of the event data, the event data contains single information, complex reasoning including action recognition is difficult to realize, the privacy of an observed user and the accuracy of data acquisition are difficult to consider in action recognition, and the like are solved.

Next, an event and video fusion motion recognition apparatus according to an embodiment of the present application will be described with reference to the drawings.

Fig. 3 is a block diagram illustrating an event and video fusion motion recognition apparatus according to an embodiment of the present disclosure.

As shown in fig. 3, the event and video fusion motion recognition apparatus 10 includes: fitting module 100, sampling module 200, sharpening module 300, two-dimensional keypoint prediction module 400, three-dimensional keypoint prediction module 500, and identification module 600.

Specifically, the fitting module 100 is configured to track event data, generate an event stream, and obtain a continuous event trajectory based on the event stream.

And the sampling module 200 is configured to sample the event track at equal time intervals and accumulate the sample to obtain an event frame.

The sharpening module 300 is configured to uniformly sample an input video to obtain a two-dimensional image, and obtain a sharpened two-dimensional image based on the two-dimensional image.

And a two-dimensional key point prediction module 400, configured to input the event frame and the sharpened two-dimensional image into a preset key point prediction network, so as to obtain two-dimensional key point coordinates.

And a three-dimensional key point prediction module 500, configured to convert the two-dimensional key point coordinates into three-dimensional key point coordinates by a coordinate conversion method.

The recognition module 600 is configured to input the three-dimensional key point coordinates into a preset action recognition network to obtain an action category.

Optionally, in an embodiment of the present application, the sampling module 200 includes: a partition unit and an accumulation unit.

The partitioning unit is used for cutting the event track at equal-time-length intervals and partitioning the event track into a plurality of event track areas.

And the accumulation unit is used for accumulating the event data in the event track areas to obtain an event frame with a frame rate greater than a preset frame rate.

Optionally, in an embodiment of the present application, the sharpening module 300 is further configured to sharpen the two-dimensional image by using an image sharpening formula, where the image sharpening formula is:

Optionally, in an embodiment of the present application, the two-dimensional keypoint prediction 400 module includes: the device comprises a first acquisition unit, a second acquisition unit, a splicing unit and a prediction unit.

The first obtaining unit is used for inputting the event frame into a front-end frame of the key point prediction network to obtain a feature vector of the event frame.

And the second acquisition unit is used for inputting the sharpened two-dimensional image into a front-end framework of the key point prediction network to obtain a feature vector of the sharpened two-dimensional image.

And the splicing unit is used for splicing the feature vector of the event frame and the feature vector of the sharpened two-dimensional image to form a global feature vector.

And the prediction unit is used for inputting the global feature vector into a rear-end framework of the key point prediction network to obtain a key point heat map and predicting the two-dimensional key point coordinates based on the key point heat map.

Optionally, in an embodiment of the present application, the three-dimensional keypoint prediction module 500 includes: a conversion unit and a calculation unit.

The conversion unit is used for converting the two-dimensional key point coordinates into initial three-dimensional key point coordinates in a world coordinate system by using a coordinate conversion method.

And the calculating unit is used for calculating the average value of the initial three-dimensional key point coordinates to obtain the optimal three-dimensional key point coordinates.

Optionally, in an embodiment of the present application, the identifying module 600 includes: the device comprises a third acquisition unit, an input unit and a recognition unit.

And the third acquisition unit is used for acquiring the speed information and the posture information of the optimal three-dimensional key point coordinate according to the optimal three-dimensional key point coordinate and the time of the video frame.

And the input unit is used for inputting the speed information and the posture information of the optimal three-dimensional key point coordinate into the action recognition network to obtain the confidence degrees of all candidate actions.

And the identification unit is used for selecting the one with the maximum confidence coefficient of all the candidate actions and identifying to obtain the action category.

It should be noted that the above explanation of the embodiment of the method for identifying an event and video fusion action is also applicable to the device for identifying an event and video fusion action according to this embodiment, and is not repeated herein.

According to the action recognition device for event and video fusion, which is provided by the embodiment of the application, the action type is obtained by extracting the characteristics of the event data and performing characteristic transformation on the extracted data, the event data and the video data are fused and identified together, the internal information of the event data and the video data can be fully utilized, the two data are complementarily fused, the high-precision and low-energy-consumption human action recognition can be realized, and the privacy of an observed user can be ensured. Therefore, the problems that in the related technology, due to the sparse and asynchronous characteristics of event data, a deep learning-based convolutional neural network is difficult to directly apply to feature extraction of the event data, the event data contains single information, complex reasoning including action recognition is difficult to realize, the privacy of an observed user and the accuracy of data acquisition are difficult to consider in action recognition, and the like are solved.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device may include:

memory 401, processor 402, and computer programs stored on memory 401 and executable on processor 402.

The processor 402 executes the program to implement the event and video fusion motion recognition method provided in the above embodiments.

Further, the electronic device further includes:

a communication interface 403 for communication between the memory 401 and the processor 402.

A memory 401 for storing computer programs executable on the processor 402.

Memory 401 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

If the memory 401, the processor 402 and the communication interface 403 are implemented independently, the communication interface 403, the memory 401 and the processor 402 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.

Alternatively, in practical implementation, if the memory 401, the processor 402 and the communication interface 403 are integrated on a chip, the memory 401, the processor 402 and the communication interface 403 may complete communication with each other through an internal interface.

The processor 402 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present Application.

The present embodiment also provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the above method for recognizing actions of event and video fusion.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "N" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more N executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of implementing the embodiments of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or N wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the N steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. An event and video fusion action recognition method is characterized by comprising the following steps:

tracking event data, generating an event stream, and acquiring a continuous event track based on the event stream;

sampling at equal time intervals on the event track, and accumulating to obtain an event frame;

uniformly sampling an input video to obtain a two-dimensional image, and obtaining a sharpened two-dimensional image based on the two-dimensional image;

inputting the event frame and the sharpened two-dimensional image into a preset key point prediction network to obtain two-dimensional key point coordinates;

converting the two-dimensional key point coordinates into three-dimensional key point coordinates by a coordinate conversion method; and

and inputting the three-dimensional key point coordinates into a preset action recognition network to obtain action categories.

2. The method of claim 1, wherein sampling at equal time intervals on the event trace and accumulating to obtain an event frame comprises:

cutting the event track at the equal time interval, and dividing to obtain a plurality of event track areas;

and accumulating the event data in the event track areas to obtain the event frame with the frame rate greater than a preset frame rate.

3. The method of claim 1, further comprising, prior to obtaining a sharpened two-dimensional image based on the two-dimensional image:

sharpening the two-dimensional image by using an image sharpening formula, wherein the image sharpening formula is as follows:

wherein, L (t)_k) Is a time interval t_k,t_k+1]The sharpened two-dimensional image is obtained, wherein I (k) is the kth two-dimensional image, and T is a cameraE (t) is the event frame.

4. The method according to claim 1 or 3, wherein the inputting the event frame and the sharpened two-dimensional image into a preset keypoint prediction network to obtain two-dimensional keypoint coordinates comprises:

inputting the event frame into a front-end frame of the key point prediction network to obtain a feature vector of the event frame;

inputting the sharpened two-dimensional image into a front-end framework of the key point prediction network to obtain a feature vector of the sharpened two-dimensional image;

splicing the feature vector of the event frame and the feature vector of the sharpened two-dimensional image to form a global feature vector;

and inputting the global feature vector into a rear-end frame of the key point prediction network to obtain a key point heat map, and predicting the two-dimensional key point coordinate based on the key point heat map.

5. The method of claim 1, wherein said converting the two-dimensional keypoint coordinates to three-dimensional keypoint coordinates by a coordinate conversion method comprises:

converting the two-dimensional key point coordinates into initial three-dimensional key point coordinates in a world coordinate system by using a coordinate conversion method;

and calculating the average value of the initial three-dimensional key point coordinates to obtain the optimal three-dimensional key point coordinates.

6. The method according to claim 5, wherein the inputting the three-dimensional key point coordinates into a preset action recognition network to obtain an action category comprises:

acquiring speed information and posture information of the optimal three-dimensional key point coordinate according to the optimal three-dimensional key point coordinate and the time of the video frame;

inputting the speed information and the posture information of the optimal three-dimensional key point coordinate into the action recognition network to obtain confidence degrees of all candidate actions;

and selecting the one with the maximum confidence coefficient of all candidate actions, and identifying to obtain the action category.

7. An event and video fusion action recognition device, comprising:

the fitting module is used for tracking event data, generating an event stream and acquiring a continuous event track based on the event stream;

the sampling module is used for sampling on the event track at equal time intervals and accumulating to obtain an event frame;

the sharpening module is used for uniformly sampling an input video to obtain a two-dimensional image and obtaining a sharpened two-dimensional image based on the two-dimensional image;

the two-dimensional key point prediction module is used for inputting the event frame and the sharpened two-dimensional image into a preset key point prediction network to obtain two-dimensional key point coordinates;

the three-dimensional key point prediction module is used for converting the two-dimensional key point coordinates into three-dimensional key point coordinates by a coordinate conversion method; and

and the identification module is used for inputting the three-dimensional key point coordinates into a preset action identification network to obtain the action category.

8. The apparatus of claim 7, wherein the two-dimensional point prediction module comprises:

a first obtaining unit, configured to input the event frame into a front-end frame of the keypoint prediction network, and obtain a feature vector of the event frame;

a second obtaining unit, configured to input the sharpened two-dimensional image into a front end framework of the keypoint prediction network, and obtain a feature vector of the sharpened two-dimensional image;

the splicing unit is used for splicing the feature vector of the event frame and the feature vector of the sharpened two-dimensional image to form a global feature vector;

and the prediction unit is used for inputting the global feature vector into a rear-end framework of the key point prediction network to obtain a key point heat map and predicting the two-dimensional key point coordinate based on the key point heat map.

9. An electronic device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to realize the action recognition method for event and video fusion according to any one of claims 1-6.

10. A computer-readable storage medium, on which a computer program is stored, the program being executed by a processor for implementing the method for event and video fusion motion recognition according to any one of claims 1 to 6.