CN113762017B

CN113762017B - Action recognition method, device, equipment and storage medium

Info

Publication number: CN113762017B
Application number: CN202110042177.8A
Authority: CN
Inventors: 朱博; 姜婷
Original assignee: Beijing Jingdong Zhenshi Information Technology Co Ltd
Current assignee: Beijing Jingdong Zhenshi Information Technology Co Ltd
Priority date: 2021-01-13
Filing date: 2021-01-13
Publication date: 2024-04-16
Anticipated expiration: 2041-01-13
Also published as: CN113762017A

Abstract

The embodiment of the invention discloses a method, a device, equipment and a storage medium for identifying actions. The method comprises the following steps: acquiring video data to be identified; wherein the video data comprises at least two video frames; determining an optical flow weight matrix corresponding to the video data according to pixel positions of preset feature points corresponding to each video frame respectively; the optical flow weight matrix is used for representing the time characteristics and the space characteristics of an action area in the video data; and inputting the video data and the optical flow weight matrix into an action recognition network model to obtain an output action recognition result corresponding to the video data. According to the embodiment of the invention, the problem of poor feature extraction capability of the existing motion recognition network model is solved by determining the optical flow weight matrix of the video data and inputting the optical flow weight matrix into the motion recognition network model, the accuracy of the recognition result of the motion recognition network model is improved, and the safety in the production process is further ensured.

Description

Action recognition method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of video recognition, in particular to a method, a device, equipment and a storage medium for motion recognition.

Background

With the development of networks and the rapid popularization of video acquisition devices, video surveillance is widely used in various production scenarios. By carrying out real-time monitoring and abnormal behavior early warning on the behaviors of the staff in the video, various safety production risks can be effectively reduced.

Whether the working behaviors and actions of the staff accord with the specifications is an important point of safety production concern, for example, in the sorting scene of express logistics, real-time analysis and early warning are required for the sorting actions of the sorting staff in the video.

In the process of implementing the present invention, the inventor finds that at least the following technical problems exist in the prior art:

the existing action recognition method is poor in time domain feature and space domain feature extraction capability, so that the accuracy of the finally obtained action recognition result is low, and the safety in the production process cannot be effectively ensured.

Disclosure of Invention

The embodiment of the invention provides a method, a device, equipment and a storage medium for identifying actions, which are used for improving the accuracy of an identification result of an action identification network model and further ensuring the safety in the production process.

In a first aspect, an embodiment of the present invention provides an action recognition method, including:

acquiring video data to be identified; wherein the video data comprises at least two video frames;

determining an optical flow weight matrix corresponding to the video data according to pixel positions of preset feature points corresponding to each video frame respectively; the optical flow weight matrix is used for representing the time characteristics and the space characteristics of an action area in the video data;

and inputting the video data and the optical flow weight matrix into an action recognition network model to obtain an output action recognition result corresponding to the video data.

In a second aspect, an embodiment of the present invention further provides an action recognition apparatus, where the apparatus includes:

the video data acquisition module is used for acquiring video data to be identified; wherein the video data comprises at least two video frames;

the optical flow weight matrix module is used for determining an optical flow weight matrix corresponding to the video data according to pixel positions of preset feature points corresponding to each video frame; the optical flow weight matrix is used for representing the time characteristics and the space characteristics of an action area in the video data;

and the motion recognition result output module is used for inputting the video data and the optical flow weight matrix into a motion recognition network model to obtain an output motion recognition result corresponding to the video data.

In a third aspect, an embodiment of the present invention further provides an electronic device, including:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement any of the action recognition methods described above.

In a fourth aspect, embodiments of the present invention also provide a storage medium containing computer-executable instructions which, when executed by a computer processor, are used to perform any of the above-mentioned action recognition methods.

The embodiments of the above invention have the following advantages or benefits:

according to the embodiment of the invention, the optical flow weight matrix of the video data is determined according to the pixel positions of the preset feature points corresponding to each video frame in the video data, and the video data and the optical flow weight matrix are simultaneously input into the motion recognition network model, wherein the optical flow weight matrix characterizes the time feature and the space feature of the motion region in the video data, the problem of poor feature extraction capability of the conventional motion recognition network model is solved, the accuracy of the recognition result of the motion recognition network model is improved, and the safety in the production process is further ensured.

Drawings

Fig. 1 is a flowchart of a method for identifying actions according to an embodiment of the present invention.

Fig. 2 is a flowchart of a method for determining an optical flow weight matrix according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of an action recognition network model according to a second embodiment of the present invention.

Fig. 4A is a schematic diagram of an attention module according to a second embodiment of the present invention.

Fig. 4B is a flowchart of a specific example of an action recognition method according to the second embodiment of the present invention.

Fig. 5 is a schematic diagram of an action recognition device according to a third embodiment of the present invention.

Fig. 6 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Example 1

Fig. 1 is a flowchart of a method for identifying actions, which is provided in the first embodiment of the present invention, and the present embodiment may be suitable for identifying actions in video, where the method may be performed by an action identifying device, and the device may be implemented in software and/or hardware, and the device may be configured in a terminal device, and an exemplary terminal device may be an intelligent terminal such as a mobile terminal, a notebook computer, a desktop computer, a server, and a tablet computer. The method specifically comprises the following steps:

s110, acquiring video data to be identified.

The video data may be, for example, video collected by a video recording device in real time, or video input by a user. In this embodiment, the video data includes at least two video frames. Wherein video frames may be used to describe still pictures that make up video data.

S120, determining an optical flow weight matrix corresponding to the video data according to pixel positions of preset feature points corresponding to each video frame.

Specifically, each video frame in the video data has a corresponding preset feature point, and exemplary, the preset feature points corresponding to each video frame may be the same or different. In one embodiment, optionally, a pixel point formed by the target object in the video frame is used as a preset feature point. Specifically, it is assumed that the video frame includes a person a, and all pixels occupied by the person a in the current video frame are used as preset feature points. In another embodiment, optionally, all pixel points in the video frame are used as preset feature points.

On the basis of the above embodiment, optionally, determining an optical flow weight matrix corresponding to video data according to pixel positions of preset feature points corresponding to each video frame, includes: aiming at a preset feature point corresponding to a current video frame, determining the optical flow speed of the preset feature point based on the current pixel position of the preset feature point in the current video frame and the next pixel position of the preset feature point in the next video frame; and determining an optical flow weight matrix corresponding to the video data based on the optical flow speed of the preset feature points in each video frame.

The optical flow speed is used for representing the instantaneous speed of the pixel motion of the moving object on the imaging plane, and the motion information of the moving object between the adjacent frames is determined by utilizing the change of the pixel positions of the preset feature points in the image sequence on the time domain and the correlation between the adjacent frames. Wherein in particular, the optical flow velocity comprises an optical flow rate and an optical flow direction. For example, if the current pixel position of the preset feature point in the current video frame is (x 1, y 1) and the next pixel position in the next video frame is (x 2, y 2), and the time interval between the current video frame and the next video frame is dt, the movement distance (dx, dy) = (x 2, y 2) - (x 1, y 1) of the preset feature point is the velocity vector of the optical flow velocity along the x axisVelocity vector of optical flow velocity along y-axis

Exemplary methods of calculating optical flow velocity include, but are not limited to, pyramidal L-K optical flow, horn-Schunck algorithm, flowNetSimple model, or FlowNetCorr model, among others. In one embodiment, optionally, each video frame data of the video data is input into a FlowNetCorr model, resulting in an output optical flow velocity corresponding to each video frame.

Based on the above embodiment, optionally, the optical flow speed includes a horizontal optical flow speed and a vertical optical flow speed, and correspondingly, determining an optical flow weight matrix corresponding to the video data based on the optical flow speed of the preset feature point in each video frame includes: for each video frame, determining a thermodynamic diagram matrix corresponding to the video frame based on a horizontal optical flow speed and a vertical optical flow speed corresponding to preset feature points in the video frame; and respectively carrying out normalization processing on thermodynamic diagram matrixes corresponding to the video frames to obtain optical flow weight matrixes corresponding to the video data.

In one embodiment, optionally, for each preset feature point, the sum of squares operation is performed on the horizontal optical flow velocity and the vertical optical flow velocity corresponding to the preset feature point, so as to obtain a thermal value corresponding to the preset feature point. By way of example, assuming a horizontal optical flow velocity ux and a vertical optical flow velocity uy, the thermal value h satisfies the formula:in one embodiment, when the preset feature points in the video frame are all pixel points, the thermodynamic diagram matrix corresponding to the video frame includes a thermodynamic value corresponding to each pixel point. Illustratively, if the resolution of the video frame is a×b, the thermodynamic diagram matrix has a number of rows a and a number of columns b. Specifically, the thermodynamic diagram matrix (hetmap) corresponding to the video data is [ h1, h2, ]. Hn]Where hn represents the thermodynamic diagram matrix for the nth video frame.

Specifically, each thermodynamic value in the thermodynamic diagram matrix corresponding to the video data is subjected to sigmoid (normalization) operation, so as to obtain an optical flow weight matrix corresponding to the video data. Calculating by taking a video frame as a unit, wherein an optical flow weight matrix corresponding to a single video frame satisfies the formula:

wherein wn represents an optical flow weight matrix corresponding to the nth video frame, (x) _m ,y _m ) Representing pixel points in m rows and m columns in the video frame, hn represents a thermodynamic diagram matrix corresponding to the nth video frame, hn (x _m ,y _m ) Representing thermodynamic values for row m and column m in the thermodynamic diagram matrix.

Specifically, the optical flow weight matrix corresponding to the video data is [ w1, w2 ]. The term.

Fig. 2 is a flowchart of a method for determining an optical flow weight matrix according to an embodiment of the present invention. The dashed box in fig. 2 represents the model architecture of the FlowNetSimple model, specifically, the FlowNetSimple model includes a systolic part network structure and an amplified part network structure, where the systolic part network structure is mainly composed of convolution layers, and the amplified part network structure is mainly composed of deconvolution layers. Video data is input into a FlowNetSimple model, and output optical flow speeds corresponding to video frames are obtained. And determining a thermodynamic diagram matrix of the video data based on the optical flow speed, and performing normalization operation based on the thermodynamic diagram matrix to obtain an optical flow weight matrix of the video data.

In this embodiment, an optical flow weight matrix is used to characterize the temporal and spatial features of the action region in the video data. The motion area is specifically used for describing an image area where a moving object is located in the video data. Since the optical flow velocity itself is a pixel motion that describes a moving object, an optical flow weight matrix can characterize the spatial characteristics of the action region. In the present embodiment, an optical flow weight matrix of the entire video data is determined based on the optical flow velocity of each video frame, and thus the optical flow weight matrix can characterize the temporal characteristics of the action region.

S130, inputting the video data and the optical flow weight matrix into the motion recognition network model to obtain an output motion recognition result corresponding to the video data.

Exemplary motion recognition network models include, but are not limited to, a dual stream convolutional network model, a three-dimensional convolutional network model, or a long-term memory network model, among others. In this embodiment, the motion recognition network model may be any network model that can be used for motion recognition, and the specific type of the motion recognition network model is not limited herein.

The action recognition result may be, for example, a specific behavior action, such as pick-up, drop-down, and pan, among others. Of course, the action recognition result may also be a discrimination result of a preset action, and if the preset action may be a violent action, the action recognition result is whether the video data includes the violent action. The specific output content of the action recognition result is not limited here.

The existing motion recognition models mainly comprise three types: the first type is a double-flow convolution network method for fusion classification and identification based on the extracted time domain features and space domain features; the second type is an action recognition method based on a long-short-time memory network; the third class is three-dimensional convolutional network methods with added time-dimensional channels. Because the double-flow convolution network only focuses on the convolution mapping of the current step, only the time-space domain characteristics in a short time can be captured, and the long-time-space domain characteristics cannot be represented; the long-short time memory network model can solve the problem of the upper time modeling to a certain extent, but because the input of the long-short time memory network model is the semantic feature extracted directly from the full-connection layer, the extraction capability of the time-space domain feature detail is lacking; although the three-dimensional convolution network increases the time-dimension channel, the extraction capability of the time-space domain characteristic details of the three-dimensional convolution network still needs to be improved.

According to the technical scheme of the embodiment, the optical flow weight matrix of the video data is determined according to the pixel positions of the preset feature points corresponding to each video frame in the video data, and the video data and the optical flow weight matrix are simultaneously input into the motion recognition network model, wherein the optical flow weight matrix characterizes the time characteristics and the space characteristics of the motion area in the video data, the problem that the feature extraction capability of the existing motion recognition network model is poor is solved, the accuracy of the recognition result of the motion recognition network model is improved, and the safety in the production process is further guaranteed.

Example two

Fig. 3 is a schematic diagram of an action recognition network model according to a second embodiment of the present invention, where the technical solution of the present embodiment is further elaborated on the basis of the foregoing embodiment, and in this embodiment, optionally, the action recognition network model includes an intermediate network module 210, an output module 230, and at least one attention module 220; the intermediate network module 210 is configured to perform preset processing on an input video frame to obtain intermediate image data; the attention module 220 is configured to perform a proportional fusion process based on the intermediate image data and the optical flow weight matrix output by the intermediate network module 210 to obtain an attention feature map; an output module 230 for determining an action recognition result corresponding to the video data based on the attention profile outputted by the attention module 220.

In an exemplary embodiment, when the preset process is a maximum pooling process, the intermediate network module 210 is a maximum pooling layer. When the preset process is an average pooling process, the intermediate network module 210 is an average pooling layer. When the preset process is a convolution process, the intermediate network module 210 is a convolution layer. The specific type of intermediate network module 210 is not limited herein.

It should be noted that fig. 2 only shows one of the connection relations between the modules in the motion recognition network model. When the motion recognition network model includes the plurality of intermediate network modules 210 and the plurality of attention modules 220, taking two intermediate network modules and two attention modules as examples, the connection relationship may be the intermediate network module a, the attention module a, the intermediate network module B, and the attention module B, or the connection relationship may be the intermediate network module a, the intermediate network module B, the attention module a, and the attention module B, or the connection relationship may be the intermediate network module a, the attention module B, and the intermediate network module B. The connection relationship between the modules in the motion recognition network model is not limited here.

In one embodiment, the intermediate network module 210 optionally includes a correction processing module, configured to perform correction processing on each input video frame, so as to obtain a corrected video frame; wherein the correction process includes a subtraction mean process and/or a scaling process. Specifically, the input of the correction processing module is a video frame in the video data input to the motion recognition network model. The average value processing is subtracted, specifically, pixel values corresponding to pixel points in each video frame are correspondingly added to obtain an average value, an average value image corresponding to video data is obtained, each video frame is subtracted from the average value image, and a corrected video frame is obtained. The advantage of this is that the spatial characteristics of the motion areas in each video frame are emphasized by subtracting the smooth pixel data from the video data. The scaling process, specifically, adjusts the resolution of the video frame, so that the video resolution corresponding to the video frame meets the resolution requirement of the motion recognition network model on the input data, thereby ensuring that the motion recognition network model outputs a stable and accurate motion recognition result. In this embodiment, the intermediate image data output by the intermediate network module 210 is a modified video frame.

In one embodiment, optionally, the attention module 220 includes a network node unit and an attention unit; a network node unit for determining a node feature map based on the input intermediate image data; and the attention unit is used for carrying out proportional fusion processing on the node characteristic graphs output by the network node unit based on the optical flow weight matrix to obtain attention characteristic graphs.

The network node unit may be configured to perform feature extraction on the intermediate image data to obtain a node feature map. The network node elements include at least one network layer, illustratively including but not limited to at least one of a convolution layer, a deconvolution layer, a pooling layer, an activation function layer, a normalization layer, and a full connection layer. The specific type, number and manner of connection of the network layers in the network node unit are not limited here.

When the video data includes n video frames, the node feature map u= [ F1, F2, &..fw ] output by the network node unit, where Fn represents a feature map corresponding to the n video frame. In one embodiment, optionally, the node profile includes at least one channel profile corresponding to each video frame. Wherein, the size of Fn satisfies a×b×c, where a×b represents the resolution of the channel feature map and c represents the number of the channel feature maps, by way of example.

On the basis of the above embodiment, when the node feature map includes at least one channel feature map corresponding to each video frame, the attention unit is specifically configured to: for each video frame, based on a frame optical flow weight matrix corresponding to the video frame in the optical flow weight matrix, respectively performing proportional fusion operation on each channel feature map corresponding to the video frame to obtain a frame attention feature map corresponding to the video frame; based on the attention feature maps of each frame, an attention feature map corresponding to the video data is generated.

The frame optical flow weight matrix represents an optical flow weight matrix corresponding to a single video frame, specifically, the optical flow weight matrix corresponding to video data is [ w1, w2, & gt. And respectively carrying out proportional fusion (Scale) calculation on the frame optical flow weight matrix and each channel feature map to obtain a frame attention feature map, wherein the frame attention feature map meets the formula:

sn[1]＝wn*Fn[1]......sn[c]＝wn*Fn[c]

wherein sn [1] represents a frame attention profile corresponding to the 1 st channel, fn [1] represents the 1 st channel profile, sn [ c ] represents a frame attention profile corresponding to the c-th channel, and Fn [ c ] represents the c-th channel profile.

The frame attention profile corresponding to the nth video frame is [ sn [1], sn [2],. The term "sn [ c ] ], and the attention profile corresponding to the video data is [ s1, s 2],. The term" sn ].

Fig. 4A is a schematic diagram of an attention module according to a second embodiment of the present invention. As shown in fig. 4A, the leftmost 3 rectangles represent intermediate image data output by the intermediate network model input into the attention module, the three rectangles in the middle represent node feature maps output by the network node unit U, and specifically, the node feature maps include feature maps Fn corresponding to n video frames, respectively, and each feature map Fn has a size satisfying a×b×c. The three rectangles above the network node unit U represent an input optical flow weight matrix comprising frame optical flow weight matrices corresponding to n video frames, respectively, wherein the size of each frame optical flow weight matrix satisfies a x b. And performing Scale (proportion fusion) on the optical flow weight matrix and the node feature map to obtain an attention feature map corresponding to the video data.

Fig. 4B is a flowchart of a specific example of an action recognition method according to the second embodiment of the present invention, and the leftmost three matrices in fig. 4B represent video frames in video data. In one aspect, video data is input into a FlowNetCorr model, and an optical flow weight matrix is determined based on the output optical flow velocity. On the other hand, video data and an optical flow weight matrix are input into the motion recognition network model, and the lower dashed box in fig. 4B represents the motion recognition network model and the upper two dashed boxes each represent an attention module in the motion recognition network model. In particular, in this embodiment, the network node units in the two attention modules are respectively a 7 x 7 convolutional layer and a 3 x 3 convolutional layer, the intermediate network module includes a correction processing module and a 1×3×3 max pooling layer. Where "a.+ -. In the action recognition network model" may represent a repeat attention module and a 1 x 3 max pooling layer. In the present embodiment of the present invention, in the present embodiment, the motion recognition network model also includes a 2 x 7 averaging pooling layer and an output module is a 1 x1 convolution layer. In this embodiment, the operation recognition network model is exemplified by a three-dimensional convolution network model, but the specific network architecture of the operation recognition network model is not limited.

On the basis of the above embodiment, optionally, the attention module further includes a resolution unit, configured to sample the optical flow weight matrix if the image resolution of the video frame corresponding to the optical flow weight matrix is different from the image resolution of the node feature map, so that the image resolution corresponding to the optical flow weight matrix is the same as the image resolution of the node feature map; wherein the sampling process includes an upsampling process or a downsampling process.

In this example, assuming that the FlowNetCorr model predicts optical flow speeds corresponding to all pixels in a video frame, and the image resolution of the video frame is a×b, the image resolution corresponding to the optical flow weight matrix is also a×b. Because the network node unit performs feature extraction on the input intermediate image data, the extracted node feature map may have a situation inconsistent with the image resolution of the original video frame, and before performing the proportion integration processing, the optical flow weight matrix is sampled. Specifically, if the image resolution corresponding to the optical flow weight matrix is greater than the image resolution of the node feature map, downsampling the optical flow weight matrix, and if the image resolution corresponding to the optical flow weight matrix is less than the image resolution of the node feature map, upsampling the optical flow weight matrix. The arrangement has the advantage of ensuring that the subsequent proportion fusion processing cannot cause proportion fusion failure due to different image resolutions.

On the basis of the above embodiment, optionally, the method further includes: acquiring video data to be trained, wherein the video data to be trained comprises at least two video frames to be trained; determining an optical flow weight matrix corresponding to the video data to be trained according to pixel positions of preset feature points corresponding to each video frame to be trained; inputting the video data to be trained and the optical flow weight matrix into an initial motion recognition network model, and adjusting model parameters of the initial motion recognition network model according to the output motion recognition result and the standard recognition result until the motion recognition network model after training is completed is obtained.

According to the technical scheme, the attention module is arranged in the action recognition network model, so that the problem of processing the optical flow weight matrix by the action recognition network model is solved, the optical flow weight matrix is combined with the existing network node units of the action recognition network model, an output attention characteristic diagram is obtained, the accuracy of the recognition result of the action recognition network model is improved, in addition, the existing action recognition network model is not required to be modified excessively, and the unsupervised focusing of an action area can be realized by only combining the output of the existing network node with the optical flow weight matrix, so that the portability of the attention module is improved.

Example III

Fig. 5 is a schematic diagram of an action recognition device according to a third embodiment of the present invention. The embodiment can be suitable for the situation of identifying the behavior in the video, the device can be realized in a software and/or hardware mode, and the device can be configured in the terminal equipment. The motion recognition device includes: a video data acquisition module 310, an optical flow weight matrix determination module 320, and an action recognition result output module 330.

The video data obtaining module 310 is configured to obtain video data to be identified; wherein the video data comprises at least two video frames;

the optical flow weight matrix determining module 320 is configured to determine an optical flow weight matrix corresponding to the video data according to pixel positions of preset feature points corresponding to each video frame; the optical flow weight matrix is used for representing the time characteristics and the space characteristics of the action area in the video data;

the motion recognition result output module 330 is configured to input the video data and the optical flow weight matrix into the motion recognition network model, and obtain an output motion recognition result corresponding to the video data.

Based on the above technical solution, optionally, the optical flow weight matrix determining module 320 includes:

the optical flow speed determining unit is used for determining the optical flow speed of the preset feature point according to the preset feature point corresponding to the current video frame and based on the current pixel position of the preset feature point in the current video frame and the next pixel position of the preset feature point in the next video frame;

and the optical flow weight matrix determining unit is used for determining an optical flow weight matrix corresponding to the video data based on the optical flow speed of the preset feature points in each video frame.

On the basis of the above technical solution, optionally, the optical flow speed includes a horizontal optical flow speed and a vertical optical flow speed, and the corresponding optical flow weight matrix determining unit is specifically configured to:

for each video frame, determining a thermodynamic diagram matrix corresponding to the video frame based on a horizontal optical flow speed and a vertical optical flow speed corresponding to preset feature points in the video frame;

and respectively carrying out normalization processing on thermodynamic diagram matrixes corresponding to the video frames to obtain optical flow weight matrixes corresponding to the video data.

On the basis of the technical scheme, optionally, the action recognition network model comprises an intermediate network module, an output module and at least one attention module;

the intermediate network module is used for carrying out preset processing on the input video frames to obtain intermediate image data;

the attention module is used for carrying out proportion fusion processing based on the intermediate image data and the optical flow weight matrix output by the intermediate network module to obtain an attention characteristic diagram;

and the output module is used for determining an action recognition result corresponding to the video data based on the attention characteristic diagram output by the attention module.

On the basis of the technical scheme, optionally, the attention module comprises a network node unit and an attention unit;

a network node unit for determining a node feature map based on the input intermediate image data;

and the attention unit is used for carrying out proportional fusion processing on the node characteristic graphs output by the network node unit based on the optical flow weight matrix to obtain attention characteristic graphs.

On the basis of the technical scheme, optionally, the node characteristic diagram comprises at least one channel characteristic diagram corresponding to each video frame respectively;

accordingly, the attention unit is specifically configured to: for each video frame, based on a frame optical flow weight matrix corresponding to the video frame in the optical flow weight matrix, respectively performing proportional fusion operation on each channel feature map corresponding to the video frame to obtain a frame attention feature map corresponding to the video frame; based on the attention feature maps of each frame, an attention feature map corresponding to the video data is generated.

On the basis of the above technical solution, optionally, the attention module further includes a resolution unit, configured to sample the optical flow weight matrix if the image resolution of the video frame corresponding to the optical flow weight matrix is different from the image resolution of the node feature map, so that the image resolution corresponding to the optical flow weight matrix is the same as the image resolution of the node feature map; wherein the sampling process includes an upsampling process or a downsampling process.

On the basis of the technical scheme, the intermediate network module comprises a correction processing module, a correction processing module and a processing module, wherein the correction processing module is used for respectively carrying out correction processing on each input video frame to obtain corrected video frames; wherein the correction process includes a subtraction mean process and/or a scaling process.

The action recognition device provided by the embodiment of the invention can be used for executing the action recognition method provided by the embodiment of the invention, and has the corresponding functions and beneficial effects of the execution method.

It should be noted that, in the embodiment of the motion recognition apparatus, each unit and module included are only divided according to the functional logic, but not limited to the above-mentioned division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.

Example IV

Fig. 6 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention, where the embodiment of the present invention provides services for implementing the motion recognition method according to the foregoing embodiment of the present invention, and the motion recognition device according to the foregoing embodiment of the present invention may be configured. Fig. 6 illustrates a block diagram of an exemplary electronic device 12 suitable for use in implementing embodiments of the present invention. The electronic device 12 shown in fig. 6 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

The electronic device 12 may be, for example, a mobile terminal, a notebook computer, a desktop computer, a server, a tablet computer, or the like. In one embodiment, the electronic device 12 is optionally a camera.

As shown in fig. 6, the electronic device 12 is in the form of a general purpose computing device. Components of the electronic device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.

Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Electronic device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by electronic device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The electronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 6, commonly referred to as a "hard disk drive"). Although not shown in fig. 6, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.

The electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the electronic device 12, and/or any devices (e.g., network card, modem, etc.) that enable the electronic device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Also, the electronic device 12 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 20. As shown in fig. 6, the network adapter 20 communicates with other modules of the electronic device 12 over the bus 18. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 12, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, implementing the action recognition method provided by the embodiment of the present invention.

By the electronic equipment, the problem of poor feature extraction capability of the existing motion recognition network model is solved, the accuracy of the recognition result of the motion recognition network model is improved, and the safety in the production process is further ensured.

Example five

A fifth embodiment of the present invention also provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a method of action recognition, the method comprising:

determining an optical flow weight matrix corresponding to video data according to pixel positions of preset feature points corresponding to each video frame respectively; the optical flow weight matrix is used for representing the time characteristics and the space characteristics of the action area in the video data;

and inputting the video data and the optical flow weight matrix into the motion recognition network model to obtain an output motion recognition result corresponding to the video data.

The computer storage media of embodiments of the invention may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present invention may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present invention is not limited to the above method operations, and may also perform the related operations in the action recognition method provided in any embodiment of the present invention.

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A method of motion recognition, comprising:

inputting the video data and the optical flow weight matrix into an action recognition network model to obtain an output action recognition result corresponding to the video data;

the action recognition network model comprises at least two intermediate network modules, at least two attention modules and an output module, wherein the intermediate network modules are used for carrying out preset processing on input data, the attention modules comprise network node units and attention units, and the attention units are used for carrying out proportional fusion processing on node feature graphs output by the network node units based on the optical flow weight matrix to obtain attention feature graphs.

2. The method of claim 1, wherein determining an optical flow weight matrix corresponding to the video data according to pixel positions of preset feature points corresponding to each of the video frames, comprises:

aiming at a preset feature point corresponding to a current video frame, determining the optical flow speed of the preset feature point based on the current pixel position of the preset feature point in the current video frame and the next pixel position of the preset feature point in the next video frame;

and determining an optical flow weight matrix corresponding to the video data based on the optical flow speed of the preset feature points in each video frame.

3. The method of claim 2, wherein the optical flow velocity comprises a horizontal optical flow velocity and a vertical optical flow velocity, and wherein the determining the optical flow weight matrix corresponding to the video data based on the optical flow velocity of the preset feature point in each video frame comprises:

4. The method of claim 1, wherein the node profile comprises at least one channel profile corresponding to each of the video frames;

correspondingly, the attention unit is specifically configured to: for each video frame, based on a frame optical flow weight matrix corresponding to the video frame in the optical flow weight matrix, respectively performing proportional fusion operation on each channel feature map corresponding to the video frame to obtain a frame attention feature map corresponding to the video frame;

and generating an attention characteristic diagram corresponding to the video data based on each frame attention characteristic diagram.

5. The method of claim 1, wherein the attention module further comprises a resolution unit configured to sample the optical flow weight matrix to make the image resolution corresponding to the optical flow weight matrix identical to the image resolution of the node feature map if the image resolution of the video frame corresponding to the optical flow weight matrix is different from the image resolution of the node feature map; wherein the sampling process includes an upsampling process or a downsampling process.

6. The method according to claim 1, wherein the intermediate network module includes a correction processing module, configured to perform correction processing on each of the input video frames, to obtain corrected video frames; wherein the correction process includes a subtraction mean process and/or a scaling process.

7. An action recognition device, comprising:

the optical flow weight matrix determining module is used for determining an optical flow weight matrix corresponding to the video data according to pixel positions of preset feature points corresponding to each video frame; the optical flow weight matrix is used for representing the time characteristics and the space characteristics of an action area in the video data;

the motion recognition result output module is used for inputting the video data and the optical flow weight matrix into a motion recognition network model to obtain an output motion recognition result corresponding to the video data;

8. An electronic device, the electronic device comprising:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, causes the one or more processors to implement the action recognition method of any one of claims 1-6.

9. A storage medium containing computer executable instructions which, when executed by a computer processor, are for performing the action recognition method of any one of claims 1-6.