CN110929637A

CN110929637A - Image identification method and device, electronic equipment and storage medium

Info

Publication number: CN110929637A
Application number: CN201911139594.3A
Authority: CN
Inventors: 谷宇章; 杨洪业; 张晓林
Original assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Current assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Priority date: 2019-11-20
Filing date: 2019-11-20
Publication date: 2020-03-27
Anticipated expiration: 2039-11-20
Also published as: CN110929637B

Abstract

The application relates to an image identification method, an image identification device, electronic equipment and a storage medium, wherein a human skeleton image sequence is obtained; determining a corresponding relative coordinate set in a skeleton joint point set of each frame of human body skeleton image; determining a relative coordinate tensor based on the relative coordinate set, the number of the skeleton joint points and the frame number of the images in the human body skeleton image sequence; determining a plurality of sets of inter-frame differential values; determining a time difference tensor based on the plurality of inter-frame difference value sets, the number of skeleton joint points and the number of frames of images in the human body skeleton image sequence; determining an input tensor based on the relative coordinate tensor and the temporal difference tensor; and performing motion recognition on the input tensor based on the trained motion recognition model to obtain a motion category corresponding to the human skeleton image sequence. According to the method and the device, the input tensor of the motion recognition model based on the graph convolution network is constructed by utilizing the human body skeleton joint point information, the motion recognition is carried out, and the accuracy of human motion recognition can be improved.

Description

Image identification method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer vision technologies, and in particular, to an image recognition method and apparatus, an electronic device, and a storage medium.

Background

Understanding human behavior is one of the most important tasks in computer vision, as it can facilitate a wide range of applications such as human-computer interaction, robotics, and game control. The skeleton consisting of three-dimensional joint positions provides a good expression for describing human behavior.

In recent years, with the rapid development of three-dimensional data acquisition equipment such as microsoft kinect, the acquisition of skeleton data becomes easier. Furthermore, the skeleton itself is a high-level feature of the human body, with invariance to appearance or appearance, which eliminates the difficulty of representing and understanding different action classes. Most importantly, the skeleton is robust to noise and efficient in both computation and storage. Therefore, skeleton-based motion recognition has received increasing attention in recent years.

Most of the previous researches input the joint coordinate vector into a Recurrent Neural Network (RNN) directly or encode the skeleton sequence into a pseudo image, and model the time-space dynamics by using a Convolutional Neural Network (CNN).

However, these approaches rarely explore the intrinsic dependencies between joints. To capture such dependencies, the framework data should be fully understood. In terms of data structures, a skeleton is a special graph whose vertices are joints and edges are skeletons. Therefore, the Graph Convolutional Network (GCN) is applied to mining the structure information of the human body, so that the performance better than that of a non-Graph Convolutional Network can be obtained, and the accuracy of human action recognition can be improved.

Disclosure of Invention

The embodiment of the application provides an image identification method and device, electronic equipment and a storage medium, which can improve the accuracy of human action identification.

In one aspect, an embodiment of the present application provides an image recognition method, including:

acquiring a human skeleton image sequence; the human body skeleton image sequence comprises continuous multiframe human body skeleton images; skeleton joint points of each frame of human skeleton image are consistent;

determining a corresponding relative coordinate set in a skeleton joint point set of each frame of human body skeleton image; the relative coordinates in the relative coordinate set correspond to the skeleton joint points in the skeleton joint point set one by one;

determining a relative coordinate tensor based on the relative coordinate set, the number of the skeleton joint points and the frame number of the images in the human body skeleton image sequence;

determining a plurality of inter-frame difference value sets according to a plurality of relative coordinate sets corresponding to the human skeleton image sequence;

determining a time difference tensor based on the plurality of inter-frame difference value sets, the number of skeleton joint points and the number of frames of images in the human body skeleton image sequence;

determining an input tensor based on the relative coordinate tensor and the temporal difference tensor;

and performing motion recognition on the input tensor based on the trained motion recognition model to obtain a motion category corresponding to the human skeleton image sequence.

In another aspect, an embodiment of the present application provides an image recognition apparatus, including:

the first acquisition module is used for acquiring a human skeleton image sequence; the human body skeleton image sequence comprises continuous multiframe human body skeleton images; skeleton joint points of each frame of human skeleton image are consistent;

the first determining module is used for determining a corresponding relative coordinate set in a skeleton joint point set of each frame of human skeleton image; the relative coordinates in the relative coordinate set correspond to the skeleton joint points in the skeleton joint point set one by one;

the second determination module is used for determining a relative coordinate tensor based on the relative coordinate set, the number of the skeleton joint points and the frame number of the images in the human body skeleton image sequence;

the third determining module is used for determining a plurality of inter-frame difference value sets according to a plurality of relative coordinate sets corresponding to the human skeleton image sequence;

the fourth determination module is used for determining a time difference tensor based on the plurality of inter-frame difference value sets, the number of the skeleton joint points and the number of frames of the images in the human body skeleton image sequence;

a fifth determining module for determining an input tensor based on the relative coordinate tensor and the temporal difference tensor;

and the action recognition module is used for carrying out action recognition on the input tensor based on the trained action recognition model to obtain the action category corresponding to the human skeleton image sequence.

In another aspect, an embodiment of the present application provides an electronic device, where the electronic device includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or a set of instructions, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the image recognition method described above.

In another aspect, an embodiment of the present application provides a computer-readable storage medium, in which at least one instruction, at least one program, a code set, or a set of instructions is stored, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the image recognition method described above.

The image identification method, the image identification device, the electronic equipment and the storage medium have the following beneficial effects:

obtaining a human skeleton image sequence; the human body skeleton image sequence comprises continuous multiframe human body skeleton images; skeleton joint points of each frame of human skeleton image are consistent; determining a corresponding relative coordinate set in a skeleton joint point set of each frame of human body skeleton image; the relative coordinates in the relative coordinate set correspond to the skeleton joint points in the skeleton joint point set one by one; determining a relative coordinate tensor based on the relative coordinate set, the number of the skeleton joint points and the frame number of the images in the human body skeleton image sequence; determining a plurality of inter-frame difference value sets according to a plurality of relative coordinate sets corresponding to the human skeleton image sequence; determining a time difference tensor based on the plurality of inter-frame difference value sets, the number of skeleton joint points and the number of frames of images in the human body skeleton image sequence; determining an input tensor based on the relative coordinate tensor and the temporal difference tensor; and performing motion recognition on the input tensor based on the trained motion recognition model to obtain a motion category corresponding to the human skeleton image sequence. According to the method and the device, the input tensor of the motion recognition model based on the graph convolution network is constructed by utilizing the human body skeleton joint point information, the motion recognition is carried out, and the accuracy of human motion recognition can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of an image recognition method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a human skeleton data set according to an embodiment of the present application

FIG. 4 is a schematic view of a human skeleton according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of relative coordinate tensors according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an input tensor provided by an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a motion recognition model provided in an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a convolutional layer provided in an embodiment of the present application;

FIG. 9 is a schematic diagram of a trained adjacency matrix provided by an embodiment of the present application;

FIG. 10 is a flow diagram illustrating a spatiotemporal attention extraction operation provided by an embodiment of the present application;

fig. 11 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application, and the application scenario includes a data processing module 101 and a motion recognition model 102, and after an overall image sequence of a human skeleton is obtained by the data processing module 101 and the motion recognition model 102, motion categories corresponding to the image sequence of the human skeleton are output sequentially through the data processing module 101 and the motion recognition model 102.

A human skeleton image sequence is input into a data processing module 101; the human body skeleton image sequence comprises continuous multiframe human body skeleton images; the skeleton joint points of each frame of human skeleton image are consistent. The data processing module 101 determines a corresponding relative coordinate set in a skeleton joint point set of each frame of human skeleton image; the relative coordinates in the relative coordinate set correspond to the skeleton joint points in the skeleton joint point set one by one; the data processing module 101 determines a relative coordinate tensor based on the set of relative coordinates, the number of skeletal joint points, and the number of frames of images in the sequence of human skeleton images. The data processing module 101 determines a plurality of inter-frame difference value sets according to a plurality of relative coordinate sets corresponding to the human body skeleton image sequence, and determines a time difference tensor based on the plurality of inter-frame difference value sets, the number of skeleton joint points and the number of frames of images in the human body skeleton image sequence; the data processing module 101 connects the relative coordinate tensor and the time difference tensor in series to serve as an input tensor, the input tensor is input into the trained action recognition model 102, and the action recognition model 102 performs action recognition on the input tensor to obtain an action type corresponding to the human skeleton image sequence.

Optionally, in another application scenario, the data processing module 101 may also serve as a part of the motion recognition model 102, the human skeleton image sequence serves as an input of the motion recognition model 102, and the motion category corresponding to the human skeleton image sequence is output through the motion recognition model 102.

Alternatively, the data processing module 101 and the motion recognition model 102 may be disposed in the same device, such as a mobile terminal, a computer terminal, a server, or a similar computing device; alternatively, the data processing module 101 and the motion recognition model 102 may be provided in a plurality of devices, which are in one system; alternatively, the data processing module 101 and the motion recognition model 102 may be provided on one platform. Therefore, the execution subject of the embodiment of the present application may be a mobile terminal, a computer terminal, a server, or a similar operation device; may be a system or a platform.

While a specific embodiment of an image recognition method according to the present application is described below, fig. 2 is a schematic flow chart of an image recognition method according to an embodiment of the present application, and the present specification provides the method operation steps according to the embodiment or the flow chart, but more or less operation steps may be included based on conventional or non-creative labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In practice, the system or server product may be implemented in a sequential or parallel manner (e.g., parallel processor or multi-threaded environment) according to the embodiments or methods shown in the figures. Specifically, as shown in fig. 2, the method may include:

s201: acquiring a human skeleton image sequence; the human body skeleton image sequence comprises continuous multiframe human body skeleton images; the skeleton joint points of each frame of human skeleton image are consistent.

In the embodiment of the application, the human body skeleton image sequence can be acquired by a depth sensor (such as microsoft Kinect), and the data acquired by the depth sensor further includes three-dimensional coordinate information of skeleton joint points in each frame of human body skeleton image.

At present, a large number of open source human skeleton data sets are used for experimental verification. For example, the NTU RGB + D dataset was captured simultaneously by 3 microsoft Kinect cameras, involving 60 motion classes, containing 50,000 motion samples, and video, depth image sequences and three-dimensional skeleton data for each motion sample. Referring to fig. 3, fig. 3 is a schematic diagram of a human skeleton data set according to an embodiment of the present application, and fig. 3(a) is three-dimensional skeleton data of an NTU RGB + D data set, which includes three-dimensional coordinate information of 25 joint points. The three-dimensional coordinate information is obtained by skeletal tracking techniques in the Kinect camera, which by processing depth data establishes the coordinates of the various joints of the human body, it is possible to determine which parts of the human body are, for example, the hands, the head, and the body, as well as their spatial positions. Similarly, in addition to the NTU RGB + D dataset, there is an HDM05 dataset, which contains three-dimensional skeleton data including three-dimensional coordinate information for 31 joint points, as shown in fig. 3 (b).

S203: determining a corresponding relative coordinate set in a skeleton joint point set of each frame of human body skeleton image; and the relative coordinates in the relative coordinate set correspond to the skeleton joint points in the skeleton joint point set one by one.

In the embodiment of the application, the number of skeleton joint point sets of each frame of human skeleton image can be determined according to a specific algorithm, the number of skeleton joint point sets of each frame of human skeleton image acquired based on an NTU RGB + D data set is 25, and the data set further includes three-dimensional coordinate information sets corresponding to the 25 joint points.

An implementation mode of optionally determining a corresponding relative coordinate set in a skeleton joint point set of each frame of human body skeleton image is to determine a root node from the skeleton joint point set; and determining the relative coordinates of each skeleton joint point in the skeleton joint point set in each frame of human body skeleton image based on the root node in the skeleton joint point set to obtain a relative coordinate set.

The following description is given by way of a specific example, please refer to fig. 4, fig. 4 is a schematic diagram of a human skeleton according to an embodiment of the present application, and for convenience of description, it is assumed that an algorithm determines 5 joint points. The method comprises the steps of obtaining 10 continuous frames of human body skeleton images through a depth sensor, and obtaining three-dimensional coordinate information sets of 5 skeleton joint points in each frame of human body skeleton image. For example, the three-dimensional coordinate information of 5 skeletal joint points in the 1 st frame is: head joint point A₁(90,90,90), hand joint point B₁(100,80,60), hand joint point C₁(80,100,60), a leg joint point D₁(100,80,0), leg joint point E₁(80,100,0), and for example, the three-dimensional coordinate information of 5 skeletal joint points in the 2 nd frame is: a. the₂(90,90,92)，B₂(100,80,62)，C₂(80,100,62)，D₂(100,80,10)，E₂(80,100,10), and for example, the three-dimensional coordinate information of 5 skeletal joint points in the 10 th frame is: a. the₁₀(90,90,110)，B₁₀(100,80,80)，C₁₀(80,100,80)，D₁₀(100,80,50)，E₁₀(80,100, 50); determining a root node as a head joint point A from the 5 skeleton joint points A, B, C, D and E, and then determining relative coordinates of the 5 skeleton joint points in each frame relative to the head joint point A to obtain a relative coordinate set; wherein, the head joint point A in each frame of the human skeleton image is the origin (0,0, 0). For example, the set of relative coordinates of frame 1 includes A'₁(0,0,0)，B’₁(10，-10,-30)，C’₁(-10,10,-30)，D’₁(10,-10,-90)，E’₁(-10,10, -90), the set of relative coordinates of frame 2 comprises A'₂(0,0,0)，B’₂(10，-10,-30)，C’₂(-10,10,-30)，D’₂(10,-10,-82)，E’₂(-10,10, -82), the set of relative coordinates of frame 10 includes A'₁₀(0,0,0)，B’₁₀(10，-10,-30)，C’₁₀(-10,10,-30)，D’₁₀(10,-10,-70)，E’₁₀(-10,10,-70)。

S205: a relative coordinate tensor is determined based on the set of relative coordinates, the number of skeletal joint points, and the number of frames of images in the sequence of human skeletal images.

In the embodiment of the present application, the human skeleton data is converted into tensor C × T × V, please refer to fig. 5, where fig. 5 is a schematic structural diagram of a relative coordinate tensor provided in the embodiment of the present application, where C represents a channel number (in the present application, a human skeleton joint point is represented by three-dimensional coordinate information, that is, the channel number is 3), and three channels x, y, and z respectively represent a relative coordinate set of a skeleton joint point set of each frame of image in the human skeleton image on the channel; t represents a frame number sequence of the human skeleton image; v represents a sequence of joint points of the human skeleton.

S207: and determining a plurality of inter-frame difference value sets according to a plurality of relative coordinate sets corresponding to the human skeleton image sequence.

In the embodiment of the application, after the three-dimensional coordinate information set corresponding to the skeleton joint point set is converted into the relative coordinate set according to the root joint in each frame, a plurality of inter-frame differential value sets are determined channel by channel (x channel, y channel and z channel) according to a plurality of relative coordinate sets corresponding to the human body skeleton image sequence, and the inter-frame differential value is the relative displacement of a certain skeleton joint point between two adjacent frames of images.

The description is continued based on the above example. For example, the interframe difference value of the joint point D between the 2 nd frame image and the 1 st frame image is calculated channel by channel, the interframe difference value of the joint point D on the x channel is 0, the interframe difference value of the joint point D on the y channel is 0, and the interframe difference value of the joint point D on the Z channel is 8, which indicates that the leg movement of the human body is changed only in the Z-axis direction; for another example, the inter-frame difference values between the 2 nd frame image and the 1 st frame image of the joint point B are calculated channel by channel, and the inter-frame difference values on the x channel, the y channel and the z channel are all 0, which indicates that the human hand motion is not changed in any direction.

S209: and determining a time difference tensor based on the plurality of inter-frame difference value sets, the number of the skeleton joint points and the number of frames of the images in the human body skeleton image sequence.

S211: an input tensor is determined based on the relative coordinate tensor and the temporal difference tensor.

In the embodiment of the present application, in addition to constructing the above-mentioned relative coordinate tensor to obtain the features of the human motion in the spatial domain, a time difference tensor is also constructed based on the above-mentioned inter-frame difference value set to extract the features of the human motion in the temporal domain, and the time difference tensor and the relative coordinate tensor are connected in series to be used as an input tensor C × T × v, please refer to fig. 6, where fig. 6 is a schematic structural diagram of an input tensor provided in the embodiment of the present application, where three channels x, y, and z respectively represent relative coordinate sets of a skeleton joint set of each frame image in the human skeleton image on the channel, and the other three channels △ x, △ y, and △ z respectively represent relative coordinate sets corresponding to the skeleton joint set in each frame image based on the inter-frame difference value sets of the skeleton joint set in the previous frame image on the corresponding x channel, y channel, and z channel, respectively.

S213: and performing motion recognition on the input tensor based on the trained motion recognition model to obtain a motion category corresponding to the human skeleton image sequence.

In this embodiment of the application, the motion recognition model may be a network model improved based on a graph convolution network model, and the motion recognition model may include: an input layer, 1 Batch Normalization layer (BN), 10 convolution layers, 1 Global Average Pooling layer (GAP), 1 fully connected layer (FC), and an output layer. Wherein each convolutional layer comprises 1 pseudo-map convolution module, 1 spatiotemporal attention extraction module and 1 time convolution module. The acquired human skeleton image sequence is input to an input layer of the action recognition model, an input tensor is determined through the input layer and then transmitted to a batch normalization layer, and the purpose of the batch normalization layer is to normalize the input of the action recognition model through a batch processing normalization layer, so that gradient disappearance and explosion can be avoided, and the training speed is improved. And then, extracting feature tensors from the input tensors processed by the batch normalization layer through 10 convolutional layers in sequence, and feeding the feature tensors output by the final convolutional layers into a global average pooling layer, wherein the purpose of the global average pooling layer is to reduce feature dimensions. And then, the tensor output by the global average pooling layer is fed into the full connection layer to obtain the classification score of the human skeleton image sequence, and finally, the classification and identification of the human body actions are completed through a Softmax classification module of the output layer.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a motion recognition model according to an embodiment of the present application, which sequentially includes an input layer, a batch normalization layer, a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer, a fifth convolution layer, a sixth convolution layer, a seventh convolution layer, an eighth convolution layer, a ninth convolution layer, a tenth convolution layer, a global average pooling layer, a full connection layer, and an output layer. In one specific example:

the role of the input layer can be the role of data processing, and steps S201 to S211 are executed to acquire a human skeleton image sequence and determine an input tensor.

The function of the batch normalization layer is to realize normalization of data, and the technology is common knowledge of those skilled in the art and will not be described in detail here.

Referring to fig. 8, fig. 8 is a schematic structural diagram of convolutional layers provided in an embodiment of the present application, and the structure of 10 convolutional layers in the embodiment of the present application can be referred to, and includes 1 pseudo-map convolution module, 1 spatiotemporal attention extraction module, and 1 time convolution module. Since the input tensor is 6 channels, the first convolution layer pseudo-map convolution module has 6 input channels and also 64 channels for output. The pseudo-map convolution modules of the second, third and fourth convolution layers have 64 input channels and 64 output channels, respectively. The pseudo-map convolution module of the fifth convolution layer has 64 input channels and 128 output channels. The pseudo-map convolution modules of the sixth convolution layer and the seventh convolution layer have 128 input channels and 128 output channels, respectively. The pseudo-map convolution module of the eighth convolution layer has 128 input channels and 256 output channels. The ninth convolutional layer and the tenth convolutional layer have 256 input channels and 256 output channels, respectively. Wherein the step size of the fifth convolutional layer and the eighth convolutional layer may be set to 2.

And feeding the output tensor of the preorder tenth layer of the convolutional layer into a global pooling layer, wherein the global pooling layer acquires 256-dimensional characteristic vectors from the human skeleton image sequence.

And after the global pooling layer, feeding the output tensor into the full connection layer to obtain the action classification score corresponding to the human skeleton image sequence, and then finishing the human action classification identification through a Softmax classification module of the output layer.

In the embodiment of the application, each convolution layer comprises 1 pseudo-map convolution module, 1 space-time attention extraction module and 1 time convolution module. The pseudo-map convolution modules of the 10 convolutional layers are all used for acquiring a trained adjacent matrix, and pseudo-map convolution operation is carried out on the basis of the product of the tensor output by the last convolutional layer and the adjacent matrix to obtain a spatial feature tensor; then, a space-time attention extraction module performs space-time attention extraction operation based on the space feature tensor to obtain a space-time calibration feature tensor; the spatio-temporal alignment feature tensor comprises a plurality of feature planes with different weights; and finally, the time convolution module performs time convolution operation on the time-space calibration feature tensor to obtain an output tensor of the convolution layer.

In an optional embodiment of performing a pseudo-image convolution operation based on a product of a tensor output by a previous convolutional layer and an adjacent matrix to obtain a spatial feature tensor, the spatial feature tensor may be determined according to formula (1):

wherein f is_outA tensor representing spatial features; w_iRepresenting a weight; f. of_inA tensor representing an output of a previous convolutional layer;

representing the trained adjacency matrix; n represents the number of adjacency matrices in each layer; i denotes the ith adjacency matrix in each pseudo-graph convolution module.

In the embodiment of the present application, the trained adjacency matrices corresponding to the pseudo-map convolution modules in each convolution layer are different from each other. Referring to fig. 9, fig. 9 is a schematic diagram of a trained adjacency matrix according to an embodiment of the present application. The first row in fig. 9 shows original adjacency matrices obtained by training based on NTU-RGB + D Cross-Subject in the prior art, these adjacency matrices only represent the connection relationships between the joint points with physical direct connection relationships in the human skeleton joint points, and are trained and kept fixed. Second row and third row in fig. 9, in the embodiment of the present application, 10 adjacent matrices obtained by training based on NTU-RGB + D Cross-Subject, a first column in the second row may be a matrix in a pseudo-map convolution module of a first layer of convolutional layer, and a last column in the third row may be a matrix in a pseudo-map convolution module of a tenth layer of convolutional layer. Here, since it is a learnable adjacency matrix, it is not related to the predefined graph and the normalized adjacency matrix in the related art, and therefore, it is called a pseudo graph convolution module.

In the embodiment of the application, the positions and the motion states of the skeletal joint points in the three-dimensional directions are considered to contribute to the classification of the actions, and certain frames containing prominent features play an important role in distinguishing the action types. Therefore, in the convolutional layer provided in the embodiment of the present application, a spatiotemporal attention extraction module performs spatiotemporal attention extraction operation to obtain a spatiotemporal alignment feature tensor.

An alternative implementation of the spatiotemporal attention extraction operation by the spatiotemporal attention extraction module to obtain the spatiotemporal alignment feature tensor is described below. Referring to fig. 10, fig. 10 is a flowchart illustrating a spatiotemporal attention extraction operation according to an embodiment of the present disclosure. Firstly, extracting information channel by using global average pooling, then reducing the number of channels through a full connection layer and a ReLU nonlinear operation layer, and then recovering the number of channels through one full connection layer and the ReLU nonlinear operation layer, so that the spatial characteristics can be calibrated; in order to recalibrate the temporal features, the channel axis and the time axis are exchanged to obtain the tensor T × C × V, then the same operation is applied, and after the temporal features are recalibrated, the feature tensor is changed back to the original shape. The Hadamard product is used for mixing spatial features and temporal features, after the mixed tensor is changed into V multiplied by T multiplied by C, the space-time attention tensor is extracted by adopting 1 multiplied by 1 convolution, and the original input tensor is multiplied by the space-time attention tensor to obtain the space-time calibration feature tensor.

Optionally, the time alignment operation may perform time alignment on the features obtained after the previous convolutional layer time convolution operation; the temporal alignment operation may also increase the weight of some important frames in the convolutional layer before the temporal convolution, and then perform the temporal convolution operation, thus contributing to high-quality feature extraction.

In the embodiment of the application, all experiments in the training process of the motion recognition model can be performed based on a PyTorch deep learning framework. Using Stochastic Gradient Descent (SGD) with Nesterov momentum for optimization, the learning rate, momentum, and weight decay may be set to 0.1, 0.9, and 0.0001, respectively. Dropout with a probability of 0.2 is used to mitigate overfitting during training.

All elements in (1) are initialized to 1. The cross entropy is chosen as a loss function of the counter-propagating gradient.

The method provided by the embodiment of the application is compared with other skeleton-based action recognition methods based on an NTU-RGB + D data set and an HDM05 data set respectively.

For the NTU-RGB + D dataset, there are a maximum of two people in each sample of the dataset, and the maximum number of frames per sample is 300. For samples less than 300 frames, these samples are repeated until 300 frames are reached. The batch size is set to 32. The learning rate is set to 0.1 and divided by 10 at the 20 th epoch and the 40 th epoch. The training process ends at the 60 th epoch. In the embodiment of the present application, the method proposed in the present application is trained based on two common benchmarks: Cross-Subject and Cross-View. The accuracy of the top-1 classification identification of the other 16 methods and the method of the present application (PGCN-TCA) was calculated in the test stage. Table 1 shows the results of a comparison in which the accuracy of the method provided in the examples of the present application (PGCN-TCA) on a Cross-Subject basis was 88.0% for the top-1 classification, second only to 2s-AGCN, but superior to most existing methods; the accuracy of the method provided by the embodiment of the application (PGCN-TCA) in Cross-View benchmark is 93.6% for top-1 classification, which is second only to 2s-AGCN, but the method is superior to most existing methods and has higher accuracy.

Table 1: accuracy comparison of Cross-Subject and Cross-View Top-1 classifications on NTU-RGB + D datasets

NO.	Methods	Cross-Subject(％)	Cross-View(％)
				1	Lie Group	50.1	52.8
2	H-RNN	59.1	64.0
				3	Deep LSTM	60.7	67.3
4	ST-LSTM+TS	69.2	77.7
				5	Temporal Conv	74.3	83.1
6	Visualize CNN	76.0	82.6
				7	Visualize CNN	79.6	84.8
8	ST-GCN	81.5	88.3
				9	MANs	82.7	93.2
10	DPRL	83.5	89.8
				11	SR-TSL	84.8	92.4
12	HCN	86.5	91.1
				13	PB-GCN	87.5	93.2
14	RA-GCN	85.9	93.5
				15	AS-GCN	86.8	94.2
16	2s-AGCN	88.5	95.1
				17	PGCN-TCA	88.0	93.6

For the HDM05 data set, the maximum number of frames in each sample was 901. For samples less than 901 frames, repeat until it reaches 901 frames. The batch size is set to 16. The learning rate is also set to 0.1 and divided by 10 at the 100 th epoch. The training process ends at the 120 th epoch. The evaluation was performed 10 times in a random evaluation mode, half of the sequences in the data set were randomly selected for training and the rest of the sequences were used for testing each evaluation. In each evaluation, the accuracy of the top-1 classification of the 7 other methods and the method provided in the present embodiment (PGCN-TCA) was calculated. Table 2 shows the results of comparison, wherein the method provided in the examples of the present application is inferior to PB-GCN, and is superior to most of the existing methods.

Table 2: top-1 classification accuracy comparison on HDM05 dataset

According to the experimental results, the motion classification accuracy determined based on the human skeleton image sequence is high in the image identification method provided by the embodiment of the application.

An embodiment of the present application further provides an image recognition apparatus, and fig. 11 is a schematic structural diagram of the image recognition apparatus provided in the embodiment of the present application, and as shown in fig. 11, the apparatus includes:

a first obtaining module 1101, configured to obtain a human skeleton image sequence; the human body skeleton image sequence comprises continuous multiframe human body skeleton images; skeleton joint points of each frame of human skeleton image are consistent;

a first determining module 1102, configured to determine a corresponding set of relative coordinates in a set of skeleton joint points of each frame of human skeleton image; the relative coordinates in the relative coordinate set correspond to the skeleton joint points in the skeleton joint point set one by one;

a second determining module 1103, configured to determine a relative coordinate tensor based on the relative coordinate set, the number of skeletal joint points, and the number of frames of images in the human body skeleton image sequence;

a third determining module 1104, configured to determine a plurality of inter-frame difference value sets according to a plurality of relative coordinate sets corresponding to the human skeleton image sequence;

a fourth determining module 1105, configured to determine a temporal difference tensor based on the inter-frame difference value sets, the number of skeleton joint points, and the number of frames of the images in the human skeleton image sequence;

a fifth determining module 1106 for determining an input tensor based on the relative coordinate tensor and the temporal difference tensor;

and the action recognition module 1107 is configured to perform action recognition on the input tensor based on the trained action recognition model to obtain an action category corresponding to the human skeleton image sequence.

The device and method embodiments in the embodiments of the present application are based on the same application concept.

An embodiment of the present application provides an electronic device, which includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the image recognition method.

Embodiments of the present application further provide a storage medium, which may be disposed in a server to store at least one instruction, at least one program, a set of codes, or a set of instructions related to implementing an image recognition method in the method embodiments, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the image recognition method.

Alternatively, in this embodiment, the storage medium may be located in at least one network server of a plurality of network servers of a computer network. Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

As can be seen from the embodiments of the image recognition method, the image recognition device, the electronic device, or the storage medium provided in the present application, a human skeleton image sequence is obtained; the human body skeleton image sequence comprises continuous multiframe human body skeleton images; skeleton joint points of each frame of human skeleton image are consistent; determining a corresponding relative coordinate set in a skeleton joint point set of each frame of human body skeleton image; the relative coordinates in the relative coordinate set correspond to the skeleton joint points in the skeleton joint point set one by one; determining a relative coordinate tensor based on the relative coordinate set, the number of the skeleton joint points and the frame number of the images in the human body skeleton image sequence; determining a plurality of inter-frame difference value sets according to a plurality of relative coordinate sets corresponding to the human skeleton image sequence; determining a time difference tensor based on the plurality of inter-frame difference value sets, the number of skeleton joint points and the number of frames of images in the human body skeleton image sequence; determining an input tensor based on the relative coordinate tensor and the temporal difference tensor; and performing motion recognition on the input tensor based on the trained motion recognition model to obtain a motion category corresponding to the human skeleton image sequence. According to the method and the device, the input tensor of the motion recognition model based on the graph convolution network is constructed by utilizing the human body skeleton joint point information, the motion recognition is carried out, and the accuracy of human motion recognition can be improved.

It should be noted that: the sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. An image recognition method, comprising:

determining a relative coordinate tensor based on the relative coordinate set, the number of the skeleton joint points and the number of frames of the images in the human body skeleton image sequence;

determining a temporal difference tensor based on the plurality of sets of interframe difference values, the number of the skeletal joint points, and the number of frames of images in the human body skeleton image sequence;

and performing motion recognition on the input tensor based on the trained motion recognition model to obtain a motion category corresponding to the human body skeleton image sequence.

2. The method of claim 1, wherein determining a corresponding set of relative coordinates in the set of skeletal joint points for each frame of the human skeletal image comprises:

determining a coordinate information set of a skeleton joint point set of each frame of human skeleton image in the human skeleton image sequence;

determining a root node from the skeleton joint point set;

and determining the relative coordinates of each skeleton joint point in the skeleton joint point set in each frame of human body skeleton image based on the root node in the skeleton joint point set to obtain the relative coordinate set.

3. The method of claim 1, wherein the motion recognition model comprises:

the system comprises an input layer, 1 batch normalization layer, 10 convolution layers, 1 global average pooling layer, 1 full-connection layer and an output layer;

the convolutional layer includes a pseudo-graph convolution module, a spatiotemporal attention extraction module, and a temporal convolution module.

4. The method of claim 3, wherein the motion recognition of the input tensor based on the trained motion recognition model comprises:

acquiring a trained adjacency matrix;

performing pseudo-graph convolution operation based on the product of the input tensor and the adjacent matrix, and outputting a spatial feature tensor;

performing space-time attention extraction operation based on the spatial feature tensor, and outputting a space-time calibration feature tensor; the spatiotemporal alignment feature tensor comprises a plurality of differently weighted feature planes;

performing time convolution operation on the space-time calibration characteristic tensor to obtain an output tensor;

wherein the trained adjacency matrices corresponding to each of the 10 convolutional layers are different from each other.

5. An image recognition apparatus, comprising:

a second determination module, configured to determine a relative coordinate tensor based on the set of relative coordinates, the number of skeletal joint points, and the number of frames of images in the human body skeleton image sequence;

a fourth determining module, configured to determine a temporal difference tensor based on the plurality of inter-frame difference value sets, the number of the skeleton joint points, and the number of frames of images in the human skeleton image sequence;

and the action recognition module is used for carrying out action recognition on the input tensor based on the trained action recognition model to obtain an action category corresponding to the human body skeleton image sequence.

6. The apparatus of claim 5,

the first determining module is further configured to determine a coordinate information set of a skeleton joint point set of each frame of human skeleton image in the human skeleton image sequence; determining a root node from the skeleton joint point set; and determining the relative coordinates of each skeleton joint point in the skeleton joint point set in each frame of human body skeleton image based on the root node in the skeleton joint point set to obtain the relative coordinate set.

7. The apparatus of claim 5, wherein the motion recognition model comprises:

8. The apparatus of claim 7,

the action recognition module is also used for acquiring a trained adjacency matrix; performing pseudo-graph convolution operation based on the product of the input tensor and the adjacent matrix, and outputting a spatial feature tensor; performing space-time attention extraction operation based on the spatial feature tensor, and outputting a space-time calibration feature tensor; the spatiotemporal alignment feature tensor comprises a plurality of differently weighted feature planes; performing time convolution operation on the space-time calibration characteristic tensor to obtain an output tensor; wherein the trained adjacency matrices corresponding to each of the 10 convolutional layers are different from each other.

9. An electronic device, characterized in that the device comprises a processor and a memory in which at least one instruction, at least one program, set of codes or set of instructions is stored, which is loaded and executed by the processor to implement the image recognition method according to any one of claims 1 to 4.

10. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the image recognition method according to any one of claims 1 to 4.