WO2022073282A1 - Motion recognition method based on feature interactive learning, and terminal device - Google Patents

Motion recognition method based on feature interactive learning, and terminal device Download PDF

Info

Publication number
WO2022073282A1
WO2022073282A1 PCT/CN2020/129550 CN2020129550W WO2022073282A1 WO 2022073282 A1 WO2022073282 A1 WO 2022073282A1 CN 2020129550 W CN2020129550 W CN 2020129550W WO 2022073282 A1 WO2022073282 A1 WO 2022073282A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
video sequence
network model
module
convolution
Prior art date
Application number
PCT/CN2020/129550
Other languages
French (fr)
Chinese (zh)
Inventor
任子良
程俊
张锲石
高向阳
康宇航
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2022073282A1 publication Critical patent/WO2022073282A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/85Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression

Definitions

  • the present application belongs to the technical field of computer vision, and in particular, relates to an action recognition method and terminal device based on feature interaction learning.
  • human action recognition has become one of the research hotspots in the field of computer vision.
  • computers can automatically understand and describe human actions in videos, which has great application value in many fields, such as video surveillance, human-computer interaction, motion analysis, content-based video retrieval, and autonomous driving.
  • the methods of human action recognition mainly include methods based on artificially designed features and methods based on neural network deep learning features.
  • the methods based on neural network deep learning features have achieved certain success in the recognition of human actions.
  • the current human action recognition method based on neural network deep learning when processing the action classification and recognition of long video sequences, a certain number of video frames are obtained through sparse sampling as the input of the neural network, and the video frames are extracted layer by layer through the neural network.
  • features, to identify and classify human actions due to the complexity and variability of video shooting angles, shooting dimensions, shooting backgrounds, and the differences and similarities of actions, the sparse sampling method for a single modality has a low accuracy rate of action recognition. .
  • the embodiments of the present application provide an action recognition method and terminal device based on feature interaction learning, which can solve the problem of low accuracy of action recognition due to the sparse sampling method of a single modality.
  • an embodiment of the present application provides an action recognition method based on feature interaction learning, the method includes: acquiring video data of an action to be recognized, the video data including a first video sequence and a second video sequence; Perform compression processing on the first video sequence and the second video sequence respectively to obtain a first motion picture corresponding to the first video sequence and a second motion picture corresponding to the second video sequence;
  • the motion map and the second motion map are input into the trained dual-stream neural network model, and the features of the first motion map and the features of the second motion map are interactively learned through the trained dual-stream neural network model, Obtain the first prediction result of the first video sequence and the second prediction result of the second video sequence output by the trained dual-stream neural network model; based on the first prediction result and the second prediction result , and determine the classification result of the action to be recognized.
  • performing compression processing on the first video sequence to obtain a first motion picture corresponding to the first video sequence including:
  • performing compression processing on the second video sequence to obtain a second motion image corresponding to the second video sequence includes:
  • the trained dual-stream neural network model includes a first neural network model, a second neural network model, and a routing module, and the routing module is set in the first neural network between the intermediate convolution module of the model and the intermediate convolution module of the second neural network model; the input of the first neural network model is the first motion map, and the output is the first video sequence of the the first prediction result; the input of the second neural network model is the second motion map, and the output is the second prediction result of the second video sequence; the routing module is used for Between the intermediate convolution module of the network model and the intermediate convolution module of the second neural network model, interactive learning is performed on the features of the first motion map and the features of the second motion map.
  • the intermediate convolution module of the first neural network model includes a first convolution module with a preset number of layers
  • the intermediate convolution module of the second neural network model includes A second convolution module corresponding to the first convolution module
  • the first motion map and the second motion map are input into the trained dual-stream neural network model, and the trained dual-stream neural network
  • the network model performs interactive learning on the features of the first motion map and the features of the second motion map to obtain a first prediction result of the first video sequence and a second prediction result of the second video sequence, including :
  • the output of the first convolution module of the first layer and the output of the second convolution module of the first layer are used as the input of the routing module of the first layer, and the feature interactive learning is performed by the routing module of the first layer, and the first layer is obtained.
  • a route output take the superposition result of the output of the first convolution module of the first layer and the output of the first route as the input of the first convolution module of the second layer, and the first convolution module of the second layer
  • the convolution module performs feature learning to obtain the output of the first convolution module of the second layer; the superposition result of the output of the second convolution module of the first layer and the output of the first route is used as the second layer
  • the first convolution module of the first layer and the first convolution module of the second layer are two adjacent convolution layers in the middle convolution module of the first neural network model;
  • the second convolution module of one layer and the second convolution module of the second layer are two adjacent convolution layers in the middle convolution module of the second neural network model;
  • the routing module and the routing module of the second layer are two adjacent computing modules.
  • the routing module includes: a first convolution unit, a first normalization unit, a first activation unit, a second convolution unit, a second normalization unit, second activation unit; through the first convolution unit, the first normalization unit, the first activation unit, the second convolution unit, the second normalization unit of the routing module unit and the second activation unit, in turn, perform interactive learning on the feature matrix output by the convolution calculation module of the first neural network model and the feature matrix output by the convolution calculation module of the second neural network model, and obtain the The feature matrix output by the routing module described above.
  • the determining a classification result of the to-be-recognized action based on the first prediction result and the second prediction result includes:
  • Feature fusion is performed on the first prediction result and the second prediction result to obtain a probability distribution of action categories; the action category with the highest probability in the probability distribution is used as the classification result of the action to be recognized.
  • the first neural network model includes a first loss function
  • the second neural network model includes a second loss function
  • the model, the second neural network model and the routing module are trained, and the parameters of the first neural network model and the second neural network model are adjusted according to the first loss function and the second loss function, respectively. parameters and the parameters of the routing module; if the first loss function and the second loss function meet the preset threshold, stop the parameters of the first neural network model and the second neural network model. parameters and the training of the routing module to obtain the trained dual-stream neural network model.
  • an embodiment of the present application provides an action recognition device based on feature interactive learning, including:
  • an acquisition unit configured to acquire video data of an action to be recognized, the video data including a first video sequence and a second video sequence;
  • a processing unit configured to compress the first video sequence and the second video sequence respectively, to obtain a first motion picture corresponding to the first video sequence and a second motion picture corresponding to the second video sequence ;
  • the computing unit is used for inputting the first motion map and the second motion map into a trained dual-stream neural network model, and through the trained dual-stream neural network model, the features of the first motion map and the described
  • the features of the second motion map are interactively learned to obtain the first prediction result of the first video sequence and the second prediction result of the second video sequence;
  • An output unit configured to determine a classification result of the action to be recognized based on the first prediction result and the second prediction result.
  • an embodiment of the present application provides a terminal device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program When implementing the method described in the first aspect and possible implementation manners of the first aspect.
  • embodiments of the present application provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the first aspect and possible implementations of the first aspect are implemented method described.
  • an embodiment of the present application provides a computer program product that, when the computer program product runs on a terminal device, enables the terminal device to execute the action recognition method described in any one of the first aspects above.
  • the terminal device can obtain video data of actions to be recognized, and the video data includes a first video sequence and a second video sequence;
  • the sequence and the second video sequence are respectively compressed to obtain the first motion map corresponding to the first video sequence and the second motion map corresponding to the second video sequence;
  • the first motion map and the second motion map are input into the trained dual-stream neural network.
  • the network model through the trained dual-stream neural network model, interactively learns the features of the first motion map and the features of the second motion map, and obtains the first prediction result of the first video sequence and the second prediction result of the second video sequence; Based on the first prediction result and the second prediction result, the classification result of the action to be recognized is determined; the first motion picture and the second motion picture are obtained by compressing the first video sequence and the second video sequence respectively, and the video data is enriched
  • the spatial and temporal representation of the model makes the information representation more complete and the features richer; thus, the first motion map and the second motion map are used as the input of the dual-stream neural network model, and the multi-modal image features are interactively learned through the neural network model.
  • the accuracy of action recognition has strong ease of use and practicality.
  • FIG. 1 is a schematic flowchart of an action recognition method provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of video data compression processing provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a network architecture of a dual-stream neural network model provided by an embodiment of the present application
  • FIG. 4 is a schematic diagram of the architecture of a routing module of a dual-stream neural network provided by an embodiment of the present application
  • FIG. 5 is a schematic diagram of the architecture of a middle-level feature interaction learning unit of a dual-stream neural network provided by an embodiment of the present application;
  • FIG. 6 is a schematic structural diagram of a motion recognition device provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • the term “if” may be contextually interpreted as “when” or “once” or “in response to determining” or “in response to detecting “.
  • the phrases “if it is determined” or “if the [described condition or event] is detected” may be interpreted, depending on the context, to mean “once it is determined” or “in response to the determination” or “once the [described condition or event] is detected. ]” or “in response to detection of the [described condition or event]”.
  • references in this specification to "one embodiment” or “some embodiments” and the like mean that a particular feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the present application.
  • appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically emphasized otherwise.
  • the terms “including”, “including”, “having” and their variants mean “including but not limited to” unless specifically emphasized otherwise.
  • models such as 2D convolutional network 2D-ConvNets, 3D convolutional network 3D-ConvNets, and recurrent neural network (RNN) based on convolutional neural networks are mainly used.
  • RNN recurrent neural network
  • Human action recognition method based on convolutional neural network, given a certain number of RGB or depth video sequences, a certain number of video frames are obtained as the input of the network through the sparse sampling method, and the convolutional neural network extracts the features in the video frames layer by layer, And the human actions are classified and recognized through a classifier or normalization (Softmax function) layer.
  • Softmax function Softmax function
  • human action recognition methods based on convolutional neural networks can be divided into the following two categories: First, 2D end-to-end network training structure: supervised training of deep networks through large-scale labeled datasets, and then through parameter fine-tuning training Get the trained model for the actual task. For video sequences, this method mainly uses sparse sampling to obtain a certain frame of the entire video sequence as the network input, and cannot learn the temporal dimension features of human actions well. Second, the 3D end-to-end network training structure: a few frames of images are obtained through sparse sampling as the input of the network model, and the classification model is obtained through supervised training and parameter fine-tuning training. This method can obtain better recognition effect, but the huge amount of calculation restricts its application in practical scenarios.
  • RNN recurrent convolutional network
  • the RNN network memorizes the previous information and applies it to the calculation of the current output. It can process sequence data of any length, and realize feature learning and recognition and classification through the ordered loop learning of the input sequence, which is widely used in the field of natural language processing. It has achieved good applications, but needs to be further improved in human action recognition.
  • action recognition methods based on convolutional neural networks or other deep networks lack the representation of multi-modal spatiotemporal information of long video sequences and the mutual learning of multi-modal features, so that action recognition can be achieved.
  • the accuracy still needs to be improved.
  • This application will realize the processing and recognition of user action classification based on the representation of multimodal spatiotemporal information of long video sequences and the mutual learning of multimodal features, which further improves the accuracy of user action recognition.
  • the technical solutions of the present application will be described in detail below with reference to the drawings and specific embodiments.
  • FIG. 1 is a schematic flowchart of a motion recognition method provided by an embodiment of the present application.
  • the execution body of the method may be an independent terminal, such as a mobile phone, a computer, a multimedia device, a streaming media device, a monitoring device and other terminal devices; It can also be an integrated module in the terminal device, which can be implemented as a certain function in the terminal device.
  • the following describes an example of the method being executed by a terminal device, but the embodiment of the present application is not limited to this. As shown in Figure 1, the method includes:
  • Step S101 acquiring video data of the action to be recognized, where the video data includes a first video sequence and a second video sequence.
  • the video data is a sequential multi-frame image sequence combined in time, a sequence of all image frames of the entire video for which the action is to be identified.
  • the terminal device can acquire the video data of the action to be recognized in real time through the RGB-D camera device; the video data of the action to be recognized can also be the video data pre-stored in the terminal device.
  • the first video sequence and the second video sequence are video frame sequences of two different modalities, that is, different feature representations of the same piece of video data, for example, a color video sequence in RGB format and a video sequence represented by depth information, respectively. , video sequence or skeleton sequence in optical flow graph format, etc.
  • the first video sequence is a color video sequence in RGB format and the second video sequence is a video sequence represented by depth information.
  • a color video sequence is a multi-frame image sequence in RGB format, that is, a color image sequence in which pixel information in each frame of image is represented by three colors of red, green, blue, and RGB; a video sequence represented by depth information is represented by depth values.
  • the depth image sequence of the pixel point information in the frame image, the image depth of each frame image determines the possible color number or possible gray level of each pixel point of the image.
  • the acquired sequence frames of the first video sequence and the second video sequence can be in a one-to-one correspondence in time sequence, that is, each video frame of the first video sequence at the same moment.
  • the first video sequence and the second video sequence of the same video can include the same number of video frames; thus, they can be represented by different feature quantities. Spatiotemporal information of video frames at the same time.
  • the spatiotemporal information includes the information of the temporal dimension and the information of the spatial dimension of the video frame sequence; the information of the temporal dimension is represented at different time points corresponding to a video frame, and the continuous video frame sequence constitutes a dynamic effect by the continuity of time; the spatial dimension
  • the information can be expressed as texture information or color information of each video frame.
  • a video frame of a color video sequence in RGB format is represented by a 3-channel*widthW*heightH matrix, and the elements in the three channels represent the color information of each pixel in the video frame; the video represented by depth information
  • the depth information is measured by length units (such as millimeters, etc.).
  • the distance information representing the depth and the grayscale information are converted correspondingly, and the matrix is obtained in the form of 1 channel * width W * height H Represented grayscale image.
  • sequence frames of the first video sequence and the sequence frames of the second video sequence correspond one-to-one in time sequence, and are the same piece of video data.
  • the first The video sequence includes 50 frames of images
  • the second video frame also includes 50 frames of images.
  • the first video sequence is color video data in RGB format, it can be acquired by a camera in RGB format; if the second video sequence is a video sequence represented by depth information, it can be acquired by a depth camera; two The shooting parameters set by the various cameras may be the same, and the shooting parameters for the same target are shot in the same time period, and there is no specific limitation here.
  • the terminal device can be a device integrated with the camera, and the terminal device can directly obtain the video data of the action to be recognized through the camera device; the terminal device can also be a device that is separate from the camera device, and the terminal device can be wired or wirelessly connected to the camera device.
  • the camera device is connected in communication to obtain video data of the action to be recognized.
  • the action to be identified may be a human action or an activity action, or an animal action or action action, without any specific limitation.
  • the terminal device obtains the image frame sequence of the entire video data of the action to be recognized, records the multi-modal underlying features of the video data, and makes good use of the first video sequence and the second video sequence.
  • the features of spatiotemporal information of different modalities provide a basis for the subsequent neural network model to learn various possibilities for features, and enhance the neural network model's ability to express and recognize image features.
  • Step S102 Compress the first video sequence and the second video sequence respectively to obtain a first motion picture corresponding to the first video sequence and a second motion picture corresponding to the second video sequence.
  • the terminal device compresses the multi-frame images of the first video sequence and the multi-frame images of the second video sequence respectively, so as to obtain rich spatial and temporal data.
  • the feature representation of the first motion picture is different from the feature representation of the second motion picture, and is the representation of different underlying features of the same video, that is, the videos in the first video sequence and the second video sequence are represented by different image information. Image features of the frame.
  • the spatiotemporal information of the first motion picture includes the spatiotemporal information of all video frames of the first video sequence
  • the spatiotemporal information of the second motion picture includes the spatiotemporal information of all the video frames of the second video sequence; for example, the RGB video sequence and the depth video sequence are combined.
  • the temporal dimension information and spatial dimension information of the image are compressed and expressed as a single three-channel image and a single-channel image, respectively, showing dynamic effects and information such as color and texture.
  • each video frame of the first video sequence corresponds to a feature matrix
  • each video frame of the second video sequence corresponds to a feature matrix
  • the first video sequence or the second video sequence may respectively include T frame images
  • the feature matrix corresponding to each frame of image is It, then the feature matrix set of the first video sequence or the feature matrix set of the second video sequence can be expressed as ⁇ I 1 ,I 2 ,I 3 ,..., I T >, where I 1 is the feature matrix of the first frame image in the video sequence arranged in time series, and so on, I T is the characteristic matrix of the T-th frame image in the video sequence arranged in time series.
  • the first video sequence and the second video sequence are respectively subjected to compression processing, and multiple frames of images of the video sequence are compressed and synthesized into an image, which contains feature information representing actions through time and space, which can be called is a motion map, so as to obtain a paired first motion map and a second motion map containing the spatiotemporal information of the entire video sequence; that is, the feature matrices of multiple frames of images are combined in one image to represent, so that all the video sequences in the video sequence can be obtained.
  • the first motion picture may be an image synthesized by compression of frames of a video sequence in RGB format
  • the second motion picture may be an image synthesized by compression of a video sequence represented by depth information
  • the first motion picture and the second motion picture may also be It is the compressed and synthesized image of the video sequences corresponding to other modalities respectively.
  • the first video sequence is compressed to obtain a first motion map corresponding to the first video sequence, including:
  • the first video sequence includes multiple frames of images, and each frame of image corresponds to a feature matrix; if the first video sequence is color video data in RGB format, the feature matrix of each frame of the first video sequence is a matrix of 3 channels * width W * height H, where width W and height H are in pixels, and the elements in the feature matrix correspond to pixels.
  • the value of each element in the feature matrix represents the feature of the pixel at the corresponding position, such as a color image in RGB format, and each element represents the feature value of each pixel in the three channels of red R, green G, and blue B respectively.
  • each image frame of the first video sequence corresponds to a feature matrix
  • the elements at the same position in the feature matrices of all video frames are added, and then divided by the total frames of the video frames of the first video sequence number, obtain the element value at each position in the feature matrix, and round each element value, for example, 2.6 is rounded down to obtain 2, and the feature matrix of the first motion image corresponding to the first video sequence is obtained.
  • FIG. 2 a schematic diagram of video data compression processing provided by an embodiment of the present application, when a video sequence is color video data in RGB format, the RGB video sequence is compressed to obtain a corresponding RGB motion picture, and a multi-frame image is The spatio-temporal information of , is synthesized into the spatio-temporal information of a motion map.
  • the feature matrix of the motion map corresponding to the RGB video sequence may be a 3*W*H matrix. It can be calculated by the following formula:
  • MI is the feature matrix of the corresponding motion picture of the first video sequence
  • T is the total number of frames of the first video sequence
  • I ⁇ is the feature matrix of the ⁇ th frame image in the first video sequence
  • the value range of ⁇ is [1 , an integer of T].
  • the value range of an element in the feature matrix of each frame of image of the first video sequence may be an integer of [0, 255], and the value of each element in the feature matrix of the motion image MI after the compression processing of the first video sequence is the value of each element
  • the range is also an integer in the range [0,255].
  • the second video sequence is compressed to obtain a second motion image corresponding to the second video sequence, including:
  • the second video sequence includes multiple frames of images, and each frame of image corresponds to a feature matrix; if the second video sequence is an image sequence in which each video frame is represented by depth information, each The feature matrix of a frame image is a matrix of 1 channel * width W * height H, where width W and height H are in pixels, and elements in the feature matrix correspond to pixels. The value of each element in the feature matrix represents the feature of the pixel at the corresponding position.
  • the depth map of each frame in the second video sequence can be gray-scaled, and the depth information of each pixel in the depth map can be converted by mapping [0, 255],
  • the grayscale image of the video frame is obtained, and the value of each element in the feature matrix of the grayscale image is an integer in the range of [0,255].
  • the value of the video sequence represented by depth information may be 0 to 10000mm, while the representation range of images in computer vision is [0, 255], so the video sequence represented by depth information needs to be scaled to match the visual representation.
  • the value range is to map and convert the video sequence represented by the depth information to a grayscale image.
  • the compression process of the second video sequence is similar to that of the first video sequence.
  • a feature matrix of the grayscale image is obtained. Add the elements at the same position of the feature matrix of the grayscale images corresponding to all video frames in , and divide by the total number of video frames of the second video sequence to obtain the element value at each position in the feature matrix. The value of each element is rounded to obtain the feature matrix of the motion map corresponding to the second video sequence.
  • FIG. 2 a schematic diagram of video data compression processing provided by an embodiment of the present application, when a video sequence is a video sequence represented by depth information, the depth video sequence is subjected to grayscale processing to obtain a grayscale image corresponding to the depth video sequence , compress the grayscale image to obtain the corresponding depth motion map, and synthesize the spatiotemporal information of multiple frames of images into the spatiotemporal information of one motion image.
  • the feature matrix of the motion map corresponding to the depth video sequence may be a 1*W*H matrix. It can be calculated by the following formula:
  • MJ is the feature matrix of the motion picture corresponding to the second video sequence
  • N is the total number of frames of the second video sequence
  • In is the feature matrix of the nth frame image in the second video sequence
  • the value range of n is [1 , an integer of N].
  • N and T may be equal, and the values of n and ⁇ may be equal, that is, the video frames of the first video sequence and the video frames of the second video sequence correspond one-to-one in time sequence.
  • the value range of an element in the feature matrix of each frame of grayscale image corresponding to the second video sequence may be an integer of [0,255], and the value of each element in the feature matrix of the motion image MJ corresponding to the second video sequence
  • the range can be an integer in the range [0,255].
  • the video frames in the first video sequence in the RGB format and the video frames in the second video sequence represented by depth information may be in one-to-one correspondence.
  • the grayscale image sequence obtained by performing grayscale processing on the video frames in the second video sequence represented by the depth information is also in one-to-one correspondence with the video frames in the first video sequence in the RGB format.
  • Step S103 inputting the first motion map and the second motion map into the trained dual-stream neural network model, and performing interactive learning on the features of the first motion map and the second motion map through the trained dual-stream neural network model to obtain the first motion map.
  • the dual-stream neural network model is an overall model including two independent convolutional neural network models and a routing module.
  • the dual-stream neural network model includes two inputs and two outputs. Among them, the two inputs correspond to the feature information of the two modalities of the video data respectively, and the two outputs correspond to the prediction results of the input information of the two modalities respectively.
  • the dual-stream neural network model includes two independent convolutional neural network models and routing modules, and the inputs of the two convolutional neural network models are respectively are the first motion map and the second motion map;
  • the convolutional neural network model of each channel includes multiple convolutional layers, such as the convolutional module Conv1, the convolutional module Conv2_x, the convolutional module Conv5_x and the fully connected layer, wherein the convolutional module Conv2_x and convolution module Conv5_x respectively represent a total convolution module, and a total convolution module may include a number of convolution layers or convolution calculation units.
  • the output of the previous module is interactively learned through the routing module, and the output of the routing module is superimposed with the output of the previous convolution module as the output of the next convolution module.
  • the mid-level interaction features of different modalities in the dual-stream neural network model are learned through the routing module.
  • the basic network of the two-way convolutional neural network model can be a residual network (ResNet). Due to the high modularity of the residual network, each module in the residual network can be used as the basic module for the first motion map and the second. The feature information of different modalities of the motion map is used for model training and interactive learning of features.
  • the dual-stream neural network model optimizes and trains the model through dual loss functions.
  • the basic network model of the dual-stream neural network model can be a deep network model such as Inception, ImageNet, TSN, and dual-stream network; the parameters of the basic network model are trained and adjusted by fine-tuning; the network model can also be designed as needed to adjust the parameters. Training set adjustment.
  • the joint optimization training is performed through the dual-loss function to obtain the dual-stream high-level features of the modalities corresponding to the input image features of different modalities; for example, the input modalities are For moving images in RGB format and moving images represented by depth information, dual-stream high-level features of two modalities of RGB format and depth information can be obtained.
  • the two inputs may include multiple channel inputs; for example, if one of the inputs is an RGB motion image, the input may include three channel inputs, corresponding to the features of the red R channel of the input RGB motion image respectively. matrix, feature matrix for the green G channel, and feature matrix for the blue B channel.
  • the trained dual-stream neural network model includes a first neural network model, a second neural network model, and a routing module, and the routing module is disposed between the intermediate convolution module of the first neural network model and the second neural network model. between intermediate convolution modules; the input of the first neural network model is the first motion map, and the output is the first prediction result of the first video sequence; the input of the second neural network model is the second motion map, and the output is the the second prediction result of the second video sequence; the routing module is used for, between the intermediate convolution module of the first neural network model and the intermediate convolution module of the second neural network model, for each of the two-stream neural network model The output features of the layer convolution module are interactively learned.
  • the architecture diagram of the dual-stream neural network model corresponds to one channel of input and output
  • the second neural network model corresponds to another channel of input and output.
  • the first motion picture input into the first neural network model can be an RGB motion picture
  • the first prediction result output by the first neural network model is the identification result corresponding to the first video sequence
  • the first video sequence can be an RGB video sequence in RGB format
  • RGB motion pictures are obtained by compressing RGB video sequences in RGB format.
  • the second motion map input into the second neural network model can be a depth motion map
  • the second prediction result output by the second neural network model is the recognition result corresponding to the second video sequence
  • the second video sequence can be the depth represented by the depth information.
  • Video sequence; the depth motion map is obtained by compressing the depth video sequence represented by depth information.
  • the middle layer of the dual-stream neural network it includes multiple convolution modules and multiple routing modules, such as the convolution module Conv1, the convolution module Conv2_x and the convolution module Conv5_x as shown in Figure 4; the routing module is set in the two-way convolution neural network. After each convolution module of the network model, the output of the previous module is interactively learned through the routing module. The output of the routing module is superimposed with the output of the previous convolution module as the input of the next convolution module. The module learns mid-level interaction features of different modalities in a two-stream neural network model.
  • the intermediate convolution module of the first neural network model includes a first convolution module with a preset number of layers
  • the intermediate convolution module of the second neural network model includes a second convolution module corresponding to the first convolution module. Convolution module.
  • the first motion map and the second motion map are input into the trained dual-stream neural network model, and the features of the first motion map and the features of the second motion map are interactively learned through the trained dual-stream neural network model. , obtain the first prediction result of the first video sequence and the second prediction result of the second video sequence, including:
  • the first convolution module of the first layer and the first convolution module of the second layer are two adjacent convolution layers in the middle convolution module of the first neural network model;
  • the second convolution module of one layer and the second convolution module of the second layer are two adjacent convolution layers in the middle convolution module of the second neural network model;
  • the routing module and the routing module of the second layer are two adjacent computing modules.
  • a convolution module includes a number of convolution layers or convolution calculation units; a convolution layer can be a set of parallel feature maps, by sliding different convolutions on the input image and performing certain At each sliding position, an element-corresponding product and sum operation is performed between the convolution kernel and the input image to project the information in the receptive field to an element in the feature map.
  • the size of the convolution kernel is smaller than the size of the input image, and overlaps or acts on the input image in parallel. All elements in the feature map output by the convolution module of each layer in the middle of the dual-stream neural network model are obtained through a convolution The kernel is calculated.
  • the dual-stream neural network model further includes a fully connected layer, a first loss function and a second loss function.
  • the features output by the convolution module Conv5_x are used as the input of a fully connected layer, and the output features of the routing module of the last layer are used as the input of a fully connected layer.
  • the results of the two fully connected layers are added as The output of the total fully connected layer, the first prediction result and the second prediction result are obtained.
  • the routing module includes: a first convolution unit, a first normalization unit, a first activation unit, a second convolution unit, a second normalization unit, and a second activation unit;
  • the first convolution unit, the first normalization unit, the first activation unit, the second convolution unit, the second normalization unit, and the second activation unit sequentially output the convolution calculation module of the first neural network model.
  • the feature matrix and the feature matrix output by the convolution calculation module of the second neural network model are interactively learned to obtain the feature matrix output by the routing module.
  • the routing module includes two layers of convolution units, two layers of normalization units, and two layers of activation units; they can be the first convolution unit Conv1D, the first normalization unit Batch Normlization, the first activation unit ReLU, and the second convolution unit.
  • the output of the two-way convolution module of each layer of the intermediate convolution module of the dual-stream neural network model is used as the input of the corresponding routing module, and the output of each layer of the routing module is used as the input of the next layer of convolution module or the whole. Input to the connection layer.
  • the routing module can be a 1*1 convolution-based computing unit; the output of the two-way convolution module of the previous layer is output to the convolution of the subsequent layer after learning and redirection of the 1*1 convolution module.
  • the output of the two-way convolution module can be the information flow of multi-modal image features, such as the information flow of RGB format and the information flow of depth image features.
  • the first neural network model includes a first loss function
  • the second neural network model includes a second loss function
  • the first neural network model, the second neural network model and the routing module are trained through video sample data, Adjust the parameters of the first neural network model, the parameters of the second neural network model, and the parameters of the routing module respectively according to the first loss function and the second loss function; if the first loss function and the second loss function meet the preset threshold, stop The parameters of the first neural network model, the parameters of the second neural network model and the routing module are trained to obtain a trained dual-stream neural network model.
  • the dual-stream neural network model is optimized and trained with dual loss functions. According to the output result of the fully connected layer of the convolutional neural network of the first channel, the parameters of the convolutional neural network of the first channel are trained and adjusted through the first loss function; according to the fully connected layer of the convolutional neural network of the second channel The output result is used to train and adjust the parameters of the convolutional neural network of the second channel through the second loss function; at the same time, the parameters of the routing module are trained and adjusted through the first loss function and the second loss function.
  • Step S104 based on the first prediction result and the second prediction result, determine the classification result of the action to be recognized.
  • the first prediction result and the second prediction result are multimodal dual-stream high-level features output by the trained neural network model.
  • Feature fusion is performed on the dual-stream high-level features to obtain the final output result in the network architecture of the dual-stream neural network model.
  • the final output result is a one-dimensional score vector (probability), and the final classification result is determined according to the highest probability in the score vector; that is, the category corresponding to the highest score is the classification result of the action to be recognized.
  • determining the classification result of the action to be recognized includes:
  • Feature fusion is performed on the first prediction result and the second prediction result to obtain the probability distribution of action categories
  • feature fusion is a calculation process in the network architecture of the dual-stream neural network model, that is, after the dual-stream neural network model obtains the feature information of the RGB format information flow and the depth information flow, it will be fused, and the fusion will be performed after the fusion. Probability mapping, and finally category judgment.
  • the final output result is a one-dimensional score vector (probability)
  • the score vector is a one-dimensional vector containing 10 elements, each element is the probability of 0 to 1, and the sum of the 10 elements is 1, assuming the first The maximum value of the two elements is 0.3, and the classification result of the action to be recognized is determined to be the second category.
  • the process of feature fusion can perform fusion calculation by performing point multiplication, weighted addition or maximum value of the two matrices finally output by the network architecture to obtain the final probability distribution, which is determined according to the category corresponding to the maximum value in the probability distribution.
  • the type of action to be recognized can perform fusion calculation by performing point multiplication, weighted addition or maximum value of the two matrices finally output by the network architecture to obtain the final probability distribution, which is determined according to the category corresponding to the maximum value in the probability distribution.
  • the terminal device can obtain the video data of the action to be recognized, the video data includes the first video sequence and the second video sequence, and the first video sequence and the second video sequence are respectively compressed to obtain the first motion picture and the second video sequence.
  • the second motion map which provides a richer spatiotemporal representation of the video data, makes the information representation more complete and the features richer; thus, the first motion map and the second motion map are used as the input of the dual-stream neural network model, and the multi-modality model is analyzed by the neural network model.
  • the interactive learning of state image features improves the accuracy of action recognition.
  • FIG. 6 shows a structural block diagram of the motion recognition apparatus provided by the embodiments of the present application. For convenience of description, only the parts related to the embodiments of the present application are shown.
  • the device includes:
  • an acquisition unit 61 configured to acquire video data of the action to be identified, the video data including a first video sequence and a second video sequence;
  • a processing unit 62 configured to perform compression processing on the first video sequence and the second video sequence respectively to obtain a first motion map corresponding to the first video sequence and a second motion corresponding to the second video sequence picture;
  • the computing unit 63 is used for inputting the first motion map and the second motion map into the trained dual-stream neural network model, and through the trained dual-stream neural network model, the features of the first motion map and all parameters are analyzed. performing interactive learning on the features of the second motion map to obtain the first prediction result of the first video sequence and the second prediction result of the second video sequence;
  • An output unit 64 configured to determine a classification result of the to-be-recognized action based on the first prediction result and the second prediction result.
  • the terminal device can obtain the video data of the action to be recognized, the video data includes the first video sequence and the second video sequence, and the first video sequence and the second video sequence are respectively compressed to obtain the first motion picture and the second video sequence.
  • the second motion map which provides a richer spatiotemporal representation of the video data, makes the information representation more complete and the features richer; thus, the first motion map and the second motion map are used as the input of the dual-stream neural network model, and the multi-modality model is analyzed by the neural network model.
  • the interactive learning of state image features improves the accuracy of action recognition.
  • FIG. 7 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
  • the terminal device 7 in this embodiment includes: at least one processor 70 (only one is shown in FIG. 7 ), a memory 71 , and a processor stored in the memory 71 and can be processed in the at least one processor
  • the computer program 72 running on the processor 70 when the processor 70 executes the computer program 72, implements the steps in any of the foregoing embodiments of the method for identifying each training board.
  • the terminal device 7 may include, but is not limited to, a processor 70 and a memory 71 .
  • FIG. 7 is only an example of the terminal device 7, and does not constitute a limitation on the terminal device 7, and may include more or less components than the one shown, or combine some components, or different components , for example, may also include input and output devices, network access devices, and the like.
  • the so-called processor 70 may be a central processing unit (Central Processing Unit, CPU), and the processor 70 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuits) , ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 71 may be an internal storage unit of the terminal device 7 in some embodiments, such as a hard disk or a memory of the terminal device 7 . In other embodiments, the memory 71 may also be an external storage device of the terminal device 7, such as a plug-in hard disk equipped on the terminal device 7, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Further, the memory 71 may also include both an internal storage unit of the terminal device 7 and an external storage device. The memory 71 is used to store operating systems, application programs, bootloaders (BootLoader), data, and other programs, such as program codes of the computer programs, and the like. The memory 71 may also be used to temporarily store data that has been output or will be output.
  • the memory 71 may also be used to temporarily store data that has been output or will be output.
  • Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps in the foregoing method embodiments can be implemented.
  • the embodiments of the present application provide a computer program product, when the computer program product runs on a mobile terminal, the steps in the foregoing method embodiments can be implemented when the mobile terminal executes the computer program product.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • the present application realizes all or part of the processes in the methods of the above embodiments, which can be completed by instructing the relevant hardware through a computer program, and the computer program can be stored in a computer-readable storage medium.
  • the computer program includes computer program code
  • the computer program code may be in the form of source code, object code, executable file or some intermediate form, and the like.
  • the computer-readable medium may include at least: any entity or device capable of carrying the computer program code to the photographing device/terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electrical carrier signals, telecommunication signals, and software distribution media.
  • ROM read-only memory
  • RAM random access memory
  • electrical carrier signals telecommunication signals
  • software distribution media For example, U disk, mobile hard disk, disk or CD, etc.
  • computer readable media may not be electrical carrier signals and telecommunications signals.
  • the disclosed apparatus/network device and method may be implemented in other manners.
  • the apparatus/network device embodiments described above are only illustrative.
  • the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, such as multiple units. Or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

Abstract

A motion recognition method based on feature interactive learning, and a terminal device, which are applicable to the technical field of computer vision. The method comprises: acquiring video data of a motion to be recognized, the video data comprising a first video sequence and a second video sequence (S101); respectively performing compression processing on the first video sequence and the second video sequence, so as to obtain a first motion graph and a second motion graph (S102); inputting the first motion graph and the second motion graph into a trained two-stream neural network model, and performing interactive learning on features of the first motion graph and features of the second motion graph by means of the trained two-stream neural network model, so as to obtain a first prediction result of the first video sequence and a second prediction result of the second video sequence (S103); and on the basis of the first prediction result and the second prediction result, determining a classification result of the motion to be recognized (S104). According to the method, the problem of low motion recognition accuracy of sparse sampling is solved. By means of interactive learning of multi-modal input features, the accuracy of motion category recognition is improved.

Description

一种基于特征交互学习的动作识别方法及终端设备An action recognition method and terminal device based on feature interactive learning 技术领域technical field
本申请属于计算机视觉技术领域,尤其涉及一种基于特征交互学习的动作识别方法及终端设备。The present application belongs to the technical field of computer vision, and in particular, relates to an action recognition method and terminal device based on feature interaction learning.
背景技术Background technique
近年来,人体动作识别已成为计算机视觉领域的研究热点之一。通过动作识别技术,计算机可以自动理解和描述视频中的人体动作,在诸多领域具有巨大的应用价值,例如:视频监控、人机交互、运动分析、基于内容的视频检索以及自动驾驶等领域。人体动作识别的方法主要包括基于人工设计特征的方法和基于神经网络深度学习特征的方法。In recent years, human action recognition has become one of the research hotspots in the field of computer vision. Through action recognition technology, computers can automatically understand and describe human actions in videos, which has great application value in many fields, such as video surveillance, human-computer interaction, motion analysis, content-based video retrieval, and autonomous driving. The methods of human action recognition mainly include methods based on artificially designed features and methods based on neural network deep learning features.
与传统的基于人工设计特征的方法相比,基于神经网络深度学习特征的方法对人体动作的识别取得了一定的成功。然而目前基于神经网络深度学习的人体动作识别方法,在处理长视频序列的动作分类识别时,通过稀疏采样获取一定数量的视频帧作为神经网络的输入,经过神经网络的逐层提取视频帧中的特征,对人体动作进行识别和分类;由于视频拍摄视角、拍摄尺寸、拍摄背景的复杂多变,以及动作的差异性和相似性,针对单一模态的稀疏采样方式,动作识别的准确率较低。Compared with the traditional methods based on artificially designed features, the methods based on neural network deep learning features have achieved certain success in the recognition of human actions. However, in the current human action recognition method based on neural network deep learning, when processing the action classification and recognition of long video sequences, a certain number of video frames are obtained through sparse sampling as the input of the neural network, and the video frames are extracted layer by layer through the neural network. features, to identify and classify human actions; due to the complexity and variability of video shooting angles, shooting dimensions, shooting backgrounds, and the differences and similarities of actions, the sparse sampling method for a single modality has a low accuracy rate of action recognition. .
发明内容SUMMARY OF THE INVENTION
本申请实施例提供了一种基于特征交互学习的动作识别方法及终端设备,可以解决由于单一模态的稀疏采样方式,动作识别的准确率较低的问题。The embodiments of the present application provide an action recognition method and terminal device based on feature interaction learning, which can solve the problem of low accuracy of action recognition due to the sparse sampling method of a single modality.
第一方面,本申请实施例提供了一种基于特征交互学习的动作识别方法,所述方法包括:获取待识别动作的视频数据,所述视频数据包括第一视频序列和第二视频序列;将所述第一视频序列和所述第二视频序列分别进行压缩处理,得到所述第一视频序列对应的第一运动图和所述第二视频序列对应的第二运动图;将所述第一运动图和所述第二运动图输入训练后的双流神经网络模型,通 过所述训练后的双流神经网络模型对所述第一运动图的特征和所述第二运动图的特征进行交互学习,得到所述训练后的双流神经网络模型输出的所述第一视频序列的第一预测结果和所述第二视频序列的第二预测结果;基于所述第一预测结果和所述第二预测结果,确定所述待识别动作的分类结果。In a first aspect, an embodiment of the present application provides an action recognition method based on feature interaction learning, the method includes: acquiring video data of an action to be recognized, the video data including a first video sequence and a second video sequence; Perform compression processing on the first video sequence and the second video sequence respectively to obtain a first motion picture corresponding to the first video sequence and a second motion picture corresponding to the second video sequence; The motion map and the second motion map are input into the trained dual-stream neural network model, and the features of the first motion map and the features of the second motion map are interactively learned through the trained dual-stream neural network model, Obtain the first prediction result of the first video sequence and the second prediction result of the second video sequence output by the trained dual-stream neural network model; based on the first prediction result and the second prediction result , and determine the classification result of the action to be recognized.
在第一方面的一种可能的实现方式中,所述将所述第一视频序列进行压缩处理,得到所述第一视频序列对应的第一运动图,包括:In a possible implementation manner of the first aspect, performing compression processing on the first video sequence to obtain a first motion picture corresponding to the first video sequence, including:
获取所述第一视频序列中每一视频帧的特征矩阵;根据所述第一视频序列中视频帧的时序,将每一视频帧的所述特征矩阵进行压缩计算,得到用于表示所述第一运动图的特征矩阵。Obtain the feature matrix of each video frame in the first video sequence; according to the time sequence of the video frames in the first video sequence, compress and calculate the feature matrix of each video frame, and obtain the feature matrix used to represent the first video frame. A feature matrix of a motion map.
在第一方面的一种可能的实现方式中,所述将所述第二视频序列进行压缩处理,得到所述第二视频序列对应的第二运动图,包括:In a possible implementation manner of the first aspect, performing compression processing on the second video sequence to obtain a second motion image corresponding to the second video sequence includes:
将所述第二视频序列进行灰度处理,得到所述第二视频序列对应的灰度序列帧;根据所述第二视频序列中视频帧的时序,将所述灰度序列帧的特征矩阵进行压缩计算,得到用于表示所述第二运动图的特征矩阵。Perform grayscale processing on the second video sequence to obtain a grayscale sequence frame corresponding to the second video sequence; Compression calculation to obtain a feature matrix for representing the second motion map.
在第一方面的一种可能的实现方式中,所述训练后的双流神经网络模型包括第一神经网络模型、第二神经网络模型和路由模块,所述路由模块设置于所述第一神经网络模型的中间卷积模块和所述第二神经网络模型的中间卷积模块之间;所述第一神经网络模型的输入为所述第一运动图,输出为所述第一视频序列的所述第一预测结果;所述第二神经网络模型的输入为所述第二运动图,输出为所述第二视频序列的所述第二预测结果;所述路由模块用于在所述第一神经网络模型的中间卷积模块和所述第二神经网络模型的中间卷积模块之间,对所述第一运动图的特征和所述第二运动图的特征进行交互学习。In a possible implementation manner of the first aspect, the trained dual-stream neural network model includes a first neural network model, a second neural network model, and a routing module, and the routing module is set in the first neural network between the intermediate convolution module of the model and the intermediate convolution module of the second neural network model; the input of the first neural network model is the first motion map, and the output is the first video sequence of the the first prediction result; the input of the second neural network model is the second motion map, and the output is the second prediction result of the second video sequence; the routing module is used for Between the intermediate convolution module of the network model and the intermediate convolution module of the second neural network model, interactive learning is performed on the features of the first motion map and the features of the second motion map.
在第一方面的一种可能的实现方式中,所述第一神经网络模型的中间卷积模块包括预设层数的第一卷积模块,所述第二神经网络模型的中间卷积模块包括与所述第一卷积模块相对应的第二卷积模块;所述将所述第一运动图和所述第二运动图输入训练后的双流神经网络模型,通过所述训练后的双流神经网络模型对所述第一运动图的特征和所述第二运动图的特征进行交互学习,得到所述第一视频序列的第一预测结果和所述第二视频序列的第二预测结果,包括:In a possible implementation manner of the first aspect, the intermediate convolution module of the first neural network model includes a first convolution module with a preset number of layers, and the intermediate convolution module of the second neural network model includes A second convolution module corresponding to the first convolution module; the first motion map and the second motion map are input into the trained dual-stream neural network model, and the trained dual-stream neural network The network model performs interactive learning on the features of the first motion map and the features of the second motion map to obtain a first prediction result of the first video sequence and a second prediction result of the second video sequence, including :
将第一层的第一卷积模块的输出和第一层的第二卷积模块的输出作为第一 层的路由模块的输入,由所述第一层的路由模块进行特征交互学习,得到第一路由输出;将所述第一层的第一卷积模块的输出与所述第一路由输出的叠加结果作为第二层的第一卷积模块的输入,由所述第二层的第一卷积模块进行特征学习,得到所述第二层的第一卷积模块的输出;将所述第一层的第二卷积模块的输出与所述第一路由输出的叠加结果作为第二层的第二卷积模块的输入,由所述第二层的第二卷积模块进行特征学习,得到所述第二层的第二卷积模块的输出;将所述第二层的第一卷积模块的输出和所述第二层的第二卷积模块的输出作为第二层的路由模块的输入,由所述第二层的路由模块进行特征交互学习,得到第二路由输出;The output of the first convolution module of the first layer and the output of the second convolution module of the first layer are used as the input of the routing module of the first layer, and the feature interactive learning is performed by the routing module of the first layer, and the first layer is obtained. A route output; take the superposition result of the output of the first convolution module of the first layer and the output of the first route as the input of the first convolution module of the second layer, and the first convolution module of the second layer The convolution module performs feature learning to obtain the output of the first convolution module of the second layer; the superposition result of the output of the second convolution module of the first layer and the output of the first route is used as the second layer The input of the second convolution module of the second layer, the feature learning is performed by the second convolution module of the second layer, and the output of the second convolution module of the second layer is obtained; the first volume of the second layer is The output of the product module and the output of the second convolution module of the second layer are used as the input of the routing module of the second layer, and the routing module of the second layer performs feature interactive learning to obtain the second routing output;
其中,所述第一层的第一卷积模块与第二层的第一卷积模块为所述第一神经网络模型的中间卷积模块中前后相邻的两层卷积层;所述第一层的第二卷积模块和所述第二层的第二卷积模块为所述第二神经网络模型的中间卷积模块中前后相邻的两层卷积层;所述第一层的路由模块和所述第二层的路由模块为前后相邻的两个计算模块。Wherein, the first convolution module of the first layer and the first convolution module of the second layer are two adjacent convolution layers in the middle convolution module of the first neural network model; The second convolution module of one layer and the second convolution module of the second layer are two adjacent convolution layers in the middle convolution module of the second neural network model; The routing module and the routing module of the second layer are two adjacent computing modules.
在第一方面的一种可能的实现方式中,所述路由模块包括:第一卷积单元、第一归一化单元、第一激活单元、第二卷积单元、第二归一化单元、第二激活单元;通过所述路由模块的所述第一卷积单元、所述第一归一化单元、所述第一激活单元、所述第二卷积单元、所述第二归一化单元、所述第二激活单元,依次对所述第一神经网络模型的卷积计算模块输出的特征矩阵和所述第二神经网络模型的卷积算计模块输出的特征矩阵进行交互学习,得到所述路由模块输出的特征矩阵。In a possible implementation manner of the first aspect, the routing module includes: a first convolution unit, a first normalization unit, a first activation unit, a second convolution unit, a second normalization unit, second activation unit; through the first convolution unit, the first normalization unit, the first activation unit, the second convolution unit, the second normalization unit of the routing module unit and the second activation unit, in turn, perform interactive learning on the feature matrix output by the convolution calculation module of the first neural network model and the feature matrix output by the convolution calculation module of the second neural network model, and obtain the The feature matrix output by the routing module described above.
在第一方面的一种可能的实现方式中,所述基于所述第一预测结果和所述第二预测结果,确定所述待识别动作的分类结果,包括:In a possible implementation manner of the first aspect, the determining a classification result of the to-be-recognized action based on the first prediction result and the second prediction result includes:
对所述第一预测结果和所述第二预测结果进行特征融合,得到动作类别的概率分布;将所述概率分布中概率最大的动作类别作为所述待识别动作的所述分类结果。Feature fusion is performed on the first prediction result and the second prediction result to obtain a probability distribution of action categories; the action category with the highest probability in the probability distribution is used as the classification result of the action to be recognized.
在第一方面的一种可能的实现方式中,所述第一神经网络模型包括第一损失函数,所述第二神经网络模型包括第二损失函数;通过样本视频数据对所述第一神经网络模型、所述第二神经网络模型和所述路由模块进行训练,依据所 述第一损失函数和所述第二损失函数分别调整所述第一神经网络模型的参数、所述第二神经网络模型的参数以及所述路由模块的参数;若所述第一损失函数和所述第二损失函数满足预设阈值,则停止对所述第一神经网络模型的参数、所述第二神经网络模型的参数以及所述路由模块的训练,得到所述训练后的双流神经网络模型。In a possible implementation manner of the first aspect, the first neural network model includes a first loss function, and the second neural network model includes a second loss function; The model, the second neural network model and the routing module are trained, and the parameters of the first neural network model and the second neural network model are adjusted according to the first loss function and the second loss function, respectively. parameters and the parameters of the routing module; if the first loss function and the second loss function meet the preset threshold, stop the parameters of the first neural network model and the second neural network model. parameters and the training of the routing module to obtain the trained dual-stream neural network model.
第二方面,本申请实施例提供了一种基于特征交互学习的动作识别装置,包括:In a second aspect, an embodiment of the present application provides an action recognition device based on feature interactive learning, including:
获取单元,用于获取待识别动作的视频数据,所述视频数据包括第一视频序列和第二视频序列;an acquisition unit, configured to acquire video data of an action to be recognized, the video data including a first video sequence and a second video sequence;
处理单元,用于将所述第一视频序列和所述第二视频序列分别进行压缩处理,得到所述第一视频序列对应的第一运动图和所述第二视频序列对应的第二运动图;A processing unit, configured to compress the first video sequence and the second video sequence respectively, to obtain a first motion picture corresponding to the first video sequence and a second motion picture corresponding to the second video sequence ;
计算单元,用于将所述第一运动图和所述第二运动图输入训练后的双流神经网络模型,通过所述训练后的双流神经网络模型对所述第一运动图的特征和所述第二运动图的特征进行交互学习,得到所述第一视频序列的第一预测结果和所述第二视频序列的第二预测结果;The computing unit is used for inputting the first motion map and the second motion map into a trained dual-stream neural network model, and through the trained dual-stream neural network model, the features of the first motion map and the described The features of the second motion map are interactively learned to obtain the first prediction result of the first video sequence and the second prediction result of the second video sequence;
输出单元,用于基于所述第一预测结果和所述第二预测结果,确定所述待识别动作的分类结果。An output unit, configured to determine a classification result of the action to be recognized based on the first prediction result and the second prediction result.
第三方面,本申请实施例提供了一种终端设备,包括:存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现第一方面及第一方面的可能实现方式所述的方法。In a third aspect, an embodiment of the present application provides a terminal device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program When implementing the method described in the first aspect and possible implementation manners of the first aspect.
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现第一方面及第一方面的可能实现方式所述的方法。In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the first aspect and possible implementations of the first aspect are implemented method described.
第五方面,本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面中任一项所述的动作识别方法。In a fifth aspect, an embodiment of the present application provides a computer program product that, when the computer program product runs on a terminal device, enables the terminal device to execute the action recognition method described in any one of the first aspects above.
可以理解的是,上述第二方面至第五方面的有益效果可以参见上述第一方面的相关描述,在此不再赘述。It can be understood that, for the beneficial effects of the second aspect to the fifth aspect, reference may be made to the relevant description of the first aspect, which will not be repeated here.
本申请实施例与现有技术相比存在的有益效果是:通过本申请实施例,终端设备可以获取待识别动作的视频数据,视频数据包括第一视频序列和第二视频序列;将第一视频序列和第二视频序列分别进行压缩处理,得到第一视频序列对应的第一运动图和第二视频序列对应的第二运动图;将第一运动图和第二运动图输入训练后的双流神经网络模型,通过训练后的双流神经网络模型对第一运动图的特征和第二运动图的特征进行交互学习,得到第一视频序列的第一预测结果和第二视频序列的第二预测结果;基于第一预测结果和第二预测结果,确定待识别动作的分类结果;通过分别对第一视频序列和第二视频序列的压缩得到第一运动图和第二运动图,对视频数据进行更丰富的时空表示,使得信息表示更全,特征更丰富;从而将第一运动图和第二运动图作为双流神经网络模型的输入,通过神经网络模型对多模态的图像特征进行交互学习,提高了动作识别的准确度;具有较强的易用性与实用性。Compared with the prior art, the beneficial effects of the embodiments of the present application are: through the embodiments of the present application, the terminal device can obtain video data of actions to be recognized, and the video data includes a first video sequence and a second video sequence; The sequence and the second video sequence are respectively compressed to obtain the first motion map corresponding to the first video sequence and the second motion map corresponding to the second video sequence; the first motion map and the second motion map are input into the trained dual-stream neural network. The network model, through the trained dual-stream neural network model, interactively learns the features of the first motion map and the features of the second motion map, and obtains the first prediction result of the first video sequence and the second prediction result of the second video sequence; Based on the first prediction result and the second prediction result, the classification result of the action to be recognized is determined; the first motion picture and the second motion picture are obtained by compressing the first video sequence and the second video sequence respectively, and the video data is enriched The spatial and temporal representation of the model makes the information representation more complete and the features richer; thus, the first motion map and the second motion map are used as the input of the dual-stream neural network model, and the multi-modal image features are interactively learned through the neural network model. The accuracy of action recognition; has strong ease of use and practicality.
附图说明Description of drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only for the present application. In some embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.
图1是本申请实施例提供的动作识别方法的流程示意图;1 is a schematic flowchart of an action recognition method provided by an embodiment of the present application;
图2是本申请实施例提供的视频数据压缩处理的示意图;2 is a schematic diagram of video data compression processing provided by an embodiment of the present application;
图3是本申请实施例提供的双流神经网络模型的网络架构示意图;3 is a schematic diagram of a network architecture of a dual-stream neural network model provided by an embodiment of the present application;
图4是本申请实施例提供的双流神经网络的路由模块的架构示意图;4 is a schematic diagram of the architecture of a routing module of a dual-stream neural network provided by an embodiment of the present application;
图5是本申请实施例提供的双流神经网络的中层特征交互学习单元的架构示意图;5 is a schematic diagram of the architecture of a middle-level feature interaction learning unit of a dual-stream neural network provided by an embodiment of the present application;
图6是本申请实施例提供的动作识别装置的结构示意图;6 is a schematic structural diagram of a motion recognition device provided by an embodiment of the present application;
图7是本申请实施例提供的终端设备的结构示意图。FIG. 7 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
具体实施方式Detailed ways
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当 清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。In the following description, for the purpose of illustration rather than limitation, specific details such as a specific system structure and technology are set forth in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to those skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
应当理解,当在本申请说明书和所附权利要求书中使用时,术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It is to be understood that, when used in this specification and the appended claims, the term "comprising" indicates the presence of the described feature, integer, step, operation, element and/or component, but does not exclude one or more other The presence or addition of features, integers, steps, operations, elements, components and/or sets thereof.
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It will also be understood that, as used in this specification and the appended claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.
如在本申请说明书和所附权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。As used in the specification of this application and the appended claims, the term "if" may be contextually interpreted as "when" or "once" or "in response to determining" or "in response to detecting ". Similarly, the phrases "if it is determined" or "if the [described condition or event] is detected" may be interpreted, depending on the context, to mean "once it is determined" or "in response to the determination" or "once the [described condition or event] is detected. ]" or "in response to detection of the [described condition or event]".
另外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。In addition, in the description of the specification of the present application and the appended claims, the terms "first", "second", "third", etc. are only used to distinguish the description, and should not be construed as indicating or implying relative importance.
在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。References in this specification to "one embodiment" or "some embodiments" and the like mean that a particular feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in other embodiments," etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean "one or more but not all embodiments" unless specifically emphasized otherwise. The terms "including", "including", "having" and their variants mean "including but not limited to" unless specifically emphasized otherwise.
目前,针对计算机视觉任务,主要采用以卷积神经网络为基础的二维卷积网络2D-ConvNets、三维卷积网络3D-ConvNets以及循环神经网络(RNN)等模型。卷积神经网络的特征学习能力和动作识别效果取得了一定的成果,但是针对长视频序列的动作分类的处理和识别过程中,仍存在诸多挑战。At present, for computer vision tasks, models such as 2D convolutional network 2D-ConvNets, 3D convolutional network 3D-ConvNets, and recurrent neural network (RNN) based on convolutional neural networks are mainly used. The feature learning ability and action recognition effect of convolutional neural networks have achieved certain results, but there are still many challenges in the process of action classification processing and recognition for long video sequences.
基于卷积神经网络的人体动作识别方法,给定一定数量的RGB或者深度视频序列,通过稀疏采样方法得到一定数量的视频帧作为网络的输入,卷积神 经网络逐层提取视频帧中的特征,并通过分类器或归一化(Softmax函数)层对人体动作进行分类和识别。Human action recognition method based on convolutional neural network, given a certain number of RGB or depth video sequences, a certain number of video frames are obtained as the input of the network through the sparse sampling method, and the convolutional neural network extracts the features in the video frames layer by layer, And the human actions are classified and recognized through a classifier or normalization (Softmax function) layer.
其中,基于卷积神经网络的人体动作识别方法可分为以下两大类:第一,2D端到端的网络训练结构:通过大规模标注数据集对深度网络进行有监督训练,之后通过参数微调训练获得实际任务的训练模型。针对视频序列,该类方法主要采用稀疏采样得到整个视频序列的某一帧图像作为网络输入,不能很好的学习到人体动作的时间维度特征。第二,3D端到端网络训练结构:通过稀疏采样得到某几帧图像作为网络模型的输入,并通过有监督训练和参数微调训练得到分类模型。该方法可获取较好的识别效果,但计算量庞大而制约了其在实际场景中的应用。Among them, human action recognition methods based on convolutional neural networks can be divided into the following two categories: First, 2D end-to-end network training structure: supervised training of deep networks through large-scale labeled datasets, and then through parameter fine-tuning training Get the trained model for the actual task. For video sequences, this method mainly uses sparse sampling to obtain a certain frame of the entire video sequence as the network input, and cannot learn the temporal dimension features of human actions well. Second, the 3D end-to-end network training structure: a few frames of images are obtained through sparse sampling as the input of the network model, and the classification model is obtained through supervised training and parameter fine-tuning training. This method can obtain better recognition effect, but the huge amount of calculation restricts its application in practical scenarios.
另外,基于其他深度网络的人体动作识别方法,应用于人体动作识别的其他深度网络有循环卷积网络(RNN)等。RNN网络对前面的信息进行记忆并应用于当前输出的计算中,能够对任何长度的序列数据进行处理,通过对输入序列的有序循环学习来实现特征学习和识别分类,其在自然语言处理领域取得了较好的应用,但在人体动作识别上还有待进一步提升。In addition, human action recognition methods based on other deep networks, other deep networks applied to human action recognition include recurrent convolutional network (RNN) and so on. The RNN network memorizes the previous information and applies it to the calculation of the current output. It can process sequence data of any length, and realize feature learning and recognition and classification through the ordered loop learning of the input sequence, which is widely used in the field of natural language processing. It has achieved good applications, but needs to be further improved in human action recognition.
综上,现有的计算机视觉中,基于卷积神经网络或其它深度网络的动作识别方法,缺乏对长视频序列的多模态时空信息的表示以及对多模态特征的相互学习,从而动作识别的准确度仍有待提高。To sum up, in the existing computer vision, action recognition methods based on convolutional neural networks or other deep networks lack the representation of multi-modal spatiotemporal information of long video sequences and the mutual learning of multi-modal features, so that action recognition can be achieved. The accuracy still needs to be improved.
本申请将基于对长视频序列的多模态时空信息的表示以及对多模态特征的相互学习,实现对用户动作分类的处理和识别,进一步提高了用户动作识别的准确度。下面结合图示及具体实施例来详细说明本申请的技术方案。This application will realize the processing and recognition of user action classification based on the representation of multimodal spatiotemporal information of long video sequences and the mutual learning of multimodal features, which further improves the accuracy of user action recognition. The technical solutions of the present application will be described in detail below with reference to the drawings and specific embodiments.
请参见图1,是本申请实施例提供的动作识别方法的流程示意图,该方法的执行主体可以是独立的终端,例如可以是手机、电脑、多媒体设备、流媒体设备、监控装置等终端设备;还可以为终端设备中的集成模块,作为终端设备中的某一项功能实现。下面以该方法由终端设备执动作例进行说明,但本申请实施例不限于此。如图1所示,该方法包括:Please refer to FIG. 1 , which is a schematic flowchart of a motion recognition method provided by an embodiment of the present application. The execution body of the method may be an independent terminal, such as a mobile phone, a computer, a multimedia device, a streaming media device, a monitoring device and other terminal devices; It can also be an integrated module in the terminal device, which can be implemented as a certain function in the terminal device. The following describes an example of the method being executed by a terminal device, but the embodiment of the present application is not limited to this. As shown in Figure 1, the method includes:
步骤S101,获取待识别动作的视频数据,视频数据包括第一视频序列和第二视频序列。Step S101, acquiring video data of the action to be recognized, where the video data includes a first video sequence and a second video sequence.
在一些实施例中,视频数据为按时序组合的连续的多帧图像序列,为待识 别动作的整段视频的所有图像帧的序列。终端设备可以通过RGB-D摄像装置实时获取待识别动作的视频数据;待识别动作的视频数据还可以是预存在终端设备中的视频数据。第一视频序列和第二视频序列为两种不同模态的视频帧序列,即同一段视频数据的不同的特征表示形式,例如分别可以为RGB格式的彩色视频序列、以深度信息表示的视频序列、光流图格式的视频序列或骨架序列等。In some embodiments, the video data is a sequential multi-frame image sequence combined in time, a sequence of all image frames of the entire video for which the action is to be identified. The terminal device can acquire the video data of the action to be recognized in real time through the RGB-D camera device; the video data of the action to be recognized can also be the video data pre-stored in the terminal device. The first video sequence and the second video sequence are video frame sequences of two different modalities, that is, different feature representations of the same piece of video data, for example, a color video sequence in RGB format and a video sequence represented by depth information, respectively. , video sequence or skeleton sequence in optical flow graph format, etc.
其中,以第一视频序列为RGB格式的彩色视频序列和第二视频序列为深度信息表示的视频序列为例进行详细说明。彩色视频序列为RGB格式的多帧图像序列,即通过红绿蓝RGB三种颜色表示每一帧图像中的像素点信息的彩色图像序列;以深度信息表示的视频序列为通过深度值表示每一帧图像中的像素点信息的深度图像序列,每一帧图像的图像深度决定了图像的每个像素点可能的颜色数或者可能的灰度等级。The detailed description is given by taking as an example that the first video sequence is a color video sequence in RGB format and the second video sequence is a video sequence represented by depth information. A color video sequence is a multi-frame image sequence in RGB format, that is, a color image sequence in which pixel information in each frame of image is represented by three colors of red, green, blue, and RGB; a video sequence represented by depth information is represented by depth values. The depth image sequence of the pixel point information in the frame image, the image depth of each frame image determines the possible color number or possible gray level of each pixel point of the image.
需要说明的是,通过设置拍摄装置的拍摄参数,使得所获取的第一视频序列和第二视频序列的序列帧可以按时序分别一一对应,即相同时刻的第一视频序列的每一视频帧和第二视频序列的每一视频帧相对应,例如设置每秒拍摄20帧,相同一段视频的第一视频序列和第二视频序列可以包括相同数量的视频帧;从而可以通过不同的特征量表示相同时刻的视频帧的时空信息。It should be noted that, by setting the shooting parameters of the shooting device, the acquired sequence frames of the first video sequence and the second video sequence can be in a one-to-one correspondence in time sequence, that is, each video frame of the first video sequence at the same moment. Corresponding to each video frame of the second video sequence, for example, set to shoot 20 frames per second, the first video sequence and the second video sequence of the same video can include the same number of video frames; thus, they can be represented by different feature quantities. Spatiotemporal information of video frames at the same time.
其中,时空信息包括视频帧序列的时间维度的信息和空间维度的信息;时间维度的信息表现在不同时间点对应一个视频帧,由时间的连续性使得连续的视频帧序列构成动态效果;空间维度的信息可以表现为每个视频帧的纹理信息或色彩信息。例如,RGB格式的彩色视频序列的视频帧通过3通道*宽W*高H矩阵形式进行表示,通过三个通道中的元素表示视频帧中每个像素点的色彩信息;以深度信息表示的视频序列的视频帧,深度信息通过长度单位(如毫米等)进行衡量,为便于计算机处理,将表示深度的距离信息与灰度信息进行对应的转换,得到以1通道*宽W*高H矩阵形式表示的灰度图。Among them, the spatiotemporal information includes the information of the temporal dimension and the information of the spatial dimension of the video frame sequence; the information of the temporal dimension is represented at different time points corresponding to a video frame, and the continuous video frame sequence constitutes a dynamic effect by the continuity of time; the spatial dimension The information can be expressed as texture information or color information of each video frame. For example, a video frame of a color video sequence in RGB format is represented by a 3-channel*widthW*heightH matrix, and the elements in the three channels represent the color information of each pixel in the video frame; the video represented by depth information For the video frames of the sequence, the depth information is measured by length units (such as millimeters, etc.). In order to facilitate computer processing, the distance information representing the depth and the grayscale information are converted correspondingly, and the matrix is obtained in the form of 1 channel * width W * height H Represented grayscale image.
另外,第一视频序列的序列帧和第二视频序列的序列帧按时序一一对应,且为相同的一段视频数据,例如,通过设置拍摄装置的拍摄参数,在相同的拍摄时间内,第一视频序列包括50帧图像,第二视频帧也包括50帧图像。In addition, the sequence frames of the first video sequence and the sequence frames of the second video sequence correspond one-to-one in time sequence, and are the same piece of video data. For example, by setting the shooting parameters of the shooting device, within the same shooting time, the first The video sequence includes 50 frames of images, and the second video frame also includes 50 frames of images.
在一些实施例中,若第一视频序列为RGB格式的彩色视频数据,则可以 通过RGB格式的相机获取;若第二视频序列为以深度信息表示的视频序列,则可以通过深度相机获取;两种相机设置的拍摄参数可以相同,且为同一时间段对相同目标的拍摄,在此不做具体限制。In some embodiments, if the first video sequence is color video data in RGB format, it can be acquired by a camera in RGB format; if the second video sequence is a video sequence represented by depth information, it can be acquired by a depth camera; two The shooting parameters set by the various cameras may be the same, and the shooting parameters for the same target are shot in the same time period, and there is no specific limitation here.
其中,终端设备可以为与摄像装置一体的设备,终端设备可以通过摄像装置直接获取待识别动作的视频数据;终端设备还可以是与摄像装置独立分开的设备,终端设备通过有线或无线的方式与摄像装置进行通信连接,以获取待识别动作的视频数据。待识别动作可以为人的行为或活动动作,也可以是动物的行为或活动动作,不做具体限制。The terminal device can be a device integrated with the camera, and the terminal device can directly obtain the video data of the action to be recognized through the camera device; the terminal device can also be a device that is separate from the camera device, and the terminal device can be wired or wirelessly connected to the camera device. The camera device is connected in communication to obtain video data of the action to be recognized. The action to be identified may be a human action or an activity action, or an animal action or action action, without any specific limitation.
上述实施例,终端设备获取了待识别动作的整段视频数据的图像帧序列,记录了视频数据的多模态的底层特征,并且很好的利用了第一视频序列和第二视频序列两种不同模态的时空信息的特征,为后续神经网络模型对特征学习的多种可能性提供基础,增强了神经网络模型对图像特征的表达及识别能力。In the above embodiment, the terminal device obtains the image frame sequence of the entire video data of the action to be recognized, records the multi-modal underlying features of the video data, and makes good use of the first video sequence and the second video sequence. The features of spatiotemporal information of different modalities provide a basis for the subsequent neural network model to learn various possibilities for features, and enhance the neural network model's ability to express and recognize image features.
步骤S102,将第一视频序列和第二视频序列分别进行压缩处理,得到第一视频序列对应的第一运动图和第二视频序列对应的第二运动图。Step S102: Compress the first video sequence and the second video sequence respectively to obtain a first motion picture corresponding to the first video sequence and a second motion picture corresponding to the second video sequence.
在一些实施例中,为了可以对待识别动作的整段视频数据进行特征学习,终端设备将第一视频序列的多帧图像和第二视频序列的多帧图像分别进行压缩处理,得到包含丰富的时空信息的第一运动图和第二运动图。其中第一运动图的特征表示方式不同于第二运动图的特征表示方式,为同一段视频的不同底层特征的表示,即通过不同的图像信息分别表示第一视频序列和第二视频序列中视频帧的图像特征。In some embodiments, in order to perform feature learning on the entire video data of the action to be recognized, the terminal device compresses the multi-frame images of the first video sequence and the multi-frame images of the second video sequence respectively, so as to obtain rich spatial and temporal data. A first motion map and a second motion map of the information. The feature representation of the first motion picture is different from the feature representation of the second motion picture, and is the representation of different underlying features of the same video, that is, the videos in the first video sequence and the second video sequence are represented by different image information. Image features of the frame.
第一运动图的时空信息包含第一视频序列的所有视频帧的时空信息,第二运动图的时空信息包含第二视频序列所有视频帧的时空信息;例如,将RGB视频序列和深度depth视频序列的时间维度信息和空间维度信息分别压缩表示为单张三通道图像和单通道图像,呈现出动态的效果以及色彩、纹理等信息。The spatiotemporal information of the first motion picture includes the spatiotemporal information of all video frames of the first video sequence, and the spatiotemporal information of the second motion picture includes the spatiotemporal information of all the video frames of the second video sequence; for example, the RGB video sequence and the depth video sequence are combined. The temporal dimension information and spatial dimension information of the image are compressed and expressed as a single three-channel image and a single-channel image, respectively, showing dynamic effects and information such as color and texture.
在实际计算过程中,第一视频序列的每一视频帧对应一个特征矩阵,第二视频序列的每一视频帧对应一个特征矩阵;例如第一视频序列或第二视频序列分别可以包括T帧图像,每一帧图像对应的特征矩阵为I t,则第一视频序列的特征矩阵集合或第二视频序列的特征矩阵集合可以表示为<I 1,I 2,I 3,...,I T>,其中,I 1为视频序列中按照时序排列的第一帧图像的特征矩阵,以此类推,I T为视频 序列中按照时序排列的第T帧图像的特征矩阵。 In the actual calculation process, each video frame of the first video sequence corresponds to a feature matrix, and each video frame of the second video sequence corresponds to a feature matrix; for example, the first video sequence or the second video sequence may respectively include T frame images , the feature matrix corresponding to each frame of image is It, then the feature matrix set of the first video sequence or the feature matrix set of the second video sequence can be expressed as <I 1 ,I 2 ,I 3 ,..., I T >, where I 1 is the feature matrix of the first frame image in the video sequence arranged in time series, and so on, I T is the characteristic matrix of the T-th frame image in the video sequence arranged in time series.
在一些实施例中,将第一视频序列和第二视频序列分别进行压缩处理,将视频序列的多帧图像压缩合成为一张图像,该图像包含通过时间和空间表示动作的特征信息,可以称为运动图,从而得到包含整段视频序列时空信息的成对的第一运动图和第二运动图;即将多帧图像的特征矩阵合并在一张图像中进行表示,从而可以获取视频序列中所有视频帧的特征。In some embodiments, the first video sequence and the second video sequence are respectively subjected to compression processing, and multiple frames of images of the video sequence are compressed and synthesized into an image, which contains feature information representing actions through time and space, which can be called is a motion map, so as to obtain a paired first motion map and a second motion map containing the spatiotemporal information of the entire video sequence; that is, the feature matrices of multiple frames of images are combined in one image to represent, so that all the video sequences in the video sequence can be obtained. Features of video frames.
示例性的,第一运动图可以是RGB格式的视频序列帧压缩合成的图像,第二运动图可以为以深度信息表示的视频序列压缩合成的图像;第一运动图和第二运动图还可以是其它模态的一一对应的视频序列分别压缩合成的图像。Exemplarily, the first motion picture may be an image synthesized by compression of frames of a video sequence in RGB format, and the second motion picture may be an image synthesized by compression of a video sequence represented by depth information; the first motion picture and the second motion picture may also be It is the compressed and synthesized image of the video sequences corresponding to other modalities respectively.
在一些实施例中,将第一视频序列进行压缩处理,得到第一视频序列对应的第一运动图,包括:In some embodiments, the first video sequence is compressed to obtain a first motion map corresponding to the first video sequence, including:
A1、获取所述第一视频序列中每一视频帧的特征矩阵;A1. Obtain the feature matrix of each video frame in the first video sequence;
A2、根据所述第一视频序列中视频帧的时序,将所述特征矩阵进行压缩计算,得到用于表示所述第一运动图的特征矩阵。A2. According to the time sequence of the video frames in the first video sequence, compress the feature matrix to obtain a feature matrix for representing the first motion image.
在一些实施例中,第一视频序列包括多帧图像,每一帧图像对应一个特征矩阵;若第一视频序列为RGB格式的彩色视频数据,则第一视频序列的每一帧图像的特征矩阵为3通道*宽度W*高度H的矩阵,其中宽度W和高度H以像素为单位,特征矩阵中的元素与像素相对应。特征矩阵中的每个元素的值表示对应位置的像素点的特征,例如RGB格式的彩色图像,每个元素代表每个像素点分别在红R、绿G、蓝B三个通道的特征值。In some embodiments, the first video sequence includes multiple frames of images, and each frame of image corresponds to a feature matrix; if the first video sequence is color video data in RGB format, the feature matrix of each frame of the first video sequence is a matrix of 3 channels * width W * height H, where width W and height H are in pixels, and the elements in the feature matrix correspond to pixels. The value of each element in the feature matrix represents the feature of the pixel at the corresponding position, such as a color image in RGB format, and each element represents the feature value of each pixel in the three channels of red R, green G, and blue B respectively.
在一些实施例中,第一视频序列的每一帧图像对应一个特征矩阵,将所有视频帧的特征矩阵中的相同位置处的元素相加,再除以第一视频序列的视频帧的总帧数,得到特征矩阵中的每个位置处的元素值,对每个元素值取整,比如2.6向下取整则得到2,得到第一视频序列对应的第一运动图的特征矩阵。In some embodiments, each image frame of the first video sequence corresponds to a feature matrix, the elements at the same position in the feature matrices of all video frames are added, and then divided by the total frames of the video frames of the first video sequence number, obtain the element value at each position in the feature matrix, and round each element value, for example, 2.6 is rounded down to obtain 2, and the feature matrix of the first motion image corresponding to the first video sequence is obtained.
如图2所示,本申请实施例提供的视频数据压缩处理的示意图,当视频序列为RGB格式的彩色视频数据时,将RGB视频序列进行压缩处理,得到对应的RGB运动图,将多帧图像的时空信息合成为一张运动图的时空信息。其中,RGB视频序列对应的运动图的特征矩阵可以为3*W*H的矩阵。可以通过下面公式进行计算:As shown in FIG. 2 , a schematic diagram of video data compression processing provided by an embodiment of the present application, when a video sequence is color video data in RGB format, the RGB video sequence is compressed to obtain a corresponding RGB motion picture, and a multi-frame image is The spatio-temporal information of , is synthesized into the spatio-temporal information of a motion map. The feature matrix of the motion map corresponding to the RGB video sequence may be a 3*W*H matrix. It can be calculated by the following formula:
Figure PCTCN2020129550-appb-000001
Figure PCTCN2020129550-appb-000001
其中,MI为第一视频序列对应运动图的特征矩阵,T为第一视频序列的总帧数,I τ为第一视频序列中第τ帧图像的特征矩阵,τ的取值范围为[1,T]的整数。 Wherein, MI is the feature matrix of the corresponding motion picture of the first video sequence, T is the total number of frames of the first video sequence, Iτ is the feature matrix of the τth frame image in the first video sequence, and the value range of τ is [1 , an integer of T].
另外,第一视频序列的每一帧图像的特征矩阵中元素的取值范围可以为[0,255]的整数,对第一视频序列压缩处理后的运动图MI的特征矩阵中每个元素的取值范围也为[0,255]的整数。In addition, the value range of an element in the feature matrix of each frame of image of the first video sequence may be an integer of [0, 255], and the value of each element in the feature matrix of the motion image MI after the compression processing of the first video sequence is the value of each element The range is also an integer in the range [0,255].
在一些实施例中,将第二视频序列进行压缩处理,得到第二视频序列对应的第二运动图,包括:In some embodiments, the second video sequence is compressed to obtain a second motion image corresponding to the second video sequence, including:
B1、将第二视频序列进行灰度处理,得到第二视频序列对应的灰度序列帧;B1, performing grayscale processing on the second video sequence to obtain a grayscale sequence frame corresponding to the second video sequence;
B2、根据灰度序列帧的时序,将灰度序列帧的特征矩阵进行压缩计算,得到用于表示第二运动图的特征矩阵。B2. According to the time sequence of the grayscale sequence frame, compress the feature matrix of the grayscale sequence frame to obtain a feature matrix used to represent the second motion image.
在一些实施例中,第二视频序列包括多帧图像,每一帧图像对应一个特征矩阵;若第二视频序列为以深度信息表示每一视频帧的图像序列,则第二视频序列的每一帧图像的特征矩阵为1通道*宽度W*高度H的矩阵,其中宽度W和高度H以像素为单位,特征矩阵中的元素与像素相对应。特征矩阵中的每个元素的值表示对应位置的像素点的特征。由于第二视频序列为深度信息表示的图像序列,可以将第二视频序列中的每一帧深度图进行灰度处理,将深度图中每个像素点的深度信息进行[0,255]的映射转化,得到视频帧的灰度图像,灰度图像的特征矩阵中每个元素的取值范围为[0,255]的整数。In some embodiments, the second video sequence includes multiple frames of images, and each frame of image corresponds to a feature matrix; if the second video sequence is an image sequence in which each video frame is represented by depth information, each The feature matrix of a frame image is a matrix of 1 channel * width W * height H, where width W and height H are in pixels, and elements in the feature matrix correspond to pixels. The value of each element in the feature matrix represents the feature of the pixel at the corresponding position. Since the second video sequence is an image sequence represented by depth information, the depth map of each frame in the second video sequence can be gray-scaled, and the depth information of each pixel in the depth map can be converted by mapping [0, 255], The grayscale image of the video frame is obtained, and the value of each element in the feature matrix of the grayscale image is an integer in the range of [0,255].
示例性的,以深度信息表示的视频序列的取值可能是0到10000mm,而计算机视觉中图像的表示范围是[0,255],所以需要将深度信息表示的视频序列缩放为与视觉表示匹配的取值范围,即将深度信息表示的视频序列向灰度图进行映射转化。其中,缩放方式有多种,假设深度信息表示的视频序列为1*W*H矩阵,设所有元素的最大值和最小值之差为max-min,对视频序列中每一个深度图像的矩阵的元素进行缩放和取整操作。例如:假设最大深度值max-最小深度值min=10000,某一个元素值为7580,则操作后此元素的对应值为(7580/10000)*255=193.29,然后再取整得到193,即所对应的元素值为193,从而实现向灰度图像的转化。Exemplarily, the value of the video sequence represented by depth information may be 0 to 10000mm, while the representation range of images in computer vision is [0, 255], so the video sequence represented by depth information needs to be scaled to match the visual representation. The value range is to map and convert the video sequence represented by the depth information to a grayscale image. Among them, there are many ways of scaling. Assuming that the video sequence represented by the depth information is a 1*W*H matrix, and setting the difference between the maximum value and the minimum value of all elements as max-min, the matrix of each depth image in the video sequence is Elements are scaled and rounded. For example: assuming that the maximum depth value max-minimum depth value min=10000, and the value of a certain element is 7580, the corresponding value of this element after the operation is (7580/10000)*255=193.29, and then rounded up to get 193, that is, the The corresponding element value is 193, thus realizing the conversion to grayscale image.
在一些实施例中,第二视频序列与第一视频序列的压缩处理过程类似,将第二视频序列的每一帧图像进行灰度处理后,得到灰度图像的特征矩阵,将第二视频序列中所有视频帧对应的灰度图像的特征矩阵的相同位置处的元素相加,再除以第二视频序列的视频帧的总帧数,得到特征矩阵中的每个位置处的元素值,对每个元素值取整,得到第二视频序列对应的运动图的特征矩阵。In some embodiments, the compression process of the second video sequence is similar to that of the first video sequence. After grayscale processing is performed on each frame of the second video sequence, a feature matrix of the grayscale image is obtained. Add the elements at the same position of the feature matrix of the grayscale images corresponding to all video frames in , and divide by the total number of video frames of the second video sequence to obtain the element value at each position in the feature matrix. The value of each element is rounded to obtain the feature matrix of the motion map corresponding to the second video sequence.
如图2所示,本申请实施例提供的视频数据压缩处理的示意图,当视频序列为以深度信息表示的视频序列时,将深度视频序列进行灰度处理,得到深度视频序列对应的灰度图像,将灰度图像进行压缩处理,得到对应的深度运动图,将多帧图像的时空信息合成为一张运动图的时空信息。其中,深度视频序列对应的运动图的特征矩阵可以为1*W*H的矩阵。可以通过下面公式进行计算:As shown in FIG. 2 , a schematic diagram of video data compression processing provided by an embodiment of the present application, when a video sequence is a video sequence represented by depth information, the depth video sequence is subjected to grayscale processing to obtain a grayscale image corresponding to the depth video sequence , compress the grayscale image to obtain the corresponding depth motion map, and synthesize the spatiotemporal information of multiple frames of images into the spatiotemporal information of one motion image. The feature matrix of the motion map corresponding to the depth video sequence may be a 1*W*H matrix. It can be calculated by the following formula:
Figure PCTCN2020129550-appb-000002
Figure PCTCN2020129550-appb-000002
其中,MJ为第二视频序列对应运动图的特征矩阵,N为第二视频序列的总帧数,I n为第二视频序列中第n帧图像的特征矩阵,n的取值范围为[1,N]的整数。N和T可以相等,n和τ取值可以相等,即第一视频序列的视频帧和第二视频序列的视频帧按时序一一对应。 Among them, MJ is the feature matrix of the motion picture corresponding to the second video sequence, N is the total number of frames of the second video sequence, In is the feature matrix of the nth frame image in the second video sequence, and the value range of n is [1 , an integer of N]. N and T may be equal, and the values of n and τ may be equal, that is, the video frames of the first video sequence and the video frames of the second video sequence correspond one-to-one in time sequence.
另外,第二视频序列对应的每一帧灰度图像的特征矩阵中元素的取值范围可以为[0,255]的整数,第二视频序列对应的运动图MJ的特征矩阵中每个元素的取值范围可以为[0,255]的整数。In addition, the value range of an element in the feature matrix of each frame of grayscale image corresponding to the second video sequence may be an integer of [0,255], and the value of each element in the feature matrix of the motion image MJ corresponding to the second video sequence The range can be an integer in the range [0,255].
需要说明的是,RGB格式的第一视频序列中的视频帧和以深度信息表示的第二视频序列中视频帧可以一一对应。将以深度信息表示的第二视频序列中视频帧进行灰度处理后得到的灰度图序列与RGB格式的第一视频序列中的视频帧也一一对应。It should be noted that the video frames in the first video sequence in the RGB format and the video frames in the second video sequence represented by depth information may be in one-to-one correspondence. The grayscale image sequence obtained by performing grayscale processing on the video frames in the second video sequence represented by the depth information is also in one-to-one correspondence with the video frames in the first video sequence in the RGB format.
步骤S103,将第一运动图和第二运动图输入训练后的双流神经网络模型,通过训练后的双流神经网络模型对第一运动图的特征和第二运动图的特征进行交互学习,得到第一视频序列的第一预测结果和第二视频序列的第二预测结果。Step S103, inputting the first motion map and the second motion map into the trained dual-stream neural network model, and performing interactive learning on the features of the first motion map and the second motion map through the trained dual-stream neural network model to obtain the first motion map. A first predictor for a video sequence and a second predictor for a second video sequence.
在一些实施例中,双流神经网络模型为包括两路独立的卷积神经网络模型以及路由模块的整体模型。双流神经网络模型包括两路输入和两路输出。其中,两路输入分别对应视频数据两种模态的特征信息,两路输出分别对应两种模态输入信息的预测结果。In some embodiments, the dual-stream neural network model is an overall model including two independent convolutional neural network models and a routing module. The dual-stream neural network model includes two inputs and two outputs. Among them, the two inputs correspond to the feature information of the two modalities of the video data respectively, and the two outputs correspond to the prediction results of the input information of the two modalities respectively.
如图3所示,本申请实施例提供的双流神经网络模型的网络架构示意图,双流神经网络模型包括两路独立的卷积神经网络模型和路由模块,两路卷积神经网络模型的的输入分别为第一运动图和第二运动图;每一路的卷积神经网络模型包括多个卷积层,例如卷积模块Conv1、卷积模块Conv2_x、卷积模块Conv5_x以及全连接层,其中卷积模块Conv2_x、卷积模块Conv5_x分别表示一个总的卷积模块,一个总的卷积模块可以包括若干数量的卷积层或卷积计算单元。两路卷积神经网络模型的每个卷积模块之后通过路由模块对上一模块的输出结果进行交互学习,路由模块的输出作为和上一卷积模块的输出叠加,作为下一卷积模块的输入,通过路由模块学习双流神经网络模型中不同模态的中层交互特征。As shown in FIG. 3 , a schematic diagram of the network architecture of the dual-stream neural network model provided in the embodiment of the present application, the dual-stream neural network model includes two independent convolutional neural network models and routing modules, and the inputs of the two convolutional neural network models are respectively are the first motion map and the second motion map; the convolutional neural network model of each channel includes multiple convolutional layers, such as the convolutional module Conv1, the convolutional module Conv2_x, the convolutional module Conv5_x and the fully connected layer, wherein the convolutional module Conv2_x and convolution module Conv5_x respectively represent a total convolution module, and a total convolution module may include a number of convolution layers or convolution calculation units. After each convolution module of the two-way convolutional neural network model, the output of the previous module is interactively learned through the routing module, and the output of the routing module is superimposed with the output of the previous convolution module as the output of the next convolution module. Input, the mid-level interaction features of different modalities in the dual-stream neural network model are learned through the routing module.
其中,两路卷积神经网络模型的基础网络可以为残差网络(ResNet),由于残差网络的高度模块化,可以将残差网络中的各个模块作为基础模块对第一运动图和第二运动图不同模态的特征信息进行模型训练和特征的交互学习。双流神经网络模型通过双损失函数进行模型的优化和训练。Among them, the basic network of the two-way convolutional neural network model can be a residual network (ResNet). Due to the high modularity of the residual network, each module in the residual network can be used as the basic module for the first motion map and the second. The feature information of different modalities of the motion map is used for model training and interactive learning of features. The dual-stream neural network model optimizes and trains the model through dual loss functions.
示例性的,双流神经网络模型的基础网络模型可以为Inception、ImageNet、TSN以及双流网络等深度网络模型;通过微调对基础网络模型的参数进行训练及调整;还可以根据需要设计网络模型进行参数的训练集调整。通过双流神经网络模型对不同模态的运动图像的特征学习后,通过双损失函数进行联合优化训练,得到与输入的不同模态的图像特征对应模态的双流高层特征;如输入的模态为RGB格式的运动图像和深度信息表示的运动图像,则可以得RGB格式及深度信息两种模态的双流高层特征。Exemplarily, the basic network model of the dual-stream neural network model can be a deep network model such as Inception, ImageNet, TSN, and dual-stream network; the parameters of the basic network model are trained and adjusted by fine-tuning; the network model can also be designed as needed to adjust the parameters. Training set adjustment. After learning the features of moving images of different modalities through the dual-stream neural network model, the joint optimization training is performed through the dual-loss function to obtain the dual-stream high-level features of the modalities corresponding to the input image features of different modalities; for example, the input modalities are For moving images in RGB format and moving images represented by depth information, dual-stream high-level features of two modalities of RGB format and depth information can be obtained.
在一些实施例中,两路输入可以包括多个通道输入;例如若其中一路输入为RGB运动图,则该路输入可以包括三个通道的输入,分别对应输入RGB运动图的红色R通道的特征矩阵、绿色G通道的特征矩阵以及蓝色B通道的特征矩阵。In some embodiments, the two inputs may include multiple channel inputs; for example, if one of the inputs is an RGB motion image, the input may include three channel inputs, corresponding to the features of the red R channel of the input RGB motion image respectively. matrix, feature matrix for the green G channel, and feature matrix for the blue B channel.
在一些实施例中,训练后的双流神经网络模型包括第一神经网络模型、第二神经网络模型和路由模块,路由模块设置于第一神经网络模型的中间卷积模块和第二神经网络模型的中间卷积模块之间;第一神经网络模型的输入为第一运动图,输出为第一视频序列的第一预测结果;第二神经网络模型的输入为所 述第二运动图,输出为所述第二视频序列的所述第二预测结果;路由模块用于在第一神经网络模型的中间卷积模块和第二神经网络模型的中间卷积模块之间,对双流神经网络模型的每一层卷积模块的输出特征进行交互学习。In some embodiments, the trained dual-stream neural network model includes a first neural network model, a second neural network model, and a routing module, and the routing module is disposed between the intermediate convolution module of the first neural network model and the second neural network model. between intermediate convolution modules; the input of the first neural network model is the first motion map, and the output is the first prediction result of the first video sequence; the input of the second neural network model is the second motion map, and the output is the the second prediction result of the second video sequence; the routing module is used for, between the intermediate convolution module of the first neural network model and the intermediate convolution module of the second neural network model, for each of the two-stream neural network model The output features of the layer convolution module are interactively learned.
如图4所示,本申请实施例提供的双流神经网络模型的架构图,第一神经网络模型对应一路输入、输出,第二神经网络模型对应另一路输入、输出。输入第一神经网络模型的第一运动图可以为RGB运动图;第一神经网络模型输出的第一预测结果为第一视频序列对应的识别结果,第一视频序列可以为RGB格式的RGB视频序列;RGB运动图为由RGB格式的RGB视频序列压缩得到。输入第二神经网络模型的第二运动图可以为深度运动图;第二神经网络模型输出的第二预测结果为第二视频序列对应的识别结果,第二视频序列可以为以深度信息表示的深度视频序列;深度运动图为以深度信息表示的深度视频序列压缩得到。As shown in FIG. 4 , the architecture diagram of the dual-stream neural network model provided by the embodiment of the present application, the first neural network model corresponds to one channel of input and output, and the second neural network model corresponds to another channel of input and output. The first motion picture input into the first neural network model can be an RGB motion picture; the first prediction result output by the first neural network model is the identification result corresponding to the first video sequence, and the first video sequence can be an RGB video sequence in RGB format ; RGB motion pictures are obtained by compressing RGB video sequences in RGB format. The second motion map input into the second neural network model can be a depth motion map; the second prediction result output by the second neural network model is the recognition result corresponding to the second video sequence, and the second video sequence can be the depth represented by the depth information. Video sequence; the depth motion map is obtained by compressing the depth video sequence represented by depth information.
在双流神经网络的中层,包括多个卷积模块和多个路由模块,如图4中所示的卷积模块Conv1、卷积模块Conv2_x以及卷积模块Conv5_x;路由模块设置在两路卷积神经网络模型的每个卷积模块之后,通过路由模块对上一模块的输出结果进行交互学习,路由模块的输出作为和上一卷积模块的输出叠加,作为下一卷积模块的输入,通过路由模块学习双流神经网络模型中不同模态的中层交互特征。In the middle layer of the dual-stream neural network, it includes multiple convolution modules and multiple routing modules, such as the convolution module Conv1, the convolution module Conv2_x and the convolution module Conv5_x as shown in Figure 4; the routing module is set in the two-way convolution neural network. After each convolution module of the network model, the output of the previous module is interactively learned through the routing module. The output of the routing module is superimposed with the output of the previous convolution module as the input of the next convolution module. The module learns mid-level interaction features of different modalities in a two-stream neural network model.
在一些实施例中,第一神经网络模型的中间卷积模块包括预设层数的第一卷积模块,第二神经网络模型的中间卷积模块包括与第一卷积模块相对应的第二卷积模块。In some embodiments, the intermediate convolution module of the first neural network model includes a first convolution module with a preset number of layers, and the intermediate convolution module of the second neural network model includes a second convolution module corresponding to the first convolution module. Convolution module.
如图4所示,将第一运动图和第二运动图输入训练后的双流神经网络模型,通过训练后的双流神经网络模型对第一运动图的特征和第二运动图的特征进行交互学习,得到第一视频序列的第一预测结果和第二视频序列的第二预测结果,包括:As shown in Figure 4, the first motion map and the second motion map are input into the trained dual-stream neural network model, and the features of the first motion map and the features of the second motion map are interactively learned through the trained dual-stream neural network model. , obtain the first prediction result of the first video sequence and the second prediction result of the second video sequence, including:
C1、将第一层的第一卷积模块的输出和第二卷积模块的输出作为第一层的路由模块的输入,由所述第一层的路由模块进行特征交互学习,得到第一路由输出;C1. Use the output of the first convolution module of the first layer and the output of the second convolution module as the input of the routing module of the first layer, and the routing module of the first layer performs feature interactive learning to obtain the first route output;
C2、将所述第一层的第一卷积模块的输出与所述第一路由输出的叠加结果 作为第二层的第一卷积模块的输入,由所述第二层的第一卷积模块进行特征学习,得到所述第二层的第一卷积模块的输出;C2. Use the superposition result of the output of the first convolution module of the first layer and the output of the first route as the input of the first convolution module of the second layer, and the first convolution of the second layer The module performs feature learning to obtain the output of the first convolution module of the second layer;
C3、将所述第一层的第二卷积模块的输出与所述第一路由输出的叠加结果作为第二层的第二卷积模块的输入,由所述第二层的第二卷积模块进行特征学习,得到所述第二层的第二卷积模块的输出;C3. Use the superposition result of the output of the second convolution module of the first layer and the output of the first route as the input of the second convolution module of the second layer, and the second convolution module of the second layer The module performs feature learning to obtain the output of the second convolution module of the second layer;
C4、将所述第二层的第一卷积模块的输出和所述第二层的第二卷积模块的输出作为第二层的路由模块的输入,由所述第二层的路由模块进行特征交互学习,得到第二路由输出。C4. Use the output of the first convolution module of the second layer and the output of the second convolution module of the second layer as the input of the routing module of the second layer, and the routing module of the second layer performs Feature interactive learning to obtain the second routing output.
其中,所述第一层的第一卷积模块与第二层的第一卷积模块为所述第一神经网络模型的中间卷积模块中前后相邻的两层卷积层;所述第一层的第二卷积模块和所述第二层的第二卷积模块为所述第二神经网络模型的中间卷积模块中前后相邻的两层卷积层;所述第一层的路由模块和所述第二层的路由模块为前后相邻的两个计算模块。Wherein, the first convolution module of the first layer and the first convolution module of the second layer are two adjacent convolution layers in the middle convolution module of the first neural network model; The second convolution module of one layer and the second convolution module of the second layer are two adjacent convolution layers in the middle convolution module of the second neural network model; The routing module and the routing module of the second layer are two adjacent computing modules.
在一些实施例中,一个卷积模块包括若干数量的卷积层或卷积计算单元;一个卷积层可以为一组平行的特征图,通过在输入图像上滑动不同的卷积并执行一定的运算而组成;在每一个滑动的位置上,卷积核与输入图像之间会执行一个元素对应乘积并求和的运算,以将感受野内的信息投影到特征图中的一个元素。其中,卷积核的尺寸小于输入图像的尺寸,且重叠或平行的作用于输入图像中,双流神经网络模型中间的每一层卷积模块输出的特征图中的所有元素都是通过一个卷积核计算得出。In some embodiments, a convolution module includes a number of convolution layers or convolution calculation units; a convolution layer can be a set of parallel feature maps, by sliding different convolutions on the input image and performing certain At each sliding position, an element-corresponding product and sum operation is performed between the convolution kernel and the input image to project the information in the receptive field to an element in the feature map. Among them, the size of the convolution kernel is smaller than the size of the input image, and overlaps or acts on the input image in parallel. All elements in the feature map output by the convolution module of each layer in the middle of the dual-stream neural network model are obtained through a convolution The kernel is calculated.
另外,双流神经网络模型还包括全连接层、第一损失函数以及第二损失函数。如图4所示,卷积模块Conv5_x输出的特征作为一个全连接层的输入,最后一层的路由模块的输出特征作为一个全连接层的输入,将两个全连接层的结果相加,作为总的全连接层的输出,得到第一预测结果和第二预测结果。In addition, the dual-stream neural network model further includes a fully connected layer, a first loss function and a second loss function. As shown in Figure 4, the features output by the convolution module Conv5_x are used as the input of a fully connected layer, and the output features of the routing module of the last layer are used as the input of a fully connected layer. The results of the two fully connected layers are added as The output of the total fully connected layer, the first prediction result and the second prediction result are obtained.
在一些实施例中,路由模块包括:第一卷积单元、第一归一化单元、第一激活单元、第二卷积单元、第二归一化单元、第二激活单元;通过路由模块的第一卷积单元、第一归一化单元、第一激活单元、第二卷积单元、第二归一化单元、第二激活单元,依次对第一神经网络模型的卷积计算模块输出的特征矩阵和第二神经网络模型的卷积算计模块输出的特征矩阵进行交互学习,得到路 由模块输出的特征矩阵。In some embodiments, the routing module includes: a first convolution unit, a first normalization unit, a first activation unit, a second convolution unit, a second normalization unit, and a second activation unit; The first convolution unit, the first normalization unit, the first activation unit, the second convolution unit, the second normalization unit, and the second activation unit, sequentially output the convolution calculation module of the first neural network model. The feature matrix and the feature matrix output by the convolution calculation module of the second neural network model are interactively learned to obtain the feature matrix output by the routing module.
如图5所示,本申请实施例提供的路由模块的架构示意图。路由模块包括两层卷积单元、两层归一化单元以及两层激活单元;分别可以为第一卷积单元Conv1D、第一归一化单元Batch Normlization、第一激活单元ReLU、第二卷积单元Conv1D、第二归一化单元Batch Normlization、第二激活单元ReLU。将双流神经网路模型中间卷积模块的每一层的双路的卷积模块的输出作为对应的路由模块的输入,将每一层路由模块的输出作为下一层卷积模块的输入或全连接层的输入。其中,路由模块可以为1*1的卷积为基础的计算单元;上一层的双路的卷积模块的输出通过1*1卷积的学习和重定向之后,输出给后续层的卷积模块。双路的卷积模块的输出可以为多模态的图像特征的信息流,例如RGB格式的信息流和深度图像特征的信息流等。As shown in FIG. 5 , a schematic structural diagram of a routing module provided by an embodiment of the present application is shown. The routing module includes two layers of convolution units, two layers of normalization units, and two layers of activation units; they can be the first convolution unit Conv1D, the first normalization unit Batch Normlization, the first activation unit ReLU, and the second convolution unit. Unit Conv1D, second normalization unit Batch Normlization, second activation unit ReLU. The output of the two-way convolution module of each layer of the intermediate convolution module of the dual-stream neural network model is used as the input of the corresponding routing module, and the output of each layer of the routing module is used as the input of the next layer of convolution module or the whole. Input to the connection layer. Among them, the routing module can be a 1*1 convolution-based computing unit; the output of the two-way convolution module of the previous layer is output to the convolution of the subsequent layer after learning and redirection of the 1*1 convolution module. The output of the two-way convolution module can be the information flow of multi-modal image features, such as the information flow of RGB format and the information flow of depth image features.
在一些实施例中,第一神经网络模型包括第一损失函数,第二神经网络模型包括第二损失函数;通过视频样本数据对第一神经网络模型、第二神经网络模型和路由模块进行训练,依据第一损失函数和第二损失函数分别调整第一神经网络模型的参数、第二神经网络模型的参数以及路由模块的参数;若第一损失函数和第二损失函数满足预设阈值,则停止对第一神经网络模型的参数、第二神经网络模型的参数以及路由模块的训练,得到训练后的双流神经网络模型。In some embodiments, the first neural network model includes a first loss function, and the second neural network model includes a second loss function; the first neural network model, the second neural network model and the routing module are trained through video sample data, Adjust the parameters of the first neural network model, the parameters of the second neural network model, and the parameters of the routing module respectively according to the first loss function and the second loss function; if the first loss function and the second loss function meet the preset threshold, stop The parameters of the first neural network model, the parameters of the second neural network model and the routing module are trained to obtain a trained dual-stream neural network model.
在一些实施例中,双流神经网络模型通过双损失函数进行模型的优化和训练。根据第一路的卷积神经网络的全连接层输出结果,通过第一损失函数对第一路的卷积神经网络的参数进行训练及调整;根据第二路的卷积神经网络的全连接层输出结果,通过第二损失函数对第二路的卷积神经网络的参数进行训练及调整;同时通过第一损失函数和第二损失函数训练并调整路由模块的参数。In some embodiments, the dual-stream neural network model is optimized and trained with dual loss functions. According to the output result of the fully connected layer of the convolutional neural network of the first channel, the parameters of the convolutional neural network of the first channel are trained and adjusted through the first loss function; according to the fully connected layer of the convolutional neural network of the second channel The output result is used to train and adjust the parameters of the convolutional neural network of the second channel through the second loss function; at the same time, the parameters of the routing module are trained and adjusted through the first loss function and the second loss function.
步骤S104,基于第一预测结果和第二预测结果,确定所述待识别动作的分类结果。Step S104, based on the first prediction result and the second prediction result, determine the classification result of the action to be recognized.
在一些实施例中,第一预测结果和第二预测结果为训练后的神经网络模型输出的多模态的双流高层特征。对双流高层特征进行特征融合,得到双流神经网络模型的网络架构中最后的输出结果。最后的输出结果为一个一维的得分向量(概率),最终分类结果根据得分向量中概率最大的进行确定;即得分最大的对应的类别为待识别动作的分类结果。In some embodiments, the first prediction result and the second prediction result are multimodal dual-stream high-level features output by the trained neural network model. Feature fusion is performed on the dual-stream high-level features to obtain the final output result in the network architecture of the dual-stream neural network model. The final output result is a one-dimensional score vector (probability), and the final classification result is determined according to the highest probability in the score vector; that is, the category corresponding to the highest score is the classification result of the action to be recognized.
在一些实施例中,基于第一预测结果和第二预测结果,确定待识别动作的分类结果,包括:In some embodiments, based on the first prediction result and the second prediction result, determining the classification result of the action to be recognized includes:
D1、对所述第一预测结果和所述第二预测结果进行特征融合,得到动作类别的概率分布;D1. Feature fusion is performed on the first prediction result and the second prediction result to obtain the probability distribution of action categories;
D2、将所述概率分布中概率最大的动作类别作为所述待识别动作的所述分类结果。D2. Use the action category with the highest probability in the probability distribution as the classification result of the action to be identified.
在一些实施例中,特征融合是双流神经网络模型的网络架构中的一个计算过程,即双流神经网络模型得到了RGB格式的信息流和深度信息流的特征信息后,会进行融合,融合后进行概率映射,最后进行类别判断。例如,最后的输出结果为一个一维的得分向量(概率),得分向量为包含10个元素的一维向量,每个元素都是0到1的概率,10个元素的和为1,假设第二个元素最大为0.3,则判定待识别动作的分类结果为第二类。In some embodiments, feature fusion is a calculation process in the network architecture of the dual-stream neural network model, that is, after the dual-stream neural network model obtains the feature information of the RGB format information flow and the depth information flow, it will be fused, and the fusion will be performed after the fusion. Probability mapping, and finally category judgment. For example, the final output result is a one-dimensional score vector (probability), the score vector is a one-dimensional vector containing 10 elements, each element is the probability of 0 to 1, and the sum of the 10 elements is 1, assuming the first The maximum value of the two elements is 0.3, and the classification result of the action to be recognized is determined to be the second category.
其中,特征融合的过程可以通过将网络架构最后输出的两个矩阵进行点乘、加权相加或最大值等方式进行融合计算,得到最终的概率分布,根据概率分布中的最大值对应的类别确定待识别动作的种类。Among them, the process of feature fusion can perform fusion calculation by performing point multiplication, weighted addition or maximum value of the two matrices finally output by the network architecture to obtain the final probability distribution, which is determined according to the category corresponding to the maximum value in the probability distribution. The type of action to be recognized.
通过本申请实施例,终端设备可以获取待识别动作的视频数据,视频数据包括第一视频序列和第二视频序列,分别对第一视频序列和第二视频序列的压缩得到第一运动图和第二运动图,对视频数据进行更丰富的时空表示,使得信息表示更全,特征更丰富;从而将第一运动图和第二运动图作为双流神经网络模型的输入,通过神经网络模型对多模态的图像特征进行交互学习,提高了动作识别的准确度。Through the embodiments of the present application, the terminal device can obtain the video data of the action to be recognized, the video data includes the first video sequence and the second video sequence, and the first video sequence and the second video sequence are respectively compressed to obtain the first motion picture and the second video sequence. The second motion map, which provides a richer spatiotemporal representation of the video data, makes the information representation more complete and the features richer; thus, the first motion map and the second motion map are used as the input of the dual-stream neural network model, and the multi-modality model is analyzed by the neural network model. The interactive learning of state image features improves the accuracy of action recognition.
应理解,以上所述实施例仅用以说明本申请的技术方案,而非对其限制;对前述各实施例所记载的技术方案的修改,或者对其中部分技术特征的等同替换;例如增加模型的维度,增加多个模态的视频序列的特征作为模型的输入,将双流神经网络模型修改为多路独立的卷积神经网络模型和路由模块,对多个模态的视频序列的特征进行交互学习等,属于类似的发明构思,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。It should be understood that the above-mentioned embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; modifications to the technical solutions recorded in the foregoing embodiments, or equivalent replacements to some of the technical features therein; for example, adding a model The dimension of multi-modal video sequence is added as the input of the model, and the dual-stream neural network model is modified into a multi-channel independent convolutional neural network model and routing module to interact with the features of video sequences of multiple modalities. Learning, etc., belong to similar inventive concepts, and the essence of the corresponding technical solutions does not deviate from the spirit and scope of the technical solutions of the embodiments of the present application, and should be included within the protection scope of the present application.
还应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后, 各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should also be understood that the size of the sequence numbers of the steps in the above-mentioned embodiments does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. .
对应于上文实施例所述的动作识别方法,图6示出了本申请实施例提供的动作识别装置的结构框图,为了便于说明,仅示出了与本申请实施例相关的部分。Corresponding to the motion recognition methods described in the above embodiments, FIG. 6 shows a structural block diagram of the motion recognition apparatus provided by the embodiments of the present application. For convenience of description, only the parts related to the embodiments of the present application are shown.
参照图6,该装置包括:Referring to Figure 6, the device includes:
获取单元61,用于获取待识别动作的视频数据,所述视频数据包括第一视频序列和第二视频序列;an acquisition unit 61, configured to acquire video data of the action to be identified, the video data including a first video sequence and a second video sequence;
处理单元62,用于将所述第一视频序列和所述第二视频序列分别进行压缩处理,得到所述第一视频序列对应的第一运动图和所述第二视频序列对应的第二运动图;A processing unit 62, configured to perform compression processing on the first video sequence and the second video sequence respectively to obtain a first motion map corresponding to the first video sequence and a second motion corresponding to the second video sequence picture;
计算单元63,用于将所述第一运动图和所述第二运动图输入训练后的双流神经网络模型,通过所述训练后的双流神经网络模型对所述第一运动图的特征和所述第二运动图的特征进行交互学习,得到所述第一视频序列的第一预测结果和所述第二视频序列的第二预测结果;The computing unit 63 is used for inputting the first motion map and the second motion map into the trained dual-stream neural network model, and through the trained dual-stream neural network model, the features of the first motion map and all parameters are analyzed. performing interactive learning on the features of the second motion map to obtain the first prediction result of the first video sequence and the second prediction result of the second video sequence;
输出单元64,用于基于所述第一预测结果和所述第二预测结果,确定所述待识别动作的分类结果。An output unit 64, configured to determine a classification result of the to-be-recognized action based on the first prediction result and the second prediction result.
通过本申请实施例,终端设备可以获取待识别动作的视频数据,视频数据包括第一视频序列和第二视频序列,分别对第一视频序列和第二视频序列的压缩得到第一运动图和第二运动图,对视频数据进行更丰富的时空表示,使得信息表示更全,特征更丰富;从而将第一运动图和第二运动图作为双流神经网络模型的输入,通过神经网络模型对多模态的图像特征进行交互学习,提高了动作识别的准确度。Through the embodiments of the present application, the terminal device can obtain the video data of the action to be recognized, the video data includes the first video sequence and the second video sequence, and the first video sequence and the second video sequence are respectively compressed to obtain the first motion picture and the second video sequence. The second motion map, which provides a richer spatiotemporal representation of the video data, makes the information representation more complete and the features richer; thus, the first motion map and the second motion map are used as the input of the dual-stream neural network model, and the multi-modality model is analyzed by the neural network model. The interactive learning of state image features improves the accuracy of action recognition.
需要说明的是,上述装置/单元之间的信息交互、执行过程等内容,由于与本申请方法实施例基于同一构思,其具体功能及带来的技术效果,具体可参见方法实施例部分,此处不再赘述。It should be noted that the information exchange, execution process and other contents between the above-mentioned devices/units are based on the same concept as the method embodiments of the present application. For specific functions and technical effects, please refer to the method embodiments section. It is not repeated here.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不 同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and simplicity of description, only the division of the above-mentioned functional units and modules is used as an example. Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above. Each functional unit and module in the embodiment may be integrated in one processing unit, or each unit may exist physically alone, or two or more units may be integrated in one unit, and the above-mentioned integrated units may adopt hardware. It can also be realized in the form of software functional units. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing from each other, and are not used to limit the protection scope of the present application. For the specific working processes of the units and modules in the above-mentioned system, reference may be made to the corresponding processes in the foregoing method embodiments, which will not be repeated here.
图7为本申请一实施例提供的终端设备的结构示意图。如图7所示,该实施例的终端设备7包括:至少一个处理器70(图7中仅示出一个)处理器、存储器71以及存储在所述存储器71中并可在所述至少一个处理器70上运行的计算机程序72,所述处理器70执行所述计算机程序72时实现上述任意各个训练板的识别方法实施例中的步骤。FIG. 7 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in FIG. 7 , the terminal device 7 in this embodiment includes: at least one processor 70 (only one is shown in FIG. 7 ), a memory 71 , and a processor stored in the memory 71 and can be processed in the at least one processor The computer program 72 running on the processor 70, when the processor 70 executes the computer program 72, implements the steps in any of the foregoing embodiments of the method for identifying each training board.
该终端设备7可包括,但不仅限于,处理器70、存储器71。本领域技术人员可以理解,图7仅仅是终端设备7的举例,并不构成对终端设备7的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如还可以包括输入输出设备、网络接入设备等。The terminal device 7 may include, but is not limited to, a processor 70 and a memory 71 . Those skilled in the art can understand that FIG. 7 is only an example of the terminal device 7, and does not constitute a limitation on the terminal device 7, and may include more or less components than the one shown, or combine some components, or different components , for example, may also include input and output devices, network access devices, and the like.
所称处理器70可以是中央处理单元(Central Processing Unit,CPU),该处理器70还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The so-called processor 70 may be a central processing unit (Central Processing Unit, CPU), and the processor 70 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuits) , ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
所述存储器71在一些实施例中可以是所述终端设备7的内部存储单元,例如终端设备7的硬盘或内存。所述存储器71在另一些实施例中也可以是所述终端设备7的外部存储设备,例如所述终端设备7上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器71还可以既包括所述终端设备7的内部存储单元也包括外部存储设备。所述存储器71用于存储操作系统、应用程序、引导装载程序(BootLoader)、数据以及其他程序等,例如所述计算机程序的程序代码 等。所述存储器71还可以用于暂时地存储已经输出或者将要输出的数据。The memory 71 may be an internal storage unit of the terminal device 7 in some embodiments, such as a hard disk or a memory of the terminal device 7 . In other embodiments, the memory 71 may also be an external storage device of the terminal device 7, such as a plug-in hard disk equipped on the terminal device 7, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Further, the memory 71 may also include both an internal storage unit of the terminal device 7 and an external storage device. The memory 71 is used to store operating systems, application programs, bootloaders (BootLoader), data, and other programs, such as program codes of the computer programs, and the like. The memory 71 may also be used to temporarily store data that has been output or will be output.
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现可实现上述各个方法实施例中的步骤。Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps in the foregoing method embodiments can be implemented.
本申请实施例提供了一种计算机程序产品,当计算机程序产品在移动终端上运行时,使得移动终端执行时实现可实现上述各个方法实施例中的步骤。The embodiments of the present application provide a computer program product, when the computer program product runs on a mobile terminal, the steps in the foregoing method embodiments can be implemented when the mobile terminal executes the computer program product.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质至少可以包括:能够将计算机程序代码携带到拍照装置/终端设备的任何实体或装置、记录介质、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质。例如U盘、移动硬盘、磁碟或者光盘等。在某些司法管辖区,根据立法和专利实践,计算机可读介质不可以是电载波信号和电信信号。The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the present application realizes all or part of the processes in the methods of the above embodiments, which can be completed by instructing the relevant hardware through a computer program, and the computer program can be stored in a computer-readable storage medium. When executed by a processor, the steps of each of the above method embodiments can be implemented. Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form, and the like. The computer-readable medium may include at least: any entity or device capable of carrying the computer program code to the photographing device/terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electrical carrier signals, telecommunication signals, and software distribution media. For example, U disk, mobile hard disk, disk or CD, etc. In some jurisdictions, under legislation and patent practice, computer readable media may not be electrical carrier signals and telecommunications signals.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。In the foregoing embodiments, the description of each embodiment has its own emphasis. For parts that are not described or described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.
在本申请所提供的实施例中,应该理解到,所揭露的装置/网络设备和方法,可以通过其它的方式实现。例如,以上所描述的装置/网络设备实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一 个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。In the embodiments provided in this application, it should be understood that the disclosed apparatus/network device and method may be implemented in other manners. For example, the apparatus/network device embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, such as multiple units. Or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the above-mentioned embodiments, those of ordinary skill in the art should understand that: it can still be used for the above-mentioned implementations. The technical solutions described in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the application, and should be included in the within the scope of protection of this application.

Claims (10)

  1. 一种基于特征交互学习的动作识别方法,其特征在于,包括:An action recognition method based on feature interaction learning, characterized in that it includes:
    获取待识别动作的视频数据,所述视频数据包括第一视频序列和第二视频序列;acquiring video data of an action to be identified, the video data including a first video sequence and a second video sequence;
    将所述第一视频序列和所述第二视频序列分别进行压缩处理,得到所述第一视频序列对应的第一运动图和所述第二视频序列对应的第二运动图;compressing the first video sequence and the second video sequence respectively, to obtain a first motion picture corresponding to the first video sequence and a second motion picture corresponding to the second video sequence;
    将所述第一运动图和所述第二运动图输入训练后的双流神经网络模型,通过所述训练后的双流神经网络模型对所述第一运动图的特征和所述第二运动图的特征进行交互学习,得到所述训练后的双流神经网络模型输出的所述第一视频序列的第一预测结果和所述第二视频序列的第二预测结果;The first motion map and the second motion map are input into the trained dual-stream neural network model, and the features of the first motion map and the second motion map are analyzed by the trained dual-stream neural network model. The feature is interactively learned to obtain the first prediction result of the first video sequence and the second prediction result of the second video sequence output by the trained dual-stream neural network model;
    基于所述第一预测结果和所述第二预测结果,确定所述待识别动作的分类结果。Based on the first prediction result and the second prediction result, a classification result of the action to be recognized is determined.
  2. 如权利要求1所述的方法,其特征在于,所述将所述第一视频序列进行压缩处理,得到所述第一视频序列对应的第一运动图,包括:The method according to claim 1, wherein the compressing the first video sequence to obtain the first motion picture corresponding to the first video sequence comprises:
    获取所述第一视频序列中每一视频帧的特征矩阵;Obtain the feature matrix of each video frame in the first video sequence;
    根据所述第一视频序列中视频帧的时序,将每一视频帧的所述特征矩阵进行压缩计算,得到用于表示所述第一运动图的特征矩阵。According to the time sequence of the video frames in the first video sequence, the feature matrix of each video frame is compressed and calculated to obtain a feature matrix for representing the first motion image.
  3. 如权利要求1所述的方法,其特征在于,所述将所述第二视频序列进行压缩处理,得到所述第二视频序列对应的第二运动图,包括:The method of claim 1, wherein the compressing the second video sequence to obtain a second motion picture corresponding to the second video sequence comprises:
    将所述第二视频序列进行灰度处理,得到所述第二视频序列对应的灰度序列帧;performing grayscale processing on the second video sequence to obtain a grayscale sequence frame corresponding to the second video sequence;
    根据所述第二视频序列中视频帧的时序,将所述灰度序列帧的特征矩阵进行压缩计算,得到用于表示所述第二运动图的特征矩阵。According to the time sequence of the video frames in the second video sequence, the feature matrix of the grayscale sequence frame is compressed and calculated to obtain the feature matrix used to represent the second motion image.
  4. 如权利要求1所述的方法,其特征在于,所述训练后的双流神经网络模型包括第一神经网络模型、第二神经网络模型和路由模块,所述路由模块设置于所述第一神经网络模型的中间卷积模块和所述第二神经网络模型的中间卷积模块之间;The method of claim 1, wherein the trained dual-stream neural network model comprises a first neural network model, a second neural network model and a routing module, and the routing module is set in the first neural network between the intermediate convolution module of the model and the intermediate convolution module of the second neural network model;
    所述第一神经网络模型的输入为所述第一运动图,输出为所述第一视频序列的所述第一预测结果;The input of the first neural network model is the first motion map, and the output is the first prediction result of the first video sequence;
    所述第二神经网络模型的输入为所述第二运动图,输出为所述第二视频序列的所述第二预测结果;The input of the second neural network model is the second motion map, and the output is the second prediction result of the second video sequence;
    所述路由模块用于在所述第一神经网络模型的中间卷积模块和所述第二神经网络模型的中间卷积模块之间,对所述第一运动图的特征和所述第二运动图的特征进行交互学习。The routing module is configured to, between the intermediate convolution module of the first neural network model and the intermediate convolution module of the second neural network model, compare the features of the first motion map and the second motion Interactive learning of graph features.
  5. 如权利要求4所述的方法,其特征在于,所述第一神经网络模型的中间卷积模块包括预设层数的第一卷积模块,所述第二神经网络模型的中间卷积模块包括与所述第一卷积模块相对应的第二卷积模块;The method of claim 4, wherein the intermediate convolution module of the first neural network model comprises a first convolution module with a preset number of layers, and the intermediate convolution module of the second neural network model comprises a second convolution module corresponding to the first convolution module;
    所述将所述第一运动图和所述第二运动图输入训练后的双流神经网络模型,通过所述训练后的双流神经网络模型对所述第一运动图的特征和所述第二运动图的特征进行交互学习,得到所述第一视频序列的第一预测结果和所述第二视频序列的第二预测结果,包括:The first motion map and the second motion map are input into the trained dual-stream neural network model, and the features of the first motion map and the second motion are analyzed by the trained dual-stream neural network model. The features of the graph are interactively learned to obtain the first prediction result of the first video sequence and the second prediction result of the second video sequence, including:
    将第一层的第一卷积模块的输出和第一层的第二卷积模块的输出作为第一层的路由模块的输入,由所述第一层的路由模块进行特征交互学习,得到第一路由输出;The output of the first convolution module of the first layer and the output of the second convolution module of the first layer are used as the input of the routing module of the first layer, and the feature interactive learning is performed by the routing module of the first layer, and the first layer is obtained. a route output;
    将所述第一层的第一卷积模块的输出与所述第一路由输出的叠加结果作为第二层的第一卷积模块的输入,由所述第二层的第一卷积模块进行特征学习,得到所述第二层的第一卷积模块的输出;The superposition result of the output of the first convolution module of the first layer and the output of the first route is used as the input of the first convolution module of the second layer, and the first convolution module of the second layer performs feature learning to obtain the output of the first convolution module of the second layer;
    将所述第一层的第二卷积模块的输出与所述第一路由输出的叠加结果作为第二层的第二卷积模块的输入,由所述第二层的第二卷积模块进行特征学习,得到所述第二层的第二卷积模块的输出;The superposition result of the output of the second convolution module of the first layer and the output of the first route is used as the input of the second convolution module of the second layer, and the second convolution module of the second layer performs feature learning to obtain the output of the second convolution module of the second layer;
    将所述第二层的第一卷积模块的输出和所述第二层的第二卷积模块的输出作为第二层的路由模块的输入,由所述第二层的路由模块进行特征交互学习,得到第二路由输出;The output of the first convolution module of the second layer and the output of the second convolution module of the second layer are used as the input of the routing module of the second layer, and the feature interaction is performed by the routing module of the second layer learn, get the second route output;
    其中,所述第一层的第一卷积模块与第二层的第一卷积模块为所述第一神经网络模型的中间卷积模块中前后相邻的两层卷积层;所述第一层的第二卷积模块和所述第二层的第二卷积模块为所述第二神经网络模型的中间卷积模块中前后相邻的两层卷积层;所述第一层的路由模块和所述第二层的路由模块为前后相邻的两个计算模块。Wherein, the first convolution module of the first layer and the first convolution module of the second layer are two adjacent convolution layers in the middle convolution module of the first neural network model; The second convolution module of one layer and the second convolution module of the second layer are two adjacent convolution layers in the middle convolution module of the second neural network model; The routing module and the routing module of the second layer are two adjacent computing modules.
  6. 如权利要求4所述的方法,其特征在于,所述路由模块包括:第一卷积单元、第一归一化单元、第一激活单元、第二卷积单元、第二归一化单元、第二激活单元;The method of claim 4, wherein the routing module comprises: a first convolution unit, a first normalization unit, a first activation unit, a second convolution unit, a second normalization unit, the second activation unit;
    通过所述路由模块的所述第一卷积单元、所述第一归一化单元、所述第一激活单元、所述第二卷积单元、所述第二归一化单元、所述第二激活单元,依次对所述第一神经网络模型的卷积计算模块输出的特征矩阵和所述第二神经网络模型的卷积算计模块输出的特征矩阵进行交互学习,得到所述路由模块输出的特征矩阵。Through the first convolution unit, the first normalization unit, the first activation unit, the second convolution unit, the second normalization unit, the first The second activation unit performs interactive learning on the feature matrix output by the convolution calculation module of the first neural network model and the feature matrix output by the convolution calculation module of the second neural network model in turn, and obtains the output of the routing module. feature matrix.
  7. 如权利要求1至6任一项所述的方法,其特征在于,所述基于所述第一预测结果和所述第二预测结果,确定所述待识别动作的分类结果,包括:The method according to any one of claims 1 to 6, wherein the determining the classification result of the to-be-recognized action based on the first prediction result and the second prediction result comprises:
    对所述第一预测结果和所述第二预测结果进行特征融合,得到动作类别的概率分布;Perform feature fusion on the first prediction result and the second prediction result to obtain a probability distribution of action categories;
    将所述概率分布中概率最大的动作类别作为所述待识别动作的所述分类结果。The action category with the highest probability in the probability distribution is used as the classification result of the action to be identified.
  8. 如权利要求4所述的方法,其特征在于,所述第一神经网络模型包括第一损失函数,所述第二神经网络模型包括第二损失函数;The method of claim 4, wherein the first neural network model includes a first loss function, and the second neural network model includes a second loss function;
    通过样本视频数据对所述第一神经网络模型、所述第二神经网络模型和所述路由模块进行训练,依据所述第一损失函数和所述第二损失函数分别调整所述第一神经网络模型的参数、所述第二神经网络模型的参数以及所述路由模块的参数;The first neural network model, the second neural network model and the routing module are trained through sample video data, and the first neural network is adjusted according to the first loss function and the second loss function respectively parameters of the model, parameters of the second neural network model, and parameters of the routing module;
    若所述第一损失函数和所述第二损失函数满足预设阈值,则停止对所述第一神经网络模型的参数、所述第二神经网络模型的参数以及所述路由模块的训练,得到所述训练后的双流神经网络模型。If the first loss function and the second loss function meet the preset threshold, stop the training of the parameters of the first neural network model, the parameters of the second neural network model and the routing module, and obtain The trained dual-stream neural network model.
  9. 一种终端设备,其特征在于,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如权利要求1至8任一项所述的方法。A terminal device, characterized by comprising a memory, a processor, and a computer program stored in the memory and running on the processor, the processor implementing the computer program according to claims 1 to 1 when the processor executes the computer program. 8. The method of any one.
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至8任一项所述的方法。A computer-readable storage medium storing a computer program, characterized in that, when the computer program is executed by a processor, the method according to any one of claims 1 to 8 is implemented.
PCT/CN2020/129550 2020-10-10 2020-11-17 Motion recognition method based on feature interactive learning, and terminal device WO2022073282A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011078182.6A CN112257526B (en) 2020-10-10 2020-10-10 Action recognition method based on feature interactive learning and terminal equipment
CN202011078182.6 2020-10-10

Publications (1)

Publication Number Publication Date
WO2022073282A1 true WO2022073282A1 (en) 2022-04-14

Family

ID=74241911

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/129550 WO2022073282A1 (en) 2020-10-10 2020-11-17 Motion recognition method based on feature interactive learning, and terminal device

Country Status (2)

Country Link
CN (1) CN112257526B (en)
WO (1) WO2022073282A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116434335A (en) * 2023-03-30 2023-07-14 东莞理工学院 Method, device, equipment and storage medium for identifying action sequence and deducing intention
CN117556381A (en) * 2024-01-04 2024-02-13 华中师范大学 Knowledge level depth mining method and system for cross-disciplinary subjective test questions

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022193312A1 (en) * 2021-03-19 2022-09-22 京东方科技集团股份有限公司 Electrocardiogram signal identification method and electrocardiogram signal identification apparatus based on multiple leads
CN113326835B (en) * 2021-08-04 2021-10-29 中国科学院深圳先进技术研究院 Action detection method and device, terminal equipment and storage medium
CN115100740B (en) * 2022-06-15 2024-04-05 东莞理工学院 Human motion recognition and intention understanding method, terminal equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020658A (en) * 2019-03-28 2019-07-16 大连理工大学 A kind of well-marked target detection method based on multitask deep learning
CN110175596A (en) * 2019-06-04 2019-08-27 重庆邮电大学 The micro- Expression Recognition of collaborative virtual learning environment and exchange method based on double-current convolutional neural networks
CN110633630A (en) * 2019-08-05 2019-12-31 中国科学院深圳先进技术研究院 Behavior identification method and device and terminal equipment
CN111199238A (en) * 2018-11-16 2020-05-26 顺丰科技有限公司 Behavior identification method and equipment based on double-current convolutional neural network

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220616B (en) * 2017-05-25 2021-01-19 北京大学 Adaptive weight-based double-path collaborative learning video classification method
CN107862376A (en) * 2017-10-30 2018-03-30 中山大学 A kind of human body image action identification method based on double-current neutral net
CN107808150A (en) * 2017-11-20 2018-03-16 珠海习悦信息技术有限公司 The recognition methods of human body video actions, device, storage medium and processor
TWI745693B (en) * 2018-05-18 2021-11-11 宏達國際電子股份有限公司 Control method and medical system
CN110555340B (en) * 2018-05-31 2022-10-18 赛灵思电子科技(北京)有限公司 Neural network computing method and system and corresponding dual neural network implementation
CN110287820B (en) * 2019-06-06 2021-07-23 北京清微智能科技有限公司 Behavior recognition method, device, equipment and medium based on LRCN network
CN111027377B (en) * 2019-10-30 2021-06-04 杭州电子科技大学 Double-flow neural network time sequence action positioning method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111199238A (en) * 2018-11-16 2020-05-26 顺丰科技有限公司 Behavior identification method and equipment based on double-current convolutional neural network
CN110020658A (en) * 2019-03-28 2019-07-16 大连理工大学 A kind of well-marked target detection method based on multitask deep learning
CN110175596A (en) * 2019-06-04 2019-08-27 重庆邮电大学 The micro- Expression Recognition of collaborative virtual learning environment and exchange method based on double-current convolutional neural networks
CN110633630A (en) * 2019-08-05 2019-12-31 中国科学院深圳先进技术研究院 Behavior identification method and device and terminal equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116434335A (en) * 2023-03-30 2023-07-14 东莞理工学院 Method, device, equipment and storage medium for identifying action sequence and deducing intention
CN117556381A (en) * 2024-01-04 2024-02-13 华中师范大学 Knowledge level depth mining method and system for cross-disciplinary subjective test questions
CN117556381B (en) * 2024-01-04 2024-04-02 华中师范大学 Knowledge level depth mining method and system for cross-disciplinary subjective test questions

Also Published As

Publication number Publication date
CN112257526B (en) 2023-06-20
CN112257526A (en) 2021-01-22

Similar Documents

Publication Publication Date Title
WO2022073282A1 (en) Motion recognition method based on feature interactive learning, and terminal device
US11551333B2 (en) Image reconstruction method and device
CN109359592B (en) Video frame processing method and device, electronic equipment and storage medium
CN110532871B (en) Image processing method and device
WO2021043168A1 (en) Person re-identification network training method and person re-identification method and apparatus
WO2020192483A1 (en) Image display method and device
US20190304102A1 (en) Memory efficient blob based object classification in video analytics
KR20230013243A (en) Maintain a fixed size for the target object in the frame
CN108388882B (en) Gesture recognition method based on global-local RGB-D multi-mode
CN111310676A (en) Video motion recognition method based on CNN-LSTM and attention
CN110222717B (en) Image processing method and device
US11526704B2 (en) Method and system of neural network object recognition for image processing
CN113066017B (en) Image enhancement method, model training method and equipment
CN111783620A (en) Expression recognition method, device, equipment and storage medium
CN111079507B (en) Behavior recognition method and device, computer device and readable storage medium
WO2020092276A1 (en) Video recognition using multiple modalities
WO2021047587A1 (en) Gesture recognition method, electronic device, computer-readable storage medium, and chip
CN110222718A (en) The method and device of image procossing
WO2022104026A1 (en) Consistency measure for image segmentation processes
WO2022205329A1 (en) Object detection method, object detection apparatus, and object detection system
KR101344851B1 (en) Device and Method for Processing Image
WO2021073311A1 (en) Image recognition method and apparatus, computer-readable storage medium and chip
CN110633630B (en) Behavior identification method and device and terminal equipment
WO2021189321A1 (en) Image processing method and device
CN113489958A (en) Dynamic gesture recognition method and system based on video coding data multi-feature fusion

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20956597

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20956597

Country of ref document: EP

Kind code of ref document: A1