WO2021098402A1 - 动作识别方法、装置、计算机存储介质和计算机设备 - Google Patents

动作识别方法、装置、计算机存储介质和计算机设备 Download PDF

Info

Publication number
WO2021098402A1
WO2021098402A1 PCT/CN2020/120076 CN2020120076W WO2021098402A1 WO 2021098402 A1 WO2021098402 A1 WO 2021098402A1 CN 2020120076 W CN2020120076 W CN 2020120076W WO 2021098402 A1 WO2021098402 A1 WO 2021098402A1
Authority
WO
WIPO (PCT)
Prior art keywords
time series
frame
convolution
feature map
target
Prior art date
Application number
PCT/CN2020/120076
Other languages
English (en)
French (fr)
Inventor
罗栋豪
王亚彪
郭晨阳
邓博元
汪铖杰
李季檩
黄飞跃
吴永坚
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP20888898.2A priority Critical patent/EP3992846A4/en
Priority to KR1020227005895A priority patent/KR20220038434A/ko
Priority to JP2022516004A priority patent/JP7274048B2/ja
Publication of WO2021098402A1 publication Critical patent/WO2021098402A1/zh
Priority to US17/530,428 priority patent/US11928893B2/en

Links

Images

Classifications

    • GPHYSICS
    • G08SIGNALLING
    • G08BSIGNALLING OR CALLING SYSTEMS; ORDER TELEGRAPHS; ALARM SYSTEMS
    • G08B21/00Alarms responsive to a single specified undesired or abnormal condition and not otherwise provided for
    • G08B21/02Alarms for ensuring the safety of persons
    • G08B21/04Alarms for ensuring the safety of persons responsive to non-activity, e.g. of elderly persons
    • G08B21/0407Alarms for ensuring the safety of persons responsive to non-activity, e.g. of elderly persons based on behaviour analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G08SIGNALLING
    • G08BSIGNALLING OR CALLING SYSTEMS; ORDER TELEGRAPHS; ALARM SYSTEMS
    • G08B21/00Alarms responsive to a single specified undesired or abnormal condition and not otherwise provided for
    • G08B21/02Alarms for ensuring the safety of persons
    • G08B21/0202Child monitoring systems using a transmitter-receiver system carried by the parent and the child
    • G08B21/0261System arrangements wherein the object is to detect trespassing over a fixed physical boundary, e.g. the end of a garden
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Definitions

  • This application relates to the field of artificial intelligence technology, and more to the field of image processing technology, in particular to an action recognition method, device, computer-readable storage medium, and computer equipment.
  • the action recognition of video data generally uses a two-dimensional convolutional network neural network to recognize each frame of the video data, and finally the action recognition results of all frames of the video data are merged to obtain the right Action recognition result of video data.
  • the accuracy of using a two-dimensional convolutional neural network for action recognition is low.
  • an action recognition method is provided.
  • an action recognition method which is executed by a computer device, and includes:
  • an action recognition method which is executed by a computer device, and includes:
  • time series motion feature map identify the motion type of the moving object in the image data of the target time series frame
  • an action recognition device which is set in a computer device, and includes:
  • the image acquisition module is used to acquire the image data of the video data in different time series frames, and obtain the original sub-feature maps of the image data of each time series frame on different convolution channels through the multi-channel convolution layer;
  • the weight acquisition module is used to take each time series frame as the target time series frame, according to the original sub-characteristic map of the target time series frame on each convolution channel, and the next time series frame adjacent to the target time frame on each convolution channel
  • the original sub-feature map of calculate the weight of the motion information of the target time series frame on each convolution channel
  • the feature determination module is used to obtain the motion information features of the target time series frame on each convolution channel according to the weight of the motion information of the target time series frame on each convolution channel and the original sub-feature map of the target time series frame on each convolution channel Figure;
  • the timing interaction module is used to perform timing convolution on the motion information feature map of the target timing frame on each convolution channel to obtain the timing motion feature map of the target timing frame on each convolution channel;
  • the action recognition module is used to identify the action type of the moving object in the image data of the target time series frame according to the time series motion feature map of the target time series frame in each convolution channel.
  • an action recognition device which is installed in a computer device, and includes:
  • the image acquisition module is used to acquire real-time surveillance video data; extract the image data of the surveillance video data in different time series frames, and obtain the original sub-features of the image data of each time series frame on different convolution channels through a multi-channel convolution layer Figure;
  • the weight acquisition module is used to take each of the time sequence frames as the target time sequence frame, according to the original sub-characteristic map of the target time sequence frame on each convolution channel, and the next time sequence frame adjacent to the target time sequence frame in each volume
  • the original sub-feature map on the product channel is calculated, and the weight of the motion information of the target time series frame on each convolution channel is calculated;
  • the feature determination module is configured to obtain the motion information feature map of the target time series frame on each convolution channel according to the weight of the motion information and the original sub-feature map of the target time series frame on each convolution channel;
  • the time series interaction module is used to perform time series convolution on the motion information feature map to obtain the time series motion feature map of the target time series frame on each convolution channel;
  • the action recognition module is used for recognizing the action type of the moving object in the image data of the target time sequence frame according to the time series motion feature diagram; and determining the action type as the action information of the moving object in the current surveillance video data.
  • One or more computer-readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the actions of the embodiments of the present application Steps in the identification method:
  • a computer device includes a memory and one or more processors.
  • the memory stores computer-readable instructions.
  • the computer-readable instructions are executed by the one or more processors, the one or more The processor executes the steps in the action recognition method of each embodiment of the present application.
  • Figure 1 is an application environment diagram of an action recognition method in an embodiment
  • Figure 2 is a schematic structural diagram of an action recognition network model in an embodiment
  • FIG. 3 is a schematic flowchart of an action recognition method in an embodiment
  • FIG. 4 is a schematic diagram of a step of generating a time series motion feature map in an embodiment
  • FIG. 5 is a schematic flowchart of the step of calculating the weight of motion information in an embodiment
  • Fig. 6a is a schematic flowchart of a step of obtaining difference information in an embodiment
  • Fig. 6b is a schematic diagram of calculating the weight of motion information in an embodiment
  • FIG. 7 is a schematic flowchart of a step of generating a time series motion feature map in an embodiment
  • FIG. 8a is a schematic flowchart of a step of identifying the action type of a moving object in the image data of the target time series frame according to the time series motion characteristic diagram of the target time series frame in each convolution channel in an embodiment
  • Figure 8b is a schematic structural diagram of a residual network layer in an embodiment
  • FIG. 9 is a schematic flowchart of a parameter training step in an embodiment
  • FIG. 10 is a visualization schematic diagram of an original sub-feature map, a motion information feature map, and a time series motion feature map in an implementation
  • FIG. 11 is a schematic flowchart of an action recognition method in another embodiment
  • Figure 12 is a structural block diagram of an action recognition device in an embodiment
  • Figure 13 is a structural block diagram of a weight acquisition module in an embodiment
  • Fig. 14 is a structural block diagram of a computer device in an embodiment.
  • AI Artificial Intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • Computer Vision is a science that studies how to make machines "see”. Furthermore, it refers to the use of cameras and computers instead of human eyes to identify, track, and measure targets. And further graphics processing, so that computer processing becomes more suitable for human eyes to observe or send to the instrument to detect the image.
  • Computer vision studies related theories and technologies trying to establish an artificial intelligence system that can obtain information from images or multi-dimensional data.
  • Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping Construction and other technologies also include common face recognition, fingerprint recognition and other biometric recognition technologies.
  • Fig. 1 is an application environment diagram of the action recognition method in an embodiment.
  • the action recognition method is applied to a computer device, and the computer device may be a terminal or a server.
  • the computer device may be a terminal or a server.
  • an action recognition network model is deployed in the computer device, and the action recognition network is a network model correspondingly constructed according to the action recognition method provided in this application.
  • the server extracts the image data of multiple time series frames from the video data.
  • the image data of multiple time series frames obtained from the video data all contain moving objects.
  • the image data is input to the action recognition network model.
  • the action recognition network model performs action recognition on the image data of each time series frame obtained from the video data, and obtains the action type corresponding to the image data of each time series frame, which can be followed by the video data.
  • the action types corresponding to the image data of all the extracted time series frames are merged to obtain the action recognition result of the video data.
  • the video data can be real-time surveillance video.
  • the monitoring object in the image data of each time series frame in the surveillance video can be determined.
  • Real-time action recognition obtain the action information of the monitored object in the image data of each frame of the surveillance video, and realize the real-time monitoring of the monitored object, without manually watching the video data to know the behavior of the monitored object.
  • the video data may be a sign language video.
  • the hand motion in the image data of each time series frame in the sign language video is used for recognition.
  • Fig. 2 is a schematic structural diagram of an action recognition network model in an embodiment.
  • the action recognition network model includes a multi-channel convolutional layer, an action information enhancement module, a timing interaction module, and a backbone network layer.
  • the multi-channel convolutional layer is used to obtain the original feature map of the image data of each time series frame, where the original feature map includes the original sub-features on different convolution channels Figure;
  • the action information enhancement module is used to enhance the action information of the original sub-feature maps of the image data of each time series frame on different convolution channels to obtain the action information features of the image data of each time series frame on different convolution channels Figure;
  • the timing interaction module is used to perform convolution operations on the motion information feature maps of the image data of adjacent time series frames on the same convolution channel to obtain a time series motion feature map. This time series operation feature map combines the previous and next adjacent time series frames Motion information; the backbone network layer is used to obtain the action
  • the backbone network layer is a 2D convolutional network used for action recognition, which is composed of multiple network layers connected in sequence.
  • the backbone network layer consists of 3 layers connected in sequence. Layer sub-network layer composition.
  • the backbone network layer may be a ResNet-50 convolutional neural network.
  • an action recognition method is provided.
  • the method is mainly applied to the server 102 in FIG. 1 as an example.
  • the action recognition method specifically includes the following steps:
  • Step S302 Obtain the image data of the video data in different time series frames, and obtain the original sub-feature maps of the image data of each time series frame on different convolution channels through the multi-channel convolution layer.
  • the video data can be any video data.
  • video data refers to videos that include moving objects. For example, it can be dance videos, surveillance videos, sign language videos, etc. From the source, video data can be surveillance videos shot by a camera, or it can be It is the video data sent by other devices.
  • the image data of different time series frames refers to the image data extracted from the video data in a time sequence, which may include the image data of all time series frames in the video data, and may also include the image data of some consecutive time series frames.
  • the image data of the video data in different time series frames it can be obtained sequentially according to the order of the image data in the video data, or it can be extracted from the video data at a certain sampling frequency, for example, the first frame of the video data
  • the image data is taken as the image data of the first time series frame, and then the image data of the subsequent time series frame is extracted at a certain sampling frequency according to the arrangement order of the image data in the video data.
  • the number of frames of image data may be determined according to the complexity requirements of action recognition, or determined according to the number of image data frames in the video data.
  • the original sub-feature map refers to the feature information that characterizes the image data
  • the multi-channel convolutional layer refers to the network model used to obtain the feature information in the image data
  • the multi-channel convolutional layer here is the trained network model. It can be directly used to obtain the characteristic information of the image data.
  • the multi-channel convolution layer includes multiple convolution kernels, and the convolution channel is determined by the multi-channel convolution layer.
  • the number of convolution kernels used to extract image data in the multi-channel convolution layer is the convolution Number of channels.
  • the image data is input into the multi-channel convolutional layer as the input data of the multi-channel convolutional layer, and each convolution kernel in the multi-channel convolutional layer performs convolution calculation on the image data, and obtains the convolution corresponding to each convolution kernel.
  • the image data of different time series frames obtained from the video data is a grayscale image
  • the grayscale image is input into the multi-channel convolutional layer to obtain the original feature map of the output of the multi-channel convolutional layer.
  • the data dimensions of the original feature map are C, H, W, where H, W identify the length and width of the original feature map, and C represents the channel dimension of the original feature map, that is, the original feature map includes C original sub-feature maps.
  • Step S304 using each timing frame as the target timing frame respectively, according to the original sub-characteristic map of the target timing frame on each convolution channel, and the original sub-characteristic map of the next timing frame adjacent to the target timing frame on each convolution channel.
  • Feature map calculate the weight of the motion information of the target time series frame on each convolution channel.
  • the next time series frame refers to the time series frame corresponding to the next time relative to the target time series frame.
  • the target time series frame is the t-th frame, that is, the image data of the target time series frame is the t-th frame obtained from the video data.
  • Frame image data the next time series frame is the (t+1)th frame, that is, the image data of the next time series frame is the image data of the (t+1)th frame acquired from the video data.
  • the weight of the motion information refers to the probability distribution of the attention distribution of the original sub-feature map of the image data of the target time series frame on different channels; the size of the motion weight is on different convolution channels with the image data of the target time series frame
  • the original sub-feature map of is related to the correlation of the motion information of the moving object, or it can be said to be related to the amount of motion information contained in the original sub-feature map of the image data of the target time series frame on different convolution channels. It is understandable that when the original sub-feature map of the image data of the target time series frame on a certain convolution channel is more related to the motion information of the moving object, the more motion information is contained, the more motion information is included in the convolution channel. The more attention is allocated to the original sub-feature map, the greater the weight of the motion information.
  • the image data of each time series frame acquired in the video data contains information that is critical for action recognition, such as the apparent information of moving objects, and also contains noise information that is useless or even counterproductive for action recognition. For example, noise or background information in image data.
  • the feature information in the feature map that is, the original sub-feature map on the convolution channel allocates more attention, and the original sub-feature map that contains less motion information or more noise information is suppressed, that is, the The original sub-feature map on the convolution channel allocates less attention, so that information that is beneficial to action recognition is enhanced, and information that is irrelevant or even harmful to action recognition is consistent, which effectively improves the accuracy of action recognition.
  • the moving object and background information are static, and movement is a process of movement change, it is necessary to realize the movement through the image data of the target time series frame and the image data of the next time series frame.
  • the change process of the object's action is described to improve the accuracy of action recognition. Specifically, after acquiring the original sub-features of the image data of each time series frame on each convolution channel, for the image data of each time series frame, use itself as the image data of the target time series frame, so that the image data of each time series frame is used as the image data of the target time series frame.
  • the original sub-feature map of the image data on different convolution channels, and the original sub-feature map of the image data of the subsequent time series frame on different convolution channels, to obtain the original sub-feature map of the target time series frame on each convolution channel The weight of the corresponding sports information.
  • the original sub-feature map of the image data of the target time series frame on each convolution channel can be calculated first, and the original sub-feature map of the image data of the target time series frame on each convolution channel.
  • the degree of difference between the image data of the time series frame in the original sub-feature maps on the corresponding convolution channel, and then the corresponding degree of the image data of the target time series frame on each convolution channel is determined according to the degree of difference between the original sub-feature maps on each convolution channel Sports information weight.
  • Step S306 according to the motion information weight of the target time series frame on each convolution channel and the original sub-feature map of the target time series frame on each convolution channel, obtain the motion information feature map of the target time series frame on each convolution channel.
  • the weight of the motion information of the target time series frame on each convolution channel can be added to the original sub-feature map of the target time series frame on the corresponding convolution channel In order to obtain the motion information feature map of the target time series frame on each convolution channel.
  • the motion information weight is used to describe the correlation between the original sub-feature map of the image data in the target time series frame on different convolution channels and the motion information of the moving object, the weight of the motion information on each convolution channel and the corresponding channel
  • the original sub-feature map is multiplied to obtain the motion information feature map of the target time series frame on each convolution channel, so that the original sub-feature map that has a strong correlation with the motion information of the moving object is enhanced, and the motion information of the moving object is enhanced.
  • the original sub-feature map with weaker information relevance is suppressed, so that information that is beneficial to action recognition is enhanced, and information that is irrelevant or even harmful to action recognition is suppressed, so that the action information feature map contains more motion information related to the moving object. It is conducive to subsequent action recognition of moving objects, and effectively improves the accuracy of action recognition.
  • Step S308 Perform time series convolution on the motion information feature map of the target time series frame on each convolution channel to obtain the time series motion feature map of the target time series frame on each convolution channel.
  • the time series convolution is performed on the motion information feature map of the target time series frame on each convolution channel.
  • the time series frame to be convolved is determined according to the target time series frame, and the target time series frame and the time series frame to be convolution are in the same volume
  • the motion information feature map on the product channel is convolved.
  • the time sequence frame to be convolved refers to a time sequence frame adjacent to the target time sequence frame, and may include the first and last two time sequence frames of the target time sequence frame, and may also include the first and last four time sequence frames of the target time sequence frame.
  • the target timing frame is the t-th frame
  • the to-be-convolved timing frame may include the first and second timing frames adjacent to the target timing frame. That is, the to-be-convolved timing frame may include the (t-1)-th frame and the (t-1)-th frame and the (t-1)-th frame.
  • the to-be-convolved time-series frame also includes the first and second time-series frames adjacent to the target time-series frame, that is, the to-be-convolved time-series frame includes the (t-th) -2) frame, (t-1)th frame, (t+1)th frame and (t+2)th frame, at this time for the tth frame, for the (t-2)th frame, (t-1)th frame ) Frame, t-th frame, (t+1)-th frame, and (t+2)-th frame are convolved on the motion information feature map on the same convolution channel to obtain the t-th frame on each convolution channel Time series motion characteristic diagram.
  • the time series frame adjacent to the target time series frame can be determined as the time series frame to be convolved, and the target time series frame and the time series to be convolution
  • the motion information feature map of the frame on the same convolution channel is subjected to convolution operation to obtain the time series motion feature map of the target time series frame on each convolution channel, so that the time series motion feature map combines the motion feature maps of the previous and next time series frames. That is, the motion information of the moving object is modeled in the dimension of time series.
  • the method for acquiring the motion information feature map of the time series frame to be convolved on each convolution channel is the same as the method for acquiring the motion information feature map of the target time series frame on each convolution channel.
  • FIG. 4 is a schematic diagram of performing temporal convolution on the motion information feature map of the target timing frame on each convolution channel in an embodiment to obtain the timing motion feature map of the target timing frame on each convolution channel
  • the matrix diagram on the left in the figure represents the motion information feature diagram of each time series frame on each convolution channel
  • the matrix diagram on the right represents the time series motion feature diagram of each time series frame on each convolution channel
  • the horizontal axis of the matrix diagram in the figure represents The dimension of the convolution channel.
  • the vertical axis represents the dimension of the time series frame.
  • the second The row represents the motion information feature map of the second time series frame on each convolution channel, and so on.
  • the convolution timing frame includes the previous timing frame of the second timing frame and the next timing frame of the second timing frame as an example.
  • For the timing motion characteristic diagram of the second timing frame use a The 3*1 convolution check checks the motion information feature map of the first time series frame on the first convolution channel, the motion information feature map of the second time frame on the first convolution channel, and the third time frame on the first convolution The motion information feature map on the channel is subjected to convolution operation to obtain the time series motion feature map of the second time series frame in the first convolution channel.
  • a 3*1 convolution check is used to check the first time series frame in the second volume.
  • the motion information feature map on the product channel (A1 in the figure), the motion information feature map of the second time series frame on the second convolution channel (A2 in the figure), and the motion information of the third time frame on the second convolution channel The feature map (A3 in the figure) performs convolution operations to obtain the timing motion feature map of the second time series frame in the second convolution channel (B in the figure), and so on, to obtain the second time series frame on each convolution channel Time series motion characteristic diagram.
  • Step S310 Acquire the action type of the moving object in the image data of the target time series frame according to the time series motion characteristic diagram of the target time series frame in each convolution channel.
  • the time series motion characteristic diagram can be used as the characteristic information of the image data to identify the action type of the moving object in the image data of the target time series frame.
  • the time series motion feature map includes both strong motion-related information and time series information. Using the time series motion feature map for action recognition can effectively improve the accuracy of action recognition.
  • the time series motion feature map can be used as the feature information of the image data and input into the 2D convolutional network for motion recognition to identify the motion type of the moving object in the image data of the target time series frame.
  • the 2D convolutional network can include the ResNet-50 convolutional neural network. After the target time series frame is input to the ResNet-50 convolutional neural network in the time series motion feature map of each channel, the corresponding output time series feature map points to the probability of each action type, To identify the motion type of the moving object in the image data of the target time series frame.
  • the step of identifying the action type of the moving object in the image data of the target time series frame according to the time series motion feature map of each convolution channel of the target time series frame is executed by the backbone network layer.
  • the time series motion feature maps of the target time series frames in each convolution channel are input to the backbone network layer.
  • the backbone network layer functions as a classifier, and the backbone network layer outputs the motion types of the moving objects in the image data of the target time series frames.
  • the step of obtaining the original sub-feature map of the image data of each time series frame on different convolution channels described in step S302 is performed by the multi-channel convolution layer.
  • the image data of each time series frame is obtained in different convolution channels through the multi-channel convolution layer.
  • the original sub-characteristic map on the above; the step S304 describes each time-series frame as the target time-series frame, according to the original sub-characteristic map of the target time-series frame on each convolution channel, and the next time-series frame adjacent to the target time-series frame.
  • the original sub-feature map on each convolution channel and the steps of calculating the weight of the motion information of the target time series frame on each convolution channel are executed by the action information enhancement module.
  • step S308 the step of performing time series convolution on the motion information feature map of the target time series frame on each convolution channel to obtain the time series motion feature map of the target time frame frame on each convolution channel is executed by the time series interaction module.
  • the above action recognition method obtains the image data of the video data in different time series frames, after obtaining the original sub-characteristic maps of the image data of each time series frame on different convolution channels through the multi-channel convolution layer, each time series frame is used as the target.
  • Time series frame, through the target time series frame and the original sub-characteristic map of the next time series frame on each convolution channel obtain the running information weight of the target time series frame on each convolution channel, and add the motion information weight to the corresponding convolution
  • enhance the motion information on the original sub-feature map in a single time series frame to obtain the motion information feature map of the target time series frame on each convolution channel, and then compare the target time frame on each convolution channel.
  • the motion information feature map is subjected to time series convolution, so that the motion information feature map of the target time series frame on each convolution channel merges the motion information feature map from the adjacent time series frame, and realizes the modeling in this dimension of time series.
  • the time series motion feature map of the target time series frame on each convolution channel, and finally the time series motion feature map of the target time frame frame on each convolution channel is used as the feature information of the image data of the target time series frame for action recognition, and the target time series frame is identified
  • the action type of the moving object in the image data is used.
  • This action recognition method not only enhances the motion information on the original sub-feature map in a single time series frame, but also realizes the modeling of the time series information between each time series frame, and disrupts the time series information between each time series frame. The sequence will get completely different action recognition results, effectively improving the accuracy of action recognition.
  • the method further includes: obtaining the image data of each time series frame to move After the action type of the object, the action type of the video data is determined according to the action type of each time series frame.
  • the subsequent time series frames are successively used as the target time series frame, and the action type of the moving object in the image data is obtained.
  • the action type corresponding to the moving object in the image data of all the time series frames of the video data is finally merged to obtain the action recognition result of the video data.
  • the figure shows the steps of calculating the weight of the motion information of the target time series frame on each convolution channel, including:
  • Step S502 Obtain the difference information between the original sub-feature map of the target time series frame and the original sub-feature map of the next time series frame on each convolution channel.
  • the difference information can describe the degree of motion change of the moving object in the image data of the two time series frames, that is, the information related to the motion of the moving object; as mentioned above, in the image data of each time series frame acquired in the video data , Both contain information that is critical for action recognition, as well as noise information that is useless or even counterproductive for action recognition.
  • the moving object and the background information are static, and movement is a process of motion change. Therefore, it is difficult to obtain the motion information of the moving object based on the image data of a single time series frame.
  • the action changes of the previous and subsequent time series frames are obtained from the difference information between the original sub-feature maps on the corresponding convolution channels, and the motion information contained in the original sub-feature maps of the image data of the target time series frame on each convolution channel can be obtained. .
  • the difference information between the image data of the subsequent time series frame and the original sub-feature map on the corresponding convolution channel is greater.
  • the original sub-feature map on the convolution channel is more related to the motion information of the moving object, and the original sub-feature map contains more feature information related to operation.
  • the original sub-feature map on the convolution channel is more related to the motion information of the moving object. The more irrelevant the motion information of the moving object is, the less feature information related to the operation is contained in the original sub-feature map.
  • the difference information between the original sub-feature map of the target time series frame and the original sub-feature map of the next time series frame on each convolution channel is obtained, specifically by calculating the image data of the target time series frame on each convolution channel.
  • the original sub-feature map is obtained by the difference between the image data of the subsequent time series frame and the original sub-feature map on the corresponding convolution channel.
  • step S504 the difference information on each convolution channel is mapped to the weight of the motion information of the target time series frame on each convolution channel through the activation function.
  • the corresponding convolution can be obtained according to the difference information on each convolution channel through the activation function The operating information weight of the channel.
  • the greater the difference information between the image data of the subsequent time series frame and the original sub-feature map on the corresponding convolution channel The greater the weight of the motion information of the original sub-feature map on the convolution channel, on the contrary, the more irrelevant the original sub-feature map on the convolution channel and the motion information of the moving object, the original sub-feature map on the convolution channel The smaller the weight of the sports information.
  • the activation function may be a Sigmiod function.
  • the activation function Sigmiod function can be used to map the difference information on each convolution channel as With a weight value between 0 and 1, the weight of the motion information of the original sub-feature map of the target time series frame on each channel is obtained.
  • the step of obtaining difference information between the original sub-feature map of the target time series frame and the original sub-feature map of the next time series frame on each convolution channel includes:
  • step S602 the original sub-feature map of the target time-series frame on each convolution channel and the original sub-feature map of the next time-series frame on each convolution channel are respectively transformed into unit sub-feature maps through the unit pooling layer.
  • the unit pooling layer refers to the pooling layer used for dimensionality reduction of the original sub-feature map.
  • the unit pooling layer may include an average pooling layer, such as global average pooling.
  • the unit sub-feature map refers to a feature map whose length and width are both 1.
  • the original sub-feature map with a space size of H*W can be reduced to a unit sub-feature map with a space size of 1*1 through the unit pooling layer.
  • the dimension of the convolution channel is unchanged at this time, that is, the number of convolution channels of the unit sub-feature map obtained is equal to the convolution channel data of the original sub-feature map.
  • step S604 the unit sub-feature map of the target time series frame on each convolution channel and the unit sub-feature map of the next time series frame on each convolution channel are reduced by a preset zoom factor to obtain the reduced unit. Sub feature map.
  • the preset zoom factor is set according to the actual situation. It can be based on the number of the original sub-feature map in the dimension of the convolution channel and the unit sub-feature map after the dimensionality reduction of the convolution channel is in the dimension of the convolution channel.
  • the ratio of the quantity to be determined For example, the number of original sub-feature maps in the dimension of the convolution channel is 265, and after the dimensionality reduction of the convolution channel, the number of unit sub-feature maps in the dimension of the convolution channel is 16, and the zoom factor is preset Is 16 times.
  • the target time series frame and the subsequent time frame can be reduced by the dimensionality reduction convolution layer.
  • the number of unit sub-feature maps corresponding to a time series frame in the dimension of the convolution channel, where the size of the convolution kernel in the dimensionality reduction convolution layer is 1*1, and the number of convolution kernels and the dimensionality reduction need to be obtained The number of unit sub-feature maps of is equal in the dimension of the convolution channel.
  • the spatial size of the original sub-feature map of each time series frame is H*W
  • the number of dimensions in the convolution channel is C
  • C that is, C original sub-feature maps with a space size of H*W are included
  • each time series frame The data dimension of the original sub-feature map of the image data is C*H*W
  • the unit sub-feature map obtained after the unit pooling layer has the same number in the dimension of the convolution channel, and the space size is reduced to 1* 1.
  • the data dimension of the unit sub-feature map is (C*1*1); then, the dimension of the convolution channel is reduced through the dimensionality reduction convolution layer, and the unit sub-feature map is in the dimension of the convolution channel
  • the number of is reduced to (C/r), that is, the data dimension of the unit sub-feature map after the dimension reduction is obtained is (C/r*1*1), where r is the zoom factor.
  • Step S606 Obtain dimensionality reduction difference information between the unit sub-feature map after the dimensionality reduction of the target time series frame and the unit sub-feature map after the dimensionality reduction of the next time series frame.
  • obtaining the dimensionality reduction difference information between the unit sub-feature map after the dimensionality reduction of the target time series frame and the unit sub-feature map after the dimensionality reduction of the next time series frame can be specifically calculated by calculating the unit sub-feature map after the target time series frame is reduced It is obtained by the difference between the unit sub-feature map on the corresponding convolution channel and the unit sub-feature map after the dimensionality reduction of the next time series frame.
  • step S608 the dimensionality reduction difference information is increased by a preset zoom factor to obtain the difference information between the original sub-feature map of the target time-series frame and the original sub-feature map of the next time-series frame on each convolution channel.
  • the amount of the dimensionality reduction difference information in the dimension of the convolution channel can be restored to be consistent with the data of the convolution channel of the original sub-feature map through the ascending convolution layer.
  • the size of the convolution kernel in the dimension-up convolution layer is 1*1, and the number of convolution kernels is equal to the number of convolution channels of the original sub-feature map.
  • the original sub-feature map of the target time series frame on each convolution channel and the original sub-feature map of the subsequent time series frame on each convolution channel are transformed into unit sub-feature maps through the unit pooling layer, and After the obtained unit sub-feature map is reduced by a preset zoom factor in the dimension of the convolution channel, the data amount of the unit sub-feature map after dimensionality reduction is greatly reduced compared with the original sub-feature map, so that the target timing is calculated
  • the difference information of the original sub-feature map on each convolution channel between the frame and the next time-series frame is converted into the reduced-dimensional unit sub-feature map of the calculated target time-series frame and the reduced unit sub-feature map of the next time-series frame
  • the difference information between the two can effectively reduce the amount of calculation and increase the calculation speed.
  • Fig. 6b shows the timing of calculating the target in an embodiment A schematic diagram of the weight of the motion information of the frame on each convolution channel.
  • the two inputs A and B in Fig. 6 respectively represent the original sub-characteristic map of the target time series frame and the original sub-characteristic map of the next time series frame.
  • the data dimensions of input A and input B are both C*H*W, where, H and W respectively identify the length and width of the original sub-feature map, C represents the number of the original sub-feature map in the dimension of the convolution channel, that is, both input A and input B include C convolution channels, and the space size is The original sub-feature map of H*W.
  • the original sub-feature map in input A and the original sub-feature map in input B are reduced by the unit pooling layer to obtain C convolution channels and space sizes. It is a 1*1 unit sub-characteristic map.
  • the unit sub-feature map corresponding to input A is reduced in the dimension of the convolution channel through the first dimensionality reduction pooling layer, and the data dimension of the unit sub-feature map after dimensionality reduction is C/r*1* 1.
  • the unit sub-feature map corresponding to input B is reduced in the dimension of the convolution channel through the second dimensionality reduction pooling layer, and the data dimension of the unit sub-feature map after dimensionality reduction is also C/ r*1*1. It is understandable that the network parameters of the first dimensionality reduction convolutional layer and the second dimensionality reduction convolutional layer are consistent.
  • the unit sub-feature map (data dimension C/r*1*1) after the dimensionality reduction of the two time series frames of input A and input B is subtracted to obtain the dimensionality reduction difference information that characterizes the motion information.
  • the dimensionality reduction difference The data dimension of the information is C/r*1*1, and then the number of convolution channels is restored to the same number of convolution channels of the original sub-feature map through the ascending convolution layer, and the data dimension is C* 1*1 difference information. Finally, through the sigmoid function, the difference information corresponding to each convolution channel is mapped to the weight of the motion information with the data value of 0 to 1.
  • the weight of the motion information of each convolution channel is multiplied by the original sub-feature map of the corresponding convolution channel, so that the feature information of the original sub-feature map of some convolution channels is enhanced to varying degrees, while the other convolution channels’
  • the feature information of the original sub-feature map is suppressed to varying degrees, and the feature information of the subsequent time series frame is used to enhance the feature information related to the motion information in the original sub-feature map of the target time series frame. It should be understood that since the last time series frame has no subsequent frame, the feature information in the original sub-feature map of the subsequent time series frame cannot be used to enhance the current time series frame, that is, the motion information feature map is consistent with the original sub-feature map.
  • the step of performing time series convolution on the motion information feature map of the target time series frame on each convolution channel to obtain the time series motion feature map of the target time frame frame on each convolution channel include:
  • Step S702 respectively acquiring the motion information feature map of the previous time series frame adjacent to the target time series frame in each convolution channel and the motion information feature map of the next time series frame adjacent to the target time series frame in each convolution channel;
  • Step S704 using time series convolution to check the motion information feature maps of the target time series frame, the previous time series frame and the next time frame in the same convolution channel into a convolution operation to obtain the time series motion characteristics of the target time frame frame on each convolution channel Figure.
  • the motion information feature map of a time series frame on the same convolution channel is subjected to convolution operation to obtain the time series motion feature map of the target time series frame on the convolution channel, and then the time series motion of the target time frame frame on all convolution channels.
  • the feature map enables the time series motion feature map to integrate the motion feature maps of the previous and subsequent time series frames, that is, the motion information of the moving object, and realize the modeling in the dimension of time series.
  • the motion information feature map of the previous time series frame on each convolution channel and the acquisition method of the motion information feature map of the next time series frame on each convolution channel are related to the motion information of the target time series frame on each convolution channel.
  • the method of obtaining the information feature map is the same.
  • the target timing frame is the t-th frame
  • the previous timing frame adjacent to the target timing frame is the (t-1)-th frame
  • the motion information feature map for the previous timing frame (the (t-1)-th frame)
  • the weight of the motion information on each convolution channel is then obtained according to the weight of the motion information of the (t-1)th frame on each convolution channel and the original sub-feature map of the (t-1)th frame on each convolution channel
  • the motion information feature map of the (t-1)th frame on each convolution channel is then obtained according to the weight of the motion information of the (t-1)th frame on each convolution channel and the original sub-feature map of the (t-1)th frame on each convolution channel.
  • next time series frame adjacent to the target time series frame is the (t+1)th frame
  • the motion information feature map of the next time series frame is based on the (t+1)th frame.
  • the original sub-feature map of the frame on each convolution channel, and the original sub-feature map of the (t+2)th frame on each convolution channel calculate the (t+1)th frame on each convolution channel
  • the weight of the motion information and then according to the weight of the motion information of the (t+1)th frame on each convolution channel and the original sub-feature map of the (t+1)th frame on each convolution channel, obtain the (t+1)th frame
  • the motion information feature map of the frame on each convolution channel is the motion information feature map of the frame on each convolution channel.
  • the above-mentioned time series convolution check is used to perform convolution operations on the motion information feature maps of the target time series frame, the previous time series frame and the next time series frame in the same convolution channel to obtain the target time series.
  • the steps of the time series motion feature map of the frame on each convolution channel can be executed by the action information enhancement module, as shown in Figure 4, with the third time series frame in the figure as the target time series frame, for the time series motion of the third time series frame Feature map, using a 3*1 convolution kernel to perform convolution operations on the first convolution channel of the second time series frame, the third time series frame, and the fourth time series frame to obtain the third time frame in the first convolution channel Time series motion feature map.
  • a 3*1 convolution core is used to perform convolution operations on the second convolution channel of the second time series frame, the third time series frame, and the fourth time series frame to obtain the third time series frame in the first time frame.
  • the step of identifying the action type of the moving object in the image data of the target time series frame according to the time series motion feature map of each convolution channel of the target time series frame includes:
  • Step S802 Input the time series motion characteristic map of the target time series frame into the residual network layer to obtain the motion characteristic information of the image data of the target time series frame.
  • the time series motion feature map of the target time frame frame in each convolution channel is used as the feature information of the image data of the target time series frame and input to the residual network In the layer, feature learning is performed on each time series motion feature map through the residual network layer to obtain the action feature information of the image data.
  • the number of motion feature information in the dimension of the convolution channel can be consistent with that of the time series running feature map.
  • Step S804 Input the action feature information into the action classification network layer, and identify the action type of the moving object in the image data of the target time series frame.
  • the action classification network layer is a network structure used to recognize action types according to the action feature information of the image data.
  • the action classification network layer here is a trained action classification network layer, which can be directly used to obtain the moving objects in the image data.
  • Action type Specifically, after acquiring the action feature information of the image data of the target time series frame, the action feature information is input into the action classification network layer, and the action type of the image data moving object in the target time series frame is acquired through the action classification network layer.
  • the network structure of the residual network layer can be as shown in Figure 8b, which includes three convolutional neural networks, two two-dimensional convolutional neural networks with a size of 1*1 at both ends.
  • the method further includes: determining the motion feature information as the target time series frame The original sub-feature map of the image data on different convolution channels; re-execute the original sub-feature map on each convolution channel according to the target time series frame, and the original sub-feature map on each convolution channel of the next time series frame, The step of calculating the weight of the motion information of the target time series frame on each convolution channel.
  • the action feature information of the image data of the target time series frame can be re-determined as the original sub-feature map of the image data of the target time series frame on different convolution channels, and then the newly determined original sub-feature information can be re-determined.
  • the feature map performs the same operation, that is, calculates the motion information weight of the original sub-feature map on each convolution channel, and adds the motion information weight to the original sub-feature map of the corresponding convolution channel to obtain the target timing frame in each volume Convolve the motion information feature map on the channel, and then use the timing convolution check to convolve the motion information feature map of the target timing frame and the adjacent timing frame in the same convolution channel, so that the target timing frame's motion information on each convolution channel
  • the feature map fuses the motion information feature map from adjacent time series frames to obtain the time series motion feature map of the target time series frame on each convolution channel.
  • the information enhancement of the movement feature information based on the attention mechanism again, and the modeling of the timing information again can effectively improve the ability of the action feature information to represent the action information, and the subsequent actions
  • the characteristic information is used for action recognition, which effectively improves the accuracy of action recognition.
  • the action information enhancement module in the figure is used to enhance the action information of the original sub-feature maps of the image data of each time series frame on different convolution channels to obtain each time series.
  • the action characteristic information of the action type The action information enhancement module, time sequence interaction module, and residual network layer can be used as a feature extraction unit. Through multiple feature extraction units, the accuracy of feature learning can be improved, and the accuracy of action recognition can be effectively improved.
  • the action recognition method further includes:
  • Step S902 Obtain video samples, where the video samples include multiple image samples of different sample time series frames and the standard action types of moving objects in the image samples of each sample time series frame.
  • Step S904 Obtain the original sub-feature map samples of each image sample on different convolution channels through the multi-channel convolution layer.
  • the image sample is input into the multi-channel convolution layer as the input data of the multi-channel convolution layer, and each convolution kernel in the multi-channel convolution layer performs convolution calculation on the image sample to obtain the convolution corresponding to each convolution kernel.
  • the original sub-feature map sample of the channel is input into the multi-channel convolution layer as the input data of the multi-channel convolution layer, and each convolution kernel in the multi-channel convolution layer performs convolution calculation on the image sample to obtain the convolution corresponding to each convolution kernel.
  • each sample timing frame is used as the target sample timing frame to obtain sample difference information of the original sub-feature map sample of the target sample timing frame and the original sub-feature map sample of the subsequent sample timing frame on each convolution channel.
  • the sample difference information can describe the motion change degree of the moving object in the image samples of the two sample time series frames, that is, the information related to the motion of the moving object; in the image samples of each sample time series frame acquired in the video sample, all Contains information that is critical for action recognition, as well as noise information that is useless or even counterproductive for action recognition.
  • the moving object and background information are static, and movement is a process of motion change, so it is difficult to obtain the motion of the moving object only based on the image sample of the single sample time series frame. information.
  • the original sub-feature map samples of the image samples of the target sample time series frame on each convolution channel, and the difference information between the original sub-feature map samples of the image samples of the current time series frame on the corresponding convolution channel, that is, the before and after samples The motion change of the moving object in the time series frame, the sample difference information between the original sub-feature map samples on the corresponding convolution channel in the previous and after sample time series frames can be obtained, and the image samples of the target sample time series frame on each convolution channel can be obtained.
  • the motion information contained in the original sub-feature map sample is
  • data reduction can be performed on the original sub-feature map samples of the target sample time series frame on each convolution channel to obtain the unit sub-feature map samples after the target sample time series frame is reduced in dimensionality, and the subsequent sample time series frames are in each convolutional channel.
  • the original sub-feature map samples on the convolution channel are subjected to data dimensionality reduction.
  • the data dimensionality reduction is performed to obtain the unit sub-feature map samples after the dimensionality reduction of the latter sample time series frame.
  • the data amount of the unit sub-feature map after the dimensionality reduction is relative to The number of original sub-feature maps is greatly reduced.
  • step S908 the sample difference information on each convolution channel is mapped to the motion information weight samples of the target sample time series frame on each convolution channel through the activation function.
  • Step S912 Perform time-series convolution on the motion information feature map samples of the target sample time-series frame on each convolution channel to obtain time-series motion feature map samples of the target sample time-series frame on each convolution channel.
  • the time series convolution kernel can be used to calculate the motion information of the target sample time series frame and the adjacent sample time series frame on the same convolution channel.
  • the feature map samples are subjected to convolution operation to obtain the time series motion feature map samples of the target sample time series frame on each convolution channel, so that the time series motion feature map samples are fused with the motion feature map samples of the previous and next sample time series frames, that is, the motion feature map samples of the moving object.
  • the action information is modeled in the dimension of time sequence.
  • Step S914 Obtain the predicted action type of the moving object in the image sample of the target sample time series frame according to the time series motion feature map samples of the target sample time series frame in each convolution channel.
  • the time series motion feature map sample after obtaining the time series motion feature map sample of the image data of the target sample time series frame, the time series motion feature map sample can be used as the feature information of the image sample to obtain the action type of the moving object in the image sample of the target sample time series frame.
  • the time series motion feature map samples can be input into a 2D convolutional network for action recognition to obtain the predicted action type of the moving object in the image samples of the target sample time series frame.
  • Step S916 According to the difference between the predicted action type and the standard action type, adjust the parameters of the multi-channel convolutional layer, activation function, and time-series convolution kernel, and continue training until the training end condition is met.
  • the difference between the predicted action type and the standard action type can be used as the loss function to adjust the parameters of the multi-channel convolution layer, activation function, and time series convolution kernel until the end of the training condition.
  • the training end condition here can be adjusted or set according to actual needs. For example, when the loss function satisfies the convergence condition, the training end condition can be considered to be reached; or when the number of training times reaches the preset number, the training end condition can be considered to be reached.
  • an action recognition method includes:
  • the unit sub-feature map of the target time series frame on each convolution channel and the unit sub-feature map of the next time series frame on each convolution channel are reduced by a preset zoom factor, respectively, to obtain The unit sub-characteristic map after dimensionality reduction;
  • the action type of the moving object in the image data of each time series frame is obtained, the action type of the video data is determined according to the action type of each time series frame.
  • the left column is the image data of two time series frames intercepted from the video in chronological order.
  • the first column of image data in the left column is the image data of the target time series frame
  • the second column is the latter time series.
  • the image data of the frame; in the right column, the image in the first column is a visualization of the original sub-feature map corresponding to the image data of the target time series frame in the left column, and the second column of the image is the original sub-feature map after the action information enhancement module
  • the obtained visualization of the motion information feature map, the third column of images is the visualization of the time series motion feature map obtained after the motion information feature map passes through the time series interaction module. It can be seen from Figure 10 that the original sub-feature map includes information that is critical for action recognition, as well as noise information that is useless or even counterproductive for action recognition.
  • the motion feature map not only has the information of the target time series frame image data in the first column in the left column, but also includes the information of the next time series frame image data in the second column in the left column, achieving the purpose of modeling the time series information.
  • the operation process on the data in the above steps 2 to 4 is carried out in the dimension of the convolution channel, and the feature maps of different convolution channels (including the original sub-feature map or the motion information feature map) are independent of each other Yes, the information of the feature maps of adjacent convolution channels will not be mixed, so that the amount of calculation in the calculation process is kept at a low amount of calculation and the calculation speed is higher.
  • the action information enhancement module and the timing interaction module in Figure 2 operate on the convolution channel, that is, the feature map (original sub-feature map or motion information) of a single or multiple timing frames in each convolution channel Feature map), the feature maps of different convolution channels are independent of each other, and the information of the feature maps of adjacent convolution channels will not be mixed, so that the amount of calculation is kept at a low amount of calculation during the calculation process, and the calculation speed is higher. .
  • an action recognition method includes:
  • Step S1102 Obtain real-time surveillance video data.
  • the surveillance video data may be a real-time video captured by a camera, where the image of the surveillance video data includes a moving object to be monitored.
  • Step S1104 Extract the image data of the surveillance video data in different time series frames, and obtain the original sub-feature maps of the image data of each time series frame on different convolution channels through the multi-channel convolution layer.
  • the image data of different time series frames refers to the image data extracted from the surveillance video data according to the time sequence of shooting, which includes the image data of all time series frames in the surveillance video data.
  • Obtaining the image data of the video data in different time-series frames may specifically be obtained sequentially according to the arrangement order of the image data in the video data.
  • the original sub-feature map refers to the feature information that characterizes the image data
  • the multi-channel convolutional layer refers to the network model used to obtain the feature information in the image data
  • the multi-channel convolutional layer here is the trained network model. It can be directly used to obtain the characteristic information of the image data.
  • the multi-channel convolution layer includes multiple convolution kernels, and the convolution channel is determined by the multi-channel convolution layer.
  • the number of convolution kernels used to extract image data in the multi-channel convolution layer is the convolution Number of channels.
  • the image data of each time series frame in the surveillance video is input into the multi-channel convolutional layer as input data of the multi-channel convolutional layer, and each convolution kernel in the multi-channel convolutional layer performs convolution calculation on the image data, Obtain the original sub-feature map of the convolution channel corresponding to each convolution kernel.
  • Step S1106 Determine the target timing frame, and calculate the target based on the original sub-feature map of the target timing frame on each convolution channel and the original sub-feature map of the next timing frame adjacent to the target timing frame on each convolution channel. The weight of the motion information of the time series frame on each convolution channel.
  • the target time sequence frame refers to the time sequence frame corresponding to the image data acquired at the current time
  • the next time sequence frame refers to the time sequence frame corresponding to the next time relative to the target time sequence frame.
  • the image data of each time series frame obtained in the surveillance video data contains information that is critical for action recognition, such as the apparent information of moving objects, and also contains noise information that is useless or even counterproductive for action recognition. , Such as noise or background information in image data.
  • the feature information in the feature map that is, the original sub-feature map on the convolution channel allocates more attention, and the original sub-feature map that contains less motion information or more noise information is suppressed, that is, the The original sub-feature map on the convolution channel allocates less attention, so that information that is beneficial to action recognition is enhanced, and information that is irrelevant or even harmful to action recognition is consistent, which effectively improves the accuracy of action recognition.
  • the moving object and background information are static, and movement is a process of movement change, it is necessary to realize the movement through the image data of the target time series frame and the image data of the next time series frame.
  • the change process of the object's action is described to improve the accuracy of action recognition.
  • the original sub-feature map of the image data on different convolution channels, and the original sub-feature map of the image data of the subsequent time series frame on different convolution channels, and obtain the original sub-feature map corresponding to the target time series frame on each convolution channel The weight of the sports information.
  • the original sub-feature map of the image data of the target time series frame on each convolution channel can be calculated first, and the following one
  • the degree of difference between the image data of the time series frame in the original sub-feature maps on the corresponding convolution channel, and then the corresponding degree of the image data of the target time series frame on each convolution channel is determined according to the degree of difference between the original sub-feature maps on each convolution channel Sports information weight.
  • Step S1108 according to the motion information weight of the target time series frame on each convolution channel and the original sub-feature map of the target time series frame on each convolution channel, obtain the motion information feature map of the target time series frame on each convolution channel.
  • the time series convolution is performed on the motion information feature map of the target time series frame on each convolution channel.
  • the time series frame to be convolved is determined according to the target time series frame, and the target time series frame and the time series frame to be convolved are in the same volume.
  • the motion information feature map on the product channel is convolved to obtain the time series motion feature map of the target time series frame on each convolution channel, so that the time series motion feature map combines the motion feature maps of the previous and next time series frames, that is, the moving object is in the front and back.
  • Time action information can be modeled in the dimension of time series.
  • the method for acquiring the motion information feature map of the time series frame to be convolved on each convolution channel is the same as the method for acquiring the motion information feature map of the target time series frame on each convolution channel.
  • Step S1112 Identify the action type of the moving object in the image data of the target time series frame according to the time series motion feature map of the target time series frame in each convolution channel.
  • the time series motion characteristic diagram can be determined as the characteristic information of the target time series frame image data, and the motion of the moving object in the image data of the target time series frame can be identified according to the characteristic information Types of.
  • the time series motion feature map can be input into a 2D convolutional network for motion recognition to identify the motion type of the moving object in the image data of the target time series frame.
  • the time series motion feature map includes both strong motion-related information and time series information. Using the time series motion feature map for action recognition can effectively improve the accuracy of action recognition.
  • Step S1114 Determine the action type as the action information of the moving object in the current surveillance video data.
  • the action type is determined as the action information of the moving object in the surveillance video data, so as to realize the real-time update of the motion information of the moving object without watching the surveillance video.
  • the movement information of the moving object while ensuring real-time monitoring of the moving object.
  • the motion information can be displayed through the display device, so that the monitoring personnel can obtain the motion state of the moving object in the monitoring video.
  • a motion recognition device 1200 is provided, which is set in a computer device.
  • the device includes: an image acquisition module 1202, a weight acquisition module 1204, a feature determination module 1206, and a timing interaction module 1208 And the action recognition module 1210, where:
  • the image acquisition module 1202 is configured to acquire the image data of the video data in different time series frames, and acquire the original sub-feature maps of the image data of each time series frame on different convolution channels through the multi-channel convolution layer;
  • the weight acquisition module 1204 is configured to use each time sequence frame as the target time sequence frame, according to the original sub-characteristic map of the target time sequence frame on each convolution channel, and the next time sequence frame adjacent to the target time frame in each convolution channel Calculate the weight of the motion information of the target time series frame on each convolution channel;
  • the feature determining module 1206 is used to obtain the motion information of the target time series frame on each convolution channel according to the weight of the motion information of the target time series frame on each convolution channel and the original sub-feature map of the target time series frame on each convolution channel Feature map
  • the timing interaction module 1208 is used to perform timing convolution on the motion information feature map of the target timing frame on each convolution channel to obtain the timing motion feature map of the target timing frame on each convolution channel;
  • the action recognition module 1210 is used for recognizing the action type of the moving object in the image data of the target time series frame according to the time series motion feature map of each convolution channel of the target time series frame.
  • the weight obtaining module 1204 includes:
  • the difference information acquisition module 1204a is configured to acquire difference information between the original sub-feature map of the target time series frame and the original sub-feature map of the next time series frame on each convolution channel;
  • the weight mapping module 1204b is used to map the difference information on each convolution channel to the motion information weight of the target time series frame on each convolution channel through an activation function.
  • the difference information acquisition module is configured to: separately use the unit pooling layer to separately map the original sub-feature map of the target time series frame on each convolution channel and the original sub-feature map of the next time series frame on each convolution channel.
  • the map is transformed into a unit sub-feature map; the unit sub-feature map of the target time series frame on each convolution channel and the unit sub-feature map of the next time series frame on each convolution channel are reduced by a preset zoom factor to obtain Reduced dimensionality of the unit sub-feature map; obtain the dimensionality reduction difference information between the unit sub-feature map after the dimensionality reduction of the target time series frame and the unit sub-feature map after the dimensionality reduction of the next time series frame; perform preset scaling on the dimensionality reduction difference information The dimensionality of the multiple is increased, and the difference information between the original sub-feature map of the target time series frame and the original sub-feature map of the next time series frame on each convolution channel is obtained.
  • the timing interaction module is configured to: obtain the motion information feature maps of the previous timing frame adjacent to the target timing frame in each convolution channel and the next timing frame adjacent to the target timing frame in each convolution channel.
  • the motion information feature map of the convolution channel; the motion information feature map of the target time frame, the previous time frame and the next time frame in the same convolution channel is convolved using the time series convolution check to obtain the target time frame in each convolution Time series motion characteristic diagram on the channel.
  • the action recognition module is used to input the time series motion feature map of the target time series frame into the residual network layer to obtain the action feature information of the image data of the target time series frame; input the action feature information to the action classification network In the layer, the action type of the moving object in the target time series frame image data is obtained.
  • the timing interaction module is also used to determine the action feature information as the original sub-feature map of the image data of the target timing frame on different convolution channels, and make the weight acquisition module 1104 again according to the target timing frame in each The original sub-feature map on the convolution channel, and the original sub-feature map on each convolution channel of the next time-series frame adjacent to the target time-series frame, calculate the weight of the motion information of the target time-series frame on each convolution channel.
  • the action recognition module is further configured to determine the action type of the video data according to the action type of each time series frame after obtaining the action type of the moving object in the image data of each time series frame.
  • the action recognition device further includes a training module for obtaining video samples, where the video samples include multiple image samples of different sample time series frames and the standard action types of moving objects in the image samples of each sample time series frame ; Obtain the original sub-feature map samples of each image sample on different convolution channels through the multi-channel convolution layer; take each sample timing frame as the target sample timing frame, and obtain the original sub-feature map sample of the target sample timing frame and the same The sample difference information of the original sub-feature map samples of this time series frame on each convolution channel; the activation function is used to map the sample difference information on each convolution channel to the motion information weight sample of the target sample time series frame on each convolution channel; According to the motion information weight samples of the target sample time series frame on each convolution channel and the original sub-feature map samples, obtain the motion information feature map samples of the target sample time series frame on each convolution channel; for the target sample time series frame in each convolution The motion information feature map samples on the channel are subjected to time series convolution to obtain the time series motion feature map
  • an action recognition device which is set in a computer device, and the device includes: an image acquisition module, a weight acquisition module, a feature determination module, a time sequence interaction module, and an action recognition module; wherein:
  • the image acquisition module is used to acquire real-time surveillance video data; extract image data of surveillance video data in different time series frames, and obtain the original sub-feature maps of image data of each time series frame on different convolution channels through a multi-channel convolution layer.
  • the weight acquisition module is used to take each timing frame as the target timing frame, according to the original sub-characteristic map of the target timing frame on each convolution channel, and the next timing frame adjacent to the target timing frame in each convolution channel.
  • the original sub-feature map above calculate the weight of the motion information of the target time series frame on each convolution channel.
  • the feature determination module is used to obtain the motion information feature map of the target time series frame on each convolution channel according to the weight of the motion information and the original sub-feature map of the target time series frame on each convolution channel.
  • the timing interaction module is used to perform timing convolution on the motion information feature map to obtain the timing motion feature map of the target timing frame on each convolution channel.
  • the action recognition module is used to identify the action type of the moving object in the image data of the target time sequence frame according to the time series motion feature diagram; determine the action type as the action information of the moving object in the current surveillance video data.
  • Each module in the above-mentioned motion recognition device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • Fig. 14 shows an internal structure diagram of a computer device in an embodiment.
  • the computer device may specifically be the server 102 in FIG. 1.
  • the computer equipment includes one or more processors, memories, network interfaces, input devices, and display screens connected through a system bus.
  • the memory includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operating system, and may also store computer-readable instructions.
  • the computer-readable instructions are executed by one or more processors, the one or more processors can realize action recognition method.
  • the internal memory may also store computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors can execute the action recognition method.
  • the display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen.
  • the input device of the computer equipment can be a touch layer covered on the display screen, or it can be a button, trackball or touchpad set on the housing of the computer equipment. It can be an external keyboard, touchpad, or mouse.
  • FIG. 14 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • the action recognition apparatus provided in the present application may be implemented in a form of computer-readable instructions, and the computer-readable instructions may run on the computer device as shown in FIG. 14.
  • the memory of the computer device can store various program modules that make up the action recognition apparatus, such as the image acquisition module 1202, the weight acquisition module 1204, the feature determination module 1206, the timing interaction module 1208, and the action recognition module 1210 shown in FIG. 12.
  • the computer-readable instructions formed by each program module cause one or more processors to execute the steps in the action recognition method of each embodiment of the present application described in this specification.
  • the computer device shown in FIG. 14 may execute step S302 through the image acquisition module 1202 in the motion recognition apparatus shown in FIG. 12.
  • the computer device may execute step S304 through the weight obtaining module 1204.
  • the computer device can execute step S306 through the feature determination module 1206.
  • the computer device may execute step S308 through the time sequence interaction module 1208.
  • the computer device may execute step S310 through the motion recognition module 1210.
  • a computer device including a memory and one or more processors.
  • the memory stores computer-readable instructions.
  • the processor executes the steps of the above-mentioned action recognition method.
  • the steps of the action recognition method may be the steps in the action recognition method of each of the above embodiments.
  • one or more computer-readable storage media are provided, and computer-readable instructions are stored.
  • the computer-readable instructions are executed by one or more processors, the one or more processors perform the above-mentioned actions.
  • the steps of the action recognition method may be the steps in the action recognition method of each of the above-mentioned embodiments.
  • the “multiple” in each embodiment of the present application means at least two.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Emergency Management (AREA)
  • Business, Economics & Management (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Gerontology & Geriatric Medicine (AREA)
  • Psychology (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Child & Adolescent Psychology (AREA)
  • Image Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)

Abstract

一种动作识别方法,包括:通过多通道卷积层获取各时序帧的图像数据在不同卷积通道上的原始子特征图;分别以各个时序帧作为目标时序帧,根据目标时序帧在各卷积通道上的原始子特征图,以及后一时序帧在各卷积通道上的原始子特征图,计算目标时序帧在各卷积通道上的运动信息权重,并根据运动信息权重获取目标时序帧在各卷积通道上的运动信息特征图;对目标时序帧在各卷积通道上的运动信息特征图进行时序卷积,得到目标时序帧在各卷积通道上的时序运动特征图;根据目标时序帧在各卷积通道的时序运动特征图识别目标时序帧的图像数据中运动对象的动作类型。

Description

动作识别方法、装置、计算机存储介质和计算机设备
本申请要求于2019年11月20日提交中国专利局,申请号为2019111430082,申请名称为“动作识别方法、装置、计算机可读存储介质和计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域、更涉及图像处理技术领域,特别是涉及一种动作识别方法、装置、计算机可读存储介质和计算机设备。
背景技术
随着计算机技术和人工智能技术的发展,动作识别技术从图像领域扩展到了视频领域。传统方法中,对视频数据进行动作识别,一般是使用二维卷积网络神经网络对视频数据中的每一帧图像进行识别,最终对该视频数据的所有帧的动作识别结果进行融合,得到对视频数据的动作识别结果。但是在强调运动对象动作变化的场景中,即使打乱视频数据中各帧图像的顺序,并不会影响二维卷积网络神经网络对视频数据中动作类型的识别结果。因此,使用二维卷积神经网络进行动作识别的精度低。
发明内容
根据本申请提供的各种实施例,提供一种动作识别方法、装置、计算机可读存储介质和计算机设备。
根据本申请的一个方面,提供了一种动作识别方法,由计算机设备执行,包括:
获取视频数据在不同时序帧的图像数据,通过多通道卷积层获取各时序帧的图像数据在不同卷积通道上的原始子特征图;
分别以每个时序帧作为目标时序帧,根据目标时序帧在各卷积通道上的原始子特征图,以及与目标时序帧相邻的后一时序帧在各卷积通道上的原始子特征图,计算目标时序帧在各卷积通道上的运动信息权重;
根据运动信息权重以及目标时序帧在各卷积通道上的原始子特征图,获取目标时序帧在各卷积通道上的运动信息特征图;
对运动信息特征图进行时序卷积,得到目标时序帧在各卷积通道上的时序运动特征图;及
根据时序运动特征图识别目标时序帧的图像数据中运动对象的动作类 型。
根据本申请的一个方面,提供了一种动作识别方法,由计算机设备执行,包括:
获取实时的监控视频数据;
提取所述监控视频数据在不同时序帧的图像数据,通过多通道卷积层获取各时序帧的图像数据在不同卷积通道上的原始子特征图;
分别以每个时序帧作为目标时序帧,根据目标时序帧在各卷积通道上的原始子特征图,以及与目标时序帧相邻的后一时序帧在各卷积通道上的原始子特征图,计算目标时序帧在各卷积通道上的运动信息权重;
根据运动信息权重以及目标时序帧在各卷积通道上的原始子特征图,获取目标时序帧在各卷积通道上的运动信息特征图;
对运动信息特征图进行时序卷积,得到目标时序帧在各卷积通道上的时序运动特征图;及
根据时序运动特征图,识别目标时序帧的图像数据中运动对象的动作类型;
将所述动作类型确定为当前所述监控视频数据中运动对象的动作信息。
根据本申请的一个方面,提供了一种动作识别装置,设置于计算机设备中,包括:
图像获取模块,用于获取视频数据在不同时序帧的图像数据,通过多通道卷积层获取各时序帧的图像数据在不同卷积通道上的原始子特征图;
权重获取模块,用于分别以各个时序帧作为目标时序帧,根据目标时序帧在各卷积通道上的原始子特征图,以及与目标时序帧相邻的后一时序帧在各卷积通道上的原始子特征图,计算目标时序帧在各卷积通道上的运动信息权重;
特征确定模块,用于根据目标时序帧在各卷积通道上的运动信息权重以及目标时序帧在各卷积通道上的原始子特征图,获取目标时序帧在各卷积通道上的运动信息特征图;
时序交互模块,用于对目标时序帧在各卷积通道上的运动信息特征图进行时序卷积,得到目标时序帧在各卷积通道上的时序运动特征图;
动作识别模块,用于根据目标时序帧在各卷积通道的时序运动特征图识别目标时序帧的图像数据中运动对象的动作类型。
根据本申请的一个方面,提供了一种动作识别装置,设置于计算机设备 中,包括:
图像获取模块,用于获取实时的监控视频数据;提取所述监控视频数据在不同时序帧的图像数据,通过多通道卷积层获取各时序帧的图像数据在不同卷积通道上的原始子特征图;
权重获取模块,用于分别以每个所述时序帧作为目标时序帧,根据目标时序帧在各卷积通道上的原始子特征图,以及与目标时序帧相邻的后一时序帧在各卷积通道上的原始子特征图,计算目标时序帧在各卷积通道上的运动信息权重;
特征确定模块,用于根据所述运动信息权重以及目标时序帧在各卷积通道上的原始子特征图,获取目标时序帧在各卷积通道上的运动信息特征图;
时序交互模块,用于对所述运动信息特征图进行时序卷积,得到目标时序帧在各卷积通道上的时序运动特征图;及
动作识别模块,用于根据时序运动特征图,识别目标时序帧的图像数据中运动对象的动作类型;将所述动作类型确定为当前所述监控视频数据中运动对象的动作信息。
一个或多个计算机可读存储介质,存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行本申请各实施例的动作识别方法中的步骤:
一种计算机设备,包括存储器和一个或多个处理器,所述存储器存储有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行本申请各实施例的动作识别方法中的步骤。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。基于本申请的说明书、附图以及权利要求书,本申请的其它特征、目的和优点将变得更加明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为一个实施例中动作识别方法的应用环境图;
图2是一个实施例中动作识别网络模型的结构示意图;
图3为一个实施例中动作识别方法的流程示意图;
图4为一个实施例中时序运动特征图生成步骤的示意图;
图5为一个实施例中计算运动信息权重步骤的流程示意图;
图6a为一个实施例中差异信息获取步骤的流程示意图;
图6b为一个实施例中计算运动信息权重的示意图;
图7为一个实施例中时序运动特征图生成步骤的流程示意图;
图8a为一个实施例中根据目标时序帧在各卷积通道的时序运动特征图识别目标时序帧的图像数据中运动对象的动作类型步骤的流程示意图;
图8b为一个实施例中残差网络层的结构示意图;
图9为一个实施例中参数训练步骤的流程示意图;
图10为一个实施中原始子特征图、运动信息特征图以及时序运动特征图的可视化示意图;
图11为另一个实施例中动作识别方法的流程示意图;
图12为一个实施例中动作识别装置的结构框图;
图13为一个实施例中权重获取模块的结构框图;
图14为一个实施例中计算机设备的结构框图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
计算机视觉技术(Computer Vision,CV)计算机视觉是一门研究如何使机器“看”的科学,更进一步的说,就是指用摄影机和电脑代替人眼对目标进行识别、跟踪和测量等机器视觉,并进一步做图形处理,使电脑处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科,计算机视觉研究相关的理论和技术,试图建立能够从图像或者多维数据中获取信息的人工智能系统。计算机视觉技术通常包括图像处理、图像识别、图像语义理解、图像检索、OCR、视频处理、视频语义理解、视频内容/行为识别、三维物体重建、3D技术、虚拟现实、增强现实、同步定位与地图构建等技术,还包括常见的人脸识别、指纹识别等生物特征识别技术。
图1为一个实施例中动作识别方法的应用环境图。参见图1,该动作识别方法应用于一种计算机设备中,该计算机设备可以是终端或者服务器。如图1所示,以该计算机设备是服务器为例,计算机设备中部署有动作识别网络模型,该动作识别网络是根据本申请所提供的动作识别方法对应构建的网络模型。服务器从视频数据中提取多个时序帧的图像数据,如图1中所示,从视频数据中获取的多个时序帧的图像数据均包含运动对象,服务器通过将提取到的多个时序帧的图像数据输入至动作识别网络模型中,动作识别网络模型对从视频数据中获得的每一时序帧的图像数据进行动作识别,得到每一时序帧图像数据对应的动作类型,后续可对从视频数据提取到的所有时序帧的图像数据对应的动作类型进行融合,得到对视频数据的动作识别结果。
比如,在一个示例性的应用场景中,视频数据可以是实时的监控视频,通过将实时的监控视频输入至动作识别模型中,以对监控视频中每一时序帧的图像数据中的监控对象的实时动作进行识别,获得监控视频中每一帧的图像数据中监控对象的动作信息,实现对监控对象的实时监控,无需通过人工观看视频数据获知监控对象的行为动作。
又比如,在一个示例性的应用场景中,视频数据可以是手语视频,通过将手语视频输入至动作识别模型中,以对手语视频中每一时序帧的图像数据中的手部动作进行识别,获得手语视频中每一时序帧的图像数据对应的手语动作信息,实现手语翻译。
图2是一个实施例中动作识别网络模型的结构示意图,如图2所示,动作识别网络模型中包括多通道卷积层、动作信息增强模块、时序交互模块以及主干网络层。其中,在获取视频数据在不同时序帧的图像数据后,多通道卷积层用于获取每一时序帧的图像数据的原始特征图,其中原始特征图包括 在不同卷积通道上的原始子特征图;动作信息增强模块用于对每一时序帧的图像数据在不同卷积通道上的原始子特征图进行动作信息增强,得到每一时序帧的图像数据在不同卷积通道上的动作信息特征图;时序交互模块用于对前后相邻时序帧的图像数据的动作信息特征图在相同卷积通道上进行卷积运算,得到时序运动特征图,该时序运行特征图融合了前后相邻时序帧的运动信息;主干网络层用于根据时序运动特征图获取图像数据中运动对象的动作类型。
在一个实施例中,主干网络层是用于动作识别的2D卷积网络,由依次连接的多个网络层构成,如图2示出的动作识别网络模型中,主干网络层由依次连接的3层子网络层构成。可选的,主干网络层可以是ResNet-50卷积神经网络。
如图3所示,在一个实施例中,提供了一种动作识别方法。本实施例主要以该方法应用于上述图1中的服务器102来举例说明。参照图3,该动作识别方法具体包如下步骤:
步骤S302,获取视频数据在不同时序帧的图像数据,通过多通道卷积层获取各时序帧的图像数据在不同卷积通道上的原始子特征图。
其中,视频数据可以是任意视频数据。从内容上来讲,视频数据是指包括有运动对象的视频,例如可以是舞蹈类视频、监控类视频、手语类视频等,从来源上讲,视频数据可以是通过摄像头拍摄的监控视频,也可以是由其他设备发送获得的视频数据。
其中,不同时序帧的图像数据是指按照时间顺序从视频数据中抽取的图像数据,其中,可以包括视频数据中所有时序帧的图像数据,也可包括部分连续时序帧的图像数据。获取视频数据在不同时序帧的图像数据,具体可以是按照视频数据中图像数据的排列顺序依次获取,也可以是以一定的采样频率从视频数据中提取的,例如,以视频数据第一帧的图像数据作为第一时序帧的图像数据,然后根据视频数据中图像数据的排列顺序,以一定的采样频率抽取后续时序帧的图像数据。应该理解的是,图像数据的帧数量,可以是根据动作识别的复杂度要求确定的,或者根据视频数据中的图像数据帧数量确定的。
其中,原始子特征图是指表征图像数据的特征信息;多通道卷积层是指用于获取图像数据中的特征信息的网络模型,这里的多通道卷积层是已训练好的网络模型,可直接用来获取图像数据的特征信息。其中,多通道卷积层 包括多个卷积核,卷积通道是由多通道卷积层所决定的,多通道卷积层中用于抽取图像数据的卷积核的数量,即为卷积通道数量。具体地,将图像数据作为多通道卷积层的输入数据输入至多通道卷积层中,多通道卷积层中的各个卷积核对图像数据进行卷积计算,获取与各个卷积核对应的卷积通道的原始子特征图。
比如,以灰度图为例,从视频数据中获得的不同时序帧的图像数据为灰度图,将该灰度图输入至多通道卷积层中,获取多通道卷积层输出原始特征图,其中原始特征图的数据维度为C,H,W,其中,H、W标识原始特征图的长度以及宽度,C表示原始特征图的通道维度,即原始特征图包括有C张原始子特征图。
步骤S304,分别以各个时序帧作为目标时序帧,根据目标时序帧在各卷积通道上的原始子特征图,以及与目标时序帧相邻的后一时序帧在各卷积通道上的原始子特征图,计算目标时序帧在各卷积通道上的运动信息权重。
其中,后一时序帧是指相对于目标时序帧,下一时刻所对应的时序帧,例如,目标时序帧为第t帧,即目标时序帧的图像数据是从视频数据中获取到的第t帧的图像数据,则后一时序帧为第(t+1)帧,即后一时序帧的图像数据是从视频数据中获取到的第(t+1)帧的图像数据。
其中,运动信息权重是指对目标时序帧的图像数据在不同通道上的原始子特征图的注意力分配的概率分布;运动性权重的大小,与目标时序帧的图像数据在不同卷积通道上的原始子特征图与运动对象的动作信息的相关性相关,也可以说与目标时序帧的图像数据在不同卷积通道上的原始子特征图中所包含的运动信息的多少相关。可以理解的是,当目标时序帧的图像数据在某一卷积通道上的原始子特征图与运动对象的动作信息的相关性越大,包含的运动信息越多,则在该卷积通道上的原始子特征图分配到的注意力越多,即运动运动信息权重越大。
在视频数据中获取的每一时序帧的图像数据中,都包含有对动作识别而言关键的信息,例如运动对象的表观信息,也包含有对动作识别而已无用甚至起反作用的噪声信息,例如图像数据中的噪声或背景信息。在获取目标时序帧的图像数据在不同卷积通道上的原始子特征图与运动对象的动作信息的相关性,即运动信息权重后,可通过增大包含与运动对象动作信息更相关的原始子特征图中的特征信息,即该卷积通道上的原始子特征图分配更多的注意力,而抑制包含较少的运动对象的动作信息或包含更多噪声信息的原始子 特征图,即该卷积通道上的原始子特征图分配较少的注意力,实现对对动作识别有利的信息得到增强而对动作识别无关甚至有害的信息得到一致,有效提高动作识别的准确性。
由于单一的时序帧的图像数据中,运动对象以及背景信息都是静态的,而运动是一个动作变化的过程,因此需要通过目标时序帧的图像数据及其后一时序帧的图像数据实现对运动对象的动作变化过程进行描述,以提高动作识别的准确度。具体地,在获取到每一时序帧的图像数据在各个卷积通道上的原始子特征后,对于每一个时序帧的图像数据,以其本身作为目标时序帧的图像数据,从而根据目标时序帧的图像数据在不同卷积通道上的原始子特征图,及其后一时序帧的图像数据在不同卷积通道上原始子特征图,获取目标时序帧在各个卷积通道上的原始子特征图对应的运动信息权重。
进一步地,获取目标时序帧在各个卷积通道上的原始子特征图所包含的运动信息权重,具体可以先计算目标时序帧的图像数据在各个卷积通道上的原始子特征图,与其后一时序帧的图像数据在对应卷积通道上原始子特征图间的差异度,然后根据各个卷积通道上原始子特征图间的差异度确定目标时序帧的图像数据在各个卷积通道上对应的运动信息权重。
步骤S306,根据目标时序帧在各卷积通道上的运动信息权重以及目标时序帧在各卷积通道上的原始子特征图,获取目标时序帧在各卷积通道上的运动信息特征图。
其中,在获取到目标时序帧在各卷积通道上的运动信息权重后,可将目标时序帧在各卷积通道上的运动信息权重加到目标时序帧在对应卷积通道的原始子特征图中,以获取目标时序帧在各卷积通道上的运动信息特征图。
由于运动信息权重用于描述在目标时序帧的图像数据在不同卷积通道上的原始子特征图与运动对象的动作信息的相关性,通过将每个卷积通道上的运动信息权重以及对应通道上原始子特征图进行相乘,获取目标时序帧在各个卷积通道上的动作信息特征图,使得与运动对象的动作信息相关性较强的原始子特征图得到增强,而与运动对象的动作信息相关性较弱的原始子特征图得到抑制,实现对对动作识别有利的信息得到增强而对动作识别无关甚至有害的信息得到抑制,使得动作信息特征图包含更多与运动对象的动作信息,利于后续对运动对象的动作识别,有效提高动作识别的准确性。
步骤S308,对目标时序帧在各卷积通道上的运动信息特征图进行时序卷积,得到目标时序帧在各卷积通道上的时序运动特征图。
其中,对目标时序帧在在各个卷积通道上的运动信息特征图进行时序卷积,具体可以是根据目标时序帧确定待卷积时序帧,对目标时序帧与待卷积时序帧在同一卷积通道上的运动信息特征图进行卷积。其中,待卷积时序帧是指与目标时序帧相邻的时序帧,可以包括目标时序帧的前、后两个时序帧,也可以包括目标时序帧的前、后四个时序帧等。例如,目标时序帧为第t帧,待卷积时序帧可以包括与目标时序帧相邻的前、后两个时序帧,即待卷积时序帧可以包括第(t-1)帧以及第(t+1)帧,也就是说,针对第t帧,对第(t-1)帧、第t帧以及第(t+1)帧在同一卷积通道上的运动信息特征图进行卷积,以获取第t帧的在各卷积通道上的时序运动特征图;待卷积时序帧还包括与目标时序帧相邻的前、后两个时序帧,即待卷积时序帧包括第(t-2)帧、第(t-1)帧、第(t+1)帧以及第(t+2)帧,此时针对第t帧,对第(t-2)帧、第(t-1)帧、第t帧以及第(t+1)帧以及第(t+2)帧在同一卷积通道上的运动信息特征图进行卷积,以获取第t帧的在各卷积通道上的时序运动特征图。
具体地,在得到各个时序帧在各个卷积通道上的运动信息特征图后,可以将与目标时序帧相邻的时序帧确定为待卷积时序帧,并对目标时序帧以及待卷积时序帧在同一卷积通道上的运动信息特征图进行卷积运算,以获取目标时序帧在各个卷积通道上的时序运动特征图,使得时序运动特征图中融合了前后时序帧的运动特征图,即运动对象的动作信息,实现在时序这一维度进行建模。其中,待卷积时序帧在各个卷积通道上的运动信息特征图的获取方法,与目标时序帧在各个卷积通道上的运动信息特征图的获取方法相同。
如图4所示,图4为一个实施例中对目标时序帧在各卷积通道上的运动信息特征图进行时序卷积,得到目标时序帧在各卷积通道上的时序运动特征图的示意图,图中左边的矩阵图表示各个时序帧在各个卷积通道上的运动信息特征图,右边矩阵图表示各个时序帧在各个卷积通道上的时序运动特征图;图中矩阵图的横轴表示卷积通道的维度,纵轴表示时序帧的维度,以左边的矩阵图为例,图中左边的矩阵图第一行表示第1时序帧在各个卷积通道上的运动信息特征图,第二行表示第2时序帧在各个卷积通道上的运动信息特征图,依次类推。以第2时序帧作为目标时序帧、卷积时序帧包括第2时序帧的前一时序帧以及第2时序帧的后一时序帧为例,对于第2时序帧的时序运动特征图,利用一个3*1的卷积核对第1时序帧在第1卷积通道上的运动信息特征图、第2时序帧在第1卷积通道上的运动信息特征图以及第3时序帧 在第1卷积通道上的运动信息特征图进行卷积运算,以获取第2时序帧在第1卷积通道的时序运动特征图,同样的,利用一个3*1的卷积核对第1时序帧在第2卷积通道上的运动信息特征图(图中A1)、第2时序帧在第2卷积通道上的运动信息特征图(图中A2)以及第3时序帧在第2卷积通道上的运动信息特征图(图中A3)进行卷积运算,以获取第2时序帧在第2卷积通道的时序运动特征图(图中B),以此类推,获得第2时序帧在各个卷积通道上的时序运动特征图。对于任意一个时序帧,可以利用其前后相邻的相邻时序帧,在各个卷积通道上进行时间维度上的卷积运算,使得运算后的时序运动特征图融合了前后时序帧的运动特征图,即运动对象的动作信息。
应该理解的是,如图4所示,对于第1时序帧以及最后一个的第4时序帧,由于没有前一时序帧或后一时序帧的图像数据,可以将第1时序帧的前一时序帧以及最后一个的第4时序帧的后一时序帧进行填0操作。
步骤S310,根据目标时序帧在各卷积通道的时序运动特征图获取目标时序帧的图像数据中运动对象的动作类型。
其中,在得到目标时序帧的图像数据的时序运动特征后,可以利用时序运动特征图作为图像数据的特征信息,识别目标时序帧的图像数据中运动对象的动作类型。时序运动特征图中即包括较强的与运动相关的信息,又包括时序信息,利用时序运动特征图进进行动作识别,可有效提高动作识别的精确度。
具体地,可以将时序运动特征图作为图像数据的特征信息,输入至用于动作识别的2D卷积网络中,以识别目标时序帧的图像数据中运动对象的动作类型。其中,2D卷积网络可以包括ResNet-50卷积神经网络,目标时序帧在各个通道的时序运动特征图输入至ResNet-50卷积神经网络后,相应输出时序特征图指向各个动作类型的概率,以识别目标时序帧的图像数据中运动对象的动作类型。
以图2所示的动作识别网络模型为例,根据目标时序帧在各卷积通道的时序运动特征图识别目标时序帧的图像数据中运动对象的动作类型的步骤是由主干网络层执行的,将目标时序帧在各卷积通道的时序运动特征图输入至主干网络层中,主干网络层起到分类器的作用,主干网络层输出目标时序帧的图像数据中运动对象的动作类型。步骤S302描述的获取各时序帧的图像数据在不同卷积通道上的原始子特征图的步骤由多通道卷积层执行,通过多通道卷积层获取各时序帧的图像数据在不同卷积通道上的原始子特征图;步骤 S304描述的分别以各个时序帧作为目标时序帧,根据目标时序帧在各卷积通道上的原始子特征图,以及与目标时序帧相邻的后一时序帧在各卷积通道上的原始子特征图,计算目标时序帧在各卷积通道上的运动信息权重的步骤是由动作信息增强模块执行的。而步骤S308,对目标时序帧在各卷积通道上的运动信息特征图进行时序卷积,得到目标时序帧在各卷积通道上的时序运动特征图的步骤,是由时序交互模块执行的。
上述动作识别方法,获取视频数据在不同时序帧的图像数据,在通过多通道卷积层获取各个时序帧的图像数据在不同卷积通道上的原始子特征图后,分别以各时序帧作为目标时序帧,通过目标时序帧以及后一时序帧在各卷积通道上的原始子特征图,获取目标时序帧在各个卷积通道上的运行信息权重,并将运动信息权重加到对应的卷积通道的原始子特征图上,增强单一时序帧中原始子特征图上的运动信息,获得目标时序帧在各卷积通道上的运动信息特征图,然后对目标时序帧在各卷积通道上的运动信息特征图进行时序卷积,使得目标时序帧在各卷积通道上的运动信息特征图融合了来自相邻时序帧的运动信息特征图,实现在时序上这一维度上的建模,获得目标时序帧在各卷积通道上的时序运动特征图,最终将目标时序帧在各个卷积通道上的时序运动特征图作为目标时序帧的图像数据的特征信息进行动作识别,识别目标时序帧的图像数据中运动对象的动作类型,该动作识别方法在增强单一时序帧中原始子特征图上的运动信息的同时,实现对各时序帧间的时序信息进行建模,打乱各时序帧间的顺序会得到完全不同的动作识别结果,有效提高动作识别的精度性。
在一个实施例中,根据目标时序帧在各卷积通道的时序运动特征图识别目标时序帧的图像数据中运动对象的动作类型的步骤之后,还包括:在得到各个时序帧的图像数据中运动对象的动作类型后,根据各时序帧的动作类型确定视频数据的动作类型。
其中,在获得目标时序帧的图像数据中运动对象的动作类型后,根据时序帧的次序,依次将后续的时序帧作为目标时序帧,并获取其图像数据中运动对象的动作类型,在得到所有时序帧的图像数据中的运动对象的动作类型后,最终对视频数据的所有时序帧的图像数据中的运动对象对应的动作类型进行融合,以获取对该视频数据的动作识别结果。
在一个实施例中,如图5所示,根据目标时序帧在各卷积通道上的原始子特征图,以及与目标时序帧相邻的后一时序帧在各卷积通道上的原始子特 征图,计算目标时序帧在各卷积通道上的运动信息权重的步骤,包括:
步骤S502,获取目标时序帧的原始子特征图与后一时序帧的原始子特征图在各卷积通道上的差异信息。
其中,差异信息可以描述两个时序帧的图像数据中运动对象的动作变化程度,即与运动对象的动作相关的信息;如前所述,在视频数据中获取的每一时序帧的图像数据中,都包含有对动作识别而言关键的信息,也包含有对动作识别而已无用甚至起反作用的噪声信息。但是由于在单一的时序帧的图像数据中,运动对象以及背景信息都是静态的,而运动是一个动作变化的过程,因此仅仅根据单一的时序帧的图像数据难以获取到运动对象的动作信息。而目标时序帧的图像数据在各个卷积通道上的原始子特征图,与其后一时序帧的图像数据在对应卷积通道上的原始子特征图间的差异信息,即前后时序帧中运动对象的动作变化,获取到前后时序帧中对应卷积通道上原始子特征图间的差异信息,即可得到目标时序帧的图像数据在各个卷积通道上的原始子特征图中所包含的运动信息。
可以理解的是,当目标时序帧的图像数据在某个卷积通道上的原始子特征图,与其后一时序帧的图像数据在对应卷积通道上的原始子特征图上间的差异信息越大,该卷积通道上的原始子特征图与运动对象的动作信息越相关,原始子特征图中包含越多的与运行相关的特征信息,相反,该卷积通道上的原始子特征图与运动对象的动作信息越不相关,原始子特征图中包含较少的与运行相关的特征信息。
具体地,获取目标时序帧的原始子特征图与后一时序帧的原始子特征图在各卷积通道上的差异信息,具体可以是通过计算目标时序帧的图像数据在各个卷积通道上的原始子特征图,与其后一时序帧的图像数据在对应卷积通道上的原始子特征图上间的差值获得的。
步骤S504,通过激活函数将各卷积通道上差异信息映射为目标时序帧在各卷积通道上的运动信息权重。
其中,在得到目标时序帧的原始子特征图与后一时序帧的原始子特征图在各卷积通道上的差异信息后,可以通过激活函数根据各个卷积通道上的差异信息获取对应卷积通道的运行信息权重。如上所述,当目标时序帧的图像数据在某个卷积通道上的原始子特征图,与其后一时序帧的图像数据在对应卷积通道上的原始子特征图上间的差异信息越大,该卷积通道上的原始子特征图的运动信息权重越大,相反,该卷积通道上的原始子特征图与运动对象 的动作信息越不相关,该卷积通道上的原始子特征图的运动信息权重越小。
具体地,激活函数可以是Sigmiod函数。在得到目标时序帧的原始子特征图与后一时序帧的原始子特征图在各卷积通道上的差异信息后,可通过激活函数Sigmiod函数,将在各个卷积通道上的差异信息映射为0到1之间的权重值,得到目标时序帧在各个通道上的原始子特征图的运动信息权重。
在一个实施例中,如图6a所示,获取目标时序帧的原始子特征图与后一时序帧的原始子特征图在各卷积通道上的差异信息的步骤,包括:
步骤S602,分别通过单位池化层将目标时序帧在各卷积通道上的原始子特征图以及后一时序帧在各卷积通道上的原始子特征图变换为单位子特征图。
其中,单位池化层是指用于原始子特征图进行降维的池化层,可选的,单位池化层可以包括均值池化层(average pooling),例如全局平均池化。
其中,单位子特征图是指长、宽均为1的特征图。具体地,通过单位池化层可以将空间大小为H*W的原始子特征图,降维为空间大小为1*1的单位子特征图。应该理解的是,此时在卷积通道这一维度是不变的,即获得的单位子特征图的卷积通道的数量与原始子特征图的卷积通道数据是相等的。
步骤S604,分别对目标时序帧在各卷积通道上的单位子特征图以及后一时序帧在各卷积通道上的单位子特征图进行预设缩放倍数的降维,得到降维后的单位子特征图。
其中,预设缩放倍数是根据实际情况设置的,可以根据原始子特征图在卷积通道这一维度上的数量与进行卷积通道降维后的单位子特征图在卷积通道这一维度上的数量的比值进行确定。例如,原始子特征图在卷积通道这一维度上的数量为265,而进行卷积通道降维后,单位子特征图在卷积通道这一维度上的数量为16,则预设缩放倍数为16倍。
其中,在获取到目标时序帧在各个卷积通道上的单位子特征图以及后一时序帧在各个卷积通道上的单位子特征图后,可以通过降维卷积层降低目标时序帧以及后一时序帧对应的单位子特征图在卷积通道这一维度上的数量,其中,该降维卷积层中卷积核的大小为1*1,卷积核的数量与降维后需要获得的单位子特征图在卷积通道这一维度上的数量相等。
例如,各个时序帧的原始子特征图的空间大小为H*W,在卷积通道这一维度的数量为C,即包括了C个空间大小为H*W的原始子特征图,各个时序帧的图像数据的原始子特征图的数据维度为C*H*W;在经过单位池化层后得 到的单位子特征图在卷积通道这一维度的数量不变,空间大小降维为1*1,即单位子特征图的数据维度为(C*1*1);然后,通过降维卷积层对卷积通道这一维度进行降维,将单位子特征图在卷积通道这一维度的数量降为(C/r),即获得降维后的单位子特征图的数据维度为(C/r*1*1),其中r为缩放倍数。
步骤S606,获取目标时序帧降维后的单位子特征图与后一时序帧降维后的单位子特征图间的降维差异信息。
其中,获取目标时序帧降维后的单位子特征图与后一时序帧降维后的单位子特征图间的降维差异信息,具体可以是通过计算目标时序帧降维后的单位子特征图与后一时序帧降维后的单位子特征图,在对应卷积通道上的单位子特征图的差值获得的。
步骤S608,对降维差异信息进行预设缩放倍数的升维,得到目标时序帧的原始子特征图与后一时序帧的原始子特征图在各卷积通道上的差异信息。
其中,在得到降维差异信息后,可以通过升维卷积层将降维差异信息在卷积通道这一维度上的数量,恢复至与原始子特征图的卷积通道的数据一致。其中,该升维卷积层中卷积核的大小为1*1,卷积核的数量与原始子特征图的卷积通道数量相等。
本实施例中,通过单位池化层将目标时序帧在各卷积通道上的原始子特征图以及后一时序帧在各卷积通道上的原始子特征图变换为单位子特征图,并对获得的单位子特征图在卷积通道这一维度进行预设缩放倍数的降维后,降维后的单位子特征图的数据量相对于原始子特征图的数量量大大降低,使得计算目标时序帧与后一时序帧在各卷积通道上的原始子特征图的差异信息,转换为计算目标时序帧的降维后的单位子特征图与后一时序帧的降维后的单位子特征图间的差异信息,有效降低计算量,提高计算速度。
以图2所示的动作识别网络模型为例,上述如图5以及如图6所示的步骤可以由动作信息增强模块执行的;如图6b所示,图6b为一个实施例中计算目标时序帧在各卷积通道上的运动信息权重的示意图。图6中A、B两个输入分别表示目标时序帧的原始子特征图以及后一时序帧的原始子特征图,其中,输入A以及输入B的数据维度均为C*H*W,其中,H与W分别标识原始子特征图的长度以及宽度,C表示原始子特征图在卷积通道这一维度上的数量,即输入A与输入B均包括有C个卷积通道的、空间大小为H*W的原始子特征图。为了降低这一模块的计算量,先通过单位池化层分别对输入A中的原始子特征图以及输入B中的原始子特征图的空间维度进行降维,获得C个卷积 通道、空间大小为1*1的单位子特征图。接着,通过第一降维池化层在卷积通道这一维度上对与输入A对应的单位子特征图进行降维,降维后的单位子特征图的数据维度为C/r*1*1,同样的,通过第二降维池化层在卷积通道这一维度上对与输入B对应的单位子特征图进行降维,降维后的单位子特征图的数据维度同样为C/r*1*1。可以理解的是,第一降维卷保护层与第二降维卷积层的网络参数一致。然后,将输入A、输入B这两个时序帧降维后的单位子特征图(数据维度为C/r*1*1)相减,得到表征运动信息的降维差异信息,该降维差异信息的数据维度为C/r*1*1,再通过升维卷积层将卷积通道这一维度的数量恢复至与原始子特征图的卷积通道的数量一致,得到数据维度为C*1*1的差异信息。最后,经过sigmoid函数将对应每个卷积通道的差异信息,映射为数据值为0至1的运动信息权重。后续将每个卷积通道的运动信息权重与对应卷积通道的原始子特征图进行相乘,使得部分卷积通道的原始子特征图的特征信息得到不同程度的增强,而其余卷积通道的原始子特征图的特征信息得到不同程度的抑制,实现利用后一时序帧的特征信息来增强目标时序帧的原始子特征图中与运动信息相关的特征信息。应该理解的是,由于最后一时序帧由于没有后帧,因此不能利用后一时序帧的原始子特征图中的特征信息增强本时序帧,即运动信息特征图与原始子特征图一致。
在一个实施例中,如图7所示,对目标时序帧在各卷积通道上的运动信息特征图进行时序卷积,得到目标时序帧在各卷积通道上的时序运动特征图的步骤,包括:
步骤S702,分别获取与目标时序帧相邻的前一时序帧在各卷积通道的运动信息特征图以及与目标时序帧相邻的后一时序帧在各卷积通道的运动信息特征图;
步骤S704,利用时序卷积核对目标时序帧、前一时序帧以及后一时序帧在同一卷积通道的运动信息特征图进卷积运算,得到目标时序帧在各个卷积通道上的时序运动特征图。
其中,分别获取前一时序帧在各卷积通道的运动信息特征图以及后一时序帧在各卷积通道的运动信息特征图,然后利用时序卷积核对目标时序帧、前一时序帧以及后一时序帧在同一卷积通道上的运动信息特征图进卷积运算,以获得目标时序帧在该卷积通道上的时序运动特征图,进而获取目标时序帧在所有卷积通道上的时序运动特征图,使得时序运动特征图中融合了前后时序帧的运动特征图,即运动对象的动作信息,实现在时序这一维度进行 建模。
应该理解的是,前一时序帧在各卷积通道的运动信息特征图以及后一时序帧在各卷积通道的运动信息特征图的获取方法,与目标时序帧在各个卷积通道上的运动信息特征图的获取方法相同。例如,目标时序帧为第t帧,与目标时序帧相邻的前一时序帧为第(t-1)帧,则对于前一时序帧(第(t-1)帧)的运动信息特征图,是根据第(t-1)帧在各卷积通道上的原始子特征图,以及与第(t)帧在各卷积通道上的原始子特征图,计算第(t-1)帧在各卷积通道上的运动信息权重,然后根据第(t-1)帧在各卷积通道上的运动信息权重以及第(t-1)帧在各卷积通道上的原始子特征图,获取第(t-1)帧在各卷积通道上的运动信息特征图。同样的,与目标时序帧相邻的后一时序帧为第(t+1)帧,对于后一时序帧(第(t+1)帧)的运动信息特征图,是根据第(t+1)帧在各卷积通道上的原始子特征图,以及与第(t+2)帧在各卷积通道上的原始子特征图,计算第(t+1)帧在各卷积通道上的运动信息权重,然后根据第(t+1)帧在各卷积通道上的运动信息权重以及第(t+1)帧在各卷积通道上的原始子特征图,获取第(t+1)帧在各卷积通道上的运动信息特征图。
以图2所示的动作识别网络模型为例,上述利用时序卷积核对目标时序帧、前一时序帧以及后一时序帧在同一卷积通道的运动信息特征图进卷积运算,得到目标时序帧在各个卷积通道上的时序运动特征图的步骤可以由动作信息增强模块执行的,具体如图4所示,以图中第3时序帧作为目标时序帧,对于第3时序帧的时序运动特征图,利用一个3*1的卷积核对第2时序帧、第3时序帧以及第4时序帧的第1卷积通道进行卷积运算,以获取第3时序帧在第1卷积通道的时序运动特征图,同样的,利用一个3*1的卷积核对第2时序帧、第3时序帧以及第4时序帧的第2卷积通道进行卷积运算,以获取第3时序帧在第2卷积通道的时序运动特征图,以此类推,获得第3时序帧在各个卷积通道上的时序运动特征图。对于任意一个时序帧,可以利用其前后相邻的相邻时序帧,在各个卷积通道上进行时间维度上的卷积运算,使得运算后的时序运动特征图融合了前后时序帧的运动特征图,即运动对象的动作信息。
在一个实施例中,如图8a所示,根据目标时序帧在各卷积通道的时序运动特征图识别目标时序帧的图像数据中运动对象的动作类型的步骤,包括:
步骤S802,将目标时序帧的时序运动特征图输入至残差网络层中,得到 目标时序帧的图像数据的动作特征信息。
其中,残差网络层是用于获取时序运动特征图进行进一步的特征学习,以获取能够更好表征运动对象动作类型的动作特征信息。
具体地,在获得目标时序帧在各卷积通道的时序运动特征图后,将目标时序帧在各卷积通道的时序运动特征图作为目标时序帧的图像数据的特征信息,输入至残差网络层中,通过残差网络层对各个时序运动特征图进行特征学习,以获取图像数据的动作特征信息。其中,运动特征信息在卷积通道这一维度上的数量可以与时序运行特征图的一致。
步骤S804,将动作特征信息输入至动作分类网络层中,识别目标时序帧的图像数据中运动对象的动作类型。
其中,动作分类网络层是用于根据图像数据的动作特征信息进行动作类型识别的网络结构,这里的动作分类网络层是经过训练的动作分类网络层,可直接用于获取图像数据中运动对象的动作类型。具体地,在获取到目标时序帧的图像数据的动作特征信息后,将动作特征信息输入至动作分类网络层中,通过动作分类网络层获取目标时序帧中图像数据运动对象的动作类型。
以图2所示的动作识别网络模型为例,上述根据目标时序帧在各卷积通道的时序运动特征图识别目标时序帧的图像数据中运动对象的动作类型的步骤可以由主干网络层执行的,其中,主干网络层中的残差网络层是用于获取时序运动特征图进行进一步的特征学习,以获取能够更好表征运动对象动作类型的动作特征信息,而主干网络层中的池化层以及全连接层相当于动作分类网络层,用于根据输入的动作特征信息识别目标时序帧的图像数据中运动对象的动作类型。进一步地,在一个实施例中,残差网络层的网络结构可以如图8b所示,其中,包括三个卷积神经网络,分别为两端的2个大小为1*1的二维卷积神经网络(2Dconv)、以及中间大小为3*3的二维卷积神经网络。
在一个实施例中,将目标时序帧的时序运动特征图输入至残差网络层中,得到目标时序帧的图像数据的动作特征信息的步骤之后,还包括:将动作特征信息确定为目标时序帧的图像数据在不同卷积通道上的原始子特征图;重新执行根据目标时序帧在各卷积通道上的原始子特征图,以及后一时序帧在各卷积通道上的原始子特征图,计算目标时序帧在各卷积通道上的运动信息权重的步骤。
其中,在得到目标时序帧的图像数据的动作特征信息后,可以将动作特征信息重新确定为目标时序帧的图像数据在不同卷积通道上的原始子特征 图,然后重新对新确定的原始子特征图进行相同的操作,即计算原始子特征图在各个卷积通道上的运动信息权重,并将运动信息权重加到对应的卷积通道的原始子特征图上,获得目标时序帧在各卷积通道上的运动信息特征图,然后利用时序卷积核对目标时序帧与相邻时序帧在同一卷积通道的运动信息特征图进行卷积,使得目标时序帧在各卷积通道上的运动信息特征图融合了来自相邻时序帧的运动信息特征图,获得目标时序帧在各卷积通道上的时序运动特征图。
通过将动作特征信息确定为原始子特征图,再次基于注意力机制对运动特征信息进行的信息增强,以及再次对时序信息进行建模,可有效提高动作特征信息表征动作信息的能力,后续将动作特性信息用于动作识别,有效提高动作识别的精度。
以图2所示的动作识别网络模型为例,图中的动作信息增强模块用于对每一时序帧的图像数据在不同卷积通道上的原始子特征图进行动作信息增强,得到每一时序帧的图像数据在不同卷积通道上的动作信息特征图;而时序交互模块用于对前后相邻时序帧的图像数据的动作信息特征图在相同卷积通道上进行卷积运算,得到时序运动特征图,该时序运行特征图融合了前后相邻时序帧的运动信息;而主干网络层中的残差网络层用于获取时序运动特征图进行进一步的特征学习,以获取能够更好表征运动对象动作类型的动作特征信息。对于动作信息增强模块、时序交互模块以及残差网络层,可以作为一个特征提取单元,通过多个特征提取单元,提高特征学习的精度,可有效提高动作识别的精度。
进一步的,对于动作信息增强模块以及时序交互模块,不仅仅可应用于视频数据的动作识别这一应用环境中,还可以用于与任何需要对视频数据进行建模的场景中,例如,动作信息增强模块可以嵌入到对连续时序帧建模的神经网络中,根据应用场景的不同,针对性的增强有利于应用场景的特征信息而抑制不利于应用场景的噪声信息,时序交互模块也可以嵌入与任何2D卷积网络中进行对时序信息的建模,有助于特征学习。
在一个实施例中,如图9所示,动作识别方法还包括:
步骤S902,获取视频样本,其中视频样本包括多张不同样本时序帧的图像样本以及各样本时序帧的图像样本中运动对象的标准动作类型。
其中,视频样本是指用于动作识别网络模型的视频样本。视频样本中包括有多张不同样本时序帧的图像样本,以及各个图像样本对应的标准动作类 型。
步骤S904,通过多通道卷积层获取各图像样本在不同卷积通道上的原始子特征图样本。
其中,将图像样本作为多通道卷积层的输入数据输入至多通道卷积层中,多通道卷积层中的各个卷积核对图像样本进行卷积计算,获取与各个卷积核对应的卷积通道的原始子特征图样本。
步骤S906,分别以各个样本时序帧作为目标样本时序帧,获取目标样本时序帧的原始子特征图样本以及后一样本时序帧的原始子特征图样本在各卷积通道上的样本差异信息。
其中,样本差异信息可以描述两个样本时序帧的图像样本中运动对象的动作变化程度,即与运动对象的动作相关的信息;在视频样本中获取的每一样本时序帧的图像样本中,都包含有对动作识别而言关键的信息,也包含有对动作识别而已无用甚至起反作用的噪声信息。但是由于在单一的样本时序帧的图像样本中,运动对象以及背景信息都是静态的,而运动是一个动作变化的过程,因此仅仅根据单一的样本时序帧的图像样本难以获取到运动对象的动作信息。而目标样本时序帧的图像样本在各个卷积通道上的原始子特征图样本,与其后一样本时序帧的图像样本在对应卷积通道上的原始子特征图样本间的差异信息,即前后样本时序帧中运动对象的动作变化,获取到前后样本时序帧中在对应卷积通道上原始子特征图样本间的样本差异信息,即可得到目标样本时序帧的图像样本在各个卷积通道上的原始子特征图样本中所包含的运动信息。
具体地,获取目标样本时序帧的原始子特征图样本与后一样本时序帧的原始子特征图样本在各卷积通道上的样本差异信息,具体可以是通过计算目标样本时序帧的图像样本在各个卷积通道上的原始子特征图样本,与其后一样本时序帧的图像样本在对应卷积通道上的原始子特征图样本上间的差值获得的。
进一步地,可以对目标样本时序帧在各卷积通道上的原始子特征图样本进行数据降维,得到目标样本时序帧降维后的单位子特征图样本,并对后一样本时序帧在各卷积通道上的原始子特征图样本进行数据降维进行数据降维,得到后一样本时序帧降维后的单位子特征图样本,降维后的后的单位子特征图的数据量相对于原始子特征图的数量大大降低,通过将计算目标样本时序帧与后一样本时序帧在各卷积通道上的原始子特征图样本的样本差异信 息,转换为计算目标样本时序帧的降维后的单位子特征图样本与后一样本时序帧的降维后的单位子特征图样本间的差异信息,有效降低计算量,提高计算速度。
步骤S908,通过激活函数将各卷积通道上样本差异信息映射为目标样本时序帧在各卷积通道上的运动信息权重样本。
其中,在得到目标样本时序帧的原始子特征图样本与后一样本时序帧的原始子特征图样本在各卷积通道上的样本差异信息后,可以通过激活函数根据各个卷积通道上的差异信息获取对应卷积通道的运行信息权重。具体地,激活函数可以是Sigmiod函数。在得到目标样本时序帧的原始子特征图样本与后一样本时序帧的原始子特征图样本在各卷积通道上的样本差异信息后,可通过激活函数Sigmiod函数,将在各个卷积通道上的样本差异信息映射为0到1之间的权重值,得到目标样本时序帧在各个通道上的原始子特征图样本的运动信息权重。
步骤S910,根据目标样本时序帧在各卷积通道上的运动信息权重样本以及原始子特征图样本,获取目标样本时序帧在各卷积通道上的运动信息特征图样本。
步骤S912,对目标样本时序帧在各卷积通道上的运动信息特征图样本进行时序卷积,得到目标样本时序帧在各卷积通道上的时序运动特征图样本。
其中,在得到各个样本时序帧在各个卷积通道上的运动信息特征图样本后,可以利用时序卷积核,对目标样本时序帧以及相邻的样本时序帧在同一卷积通道上的运动信息特征图样本进行卷积运算,以获取目标样本时序帧在各个卷积通道上的时序运动特征图样本,使得时序运动特征图样本中融合了前后样本时序帧的运动特征图样本,即运动对象的动作信息,实现在时序这一维度进行建模。
步骤S914,根据目标样本时序帧在各卷积通道的时序运动特征图样本获取目标样本时序帧的图像样本中运动对象的预测动作类型。
其中,在得到目标样本时序帧的图像数据的时序运动特征图样本后,可以利用时序运动特征图样本作为图像样本的特征信息,获取目标样本时序帧的图像样本中运动对象的动作类型。具体地,可以将时序运动特征图样本输入至用于动作识别的2D卷积网络中,以获取目标样本时序帧的图像样本中运动对象的预测动作类型。
步骤S916,根据预测动作类型以及标准动作类型间的差异,调整多通道 卷积层、激活函数以及时序卷积核的参数,继续训练直至满足训练结束条件。
其中,在获得图像样本的预测动作类型后,可将预测动作类型与标准动作类型间的差异作为损失函数,对多通道卷积层、激活函数以及时序卷积核的参数进行调整,直至训练结束条件。这里的训练结束条件可根据实际需要进行调整或设置,例如,当损失函数满足收敛条件,则可认为达到训练结束条件;或者当训练次数达到预设次数时,则可认为达到训练结束条件。
在一个实施例中,一种动作识别方法,包括:
1、获取视频数据在不同时序帧的图像数据,通过多通道卷积层获取各时序帧的图像数据在不同卷积通道上的原始子特征图;
2、分别以各个时序帧作为目标时序帧,根据目标时序帧在各卷积通道上的原始子特征图,以及与目标时序帧相邻的后一时序帧在各卷积通道上的原始子特征图,计算目标时序帧在各卷积通道上的运动信息权重;
2-1、获取目标时序帧的原始子特征图与后一时序帧的原始子特征图在各卷积通道上的差异信息;
2-1-1、分别通过单位池化层将目标时序帧在各卷积通道上的原始子特征图以及后一时序帧在各卷积通道上的原始子特征图变换为单位子特征图;
2-1-2、分别对目标时序帧在各卷积通道上的单位子特征图以及后一时序帧在各所述卷积通道上的单位子特征图进行预设缩放倍数的降维,得到降维后的单位子特征图;
2-1-3、获取目标时序帧降维后的单位子特征图与后一时序帧降维后的单位子特征图间的降维差异信息;
2-1-4、对所述降维差异信息进行预设缩放倍数的升维,得到目标时序帧的原始子特征图与后一时序帧的原始子特征图在各卷积通道上的差异信息。
2-2、通过激活函数将各卷积通道上差异信息映射为目标时序帧在各卷积通道上的运动信息权重。
3、根据目标时序帧在各卷积通道上的运动信息权重以及目标时序帧在各卷积通道上的原始子特征图,获取目标时序帧在各卷积通道上的运动信息特征图;
4、对目标时序帧在各卷积通道上的运动信息特征图进行时序卷积,得到目标时序帧在各卷积通道上的时序运动特征图;
4-1、分别获取与目标时序帧相邻的前一时序帧在各卷积通道的运动信息特征图以及与目标时序帧相邻的后一时序帧在各卷积通道的运动信息特征 图;
4-2、利用时序卷积核对目标时序帧、前一时序帧以及后一时序帧在同一卷积通道的运动信息特征图进卷积运算,得到目标时序帧在各个卷积通道上的时序运动特征图。
5、根据目标时序帧在各卷积通道的时序运动特征图识别目标时序帧的图像数据中运动对象的动作类型。
5-1、将目标时序帧的时序运动特征图输入至残差网络层中,得到目标时序帧的图像数据的动作特征信息;
5-2、将所述动作特征信息输入至动作分类网络层中,识别目标时序帧的图像数据中运动对象的动作类型。
6、在得到各个时序帧的图像数据中运动对象的动作类型后,根据各时序帧的动作类型确定所述视频数据的动作类型。
进一步地,结合图2所示的动作识别网络模型以及图10,对动作识别方法进行进一步说明。在图10中,左边栏目是从视频中按时间顺序截取的两个时序帧的图像数据,其中,左边栏目中的第一列图像数据为目标时序帧的图像数据,第二列为后一时序帧的图像数据;右边栏目中,第一列的图像是对左边栏目中的目标时序帧的图像数据对应的原始子特征图的可视化,第二列图像是原始子特征图经过动作信息增强模块后获得的运动信息特征图的可视化,第三列图像是运动信息特征图经过时序交互模块后获得的时序运动特征图的可视化。从图10中可以看出,原始子特征图中即包括有对动作识别而言关键的信息,也包含有对动作识别而已无用甚至起反作用的噪声信息,其中噪声信息较多,运动对象的轮廓较为模糊;而经过动作信息增强模块后获得的运动信息特征图中,运动对象的轮廓变得清洗,与动作信息无关的背景噪声信息得到一定程度上的一致;而经过时序交互模块后获得的时序运动特征图中,不仅具有左边栏目中第一列目标时序帧图像数据的信息,还包括了左边栏目中第二列后一时序帧图像数据的信息,达到了对时序信息进行建模的目的。
进一步的,上述步骤2到步骤4中对数据的操作过程是在卷积通道这一维度上进行的,不同的卷积通道的特征图(包括原始子特征图或运动信息特征图)是相互独立的,相邻卷积通道的特征图的信息不会被混合,使得运算过程中运算量保持在低运算量而具有较高的运算速度。同样的,图2中的动作信息增强模块以及时序交互模块都是在卷积通道上进行操作的,即对于单 一或多个时序帧在各个卷积通道的特征图(原始子特征图或运动信息特征图),不同的卷积通道的特征图是相互独立的,相邻卷积通道的特征图的信息不会被混合,使得运算过程中运算量保持在低运算量而具有较高的运算速度。
在一个实施例中,如图11所示,一种动作识别方法,包括:
步骤S1102,获取实时的监控视频数据。
其中,本实施例是应用于实时监控的场景中的,视频数据选用实时获取的监控视频数据。监控视频数据可以是通过摄像头拍摄的实时视频,其中,监控视频数据的图像中包括被监视的运动对象。
步骤S1104,提取监控视频数据在不同时序帧的图像数据,通过多通道卷积层获取各时序帧的图像数据在不同卷积通道上的原始子特征图。
其中,不同时序帧的图像数据是指按照拍摄的时间顺序从监控视频数据中抽取的图像数据,其中,包括监控视频数据中所有时序帧的图像数据。获取视频数据在不同时序帧的图像数据,具体可以是按照视频数据中图像数据的排列顺序依次获取。
其中,原始子特征图是指表征图像数据的特征信息;多通道卷积层是指用于获取图像数据中的特征信息的网络模型,这里的多通道卷积层是已训练好的网络模型,可直接用来获取图像数据的特征信息。其中,多通道卷积层包括多个卷积核,卷积通道是由多通道卷积层所决定的,多通道卷积层中用于抽取图像数据的卷积核的数量,即为卷积通道数量。具体地,将监控视频中的各个时序帧的图像数据分别作为多通道卷积层的输入数据输入至多通道卷积层中,多通道卷积层中的各个卷积核对图像数据进行卷积计算,获取与各个卷积核对应的卷积通道的原始子特征图。
步骤S1106,确定目标时序帧,根据目标时序帧在各卷积通道上的原始子特征图,以及与目标时序帧相邻的后一时序帧在各卷积通道上的原始子特征图,计算目标时序帧在各卷积通道上的运动信息权重。
其中,目标时序帧是指当前时刻获取的图像数据所对应的时序帧,后一时序帧是指相对于目标时序帧,下一时刻所对应的时序帧。
在监控视频数据中获取的每一时序帧的图像数据中,都包含有对动作识别而言关键的信息,例如运动对象的表观信息,也包含有对动作识别而已无用甚至起反作用的噪声信息,例如图像数据中的噪声或背景信息。在获取目标时序帧的图像数据在不同卷积通道上的原始子特征图与运动对象的动作信息的相关性,即运动信息权重后,可通过增大包含与运动对象动作信息更相 关的原始子特征图中的特征信息,即该卷积通道上的原始子特征图分配更多的注意力,而抑制包含较少的运动对象的动作信息或包含更多噪声信息的原始子特征图,即该卷积通道上的原始子特征图分配较少的注意力,实现对对动作识别有利的信息得到增强而对动作识别无关甚至有害的信息得到一致,有效提高动作识别的准确性。
由于单一的时序帧的图像数据中,运动对象以及背景信息都是静态的,而运动是一个动作变化的过程,因此需要通过目标时序帧的图像数据及其后一时序帧的图像数据实现对运动对象的动作变化过程进行描述,以提高动作识别的准确度。其中,在获取到每一时序帧的图像数据在各个卷积通道上的原始子特征后,对于每一时序帧的图像数据,以其本身作为目标时序帧的图像数据,从而根据目标时序帧的图像数据在不同卷积通道上的原始子特征图,及其后一时序帧的图像数据在不同卷积通道上原始子特征图,获取目标时序帧在各个卷积通道上的原始子特征图对应的运动信息权重。
具体地,获取目标时序帧在各个卷积通道上的原始子特征图所包含的运动信息权重,具体可以先计算目标时序帧的图像数据在各个卷积通道上的原始子特征图,与其后一时序帧的图像数据在对应卷积通道上原始子特征图间的差异度,然后根据各个卷积通道上原始子特征图间的差异度确定目标时序帧的图像数据在各个卷积通道上对应的运动信息权重。
步骤S1108,根据目标时序帧在各卷积通道上的运动信息权重以及目标时序帧在各卷积通道上的原始子特征图,获取目标时序帧在各卷积通道上的运动信息特征图。
步骤S1110,对目标时序帧在各卷积通道上的运动信息特征图进行时序卷积,得到目标时序帧在各卷积通道上的时序运动特征图。
其中,对目标时序帧在在各个卷积通道上的运动信息特征图进行时序卷积,具体可以是根据目标时序帧确定待卷积时序帧,对目标时序帧与待卷积时序帧在同一卷积通道上的运动信息特征图进行卷积,以获取目标时序帧在各个卷积通道上的时序运动特征图,使得时序运动特征图中融合了前后时序帧的运动特征图,即运动对象在前后时间的动作信息,实现在时序这一维度进行建模。其中,待卷积时序帧在各个卷积通道上的运动信息特征图的获取方法,与目标时序帧在各个卷积通道上的运动信息特征图的获取方法相同。
步骤S1112,根据目标时序帧在各卷积通道的时序运动特征图识别目标时序帧的图像数据中运动对象的动作类型。
其中,在得到目标时序帧的图像数据的时序运动特征后,可以将时序运动特征图确定为目标时序帧图像数据的特征信息,并根据该特征信息识别目标时序帧的图像数据中运动对象的动作类型。具体地,可以将时序运动特征图输入至用于动作识别的2D卷积网络中,以识别目标时序帧的图像数据中运动对象的动作类型。时序运动特征图中即包括较强的与运动相关的信息,又包括时序信息,利用时序运动特征图进进行动作识别,可有效提高动作识别的精确度。
步骤S1114,将动作类型确定为当前监控视频数据中运动对象的动作信息。
其中,在获得目标时序帧的图像数据中运动对象的动作类型后,将该动作类型确定为监控视频数据中运动对象的动作信息,实现实时更新运动对象的运动信息,无需观看监控视频即可获取运动对象的运动信息,同时确保对运动对象的实时监控。
进一步地,可以通过显示装置显示该运动信息,使得监控人员获取监控视频中运动对象的运动状态。
以待监控对象是人为例,假设被监控人正在做跨步这以动作,获取实时监控视频数据中当前时刻拍摄到的目标时序帧以及与目标时序帧相邻的后一时序帧,通过目标时序帧以及后一时序帧在各卷积通道上的原始子特征图,获取目标时序帧在各个卷积通道上的运行信息权重,并将运动信息权重加到目标时序帧对应的卷积通道的原始子特征图上,获得目标时序帧在各卷积通道上的运动信息特征图,然后根据目标时序帧确定待卷积时序帧,从而对待卷积时序帧在同一卷积通道的运动信息特征图进行卷积,获得目标时序帧在各卷积通道上的时序运动特征图,最终将目标时序帧在各个卷积通道上的时序运动特征图作为目标时序帧的图像数据的特征信息进行动作识别,得到目标时序帧的图像数据中被监控人的动作类型,此时被监控人的动作类型为跨步动作类型,并将该动作类型确定为被监控人的动作信息。
应该理解的是,虽然上述流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,上述流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而 是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
在一个实施例中,如图12所示,提供了一种动作识别装置1200,设置于计算机设备中,该装置包括:图像获取模块1202、权重获取模块1204、特征确定模块1206、时序交互模块1208以及动作识别模块1210,其中::
图像获取模块1202,用于获取视频数据在不同时序帧的图像数据,通过多通道卷积层获取各时序帧的图像数据在不同卷积通道上的原始子特征图;
权重获取模块1204,用于分别以各个时序帧作为目标时序帧,根据目标时序帧在各卷积通道上的原始子特征图,以及与目标时序帧相邻的后一时序帧在各卷积通道上的原始子特征图,计算目标时序帧在各卷积通道上的运动信息权重;
特征确定模块1206,用于根据目标时序帧在各卷积通道上的运动信息权重以及目标时序帧在各卷积通道上的原始子特征图,获取目标时序帧在各卷积通道上的运动信息特征图;
时序交互模块1208,用于对目标时序帧在各卷积通道上的运动信息特征图进行时序卷积,得到目标时序帧在各卷积通道上的时序运动特征图;
动作识别模块1210,用于根据目标时序帧在各卷积通道的时序运动特征图识别目标时序帧的图像数据中运动对象的动作类型。
在一个实施例中,如图13所示,权重获取模块1204,包括:
差异信息获取模块1204a,用于获取目标时序帧的原始子特征图与后一时序帧的原始子特征图在各卷积通道上的差异信息;
权重映射模块1204b,用于通过激活函数将各卷积通道上差异信息映射为目标时序帧在各卷积通道上的运动信息权重。
在一个实施例中,差异信息获取模块,用于:分别通过单位池化层将目标时序帧在各卷积通道上的原始子特征图以及后一时序帧在各卷积通道上的原始子特征图变换为单位子特征图;分别对目标时序帧在各卷积通道上的单位子特征图以及后一时序帧在各卷积通道上的单位子特征图进行预设缩放倍数的降维,得到降维后的单位子特征图;获取目标时序帧降维后的单位子特征图与后一时序帧降维后的单位子特征图间的降维差异信息;对降维差异信息进行预设缩放倍数的升维,得到目标时序帧的原始子特征图与后一时序帧的原始子特征图在各卷积通道上的差异信息。
在一个实施例中,时序交互模块,用于:分别获取与目标时序帧相邻的 前一时序帧在各卷积通道的运动信息特征图以及与目标时序帧相邻的后一时序帧在各卷积通道的运动信息特征图;利用时序卷积核对目标时序帧、前一时序帧以及后一时序帧在同一卷积通道的运动信息特征图进卷积运算,得到目标时序帧在各个卷积通道上的时序运动特征图。
在一个实施例中,动作识别模块,用于将目标时序帧的时序运动特征图输入至残差网络层中,得到目标时序帧的图像数据的动作特征信息;将动作特征信息输入至动作分类网络层中,获取目标时序帧图像数据中运动对象的动作类型。
在一个实施例中,时序交互模块,还用于将动作特征信息确定为目标时序帧的图像数据在不同卷积通道上的原始子特征图,并使得权重获取模块1104再次根据目标时序帧在各卷积通道上的原始子特征图,以及与目标时序帧相邻的后一时序帧在各卷积通道上的原始子特征图,计算目标时序帧在各卷积通道上的运动信息权重。
在一个实施例中,动作识别模块,还用于在得到各个时序帧的图像数据中运动对象的动作类型后,根据各时序帧的动作类型确定视频数据的动作类型。
在一个实施例中,动作识别装置还包括训练模块,训练模块用于获取视频样本,其中视频样本包括多张不同样本时序帧的图像样本以及各样本时序帧的图像样本中运动对象的标准动作类型;通过多通道卷积层获取各图像样本在不同卷积通道上的原始子特征图样本;分别以各个样本时序帧作为目标样本时序帧,获取目标样本时序帧的原始子特征图样本以及后一样本时序帧的原始子特征图样本在各卷积通道上的样本差异信息;通过激活函数将各卷积通道上样本差异信息映射为目标样本时序帧在各卷积通道上的运动信息权重样本;根据目标样本时序帧在各卷积通道上的运动信息权重样本以及原始子特征图样本,获取目标样本时序帧在各卷积通道上的运动信息特征图样本;对目标样本时序帧在各卷积通道上的运动信息特征图样本进行时序卷积,得到目标样本时序帧在各卷积通道上的时序运动特征图样本;根据目标样本时序帧在各卷积通道的时序运动特征图样本获取目标样本时序帧的图像样本中运动对象的预测动作类型;根据预测动作类型以及目标样本时序帧的标准动作类型间的差异,调整多通道卷积层、激活函数以及时序卷积核的参数,继续训练直至满足训练结束条件。
在一个实施例中,提供了一种动作识别装置,设置于计算机设备中,该 装置包括:图像获取模块、权重获取模块、特征确定模块、时序交互模块以及动作识别模块;其中:
图像获取模块,用于获取实时的监控视频数据;提取监控视频数据在不同时序帧的图像数据,通过多通道卷积层获取各时序帧的图像数据在不同卷积通道上的原始子特征图。
权重获取模块,用于分别以每个时序帧作为目标时序帧,根据目标时序帧在各卷积通道上的原始子特征图,以及与目标时序帧相邻的后一时序帧在各卷积通道上的原始子特征图,计算目标时序帧在各卷积通道上的运动信息权重。
特征确定模块,用于根据运动信息权重以及目标时序帧在各卷积通道上的原始子特征图,获取目标时序帧在各卷积通道上的运动信息特征图。
时序交互模块,用于对运动信息特征图进行时序卷积,得到目标时序帧在各卷积通道上的时序运动特征图。
动作识别模块,用于根据时序运动特征图,识别目标时序帧的图像数据中运动对象的动作类型;将动作类型确定为当前监控视频数据中运动对象的动作信息。
关于动作识别装置的具体限定可以参见上文中对于动作识别方法的限定,在此不再赘述。上述动作识别装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
图14示出了一个实施例中计算机设备的内部结构图。该计算机设备具体可以是图1中的服务器102。如图14所示,该计算机设备包括该计算机设备包括通过系统总线连接的一个或多个处理器、存储器、网络接口、输入装置和显示屏。其中,存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作系统,还可存储有计算机可读指令,该计算机可读指令被一个或多个处理器执行时,可使得一个或多个处理器实现动作识别方法。该内存储器中也可储存有计算机可读指令,该计算机可读指令被一个或多个处理器执行时,可使得一个或多个处理器执行动作识别方法。计 算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
本领域技术人员可以理解,图14中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,本申请提供的动作识别装置可以实现为一种计算机可读指令的形式,计算机可读指令可在如图14所示的计算机设备上运行。计算机设备的存储器中可存储组成该动作识别装置的各个程序模块,比如,图12所示的图像获取模块1202、权重获取模块1204、特征确定模块1206、时序交互模块1208以及动作识别模块1210。各个程序模块构成的计算机可读指令使得一个或多个处理器执行本说明书中描述的本申请各个实施例的动作识别方法中的步骤。
例如,图14所示的计算机设备可以通过如图12所示的动作识别装置中的图像获取模块1202执行步骤S302。计算机设备可通过权重获取模块1204执行步骤S304。计算机设备可通过特征确定模块1206执行步骤S306。计算机设备可通过时序交互模块1208执行步骤S308。计算机设备可通过动作识别模块1210执行步骤S310。
在一个实施例中,提供了一种计算机设备,包括存储器和一个或多个处理器,存储器存储有计算机可读指令,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述动作识别方法的步骤。此处动作识别方法的步骤可以是上述各个实施例的动作识别方法中的步骤。
在一个实施例中,提供了一个或多个计算机可读存储介质,存储有计算机可读指令,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述动作识别方法的步骤。此处动作识别方法的步骤可以是上述各个实施例的动作识别方法中的步骤。
本申请各实施例中的“多个”即为至少两个。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储 器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种动作识别方法,由计算机设备执行,包括:
    获取视频数据在不同时序帧的图像数据,通过多通道卷积层获取各所述时序帧的图像数据在不同卷积通道上的原始子特征图;
    分别以每个所述时序帧作为目标时序帧,根据所述目标时序帧在各所述卷积通道上的原始子特征图,以及与所述目标时序帧相邻的后一时序帧在各所述卷积通道上的原始子特征图,计算所述目标时序帧在各所述卷积通道上的运动信息权重;
    根据所述运动信息权重以及所述目标时序帧在各所述卷积通道上的原始子特征图,获取所述目标时序帧在各所述卷积通道上的运动信息特征图;
    对所述运动信息特征图进行时序卷积,得到所述目标时序帧在各卷积通道上的时序运动特征图;及
    根据所述时序运动特征图,识别所述目标时序帧的图像数据中运动对象的动作类型。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述目标时序帧在各所述卷积通道上的原始子特征图,以及与所述目标时序帧相邻的后一时序帧在各所述卷积通道上的原始子特征图,计算所述目标时序帧在各所述卷积通道上的运动信息权重,包括:
    获取所述目标时序帧的原始子特征图与所述后一时序帧的原始子特征图在各所述卷积通道上的差异信息;
    通过激活函数,将各所述卷积通道上差异信息映射为所述目标时序帧在各所述卷积通道上的运动信息权重。
  3. 根据权利要求2所述的方法,其特征在于,所述获取所述目标时序帧的原始子特征图与所述后一时序帧的原始子特征图在各所述卷积通道上的差异信息,包括:
    通过单位池化层,分别将所述目标时序帧在各卷积通道上的原始子特征图、以及所述后一时序帧在各所述卷积通道上的原始子特征图变换为单位子 特征图;
    分别对目标时序帧的所述单位子特征图以及所述后一时序帧的所述单位子特征图进行预设缩放倍数的降维,得到降维后的单位子特征图;
    获取所述目标时序帧降维后的单位子特征图与所述后一时序帧降维后的单位子特征图间的降维差异信息;
    对所述降维差异信息进行所述预设缩放倍数的升维,得到所述目标时序帧的原始子特征图与所述后一时序帧的原始子特征图在各所述卷积通道上的差异信息。
  4. 根据权利要求1所述的方法,其特征在于,所述对所述运动信息特征图进行时序卷积,得到所述目标时序帧在各卷积通道上的时序运动特征图,包括:
    分别获取与目标时序帧相邻的前一时序帧在各所述卷积通道的运动信息特征图、以及所述后一时序帧在各所述卷积通道的运动信息特征图;
    利用时序卷积核,对目标时序帧、所述前一时序帧以及所述后一时序帧在同一卷积通道的运动信息特征图进卷积运算,得到所述目标时序帧在各卷积通道上的时序运动特征图。
  5. 根据权利要求1所述的方法,其特征在于,所述根据所述时序运动特征图,识别所述目标时序帧的图像数据中运动对象的动作类型,包括:
    将所述目标时序帧的时序运动特征图输入至残差网络层中,得到所述目标时序帧的图像数据的动作特征信息;
    将所述动作特征信息输入至动作分类网络层中,识别所述目标时序帧的图像数据中运动对象的动作类型。
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    将所述动作特征信息确定为所述目标时序帧的图像数据在不同卷积通道上的原始子特征图;
    重新执行根据所述目标时序帧在各所述卷积通道上的原始子特征图,以及与所述目标时序帧相邻的后一时序帧在各所述卷积通道上的原始子特征 图,计算所述目标时序帧在各所述卷积通道上的运动信息权重的步骤。
  7. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    在得到各个时序帧的图像数据中运动对象的动作类型后,根据各所述时序帧的动作类型,确定所述视频数据对应的动作类型。
  8. 根据权利要求2所述的方法,其特征在于,所述方法还包括:
    获取视频样本,其中所述视频样本包括多张不同样本时序帧的图像样本以及各样本时序帧的图像样本中运动对象的标准动作类型;
    通过多通道卷积层获取各所述图像样本在不同卷积通道上的原始子特征图样本;
    分别以每个所述样本时序帧作为目标样本时序帧,获取目标样本时序帧的原始子特征图样本以及后一样本时序帧的原始子特征图样本在各卷积通道上的样本差异信息;
    通过激活函数将各卷积通道上样本差异信息映射为目标样本时序帧在各卷积通道上的运动信息权重样本;
    根据目标样本时序帧在各卷积通道上的运动信息权重样本以及原始子特征图样本,获取目标样本时序帧在各卷积通道上的运动信息特征图样本;
    对目标样本时序帧在各卷积通道上的运动信息特征图样本进行时序卷积,得到目标样本时序帧在各卷积通道上的时序运动特征图样本;
    根据目标样本时序帧在各卷积通道的时序运动特征图样本获取目标样本时序帧的图像样本中运动对象的预测动作类型;
    根据所述预测动作类型以及目标样本时序帧的标准动作类型间的差异,调整所述多通道卷积层、所述激活函数以及时序卷积核的参数,继续训练直至满足训练结束条件。
  9. 一种动作识别方法,其特征在于,由计算机设备执行,包括:
    获取实时的监控视频数据;
    提取所述监控视频数据在不同时序帧的图像数据,通过多通道卷积层获取各时序帧的图像数据在不同卷积通道上的原始子特征图;
    分别以每个所述时序帧作为目标时序帧,根据目标时序帧在各卷积通道上的原始子特征图,以及与目标时序帧相邻的后一时序帧在各卷积通道上的原始子特征图,计算目标时序帧在各卷积通道上的运动信息权重;
    根据所述运动信息权重以及目标时序帧在各卷积通道上的原始子特征图,获取目标时序帧在各卷积通道上的运动信息特征图;
    对所述运动信息特征图进行时序卷积,得到目标时序帧在各卷积通道上的时序运动特征图;及
    根据时序运动特征图,识别目标时序帧的图像数据中运动对象的动作类型;
    将所述动作类型确定为当前所述监控视频数据中运动对象的动作信息。
  10. 一种动作识别装置,其特征在于,设置于计算机设备中,包括:
    图像获取模块,用于获取视频数据在不同时序帧的图像数据,通过多通道卷积层获取各所述时序帧的图像数据在不同卷积通道上的原始子特征图;
    权重获取模块,用于分别以每个所述时序帧作为目标时序帧,根据所述目标时序帧在各所述卷积通道上的原始子特征图,以及与所述目标时序帧相邻的后一时序帧在各所述卷积通道上的原始子特征图,计算所述目标时序帧在各所述卷积通道上的运动信息权重;
    特征确定模块,用于根据所述运动信息权重以及所述目标时序帧在各所述卷积通道上的原始子特征图,获取所述目标时序帧在各所述卷积通道上的运动信息特征图;
    时序交互模块,用于对所述运动信息特征图进行时序卷积,得到所述目标时序帧在各卷积通道上的时序运动特征图;及
    动作识别模块,用于根据所述时序运动特征图识别目标时序帧的图像数据中运动对象的动作类型。
  11. 根据权利要求10所述的装置,其特征在于,所述权重获取模块,包括:
    差异信息获取模块,用于获取所述目标时序帧的原始子特征图与所述后 一时序帧的原始子特征图在各所述卷积通道上的差异信息;
    权重映射模块,用于通过激活函数,将各所述卷积通道上差异信息映射为所述目标时序帧在各所述卷积通道上的运动信息权重。
  12. 根据权利要求10所述的装置,其特征在于,所述差异信息获取模块,还用于:
    通过单位池化层,分别将所述目标时序帧在各卷积通道上的原始子特征图、以及所述后一时序帧在各所述卷积通道上的原始子特征图变换为单位子特征图;
    分别对目标时序帧的所述单位子特征图以及所述后一时序帧的所述单位子特征图进行预设缩放倍数的降维,得到降维后的单位子特征图;
    获取所述目标时序帧降维后的单位子特征图与所述后一时序帧降维后的单位子特征图间的降维差异信息;
    对所述降维差异信息进行所述预设缩放倍数的升维,得到所述目标时序帧的原始子特征图与所述后一时序帧的原始子特征图在各卷积通道上的差异信息。
  13. 根据权利要求10所述的装置,其特征在于,所述时序交互模块还用于:
    分别获取与目标时序帧相邻的前一时序帧在各卷积通道的运动信息特征图、以及所述后一时序帧在各卷积通道的运动信息特征图;
    利用时序卷积核,对目标时序帧、所述前一时序帧以及所述后一时序帧在同一卷积通道的运动信息特征图进卷积运算,得到所述目标时序帧在各所述卷积通道上的时序运动特征图。
  14. 根据权利要求10所述的装置,其特征在于,所述动作识别模块还用于将所述目标时序帧的时序运动特征图输入至残差网络层中,得到所述目标时序帧的图像数据的动作特征信息;将所述动作特征信息输入至动作分类网络层中,识别所述目标时序帧的图像数据中运动对象的动作类型。
  15. 根据权利要求14所述的装置,其特征在于,所述时序交互模块,还 用于将所述动作特征信息确定为所述目标时序帧的图像数据在不同卷积通道上的原始子特征图;并使得所述权重获取模块重新执行根据所述目标时序帧在各所述卷积通道上的原始子特征图,以及与所述目标时序帧相邻的后一时序帧在各所述卷积通道上的原始子特征图,计算所述目标时序帧在各所述卷积通道上的运动信息权重。
  16. 根据权利要求10所述的装置,其特征在于,所述动作识别模块,还用于在得到各个时序帧的图像数据中运动对象的动作类型后,根据各所述时序帧的动作类型,确定所述视频数据对应的动作类型。
  17. 根据权利要求11所述的装置,其特征在于,所述装置还包括:
    训练模块,用于获取视频样本,其中所述视频样本包括多张不同样本时序帧的图像样本以及各样本时序帧的图像样本中运动对象的标准动作类型;通过多通道卷积层获取各所述图像样本在不同卷积通道上的原始子特征图样本;分别以每个所述样本时序帧作为目标样本时序帧,获取目标样本时序帧的原始子特征图样本以及后一样本时序帧的原始子特征图样本在各卷积通道上的样本差异信息;通过激活函数将各卷积通道上样本差异信息映射为目标样本时序帧在各卷积通道上的运动信息权重样本;根据目标样本时序帧在各卷积通道上的运动信息权重样本以及原始子特征图样本,获取目标样本时序帧在各卷积通道上的运动信息特征图样本;对目标样本时序帧在各卷积通道上的运动信息特征图样本进行时序卷积,得到目标样本时序帧在各卷积通道上的时序运动特征图样本;根据目标样本时序帧在各卷积通道的时序运动特征图样本获取目标样本时序帧的图像样本中运动对象的预测动作类型;根据所述预测动作类型以及目标样本时序帧的标准动作类型间的差异,调整所述多通道卷积层、所述激活函数以及时序卷积核的参数,继续训练直至满足训练结束条件。
  18. 一种动作识别装置,其特征在于,设置于计算机设备中,包括:
    图像获取模块,用于获取实时的监控视频数据;提取所述监控视频数据在不同时序帧的图像数据,通过多通道卷积层获取各时序帧的图像数据在不 同卷积通道上的原始子特征图;
    权重获取模块,用于分别以每个所述时序帧作为目标时序帧,根据目标时序帧在各卷积通道上的原始子特征图,以及与目标时序帧相邻的后一时序帧在各卷积通道上的原始子特征图,计算目标时序帧在各卷积通道上的运动信息权重;
    特征确定模块,用于根据所述运动信息权重以及目标时序帧在各卷积通道上的原始子特征图,获取目标时序帧在各卷积通道上的运动信息特征图;
    时序交互模块,用于对所述运动信息特征图进行时序卷积,得到目标时序帧在各卷积通道上的时序运动特征图;及
    动作识别模块,用于根据时序运动特征图,识别目标时序帧的图像数据中运动对象的动作类型;将所述动作类型确定为当前所述监控视频数据中运动对象的动作信息。
  19. 一个或多个计算机可读存储介质,存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如权利要求1至9中任一项所述方法的步骤。
  20. 一种计算机设备,包括存储器和一个或多个处理器,所述存储器存储有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行如权利要求1至9中任一项所述方法的步骤。
PCT/CN2020/120076 2019-11-20 2020-10-10 动作识别方法、装置、计算机存储介质和计算机设备 WO2021098402A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP20888898.2A EP3992846A4 (en) 2019-11-20 2020-10-10 ACTION RECOGNITION METHOD AND APPARATUS, COMPUTER STORAGE MEDIUM AND COMPUTER DEVICE
KR1020227005895A KR20220038434A (ko) 2019-11-20 2020-10-10 액션 인식 방법 및 장치, 컴퓨터 저장 매체, 및 컴퓨터 디바이스
JP2022516004A JP7274048B2 (ja) 2019-11-20 2020-10-10 動作認識方法、装置、コンピュータプログラム及びコンピュータデバイス
US17/530,428 US11928893B2 (en) 2019-11-20 2021-11-18 Action recognition method and apparatus, computer storage medium, and computer device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911143008.2A CN110866509B (zh) 2019-11-20 2019-11-20 动作识别方法、装置、计算机存储介质和计算机设备
CN201911143008.2 2019-11-20

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/530,428 Continuation US11928893B2 (en) 2019-11-20 2021-11-18 Action recognition method and apparatus, computer storage medium, and computer device

Publications (1)

Publication Number Publication Date
WO2021098402A1 true WO2021098402A1 (zh) 2021-05-27

Family

ID=69655231

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/120076 WO2021098402A1 (zh) 2019-11-20 2020-10-10 动作识别方法、装置、计算机存储介质和计算机设备

Country Status (6)

Country Link
US (1) US11928893B2 (zh)
EP (1) EP3992846A4 (zh)
JP (1) JP7274048B2 (zh)
KR (1) KR20220038434A (zh)
CN (1) CN110866509B (zh)
WO (1) WO2021098402A1 (zh)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110866509B (zh) 2019-11-20 2023-04-28 腾讯科技(深圳)有限公司 动作识别方法、装置、计算机存储介质和计算机设备
JP7297705B2 (ja) * 2020-03-18 2023-06-26 株式会社東芝 処理装置、処理方法、学習装置およびプログラム
CN111835448B (zh) * 2020-07-27 2022-05-24 上海挚想科技有限公司 多通道的通信时序控制方法及系统
CN112668410B (zh) * 2020-12-15 2024-03-29 浙江大华技术股份有限公司 分拣行为检测方法、系统、电子装置和存储介质
CN112749666B (zh) * 2021-01-15 2024-06-04 百果园技术(新加坡)有限公司 一种动作识别模型的训练及动作识别方法与相关装置
CN112633260B (zh) * 2021-03-08 2021-06-22 北京世纪好未来教育科技有限公司 视频动作分类方法、装置、可读存储介质及设备
CN113111842B (zh) * 2021-04-26 2023-06-27 浙江商汤科技开发有限公司 一种动作识别方法、装置、设备及计算机可读存储介质
CN113408585A (zh) * 2021-05-21 2021-09-17 上海师范大学 一种基于人工智能的智能印章移动检测方法
CN114997228B (zh) * 2022-05-30 2024-05-03 平安科技(深圳)有限公司 基于人工智能的动作检测方法、装置、计算机设备及介质
WO2024039225A1 (en) * 2022-08-18 2024-02-22 Samsung Electronics Co., Ltd. Method and electronic device of predicting next event in episode
CN116719420B (zh) * 2023-08-09 2023-11-21 世优(北京)科技有限公司 一种基于虚拟现实的用户动作识别方法及系统
CN117649630B (zh) * 2024-01-29 2024-04-26 武汉纺织大学 一种基于监控视频流的考场作弊行为识别方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170061251A1 (en) * 2015-08-28 2017-03-02 Beijing Kuangshi Technology Co., Ltd. Liveness detection method, liveness detection system, and liveness detection device
CN110348345A (zh) * 2019-06-28 2019-10-18 西安交通大学 一种基于动作连贯性的弱监督时序动作定位方法
CN110362715A (zh) * 2019-06-28 2019-10-22 西安交通大学 一种基于图卷积网络的未剪辑视频动作时序定位方法
CN110427807A (zh) * 2019-06-21 2019-11-08 诸暨思阔信息科技有限公司 一种时序事件动作检测方法
CN110866509A (zh) * 2019-11-20 2020-03-06 腾讯科技(深圳)有限公司 动作识别方法、装置、计算机存储介质和计算机设备

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8345984B2 (en) * 2010-01-28 2013-01-01 Nec Laboratories America, Inc. 3D convolutional neural networks for automatic human action recognition
WO2017000116A1 (zh) * 2015-06-29 2017-01-05 北京旷视科技有限公司 活体检测方法、活体检测系统以及计算机程序产品
US10402658B2 (en) * 2016-11-03 2019-09-03 Nec Corporation Video retrieval system using adaptive spatiotemporal convolution feature representation with dynamic abstraction for video to language translation
US9877056B1 (en) * 2016-11-10 2018-01-23 Google Inc. Compressed media with still images selected from a video stream
US10417498B2 (en) * 2016-12-30 2019-09-17 Mitsubishi Electric Research Laboratories, Inc. Method and system for multi-modal fusion model
CN107463949B (zh) * 2017-07-14 2020-02-21 北京协同创新研究院 一种视频动作分类的处理方法及装置
JP7002729B2 (ja) * 2017-07-31 2022-01-20 株式会社アイシン 画像データ生成装置、画像認識装置、画像データ生成プログラム、及び画像認識プログラム
CN108230359B (zh) * 2017-11-12 2021-01-26 北京市商汤科技开发有限公司 目标检测方法和装置、训练方法、电子设备、程序和介质
US10706508B2 (en) * 2018-03-29 2020-07-07 Disney Enterprises, Inc. Adaptive sampling in Monte Carlo renderings using error-predicting neural networks
CN109145150B (zh) 2018-06-15 2021-02-12 深圳市商汤科技有限公司 目标匹配方法及装置、电子设备和存储介质
CN108769535B (zh) * 2018-07-04 2021-08-10 腾讯科技(深圳)有限公司 图像处理方法、装置、存储介质和计算机设备
CN109086873B (zh) * 2018-08-01 2021-05-04 北京旷视科技有限公司 递归神经网络的训练方法、识别方法、装置及处理设备
CN109379550B (zh) * 2018-09-12 2020-04-17 上海交通大学 基于卷积神经网络的视频帧率上变换方法及系统
CN109344764A (zh) * 2018-09-28 2019-02-15 大连民族大学 度量视频连续帧与其卷积特征图间差异的系统及装置
CN109389588A (zh) * 2018-09-28 2019-02-26 大连民族大学 度量视频连续帧与其卷积特征图间差异的方法
CN109993096B (zh) * 2019-03-26 2022-12-20 东北大学 一种面向视频目标检测的光流多层帧特征传播及聚合方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170061251A1 (en) * 2015-08-28 2017-03-02 Beijing Kuangshi Technology Co., Ltd. Liveness detection method, liveness detection system, and liveness detection device
CN110427807A (zh) * 2019-06-21 2019-11-08 诸暨思阔信息科技有限公司 一种时序事件动作检测方法
CN110348345A (zh) * 2019-06-28 2019-10-18 西安交通大学 一种基于动作连贯性的弱监督时序动作定位方法
CN110362715A (zh) * 2019-06-28 2019-10-22 西安交通大学 一种基于图卷积网络的未剪辑视频动作时序定位方法
CN110866509A (zh) * 2019-11-20 2020-03-06 腾讯科技(深圳)有限公司 动作识别方法、装置、计算机存储介质和计算机设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3992846A4

Also Published As

Publication number Publication date
EP3992846A1 (en) 2022-05-04
CN110866509A (zh) 2020-03-06
US20220076002A1 (en) 2022-03-10
US11928893B2 (en) 2024-03-12
KR20220038434A (ko) 2022-03-28
JP2022551396A (ja) 2022-12-09
CN110866509B (zh) 2023-04-28
EP3992846A4 (en) 2022-10-26
JP7274048B2 (ja) 2023-05-15

Similar Documents

Publication Publication Date Title
WO2021098402A1 (zh) 动作识别方法、装置、计算机存储介质和计算机设备
JP7208408B2 (ja) 検出モデルのトレーニング方法、装置、コンピュータデバイス及びコンピュータプログラム
AU2019213369B2 (en) Non-local memory network for semi-supervised video object segmentation
US20160133297A1 (en) Dynamic Video Summarization
CN109657533A (zh) 行人重识别方法及相关产品
CN111402294A (zh) 目标跟踪方法、装置、计算机可读存储介质和计算机设备
Grinciunaite et al. Human pose estimation in space and time using 3d cnn
Dundar et al. Unsupervised disentanglement of pose, appearance and background from images and videos
CN113435432B (zh) 视频异常检测模型训练方法、视频异常检测方法和装置
CN110930434A (zh) 目标对象跟踪方法、装置、存储介质和计算机设备
CN113159200B (zh) 对象分析方法、装置及存储介质
Atto et al. Timed-image based deep learning for action recognition in video sequences
Zhang et al. Exploring event-driven dynamic context for accident scene segmentation
Tsoli et al. Patch-based reconstruction of a textureless deformable 3d surface from a single rgb image
Luvizon et al. Adaptive multiplane image generation from a single internet picture
US20240046471A1 (en) Three-dimensional medical image recognition method and apparatus, device, storage medium, and product
DE102020207974B4 (de) Systeme und verfahren zum nachweis von bewegung während 3d-datenrekonstruktion
Zhang et al. Video extrapolation in space and time
GB2572435A (en) Manipulating a face in an image
Chen et al. MICPL: Motion-Inspired Cross-Pattern Learning for Small-Object Detection in Satellite Videos
CN110705513A (zh) 视频特征提取方法、装置、可读存储介质和计算机设备
Prajapati et al. Mri-gan: A generalized approach to detect deepfakes using perceptual image assessment
CN115731263A (zh) 融合移位窗口注意力的光流计算方法、系统、设备及介质
CN111915713A (zh) 一种三维动态场景的创建方法、计算机设备、存储介质
CN113807330B (zh) 面向资源受限场景的三维视线估计方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20888898

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 20 888 898.2

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2020888898

Country of ref document: EP

Effective date: 20220127

ENP Entry into the national phase

Ref document number: 20227005895

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2022516004

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE