WO2021017606A1 - 视频处理方法、装置、电子设备及存储介质 - Google Patents

视频处理方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2021017606A1
WO2021017606A1 PCT/CN2020/093077 CN2020093077W WO2021017606A1 WO 2021017606 A1 WO2021017606 A1 WO 2021017606A1 CN 2020093077 W CN2020093077 W CN 2020093077W WO 2021017606 A1 WO2021017606 A1 WO 2021017606A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature data
video
video frame
module
target
Prior art date
Application number
PCT/CN2020/093077
Other languages
English (en)
French (fr)
Inventor
易阳
李峰
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP20847097.1A priority Critical patent/EP4006772A4/en
Publication of WO2021017606A1 publication Critical patent/WO2021017606A1/zh
Priority to US17/343,088 priority patent/US20210326597A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/251Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/20Scenes; Scene-specific elements in augmented reality scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Definitions

  • This application relates to the field of computer technology, in particular to video processing technology.
  • the method of video processing and understanding is usually based on the neural network model to extract the features of each frame of the video, and then input the features corresponding to each frame of the image into a pre-trained classifier to determine this video
  • the above methods have low recognition accuracy for moving targets.
  • the embodiments of the present application provide a video processing method, device, electronic equipment, and storage medium to improve the accuracy of recognizing moving targets in a video.
  • an embodiment of the present application provides a video processing method, including:
  • the video frame sequence obtain the characteristics of the motion state that characterizes the moving target in the time sequence of the video frame sequence through a trained neural network model
  • the matching result of the motion state feature of the moving target and the motion state feature of the specified target is obtained.
  • the neural network model includes multiple hierarchical modules, at least one multi-core time-domain processing module, and an average pooling layer, and each of the at least one multi-core time-domain processing module is set in Between two adjacent level modules of the plurality of level modules, the average pooling layer is located after the last level module;
  • obtaining the characteristics of the motion state of the moving target represented by the time sequence of the video frame sequence through a trained neural network model specifically includes:
  • the first feature data corresponding to each video frame in the video frame sequence is extracted step by step from the input data through various levels of level modules, and each first feature data contains the characterizing that the moving target is in the video frame.
  • the input data of the first-level hierarchical module includes the video frame sequence, and the input data of the other hierarchical modules is the data output by the hierarchical module at the upper level or the multi-core time domain processing module;
  • the target pixel in the first feature data output by the target level module is convolved in the time dimension to obtain the corresponding second feature data.
  • the second feature data includes time-series features that characterize the moving target in the time dimension;
  • the target level module is a level module located at the upper level of the multi-core time domain processing module, and the target pixel is the target Pixels with the same position in the first feature data output by the level module;
  • the target pixel in the first feature data output by the target level module is convolved in the time dimension to obtain the second feature data corresponding to each video frame. , Specifically including:
  • the target level module According to the time information of each video frame, determine the first time domain feature data corresponding to the target pixel in the time dimension in the first feature data output by the target level module;
  • each second time domain feature data determines the second feature data corresponding to the pixels in the second time domain feature data that have the same time information in the spatial dimension .
  • the performing convolution processing on each first temporal feature data to obtain the corresponding second temporal feature data specifically includes:
  • a first preset number of one-dimensional convolution layers with different convolution kernel sizes are used to perform convolution processing on the first time domain feature data to obtain the first time domain feature data.
  • the one-dimensional convolutional layer is a one-dimensional Depthwise convolutional layer.
  • the video processing method in the embodiment of the present application further includes:
  • the number of channels of the second characteristic data is restored from the second value to the first value.
  • the acquiring a video frame sequence containing a moving target specifically includes:
  • the extracted third preset number of video frames are determined as the video frame sequence.
  • the obtaining the matching result of the motion state feature of the moving target and the motion state feature of the specified target specifically includes:
  • the motion target is determined to be the designated target.
  • the probability that the motion target belongs to the action category corresponding to each specified target is obtained through a trained classifier, and the classifier is trained according to the motion state characteristics of the specified target .
  • the neural network model is trained in the following manner:
  • each video sample in the video sample set includes a video frame sequence marked with a category identifier, and the category identifier is used to characterize an action category corresponding to a moving target contained in the video frame sequence;
  • the neural network model is used to obtain the characteristics of the movement state that characterizes the moving target in the time sequence of the video samples
  • a classifier is used to determine the predicted probability that the motion target contained in the video sample belongs to each action category
  • the weight parameters of the neural network model and the classifier are optimized.
  • the neural network model includes multiple level modules, at least one multi-core time-domain processing module, and an average pooling layer, and each of the at least one multi-core time-domain processing module is set separately Between two adjacent level modules of the plurality of level modules, the average pooling layer is located after the last level module;
  • the neural network model is used to obtain the characteristics of the movement state of the moving target expressed in the time sequence of the video samples, which specifically includes:
  • the first feature data corresponding to each video frame in the video sample is extracted step by step from the input data through each level of hierarchy module, and each first feature data contains the moving target included in the video sample.
  • the input data of the first level module includes the video samples, and the input data of the other levels of the module is the output of the level module at the upper level or the multi-core time domain processing module data;
  • the target pixel in the first feature data output by the target level module is convolved in the time dimension to obtain the corresponding second feature data.
  • the second feature data contains the temporal features that characterize the moving target in the time dimension; through the average pooling layer, the feature data output by the last level module is averaged and pooled to obtain the The movement state characteristics of the moving target.
  • an embodiment of the present application provides a neural network model, multiple hierarchical modules, at least one multi-core time domain processing module, and an average pooling layer, and each of the at least one multi-core time domain processing module
  • the multi-core time-domain processing module is respectively arranged between two adjacent level modules of the plurality of level modules, and the average pooling layer is located after the last level module;
  • Each level module is used to extract and output first feature data corresponding to each video frame in the video frame sequence from the input data, each of the first feature data includes characterizing the moving target in the video frame
  • the input data of the first-level module includes the video frame sequence, and the input data of the other levels of the module is the data output by the higher-level module or the multi-core time-domain processing module;
  • the multi-core time domain processing module is configured to perform convolution processing on the time dimension of the target pixel in the first feature data output by the target level module according to the time information of each video frame, to obtain the corresponding second feature data respectively,
  • Each second feature data includes a time sequence feature that characterizes the moving target in the time dimension;
  • the target level module is a level module located at the upper level of the multi-core time domain processing module, and the target pixel is the Pixels with the same position in the first feature data output by the target level module;
  • the average pooling layer is used to perform average pooling processing on the feature data output by the last-level hierarchical module to obtain the motion state feature of the moving target.
  • the multi-core time-domain processing module includes: a first dimensional transformation layer, a multi-core time-domain convolution layer, and a second dimensional transformation layer;
  • the first dimension transformation layer is configured to determine, according to the time information of each video frame, the first time domain characteristic data corresponding to the target pixel in the time dimension in the first characteristic data output by the target level module;
  • the multi-core time-domain convolution layer is used to perform convolution processing on the first time-domain feature data corresponding to the target pixel for each target pixel to obtain second time-domain feature data;
  • the second dimension transformation layer is used to determine all the pixels with the same time information in the second time domain feature data according to the corresponding position of the target pixel in each second time domain feature data in the first feature data The second feature data corresponding in the spatial dimension.
  • an embodiment of the present application provides a video processing device, including:
  • the acquisition module is used to acquire a video frame sequence containing a moving target
  • the feature extraction module is configured to obtain, according to the video frame sequence, through a trained neural network model, characteristics of the motion state that characterizes the moving target expressed in the time sequence of the video frame sequence;
  • the matching module is used to obtain the matching result of the motion state feature of the moving target and the motion state feature of the specified target.
  • the neural network model includes multiple level modules, at least one multi-core time-domain processing module, and an average pooling layer, and each of the at least one multi-core time-domain processing module is set separately Between two adjacent level modules of the plurality of level modules, the average pooling layer is located after the last level module;
  • the feature extraction module is specifically used for:
  • the first feature data corresponding to each video frame in the video frame sequence is extracted step by step from the input data through various levels of level modules, and each first feature data contains the characterizing that the moving target is in the video frame.
  • the input data of the first-level hierarchical module includes the video frame sequence, and the input data of the other hierarchical modules is the output data of the hierarchical module at the upper level or the multi-core time domain processing module;
  • the target pixel in the first feature data output by the target level module is convolved in the time dimension to obtain the corresponding second feature data.
  • the second feature data includes time-series features that characterize the moving target in the time dimension;
  • the target level module is a level module located at the upper level of the multi-core time domain processing module, and the target pixel is the target Pixels with the same position in the first feature data output by the level module;
  • the feature extraction module is specifically configured to:
  • the target level module According to the time information of each video frame, determine the first time domain feature data corresponding to the target pixel in the time dimension in the first feature data output by the target level module;
  • each second time domain feature data determines the second feature data corresponding to the pixels in the second time domain feature data that have the same time information in the spatial dimension .
  • the feature extraction module is specifically configured to:
  • a first preset number of one-dimensional convolution layers with different convolution kernel sizes are used to perform convolution processing on the first time domain feature data to obtain the first time domain feature data.
  • the one-dimensional convolutional layer is a one-dimensional Depthwise convolutional layer.
  • the feature extraction module is further used for:
  • the number of channels of the second characteristic data is restored from the second value to the first value.
  • the obtaining module is specifically used for:
  • the extracted third preset number of video frames are determined as the video frame sequence.
  • the matching module is specifically used for:
  • the motion target is determined to be the designated target.
  • the probability that the motion target belongs to the action category corresponding to each specified target is obtained through a trained classifier, and the classifier is trained according to the motion state characteristics of the specified target .
  • the video processing module of the embodiment of the present application further includes a training module, and the training module is used to:
  • each video sample in the video sample set includes a video frame sequence marked with a category identifier, and the category identifier is used to characterize an action category corresponding to a moving target contained in the video frame sequence;
  • the neural network model is used to obtain the characteristics of the movement state that characterizes the moving target in the time sequence of the video samples
  • a classifier is used to determine the predicted probability that the motion target contained in the video sample belongs to each action category
  • the weight parameters of the neural network model and the classifier are optimized.
  • the neural network model includes multiple level modules, at least one multi-core time-domain processing module, and an average pooling layer, and each of the at least one multi-core time-domain processing module is set separately Between two adjacent level modules of the plurality of level modules, the average pooling layer is located after the last level module;
  • the training module is specifically used for:
  • the neural network model is used to obtain the characteristics of the movement state of the moving target expressed in the time sequence of the video samples, which specifically includes:
  • the first feature data corresponding to each video frame in the video sample is extracted step by step from the input data through each level of hierarchy module, and each first feature data contains the moving target included in the video sample.
  • the input data of the first level module includes the video samples, and the input data of the other levels of the module is the output of the level module at the upper level or the multi-core time domain processing module data;
  • the target pixel in the first feature data output by the target level module is convolved in the time dimension to obtain the corresponding second feature data.
  • the second feature data contains the temporal features that characterize the moving target in the time dimension; through the average pooling layer, the feature data output by the last level module is averaged and pooled to obtain the The movement state characteristics of the moving target.
  • an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and running on the processor, wherein the processor implements any of the above when the computer program is executed. Method steps.
  • an embodiment of the present application provides a computer-readable storage medium having computer program instructions stored thereon, and when the computer program instructions are executed by a processor, the steps of any of the foregoing methods are implemented.
  • the technical solutions provided by the embodiments of the application can obtain the motion state characteristics of the moving target expressed in the time sequence of the video frame sequence through the neural network model.
  • the motion state characteristics include the motion target extracted by the hierarchical module in each video frame.
  • the spatial features in also include the temporal features of the moving target in the time dimension extracted by the multi-core time domain processing module, that is, more comprehensive feature information can be obtained from the video frame sequence, thereby improving the recognition of moving targets Accuracy.
  • FIG. 1A is a schematic diagram of an application scenario of a video processing method provided by an embodiment of the application
  • FIG. 1B is a schematic diagram of an application scenario of a video processing method provided by an embodiment of the application
  • FIG. 1C is a schematic diagram of an application scenario of a video processing method provided by an embodiment of the application.
  • FIG. 2 is a schematic diagram of the structure of a neural network model provided by an embodiment of the application.
  • 3A is a schematic structural diagram of a multi-core time domain processing module provided by an embodiment of the application.
  • 3B is a schematic structural diagram of a multi-core time domain processing module provided by an embodiment of this application.
  • FIG. 4 is a schematic diagram of performing a Reshape operation on the first dimension transformation layer provided by an embodiment of the application
  • FIG. 5 is a schematic structural diagram of a multi-core time-domain convolutional layer provided by an embodiment of this application.
  • FIG. 6 is a schematic structural diagram of a neural network model applied to an action recognition scene provided by an embodiment of the application
  • Figure 7A is a schematic diagram of the structure of the ResNet50 network
  • FIG. 7B is a schematic structural diagram of a neural network model obtained by using a ResNet50 network as a basic network provided by an embodiment of the application;
  • FIG. 8A is a training process of a neural network model provided by an embodiment of the application.
  • FIG. 8B is a training process of a neural network model provided by an embodiment of the application.
  • FIG. 8C is a training process of a neural network model provided by an embodiment of the application.
  • FIG. 9 is a schematic flowchart of a video processing method provided by an embodiment of the application.
  • FIG. 10 is a schematic diagram of a process for determining motion state characteristics through a trained neural network model provided by an embodiment of the application
  • FIG. 11 is a schematic diagram of a process for determining second characteristic data through a multi-core time domain processing module provided by an embodiment of the application;
  • FIG. 12 is a schematic diagram of a process for determining second characteristic data through a multi-core time domain processing module according to an embodiment of the application
  • Figure 13 is a visual analysis result of the intermediate results obtained by the neural network model
  • FIG. 14 is a schematic structural diagram of a video processing device provided by an embodiment of the application.
  • FIG. 15 is a schematic structural diagram of an electronic device provided by an embodiment of the application.
  • AI Artificial Intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • Computer Vision is a science that studies how to make machines "see”. Furthermore, it refers to the use of cameras and computers to replace human eyes on targets. Machine vision such as recognition, tracking and measurement, and further graphic processing, make computer processing more suitable for human eyes to observe or send to the instrument to detect images.
  • CV Computer Vision
  • Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, three-dimensional (3- dimension, 3D) technology, virtual reality, augmented reality, synchronous positioning and map construction and other technologies, as well as common face recognition, fingerprint recognition and other biometric recognition technologies.
  • OCR Optical Character Recognition
  • feature extraction can be performed through Image Semantic Understanding (ISU).
  • the embodiments of the present application mainly relate to Machine Learning (ML), which is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Specializing in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance.
  • Machine learning is the core of artificial intelligence, the fundamental way to make computers intelligent, and its applications cover all fields of artificial intelligence.
  • Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning techniques.
  • machine learning can be used to train neural network models and classifiers, so that the neural network model can be used to determine the characteristics of the motion state, or the classifier can be used to predict the probability of the action category of the motion target.
  • Sign language recognition plays a very important role as part of human body language understanding. On the one hand, it is the main means of virtual reality human-computer interaction; on the other hand, it is an auxiliary tool for deaf-mute people to use computers to communicate with normal people.
  • Each sign language is composed of a sequence of gestures, and each gesture is composed of a sequence of hand changes.
  • the main task of sign language recognition is to determine the type of sign language to be recognized based on the extracted features of the sign language to be recognized, and then use a classifier to do classification.
  • sign language recognition systems can be divided into two types: camera (visual)-based sign language recognition systems and device input (such as data gloves, stylus, mouse, position tracker, etc.)-based sign language recognition systems.
  • sign language recognition methods mainly include template matching, neural network, Hidden Markov Model (HMM), and Dynamic Time Warping (DTW).
  • Pedestrian re-identification also known as pedestrian re-identification, is a technology that uses computer vision technology to determine whether there is a specific pedestrian in an image or video sequence. Given a monitored pedestrian image, retrieve the pedestrian image under cross-devices. It is designed to make up for the visual limitations of the current fixed cameras, and can be combined with pedestrian recognition/pedestrian tracking technology, and can be widely used in intelligent video surveillance, intelligent security and other fields.
  • the Multi-Kernel Temporal Block (MKTB) is used to enhance the timing features between feature maps corresponding to multiple video frames.
  • Reshape is a function that can re-adjust the number of rows, columns, and dimensions of the matrix.
  • Batch is a hyperparameter in the neural network, which specifically refers to the number of samples processed by the neural network each time.
  • the existing video processing and understanding methods for videos are usually based on neural network models to extract the features of each frame of a video, and then input the corresponding features of each frame of image into the pre-trained Classifier to determine the category of the moving target in this video. Since the features extracted in the above method are obtained from each frame of images, the extracted features cannot reflect the continuity and relevance of the moving target in the time dimension, resulting in low recognition accuracy of the moving target.
  • gesture recognition as an example.
  • Each gesture action representing a specific meaning is composed of a series of continuous actions with a definite timing relationship. Recognizing a gesture requires obtaining a sequence of video frames containing this series of continuous actions.
  • Each video frame in the sequence is arranged strictly in accordance with the time sequence of the action, and the feature extraction is performed on each video frame separately using the existing method. What is obtained is only the spatial feature of the hand movement at a certain moment, and the corresponding features of the hand movement at different moments. They are independent of each other and lose the timing relationship between the hand actions contained in the video frame sequence. Therefore, the hand features obtained based on the existing methods ignore the continuity and correlation of gesture actions in the time dimension, resulting in the accuracy of gesture recognition Lower.
  • the improved neural network model includes multiple hierarchical modules, at least one multi-core time-domain processing module, and an average pooling layer, and each multi-core time-domain processing module is set between two adjacent hierarchical modules in the multiple hierarchical modules.
  • the average pooling layer is located after the last level module.
  • each level module is used to extract and output the first feature data corresponding to each video frame in the video frame sequence from the input data.
  • Each first feature data contains the space that characterizes the moving target in the video frame.
  • the input data of the first level module includes the video frame sequence, and the input data of the other levels of the module is the output data of the upper level module or the multi-core time domain processing module; the multi-core time domain processing module is used for According to the time information of each video frame, the target pixel in the first feature data output by the target level module is convolved in the time dimension to obtain the corresponding second feature data.
  • Each second feature data contains the characteristic motion
  • the timing characteristics of the target in the time dimension is the target level module is the level module located at the upper level of the multi-core time domain processing module, and the target pixel is the pixel with the same position in the first feature data output by the target level module; average pool
  • the transformation layer is used to perform average pooling processing on the characteristic data output by the last-level module to obtain the movement state characteristics of the moving target.
  • the neural network model can be used to obtain the characteristics of the motion state of the moving target expressed in the time sequence of the video frame sequence.
  • the motion state characteristics include both the pass-level module
  • the extracted spatial characteristics of the moving target in each video frame also include the temporal characteristics of the moving target in the time dimension extracted by the multi-core time domain processing module, that is, based on the above neural network model, it can be obtained from the video frame sequence To more comprehensive feature information, and then improve the accuracy of the recognition of moving targets.
  • the video processing method in the embodiments of the present application can be applied to scenes of motion recognition, such as gesture recognition scenes, sign language recognition scenes, action interaction scenes, behavior recognition scenes, and so on.
  • the application scenario includes a terminal device 101 and a server 102.
  • the above-mentioned terminal device 101 is connected to the server 102 via a wireless or wired network.
  • the terminal device 101 is an electronic device capable of capturing images, such as smart phones, tablet computers, smart robots, somatosensory game devices, and VR (Virtual Reality, virtual reality technology) devices
  • the server 102 is a server cluster or cloud computing center composed of one server or several servers.
  • the terminal device 101 collects the to-be-processed video containing the user, and then sends the collected and to-be-processed video to the server 102.
  • the server 102 can directly perform action recognition on the user in the received video to be processed, determine the action category corresponding to the action performed by the user in the to-be-processed video, and determine the recognized action according to the stored correspondence between the action category and the response data
  • the response data corresponding to the category is sent to the terminal device 101.
  • the terminal device 101 executes the response data returned by the server.
  • the response data is not limited to text data, audio data, image data, video data, voice broadcast or control instructions, etc.
  • control instructions include but are not limited to: instructions to control the terminal device to display facial expressions , Instructions for controlling the movement of the action parts of the terminal device (such as leading, navigating, taking pictures, dancing, etc.), instructions for displaying props or special effects on the screen of the terminal device, instructions for controlling smart homes, etc.
  • the application scenario shown in FIG. 1A can also be used in a sign language recognition scenario.
  • the terminal device 101 collects the to-be-processed video containing the user's gesture sign language, and then sends the collected video to the server 102.
  • the server 102 can directly perform action recognition on the user in the received video to be processed, determine the sign language category corresponding to the sign language action in the video to be processed, and according to the correspondence between the stored sign language category and the semantic data (in this case, the semantic data is the response data) The relationship is determined, and the semantic data corresponding to the recognized sign language category is determined, and the semantic data is sent to the terminal device 101.
  • the semantic data may be text data or voice data.
  • the terminal device 101 plays the semantic data returned by the server, so that other users can learn the meaning corresponding to the sign language of the user's gesture, so that people with language or hearing impairments can communicate without barriers.
  • the method executed by the server 102 may also be executed on the terminal device 101.
  • the application scenario includes multiple terminal devices 111 (including terminal device 111-1, terminal device 111-2, ... terminal device 111-n) and server 112.
  • terminal devices 111 including terminal device 111-1, terminal device 111-2, ... terminal device 111-n
  • server 112 For example, when the terminal device 111-1, the terminal device 111-2, ... and the terminal device 111-n interact through the server 112, the terminal device 111-1 captures the to-be-processed video containing the user 1, and then transfers the captured video to be processed Send to server 112.
  • the server 112 can directly recognize the action of user 1 in the received video to be processed, determine the action category corresponding to the action performed by user 1 in the to-be-processed video, and determine the recognition based on the stored correspondence between the action category and the response data
  • the response data corresponding to the action category of is sent to the terminal device 111-1 and the terminal device 111-2, ... the terminal device 111-n interacting with the terminal device 111-1.
  • the terminal device that receives the response data executes the response data, and the response data is not limited to text data, audio data, image data, video data, display props or special effects. For example, in an Internet live broadcast scenario, the host in the live room performs a specified action.
  • the terminal device 111-1 sends the collected and to-be-processed video containing the host performing the specified action to the server 112, and the server 112 determines The action category corresponding to the specified action performed by the host in the to-be-processed video, and the special effect corresponding to the action category is determined, and then the corresponding special effect is added to the live data.
  • the viewer’s terminal equipment such as terminal equipment 111-2, 111-3 ,...111-n) Pull live data from the server 112 and display corresponding special effects on the live screen.
  • the video processing method in the embodiments of the present application can also be applied to scenes that identify and track moving targets in videos, such as pedestrian re-identification scenes, surveillance security scenes, intelligent traffic scenes, and military target recognition scenes.
  • the embodiments of the present application mainly perform target recognition and tracking based on the characteristics of the motion state of the target (such as the posture of the human body).
  • the application scenario includes a monitoring device 121, a server 122, and a terminal device 123.
  • the server 122 is connected to the monitoring device 121 and the terminal device 123 via a wireless network.
  • the monitoring device 121 is an electronic device with the function of capturing images, such as a camera, a video camera, a video recorder, etc.
  • the terminal device 123 is an electronic device with network communication capabilities.
  • the device may be a smart phone, a tablet computer, or a portable personal computer, etc.
  • the server 122 is a server cluster or cloud computing center composed of one server or several servers.
  • the monitoring device 121 collects the video to be processed in real time, and then sends the collected video to be processed to the server 122.
  • the server 122 can directly recognize pedestrians in the received video to be processed, extract the characteristics of each pedestrian contained in the to-be-processed video, compare the characteristics of each pedestrian with the characteristics of the target person, and determine whether the target is contained in the processed video. character.
  • the server 122 recognizes the target person, it marks the target person in the video to be processed, and then sends the to-be-processed video marked with the target object to the terminal device 123, and the terminal device 123 can play the to-be-processed video marked with the target person. So that relevant personnel can track and analyze the target person in the video.
  • the method executed by the server 122 can also be executed on the terminal device 123.
  • the foregoing video to be processed may also be a video pre-recorded by the monitoring device 121.
  • the method provided in the embodiment of the present application is not limited to be used in the application scenarios shown in FIG. 1A, FIG. 1B, and FIG. 1C, and may also be used in other possible application scenarios, which is not limited in the embodiment of the present application.
  • the functions that can be implemented by each device in the application scenarios shown in FIG. 1A, FIG. 1B, and FIG. 1C will be described together in the subsequent method embodiments, and will not be repeated here.
  • the neural network model shown in Figure 2 includes: multiple hierarchical modules (for example, the first-level hierarchical module, the second-level hierarchical module, the third-level hierarchical module, and the fourth-level hierarchical module in Figure 2), at least one multi-core Time domain processing module (for example, the first multi-core time domain processing module and the second multi-core time domain processing module in Figure 2), and the mean-pooling layer, and at least one of the multi-core time domain processing modules
  • Each multi-core time-domain processing module is respectively arranged between two adjacent level modules among the multiple level modules, and the average pooling layer is located after the last level module.
  • Each level of the module in Figure 2 respectively extracts first feature data corresponding to each video frame in the video frame sequence from the input data and outputs it.
  • Each first feature data contains the moving target contained in the video frame sequence. Spatial features in video frames.
  • the input data of the first-level hierarchical module includes a video frame sequence, and the first-level hierarchical module extracts and outputs first feature data corresponding to each video frame from the input video frame sequence.
  • the input data of other levels of level modules except the first level is the first feature data output by the level module at the upper level or the second feature data output by the multi-core time domain processing module.
  • the first level module extracts the first feature data P1 corresponding to each video frame from the input video frame sequence and outputs it
  • the second level module responds to the second feature data output by the first multi-core time domain processing module Q1 performs processing to obtain and output the first feature data P2
  • the third level module processes the second feature data Q2 output by the second multi-core time domain processing module to obtain the first feature data P3 and outputs it
  • the fourth level The module processes the first characteristic data P3 to obtain and output the first characteristic data P4.
  • the feature data output by each level module is collectively referred to as the first feature data
  • the feature data output by each multi-core time domain processing module is collectively referred to as the second feature data.
  • a single level module can contain only one network layer, for example, a single level model contains only one convolutional layer; a single level module can also include multiple same or different network layers, for example, a single level module contains a convolutional layer And the maximum pooling layer, or a single level module contains multiple different convolutional layers.
  • the neural network model described in Figure 2 is only an example. In actual applications, the structure of each level module, the number of level modules included in the neural network model, the number and location of multi-core time domain processing modules can be set according to actual needs. The embodiments of this application are not limited.
  • the video frame sequence input to the neural network model can be a continuous video, or it can be an image sequence obtained by arranging multiple discrete video frames intercepted from a video in a time sequence.
  • the video frame sequence is essentially a four-dimensional matrix (B ⁇ T, C, H, W), where B is the number of batches (Batch), that is, the number of video frame sequences that can be processed by the neural network model at one time, and T is the frame Sequence length, that is, the number of video frames contained in a video frame sequence, C is the number of channels of the image, H is the height of the image, and W is the width of the image.
  • B the number of batches (Batch)
  • T is the frame Sequence length, that is, the number of video frames contained in a video frame sequence
  • C is the number of channels of the image
  • H is the height of the image
  • W is the width of the image.
  • the image referred to is the video frame.
  • the video frame sequence input to the neural network model is a four-dimensional matrix (2 ⁇ 8, 3, 224, 224). If the neural network model only processes one video frame sequence at the same time, B can be set to 1, that is, the neural network model can process T video frames in one video frame sequence at a time.
  • the first feature data corresponding to each video frame includes multiple two-dimensional pictures (ie, two-dimensional matrix (H, W)), and each two-dimensional picture is a feature map (feature map).
  • the number of feature maps included in a feature data is equal to the number of corresponding channels. For example, if the dimension of the first feature data output by the first-level module is (16,64,112,112), the first feature data corresponding to a video frame contains 64 feature maps, and the size of each feature map is 112 ⁇ 112. It should be noted that the dimensions and sizes of the first feature data corresponding to each video frame output by the same module are the same.
  • the second feature data corresponding to each video frame also includes multiple feature maps.
  • the input data of the multi-core time-domain processing module in Figure 2 is the first feature data corresponding to each video frame output by the target level module.
  • the multi-core time-domain processing module determines the first characteristic data corresponding to each video frame according to the time information of each video frame.
  • the target pixel in a feature data is subjected to convolution processing in the time dimension to obtain corresponding second feature data, each of the second feature data includes a time sequence feature representing the moving target in the time dimension.
  • the target level module is a level module located at the upper level of the multi-core time-domain processing module, and the target pixel is a pixel with the same position in the first feature data output by the target level module.
  • FIG. 3A shows an example diagram of a multi-core time-domain processing module.
  • a multi-core time-domain processing module includes at least: a first dimensional transformation layer, a multi-core time-domain convolution layer, and a second dimensional transformation layer.
  • the first dimension transformation layer is used to determine the first time domain feature data corresponding to the target pixel in the time dimension in the first feature data corresponding to all video frames output by the target level module according to the time information of each video frame.
  • the multi-core time-domain convolution layer is used to perform convolution processing on the first time-domain feature data corresponding to the target pixel for each target pixel to obtain the second time-domain feature data.
  • the second dimension transformation layer is used to determine the spatial dimension of all pixels with the same time information in the second time domain feature data according to the corresponding position of the target pixel in each second time domain feature data in the first feature data The corresponding second feature data.
  • the first dimension transformation layer can use the Reshape operation for the matrix to realize the dimension transformation of the first feature data (B ⁇ T, C, H, W) output by the upper level module (the first feature data As shown in the left picture of Figure 4), the spatial dimension (H, W) in the first feature data (B ⁇ T, C, H, W) is merged into the batch number dimension, and the time dimension T is separated separately , Get a three-dimensional matrix (B ⁇ H ⁇ W, C, T), the first time-domain feature data is composed of pixels with the same H, the same W, and the same C in the first feature data (C, H, W) corresponding to each video frame The points are arranged in chronological order, and each first time-domain feature data contains T data, and the first time-domain feature data is a one-dimensional vector composed of these T data.
  • first time domain feature data (first time domain The characteristic data is shown in the right figure of Figure 4), and each first time domain characteristic data contains 8 data.
  • the second dimension transformation layer can also use the Reshape operation to transform all the second time domain feature data output by the multi-core time domain convolution layer.
  • the output of the multi-core time domain convolution layer is a dimension (B ⁇ H ⁇ W, C, T) three-dimensional matrix, using the Reshape operation, merge the time dimension T of this three-dimensional matrix into the batch number dimension B, separate the space dimension (H, W) separately, and get the dimension (B ⁇ T, C , H, W) four-dimensional matrix, where (C, H, W) is the second feature data corresponding to each video frame.
  • the multi-core time-domain convolution layer performs convolution processing on each first time-domain feature data output by the first dimensional transformation layer to obtain corresponding second time-domain feature data.
  • the multi-core time-domain convolutional layer includes a first preset number of one-dimensional convolutional layers with different convolution kernel sizes. For each first time-domain feature data output by the first-dimensional transformation layer, the first A preset number of one-dimensional convolution layers perform convolution processing on the first time domain feature data to obtain a first preset number of feature data of different scales corresponding to each first time domain feature data, and then fuse the first time domain feature data. A preset number of feature data of different scales are preset to obtain second time domain feature data corresponding to each first time domain feature data.
  • the way of fusion may be to add a first preset number of feature data of different scales to obtain the second time domain feature data corresponding to each first time domain feature data.
  • time-series features of different scales can be extracted from the same first time-domain feature data, and these multiple time-series features of different scales can be merged to obtain the second time-domain feature Data, better retain the timing characteristics of the moving target.
  • the one-dimensional convolutional layer in the multi-core time-domain convolutional layer can be a one-dimensional Depthwise convolutional layer.
  • the one-dimensional Depthwise convolutional layer can effectively reduce the amount of calculation and improve the performance of the multi-core time-domain convolutional layer. Processing efficiency.
  • the multi-core time-domain convolutional layer includes four one-dimensional Depthwise convolutional layers with different convolution kernel sizes.
  • convolution methods such as dilated convolution may also be used to perform convolution processing on each first time domain feature data output by the first dimensional transformation layer to obtain corresponding second time domain feature data.
  • the number of channels of the first feature data corresponding to each video frame can be reduced from the first value to the second value, thereby reducing the processing of the multi-core time domain processing module.
  • the number of channels of the second characteristic data is restored from the second value to the first value.
  • the multi-core temporal processing module may further include: a first convolutional layer and a second convolutional layer.
  • the first convolutional layer is located before the first dimensional transformation layer, and the input data of the first convolutional layer is the output data (B ⁇ T, C 1 , H, W), through the first convolutional layer to perform convolution processing on (B ⁇ T, C 1 , H, W) to obtain (B ⁇ T, C 2 , H, W), so that each video frame corresponds to the first
  • the number of channels of characteristic data is reduced from the first value C 1 to the second value C 2 , which can reduce the amount of data processed by the multi-core time domain processing module and improve processing efficiency.
  • the second convolutional layer is located after the second dimensional transformation layer.
  • the input data of the second convolutional layer is a matrix (B ⁇ T, C 2 , H, W) composed of the second feature data output by the second dimensional transformation layer.
  • the second convolutional layer performs convolution processing on (B ⁇ T, C 2 , H, W) to obtain (B ⁇ T, C 1 , H, W) to restore the number of channels of the second feature data to the first
  • the value C 1 ensures that the dimensions and sizes of the input data and output data of the multi-core time-domain processing module are the same. Therefore, the multi-core time-domain processing module can be easily deployed anywhere in the neural network model.
  • the average pooling layer in Figure 2 is used to perform average pooling processing on the feature data output by the last level module to obtain the motion state characteristics of the moving target.
  • the output data of the last level module is a four-dimensional matrix with dimensions (B ⁇ T, C′, H′, W′)
  • the four-dimensional matrix can be processed by the average pooling layer through the average pooling layer , In order to reduce the number of parameters included in the feature data, and finally obtain a two-dimensional matrix with a dimension of (B ⁇ T, C′), which is the motion state feature of the moving target.
  • C′ is the number of channels of the feature map
  • H is the height of the feature map
  • W is the width of the feature map.
  • the first feature data corresponding to each video frame in the video frame sequence is extracted step by step through the various levels of modules in the neural network model shown in Figure 2 to obtain the spatial features that characterize the moving target in the video frame.
  • the multi-core time-domain processing module between the hierarchical modules Through the multi-core time-domain processing module between the hierarchical modules, the temporal characteristics of the moving target in the time dimension are extracted, and finally the motion state characteristics including the spatial and temporal characteristics are obtained. Therefore, based on the above neural network model, more comprehensive feature information can be obtained from the video frame sequence, thereby improving the recognition accuracy of moving targets.
  • the above neural network model can be applied to a variety of scenarios.
  • the neural network model is used to extract the motion state characteristics of the moving target in the video, and then based on the extracted motion state characteristics, the matching result of the moving target and the specified target is obtained to determine whether the video contains the specified
  • the target where the designated target can be a person, an animal, or a body part (such as hands, feet, etc.).
  • a classification module for dynamic state characteristics can be added after the above neural network model to directly output the matching result with the specified target to realize an end-to-end video processing system.
  • the neural network model applied to an action recognition scenario specifically includes : Multiple level modules, one or more multi-core time domain processing modules, average pooling layer and classification modules (such as classifiers).
  • classification modules such as classifiers.
  • the functions and layout of the hierarchical module, the multi-core time-domain processing module, and the average pooling layer can be referred to the corresponding modules in the neural network model shown in FIG. 2 and will not be repeated.
  • the classification module is located after the average pooling layer.
  • the classification module is used to classify the dynamic state features output by the average pooling layer, and determine the probability that the moving target belongs to the action category corresponding to each specified target.
  • the classification module may be, for example, a Fully Connected Layer (FC), a Softmax layer, and so on.
  • any existing neural network that can process images can be used as the basic network, and one or more multi-core time-domain processing modules can be inserted into the basic network to obtain a video frame sequence capable of extracting moving targets.
  • Neural network model of motion state characteristics The available basic networks include but are not limited to: Residual Network (ResNet) Convolutional Neural Networks (CNN) or Visual Geometry Group Network (VGG) models.
  • the ResNet50 network includes a first convolution module, a maximum pooling layer (max-pooling), 4 residual modules, a first average pooling layer, a fully connected layer, and a second average pooling layer, where, Each residual layer contains at least one convolutional layer.
  • FIG. 7B is a neural network model obtained by using the ResNet50 network as the basic network.
  • the position and number of insertion of the multi-core time domain processing module are not limited to the manner shown in FIG. 7B.
  • the first convolution module includes the first convolution layer of the ResNet50 network, followed by the Batch Normalization (BN) layer and the ReLU (Rectified Linear Unit) layer.
  • the input data of the first convolution module is a video frame sequence.
  • the video frame sequence is expressed as a four-dimensional matrix, such as (8, 3, 224, 224), where 8 in the first dimension refers to the length of the frame sequence, and 3 in the second dimension refers to the number of three RGB channels, 224 in the third dimension is the height of a single video frame, and 224 in the fourth dimension is the width of a single video frame.
  • the maximum pooling layer represents the first maximum pooling layer of ResNet50. After the maximum pooling layer, the spatial size (ie, height and width) of the feature map output by the first convolution module will be reduced to half of that before the input.
  • the four residual modules are used to perform step-by-step convolution processing on the data output by the maximum pooling layer to extract the spatial characteristics of the moving target in the video frame sequence.
  • the data output by the fourth residual module is averagely pooled in the spatial dimension.
  • the first average pooling layer acts on the time dimension and performs average pooling processing on the data output by the fully connected layer in the spatial dimension.
  • the ResNet50 network includes a first convolution module, a max-pooling layer, 4 residual modules, a first average layer pooling layer, a fully connected layer, and a second average pooling layer.
  • the matrix data with dimensions (16,64,112,112) is output.
  • the output matrix data with dimensions (16,64,56,56).
  • matrix data with dimensions (16,256,56,56) is output.
  • the matrix data with dimensions (16,256,56,56) is output.
  • the matrix data with dimensions (16,512,28,28) is output.
  • matrix data with dimensions (16,512,28,28) is output.
  • matrix data with dimensions (16,1024,14,14) is output.
  • the matrix data with dimensions (16, 1024, 14, 14) is output.
  • matrix data with dimensions (16, 2048, 7, 7) are output.
  • the matrix data with dimensions (16, 2048) is output.
  • the output matrix data with dimensions of (16, 249), where 249 is the total number of preset action categories, the data output by the fully connected layer is the probability of each video frame belonging to each action category, that is, classification result.
  • the data output by the fully connected layer before the data output by the fully connected layer is input to the second average pooling layer, the data output by the fully connected layer also needs to be reshaped to obtain matrix data with dimensions (2,8,249).
  • the Reshape operation here It is to separate the time dimension T from the batch dimension, so as to use the second average pooling layer to process the classification results of each video frame in the time dimension to obtain the probability that each video frame sequence belongs to each action category. Then, after the second average pooling layer is processed, the matrix data with a dimension of (2,249) is output, and the final classification result is obtained, that is, the probability that each video frame sequence belongs to each action category.
  • the batch number is 2, that is, two independent video frame sequences S1 and video frame sequences S2 are currently being processed.
  • the output result of the second average pooling layer includes the probabilities of the video frame sequence S1 corresponding to 249 action categories. And the probability that the video frame sequence S2 corresponds to 249 action categories.
  • the neural network model of the embodiment of this application makes full use of the batch dimension in the neural network input data in the data processing process, and conveniently realizes the switching between the time dimension and the space dimension of the matrix data, and the video frame sequence corresponds to
  • the time dimension T in the matrix data of is merged with the batch number dimension B to obtain (B ⁇ T, C, H, W), so that each video frame can be convolved in the spatial dimension to extract each video frame
  • the spatial feature in the matrix data corresponding to the video frame sequence is merged into the batch number dimension B, and the time dimension T is separated separately to obtain a three-dimensional matrix (B ⁇ H ⁇ W).
  • C, T which can convolve multiple video frames in the time dimension, extract the timing features in multiple video frames, and finally obtain motion state features including spatial features and timing features.
  • FIG. 8A it is a schematic diagram of the training process of the neural network model, which specifically includes the following steps:
  • Each video sample in the video sample set in the above step includes a video frame sequence marked with a category identifier, and the category identifier is used to characterize the action category corresponding to the moving target contained in the video frame sequence.
  • the video frame sequence in the video sample set may be a continuous video containing a third preset number of video frames.
  • the video frame sequence in the video sample set can also be an image sequence obtained by arranging discontinuous multiple video frames from a piece of video in time sequence.
  • the second preset can be set every interval in the video.
  • One video frame is extracted from the number of video frames, and if the number of extracted video frames reaches the third preset number, the extracted third preset number of video frames are determined as the video frame sequence.
  • the third preset number is determined according to the requirements of the neural network model for input data, that is, the third preset number is equal to T.
  • the second preset number can be determined according to the length of the video containing a complete action and the third preset number.
  • a video corresponding to an action contains 100 frames, and the third preset number is 8, you can start from the first At the beginning of the frame, a video frame is extracted every 14 frames, and finally a video frame sequence composed of frames 1, 15, 29, 43, 57, 71, 85, and 99 is obtained.
  • a neural network model is used to obtain a motion state feature that characterizes the moving target in the time sequence of the video sample.
  • the video samples in the video sample set can be input to the neural network model, and the neural network model is used to process the motion state characteristics.
  • the neural network model in the foregoing steps may be any neural network model that does not include a classification module provided in the embodiment of the present application.
  • each level module in the neural network extracts and outputs the first feature data corresponding to each video frame in the video sample from the input data.
  • the input data includes video samples
  • the input data of other levels of level modules is the output data of the level module at the upper level or the multi-core time domain processing module.
  • the multi-core time domain processing module according to the time information of each video frame, performs convolution processing on the target pixel in the first feature data output by the target level module in the time dimension, and obtains the corresponding second feature data respectively.
  • the feature data contains the temporal features that characterize the moving target in the time dimension.
  • the average pooling layer performs average pooling processing on the feature data output by the last level module to obtain the motion state features of the moving target in the video sample.
  • S803 According to the motion state characteristics output by the neural network model, determine the predicted probability that the motion target included in the video sample belongs to each action category through the classifier.
  • the structure of the classifier can refer to the previous classification model, which will not be repeated.
  • S804 Optimize the weight parameters of the neural network model and the classifier according to the predicted probability and the category identification.
  • step S805 Determine whether the optimized neural network model meets the training requirement; if so, execute step S806, otherwise return to step S802.
  • the difference between the prediction probability and the category identifier corresponding to the video sample is calculated through a loss function (such as cross entropy loss), and then through backpropagation (BP), gradient descent (Gradient Descent) , GD) or Stochastic Gradient Descent (SGD) and other optimization algorithms update the weight parameters in the neural network model and the classifier.
  • BP backpropagation
  • Gdient Descent gradient descent
  • SGD Stochastic Gradient Descent
  • the neural network model trained by the method shown in FIG. 8A can extract the motion state characteristics of the moving target from the video frame sequence.
  • FIG. 8B it is a schematic diagram of the training process of the neural network model, wherein the same steps as those in FIG. 8A will not be repeated.
  • the training method shown in FIG. 8B specifically includes the following steps:
  • Each video sample in the video sample set in the above steps includes a video frame sequence marked with a corresponding category identifier, and the category identifier is used to characterize the action category corresponding to the moving target contained in the video frame sequence.
  • S812 According to the video samples in the video sample set, obtain the predicted probability of the moving target included in the video sample belonging to each action category through the neural network model.
  • the neural network model in the foregoing steps may be the neural network model including the classification module provided in the embodiment of the present application.
  • each level module in the neural network extracts and outputs the first feature data corresponding to each video frame in the video sample from the input data.
  • the input data includes video samples
  • the input data of other levels of level modules is the output data of the level module at the upper level or the multi-core time domain processing module.
  • the multi-core time domain processing module according to the time information of each video frame, performs convolution processing on the target pixel in the first feature data output by the target level module in the time dimension, and obtains the corresponding second feature data respectively.
  • the feature data contains the temporal features that characterize the moving target in the time dimension.
  • the average pooling layer performs average pooling processing on the feature data output by the last level module to obtain the motion state features of the moving target in the video sample.
  • the classification module classifies the dynamic state features output by the average pooling layer, and determines the predicted probability of the moving target belonging to each action category.
  • step S814 Determine whether the optimized neural network model meets the training requirements, if yes, execute step S815, otherwise return to step S812.
  • the neural network model trained by the method shown in FIG. 8B can identify the action category corresponding to the moving target in the video frame sequence.
  • FIG. 8C it is a schematic diagram of the training process of the neural network model, wherein the same steps as those in FIG. 8A will not be repeated.
  • the training method shown in FIG. 8C specifically includes the following steps:
  • a video sample in the video sample set is a triplet (S1, S2, S3) including three video frame sequences, where the video frame sequences S1 and S2 are positive samples, that is, the video frame sequences S1 and S2
  • the moving targets in S3 have the same motion state characteristics; the video frame sequence S3 is a negative sample, and the motion state characteristics of the moving target in the video frame sequence S3 are different from the motion state characteristics of the moving targets in the other two video frame sequences.
  • the moving targets in the video frame sequence S1 and S2 perform the same action, while the moving target in the video frame sequence S3 performs different actions from the previous two video frame sequences;
  • the video The moving targets in the frame sequences S1 and S2 are the same pedestrian, and the moving target in the video frame sequence S3 is not the same pedestrian from the moving targets in the previous two video frame sequences.
  • the neural network model in the foregoing steps may be any neural network model that does not include a classification module provided in the embodiment of the present application.
  • the distance value d1 between the motion state feature corresponding to S1 and the motion state feature corresponding to S2 need to be calculated
  • the Euclidean distance algorithm can be used to calculate the distance between two operating state features. The smaller the distance value between the two motion state features, the higher the similarity between the two motion state features, that is, the higher the probability that the moving targets corresponding to the two motion state features perform the same action or are the same pedestrian.
  • a batch stochastic gradient descent method can be used to update the weight parameters of the neural network model to minimize the distance d1 between the motion state features corresponding to S1 and S2 in the triple, and maximize the distance values d2 and d3.
  • the specific optimization process is based on the prior art, and will not be repeated.
  • step S825 Determine whether the optimized neural network model meets the training requirement, if so, execute step S826, otherwise return to step S822.
  • steps S822-step S825 are cyclically executed until the distance value between the motion state features corresponding to the positive samples in the triplet is less than the first value, and the distance value between the motion state features corresponding to the positive samples and the negative samples in the triplet is greater than the first value.
  • Two values, where the first value and the second data are determined according to the model accuracy requirements.
  • the above neural network model can also be trained by improving the triple and quadruple methods, and the specific process will not be repeated.
  • the neural network model trained by the method shown in FIG. 8C can extract the motion state characteristics of the moving target from the video frame sequence.
  • the video sample set can also be divided into a training set, a verification set and a test set. There is no intersection between the test set and the training set, and there is no intersection between the test set and the verification set. .
  • After training the neural network model use the validation set to test the neural network model to verify whether the output result of the neural network model is accurate. If the output result is accurate If the rate does not meet the requirements, you need to use the training set to continue training the neural network model. If the accuracy of the output results meets the requirements, use the test set that has not been trained to verify the accuracy of the test model. If the test passes, the neural network is completed Training.
  • different video sample sets can be used to obtain neural network models applied to different application scenarios.
  • you can use the existing sample set to train the neural network model for example, use the sign language data set such as IsoGD or Jester to train the neural network model that can recognize sign language, and use the action and behavior recognition data set such as UCF101 or HMDB51 to train
  • Obtain a neural network model that can recognize human actions use MSRC-12 Kinect Gesture Dataset and other gesture recognition data sets, train to obtain a neural network model applied to gesture recognition scenarios, use Human3.6M and other human body pose estimation data sets, train and apply
  • the neural network model applied to the pedestrian re-recognition scene is trained.
  • FIG. 9 is a schematic flow chart of the video processing method provided in this embodiment of the application.
  • the method can be implemented as shown in FIGS. 1A, 1B, and 1C, for example It can be executed by the server shown, of course, it can also be executed by the terminal device.
  • the flow of the video processing method is introduced below. Some of the steps are the same as the content of the model introduction and the corresponding steps in the training process. Therefore, these steps are only briefly introduced. For details, please refer to the above model. Introduction and description of the corresponding part of the training method.
  • the video frame sequence containing the moving target can be obtained in the following manner: according to the timing of the video frame in the video to be processed, one video frame is extracted from the video to be processed every second preset number of video frames; if The number of extracted video frames reaches the third preset number, and the extracted third preset number of video frames are determined as a video frame sequence.
  • the second preset number is 14, and the third preset number is 8, you can start from the first frame in the video to be processed, and extract a video frame every 14 frames, and finally get the first 1, 15, 29, and 43 , 57, 71, 85, 99 frames of the first video frame sequence. You can continue to extract one video frame every 14 frames to get the second sequence of video frames.
  • S902 According to the video frame sequence, obtain a movement state feature that characterizes the moving target in the time sequence of the video frame sequence through the trained neural network model.
  • the designated target is determined according to the application scenario. For example, if the application scenario is sign language recognition, the designated target is hand, and the application scenario is pedestrian re-recognition, then the designated target is human.
  • Step S902 may include the following steps: extract each level of the video frame sequence from the input data through the level modules in the neural network model.
  • the first feature data corresponding to the video frame is output.
  • Each first feature data contains the spatial feature of the moving target in the video frame.
  • the input data of the first level module includes the video frame sequence, and the other levels of the module
  • the input data is the data output by the upper level module or the multi-core time domain processing module; the multi-core time domain processing module, according to the time information of each video frame, the target pixel in the first feature data output by the target level module Perform convolution processing in the time dimension to obtain the corresponding second feature data.
  • Each second feature data contains the time sequence characteristics of the moving target in the time dimension;
  • the target level module is the upper level of the multi-core time domain processing module Hierarchical module, the target pixel is the pixel with the same position in the first feature data output by the target level module;
  • the average pooling layer performs average pooling processing on the feature data output by the last level module to obtain the motion state characteristics of the moving target .
  • step S902 may include the following steps:
  • the first-level hierarchical module extracts and outputs first feature data P1 corresponding to each video frame from the input video frame sequence.
  • the first multi-core time-domain processing module performs convolution processing on the target pixel in the first feature data P1 corresponding to each video frame output by the first-level hierarchical module in the time dimension, respectively Obtain the corresponding second feature data Q1.
  • the second-level hierarchical module extracts and outputs the first feature data P2 corresponding to each video frame from the second feature data Q1 output by the first multi-core time-domain processing module.
  • the second multi-core time-domain processing module performs convolution processing on the pixels with the same position in the first feature data P2 corresponding to each video frame output by the second-level hierarchical module in the time dimension. , Respectively obtain the corresponding second characteristic data Q2.
  • the third-level hierarchical module extracts and outputs first feature data P3 corresponding to each video frame from the second feature data Q2 output by the second multi-core time domain processing module.
  • the fourth-level hierarchical module extracts and outputs first feature data P4 corresponding to each video frame from the first feature data P3 output by the third-level hierarchical module.
  • the average pooling layer performs average pooling processing on the first feature data P4 output by the fourth-level hierarchical module to obtain the motion state feature of the moving target.
  • step S1002 may include:
  • the above step S1101 can be implemented by the first dimension transformation layer in the multi-core time domain processing module.
  • S1102. Perform convolution processing on each first time domain feature data to obtain corresponding second time domain feature data.
  • step S1101 can be implemented by the multi-core time-domain convolutional layer in the multi-core time-domain processing module.
  • step S1103 can be implemented by the second dimension transformation layer in the multi-core time domain processing module.
  • step S1004 is similar to the step S1002, and will not be described again.
  • step S1002 may include:
  • S1201 Decrease the number of channels of the first feature data P1 corresponding to each video frame output by the first-level hierarchical module from a first value to a second value.
  • step S1201 can be implemented by the first convolutional layer in the multi-core time domain processing module.
  • step S1202 can be implemented by the first dimension transformation layer in the multi-core time domain processing module.
  • S1203 Perform convolution processing on each first time domain feature data to obtain corresponding second time domain feature data.
  • step S1203 can be implemented by the multi-core time-domain convolutional layer in the multi-core time-domain processing module.
  • step S1204 can be implemented by the second dimension transformation layer in the multi-core time domain processing module.
  • S1205 Restore the number of channels of the second characteristic data Q1 from the second value to the first value.
  • step S1205 can be implemented by the second convolutional layer in the multi-core time domain processing module.
  • step S1004 is similar to the step S1002, and will not be described again.
  • the convolution processing can be performed on each first time domain feature data in the following manner, for example, including the following steps: for each first time domain feature data, a first preset number A one-dimensional convolution layer with different convolution kernel sizes performs convolution processing on the first time domain feature data to obtain a first preset number of feature data of different scales, and fuse the first preset corresponding to the first time domain feature data A number of feature data of different scales are used to obtain second time domain feature data corresponding to the first time domain feature data.
  • the one-dimensional convolutional layer may be a one-dimensional Depthwise convolutional layer, and using a one-dimensional Depthwise convolutional layer can effectively reduce the amount of calculation and improve the processing efficiency of the multi-core time-domain convolutional layer.
  • step S903 may include: according to the motion state characteristics of the moving target, the probability that the moving target belongs to the action category corresponding to each specified target is obtained through the trained classifier, where the classification The device is trained according to the movement state characteristics of the specified target.
  • the training method of the above-mentioned classifier can refer to the related content in the above-mentioned training method of the neural network model.
  • the neural network model containing the classification module can also directly use the neural network model containing the classification module to obtain the probability that the moving target in the video frame sequence belongs to the action category corresponding to each specified target.
  • the neural network shown in Figure 6 or Figure 7B. Network model please refer to the neural network shown in Figure 6 or Figure 7B. Network model.
  • step S903 may include: if the similarity between the motion state feature of the moving target and the motion state feature of the designated target is greater than a threshold, determining the motion target as the designated target.
  • the Euclidean distance algorithm may be used to calculate the distance value between the motion state feature of the moving target and the motion state feature of the specified target as the similarity.
  • the video processing method of the embodiment of the present application can obtain the motion state characteristics of the moving target expressed in the time sequence of the video frame sequence through the neural network model.
  • the motion state characteristics include the motion target extracted by the hierarchical module in each video frame.
  • the spatial features in also include the temporal features that characterize the moving target in the time dimension extracted by the multi-core time domain processing module, that is, more comprehensive feature information can be obtained from the video frame sequence, thereby improving the recognition of the moving target Accuracy.
  • FIG 13 it is a visual analysis result of the intermediate results obtained by the neural network model, where the first line of images is the input video frame sequence, and the second line of images is the output data corresponding to the second residual layer of the first model Visualized image, the third line of image is the visualized image corresponding to the output data of the second residual layer of the second model.
  • the first model is a model obtained by training the model shown in Figure 7A using the sign language data set, and the second model is using The model obtained by training the model shown in Figure 7B on the sign language data set.
  • the visualization process may include, for example, for multiple feature maps corresponding to each video frame, calculating the mean square value of each feature map along the channel direction to obtain a visual image corresponding to each video frame, and the pixel with the larger mean square value The corresponding brightness response is higher.
  • the feature map obtained by the video processing method of the embodiment of the present application has a higher response to the opponent (for example, the position marked by the white circle in FIG. 13 represents the response of the feature map to the opponent), This is very important for sign language recognition, and at the same time, it can better reflect the temporal continuity of the hand area. Therefore, the video processing method of the embodiment of the present application can enhance the characteristic information of the hand in time and space, thereby improving the recognition accuracy of sign language recognition.
  • an embodiment of the present application further provides a video processing device 140, including: an acquisition module 1401, a feature extraction module 1402, and a matching module 1403.
  • the obtaining module 1401 is used to obtain a video frame sequence containing a moving target
  • the feature extraction module 1402 is configured to obtain, according to the video frame sequence, through a trained neural network model, a feature representing the motion state of the moving target expressed in the time sequence of the video frame sequence;
  • the matching module 1403 is configured to obtain a matching result between the motion state feature of the moving target and the motion state feature of the specified target.
  • the neural network model includes multiple level modules, at least one multi-core time-domain processing module, and an average pooling layer, and each of the at least one multi-core time-domain processing module is set separately Between two adjacent level modules of the plurality of level modules, the average pooling layer is located after the last level module;
  • the feature extraction module 1402 is specifically used for:
  • the first feature data corresponding to each video frame in the video frame sequence is extracted step by step from the input data through various levels of level modules, and each first feature data contains the characterizing that the moving target is in the video frame.
  • the input data of the first-level hierarchical module includes the video frame sequence, and the input data of the other hierarchical modules is the output data of the hierarchical module at the upper level or the multi-core time domain processing module;
  • the target pixel in the first feature data output by the target level module is convolved in the time dimension to obtain the corresponding second feature data.
  • the second feature data includes time-series features that characterize the moving target in the time dimension;
  • the target level module is a level module located at the upper level of the multi-core time domain processing module, and the target pixel is the target Pixels with the same position in the first feature data output by the level module;
  • the feature extraction module 1402 is specifically configured to:
  • the target level module According to the time information of each video frame, determine the first time domain feature data corresponding to the target pixel in the time dimension in the first feature data output by the target level module;
  • each second time domain feature data determines the second feature data corresponding to the pixels in the second time domain feature data that have the same time information in the spatial dimension .
  • the feature extraction module 1402 is specifically configured to:
  • a first preset number of one-dimensional convolution layers with different convolution kernel sizes are used to perform convolution processing on the first time domain feature data to obtain the first time domain feature data.
  • the one-dimensional convolutional layer is a one-dimensional Depthwise convolutional layer.
  • the feature extraction module 1402 is further configured to: before determining the first temporal feature data, reduce the number of channels of the first feature data corresponding to each video frame from a first value to a second value; After the second characteristic data is determined, the number of channels of the second characteristic data is restored from the second value to the first value.
  • the acquisition module 1401 is specifically configured to: according to the timing of the video frames in the to-be-processed video, extract one video frame from the to-be-processed video every second preset number of video frames; if the extracted video The number of frames reaches the third preset number, and the extracted third preset number of video frames are determined as the video frame sequence.
  • the matching module 1403 is specifically configured to: if the similarity between the motion state feature of the moving target and the motion state feature of the designated target is greater than a threshold, determine that the motion target is the designated target; or, according to For the motion state characteristics of the moving target, the probability that the motion target belongs to the action category corresponding to each designated target is obtained through a trained classifier, and the classifier is trained according to the motion state characteristics of the designated target.
  • the video processing module of the embodiment of the present application further includes a training module, and the training module is used to:
  • each video sample in the video sample set includes a video frame sequence marked with a category identifier, and the category identifier is used to characterize an action category corresponding to a moving target contained in the video frame sequence;
  • the neural network model is used to obtain the characteristics of the movement state that characterizes the moving target in the time sequence of the video samples
  • a classifier is used to determine the predicted probability that the motion target contained in the video sample belongs to each action category
  • the weight parameters of the neural network model and the classifier are optimized.
  • the neural network model includes multiple level modules, at least one multi-core time-domain processing module, and an average pooling layer, and each of the at least one multi-core time-domain processing module is set separately Between two adjacent level modules of the plurality of level modules, the average pooling layer is located after the last level module;
  • the training module is specifically used for:
  • the neural network model is used to obtain the characteristics of the movement state of the moving target expressed in the time sequence of the video samples, which specifically includes:
  • the first feature data corresponding to each video frame in the video sample is extracted step by step from the input data through each level of hierarchy module, and each first feature data contains the moving target included in the video sample.
  • the input data of the first level module includes the video samples, and the input data of the other levels of the module is the output of the level module at the upper level or the multi-core time domain processing module data;
  • the target pixel in the first feature data output by the target level module is convolved in the time dimension to obtain the corresponding second feature data.
  • the second feature data contains the temporal features that characterize the moving target in the time dimension; through the average pooling layer, the feature data output by the last level module is averaged and pooled to obtain the The movement state characteristics of the moving target.
  • the video processing device provided in the embodiment of the present application adopts the same inventive concept as the above-mentioned video processing method, and can achieve the same beneficial effects, which will not be repeated here.
  • an embodiment of the present application also provides an electronic device, which specifically may be a terminal device (such as a desktop computer, a portable computer, a smart phone, a tablet computer, a personal digital assistant (Personal Digital Assistant)). Digital Assistant, PDA)), it can also be an external device that communicates with the terminal device, such as a server.
  • the electronic device 150 may include a processor 1501 and a memory 1502.
  • the processor 1501 may be a general-purpose processor, such as a central processing unit (CPU), a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (ASIC), and a field programmable gate array (Field Programmable Gate).
  • Array, FPGA or other programmable logic devices, discrete gates or transistor logic devices, and discrete hardware components can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of the present application.
  • the general-purpose processor may be a microprocessor or any conventional processor. The steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware processor, or executed and completed by a combination of hardware and software modules in the processor.
  • the memory 1502 as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules.
  • the memory may include at least one type of storage medium, such as flash memory, hard disk, multimedia card, card-type memory, random access memory (Random Access Memory, RAM), static random access memory (Static Random Access Memory, SRAM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), magnetic memory, disk, CD and so on.
  • the memory is any other medium that can be used to carry or store desired program codes in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto.
  • the memory 1502 in the embodiment of the present application may also be a circuit or any other device capable of realizing a storage function for storing program instructions and/or data.
  • the embodiment of the present application provides a computer-readable storage medium for storing computer program instructions used for the above-mentioned electronic device, which includes a program for executing the above-mentioned barrage processing method.
  • the above-mentioned computer storage medium may be any available medium or data storage device that the computer can access, including but not limited to magnetic storage (such as floppy disk, hard disk, magnetic tape, magneto-optical disk (MO), etc.), optical storage (such as CD, DVD, BD) , HVD, etc.), and semiconductor memory (such as ROM, EPROM, EEPROM, non-volatile memory (NAND FLASH), solid state drive (SSD)), etc.
  • magnetic storage such as floppy disk, hard disk, magnetic tape, magneto-optical disk (MO), etc.
  • optical storage such as CD, DVD, BD) , HVD, etc.
  • semiconductor memory such as ROM, EPROM, EEPROM, non-volatile memory (NAND FLASH), solid state drive (SSD)

Abstract

本申请涉及计算机技术领域,公开了一种视频处理方法、装置、电子设备及存储介质,所述方法包括:获取包含运动目标的视频帧序列;将所述视频帧序列输入已训练的神经网络模型,得到表征所述运动目标以所述视频帧序列的时序而表现的运动状态特征;获得所述运动目标的运动状态特征与指定目标的运动状态特征的匹配结果。本申请实施例提供的技术方案,提高对视频中运动目标的识别准确率。

Description

视频处理方法、装置、电子设备及存储介质
本申请要求于2019年7月29日提交中国专利局、申请号201910690174.8、申请名称为“视频处理方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及视频处理技术。
背景技术
近年来,视频处理技术得到了快速的发展。其中,针对视频的处理与理解,由于其在动作识别、行人重识别等领域的广泛应用前景,也得到了大量研究者的关注。又随着深度学习的发展特别是卷积神经网络在计算机视觉中的大量应用以及在识别、检测等领域取得了令人惊喜的成果,基于卷积神经网络的视频行为识别得到了大量的研究。
目前,针对视频的处理与理解方法,通常是基于神经网络模型对一段视频中的每一帧图像分别进行特征提取,然后将各帧图像对应的特征输入预先训练的分类器,从而确定这一段视频中的运动目标所属的类别。然而,上述方法针对运动目标的识别准确率较低。
发明内容
本申请实施例提供一种视频处理方法、装置、电子设备及存储介质,以提高对视频中运动目标的识别准确率。
第一方面,本申请一实施例提供了一种视频处理方法,包括:
获取包含运动目标的视频帧序列;
根据所述视频帧序列,通过已训练的神经网络模型得到表征所述运动目标以所述视频帧序列的时序而表现的运动状态特征;
获得所述运动目标的运动状态特征与指定目标的运动状态特征的匹配结果。
可选地,所述神经网络模型包括多个层级模块、至少一个多核时域处理模块和平均池化层,且所述至少一个多核时域处理模块中的每个多核时域处理模块分别设置在所述多个层级模块中的两个相邻层级模块之间,所述平均池化层位于最后一个层级模块之后;
所述根据所述视频帧序列,通过已训练的神经网络模型得到表征所述运动目标以所述视频帧序列的时序而表现的运动状态特征,具体包括:
通过各级层级模块分别从输入数据中逐级提取所述视频帧序列中各视频帧对应的第一特征数据并输出,每个第一特征数据中包含表征所述运动目标在所述视频帧中的空间特征, 其中,第一级层级模块的输入数据包括所述视频帧序列,其他各级层级模块的输入数据为位于其上一级的层级模块或者多核时域处理模块输出的数据;
通过所述多核时域处理模块,按照各视频帧的时间信息,对目标层级模块输出的第一特征数据中目标像素点在时间维度上进行卷积处理,分别得到对应的第二特征数据,每个第二特征数据中包含表征所述运动目标在时间维度上的时序特征;所述目标层级模块为位于所述多核时域处理模块上一级的层级模块,所述目标像素点为所述目标层级模块输出的第一特征数据中位置相同的像素点;
通过所述平均池化层对最后一级层级模块输出的特征数据进行平均池化处理,得到所述运动目标的运动状态特征。
可选地,所述按照各视频帧的时间信息,对目标层级模块输出的所述第一特征数据中目标像素点在时间维度上进行卷积处理,分别得到各视频帧对应的第二特征数据,具体包括:
按照各视频帧的时间信息,确定所述目标层级模块输出的第一特征数据中所述目标像素点在时间维度上对应的第一时域特征数据;
对每个第一时域特征数据进行卷积处理,得到对应的第二时域特征数据;
按照每个第二时域特征数据中所述目标像素点在第一特征数据中对应的位置,确定所有第二时域特征数据中时间信息相同的像素点在空间维度上对应的第二特征数据。
可选地,所述对每个第一时域特征数据进行卷积处理,得到对应的第二时域特征数据,具体包括:
针对每个所述第一时域特征数据,分别采用第一预设数量个卷积核大小不同的一维卷积层对所述第一时域特征数据进行卷积处理,得到所述第一预设数量个不同尺度的特征数据;
融合所述第一预设数量个不同尺度的特征数据,得到所述第一时域特征数据对应的第二时域特征数据。
可选地,所述一维卷积层为一维Depthwise卷积层。
可选地,本申请实施例的视频处理方法还包括:
在确定所述第一时域特征数据之前,将各视频帧对应的第一特征数据的通道数目从第一数值降为第二数值;
在确定所述第二特征数据之后,将所述第二特征数据的通道数目从所述第二数值还原为所述第一数值。
可选地,所述获取包含运动目标的视频帧序列,具体包括:
按待处理视频中视频帧的时序,从所述待处理视频中每间隔第二预设数量个视频帧抽取一个视频帧;
若抽取的视频帧的数量达到第三预设数量,将抽取的第三预设数量个视频帧确定为所述视频帧序列。
可选地,所述获得所述运动目标的运动状态特征与指定目标的运动状态特征的匹配结果,具体包括:
若所述运动目标的运动状态特征与指定目标的运动状态特征的相似度大于阈值,确定所述运动目标为所述指定目标;或者,
根据所述运动目标的运动状态特征,通过已训练的分类器得到所述运动目标属于每个指定目标所对应动作类别的概率,所述分类器是根据所述指定目标的运动状态特征训练得到的。
可选地,通过以下方式训练所述神经网络模型:
获取视频样本集,所述视频样本集中每个视频样本包括标记有类别标识的视频帧序列,所述类别标识用于表征所述视频帧序列中包含的运动目标对应的动作类别;
根据所述视频样本集中的视频样本,通过所述神经网络模型得到表征所述运动目标以所述视频样本的时序而表现的运动状态特征;
根据所述运动状态特征,通过分类器确定所述视频样本中包含的运动目标属于每个动作类别的预测概率;
根据所述预测概率和所述类别标识,优化所述神经网络模型和分类器的权重参数。
可选地,所述神经网络模型包括多个层级模块、至少一个多核时域处理模块,和平均池化层,且所述至少一个多核时域处理模块中的每个多核时域处理模块分别设置在所述多个层级模块中的两个相邻层级模块之间,所述平均池化层位于最后一个层级模块之后;
所述根据所述视频样本集中的视频样本,通过所述神经网络模型得到表征所述运动目标以所述视频样本的时序而表现的运动状态特征,具体包括:
通过各级层级模块分别从输入数据中逐级提取所述视频样本中各视频帧对应的第一特征数据并输出,每个第一特征数据中包含表征所述视频样本中包含的所述运动目标在所述视频帧中的空间特征,其中第一级层级模块的输入数据包括所述视频样本,其他各级层级模块的输入数据为位于其上一级的层级模块或者多核时域处理模块输出的数据;
通过所述多核时域处理模块,按照各视频帧的时间信息,对目标层级模块输出的第一特征数据中目标像素点在时间维度上进行卷积处理,分别得到对应的第二特征数据,每个第二特征数据中包含表征所述运动目标在时间维度上的时序特征;通过所述平均池化层对 最后一级层级模块输出的特征数据进行平均池化处理,得到所述视频样本中的运动目标的运动状态特征。
第二方面,本申请一实施例提供了一种神经网络模型,多个层级模块、至少一个多核时域处理模块,和平均池化层,且所述至少一个多核时域处理模块中的每个多核时域处理模块分别设置在所述多个层级模块中的两个相邻层级模块之间,所述平均池化层位于最后一个层级模块之后;
各层级模块,用于分别从输入数据中逐级提取所述视频帧序列中各视频帧对应的第一特征数据并输出,每个第一特征数据中包含表征所述运动目标在所述视频帧中的空间特征,其中,第一级层级模块的输入数据包括所述视频帧序列,其他各级层级模块的输入数据为位于其上一级的层级模块或者多核时域处理模块输出的数据;
所述多核时域处理模块,用于按照各视频帧的时间信息,对目标层级模块输出的第一特征数据中目标像素点在时间维度上进行卷积处理,分别得到对应的第二特征数据,每个第二特征数据中包含表征所述运动目标在时间维度上的时序特征;所述目标层级模块为位于所述多核时域处理模块上一级的层级模块,所述目标像素点为所述目标层级模块输出的第一特征数据中位置相同的像素点;
所述平均池化层,用于对最后一级层级模块输出的特征数据进行平均池化处理,得到所述运动目标的运动状态特征。
可选地,所述多核时域处理模块包括:第一维度变换层、多核时域卷积层以及第二维度变换层;
所述第一维度变换层,用于按照各视频帧的时间信息,确定所述目标层级模块输出的第一特征数据中所述目标像素点在时间维度上对应的第一时域特征数据;
所述多核时域卷积层,用于针对每个目标像素点,对所述目标像素点对应的第一时域特征数据进行卷积处理,得到第二时域特征数据;
所述第二维度变换层,用于按照每个第二时域特征数据中所述目标像素点在第一特征数据中对应的位置,确定所有第二时域特征数据中时间信息相同的像素点在空间维度上对应的第二特征数据。
第三方面,本申请一实施例提供了一种视频处理装置,包括:
获取模块,用于获取包含运动目标的视频帧序列;
特征提取模块,用于根据所述视频帧序列,通过已训练的神经网络模型得到表征所述运动目标以所述视频帧序列的时序而表现的运动状态特征;
匹配模块,用于获得所述运动目标的运动状态特征与指定目标的运动状态特征的匹配结果。
可选地,所述神经网络模型包括多个层级模块、至少一个多核时域处理模块,和平均池化层,且所述至少一个多核时域处理模块中的每个多核时域处理模块分别设置在所述多个层级模块中的两个相邻层级模块之间,所述平均池化层位于最后一个层级模块之后;
所述特征提取模块,具体用于:
通过各级层级模块分别从输入数据中逐级提取所述视频帧序列中各视频帧对应的第一特征数据并输出,每个第一特征数据中包含表征所述运动目标在所述视频帧中的空间特征,其中,第一级层级模块的输入数据包括所述视频帧序列,其他各级层级模块的输入数据为位于其上一级的层级模块或者多核时域处理模块输出的数据;
通过所述多核时域处理模块,按照各视频帧的时间信息,对目标层级模块输出的第一特征数据中目标像素点在时间维度上进行卷积处理,分别得到对应的第二特征数据,每个第二特征数据中包含表征所述运动目标在时间维度上的时序特征;所述目标层级模块为位于所述多核时域处理模块上一级的层级模块,所述目标像素点为所述目标层级模块输出的第一特征数据中位置相同的像素点;
通过所述平均池化层对最后一级层级模块输出的特征数据进行平均池化处理,得到所述运动目标的运动状态特征。
可选地,所述特征提取模块,具体用于:
按照各视频帧的时间信息,确定所述目标层级模块输出的第一特征数据中所述目标像素点在时间维度上对应的第一时域特征数据;
对每个第一时域特征数据进行卷积处理,得到对应的第二时域特征数据;
按照每个第二时域特征数据中所述目标像素点在第一特征数据中对应的位置,确定所有第二时域特征数据中时间信息相同的像素点在空间维度上对应的第二特征数据。
可选地,所述特征提取模块,具体用于:
针对每个所述第一时域特征数据,分别采用第一预设数量个卷积核大小不同的一维卷积层对所述第一时域特征数据进行卷积处理,得到所述第一预设数量个不同尺度的特征数据;
融合所述第一预设数量个不同尺度的特征数据,得到所述第一时域特征数据对应的第二时域特征数据。
可选地,所述一维卷积层为一维Depthwise卷积层。
可选地,所述特征提取模块,还用于:
在确定所述第一时域特征数据之前,将各视频帧对应的第一特征数据的通道数目从第一数值降为第二数值;
在确定所述第二特征数据之后,将所述第二特征数据的通道数目从所述第二数值还原为所述第一数值。
可选地,所述获取模块,具体用于:
按待处理视频中视频帧的时序,从所述待处理视频中每间隔第二预设数量个视频帧抽取一个视频帧;
若抽取的视频帧的数量达到第三预设数量,将抽取的第三预设数量个视频帧确定为所述视频帧序列。
可选地,所述匹配模块,具体用于:
若所述运动目标的运动状态特征与指定目标的运动状态特征的相似度大于阈值,确定所述运动目标为所述指定目标;或者,
根据所述运动目标的运动状态特征,通过已训练的分类器得到所述运动目标属于每个指定目标所对应动作类别的概率,所述分类器是根据所述指定目标的运动状态特征训练得到的。
可选地,本申请实施例的视频处理模块还包括训练模块,所述训练模块用于:
通过以下方式训练所述神经网络模型:
获取视频样本集,所述视频样本集中每个视频样本包括标记有类别标识的视频帧序列,所述类别标识用于表征所述视频帧序列中包含的运动目标对应的动作类别;
根据所述视频样本集中的视频样本,通过所述神经网络模型得到表征所述运动目标以所述视频样本的时序而表现的运动状态特征;
根据所述运动状态特征,通过分类器确定所述视频样本中包含的运动目标属于每个动作类别的预测概率;
根据所述预测概率和所述类别标识,优化所述神经网络模型和分类器的权重参数。
可选地,所述神经网络模型包括多个层级模块、至少一个多核时域处理模块,和平均池化层,且所述至少一个多核时域处理模块中的每个多核时域处理模块分别设置在所述多个层级模块中的两个相邻层级模块之间,所述平均池化层位于最后一个层级模块之后;
所述训练模块,具体用于:
所述根据所述视频样本集中的视频样本,通过所述神经网络模型得到表征所述运动目标以所述视频样本的时序而表现的运动状态特征,具体包括:
通过各级层级模块分别从输入数据中逐级提取所述视频样本中各视频帧对应的第一特征数据并输出,每个第一特征数据中包含表征所述视频样本中包含的所述运动目标在所述 视频帧中的空间特征,其中第一级层级模块的输入数据包括所述视频样本,其他各级层级模块的输入数据为位于其上一级的层级模块或者多核时域处理模块输出的数据;
通过所述多核时域处理模块,按照各视频帧的时间信息,对目标层级模块输出的第一特征数据中目标像素点在时间维度上进行卷积处理,分别得到对应的第二特征数据,每个第二特征数据中包含表征所述运动目标在时间维度上的时序特征;通过所述平均池化层对最后一级层级模块输出的特征数据进行平均池化处理,得到所述视频样本中的运动目标的运动状态特征。
第四方面,本申请一实施例提供了一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,处理器执行计算机程序时实现上述任一种方法的步骤。
第五方面,本申请一实施例提供了一种计算机可读存储介质,其上存储有计算机程序指令,该计算机程序指令被处理器执行时实现上述任一种方法的步骤。
本申请实施例提供的技术方案,可通过神经网络模型得到表征运动目标以视频帧序列的时序而表现的运动状态特征,该运动状态特征中既包含通过层级模块提取到的运动目标在各视频帧中的空间特征,还包含通过多核时域处理模块提取到的表征运动目标在时间维度上的时序特征,即能够从视频帧序列中获取到更加全面的特征信息,进而提高了针对运动目标的识别准确率。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例中所需要使用的附图作简单地介绍,显而易见地,下面所介绍的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1A为本申请实施例提供的视频处理方法的应用场景示意图;
图1B为本申请实施例提供的视频处理方法的应用场景示意图;
图1C为本申请实施例提供的视频处理方法的应用场景示意图;
图2为本申请实施例提供的神经网络模型的结构示意图;
图3A为本申请实施例提供的多核时域处理模块的结构示意图;
图3B为本申请实施例提供的多核时域处理模块的结构示意图;
图4为本申请实施例提供的第一维度变换层进行Reshape操作的示意图;
图5为本申请实施例提供的多核时域卷积层的结构示意图;
图6为本申请实施例提供的应用于动作识别场景的神经网络模型的结构示意图;
图7A为ResNet50网络的结构示意图;
图7B为本申请实施例提供的以ResNet50网络作为基础网络得到的神经网络模型的结构示意图;
图8A为本申请实施例提供的神经网络模型的训练流程;
图8B为本申请实施例提供的神经网络模型的训练流程;
图8C为本申请实施例提供的神经网络模型的训练流程;
图9为本申请实施例提供的视频处理方法的流程示意图;
图10为本申请实施例提供的通过已训练的神经网络模型确定运动状态特征的流程示意图;
图11为本申请实施例提供的通过多核时域处理模块确定第二特征数据的流程示意图;
图12为本申请实施例提供的通过多核时域处理模块确定第二特征数据的流程示意图;
图13为对神经网络模型得到的中间结果的可视化分析结果;
图14为本申请实施例提供的视频处理装置的结构示意图;
图15为本申请实施例提供的电子设备的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。
为了方便理解,下面对本申请实施例中涉及的名词进行解释:
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
例如,本申请实施例主要涉及计算机视觉技术(Computer Vision,CV),计算机视觉是一门研究如何使机器“看”的科学,更进一步的说,就是指用摄影机和电脑代替人眼对目标进行识别、跟踪和测量等机器视觉,并进一步做图形处理,使电脑处理成为更适合人眼观 察或传送给仪器检测的图像。作为一个科学学科,计算机视觉研究相关的理论和技术,试图建立能够从图像或者多维数据中获取信息的人工智能系统。计算机视觉技术通常包括图像处理、图像识别、图像语义理解、图像检索、光学字符识别(Optical Character Recognition,OCR)、视频处理、视频语义理解、视频内容/行为识别、三维物体重建、三维(3-dimension,3D)技术、虚拟现实、增强现实、同步定位与地图构建等技术,还包括常见的人脸识别、指纹识别等生物特征识别技术。例如可以通过图像语义理解(Image Semantic Understanding,ISU)进行特征提取。
例如,本申请实施例主要涉及机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。例如通过机器学习可以训练神经网络模型和分类器,从而利用神经网络模型确定运动状态特征,或利用分类器预测运动目标所属动作类别的概率等。
手语识别作为人体语言理解的一部分,有着非常重要的作用。一方面,它是虚拟现实人机交互的主要手段;另一方面它又是聋哑人利用计算机与正常人交流的辅助工具。每个手语由一个手势序列组成,而每个手势是由手形变化序列组成。手语识别的主要任务是根据提取的待识别手语的特征,然后用分类器做分类,确定待识别手语的类别。根据手语输入介质的不同,手语识别系统可分为两种:基于摄象机(视觉)的手语识别系统和基于设备输入(如数据手套、铁笔、鼠标、位置跟踪器等)的手语识别系统。目前,手语识别方法主要有基于模板匹配,神经网络,隐马尔可夫模型(Hidden Markov Model,HMM),动态时间归整(Dynamic Time Warping,DTW)等方法。
行人重识别,也称行人再识别,是利用计算机视觉技术判断图像或者视频序列中是否存在特定行人的技术。给定一个监控行人图像,检索跨设备下的该行人图像。旨在弥补目前固定的摄像头的视觉局限,并可与行人识别/行人跟踪技术相结合,可广泛应用于智能视频监控、智能安保等领域。
多核时域处理模块(Multi-Kernel Temporal Block,MKTB),用于增强多个视频帧对应的特征图之间的时序特征。
Reshape,是一种可以重新调整矩阵的行数、列数、维数的函数。
Batch,是神经网络中的一个超参数,具体指神经网络每次处理的样本数量。
附图中的任何元素数量均用于示例而非限制,以及任何命名都仅用于区分,而不具有任何限制含义。
在具体实践过程中,现有的针对视频的视频处理与理解方法,通常是基于神经网络模型对一段视频中的每一帧图像分别进行特征提取,然后将各帧图像对应的特征输入预先训练的分类器,从而确定这一段视频中的运动目标所属的类别。由于上述方法中提取的特征是分别从各帧图像中获得的,因此,提取的特征无法体现运动目标在时间维度上的连续性和相关性,导致针对运动目标的识别准确率较低。以手势识别为例,每一个代表具体含义的手势动作都是由一系列具有确定的时序关系且连续的动作构成的,识别一个手势需要获取到包含这一系列连续动作的视频帧序列,视频帧序列中的各个视频帧严格按照动作的时序排列,采用现有的方式分别对各个视频帧进行特征提取,得到的只是手部动作在某一时刻的空间特征,不同时刻的手部动作对应的特征相互独立,丧失了视频帧序列中包含的手部动作之间时序关系,因此,基于现有方法得到的手部特征忽略了手势动作在时间维度上的连续性和相关性,导致手势识别准确率较低。
为此,本申请对现有的神经网络的结构进行改进,在现有神经网络中增加了一个能够从视频帧序列中提取出表征运动目标在时间维度上的时序特征的多核时域处理模块,改进后的神经网络模型包括多个层级模块、至少一个多核时域处理模块和平均池化层,且每个多核时域处理模块分别设置在多个层级模块中的两个相邻层级模块之间,平均池化层位于最后一个层级模块之后。其中,各级层级模块用于分别从输入数据中逐级提取视频帧序列中各视频帧对应的第一特征数据并输出,每个第一特征数据中包含表征该运动目标在视频帧中的空间特征,其中第一级层级模块的输入数据包括视频帧序列,其他各级层级模块的输入数据为位于其上一级的层级模块或者多核时域处理模块输出的数据;多核时域处理模块用于按照各视频帧的时间信息,对目标层级模块输出的第一特征数据中目标像素点在时间维度上进行卷积处理,分别得到对应的第二特征数据,每个第二特征数据中包含表征运动目标在时间维度上的时序特征;其中,目标层级模块为位于多核时域处理模块上一级的层级模块,目标像素点为目标层级模块输出的第一特征数据中位置相同的像素点;平均池化层用于对最后一级层级模块输出的特征数据进行平均池化处理,得到运动目标的运动状态特征。
基于此,将包含运动目标的视频帧序列输入上述神经网络模型后,可通过神经网络模型得到表征运动目标以视频帧序列的时序而表现的运动状态特征,该运动状态特征中既包含通过层级模块提取到的运动目标在各视频帧中的空间特征,又包含通过多核时域处理模块提取到的表征运动目标在时间维度上的时序特征,即基于上述神经网络模型,能够从视频帧序列中获取到更加全面的特征信息,进而提高了针对运动目标的识别准确率。
在介绍完本申请实施例的设计思想之后,下面对本申请实施例的技术方案能够适用的应用场景做一些简单介绍,需要说明的是,以下介绍的应用场景仅用于说明本申请实施例而非限定。在具体实施时,可以根据实际需要灵活地应用本申请实施例提供的技术方案。
本申请实施例中的视频处理方法可以应用于动作识别的场景,比如手势识别场景、手语识别场景、动作交互场景、行为识别场景等。
下面以人机交互过程中的动作交互场景为例进行示例性说明,如图1A所示,该应用场景包括终端设备101和服务器102。上述终端设备101通过无线或有线网络与服务器102连接,终端设备101是具备采集图像功能的电子设备,比如智能手机、平板电脑、智能机器人、体感游戏设备以及VR(Virtual Reality,虚拟现实技术)设备等,服务器102是一台服务器或若干台服务器组成的服务器集群或云计算中心。
终端设备101采集包含用户的待处理视频,然后将采集的待处理视频发送至服务器102。服务器102可以直接对接收的待处理视频中的用户进行动作识别,确定待处理视频中的用户执行的动作对应的动作类别,并根据存储的动作类别与响应数据的对应关系,确定识别到的动作类别对应的响应数据,将响应数据发送给终端设备101。终端设备101执行服务器返回的响应数据,该响应数据不限于文本数据、音频数据、图像数据、视频数据、语音播报或控制指令等,其中,控制指令包括但不限于:控制终端设备显示表情的指令、控制终端设备的动作部件运动的指令(如引领、导航、拍照、跳舞等)、在终端设备的屏幕上展示道具或特效的指令、控制智能家居的指令等。
图1A所示的应用场景还可以用于手语识别场景。终端设备101采集包含用户比划手语的待处理视频,然后将采集的视频发送至服务器102。服务器102可以直接对接收的待处理视频中的用户进行动作识别,确定待处理视频中的手语动作对应的手语类别,并根据存储的手语类别与语义数据(此时语义数据为响应数据)的对应关系,确定识别到的手语类别对应的语义数据,将语义数据发送给终端设备101,该语义数据可以是文本数据或语音数据。终端设备101播放服务器返回的语义数据,使得其他用户能够获知该用户比划的手语对应的意思,使得语言障碍或听觉障碍人士能够无障碍地进行交流。
当然,上述服务器102执行的方法也可以在终端设备101执行。
如图1B所示,该应用场景包括多个终端设备111(包括终端设备111-1、终端设备111-2、……终端设备111-n)和服务器112。例如,当终端设备111-1、终端设备111-2、……终端设备111-n通过服务器112进行交互时,终端设备111-1采集包含用户1的待处理视频,然后将采集的待处理视频发送至服务器112。服务器112可以直接对接收的待处理视频中的用户1进行动作识别,确定待处理视频中的用户1执行的动作对应的动作类别,并根据存储的动作类别与响应数据的对应关系,确定识别到的动作类别对应的响应数据,将响应数据发送给终端设备111-1以及与终端设备111-1进行交互的终端设备111-2、……终端设备111-n。接收到该响应数据的终端设备执行响应数据,该响应数据不限于文本数据、音频数据、图像数据、视频数据、展示道具或特效等。例如,在互联网直播场景下,直播间的主播执行指定动作,以用户1是主播为例,终端设备111-1将采集的包含主播执行指定动作的待处理视频发送给服务器112,服务器112确定出待处理视频中的主播执行的指定动作对应的动作类别,并确定该动作类别对应的特效,然后将对应的特效添加到直播数据中,观众的终端设备(例如终端设备111-2、111-3、……、111-n)从服务器112拉取直播数据,并在直播画面上展示对应的特效。
本申请实施例中的视频处理方法还可以应用于识别、跟踪视频中的运动目标的场景,比如行人重识别场景、监控安防场景、智能交通场景以及军事目标识别场景等。本申请实施例主要是基于目标的运动状态特征(如人体姿态)进行目标识别、跟踪。
下面以行人重识别场景为例进行示例性说明,如图1C所示,该应用场景包括监控设备121、服务器122、终端设备123。上述服务器122通过无线网络与监控设备121以及终端设备123连接,监控设备121是具备采集图像功能的电子设备,比如摄像头、摄像机、录像机等,终端设备123是具备网络通信能力的电子设备,该电子设备可以是智能手机、平板电脑或便携式个人计算机等,服务器122是一台服务器或若干台服务器组成的服务器集群或云计算中心。
监控设备121实时采集待处理视频,然后将采集的待处理视频发送至服务器122。服务器122可以直接对接收的待处理视频中的行人进行识别,提取待处理视频中包含的各个行人的特征,将各个行人的特征与目标人物的特征进行比对,确定待处理视频中是否包含目标人物。服务器122在识别到目标人物后,对待处理视频中的目标人物进行标记,然后将标记了目标对象的待处理视频发送至终端设备123,终端设备123上可以播放标记了目标人物的待处理视频,以便相关人员对视频中的目标人物进行跟踪和分析。
当然,上述服务器122执行的方法也可以在终端设备123执行。上述待处理视频也可以是监控设备121预先录制的视频。
当然,本申请实施例提供的方法并不限用于图1A、图1B、图1C所示的应用场景中,还可以用于其它可能的应用场景,本申请实施例并不进行限制。对于图1A、图1B、图1C所示的应用场景的各个设备所能实现的功能将在后续的方法实施例中一并进行描述,在此先不过多赘述。
为进一步说明本申请实施例提供的技术方案,下面结合附图以及具体实施方式对此进行详细的说明。虽然本申请实施例提供了如下述实施例或附图所示的方法操作步骤,但基于常规或者无需创造性的劳动在所述方法中可以包括更多或者更少的操作步骤。在逻辑上不存在必要因果关系的步骤中,这些步骤的执行顺序不限于本申请实施例提供的执行顺序。
请参见图2,为本申请实施例提供的用于视频处理的神经网络模型的示意图。其中,图2所示的神经网络模型包括:多个层级模块(例如图2中的第一级层级模块、第二级层级模块、第三级层级模块、第四级层级模块),至少一个多核时域处理模块(例如图2中的第一多核时域处理模块、第二多核时域处理模块),和平均池化层(mean-pooling),且至少一个多核时域处理模块中的每个多核时域处理模块分别设置在多个层级模块中的两个相邻层级模块之间,平均池化层位于最后一个层级模块之后。
图2中的各级层级模块分别从输入数据中逐级提取视频帧序列中各视频帧对应的第一特征数据并输出,每个第一特征数据中包含表征视频帧序列中包含的运动目标在视频帧中的空间特征。
其中,第一级层级模块的输入数据包括视频帧序列,第一级层级模块从输入的视频帧序列中提取各视频帧对应的第一特征数据并输出。除第一级层级以外的其他各级层级模块的输入数据为位于其上一级的层级模块输出的第一特征数据或者多核时域处理模块输出的第二特征数据。参考图2,第一级层级模块从输入的视频帧序列中提取各视频帧对应的第一特征数据P1并输出,第二级层级模块对第一多核时域处理模块输出的第二特征数据Q1进行处理,得到第一特征数据P2并输出,第三级层级模块对第二多核时域处理模块输出的第二特征数据Q2进行处理,得到第一特征数据P3并输出,第四级层级模块对第一特征数据P3进行处理,得到第一特征数据P4并输出。
需要说明的是,本申请实施例将各层级模块输出的特征数据统称为第一特征数据,将各多核时域处理模块输出的特征数据统称为第二特征数据。
多个层级模块的结构可以相同,也可以不同。单个层级模块中可以仅包含一个网络层,例如,单个层级模中仅包含一个卷积层;单个层级模块中也可以包括多个相同或不同的网络层,例如,单个层级模块中包含卷积层和最大池化层,或者单个层级模块中包含多个不同的卷积层。
图2所述的神经网络模型仅为一个示例,实际应用中,各个层级模块的结构、神经网络模型包含的层级模块的数量、多核时域处理模块的数量以及位置均可根据实际需求设定,本申请实施例不作限定。
输入神经网络模型的视频帧序列可以是一段连续的视频,也可以是从一段视频中截取的、不连续的多个视频帧按照时序排列后得到的图像序列。视频帧序列本质上是一个四维矩阵(B×T,C,H,W),其中,B为批处理数目(Batch),即神经网络模型可一次处理完的视频帧序列的数量,T为帧序列长度,即一个视频帧序列中包含的视频帧的数量,C为图像的通道数,H为图像的高,W为图像的宽,此时所指的图像为视频帧。以批处理数目B=2,帧序列长度T=8,RGB通道数C=3,高H=224,宽W=224的输入信息为例,即输入神经网络模型的视频帧序列为一个四维矩阵(2×8,3,224,224)。如果同一时间内,神经网络模型只处理一个视频帧序列,则B可以设置为1,即该神经网络模型一次可处理一个视频帧序列中的T个视频帧。
本申请实施例中,每个视频帧对应的第一特征数据包括多个二维图片(即二维矩阵(H,W)),每一个二维图片即为一个特征图(feature map),第一特征数据包含的特征图数量等于对应的通道数。例如,第一层级模块输出的第一特征数据的维度为(16,64,112,112),则一帧视频帧对应的第一特征数据包含的特征图的数量为64,每个特征图的大小为112×112。需要说明的是,相同模块输出的每个视频帧对应的第一特征数据的维度、大小均相同。同样,每个视频帧对应的第二特征数据也包括多个特征图。
图2中的多核时域处理模块的输入数据为目标层级模块输出的各视频帧对应的第一特征数据,多核时域处理模块按照各视频帧的时间信息,对输入的各视频帧对应的第一特征数据中目标像素点在时间维度上进行卷积处理,分别得到对应的第二特征数据,每个第二 特征数据中包含表征运动目标在时间维度上的时序特征。其中,目标层级模块为位于多核时域处理模块上一级的层级模块,目标像素点为目标层级模块输出的第一特征数据中位置相同的像素点。
参考图3A,图3A示出了一种多核时域处理模块的示例图,一个多核时域处理模块至少包括:第一维度变换层、多核时域卷积层以及第二维度变换层。其中,第一维度变换层用于按照各视频帧的时间信息,确定目标层级模块输出的所有视频帧对应的第一特征数据中目标像素点在时间维度上对应的第一时域特征数据。多核时域卷积层用于针对每个目标像素点,对该目标像素点对应的第一时域特征数据进行卷积处理,得到第二时域特征数据。第二维度变换层,用于按照每个第二时域特征数据中目标像素点在第一特征数据中对应的位置,确定所有第二时域特征数据中时间信息相同的像素点在空间维度上对应的第二特征数据。
参考图4,第一维度变换层可采用针对矩阵的Reshape操作,实现对上一级的层级模块输出的第一特征数据(B×T,C,H,W)的维度转换(第一特征数据如图4左图所示),即将第一特征数据(B×T,C,H,W)中的空间维度(H,W)合并到Batch批处理数目维度上,将时间维度T单独分离出来,得到三维矩阵(B×H×W,C,T),第一时域特征数据由各视频帧对应的第一特征数据(C,H,W)中H相同、W相同、C相同的像素点按照时间顺序排列而成,每个第一时域特征数据中包含T个数据,第一时域特征数据为由这T个数据组成的一维向量。示例性地,当B=1,T=8,C=64,H=56,W=56时,Reshape操作后可得到1×56×56×64个第一时域特征数据(第一时域特征数据如图4右图所示),每个第一时域特征数据包含8个数据。
第二维度变换层同样可采用Reshape操作,对多核时域卷积层输出的所有第二时域特征数据进行维度转换,多核时域卷积层输出的是一个维度为(B×H×W,C,T)的三维矩阵,采用Reshape操作,将这个三维矩阵的时间维度T合并到批处理数目维度B上,将空间维度(H,W)单独分离出来,得到维度为(B×T,C,H,W)的四维矩阵,其中,(C,H,W)即为个视频帧对应的第二特征数据。
通过多核时域卷积层对第一维度变换层输出的每个第一时域特征数据进行卷积处理,得到对应的第二时域特征数据。在一些情况下,多核时域卷积层包含第一预设数量个卷积核大小不同的一维卷积层,针对第一维度变换层输出的每个第一时域特征数据,分别采用第一预设数量个一维卷积层对第一时域特征数据进行卷积处理,得到每个第一时域特征数据对应的第一预设数量个不同尺度的特征数据,然后,融合第一预设数量个不同尺度的特征数据,得到每个第一时域特征数据对应的第二时域特征数据。其中,融合的方式可以是,将第一预设数量个不同尺度的特征数据相加,得到每个第一时域特征数据对应的第二时域特征数据。通过设置多个卷积核大小不同的一维卷积层,可从同一第一时域特征数据中提取出不同尺度的时序特征,融合这多个不同尺度的时序特征,得到第二时域特征数据,较好保留了运动目标的时序特征。
在一些可能的实施例中,多核时域卷积层中的一维卷积层可以是一维Depthwise卷积层,一维Depthwise卷积层可有效降低计算量,提高多核时域卷积层的处理效率。
以图5所示的多核时域卷积层为例,该多核时域卷积层包括4个卷积核大小不同的一维Depthwise卷积层,卷积核分别为k=1,k=3,k=5,k=7,分别用这4个一维Depthwise卷积层对第一时域特征数据进行卷积处理,得到第一时域特征数据对应的4个不同尺度的特征数据,然后,将4个不同尺度的特征数据相加,得到该第一时域特征数据对应的第二时域特征数据。
另外,还可以采用空洞卷积(Dilated convolution)等其它的卷积方式对第一维度变换层输出的每个第一时域特征数据进行卷积处理,得到对应的第二时域特征数据。
在一些可能的实施例中,在确定第一时域特征数据之前,可以将各视频帧对应的第一特征数据的通道数目从第一数值降为第二数值,从而降低多核时域处理模块处理的数据量,提高处理效率。在确定所述第二特征数据之后,再将第二特征数据的通道数目从第二数值还原为第一数值。
参考图3B,在这种情况下,多核时域处理模块还可以包括:第一卷积层和第二卷积层。其中,第一卷积层位于第一维度变换层之前,第一卷积层的输入数据为位于该多核时域处理模块上一级的层级模块输出的数据(B×T,C 1,H,W),通过第一卷积层对(B×T,C 1,H,W)进行卷积处理,得到(B×T,C 2,H,W),以将各视频帧对应的第一特征数据的通道数目从第一数值C 1降为第二数值C 2,这样可以降低多核时域处理模块处理的数据量,提高处理效率。其中,C 2的取值可根据C 1的值确定,例如,C 2=C 1/4或C 2=C 1/8,实际应用中,C 1和C 2的关系可根据实际需求确定,本申请实施例不作限定。第二卷积层位于第二维度变换层之后,第二卷积层的输入数据为第二维度变换层输出的第二特征数据组成的矩阵(B×T,C 2,H,W),通过第二卷积层对(B×T,C 2,H,W)进行卷积处理,得到(B×T,C 1,H,W),以将第二特征数据的通道数目还原为第一数值C 1,保证多核时域处理模块的输入数据和输出数据的维度和尺寸均相同,因此,多核时域处理模块可以很容易地部署神经网络模型中的任意位置。
图2中的平均池化层用于对最后一级层级模块输出的特征数据进行平均池化处理,得到运动目标的运动状态特征。例如,最后一级层级模块的输出数据为一个维度为(B×T,C′,H′,W′)的四维矩阵,则可以通过平均池化层对该四维矩阵进行平均池化层的处理,以减少特征数据中包含的参数数量,最终得到一个维度为(B×T,C′)的二维矩阵,即运动目标的运动状态特征。其中,C′为特征图的通道数,H为特征图的高,W为特征图的宽。
通过图2所示的神经网络模型中的各级层级模块逐级提取视频帧序列中各视频帧对应的第一特征数据,以获取表征运动目标在视频帧中的空间特征,在此过程中,通过层级模块之间的多核时域处理模块提取表征运动目标在时间维度上的时序特征,最终得到包含空 间特征和时序特征的运动状态特征。因此,基于上述神经网络模型,能够从视频帧序列中获取到更加全面的特征信息,进而提高了针对运动目标的识别准确率。
上述神经网络模型可应用于多种场景,通过神经网络模型提取视频中的运动目标的运动状态特征,进而基于提取的动状态特征,获得运动目标与指定目标的匹配结果,以确定视频是否包含指定目标,其中,指定目标可以是人、动物或者肢体部位(如手、脚等)等。为此,针对不同的应用场景,可在上述神经网络模型后增加针对动状态特征的分类模块,以直接输出与指定目标的匹配结果,实现端到端的视频处理系统。
下面以图1A、图1B所示的应用场景为例,本申请实施例还提供了一种应用于动作识别场景的神经网络模型,参考图6,该应用于动作识别场景的神经网络模型具体包括:多个层级模块,一个或多个多核时域处理模块,平均池化层以及分类模块(例如分类器)。其中,层级模块、多核时域处理模块以及平均池化层的功能以及布局方式,可参考图2所示的神经网络模型中对应的模块,不再赘述。分类模块位于平均池化层之后,分类模块用于对平均池化层输出的动状态特征进行分类,确定运动目标属于每个指定目标所对应动作类别的概率。分类模块例如可以是全连接层(Fully Connected layer,FC)、Softmax层等。
实际应用中,可以将现有的任何一种可对图像进行处理的神经网络作为基础网络,在基础网络中插入一个或多个多核时域处理模块,从而得到能够提取视频帧序列中运动目标的运动状态特征的神经网络模型。可用的基础网络包括但不限于:残差网络(Residual Network,ResNet)卷积神经网络(Convolutional Neural Networks,CNN)或者视觉几何群网络(Visual Geometry Group Network,VGG)模型。
下面以ResNet50网络为例,介绍一下以ResNet50网络作为基础网络得到的神经网络模型。
参考图7A,ResNet50网络包括第一卷积模块、最大池化层(max-pooling)、4个残差模块、第一平均层池化层、全连接层和第二平均池化层,其中,每个残差层中包含至少一个卷积层。
示例性地,图7B为以ResNet50网络作为基础网络得到的神经网络模型。实际应用中,多核时域处理模块插入的位置以及数量不限于图7B所示的方式。
图7B中,第一卷积模块包括ResNet50网络的第一个卷积层以及紧随其后的批量标准化(Batch Normalization,BN)层和ReLU(Rectified Linear Unit,修正线性单元)层。第一卷积模块的输入数据为视频帧序列,该视频帧序列表示为一个四维矩阵,例如(8,3,224,224),其中,第一个维度上的8是指帧序列长度,第二个维度上的3是指RGB三个通道数,第三个维度上的224为单个视频帧的高,第四个维度上的224为单个视频帧的宽。
最大池化层表示ResNet50的第一个最大池化层,经过该最大池化层后,第一卷积模块输出的特征图的空间尺寸(即高和宽)会降为输入前的一半。
4个残差模块用于对最大池化层输出的数据进行逐级的卷积处理,以提取出视频帧序列中运动目标的空间特征。
第一平均层池化层作用与空间维度,对第四残差模块输出的数据在空间维度上进行平均池化处理。
第一平均层池化层作用于时间维度,对全连接层输出的数据在空间维度上进行平均池化处理。
图7B所示的ResNet50网络中的各网络层以及各模块对应的结构、参数以及处理流程均保持不变。
以批处理数目B=2,帧序列长度T=8,RGB通道数C=3,高H=224,宽W=224的输入信息为例,即输入神经网络模型的视频帧序列为一个四维矩阵(2×8,3,224,224)。
ResNet50网络包括第一卷积模块、最大池化层(max-pooling)、4个残差模块、第一平均层池化层、全连接层和第二平均池化层。
经过第一个卷积层处理后输出维度为(16,64,112,112)的矩阵数据。
经过最大池化层处理后输出维度为(16,64,56,56)的矩阵数据。
经过第一残差模块处理后输出维度为(16,256,56,56)的矩阵数据。
经过第一多核时域处理模块处理后输出维度为(16,256,56,56)的矩阵数据。
经过第二残差模块处理后输出维度为(16,512,28,28)的矩阵数据。
经过第二多核时域处理模块处理后输出维度为(16,512,28,28)的矩阵数据。
经过第三残差模块处理后输出维度为(16,1024,14,14)的矩阵数据。
经过第三多核时域处理模块处理后输出维度为(16,1024,14,14)的矩阵数据。
经过第三残差模块处理后输出维度为(16,2048,7,7)的矩阵数据。
经过第一平均层池化层处理后输出维度为(16,2048)的矩阵数据。
经过全连接层处理后输出维度为(16,249)的矩阵数据,其中249为预先设定的动作类别的总数,全连接层输出的数据为各视频帧属于每个动作类别的概率,即分类结果。
此处,在将全连接层输出的数据输入第二平均池化层之前,还需要对全连接层输出的数据进行Reshape操作,得到维度为(2,8,249)的矩阵数据,此处的Reshape操作是为了从batch维度中分离出时间维度T,以便利用第二平均池化层在时间维度上对各视频帧的分类结果进行处理,得到每个视频帧序列属于每个动作类别的概率。然后,再经过第二平均池化层处理后输出维度为(2,249)的矩阵数据,得到最终的分类结果,即每个视频帧序列属于每个动作类别的概率。其中,批处理数目为2,也就是当前处理的是两个独立的视频帧 序列S1和视频帧序列S2,第二平均池化层输出的结果包括视频帧序列S1对应249个动作类别的概率,以及视频帧序列S2对应249个动作类别的概率。
本申请实施例的神经网络模型,在数据处理过程中,充分利用了神经网络输入数据中的batch维度,方便地实现了矩阵数据在时间维度和空间维度之间的切换,通过将视频帧序列对应的矩阵数据中的时间维度T合并到Batch批处理数目维度B上,得到(B×T,C,H,W),从而能够在空间维度上对各个视频帧进行卷积处理,提取各视频帧中的空间特征;通过将视频帧序列对应的矩阵数据中的空间维度(H,W)合并到Batch批处理数目维度B上,将时间维度T单独分离出来,得到三维矩阵(B×H×W,C,T),从而能够在时间维度上对多个视频帧进行卷积处理,提取多个视频帧中的时序特征,最终得到包含空间特征和时序特征的运动状态特征。
如图8A所示,为神经网络模型的训练流程示意图,具体包括如下步骤:
S801、获取视频样本集。
上述步骤中的视频样本集中的每个视频样本包括标记有类别标识的视频帧序列,类别标识用于表征视频帧序列中包含的运动目标对应的动作类别。
其中,视频样本集中的视频帧序列可以是一段连续的、包含第三预设数量个视频帧的视频。
视频样本集中的视频帧序列也可以是从一段视频中截取的、不连续的多个视频帧按照时序排列后得到的图像序列,例如,可按视频的时序,从视频中每间隔第二预设数量个视频帧抽取一个视频帧,若抽取的视频帧的数量达到第三预设数量,将抽取的第三预设数量个视频帧确定为所述视频帧序列。其中,第三预设数量根据神经网络模型对输入数据的要求确定,即第三预设数量等于T。第二预设数量可根据包含一个完整动作的视频的长度以及第三预设数量确定,例如,一个动作对应的视频包含100帧,第三预设数量为8,则可从视频中的第1帧开始,每隔14帧抽取一个视频帧,最终得到第1、15、29、43、57、71、85、99帧组成的视频帧序列。
获取视频样本集之后,可以存储于数据库中,在训练时则可以直接从数据库中读取。
S802、根据视频样本集中的视频样本,通过神经网络模型得到表征运动目标以视频样本的时序而表现的运动状态特征。
在得到视频样本集后,可以将视频样本集中的视频样本输入神经网络模型,利用神经网络模型处理得到运动状态特征。上述步骤中的神经网络模型可以是本申请实施例提供的任一不包含分类模块的神经网络模型。
以图2所示的神经网络模型为例,神经网络中的各级层级模块分别从输入数据中逐级提取视频样本中各视频帧对应的第一特征数据并输出,其中第一级层级模块的输入数据包括视频样本,其他各级层级模块的输入数据为位于其上一级的层级模块或者多核时域处理 模块输出的数据。多核时域处理模块,按照各视频帧的时间信息,对目标层级模块输出的第一特征数据中目标像素点在时间维度上进行卷积处理,分别得到对应的第二特征数据,每个第二特征数据中包含表征运动目标在时间维度上的时序特征。平均池化层对最后一级层级模块输出的特征数据进行平均池化处理,得到视频样本中的运动目标的运动状态特征。
S803、根据神经网络模型输出的运动状态特征,通过分类器确定视频样本中包含的运动目标属于每个动作类别的预测概率。
分类器的结构可参考之前的分类模型,不再赘述。
S804、根据预测概率和类别标识,优化神经网络模型和分类器的权重参数。
S805、判断优化后的神经网络模型是否达到训练要求;若是,则执行步骤S806,否则返回步骤S802。
S806、结束训练。
在一种可能的实现方式中,通过损失函数(如交叉熵损失)计算预测概率和视频样本对应的类别标识之间的差异度,然后通过反向传播(Backpropagation,BP)、梯度下降(Gradient Descent,GD)或者随机梯度下降(Stochastic Gradient Descent,SGD)等优化算法更新神经网络模型和分类器中的权重参数。循环执行上述步骤S802-步骤S804,直到基于神经网络模型和分类器得到的视频样本对应的预设概率与视频样本对应的类别标识一致的概率达到期望值,则表示已获得符合要求的神经网络模型,可结束训练。
通过图8A所示的方法训练的神经网络模型能够从视频帧序列中提取出运动目标的运动状态特征。
如图8B所示,为神经网络模型的训练流程示意图,其中,与图8A中相同的步骤不再赘述。图8B所示的训练方法具体包括如下步骤:
S811、获取视频样本集。
上述步骤中的视频样本集中的每个视频样本包括标记有对应的类别标识的视频帧序列,类别标识用于表征视频帧序列中包含的运动目标对应的动作类别。
S812、根据所述视频样本集中的视频样本,通过所述神经网络模型得到视频样本中包含的运动目标属于每个动作类别的预测概率。
上述步骤中的神经网络模型可以是本申请实施例提供的包含分类模块的神经网络模型。
以图6所示的神经网络模型为例,神经网络中的各级层级模块分别从输入数据中逐级提取视频样本中各视频帧对应的第一特征数据并输出,其中第一级层级模块的输入数据包括视频样本,其他各级层级模块的输入数据为位于其上一级的层级模块或者多核时域处理模块输出的数据。多核时域处理模块,按照各视频帧的时间信息,对目标层级模块输出的 第一特征数据中目标像素点在时间维度上进行卷积处理,分别得到对应的第二特征数据,每个第二特征数据中包含表征运动目标在时间维度上的时序特征。平均池化层对最后一级层级模块输出的特征数据进行平均池化处理,得到视频样本中的运动目标的运动状态特征。分类模块对平均池化层输出的动状态特征进行分类,确定运动目标属于每个动作类别的预测概率。
S813、根据预测概率和视频样本对应的类别标识,优化神经网络模型的权重参数。
S814、判断优化后的神经网络模型是否达到训练要求,若是,则执行步骤S815,否则返回步骤S812。
S815、结束训练。
通过图8B所示的方法训练的神经网络模型能够识别视频帧序列中运动目标对应的动作类别。
如图8C所示,为神经网络模型的训练流程示意图,其中,与图8A中相同的步骤不再赘述。图8C所示的训练方法具体包括如下步骤:
S821、获取视频样本集,该视频样本集中的视频样本为包含三个视频帧序列。
上述步骤中,视频样本集中的一个视频样本为一个包括三个视频帧序列的三元组(S1,S2,S3),其中,视频帧序列S1和S2为正样本,即视频帧序列S1和S2中的运动目标具有相同的运动状态特征;视频帧序列S3为负样本,视频帧序列S3中运动目标的运动状态特征与其它两个视频帧序列中的运动目标的运动状态特征均不同。例如,针对动作识别场景,视频帧序列S1和S2中的运动目标执行相同的动作,而视频帧序列S3中的运动目标执行的动作与前两个视频帧序列不同;针对行人重识别场景,视频帧序列S1和S2中的运动目标为同一行人,而视频帧序列S3中的运动目标与前两个视频帧序列中的运动目标不是同一行人。
S822、根据视频样本集中的视频样本,通过神经网络模型得到视频样本中的三个视频帧序列分别对应的运动状态特征。
上述步骤中的神经网络模型可以是本申请实施例提供的任一不包含分类模块的神经网络模型。
S823、针对视频帧序列中的每个视频帧序列,分别计算这个视频帧序列对应的运动状态特征与另外两个视频帧序列对应的运动状态特征之间的距离值。
本步骤中,针对三元组(S1,S2,S3),需计算S1对应的运动状态特征和S2对应的运动状态特征之间的距离值d1,S1对应的运动状态特征和S3对应的运动状态特征之间的距离值d2,以及S2对应的运动状态特征和S3对应的运动状态特征之间的距离值d3。具体地,可采用欧式距离算法计算两个运行状态特征之间的距离值。两个运动状态特征之间的 距离值越小,表明这两个运动状态特征的相似度越高,即两个运动状态特征对应的运动目标执行相同动作或者是同一个行人的概率越高。
S824、根据视频样本对应的三个距离值,优化神经网络模型的权重参数。
具体地,可采用批量随机梯度下降的方式更新神经网络模型的权重参数,以最小化三元组中S1和S2对应的运动状态特征之间的距离值d1,最大化距离值d2和d3。具体优化过程为现有技术,不再赘述。
S825、判断优化后的神经网络模型是否达到训练要求,若是,则执行步骤S826,否则返回步骤S822。
S826、结束训练。
循环执行上述步骤S822-步骤S825,直到三元组中正样本对应的运动状态特征之间的距离值小于第一数值,三元组中正样本和负样本对应的运动状态特征之间的距离值大于第二数值,其中,第一数值和第二数据根据模型精度要求确定。
在一些情况下,还可以通过改进三元组以及四元组的方法训练上述神经网络模型,具体过程不再赘述。
通过图8C所示的方法训练的神经网络模型能够从视频帧序列中提取出运动目标的运动状态特征。
针对上述任一种训练方法,还可以将视频样本集划分为训练集、验证集和测试集,其中,测试集与训练集之间不存在交集,且测试集与验证集之间也不存在交集。先利用训练集训练神经网络模型,优化神经网络模型的权重参数,训练完神经网络模型后,再用验证集对神经网络模型进行测试,验证神经网络模型的输出结果是否准确,若输出结果的准确率不符合要求,则需要用训练集继续对神经网络模型进行训练,若输出结果的准确率达到要求,则用没有经过训练的测试集验证测试模型的准确率,若通过测试,表示完成神经网络的训练。
针对不同的应用场景,可使用不同的视频样本集,得到应用于不同应用场景的神经网络模型。实际应用过程中,可利用现有的样本集合训练神经网络模型,例如,利用IsoGD或Jester等手语数据集,训练得到能够识别手语的神经网络模型,利用UCF101或HMDB51等动作行为识别数据集,训练得到能够识别人体动作的神经网络模型,利用MSRC-12 Kinect Gesture Dataset等手势识别数据集,训练得到应用于手势识别场景的神经网络模型,利用Human3.6M等人体姿态估计的数据集,训练得到应用于人体姿态估计的神经网络模型,利用MARS等行人重识别数据集,训练得到应用于行人重识别场景的神经网络模型。
在模型训练完成之后,则可以将神经网络模型应用于视频处理中,请参见图9,为本申请实施例提供的视频处理方法的流程示意图,该方法例如可以通过图1A、1B、1C中所示的服务器来执行,当然,也可以通过终端设备执行。下面对视频处理方法的流程进行介 绍,其中,对于一些步骤是与模型介绍部分的内容以及训练过程中的对应步骤是相同的,因此对于这些步骤仅进行了简单介绍,具体的可以参见上述模型介绍以及训练方法中相应部分的描述。
S901、获取包含运动目标的视频帧序列。
自一些实施例中,可通过如下方式获取包含运动目标的视频帧序列:按待处理视频中视频帧的时序,从待处理视频中每间隔第二预设数量个视频帧抽取一个视频帧;若抽取的视频帧的数量达到第三预设数量,将抽取的第三预设数量个视频帧确定为视频帧序列。
例如,第二预设数量为14,第三预设数量为8,则可从待处理视频中的第1帧开始,每隔14帧抽取一个视频帧,最终得到第1、15、29、43、57、71、85、99帧组成的第一个视频帧序列。可继续每间隔14帧抽取一个视频帧,得到第二个视频帧序列。
S902、根据视频帧序列,通过已训练的神经网络模型得到表征运动目标以视频帧序列的时序而表现的运动状态特征。
S903、获得运动目标的运动状态特征与指定目标的运动状态特征的匹配结果。
本申请实施例中,指定目标根据应用场景确定,例如,应用场景为手语识别,则指定目标为手,应用场景为行人重识别,则指定目标为人。
若神经网络模型神包括多个层级模块、至少一个多核时域处理模块和平均池化层,且至少一个多核时域处理模块中的每个多核时域处理模块分别设置在多个层级模块中的两个相邻层级模块之间,平均池化层位于最后一个层级模块之后,步骤S902可以包括以下步骤:通过神经网络模型中的各级层级模块分别从输入数据中逐级提取视频帧序列中各视频帧对应的第一特征数据并输出,每个第一特征数据中包含表征运动目标在视频帧中的空间特征,其中第一级层级模块的输入数据包括视频帧序列,其他各级层级模块的输入数据为位于其上一级的层级模块或者多核时域处理模块输出的数据;多核时域处理模块,按照各视频帧的时间信息,对目标层级模块输出的第一特征数据中目标像素点在时间维度上进行卷积处理,分别得到对应的第二特征数据,每个第二特征数据中包含表征运动目标在时间维度上的时序特征;目标层级模块为位于多核时域处理模块上一级的层级模块,目标像素点为目标层级模块输出的第一特征数据中位置相同的像素点;平均池化层对最后一级层级模块输出的特征数据进行平均池化处理,得到运动目标的运动状态特征。
参考图10,以图2所示的神经网络模型为例,步骤S902可以包括以下步骤:
S1001、第一级层级模块从输入的视频帧序列中提取各视频帧对应的第一特征数据P1并输出。
S1002、第一多核时域处理模块按照各视频帧的时间信息,对第一级层级模块输出的各视频帧对应的第一特征数据P1中目标像素点在时间维度上进行卷积处理,分别得到对应的第二特征数据Q1。
S1003、第二级层级模块从第一多核时域处理模块输出的第二特征数据Q1中提取各视频帧对应的第一特征数据P2并输出。
S1004、第二多核时域处理模块按照各视频帧的时间信息,对第二级层级模块输出的各视频帧对应的第一特征数据P2中位置相同的像素点在时间维度上进行卷积处理,分别得到对应的第二特征数据Q2。
S1005、第三级层级模块从第二多核时域处理模块输出的第二特征数据Q2中提取各视频帧对应的第一特征数据P3并输出。
S1006、第四级层级模块从第三级层级模块输出的第一特征数据P3中提取各视频帧对应的第一特征数据P4并输出。
S1007、平均池化层对第四级层级模块输出的第一特征数据P4进行平均池化处理,得到运动目标的运动状态特征。
作为一种可能的实施方式,以图3A所示的多核时域处理模块为例,参考图11,步骤S1002可以包括:
S1101、按照各视频帧的时间信息,确定位于第一多核时域处理模块上一级的第一级层级模块输出的所有视频帧对应的第一特征数据P1中目标像素点在时间维度上对应的第一时域特征数据。
上述步骤S1101可通过多核时域处理模块中的第一维度变换层实现。
S1102、对每个第一时域特征数据进行卷积处理,得到对应的第二时域特征数据。
上述步骤S1101可通过多核时域处理模块中的多核时域卷积层实现。
S1103、按照每个第二时域特征数据中各像素点在第一特征数据P1中对应的位置,确定所有第二时域特征数据中时间信息相同的像素点在空间维度上对应的第二特征数据Q1。
上述步骤S1103可通过多核时域处理模块中的第二维度变换层实现。
上述步骤S1004的具体实施方式与步骤S1002类似,不再赘述。
作为一种可能的实施方式,以图3B所示的多核时域处理模块为例,参考图12,步骤S1002可以包括:
S1201、将第一级层级模块输出的各视频帧对应的第一特征数据P1的通道数目从第一数值降为第二数值。
上述步骤S1201可通过多核时域处理模块中的第一卷积层实现。
S1202、按照各视频帧的时间信息,确定所有视频帧对应的第一特征数据P1中目标像素点在时间维度上对应的第一时域特征数据。
上述步骤S1202可通过多核时域处理模块中的第一维度变换层实现。
S1203、对每个第一时域特征数据进行卷积处理,得到对应的第二时域特征数据。
上述步骤S1203可通过多核时域处理模块中的多核时域卷积层实现。
S1204、按照每个第二时域特征数据中各像素点在第一特征数据P1中对应的位置,确定所有第二时域特征数据中时间信息相同的像素点在空间维度上对应的第二特征数据Q1。
上述步骤S1204可通过多核时域处理模块中的第二维度变换层实现。
S1205、将第二特征数据Q1的通道数目由第二数值还原为第一数值。
上述步骤S1205可通过多核时域处理模块中的第二卷积层实现。
上述步骤S1004的具体实施方式与步骤S1002类似,不再赘述。
在上述任一实施例的基础上,可通过如下方式对每个第一时域特征数据进行卷积处理,例如包括如下步骤:针对每个第一时域特征数据,分别用第一预设数量个卷积核大小不同的一维卷积层对第一时域特征数据进行卷积处理,得到第一预设数量个不同尺度的特征数据,融合第一时域特征数据对应的第一预设数量个不同尺度的特征数据,得到该第一时域特征数据对应的第二时域特征数据。
其中,一维卷积层可以是一维Depthwise卷积层,使用一维Depthwise卷积层可有效降低计算量,提高多核时域卷积层的处理效率。
针对图1A、图1B所示的动作识别场景,步骤S903可以包括:根据运动目标的运动状态特征,通过已训练的分类器得到运动目标属于每个指定目标对应的动作类别的概率,其中,分类器是根据指定目标的运动状态特征训练得到的。
上述分类器的训练方法可参考上述神经网络模型的训练方法中的相关内容。
针对动作识别场景,还可以直接使用包含分类模块的神经网络模型,得到视频帧序列中的运动目标属于每个指定目标对应的动作类别的概率,具体结构可参考图6或图7B所示的神经网络模型。
针对图1C所示的目标识别跟踪场景,步骤S903可以包括:若运动目标的运动状态特征与指定目标的运动状态特征的相似度大于阈值,确定运动目标为指定目标。
在一些实施例中,可采用欧式距离算法计算运动目标的运动状态特征与指定目标的运动状态特征之间的距离值,作为相似度。
若运动目标的运动状态特征与指定目标的运动状态特征的相似度不大于阈值,则确定视频帧序列中不包含指定目标。
本申请实施例的视频处理方法,可通过神经网络模型得到表征运动目标以视频帧序列的时序而表现的运动状态特征,该运动状态特征中既包含通过层级模块提取到的运动目标在各视频帧中的空间特征,还包含通过多核时域处理模块提取到的表征运动目标在时间维度上的时序特征,即能够从视频帧序列中获取到更加全面的特征信息,进而提高了针对运动目标的识别准确率。
参考图13,为对神经网络模型得到的中间结果的可视化分析结果,其中,第一行图像为输入的视频帧序列,第二行图像为第一模型的第二残差层的输出数据对应的可视化图像,第三行图像为第二模型的第二残差层的输出数据对应的可视化图像,上述第一模型是利用手语数据集训练图7A所示模型得到的模型,上述第二模型是利用手语数据集训练图7B所示模型得到的模型。可视化过程例如可以包括:针对每一视频帧所对应的多个特征图,沿着通道方向计算各特征图的均方值,得到每一视频帧对应的可视化图像,均方值越大的像素点对应的亮度响应越高。从图13中可以看出,通过本申请实施例的视频处理方法得到的特征图对手部的响应更高(例如图13中白色圆圈所标识的位置,其表征了特征图对手部的响应),这对于手语识别来说是十分重要的,同时,能够更好地体现出手部区域的时间连续性。因此,通过本申请实施例的视频处理方法,能够增强手部在时间和空间上的特征信息,进而提高了手语识别的识别准确率。
如图14所示,基于与上述视频处理方法相同的发明构思,本申请实施例还提供了一种视频处理装置140,包括:获取模块1401、特征提取模块1402以及匹配模块1403。
获取模块1401,用于获取包含运动目标的视频帧序列;
特征提取模块1402,用于根据所述视频帧序列,通过已训练的神经网络模型得到表征所述运动目标以所述视频帧序列的时序而表现的运动状态特征;
匹配模块1403,用于获得所述运动目标的运动状态特征与指定目标的运动状态特征的匹配结果。
可选地,所述神经网络模型包括多个层级模块、至少一个多核时域处理模块,和平均池化层,且所述至少一个多核时域处理模块中的每个多核时域处理模块分别设置在所述多个层级模块中的两个相邻层级模块之间,所述平均池化层位于最后一个层级模块之后;
相应地,所述特征提取模块1402,具体用于:
通过各级层级模块分别从输入数据中逐级提取所述视频帧序列中各视频帧对应的第一特征数据并输出,每个第一特征数据中包含表征所述运动目标在所述视频帧中的空间特征,其中,第一级层级模块的输入数据包括所述视频帧序列,其他各级层级模块的输入数据为位于其上一级的层级模块或者多核时域处理模块输出的数据;
通过所述多核时域处理模块,按照各视频帧的时间信息,对目标层级模块输出的第一特征数据中目标像素点在时间维度上进行卷积处理,分别得到对应的第二特征数据,每个第二特征数据中包含表征所述运动目标在时间维度上的时序特征;所述目标层级模块为位 于所述多核时域处理模块上一级的层级模块,所述目标像素点为所述目标层级模块输出的第一特征数据中位置相同的像素点;
通过所述平均池化层对最后一级层级模块输出的特征数据进行平均池化处理,得到所述运动目标的运动状态特征。
可选地,所述特征提取模块1402,具体用于:
按照各视频帧的时间信息,确定所述目标层级模块输出的第一特征数据中所述目标像素点在时间维度上对应的第一时域特征数据;
对每个第一时域特征数据进行卷积处理,得到对应的第二时域特征数据;
按照每个第二时域特征数据中所述目标像素点在第一特征数据中对应的位置,确定所有第二时域特征数据中时间信息相同的像素点在空间维度上对应的第二特征数据。
可选地,所述特征提取模块1402,具体用于:
针对每个所述第一时域特征数据,分别采用第一预设数量个卷积核大小不同的一维卷积层对所述第一时域特征数据进行卷积处理,得到所述第一预设数量个不同尺度的特征数据;
融合所述第一预设数量个不同尺度的特征数据,得到所述第一时域特征数据对应的第二时域特征数据。
可选地,所述一维卷积层为一维Depthwise卷积层。
可选地,所述特征提取模块1402,还用于:在确定所述第一时域特征数据之前,将各视频帧对应的第一特征数据的通道数目从第一数值降为第二数值;在确定所述第二特征数据之后,将所述第二特征数据的通道数目从所述第二数值还原为所述第一数值。
可选地,所述获取模块1401,具体用于:按待处理视频中视频帧的时序,从所述待处理视频中每间隔第二预设数量个视频帧抽取一个视频帧;若抽取的视频帧的数量达到第三预设数量,将抽取的第三预设数量个视频帧确定为所述视频帧序列。
可选地,所述匹配模块1403,具体用于:若所述运动目标的运动状态特征与指定目标的运动状态特征的相似度大于阈值,确定所述运动目标为所述指定目标;或者,根据所述运动目标的运动状态特征,通过已训练的分类器得到所述运动目标属于每个指定目标所对应动作类别的概率,所述分类器是根据所述指定目标的运动状态特征训练得到的。
可选地,本申请实施例的视频处理模块还包括训练模块,所述训练模块用于:
通过以下方式训练所述神经网络模型:
获取视频样本集,所述视频样本集中每个视频样本包括标记有类别标识的视频帧序列,所述类别标识用于表征所述视频帧序列中包含的运动目标对应的动作类别;
根据所述视频样本集中的视频样本,通过所述神经网络模型得到表征所述运动目标以所述视频样本的时序而表现的运动状态特征;
根据所述运动状态特征,通过分类器确定所述视频样本中包含的运动目标属于每个动作类别的预测概率;
根据所述预测概率和所述类别标识,优化所述神经网络模型和分类器的权重参数。
可选地,所述神经网络模型包括多个层级模块、至少一个多核时域处理模块,和平均池化层,且所述至少一个多核时域处理模块中的每个多核时域处理模块分别设置在所述多个层级模块中的两个相邻层级模块之间,所述平均池化层位于最后一个层级模块之后;
相应地,所述训练模块,具体用于:
所述根据所述视频样本集中的视频样本,通过所述神经网络模型得到表征所述运动目标以所述视频样本的时序而表现的运动状态特征,具体包括:
通过各级层级模块分别从输入数据中逐级提取所述视频样本中各视频帧对应的第一特征数据并输出,每个第一特征数据中包含表征所述视频样本中包含的所述运动目标在所述视频帧中的空间特征,其中第一级层级模块的输入数据包括所述视频样本,其他各级层级模块的输入数据为位于其上一级的层级模块或者多核时域处理模块输出的数据;
通过所述多核时域处理模块,按照各视频帧的时间信息,对目标层级模块输出的第一特征数据中目标像素点在时间维度上进行卷积处理,分别得到对应的第二特征数据,每个第二特征数据中包含表征所述运动目标在时间维度上的时序特征;通过所述平均池化层对最后一级层级模块输出的特征数据进行平均池化处理,得到所述视频样本中的运动目标的运动状态特征。
本申请实施例提的视频处理装置与上述视频处理方法采用了相同的发明构思,能够取得相同的有益效果,在此不再赘述。
基于与上述视频处理方法相同的发明构思,本申请实施例还提供了一种电子设备,该电子设备具体可以为终端设备(如桌面计算机、便携式计算机、智能手机、平板电脑、个人数字助理(Personal Digital Assistant,PDA)),也可以是与终端设备通信的外部设备,如服务器等。如图15所示,该电子设备150可以包括处理器1501和存储器1502。
处理器1501可以是通用处理器,例如中央处理器(CPU)、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件,可以实现或者执行本申请实施例中公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。
存储器1502作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块。存储器可以包括至少一种类型的存储介质,例如可以包括闪存、硬盘、多媒体卡、卡型存储器、随机访问存储器(Random Access Memory,RAM)、静态随机访问存储器(Static Random Access Memory,SRAM)、可编程只读存储器(Programmable Read Only Memory,PROM)、只读存储器(Read Only Memory,ROM)、带电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、磁性存储器、磁盘、光盘等等。存储器是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。本申请实施例中的存储器1502还可以是电路或者其它任意能够实现存储功能的装置,用于存储程序指令和/或数据。
本申请实施例提供了一种计算机可读存储介质,用于储存为上述电子设备所用的计算机程序指令,其包含用于执行上述弹幕处理方法的程序。
上述计算机存储介质可以是计算机能够存取的任何可用介质或数据存储设备,包括但不限于磁性存储器(例如软盘、硬盘、磁带、磁光盘(MO)等)、光学存储器(例如CD、DVD、BD、HVD等)、以及半导体存储器(例如ROM、EPROM、EEPROM、非易失性存储器(NAND FLASH)、固态硬盘(SSD))等。
以上,以上实施例仅用以对本申请的技术方案进行了详细介绍,但以上实施例的说明只是用于帮助理解本申请实施例的方法,不应理解为对本申请实施例的限制。本技术领域的技术人员可轻易想到的变化或替换,都应涵盖在本申请实施例的保护范围之内。

Claims (16)

  1. 一种视频处理方法,包括:
    获取包含运动目标的视频帧序列;
    根据所述视频帧序列,通过已训练的神经网络模型得到表征所述运动目标以所述视频帧序列的时序而表现的运动状态特征;
    获得所述运动目标的运动状态特征与指定目标的运动状态特征的匹配结果。
  2. 根据权利要求1所述的方法,所述神经网络模型包括多个层级模块、至少一个多核时域处理模块和平均池化层,且所述至少一个多核时域处理模块中的每个多核时域处理模块分别设置在所述多个层级模块中的两个相邻层级模块之间,所述平均池化层位于最后一个层级模块之后;
    所述根据所述视频帧序列,通过已训练的神经网络模型得到表征所述运动目标以所述视频帧序列的时序而表现的运动状态特征,具体包括:
    通过各级层级模块分别从输入数据中逐级提取所述视频帧序列中各视频帧对应的第一特征数据并输出,每个第一特征数据中包含表征所述运动目标在所述视频帧中的空间特征,其中,第一级层级模块的输入数据包括所述视频帧序列,其他各级层级模块的输入数据为位于其上一级的层级模块或者多核时域处理模块输出的数据;
    通过所述多核时域处理模块,按照各视频帧的时间信息,对目标层级模块输出的第一特征数据中目标像素点在时间维度上进行卷积处理,分别得到对应的第二特征数据,每个第二特征数据中包含表征所述运动目标在时间维度上的时序特征;所述目标层级模块为位于所述多核时域处理模块上一级的层级模块,所述目标像素点为所述目标层级模块输出的第一特征数据中位置相同的像素点;
    通过所述平均池化层对最后一级层级模块输出的特征数据进行平均池化处理,得到所述运动目标的运动状态特征。
  3. 根据权利要求2所述的方法,所述按照各视频帧的时间信息,对目标层级模块输出的所述第一特征数据中目标像素点在时间维度上进行卷积处理,分别得到各视频帧对应的第二特征数据,具体包括:
    按照各视频帧的时间信息,确定所述目标层级模块输出的第一特征数据中所述目标像素点在时间维度上对应的第一时域特征数据;
    对每个第一时域特征数据进行卷积处理,得到对应的第二时域特征数据;
    按照每个第二时域特征数据中所述目标像素点在第一特征数据中对应的位置,确定所有第二时域特征数据中时间信息相同的像素点在空间维度上对应的第二特征数据。
  4. 根据权利要求3所述的方法,所述对每个第一时域特征数据进行卷积处理,得到对应的第二时域特征数据,具体包括:
    针对每个所述第一时域特征数据,分别采用第一预设数量个卷积核大小不同的一维卷积层对所述第一时域特征数据进行卷积处理,得到所述第一预设数量个不同尺度的特征数据;
    融合所述第一预设数量个不同尺度的特征数据,得到所述第一时域特征数据对应的第二时域特征数据。
  5. 根据权利要求4所述的方法,所述一维卷积层为一维Depthwise卷积层。
  6. 根据权利要求3至5中任一所述的方法,还包括:
    在确定所述第一时域特征数据之前,将各视频帧对应的第一特征数据的通道数目从第一数值降为第二数值;
    在确定所述第二特征数据之后,将所述第二特征数据的通道数目从所述第二数值还原为所述第一数值。
  7. 根据权利要求1至5中任一所述的方法,所述获取包含运动目标的视频帧序列,具体包括:
    按待处理视频中视频帧的时序,从所述待处理视频中每间隔第二预设数量个视频帧抽取一个视频帧;
    若抽取的视频帧的数量达到第三预设数量,将抽取的第三预设数量个视频帧确定为所述视频帧序列。
  8. 根据权利要求1至5中任一所述的方法,所述获得所述运动目标的运动状态特征与指定目标的运动状态特征的匹配结果,具体包括:
    若所述运动目标的运动状态特征与指定目标的运动状态特征的相似度大于阈值,确定所述运动目标为所述指定目标;或者,
    根据所述运动目标的运动状态特征,通过已训练的分类器得到所述运动目标属于每个指定目标所对应动作类别的概率,所述分类器是根据所述指定目标的运动状态特征训练得到的。
  9. 根据权利要求1至5中任一所述的方法,通过以下方式训练所述神经网络模型:
    获取视频样本集,所述视频样本集中每个视频样本包括标记有类别标识的视频帧序列,所述类别标识用于表征所述视频帧序列中包含的运动目标对应的动作类别;
    根据所述视频样本集中的视频样本,通过所述神经网络模型得到表征所述运动目标以所述视频样本的时序而表现的运动状态特征;
    根据所述运动状态特征,通过分类器确定所述视频样本中包含的运动目标属于每个动作类别的预测概率;
    根据所述预测概率和所述类别标识,优化所述神经网络模型和分类器的权重参数。
  10. 根据权利要求9所述的方法,所述神经网络模型包括多个层级模块、至少一个多核时域处理模块,和平均池化层,且所述至少一个多核时域处理模块中的每个多核时域处理模块分别设置在所述多个层级模块中的两个相邻层级模块之间,所述平均池化层位于最后一个层级模块之后;
    所述根据所述视频样本集中的视频样本,通过所述神经网络模型得到表征所述运动目标以所述视频样本的时序而表现的运动状态特征,具体包括:
    通过各级层级模块分别从输入数据中逐级提取所述视频样本中各视频帧对应的第一特征数据并输出,每个第一特征数据中包含表征所述运动目标在所述视频帧中的空间特征,其中第一级层级模块的输入数据包括所述视频样本,其他各级层级模块的输入数据为位于其上一级的层级模块或者多核时域处理模块输出的数据;
    通过所述多核时域处理模块,按照各视频帧的时间信息,对目标层级模块输出的第一特征数据中目标像素点在时间维度上进行卷积处理,分别得到对应的第二特征数据,每个第二特征数据中包含表征所述运动目标在时间维度上的时序特征;通过所述平均池化层对最后一级层级模块输出的特征数据进行平均池化处理,得到所述视频样本中的运动目标的运动状态特征。
  11. 一种神经网络模型,包括:多个层级模块、至少一个多核时域处理模块,和平均池化层,且所述至少一个多核时域处理模块中的每个多核时域处理模块分别设置在所述多个层级模块中的两个相邻层级模块之间,所述平均池化层位于最后一个层级模块之后;
    各层级模块,用于分别从输入数据中逐级提取所述视频帧序列中各视频帧对应的第一特征数据并输出,每个第一特征数据中包含表征所述运动目标在所述视频帧中的空间特征,其中,第一级层级模块的输入数据包括所述视频帧序列,其他各级层级模块的输入数据为位于其上一级的层级模块或者多核时域处理模块输出的数据;
    所述多核时域处理模块,用于按照各视频帧的时间信息,对目标层级模块输出的第一特征数据中目标像素点在时间维度上进行卷积处理,分别得到对应的第二特征数据,每个第二特征数据中包含表征所述运动目标在时间维度上的时序特征;所述目标层级模块为位于所述多核时域处理模块上一级的层级模块,所述目标像素点为所述目标层级模块输出的第一特征数据中位置相同的像素点;
    所述平均池化层,用于对最后一级层级模块输出的特征数据进行平均池化处理,得到所述运动目标的运动状态特征。
  12. 根据权利要求11所述的神经网络模型,所述多核时域处理模块包括:第一维度变换层、多核时域卷积层以及第二维度变换层;
    所述第一维度变换层,用于按照各视频帧的时间信息,确定所述目标层级模块输出的第一特征数据中所述目标像素点在时间维度上对应的第一时域特征数据;
    所述多核时域卷积层,用于针对每个目标像素点,对所述目标像素点对应的第一时域特征数据进行卷积处理,得到第二时域特征数据;
    所述第二维度变换层,用于按照每个第二时域特征数据中所述目标像素点在第一特征数据中对应的位置,确定所有第二时域特征数据中时间信息相同的像素点在空间维度上对应的第二特征数据。
  13. 一种视频处理装置,包括:
    获取模块,用于获取包含运动目标的视频帧序列;
    特征提取模块,用于根据所述视频帧序列,通过已训练的神经网络模型得到表征所述运动目标以所述视频帧序列的时序而表现的运动状态特征;
    匹配模块,用于获得所述运动目标的运动状态特征与指定目标的运动状态特征的匹配结果。
  14. 一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现权利要求1至10任一项所述方法的步骤。
  15. 一种计算机可读存储介质,其上存储有计算机程序指令,该计算机程序指令被处理器执行时实现权利要求1至10任一项所述方法的步骤。
  16. 一种计算机程序产品,当所述计算机程序产品被执行时,用于执行如权利要求1至10任一项所述的方法。
PCT/CN2020/093077 2019-07-29 2020-05-29 视频处理方法、装置、电子设备及存储介质 WO2021017606A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20847097.1A EP4006772A4 (en) 2019-07-29 2020-05-29 VIDEO PROCESSING METHOD AND EQUIPMENT AND ELECTRONIC DEVICE AND STORAGE MEDIA
US17/343,088 US20210326597A1 (en) 2019-07-29 2021-06-09 Video processing method and apparatus, and electronic device and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910690174.8 2019-07-29
CN201910690174.8A CN110472531B (zh) 2019-07-29 2019-07-29 视频处理方法、装置、电子设备及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/343,088 Continuation US20210326597A1 (en) 2019-07-29 2021-06-09 Video processing method and apparatus, and electronic device and storage medium

Publications (1)

Publication Number Publication Date
WO2021017606A1 true WO2021017606A1 (zh) 2021-02-04

Family

ID=68508308

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/093077 WO2021017606A1 (zh) 2019-07-29 2020-05-29 视频处理方法、装置、电子设备及存储介质

Country Status (4)

Country Link
US (1) US20210326597A1 (zh)
EP (1) EP4006772A4 (zh)
CN (1) CN110472531B (zh)
WO (1) WO2021017606A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861811A (zh) * 2021-03-24 2021-05-28 北京百度网讯科技有限公司 目标识别方法、装置、设备、存储介质及雷达
CN113033439A (zh) * 2021-03-31 2021-06-25 北京百度网讯科技有限公司 用于数据处理的方法、装置和电子设备

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472531B (zh) * 2019-07-29 2023-09-01 腾讯科技(深圳)有限公司 视频处理方法、装置、电子设备及存储介质
CN111104930B (zh) * 2019-12-31 2023-07-11 腾讯科技(深圳)有限公司 视频处理方法、装置、电子设备及存储介质
CN111241985B (zh) * 2020-01-08 2022-09-09 腾讯科技(深圳)有限公司 一种视频内容识别方法、装置、存储介质、以及电子设备
CN111402130B (zh) * 2020-02-21 2023-07-18 华为技术有限公司 数据处理方法和数据处理装置
CN111401207B (zh) * 2020-03-11 2022-07-08 福州大学 基于mars深度特征提取与增强的人体动作识别方法
CN113673280A (zh) * 2020-05-14 2021-11-19 索尼公司 图像处理装置、图像处理方法和计算机可读存储介质
CN111598026B (zh) * 2020-05-20 2023-05-30 广州市百果园信息技术有限公司 动作识别方法、装置、设备及存储介质
CN111611941B (zh) * 2020-05-22 2023-09-19 腾讯科技(深圳)有限公司 特效处理方法及相关设备
CN111598035B (zh) * 2020-05-22 2023-05-23 北京爱宾果科技有限公司 一种视频处理方法及系统
CN112115788A (zh) * 2020-08-14 2020-12-22 咪咕文化科技有限公司 视频动作识别方法、装置、电子设备及存储介质
CN113034433B (zh) * 2021-01-14 2024-01-02 腾讯科技(深圳)有限公司 数据鉴伪方法、装置、设备以及介质
CN116349233A (zh) * 2021-01-20 2023-06-27 三星电子株式会社 用于确定视频中的运动显著性和视频回放风格的方法和电子装置
CN112906498A (zh) * 2021-01-29 2021-06-04 中国科学技术大学 手语动作的识别方法及装置
CN113033324B (zh) * 2021-03-03 2024-03-08 广东省地质环境监测总站 一种地质灾害前兆因子识别方法、装置,电子设备及存储介质
CN113469654B (zh) * 2021-07-05 2024-03-15 安徽南瑞继远电网技术有限公司 一种基于智能算法融合的变电站多层次安全管控系统
CN113642472A (zh) * 2021-08-13 2021-11-12 北京百度网讯科技有限公司 判别器模型的训练方法和动作识别方法
CN113923464A (zh) * 2021-09-26 2022-01-11 北京达佳互联信息技术有限公司 视频违规率确定方法、装置、设备、介质及程序产品
CN113838094B (zh) * 2021-09-28 2024-03-05 贵州乌江水电开发有限责任公司 一种基于智能视频识别的安全预警方法
CN116434096A (zh) * 2021-12-30 2023-07-14 中兴通讯股份有限公司 时空动作检测方法、装置、电子设备及存储介质
US11876916B2 (en) * 2022-02-08 2024-01-16 My Job Matcher, Inc. Apparatus and methods for candidate tracking
CN116580063B (zh) * 2023-07-14 2024-01-05 深圳须弥云图空间科技有限公司 目标追踪方法、装置、电子设备及存储介质
CN117558067A (zh) * 2023-12-28 2024-02-13 天津大学 基于动作识别和序列推理的动作预测方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190122048A1 (en) * 2007-07-11 2019-04-25 Avigilon Patent Holding 1 Corporation Semantic representation module of a machine-learning engine in a video analysis system
CN109886090A (zh) * 2019-01-07 2019-06-14 北京大学 一种基于多时间尺度卷积神经网络的视频行人再识别方法
CN109919031A (zh) * 2019-01-31 2019-06-21 厦门大学 一种基于深度神经网络的人体行为识别方法
CN110472531A (zh) * 2019-07-29 2019-11-19 腾讯科技(深圳)有限公司 视频处理方法、装置、电子设备及存储介质

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6876988B2 (en) * 2000-10-23 2005-04-05 Netuitive, Inc. Enhanced computer performance forecasting system
US8098262B2 (en) * 2008-09-05 2012-01-17 Apple Inc. Arbitrary fractional pixel movement
US8345984B2 (en) * 2010-01-28 2013-01-01 Nec Laboratories America, Inc. 3D convolutional neural networks for automatic human action recognition
CN108369643B (zh) * 2016-07-20 2022-05-13 杭州凌感科技有限公司 用于3d手部骨架跟踪的方法和系统
EP3346445B1 (en) * 2017-01-05 2021-11-10 Université de Toulon Methods and devices for extracting an object from a video sequence
CN109117858B (zh) * 2017-06-26 2024-02-13 北京金风科创风电设备有限公司 监测风力发电机叶片结冰的方法及装置
CN108446669B (zh) * 2018-04-10 2023-01-10 腾讯科技(深圳)有限公司 运动识别方法、装置及存储介质
CN112088383A (zh) * 2018-05-10 2020-12-15 松下半导体解决方案株式会社 神经网络构筑装置、信息处理装置、神经网络构筑方法及程序
CN108810538B (zh) * 2018-06-08 2022-04-05 腾讯科技(深圳)有限公司 视频编码方法、装置、终端及存储介质
US10949701B2 (en) * 2018-11-02 2021-03-16 Iflytek Co., Ltd. Method, apparatus and storage medium for recognizing character
CN111161274B (zh) * 2018-11-08 2023-07-07 上海市第六人民医院 腹部图像分割方法、计算机设备
CN109961005B (zh) * 2019-01-28 2021-08-31 山东大学 一种基于二维卷积网络的动态手势识别方法及系统
CN112784987B (zh) * 2019-01-29 2024-01-23 武汉星巡智能科技有限公司 基于多级神经网络级联的目标看护方法及装置
CN109978756B (zh) * 2019-03-18 2021-03-09 腾讯科技(深圳)有限公司 目标检测方法、系统、装置、存储介质和计算机设备
EP3948775A4 (en) * 2019-04-30 2022-12-21 L'Oréal IMAGE PROCESSING USING CONVOLUTIONAL NEURAL NETWORK TO TRACK A VARIETY OF OBJECTS
KR20210036715A (ko) * 2019-09-26 2021-04-05 삼성전자주식회사 뉴럴 프로세싱 장치 및 뉴럴 프로세싱 장치에서 뉴럴 네트워크의 풀링을 처리하는 방법
KR20210037569A (ko) * 2019-09-27 2021-04-06 삼성전자주식회사 컨볼루션 신경망 가속기 아키텍처를 위한 전력 효율적인 하이브리드 트래버설 장치 및 방법
US11941875B2 (en) * 2020-07-27 2024-03-26 Waymo Llc Processing perspective view range images using neural networks

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190122048A1 (en) * 2007-07-11 2019-04-25 Avigilon Patent Holding 1 Corporation Semantic representation module of a machine-learning engine in a video analysis system
CN109886090A (zh) * 2019-01-07 2019-06-14 北京大学 一种基于多时间尺度卷积神经网络的视频行人再识别方法
CN109919031A (zh) * 2019-01-31 2019-06-21 厦门大学 一种基于深度神经网络的人体行为识别方法
CN110472531A (zh) * 2019-07-29 2019-11-19 腾讯科技(深圳)有限公司 视频处理方法、装置、电子设备及存储介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861811A (zh) * 2021-03-24 2021-05-28 北京百度网讯科技有限公司 目标识别方法、装置、设备、存储介质及雷达
CN112861811B (zh) * 2021-03-24 2023-08-01 北京百度网讯科技有限公司 目标识别方法、装置、设备、存储介质及雷达
CN113033439A (zh) * 2021-03-31 2021-06-25 北京百度网讯科技有限公司 用于数据处理的方法、装置和电子设备
CN113033439B (zh) * 2021-03-31 2023-10-20 北京百度网讯科技有限公司 用于数据处理的方法、装置和电子设备

Also Published As

Publication number Publication date
CN110472531A (zh) 2019-11-19
EP4006772A4 (en) 2022-09-28
US20210326597A1 (en) 2021-10-21
EP4006772A1 (en) 2022-06-01
CN110472531B (zh) 2023-09-01

Similar Documents

Publication Publication Date Title
WO2021017606A1 (zh) 视频处理方法、装置、电子设备及存储介质
Wan et al. Edge computing enabled video segmentation for real-time traffic monitoring in internet of vehicles
CN111709409B (zh) 人脸活体检测方法、装置、设备及介质
Singh et al. A deeply coupled ConvNet for human activity recognition using dynamic and RGB images
Laraba et al. 3D skeleton‐based action recognition by representing motion capture sequences as 2D‐RGB images
WO2020177673A1 (zh) 一种视频序列选择的方法、计算机设备及存储介质
WO2021022521A1 (zh) 数据处理的方法、训练神经网络模型的方法及设备
WO2021051545A1 (zh) 基于行为识别模型的摔倒动作判定方法、装置、计算机设备及存储介质
CN111104930B (zh) 视频处理方法、装置、电子设备及存储介质
CN111209897A (zh) 视频处理的方法、装置和存储介质
CN114283351A (zh) 视频场景分割方法、装置、设备及计算机可读存储介质
Chen et al. Gradient local auto-correlations and extreme learning machine for depth-based activity recognition
WO2021249114A1 (zh) 目标跟踪方法和目标跟踪装置
Srivastava et al. UAV surveillance for violence detection and individual identification
CN112132866A (zh) 目标对象跟踪方法、装置、设备及计算机可读存储介质
Singh et al. Depth based enlarged temporal dimension of 3D deep convolutional network for activity recognition
Liu et al. Learning directional co-occurrence for human action classification
Ding et al. Simultaneous body part and motion identification for human-following robots
Fei et al. Flow-pose Net: An effective two-stream network for fall detection
Li et al. Student behavior recognition for interaction detection in the classroom environment
Aliakbarian et al. Deep action-and context-aware sequence learning for activity recognition and anticipation
CN114519863A (zh) 人体重识别方法、人体重识别装置、计算机设备及介质
Gündüz et al. Turkish sign language recognition based on multistream data fusion
WO2021073311A1 (zh) 图像识别方法、装置、计算机可读存储介质及芯片
El‐Masry et al. Action recognition by discriminative EdgeBoxes

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20847097

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020847097

Country of ref document: EP

Effective date: 20220228