WO2022152104A1 - 动作识别模型的训练方法及装置、动作识别方法及装置 - Google Patents

动作识别模型的训练方法及装置、动作识别方法及装置 Download PDF

Info

Publication number
WO2022152104A1
WO2022152104A1 PCT/CN2022/071211 CN2022071211W WO2022152104A1 WO 2022152104 A1 WO2022152104 A1 WO 2022152104A1 CN 2022071211 W CN2022071211 W CN 2022071211W WO 2022152104 A1 WO2022152104 A1 WO 2022152104A1
Authority
WO
WIPO (PCT)
Prior art keywords
local
action
global
dimensional
image data
Prior art date
Application number
PCT/CN2022/071211
Other languages
English (en)
French (fr)
Inventor
蔡祎俊
卢江虎
项伟
Original Assignee
百果园技术(新加坡)有限公司
蔡祎俊
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百果园技术(新加坡)有限公司, 蔡祎俊 filed Critical 百果园技术(新加坡)有限公司
Publication of WO2022152104A1 publication Critical patent/WO2022152104A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Definitions

  • the present application relates to the technical field of computer vision, for example, to a method and device for training an action recognition model, and a method and device for action recognition.
  • Action recognition is part of content moderation and is used to filter video data involving violence, among other things.
  • video action recognition methods mainly use methods based on deep learning.
  • the methods based on deep learning are generally relatively simple and less flexible, resulting in lower accuracy of action recognition.
  • the present application proposes a training method and device for an action recognition model, and an action recognition method and device, so as to solve the problem of low accuracy of action recognition on video data by a method based on deep learning.
  • the present application provides an action recognition method, including:
  • the action that occurs in the video data is identified, and the global action is obtained;
  • the local feature of the multi-frame target image data identify the action that occurs in the video data, and obtain the local action
  • the global action and the local action are fused into a target action present in the video data.
  • the application also provides a training method for an action recognition model, including:
  • the action recognition model includes a sampling network, a global action recognition network, and a local action recognition network; the sampling network is used for sampling multiple frames of original image data of the video data to obtain multiple frames of target image data, so
  • the global action recognition network is used to identify the actions that appear in the video data according to the global features of the multi-frame target image data to obtain a global action
  • the local action recognition network is used to identify the actions based on the multi-frame target image data.
  • the local feature of the data identify the action that appears in the video data, obtain the local action, and the global action and the local action are used to merge into the target action that appears in the video data;
  • the sampling network, the global action recognition network and the local action recognition network are updated according to the global loss value and the local loss value.
  • the present application also provides a motion recognition device, including:
  • a video data receiving module configured to receive video data, wherein the video data has multiple frames of original image data
  • a sampling module configured to sample the multiple frames of original image data to obtain multiple frames of target image data
  • a global action recognition module configured to identify actions that appear in the video data according to the global features of the multi-frame target image data to obtain a global action
  • a local action recognition module configured to identify actions that appear in the video data according to local features of the multi-frame target image data, and obtain local actions;
  • a target action fusion module configured to fuse the global action and the local action into a target action appearing in the video data.
  • the application also provides a training device for an action recognition model, including:
  • the action recognition model determination module is configured to determine the action recognition model, wherein, the action recognition model includes a sampling network, a global action recognition network, and a local action recognition network; the sampling network is used to perform multi-frame original image data on the video data. Sampling to obtain multi-frame target image data, the global action recognition network is used to identify the actions that appear in the video data according to the global features of the multi-frame target image data, and obtain global actions, and the local action recognition The network is used to identify the actions that appear in the video data according to the local features of the multi-frame target image data, and obtain the local actions, and the global actions and the local actions are used for fusion into the video data.
  • a global loss value calculation module configured to calculate the global loss value of the global action recognition network when recognizing the global action
  • a local loss value calculation module configured to calculate the local loss value of the local action recognition network when recognizing the local action
  • the action recognition model updating module is configured to update the sampling network, the global action recognition network and the local action recognition network according to the global loss value and the local loss value.
  • the present application also provides a computer device, the computer device comprising:
  • processors one or more processors
  • memory arranged to store one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the above-mentioned action recognition method or the above-mentioned training method of an action recognition model.
  • the present application also provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the above-mentioned motion recognition method or the above-mentioned training method of the motion recognition model is implemented.
  • FIG. 1 is a flowchart of an action recognition method provided in Embodiment 1 of the present application.
  • FIG. 2 is a schematic structural diagram of an action recognition provided in Embodiment 1 of the present application.
  • FIG. 3 is a schematic structural diagram of a two-dimensional feature extraction network provided in Embodiment 1 of the present application;
  • FIG. 4 is a schematic structural diagram of a two-dimensional projection block provided in Embodiment 1 of the present application.
  • FIG. 5 is a schematic structural diagram of a two-dimensional residual block provided in Embodiment 1 of the present application.
  • FIG. 6 is a schematic structural diagram of a three-dimensional feature extraction network provided in Embodiment 1 of the present application.
  • FIG. 7 is a schematic structural diagram of a three-dimensional projection block according to Embodiment 1 of the present application.
  • FIG. 8 is a schematic structural diagram of a three-dimensional residual block provided in Embodiment 1 of the present application.
  • FIG. 9 is a schematic diagram of a local timing modeling provided in Embodiment 1 of the present application.
  • FIG. 10 is a flowchart of a training method for an action recognition model provided in Embodiment 2 of the present application.
  • FIG. 11 is a schematic structural diagram of a motion recognition device according to Embodiment 3 of the present application.
  • FIG. 12 is a schematic structural diagram of an apparatus for training an action recognition model according to Embodiment 3 of the present application.
  • FIG. 13 is a schematic structural diagram of a computer device according to Embodiment 4 of the present application.
  • Action recognition based on deep learning methods mainly includes two-dimensional (2Dimension, 2D) convolution in space, three-dimensional (3D) convolution in time and space, and one-dimensional (1D) convolution in time series.
  • the basic construction of the feature extraction network the construction methods of the feature extraction network mainly include the following two categories:
  • a series of local image data is obtained by the method of intensive frame sampling, which is regarded as a local video sequence, and the actions contained in the part (ie the video segment) are identified from the local video sequence.
  • a video data is subjected to multiple sampling processing, which increases the overall computational overhead.
  • the video data is sparsely framed to obtain the global image data as a global video sequence, and the actions contained in the global (ie the entire video) are identified from the global video sequence.
  • a global action recognition network and a local action recognition network are used in an action recognition model to respectively identify actions of different timing granularities, and the method of Multiple Instance Learning (MIL) is used for the local action recognition network.
  • MIL Multiple Instance Learning
  • this application models the problem of local action recognition as a multi-instance learning problem.
  • the action recognition model focuses on local action information with strong discriminative ability, thereby reducing the influence of irrelevant background segments.
  • FIG. 1 is a flowchart of an action recognition method provided in Embodiment 1 of the present application. This embodiment is applicable to the case where action recognition is performed on video data based on global and local.
  • the method can be executed by an action recognition device.
  • the identification device can be implemented by software and/or hardware, and can be configured in computer equipment, such as servers, workstations, personal computers, etc., and includes the following steps:
  • Step 101 Receive video data.
  • users can make real-time video data in the client or edit previous video data, such as short videos, micro-movies, live broadcast data, etc., upload the video data to the video platform, and intend to publish the video on the video platform.
  • Previous video data such as short videos, micro-movies, live broadcast data, etc.
  • upload the video data to the video platform and intend to publish the video on the video platform.
  • Video content review standards can formulate video content review standards according to business, legal and other factors. Before releasing video data, review the content of the video data according to the review standards. In this embodiment, the content of the video data can be reviewed in the dimension of actions. Conduct review, filter out some video files that do not meet the video content review standards, such as video data containing pornographic, vulgar, violent and other content, so as to release some video data that meets the video content review standards.
  • a streaming real-time system can be set up in the video platform, and the user uploads video data to the streaming real-time system through the client in real time, and the streaming real-time system can transmit the video data to the moving real-time system.
  • Computer equipment for content auditing of video data dimensionally.
  • a database such as a distributed database
  • Users upload video data to the database through the client, and the computer equipment that conducts content review of the video data in the dimension of action can be downloaded from The database reads the video data.
  • an action recognition model can be pre-trained, and the action recognition model can integrate local action information and global action information to predict the target action appearing in the video data.
  • the training of the action recognition model is completed, the The parameters and structure of the action recognition model are saved.
  • the action recognition model can be directly loaded and the target action recognition in the video data can be completed without retraining the action recognition model.
  • the action recognition model includes a sampling network (also known as a sampling layer) 210, a global action recognition network 220, and a local action recognition network 230.
  • the sampling network 210 is unified for the global action recognition network 220 and the local action recognition network 230 to provide features of video data, and the global action recognition network 230 provides the features of the video data.
  • the action recognition network 220 can recognize actions on the video data in the global dimension, and the local action recognition network 230 can recognize actions on the video data in the local dimension.
  • the global action recognition network 220 and the local action recognition network 230 are parallel branches in the action recognition model. , so that the action recognition model obtains both global and local modeling capabilities.
  • Step 102 Sampling multiple frames of original image data to obtain multiple frames of target image data.
  • the multiple frames of original image data 202 can be decoded and extracted, and the multiple frames of original image data 202 are input into the sampling network 210 to perform sampling The operation outputs the target image data, thereby reducing the data amount of the video data and reducing the calculation amount of the recognition action while maintaining the main features of the video data 201 .
  • two-dimensional operations are simpler than three-dimensional operations. If in the action recognition model, two-dimensional operations are the main and three-dimensional operations are supplemented, a two-dimensional sampling operation can be performed on multiple frames of original image data to obtain Multi-frame target image data.
  • a two-dimensional convolution operation can be performed on multiple frames of original image data to obtain multiple frames of target image data.
  • the two-dimensional convolution operation refers to the operation of convolution in two dimensions of height (H) and width (W).
  • three-dimensional operations are the main and two-dimensional operations are supplemented, three-dimensional sampling operations (such as three-dimensional convolution operations, that is, time (T), H , the above three dimensions of W are convolution operations) to obtain multiple frames of target image data, which is not limited in this embodiment.
  • three-dimensional sampling operations such as three-dimensional convolution operations, that is, time (T), H , the above three dimensions of W are convolution operations
  • Step 103 according to the global feature of the multi-frame target image data, identify the action appearing in the video data, and obtain the global action.
  • the global action recognition network 220 can be input, and the global action recognition network 220 has the ability of global modeling, and can predict the possible actions of the global video data as global actions.
  • step 103 may include the following steps:
  • Step 1031 extracting two-dimensional features from the target image data to obtain global spatial features.
  • the global action recognition network 220 uses two structures of a two-dimensional feature extraction network 2221 and a three-dimensional feature extraction network 2211 to extract features.
  • a two-dimensional convolution operation can be performed on each frame of target image data, thereby modeling spatial information and obtaining global spatial features.
  • the two-dimensional feature extraction network is a multi-layer residual neural network
  • the two-dimensional feature extraction network includes a plurality of two-dimensional stages (stages).
  • Each two-dimensional stage is provided with a two-dimensional projection block and a plurality of two-dimensional residual blocks in sequence, that is, the two-dimensional feature extraction network is divided into multiple two-dimensional stages in sequence when extracting global spatial features.
  • a two-dimensional projection block and a plurality of two-dimensional residual blocks are sequentially arranged in each two-dimensional stage.
  • a two-dimensional residual block is a convolutional neural network module constructed using strided connections in the H and W dimensions, usually consisting of two to three convolutional layers.
  • step 1031 includes the following steps:
  • Step 10311 In the current two-dimensional stage, call a two-dimensional projection block to perform a two-dimensional convolution operation on the multi-frame target image data, and sequentially call multiple two-dimensional residual blocks to perform a two-dimensional convolution operation on the multi-frame target image data.
  • Step 10312 Determine whether all 2D stages have been traversed; if all 2D stages have been traversed, go to Step 10313, and if all 2D stages have not been traversed, go to Step 10314.
  • Step 10313 Output the multi-frame target image data after performing the two-dimensional convolution operation as global spatial features.
  • Step 10314 output the multi-frame target image data after performing the two-dimensional convolution operation to the next two-dimensional stage, and return to step 10311 .
  • each two-dimensional stage Starting from the first two-dimensional stage (that is, the current two-dimensional stage is initially the first two-dimensional stage), traverse each two-dimensional stage in order, that is, call each two-dimensional stage in order to multi-frame target image data to be processed.
  • n is a positive integer, n ⁇ 2) two-dimensional stages in the two-dimensional feature extraction network.
  • the input of the first two-dimensional stage is the initial multi-frame target image data
  • the input of the second-n two-dimensional stage is the initial multi-frame target image data.
  • the input is the multi-frame target image data output from the previous 2D stage.
  • the multi-frame target image data output by the nth 2D stage extracts the global spatial features output by the network for the entire 2D feature extraction.
  • a two-dimensional pooling layer 2222 may be provided in the global action recognition network 220, and the two-dimensional pooling layer 2222 is cascaded after the two-dimensional feature extraction network 2221.
  • step 10313 Perform a spatial global pooling operation (such as a global pooling average operation) on the multi-frame target image data after performing a two-dimensional convolution operation, as a global spatial feature.
  • a spatial global pooling operation such as a global pooling average operation
  • a two-dimensional projection block can be called to perform a two-dimensional convolution operation on the multi-frame target image data, and multiple two-dimensional residual blocks can be called in turn to perform a two-dimensional convolution operation on the multi-frame target image data.
  • the so-called calling multiple two-dimensional residual blocks in sequence means that the multi-frame target image data input by the first two-dimensional residual block is the multi-frame target image data output by the two-dimensional projection block, not the first two-dimensional residual block.
  • the input multi-frame target image data is the multi-frame target image data output by the previous two-dimensional residual block
  • the multi-frame target image data output by the last two-dimensional residual block is the multi-frame target image data output in the entire two-dimensional stage.
  • the two-dimensional projection block when passing through a two-dimensional stage, can reduce the size of the target image data, expand the channel of the target image data, and extract features for each frame of the input target image data separately. Spatial information at multiple points in time.
  • the two-dimensional projection block is provided with a first two-dimensional convolutional layer (2D Conv_1) and a plurality of second two-dimensional convolutional layers (2D Conv_2).
  • the first two-dimensional convolution layer (2D Conv_1) is called to perform a two-dimensional convolution operation on the multi-frame target image data.
  • the image data performs a two-dimensional convolution operation.
  • the so-called sequentially calling multiple second two-dimensional convolutional layers (2D Conv_2) means that the multi-frame target image data input by the first second two-dimensional convolution layer (2D Conv_2) is the original multi-frame target image data or the upper
  • the multi-frame target image data output by the first two-dimensional convolution layer (2D Conv_1) is combined with the multi-frame target image data output by the second two-dimensional convolution layer (2D Conv_2).
  • the two-dimensional residual block can use the design concept of a bottleneck (bottleneck).
  • the two-dimensional residual block is provided with multiple third two-dimensional convolutional layers (2D Conv_3). Therefore, in each two-dimensional residual block, multiple The three-dimensional convolution layer (2D Conv_3) performs a two-dimensional convolution operation on the multi-frame target image data.
  • the first third two-dimensional convolution layer (2D Conv_3) can compress the number of channels of the multi-frame target image data, and the last one The third 2D convolutional layer (2D Conv_3) recovers the number of channels of multiple frames of target image data.
  • the so-called calling multiple third two-dimensional convolution layers (2D Conv_3) in turn means that the multi-frame target image data input by the first third two-dimensional convolution layer (2D Conv_3) is a two-dimensional projection block or the previous two-dimensional
  • the multi-frame target image data output by the residual block, the multi-frame target image data not input by the first third two-dimensional convolution layer (2D Conv_3) is the multi-frame target image data output by the previous third two-dimensional convolution layer (2D Conv_3).
  • Frame target image data, the multi-frame target image data output by the last third two-dimensional convolution layer (2D Conv_3) is the multi-frame target image data output by the entire two-dimensional residual block.
  • the two-dimensional feature extraction network in the embodiment of the present application is described below by using an example.
  • the 2D feature extraction network is divided into four 2D stages in order:
  • stage_1 In the first two-dimensional stage (stage_1), a two-dimensional projection block and three two-dimensional residual blocks (two-dimensional residual block_1-two-dimensional residual block_3) are sequentially set.
  • stage_2 In the second two-dimensional stage (stage_2), one two-dimensional projection block and four two-dimensional residual blocks (two-dimensional residual block_1-two-dimensional residual block_4) are sequentially set.
  • stage_3 In the third two-dimensional stage (stage_3), one two-dimensional projection block and six two-dimensional residual blocks (two-dimensional residual block_1-two-dimensional residual block_6) are sequentially set.
  • stage_4 In the fourth two-dimensional stage (stage_4), one two-dimensional projection block and three two-dimensional residual blocks (two-dimensional residual block_1-two-dimensional residual block_3) are sequentially set.
  • each two-dimensional projection block is provided with a first two-dimensional convolutional layer (2D Conv_1), three second two-dimensional convolutional layers (2D Conv_2), and a first two-dimensional convolutional layer (2D Conv_2).
  • the convolution kernel of the layer (2D Conv_1) is 1 ⁇ 1
  • the convolution kernel of the second two-dimensional convolution layer (2D Conv_2) is 1 ⁇ 1, 3 ⁇ 3, and 1 ⁇ 1 in turn.
  • three third two-dimensional convolution layers (2D Conv_3) are set in each two-dimensional residual block, and the convolution kernels of the third two-dimensional convolution layer (2D Conv_3) are 1 ⁇ 1, 3 ⁇ 3, 1 ⁇ 1.
  • the above two-dimensional feature extraction network is only an example.
  • other two-dimensional feature extraction networks may be set according to actual conditions.
  • a first A two-dimensional convolutional layer, two second two-dimensional convolutional layers, and/or two third two-dimensional convolutional layers are arranged in each two-dimensional residual block, and so on. Unrestricted.
  • those skilled in the art may also adopt other two-dimensional feature extraction networks according to actual needs, which are not limited in this embodiment of the present application.
  • Step 1032 extracting three-dimensional features from the multi-frame target image data to obtain global time series features.
  • a 3D convolution operation can be performed on each frame of target image data, thereby modeling the time series information of adjacent target image data to obtain global time series features.
  • the three-dimensional feature extraction network is a multi-layer residual neural network.
  • the level of the three-dimensional feature extraction network is smaller than that of the two-dimensional feature extraction network. Extract the layers of the network.
  • the 3D feature extraction network includes multiple 3D stages.
  • One or more 3D residual blocks are set in the first 3D stage, and 3D projection blocks and one or more 3D residual blocks are set in sequence in the other 3D stages.
  • the 3D feature extraction network is divided into multiple 3D stages in sequence.
  • One or more 3D residual blocks are set in the first 3D stage, and 3D projection blocks, a or multiple 3D residual blocks.
  • the 3D residual block is a convolutional neural network module constructed using strided connections in the dimensions T, H, and W, usually consisting of two to three convolutional layers.
  • step 1032 includes the following steps:
  • Step 10321 In the current 3D stage, call the 3D projection block to perform the 3D convolution operation on the multi-frame target image data, and/or call the 3D residual block to perform the 3D convolution operation on the multi-frame target image data.
  • Step 10322 Determine whether all three-dimensional stages have been traversed; if all three-dimensional stages have been traversed, go to step 10323, and if all three-dimensional stages have not been traversed, go to step 10324.
  • Step 10323 outputting the multi-frame target image data after performing the three-dimensional convolution operation as a global timing feature
  • Step 10324 Output the multi-frame target image data after performing the three-dimensional convolution operation to the next three-dimensional stage, and return to step 10321.
  • each three-dimensional stage in sequence that is, call each three-dimensional stage in sequence to process multi-frame target image data.
  • the input of the first three-dimensional stage is the initial multi-frame target image data
  • the input of the second-m three-dimensional stage is the above Multiple frames of target image data output by a 3D stage.
  • the multi-frame target image data output by the nth 3D stage extracts the global time series features output by the network for the entire 3D feature extraction.
  • a 3D pooling layer 2212 may be provided in the global action recognition network 220, and the 3D pooling layer 2212 is cascaded after the 3D feature extraction network 2211.
  • the multi-frame target image data after the convolution operation performs a global pooling operation (such as a global pooling average operation) in time series as a global time series feature.
  • the 3D residual block can be called to perform a 3D convolution operation on the multi-frame target image data
  • the 3D projection block can be called to perform the multi-frame target image
  • the data performs a three-dimensional convolution operation
  • the three-dimensional residual block is called to perform a three-dimensional convolution operation on the multi-frame target image data.
  • the multiple 3D residual blocks can be called in sequence to perform a 3D convolution operation on multiple frames of target image data.
  • the so-called calling multiple 3D residual blocks in sequence means that the multi-frame target image data input by the first 3D residual block is the multi-frame target image data output by the 3D projection block, and the multi-frame target image data not input by the first 2D residual block is.
  • the frame target image data is the multi-frame target image data output by the previous 3D residual block
  • the multi-frame target image data output by the last 3D residual block is the multi-frame target image data output in the entire 3D stage.
  • the 3D projection block can reduce the size of the target image data, expand the channel of the target image data, and extract features from the correlation between the input two adjacent frames of target image data. Get timing information of video data.
  • the 3D projection block is provided with a plurality of first 3D convolution layers (3D Conv_1) and a second 3D convolution layer (3D Conv_2).
  • Multiple first three-dimensional convolution layers (3D Conv_1) perform three-dimensional convolution operations on multi-frame target image data, on the other hand, call the second three-dimensional convolution layer (3D Conv_2) to perform three-dimensional convolution operations on multi-frame target image data .
  • the so-called calling multiple first three-dimensional convolution layers (3D Conv_1) in sequence means that the multi-frame target image data input by the first first three-dimensional convolution layer (3D Conv_1) is the multi-frame target image output by the previous three-dimensional residual stage
  • the multi-frame target image data that is not input by the first first three-dimensional convolution layer (3D Conv_1) is the multi-frame target image data output by the previous first three-dimensional convolution layer (3D Conv_1).
  • the multi-frame target image data output by the first three-dimensional convolution layer (3D Conv_1) and the multi-frame target image data output by the second three-dimensional convolution layer (3D Conv_2) are combined.
  • the 3D residual block can use the design concept of a bottleneck.
  • the 3D residual block is provided with multiple third 3D convolution layers (3D Conv_3). Therefore, in each 3D residual block, multiple third 3D convolutions are called in turn.
  • Layer (3D Conv_3) performs 3D convolution operation on multi-frame target image data
  • the first third 3D convolution layer (3D Conv_3) can compress the number of channels of multi-frame target image data
  • the last third 3D convolution layer ( 3D Conv_3) can recover the number of channels of multi-frame target image data.
  • the so-called sequentially calling multiple third three-dimensional convolution layers (3D Conv_3) means that the multi-frame target image data input by the first third two-dimensional convolution layer (3D Conv_3) is the original target image data or the previous three-dimensional stage.
  • the output multi-frame target image data, the multi-frame target image data not input by the first third three-dimensional convolution layer (3D Conv_3) is the multi-frame target image data output by the previous third three-dimensional convolution layer (3D Conv_3)
  • the multi-frame target image data output by the last third 3D convolution layer (3D Conv_3) is the multi-frame target image data output by the entire 3D residual block.
  • the three-dimensional feature extraction network in the embodiment of the present application is described below by using an example.
  • the 3D feature extraction network is divided into three 3D stages in order:
  • a 3D residual block is set up in the first 3D stage (stage_1).
  • stage_2 In the second three-dimensional stage (stage_2), a three-dimensional projection block and a three-dimensional residual block are sequentially set.
  • stage_3 In the third three-dimensional stage (stage_3), a three-dimensional projection block and a three-dimensional residual block are sequentially set.
  • two first three-dimensional convolution layers (3D Conv_1), one second three-dimensional convolution layer (3D Conv_2), and a first three-dimensional convolution layer (3D Conv_1) are set in each three-dimensional projection block.
  • the convolution kernels are 3 ⁇ 3 ⁇ 3 and 3 ⁇ 3 ⁇ 3 in turn
  • the convolution kernel of the second three-dimensional convolution layer (3D Conv_2) is 3 ⁇ 3 ⁇ 3.
  • two third three-dimensional convolution layers (3D Conv_3) are set in each three-dimensional residual block, and the convolution kernels of the third three-dimensional convolution layer (3D Conv_3) are 3 ⁇ 3 ⁇ 3. 3 ⁇ 3 ⁇ 3.
  • the above three-dimensional feature extraction network is only an example.
  • other three-dimensional feature extraction networks may be set according to actual conditions.
  • three first three-dimensional convolution layers are set in each three-dimensional projection block.
  • a second three-dimensional convolutional layer, and/or three third three-dimensional convolutional layers are set in each three-dimensional residual block, etc., which are not limited in this embodiment of the present application.
  • those skilled in the art may also adopt other three-dimensional feature extraction networks according to actual needs, which are not limited in this embodiment of the present application.
  • Step 1033 splicing the global spatial feature and the global time series feature into a global target feature.
  • multi-frame target image data is processed by a two-dimensional feature extraction model, and multi-dimensional (such as 1024-dimensional) global spatial features can be output.
  • multi-dimensional such as 1024-dimensional
  • 512-dimensional) global time series features, the two sets of features, global spatial features and global time series features, can be spliced by the feature splicer 223, and a multi-dimensional (1536-dimensional) global target feature can be obtained.
  • Step 1034 Map the global target feature to a preset action to obtain a global action appearing in the video data.
  • a linear global classifier (such as a fully connected layer) 224 can be set, and the global classifier 224 can perform global action classification.
  • the global target feature is input into the global classifier 224, and the global target feature can be mapped to a preset action, so that the action that appears in the video data under the global dimension is recorded as a global action for the convenience of distinction. .
  • Step 104 according to the local features of the multi-frame target image data, identify the actions that appear in the video data, and obtain the local actions.
  • a lightweight local action recognition network can be added on the basis of the global action recognition network. , in order to make better use of the local action information in the video data and complement the global action recognition network to improve the overall action recognition ability of the action recognition model.
  • the local action recognition network can be learned in a data-driven manner, thereby utilizing the information extracted by the temporal convolution branches that match the local work to complete local action recognition.
  • the sampling network is matched with the global action recognition network.
  • a local action recognition network can be constructed on the basis of the sampling network, and the multi-frame target image data output by the sampling network can be reused.
  • the additional computational overhead brought by the local action recognition branch is effectively controlled, so that the computational cost of the entire action recognition model is still kept at a low level.
  • the local action recognition network can be input, and the local action recognition network has the ability of local modeling to predict the possible actions of the local video data as local actions.
  • step 104 may include the following steps:
  • Step 1041 extracting features representing motion in the local target image data as local motion features.
  • the initial part of the local action recognition mode is local time series modeling, and the input of this part is extracted by a sampling network (such as a two-dimensional convolutional network) 210 from multiple frames of original image data 202 feature, that is, the target image data f local-2D t , where t ⁇ ⁇ 1, 2, .
  • a sampling network such as a two-dimensional convolutional network
  • the target image data does not contain timing information, and the timing information of local actions is mainly local motion features. Therefore, as shown in FIG.
  • the local motion feature can be represented by optical flow (each pixel value in the image data represents the displacement of the pixel at the corresponding spatial position from the current frame to the next frame), etc., but the optical flow has a large calculation and Storage overhead, therefore, is difficult to apply in large-scale video action recognition.
  • this embodiment uses frame difference features on target image data to express local motion features.
  • the difference 902 between any two adjacent frames of target image data 901 is calculated as a local motion feature 903 .
  • the change of two frames can reflect the local motion information between the target image data of two adjacent frames.
  • the local motion feature can be smoothed, for example, using channel-wise spatial convolution with a convolution kernel of 3x3 for smoothing operation to reduce the noise of local motion features.
  • Step 1042 Perform time-series convolution operations on the local motion features at multiple scales to obtain local time-series features.
  • this embodiment can perform multi-scale time series convolution 232 on the local motion feature in the time dimension, so as to model the local time series and obtain the local time series.
  • multi-scale temporal convolution 232 in the local action recognition branch to learn local actions of different temporal granularities, the recognition ability of the action recognition model for different local actions can be improved.
  • the so-called multiple scales can refer to multiple convolution kernels 9051 of different sizes, such as convolution kernels of size 3, 5, and 7, to realize one-dimensional time series convolution in a set of time dimensions. Local time-series modeling at scale.
  • time series convolution can refer to the convolution operation that performs convolution on the time dimension T.
  • Multiple convolution kernels 9051 may be used in parallel to perform convolution operations on local motion features along the time dimension to obtain multiple inter-frame time series features.
  • the local motion feature and multiple inter-frame time series features are added element-wise 9052 to obtain features of different time scales as local time series features.
  • multi-scale temporal convolution In addition to using multiple temporal convolutions with different kernel sizes for multi-scale temporal convolution, there are other ways to perform multi-scale temporal convolution, for example, you can use multiple max pooling with different pooling window sizes
  • the multi-scale time series convolution is performed by the average pooling layer or the average pooling layer, etc., which is not limited in this embodiment.
  • Step 1043 Map the local time sequence feature to a preset action to obtain a single action appearing in the target image data of a single frame.
  • a linear local classifier (such as a fully connected layer) 233 can be set, and the local classifier 233 can perform local action classification.
  • the local time series features can be mapped to preset actions, so as to predict the actions that appear in the single frame of target image data.
  • the actions are recorded as single actions.
  • Step 1044 Integrate all single actions into local actions appearing in the video data.
  • the video as a whole can be considered to contain the target action. Therefore, as shown in Figure 2, the The action recognition results (ie, single actions) are integrated through the pooling operation 234 to obtain actions that appear in the entire video data in a global dimension.
  • the number of single actions under each action type can be counted, and the single action with the largest number is selected as the local action that appears in the video data.
  • Step 105 Integrate the global action and the local action into a target action appearing in the video data.
  • a local action recognition network 230 can be added at the end of a The fusion layer fuses 240 the global action and the local action, so as to comprehensively predict the action appearing in the video data.
  • the action is recorded as the target action.
  • the probability of a global action is determined, on the one hand, as a global probability, and, on the other hand, the probability of a local action is determined, as a local probability.
  • the global action and the local action are set as the target action appearing in the video data, and the target probability of the target action is calculated based on the global probability and the local probability, wherein the target probability and the global probability, and The local probabilities are all positively correlated, that is, the larger the global probability and the larger the local probability, the larger the target probability, the smaller the global probability and the smaller the local probability, the smaller the target probability.
  • the local action recognition network and the global action recognition network use local action information and global action information respectively for action recognition, the two have strong complementarity.
  • the global probability of the global action is p global
  • the local probability of the local action is p local , which are approximate and independent of each other
  • the product between the first difference and the second difference can be calculated as the inverse Phase probability
  • the first difference represents one minus global probability
  • the second difference represents one minus local probability
  • the target probability p is expressed as follows:
  • the prediction results (ie global action, local action) of the global action recognition branch and the local action recognition branch are fused based on the assumption of approximate independence, so that the global action recognition branch and the local action recognition branch can be better utilized
  • the complementarity of the action recognition model enhances the overall action recognition accuracy of the action recognition model.
  • the above method of calculating the target probability is only an example.
  • other methods of calculating the target probability may be set according to the actual situation. For example, the global probability and the local probability are multiplied and added after weighting to calculate the target probability. etc., this is not limited in the embodiments of the present application.
  • those skilled in the art may also adopt other methods for calculating the target probability according to actual needs, which are not limited in this embodiment of the present application.
  • the video data can be labeled with the target action label according to the target probability. In the latter case, the push technicians perform manual review, which is not limited in this embodiment.
  • video data is received, the video data has multiple frames of original image data, the multiple frames of original image data are sampled to obtain multiple frames of target image data, and according to the global characteristics of the multiple frames of target image data, the Actions appearing in the data, obtain the global action, identify the action appearing in the video data according to the local features of the multi-frame target image data, obtain the local action, fuse the global action and the local action into the target action appearing in the video data , perform a sampling operation for the local action recognition branch and the global action recognition branch, that is, the local action recognition branch and the global action recognition branch multiplex the same feature, which can reduce the amount of video data while maintaining the main features of the video data.
  • the training device of the action recognition model can be implemented by software and/or hardware, and can be configured in computer equipment, such as servers, workstations, personal computers, etc., including the following steps:
  • Step 1001 determine an action recognition model.
  • an action recognition model can be pre-built, the action recognition model can be implemented using MXNet (a deep learning framework designed for efficiency and flexibility) as the underlying support library, and the action recognition model can be trained using four graphics cards.
  • MXNet a deep learning framework designed for efficiency and flexibility
  • the action recognition model includes the following structure:
  • the sampling network is used to sample multiple frames of original image data of the video data to obtain multiple frames of target image data.
  • the sampling network is also used to perform a two-dimensional convolution operation on multiple frames of original image data to obtain multiple frames of target image data.
  • the global action recognition network is used to identify the actions that appear in the video data according to the global features of the multi-frame target image data, and obtain the global action.
  • the global action recognition network includes the following structures:
  • the two-dimensional feature extraction network is used to extract two-dimensional features from multi-frame target image data to obtain global spatial features.
  • the 2D feature extraction network is sequentially divided into multiple 2D stages, and each 2D stage is sequentially provided with a 2D projection block and a plurality of 2D residual blocks.
  • the 2D projection block is used to perform 2D convolution operations on multiple frames of target image data.
  • the two-dimensional residual block is used to perform a two-dimensional convolution operation on the multi-frame target image data, and the multi-frame target image data output by the last two-dimensional residual block is a global spatial feature.
  • the global action recognition network also includes a two-dimensional pooling layer, which is connected to the two-dimensional feature extraction network and is used to perform spatial global pooling operations on multi-frame target image data (such as global pooling average operation), as the global spatial feature.
  • a two-dimensional pooling layer which is connected to the two-dimensional feature extraction network and is used to perform spatial global pooling operations on multi-frame target image data (such as global pooling average operation), as the global spatial feature.
  • the two-dimensional projection block is provided with a first two-dimensional convolution layer and a plurality of second two-dimensional convolution layers; the first two-dimensional convolution layer is used to perform a two-dimensional convolution operation on multiple frames of target image data.
  • the second two-dimensional convolution layer is used to perform two-dimensional convolution operation on the multi-frame target image data; the multi-frame target image data output by the first two-dimensional convolution layer and the multi-frame target output by the second two-dimensional convolution layer
  • the image data is merged into multiple frames of target image data output by the 2D projection block.
  • the two-dimensional residual block is provided with a plurality of third two-dimensional convolutional layers.
  • the third 2D convolution layer is used to perform 2D convolution operations on multiple frames of target image data.
  • the 3D feature extraction network is used to extract 3D features from multi-frame target image data to obtain global time series features.
  • the global action recognition network also includes a 3D pooling layer, which is connected to the 3D feature extraction network and performs a global pooling operation (such as a global pooling average operation) on the multi-frame target image data. ) as a global timing feature.
  • a global pooling operation such as a global pooling average operation
  • the 3D feature extraction network is divided into multiple 3D stages in sequence, one or more 3D residual blocks are set in the first 3D stage, and 3D projection blocks, one or more 3D residual blocks are set in other 3D stages in turn residual block.
  • the 3D projection block is used to perform 3D convolution operation on multi-frame target image data;
  • the 3D residual block is used to perform 3D convolution operation on multi-frame target image data, and the multi-frame target image data output by the last 3D residual block is global timing characteristics.
  • the three-dimensional projection block is provided with a plurality of first three-dimensional convolutional layers and a second three-dimensional convolutional layer;
  • the first three-dimensional convolution layer is used to perform a three-dimensional convolution operation on the multi-frame target image data; the second three-dimensional convolution layer is used to perform a three-dimensional convolution operation on the multi-frame target image data; the multi-frame output of the first three-dimensional convolution layer
  • the target image data is combined with the multi-frame target image data output by the second three-dimensional convolution layer to form the multi-frame target image data output by the three-dimensional projection block.
  • the three-dimensional residual block is provided with a plurality of third three-dimensional convolutional layers.
  • the third 3D convolution layer is used to perform 3D convolution operation on multiple frames of target image data.
  • Feature splicer used to splicing global spatial features and global temporal features into global target features.
  • the global classifier is used to map the global target features to preset actions to obtain the global actions that appear in the video data.
  • the local action recognition network is used to identify the actions appearing in the video data according to the local features of the multi-frame target image data, and obtain the local actions.
  • the local action recognition network includes the following structures:
  • the motion feature extraction network is used to extract features that characterize motion in parts of multi-frame target image data as local motion features.
  • the motion feature extraction network is also used to calculate the difference between any adjacent two frames of target image data as a local motion feature.
  • the local action recognition network further includes the following structure:
  • the smoothing layer cascaded after the motion feature extraction network, is used to smooth the local motion features.
  • the temporal feature extraction network is used to perform temporal convolution operations on local motion features at multiple scales to obtain local temporal features.
  • the time series feature extraction network includes:
  • time series convolution layers multiple time series convolution layers are set with convolution kernels of different sizes; each time series convolution layer is used to use a specified convolution kernel to perform convolution operations on local motion features along the time dimension to obtain multiple frames time series characteristics.
  • the feature fusion layer is used to add local motion features and multiple inter-frame time series features to obtain local time series features.
  • the local classifier is used to map the local time series features to preset actions, and obtain the single action that appears in the single frame of target image data.
  • a global pooling layer is used to fuse all single actions into local actions that appear in the video data.
  • the global pooling layer is also used to count the number of single actions under each action type, and select the single action with the largest number as the local action that appears in the video data.
  • the global action and the local action are used to merge into a target action appearing in the video data.
  • the probability of the global action can be determined as the global probability, and the probability of the local action can be determined as the local probability. If the global action is the same as the local action, the global action and the local action are set as the target action that appears in the video data , the target probability of the target action is calculated based on the global probability and the local probability, and the target probability is positively correlated with the global probability and the local probability.
  • the action fusion layer is further configured to calculate the product between the first difference value and the second difference value, as the inverse probability, the first difference value represents one minus the global probability, and the second difference represents one minus the local probability. Probability; subtract the inverse probability from one to get the target probability of the target action.
  • different data enhancement schemes can be adopted for the video data as samples according to the data requirements of the business, such as random scaling and cropping, random motion blur, random flipping, and so on. This embodiment does not limit this.
  • Step 1002 Calculate the global loss value of the global action recognition network when recognizing the global action.
  • a preset loss function can be used to calculate its loss value when recognizing the global action as the global loss value.
  • the loss function may be cross-entropy loss, a loss function for classification tasks whose goal is to make the probability distribution of the global action predicted by the global action recognition network match the distribution of the correct global action labelled given. differences are minimized.
  • Step 1003 Calculate the local loss value of the local action recognition network when recognizing the local action.
  • a preset loss function can be used to calculate the loss value of the local action recognition network as the local loss value.
  • the loss function may be cross-entropy loss, a loss function for classification tasks whose goal is to make the probability distribution of the local action predicted by the local action recognition network and the distribution of the correct local action labelled given. differences are minimized.
  • the video data is regarded as a sample package, and each local segment (ie, each frame of original image data) is regarded as an example.
  • the example is a positive sample example, and multiple local segments in the video data of the positive sample constitute a positive sample bag in the multi-instance learning, and the positive sample bag contains at least one positive sample Example.
  • the example is a negative example, and multiple local segments in the video data of the negative example constitute a negative example bag in the multi-instance learning, and all the examples in the negative sample bag are is a negative sample example.
  • Multi-instance learning is to perform model training at the level of sample packages rather than examples, and call local networks to process video data to determine the actions that appear in the video data as reference actions, using video data as sample packages and each frame of original image.
  • the data is used as an example, and the action with the highest probability in the sample package is taken as the local action of the sample package.
  • the difference between the reference action and the local action is calculated using the preset loss function as the local loss value of the local action recognition network when recognizing the local action.
  • This embodiment uses the multi-instance learning method to train the local action recognition network to ensure that the local action recognition network is effectively trained, which can solve the problem that the target action often only appears in some segments of the video data in actual business scenarios.
  • Step 1004 Update the sampling network, the global action recognition network and the local action recognition network according to the global loss value and the local loss value.
  • the action recognition model (including the sampling network, the global action recognition network and the local action recognition network) can be regarded as a function mapping, and the training process of the action recognition model (including the sampling network, the global action recognition network and the local action recognition network) is The process of solving a function optimization.
  • the goal of the optimization solution is to continuously update the parameters contained in the action recognition model (including the sampling network, the global action recognition network and the local action recognition network), take the labeled samples as input data, and pass the action recognition model (including the sampling network, The calculation of the global action recognition network and the local action recognition network), the loss value between the output prediction value and the annotation is the smallest.
  • the training process of the action recognition model is the process of parameter update: calculate the gradient direction of the loss function in the current parameter, and then calculate the update range of the parameter according to the loss value and learning rate , update the parameters in the opposite direction of the gradient.
  • the parameter gradient g t of the loss function at the t-th moment can be expressed as:
  • the update magnitude of the parameters at the t-th time can be expressed as:
  • the update at time t+1 can be expressed as:
  • the gradient in the global action recognition network can be calculated based on the global loss value as the global gradient, so that the global action recognition network is applied with gradient descent to update the parameters in the global action recognition network.
  • the gradient in the local action recognition network can be calculated based on the local loss value as the local gradient, so that the local gradient is applied to the local action recognition network for gradient descent to update the parameters in the local action recognition network.
  • the global gradient and the local gradient can be combined (ie, vector addition) into an intersecting gradient, so that the intersecting gradient is applied to the sampling network for gradient descent to update the parameters of the sampling network.
  • some non-heuristic optimization algorithms can be used to improve the convergence speed of gradient descent and optimize the performance.
  • an action recognition model is determined, and the action recognition model includes a sampling network, a global action recognition network, and a local action recognition network; the sampling network is used to sample multiple frames of original image data of the video data to obtain multiple frames of target image data , the global action recognition network is used to identify the actions that appear in the video data according to the global features of the multi-frame target image data, and obtain the global action, and the local action recognition network is used to identify the local features of the multi-frame target image data.
  • the actions that appear in the video data, get the local actions, the global actions and the local actions are used to fuse into the target actions that appear in the video data, calculate the global loss value of the global action recognition network when recognizing the global action, and calculate the local action recognition network in The local loss value when identifying the local action, update the sampling network, the global action recognition network and the local action recognition network according to the global loss value and the local loss value, and perform a sampling operation for the local action recognition network and the global action recognition network in the action recognition model , that is, the local action recognition network and the global action recognition network reuse the same feature.
  • the data volume of the video data can be reduced, and the calculation amount of the recognition action can be reduced.
  • the global action recognition network separately performs action modeling and action recognition from video data, avoiding the defect of only focusing on local action information or global action information, and improving the flexibility of action recognition. Improves the accuracy of identifying a variety of different video data.
  • the action recognition model is jointly trained by combining the loss values of the global action recognition network and the local action recognition network, so that the global action recognition network and the local action recognition network can better share the sampling network of the action recognition model, and achieve better results. overall performance.
  • FIG. 11 is a structural block diagram of a motion recognition apparatus provided in Embodiment 3 of the present application, which may include the following modules:
  • the video data receiving module 1101 is configured to receive video data, and the video data has multiple frames of original image data; the sampling module 1102 is configured to sample the original image data to obtain multiple frames of target image data; the global action recognition module 1103, be set to according to the global feature of the multi-frame target image data, identify the action that occurs in the video data, and obtain the global action; the local action recognition module 1104, be set to according to the multi-frame target image data. feature, identify the actions appearing in the video data, and obtain the local actions; the target action fusion module 1105 is set to fuse the global action and the local action into the target action appearing in the video data.
  • the global action recognition module 1103 includes:
  • the global spatial feature extraction module is configured to extract two-dimensional features from the multi-frame target image data to obtain global spatial features;
  • the global time sequence feature extraction module is configured to extract the three-dimensional features from the multi-frame target image data, Obtaining a global time sequence feature;
  • a global target feature splicing module configured to splicing the global spatial feature and the global time sequence feature into a global target feature;
  • a global target feature mapping module configured to map the global target feature to a preset Action to obtain the global action present in the video data.
  • the two-dimensional feature extraction network for extracting global spatial features is divided into multiple two-dimensional stages in sequence, and each two-dimensional stage is sequentially provided with a two-dimensional projection block, and a plurality of two-dimensional dimensional residual block;
  • the global space feature extraction module is also set to:
  • the two-dimensional projection block is called to perform a two-dimensional convolution operation on the multi-frame target image data, and the multiple two-dimensional residual blocks are sequentially called to perform two-dimensional convolution operations on the multi-frame target image data.
  • Dimensional convolution operation judge whether all 2D stages have been traversed; if all 2D stages have been traversed, the multi-frame target image data after the 2D convolution operation will be output as a global spatial feature; For all two-dimensional stages, output the multi-frame target image data after performing the two-dimensional convolution operation to the next two-dimensional stage, and return to execute the call of the two-dimensional projection block in the current two-dimensional stage.
  • a two-dimensional convolution operation is performed on the multi-frame target image data, and the multiple two-dimensional residual blocks are sequentially invoked to perform a two-dimensional convolution operation on the multi-frame target image data.
  • the two-dimensional projection block is provided with a first two-dimensional convolutional layer and a plurality of second two-dimensional convolutional layers; the global spatial feature extraction module is further configured as:
  • the two-dimensional residual block is provided with a plurality of third two-dimensional convolutional layers; the global spatial feature extraction module is further set to:
  • the plurality of third two-dimensional convolution layers are sequentially invoked to perform two-dimensional convolution operations on the target image data.
  • the 3D feature extraction network for extracting global time series features is divided into multiple 3D stages in sequence, one or more 3D residual blocks are set in the first 3D stage, and in each of the other 3D stages, one or more 3D residual blocks are set.
  • a three-dimensional projection block and one or more three-dimensional residual blocks are arranged in sequence; the global time series feature extraction module is also set to:
  • the multi-frame target image data after the product operation is output to the next three-dimensional stage, and then returns to the execution of the current three-dimensional stage, calling the three-dimensional projection block to perform a three-dimensional convolution operation on the multi-frame target image data, And/or, calling a plurality of the three-dimensional residual blocks in sequence to perform a three-dimensional convolution operation on the multi-frame target image data.
  • the three-dimensional projection block is provided with a plurality of first three-dimensional convolution layers and a second three-dimensional convolution layer; the global time series feature extraction module is further configured as:
  • the multi-frame target image data output by the first three-dimensional convolution layer is combined with the multi-frame target image data output by the second three-dimensional convolution layer.
  • the three-dimensional residual block is provided with a plurality of third three-dimensional convolution layers; the global time series feature extraction module is further set to:
  • the plurality of third three-dimensional convolution layers are sequentially called to perform three-dimensional convolution operations on the multi-frame target image data.
  • the local action recognition module 1104 includes:
  • the local motion feature extraction module is configured to extract the feature representing motion in the part of the multi-frame target image data as the local motion feature;
  • the local time series feature generation module is configured to perform the local motion feature on multiple scales.
  • the time series convolution operation obtains local time series features;
  • the local time series feature mapping module is set to map the local time series features to preset actions to obtain the single action that appears in the single frame of target image data;
  • the single action fusion module set to fuse all monolithic actions into local actions that appear in the video data.
  • the local motion feature extraction module includes:
  • the inter-frame difference calculation module is set to calculate the difference between any adjacent two frames of target image data as a local motion feature.
  • the local timing feature generation module includes:
  • the multi-sequence convolution module is configured to use multiple convolution kernels to perform convolution operations on the local motion features along the time dimension to obtain multiple inter-frame time-series features;
  • the feature fusion module is configured to combine the local motion features with the Multiple inter-frame timing features are added to obtain local timing features.
  • the single action fusion module includes:
  • the quantity statistics module is set to count the number of single actions under each action type; the local action selection module is set to select the single action with the largest number as the partial action appearing in the video data.
  • the local action recognition module 1104 further includes:
  • a smoothing operation module configured to perform a smoothing operation on the local motion feature.
  • the target action fusion module 1105 includes:
  • a global probability determination module set to determine the probability of the global action, as a global probability
  • a local probability determination module set to determine the probability of the local action, as a local probability
  • a target action determination module set if the global action The same as the local action, the global action and the local action are set as the target action appearing in the video data
  • the target probability calculation module is set to calculate the result based on the global probability and the local probability.
  • the target probability of the target action is positively correlated with the global probability and the local probability.
  • the target probability calculation module includes:
  • an inversion probability calculation module configured to calculate the product between the first difference value and the second difference value, as the inversion probability, the first difference value represents one minus the global probability, and the second difference value represents The local probability is subtracted from one; the inverse probability subtraction module is configured to subtract the inverse probability from one to obtain the target probability of the target action.
  • the motion recognition device provided by the embodiment of the present application can execute the motion recognition method provided by any embodiment of the present application, and has functional modules and effects corresponding to the execution method.
  • FIG. 12 is a structural block diagram of an apparatus for training an action recognition model provided in Embodiment 4 of the present application, which may include the following modules:
  • the action recognition model determination module 1201 is configured to determine an action recognition model, the action recognition model includes a sampling network, a global action recognition network, and a local action recognition network; the sampling network is used for sampling multiple frames of original image data of the video data , obtain multi-frame target image data, the global action recognition network is used to identify the actions that appear in the video data according to the global features of the multi-frame target image data, and obtain global actions, the local action recognition network It is used to identify the actions that appear in the video data according to the local features of the multi-frame target image data to obtain the local actions, and the global actions and the local actions are used for fusion to appear in the video data.
  • the target action of The local loss value during the local action; the action recognition model update module 1204 is configured to update the sampling network, the global action recognition network and the local action recognition network according to the global loss value and the local loss value .
  • the local loss value calculation module 1203 includes:
  • the reference action determination module is configured to determine the action that occurs in the video data as a reference action; the local action determination module is configured to use the video data as a sample package and the original image data of each frame as an example, in the The action with the highest probability in the sample package is taken as the local action of the sample package; the action difference calculation module is set to calculate the difference between the reference action and the local action, as the local action recognition network in the identification of the difference.
  • the local loss value when describing the local action.
  • the action recognition model update module 1204 includes:
  • a global gradient calculation module configured to calculate the gradient in the global action recognition network based on the global loss value, as a global gradient
  • a local gradient calculation module configured to calculate the gradient in the local action recognition network based on the local loss value, as a local gradient
  • an intersection gradient calculation module configured to combine the global gradient and the local gradient into an intersection gradient
  • a global action recognition network update module configured to apply the global gradient to perform gradient descent on the global action recognition network , to update the global action recognition network
  • the local action recognition network update module is set to apply the local gradient to perform gradient descent on the local action recognition network to update the local action recognition network
  • the sampling network update module set The sampling network is subjected to gradient descent to apply the intersected gradient to update the sampling network.
  • the motion recognition model training apparatus provided in the embodiment of the present application can execute the training method of the motion recognition model provided by any embodiment of the present application, and has functional modules and effects corresponding to the execution method.
  • FIG. 13 is a schematic structural diagram of a computer device according to Embodiment 5 of the present application.
  • Figure 13 shows a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present application.
  • the computer device 12 shown in FIG. 13 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present application.
  • computer device 12 takes the form of a general-purpose computing device.
  • Components of computer device 12 may include, but are not limited to, one or more processors or processing units 16 , system memory 28 , and a bus 18 connecting various system components including system memory 28 and processing unit 16 .
  • Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus using any of a variety of bus structures.
  • these architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (Video Electronics Standards) Association, VESA) local bus and Peripheral Component Interconnect (PCI) bus.
  • Computer device 12 includes a variety of computer system readable media. These media can be any available media that can be accessed by computer device 12, including both volatile and nonvolatile media, removable and non-removable media.
  • System memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32 .
  • Computer device 12 may include other removable/non-removable, volatile/non-volatile computer system storage media.
  • storage system 34 may be used to read and write to non-removable, non-volatile magnetic media (not shown in FIG. 13, commonly referred to as a "hard drive").
  • a magnetic disk drive for reading and writing to removable non-volatile magnetic disks (such as "floppy disks") and removable non-volatile optical disks (such as Compact Discs) may be provided Read-Only Memory, CD-ROM), Digital Versatile Disc Read-Only Memory (DVD-ROM) or other optical media) read and write optical disc drives.
  • each drive may be connected to bus 18 through one or more data media interfaces.
  • the memory 28 may include at least one program product having a set (eg, at least one) of program modules configured to perform the functions of the embodiments of the present application.
  • a program/utility 40 having a set (at least one) of program modules 42, which may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data , each or a combination of these examples may include an implementation of a network environment.
  • Program modules 42 generally perform the functions and/or methods of the embodiments described herein.
  • Computer device 12 may also communicate with one or more external devices 14 (eg, keyboard, pointing device, display 24, etc.), may also communicate with one or more devices that enable a user to interact with computer device 12, and/or communicate with Any device (eg, network card, modem, etc.) that enables the computer device 12 to communicate with one or more other computing devices. Such communication may take place through an input/output (I/O) interface 22 . Also, computer device 12 may communicate with one or more networks (eg, Local Area Network (LAN), Wide Area Network (WAN), and/or public networks such as the Internet) through network adapter 20. As shown, network adapter 20 communicates with other modules of computer device 12 via bus 18 . It should be understood that, although not shown, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives and data backup storage systems.
  • the processing unit 16 executes a variety of functional applications and data processing by running programs stored in the system memory 28 , such as implementing the motion recognition method and the motion recognition model training method provided by the embodiments of the present application.
  • Embodiment 6 of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, multiple processes of the above-mentioned motion recognition method and motion recognition model training method are implemented , and can achieve the same technical effect, in order to avoid repetition, it is not repeated here.
  • Computer-readable storage media may include, but are not limited to, electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatuses, or devices, or any combination thereof, for example.
  • Examples (non-exhaustive list) of computer-readable storage media include: electrical connections with one or more wires, portable computer disks, hard disks, RAM, ROM, Erasable Programmable Read-Only Memory (Erasable Programmable Read-Only Memory) Memory, EPROM or flash memory), optical fiber, CD-ROM, optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer-readable storage medium can be any tangible or non-transitory storage medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

本文公开了一种动作识别模型的训练方法及装置、动作识别方法及装置。该动作识别方法包括:接收视频数据,视频数据中具有多帧原始图像数据,对多帧原始图像数据进行采样,获得多帧目标图像数据,根据多帧目标图像数据在全局的特征,识别在视频数据中出现的动作,获得全局动作,根据多帧目标图像数据在局部的特征,识别在视频数据中出现的动作,获得局部动作,将全局动作与局部动作融合为在视频数据中出现的目标动作。

Description

动作识别模型的训练方法及装置、动作识别方法及装置
本申请要求在2021年01月15日提交中国专利局、申请号为202110056978.X的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机视觉的技术领域,例如涉及一种动作识别模型的训练方法及装置、动作识别方法及装置。
背景技术
随着短视频等视频应用的快速发展,用户可以随时随地制作视频数据并上传至视频平台,使得互联网上存在着海量的视频数据。由于互联网具有公开性和广泛传播性,所以视频平台在视频数据公开之前,会对这些视频数据进行内容审核、实施有效的监管。
动作识别是内容审核的一部分,用于过滤涉及暴力等视频数据。
传统的对视频数据进行动作识别的方法基于人工设计的特征提取算子,使得提取到的特征很难适应视频数据的内容多样性,因此,视频动作识别方法主要使用基于深度学习的方法。但是,基于深度学习的方法一般较为单一,灵活性较差,导致动作识别的精度较低。
发明内容
本申请提出了一种动作识别模型的训练方法及装置、动作识别方法及装置,以解决基于深度学习的方法对视频数据进行动作识别精度较低的问题。
本申请提供了一种动作识别方法,包括:
接收视频数据,其中,所述视频数据中具有多帧原始图像数据;
对所述多帧原始图像数据进行采样,获得多帧目标图像数据;
根据所述多帧目标图像数据在全局的特征,识别在所述视频数据中出现的动作,获得全局动作;
根据所述多帧目标图像数据在局部的特征,识别在所述视频数据中出现的动作,获得局部动作;
将所述全局动作与所述局部动作融合为在所述视频数据中出现的目标动作。
本申请还提供了一种动作识别模型的训练方法,包括:
确定动作识别模型,所述动作识别模型包括采样网络、全局动作识别网络、局部动作识别网络;所述采样网络用于对视频数据的多帧原始图像数据进行采样,获得多帧目标图像数据,所述全局动作识别网络用于根据所述多帧目标图像数据在全局的特征,识别在所述视频数据中出现的动作,获得全局动作,所述局部动作识别网络用于根据所述多帧目标图像数据在局部的特征,识别在所述视频数据中出现的动作,获得局部动作,所述全局动作与所述局部动作用于融合为在所述视频数据中出现的目标动作;
计算所述全局动作识别网络在识别所述全局动作时的全局损失值;
计算所述局部动作识别网络在识别所述局部动作时的局部损失值;
根据所述全局损失值与所述局部损失值更新所述采样网络、所述全局动作识别网络与所述局部动作识别网络。
本申请还提供了一种动作识别装置,包括:
视频数据接收模块,设置为接收视频数据,其中,所述视频数据中具有多帧原始图像数据;
采样模块,设置为对所述多帧原始图像数据进行采样,获得多帧目标图像数据;
全局动作识别模块,设置为根据所述多帧目标图像数据在全局的特征,识别在所述视频数据中出现的动作,获得全局动作;
局部动作识别模块,设置为根据所述多帧目标图像数据在局部的特征,识别在所述视频数据中出现的动作,获得局部动作;
目标动作融合模块,设置为将所述全局动作与所述局部动作融合为在所述视频数据中出现的目标动作。
本申请还提供了一种动作识别模型的训练装置,包括:
动作识别模型确定模块,设置为确定动作识别模型,其中,所述动作识别模型包括采样网络、全局动作识别网络、局部动作识别网络;所述采样网络用于对视频数据的多帧原始图像数据进行采样,获得多帧目标图像数据,所述全局动作识别网络用于根据所述多帧目标图像数据在全局的特征,识别在所述视频数据中出现的动作,获得全局动作,所述局部动作识别网络用于根据所述多帧目标图像数据在局部的特征,识别在所述视频数据中出现的动作,获得局部动作,所述全局动作与所述局部动作用于融合为在所述视频数据中出现的目标动作;
全局损失值计算模块,设置为计算所述全局动作识别网络在识别所述全局动作时的全局损失值;
局部损失值计算模块,设置为计算所述局部动作识别网络在识别所述局部动作时的局部损失值;
动作识别模型更新模块,设置为根据所述全局损失值与所述局部损失值更新所述采样网络、所述全局动作识别网络与所述局部动作识别网络。
本申请还提供了一种计算机设备,所述计算机设备包括:
一个或多个处理器;
存储器,设置为存储一个或多个程序;
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现上述的动作识别方法或者上述的动作识别模型的训练方法。
本申请还提供了一种计算机可读存储介质,所述计算机可读存储介质上存储计算机程序,所述计算机程序被处理器执行时实现上述的动作识别方法或者上述的动作识别模型的训练方法。
附图说明
图1为本申请实施例一提供的一种动作识别方法的流程图;
图2为本申请实施例一提供的一种动作识别的架构示意图;
图3为本申请实施例一提供的一种二维特征提取网络的结构示例图;
图4为本申请实施例一提供的一种二维投影块的结构示例图;
图5为本申请实施例一提供的一种二维残差块的结构示例图;
图6为本申请实施例一提供的一种三维特征提取网络的结构示例图;
图7为本申请实施例一提供的一种三维投影块的结构示例图;
图8为本申请实施例一提供的一种三维残差块的结构示例图;
图9为本申请实施例一提供的一种局部时序建模的示意图;
图10是本申请实施例二提供的一种动作识别模型的训练方法的流程图;
图11为本申请实施例三提供的一种动作识别装置的结构示意图;
图12为本申请实施例三提供的一种动作识别模型的训练装置的结构示意图;
图13为本申请实施例四提供的一种计算机设备的结构示意图。
具体实施方式
下面结合附图和实施例对本申请进行说明。此处所描述的具体实施例仅仅用于解释本申请。为了便于描述,附图中仅示出了与本申请相关的部分。
基于深度学习的方法对视频进行动作识别主要包括以空间上的二维(2Dimension,2D)卷积,时间和空间上的三维(3D)卷积以及时序上的一维(1D)卷积等为基础构建特征提取网络,征提取网络的构建方法主要包括如下两类:
(1)、基于局部视频片段建模的方法
在视频数据中通过密集采帧的方法获得一系列局部的图像数据,作为局部视频序列,从局部视频序列中识别出局部(即视频片段)包含的动作。
但是,基于局部视频片段建模的方法无法利用视频数据中的全局时序信息,因此,无法直接识别整个视频数据中包含的动作。
为了提高识别整个视频数据中包含的动作的准确率,一般是对一个视频数据进行多次采样处理,这会增加了整体的计算开销。
同时,基于局部视频片段建模的方法中假设视频数据中的每一个局部中都包含具有判别能力的动作,这一假设在实际应用场景中是不合理的。
(2)基于全局视频序列建模的方法
对视频数据进行稀疏采帧得到全局的图像数据,作为全局视频序列,并从全局视频序列中识别出全局(即视频整体)包含的动作。
但是,基于全局视频序列建模的方法需要对视频数据进行整体建模,因此,具有判别能力的局部动作信息可能会被视频数据中的其他内容所抑制,从而无法被有效利用。
此外,基于全局视频序列建模的方法假设作为目标的动作分布在整个视频数据中,但在实际应用场景中,作为目标的动作经常分布在视频数据中的一个片段内,其余片段则包含与作为目标的动作无关的其他动作。
综上,这两种方法都无法很好地覆盖不同时序粒度的动作,基于局部视频片段建模的方法无法利用视频数据中的全局时序信息,基于全局视频序列建模的方法对局部动作的时序信息缺乏关注。
这两种方法都假设视频数据中的每个片段包含具有判别能力的动作,这一假设在实际应用场景中经常是不成立的,实际场景下短视频中经常包含大量与作为目标的动作无关的内容,而具有判别能力的动作只出现在一部分片段中。
为了提高动作识别的准确率,基于全局视频序列建模的方法通常需求对一个视频数据进行多次采样处理,增加了整体的计算开销。
针对上述问题,本申请实施例在一个动作识别模型内利用全局动作识别网络和局部动作识别网络分别识别不同时序粒度的动作,使用多示例学习(Multiple Instance Learning,MIL)的方法对局部动作识别网络进行学习以关注具有判别能力的片段(即包含作为目标的动作的片段),从而可以高效地识别不同时序粒度的动作,对每一个视频数据,动作识别模型做一次采样和处理,从而可以降低整体计算开销,提高视频内容审核的效率。
此外,本申请将局部动作识别的问题建模为多示例学习问题,通过多示例学习的方法,使动作识别模型关注到具有较强判别能力的局部动作信息,从而减少无关背景片段的影响。
实施例一
图1为本申请实施例一提供的一种动作识别方法的流程图,本实施例可适用于基于全局和局部对视频数据进行动作识别的情况,该方法可以由动作识别装置来执行,该动作识别装置可以由软件和/或硬件实现,可配置在计算机设备中,例如,服务器、工作站、个人电脑,等等,包括如下步骤:
步骤101、接收视频数据。
在实际应用中,用户可以在客户端中实时制作视频数据或编辑在先的视频数据,如短视频、微电影、直播数据等,将视频数据上传至视频平台,意图在该视频平台发布该视频数据,让公众传阅、浏览。
不同的视频平台可按照业务、法律等因素制定视频内容审核标准,在发布视频数据之前,按照该审核规范对该视频数据的内容进行审核,本实施例可在动作的维度上对视频数据的内容进行审核,过滤掉一些不符合视频内容审核标准的视频文件,如包含色情、低俗、暴力等内容的视频数据,从而发布一些符合视频内容审核标准的视频数据。
如果对于实时性要求较高,在视频平台中可设置流式实时系统,用户通过客户端实时将视频数据上传至该流式实时系统,该流式实时系统可将该视频数据传输至在动作的维度上对视频数据进行内容审核的计算机设备。
如果对于实时性要求较低,在视频平台中可设置数据库,如分布式数据库等,用户通过客户端将视频数据上传至该数据库,在动作的维度上对视频数据进行内容审核的计算机设备可从该数据库读取该视频数据。
在本实施例中,如图2所示,可预先训练动作识别模型,该动作识别模型可融合局部动作信息和全局动作信息预测视频数据出现的目标动作,在训练动作识别模型完成时,可将动作识别模型的参数和结构保存起来,在执行视频内容审核的流程时,直接加载动作识别模型、完成视频数据中的目标动作识别即可,无需重新训练动作识别模型。
该动作识别模型包括采样网络(又称采样层)210、全局动作识别网络220、局部动作识别网络230,采样网络210统一为全局动作识别网络220、局部动作识别网络230提供视频数据的特征,全局动作识别网络220可在全局的维度对视频数据识别动作,局部动作识别网络230可在局部的维度对视频数据识别动作,全局动作识别网络220与局部动作识别网络230为动作识别模型中并行的分支,使得动作识别模型同时得到全局和局部的建模能力。
步骤102、对多帧原始图像数据进行采样,获得多帧目标图像数据。
如图2所示,在视频数据201中具有多帧原始图像数据202,针对视频数据201,可进行解码并抽取多帧原始图像数据202,将多帧原始图像数据202输入采样网络210中执行采样操作,输出目标图像数据,从而在保持视频数据201的主要特征的情况下,降低视频数据的数据量,降低识别动作的计算量。
在部分情况下,为降低计算的开销量,并非把视频数据的所有原始数据作为动作识别模型的输入,而是逐秒均匀地抽取多帧(如15帧)原始图像数据作为代表输入到动作识别模型中。视频数据的抽取方式会对识别动作的精度造成一定的影响,此时,可以根据业务场景设计更具偏向性的抽取方式。
一般情况下,二维的操作较三维的操作简单,若在动作识别模型中,以二维的操作为主、三维操作为辅,则可以对多帧原始图像数据执行二维的采样操作,获得多帧目标图像数据。
在一种采样的方式中,可对多帧原始图像数据执行二维卷积操作,获得多帧目标图像数据。
二维卷积操作是指以高(H)、宽(W)上的2个维度进行卷积的操作。
若在动作识别模型中,以三维的操作为主、二维的操作为辅,则可以对多帧原始图像数据执行三维的采样操作(如三维卷积操作,即,以时间(T)、H、W上的3个维度进行卷积的卷积操作),获得多帧目标图像数据,本实施例对此不加以限制。
步骤103、根据多帧目标图像数据在全局的特征,识别在视频数据中出现的动作,获得全局动作。
针对目标图像数据,可输入全局动作识别网络220,全局动作识别网络220 具有全局建模的能力,预测全局视频数据可能出现的动作,作为全局动作。
在本申请的一个实施例中,步骤103可以包括如下步骤:
步骤1031、对目标图像数据提取二维下的特征,获得全局空间特征。
如图2所示,为了高效实现对视频数据进行动作识别的时序建模,全局动作识别网络220使用了二维特征提取网络2221与三维特征提取网络2211这两种结构提取特征。
在二维特征提取网络2221中,可对每一帧目标图像数据执行二维卷积操作,从而对空间信息建模,获得全局空间特征。
在一种二维特征提取网络的结构示例中,如图3所示,二维特征提取网络为多层的残差神经网络,二维特征提取网络包括多个二维阶段(stage),在每个二维阶段中依次设置有二维投影块、以及多个二维残差块,即,在提取全局空间特征时按照顺序将二维特征提取网络划分为多个二维阶段(stage),在每个二维阶段中依次设置有二维投影块、多个二维残差块。
二维残差块是一种在H、W的维度下使用跨步连接来构建的卷积神经网络模块,通常由两到三个卷积层组成。
则在本示例中,步骤1031包括如下步骤:
步骤10311、在当前二维阶段中,调用二维投影块对多帧目标图像数据执行二维卷积操作,依次调用多个二维残差块对多帧目标图像数据执行二维卷积操作。
步骤10312、判断是否已遍历所有二维阶段;若已遍历所有二维阶段,则执行步骤10313,若未遍历所有二维阶段,则执行步骤10314。
步骤10313、将执行二维卷积操作后的多帧目标图像数据输出为全局空间特征。
步骤10314、将执行二维卷积操作后的多帧目标图像数据输出至下一个二维阶段,返回执行步骤10311。
从第一个二维阶段开始(即当前二维阶段初始为第一个二维阶段)、按照顺序依次遍历每个二维阶段,即按照顺序依次调用每个二维阶段对多帧目标图像数据进行处理。
二维特征提取网络中存在n(n为正整数,n≥2)个二维阶段,第1个二维阶段的输入,为初始的多帧目标图像数据,第2-n个二维阶段的输入,为上一个二维阶段输出的多帧目标图像数据。第n个二维阶段输出的多帧目标图像数据,为整个二维特征提取网络输出的全局空间特征。
在一些设计中,如图2所示,可在全局动作识别网络220中设置二维池化层2222,该二维池化层级2222级联在二维特征提取网络2221之后,在步骤10313中,对执行二维卷积操作后的多帧目标图像数据执行空间上的全局池化操作(如全局池化平均操作),作为全局空间特征。
在每个二维阶段中,均可调用二维投影块对多帧目标图像数据执行二维卷积操作,依次调用多个二维残差块对多帧目标图像数据执行二维卷积操作。
所谓依次调用多个二维残差块,是指第一个二维残差块输入的多帧目标图像数据为二维投影块输出的多帧目标图像数据,非第一个二维残差块输入的多帧目标图像数据为上一个二维残差块输出的多帧目标图像数据,最后一个二维残差块输出的多帧目标图像数据为整个二维阶段输出的多帧目标图像数据。
对于二维投影块,在每经过一个二维阶段时,二维投影块可以缩小目标图像数据的尺寸、扩大目标图像数据的通道,对输入的每帧目标图像数据单独提取特征,获取视频数据在多个时间点上的空间信息。
如图4所示,二维投影块设置有第一二维卷积层(2D Conv_1)、以及多个第二二维卷积层(2D Conv_2),则在二维投影块提取特征时,一方面,调用第一二维卷积层(2D Conv_1)对多帧目标图像数据执行二维卷积操作,另一方面,依次调用多个第二二维卷积层(2D Conv_2)对多帧目标图像数据执行二维卷积操作。
所谓依次调用多个第二二维卷积层(2D Conv_2),是指第一个第二二维卷积层(2D Conv_2)输入的多帧目标图像数据为原始的多帧目标图像数据或上一个二维阶段输出的多帧目标图像数据,非第一个第二二维卷积层(2D Conv_2)输入的多帧目标图像数据为上一个第二二维卷积层(2D Conv_2)输出的多帧目标图像数据。
从而对第一二维卷积层(2D Conv_1)输出的多帧目标图像数据与第二二维卷积层(2D Conv_2)输出的多帧目标图像数据进行合并。
此外,为了减少二维残差块运算的通道数、进而减少参数量,二维残差块可使用瓶颈(bottleneck)的设计理念。
如图5所示,对于bottleneck的设计,二维残差块设置有多个第三二维卷积层(2D Conv_3),因此,在每个二维残差块中,可依次调用多个第三二维卷积层(2D Conv_3)对多帧目标图像数据执行二维卷积操作,第一个第三二维卷积层(2D Conv_3)可压缩多帧目标图像数据的通道数,最后一个第三二维卷积层(2D Conv_3)可恢复多帧目标图像数据的通道数。
所谓依次调用多个第三二维卷积层(2D Conv_3),是指第一个第三二维卷 积层(2D Conv_3)输入的多帧目标图像数据为二维投影块或上一个二维残差块输出的多帧目标图像数据,非第一个第三二维卷积层(2D Conv_3)输入的多帧目标图像数据为上一个第三二维卷积层(2D Conv_3)输出的多帧目标图像数据,最后一个第三二维卷积层(2D Conv_3)输出的多帧目标图像数据为整个二维残差块输出的多帧目标图像数据。
以下通过示例来说明本申请实施例中的二维特征提取网络。
在本示例中,如图3所示,二维特征提取网络按照顺序被划分为四个二维阶段(stage):
在第一个二维阶段(stage_1)中依次设置有一个二维投影块、三个二维残差块(二维残差块_1-二维残差块_3)。
在第二个二维阶段(stage_2)中依次设置有一个二维投影块、四个二维残差块(二维残差块_1-二维残差块_4)。
在第三个二维阶段(stage_3)中依次设置有一个二维投影块、六个二维残差块(二维残差块_1-二维残差块_6)。
在第四个二维阶段(stage_4)中依次设置有一个二维投影块、三个二维残差块(二维残差块_1-二维残差块_3)。
如图4所示,在每个二维投影块中均设置有一个第一二维卷积层(2D Conv_1)、三个第二二维卷积层(2D Conv_2),第一二维卷积层(2D Conv_1)的卷积核为1×1,第二二维卷积层(2D Conv_2)的卷积核依次为1×1、3×3、1×1。
如图5所示,在每个二维残差块中均设置有三个第三二维卷积层(2D Conv_3),第三二维卷积层(2D Conv_3)的卷积核依次为1×1、3×3、1×1。
上述二维特征提取网络只是作为示例,在实施本申请实施例时,可以根据实际情况设置其它二维特征提取网络,例如,为了降低计算量,在每个二维投影块中设置有一个第一二维卷积层、两个第二二维卷积层,和/或,在每个二维残差块中设置有两个第三二维卷积层,等等,本申请实施例对此不加以限制。另外,除了上述二维特征提取网络外,本领域技术人员还可以根据实际需要采用其它二维特征提取网络,本申请实施例对此也不加以限制。
步骤1032、对多帧目标图像数据提取三维下的特征,获得全局时序特征。
在三维特征提取网络中,可对每一帧目标图像数据执行三维卷积操作,从而对相邻的目标图像数据进行时序信息建模,获得全局时序特征。
在一种三维特征提取网络的结构示例中,如图6所示,三维特征提取网络为多层的残差神经网络,一般情况下,为了降低计算量,三维特征提取网络的 层级小于二维特征提取网络的层级。
三维特征提取网络包括多个三维阶段(stage),第一个三维阶段中设置一个或多个三维残差块,其他三维阶段中依次设置有三维投影块、一个或多个三维残差块,在提取全局时序特征时将三维特征提取网络按照顺序划分为多个三维阶段(stage),第一个三维阶段中设置一个或多个三维残差块,其他三维阶段中依次设置有三维投影块、一个或多个三维残差块。
三维残差块是一种在T、H、W的维度下使用跨步连接来构建的卷积神经网络模块,通常由两到三个卷积层组成。
则在本示例中,步骤1032包括如下步骤:
步骤10321、在当前三维阶段中,调用三维投影块对多帧目标图像数据执行三维卷积操作,和/或,调用三维残差块对多帧目标图像数据执行三维卷积操作。
步骤10322、判断是否已遍历所有三维阶段;若已遍历所有三维阶段,则执行步骤10323,若未遍历所有三维阶段,则执行步骤10324。
步骤10323、将执行三维卷积操作后的多帧目标图像数据输出为全局时序特征;
步骤10324、将执行三维卷积操作后的多帧目标图像数据输出至下一个三维阶段,返回执行步骤10321。
从第一个三维阶段开始(即当前三维阶段初始为第一个二维阶段)、按照顺序依次遍历每个三维阶段,即按照顺序依次调用每个三维阶段对多帧目标图像数据进行处理。
三维特征提取网络中存在m(m为正整数,n≥2)个三维阶段,第1个三维阶段的输入,为初始的多帧目标图像数据,第2-m个三维阶段的输入,为上一个三维阶段输出的多帧目标图像数据。第n个三维阶段输出的多帧目标图像数据,为整个三维特征提取网络输出的全局时序特征。
在一些设计中,如图2所示,可在全局动作识别网络220中设置三维池化层2212,该三维池化层2212级联在三维特征提取网络2211之后,在步骤10323中,对执行三维卷积操作后的多帧目标图像数据执行时序上的全局池化操作(如全局池化平均操作),作为全局时序特征。
在第一个三维阶段中,可调用三维残差块对多帧目标图像数据执行三维卷积操作,在第二个及第二个以后的三维阶段中,可调用三维投影块对多帧目标图像数据执行三维卷积操作,调用三维残差块对多帧目标图像数据执行三维卷积操作。
若三维阶段存在多个三维残差块,则可以依次调用多个三维残差块对多帧目标图像数据执行三维卷积操作。
所谓依次调用多个三维残差块,是指第一个三维残差块输入的多帧目标图像数据为三维投影块输出的多帧目标图像数据,非第一个二维残差块输入的多帧目标图像数据为上一个三维残差块输出的多帧目标图像数据,最后一个维残差块输出的多帧目标图像数据为整个三维阶段输出的多帧目标图像数据。
对于三维投影块,在每经过非第一个三维阶段时,三维投影块可以缩小目标图像数据的尺寸、扩大目标图像数据的通道,对输入的相邻两帧目标图像数据之间关联提取特征,获取视频数据的时序信息。
如图7所示,三维投影块设置有多个第一三维卷积层(3D Conv_1)、以及第二三维卷积层(3D Conv_2),则在三维投影块提取特征时,一方面,依次调用多个第一三维卷积层(3D Conv_1)对多帧目标图像数据执行三维卷积操作,另一方面,调用第二三维卷积层(3D Conv_2)对多帧目标图像数据执行三维卷积操作。
所谓依次调用多个第一三维卷积层(3D Conv_1),是指第一个第一三维卷积层(3D Conv_1)输入的多帧目标图像数据为上一个三维残阶段输出的多帧目标图像数据,非第一个第一三维卷积层(3D Conv_1)输入的多帧目标图像数据为上一个第一三维卷积层(3D Conv_1)输出的多帧目标图像数据。
从而对第一三维卷积层(3D Conv_1)输出的多帧目标图像数据与第二三维卷积层(3D Conv_2)输出的多帧目标图像数据进行合并。
此外,为了减少三维残差块运算的通道数、进而减少参数量,三维残差块可使用瓶颈(bottleneck)的设计理念。
如图8所示,对于bottleneck的设计,三维残差块设置有多个第三三维卷积层(3D Conv_3),因此,在每个三维残差块中,依次调用多个第三三维卷积层(3D Conv_3)对多帧目标图像数据执行三维卷积操作,第一个第三三维卷积层(3D Conv_3)可压缩多帧目标图像数据的通道数,最后一个第三三维卷积层(3D Conv_3)可恢复多帧目标图像数据的通道数。
所谓依次调用多个第三三维卷积层(3D Conv_3),是指第一个第三二维卷积层(3D Conv_3)输入的多帧目标图像数据为原始的目标图像数据或上一个三维阶段输出的多帧目标图像数据,非第一个第三三维卷积层(3D Conv_3)输入的多帧目标图像数据为上一个第三三维卷积层(3D Conv_3)输出的多帧目标图像数据,最后一个第三三维卷积层(3D Conv_3)输出的多帧目标图像数据为整个三维残差块输出的多帧目标图像数据。
以下通过示例来说明本申请实施例中的三维特征提取网络。
在本示例中,三维特征提取网络按照顺序被划分为三个三维阶段(stage):
在第一个三维阶段(stage_1)中设置有一个三维残差块。
在第二个三维阶段(stage_2)中依次设置有一个三维投影块、一个三维残差块。
在第三个三维阶段(stage_3)中依次设置有一个三维投影块、一个三维残差块。
如图7所示,在每个三维投影块中均设置有两个第一三维卷积层(3D Conv_1)、一个第二三维卷积层(3D Conv_2),第一三维卷积层(3D Conv_1)的卷积核依次为3×3×3、3×3×3,第二三维卷积层(3D Conv_2)的卷积核为3×3×3。
如图8所示,在每个三维残差块中均设置有两个第三三维卷积层(3D Conv_3),第三三维卷积层(3D Conv_3)的卷积核依次为3×3×3、3×3×3。
上述三维特征提取网络只是作为示例,在实施本申请实施例时,可以根据实际情况设置其它三维特征提取网络,例如,为了通过精确度,在每三维投影块中设置有三个第一三维卷积层、一个第二三维卷积层,和/或,在每个三维残差块中设置有三个第三三维卷积层,等等,本申请实施例对此不加以限制。另外,除了上述三维特征提取网络外,本领域技术人员还可以根据实际需要采用其它三维特征提取网络,本申请实施例对此也不加以限制。
步骤1033、将全局空间特征与全局时序特征拼接为全局目标特征。
如图2所示,多帧目标图像数据经过二维特征提取模型的处理,可输出多维(如1024维)的全局空间特征,多帧目标图像数据三维特征提取模型的处理,可输出多维(如512维)的全局时序特征,全局空间特征与全局时序特征这两组特征可通过特征拼接器223进行拼接,可得到多维(1536维)的全局目标特征。
步骤1034、将全局目标特征映射为预设的动作,获得在视频数据中出现的全局动作。
在本实施例中,如图2所示,可设置线性的全局分类器(如全连接层)224,该全局分类器224可进行全局的动作分类。
将全局目标特征输入到该全局分类器224中,可将全局目标特征映射为预设的动作,从而在全局的维度下、在视频数据中出现的动作,为便于区分,该动作记为全局动作。
步骤104、根据多帧目标图像数据在局部的特征,识别在视频数据中出现的 动作,获得局部动作。
如果全局动作识别网络可以较为高效地识别全局动作,但其无法有效利用局部动作信息,无法有效识别不同时序粒度的动作,则可以在全局动作识别网络的基础上添加一个轻量化的局部动作识别网络,以便更好地利用视频数据中的局部动作信息,并和全局动作识别网络形成互补,以提高动作识别模型整体的动作识别能力。
局部动作识别网络可以通过数据驱动的方式进行学习,从而利用与该局部工作相匹配的时序卷积分支提取的信息来完成局部动作识别。
此时,采样网络与全局动作识别网络配套,为降低在局部动作识别网络的额外计算开销,可以在采样网络的基础上构建局部动作识别网络,复用采样网络输出的多帧目标图像数据,可以有效控制局部动作识别分支带来的额外计算开销,使整个动作识别模型的计算开销仍然保持在较低水平。
针对多帧目标图像数据,可输入局部动作识别网络,局部动作识别网络具有局部建模的能力,预测局部视频数据可能出现的动作,作为局部动作。
在本申请的一个实施例中,步骤104可以包括如下步骤:
步骤1041、提取在局部目标图像数据中表征运动的特征,作为局部运动特征。
在本实施例中,如图2所示,局部动作识别模式的初始是局部时序建模,这一部分的输入是采样网络(如二维卷积网络)210对多帧原始图像数据202提取出的特征,即目标图像数据f local-2D t,其中,t∈{1,2,…,K}表示帧序号,K为采样的总帧数。
目标图像数据不包含时序信息,局部动作的时序信息主要为局部运动特征,因此,如图2所示,可在多帧目标图像数据的局部中提取表征运动的特征,作为局部运动特征231。
一般情况下,可以使用光流(图像数据中的每一个像素值表示对应空间位置上的像素从当前帧到下一帧的位移)等形式表示局部运动特征,但光流具有较大的计算和存储开销,因此,难以应用在大规模的视频动作识别中。
为了实现高效的局部运动特征建模,本实施例使用目标图像数据上的帧差特征来表达局部运动特征。
如图9所示,在时间的维度上,计算任意相邻的两帧目标图像数据901之间的差值902,作为局部运动特征903。
对于第t帧,计算每相邻两帧目标图像数据的差值d local-2D t=f local-2D t-f local-2D t-1, 获得帧差特征,这一帧差特征表示相邻两帧的变化,从而可以反映相邻两帧目标图像数据之间的局部运动信息。
此外,考虑到帧差特征受噪声影响较大,如图9所示,可对局部运动特征进行平滑操作,如,使用逐通道(channel-wise)、卷积核为3x3的空间卷积进行平滑操作,从而降低局部运动特征的噪声。
步骤1042、在多个尺度下对局部运动特征进行时序卷积操作,获得局部时序特征。
单一的局部运动特征可能不足用于识别动作,因此,如图2所示,本实施例可在时间维度上对局部运动特征执行多尺度时序卷积232,从而在局部时序建模,获得局部时序特征,通过在局部动作识别分支中使用多尺度的时序卷积232来学习不同时序粒度的局部动作,可提高动作识别模型对不同局部动作的识别能力。
如图9所示,所谓多个尺度,可以指多个大小不同的卷积核9051,如大小为3、5、7的卷积核,实现一组时间维度上的一维时序卷积进行多尺度的局部时序建模。
所谓时序卷积,可以指在时间维度T上进行卷积的卷积操作。
可以以并行的方式使用多个卷积核9051对局部运动特征沿时间维度进行卷积操作,获得多个帧间时序特征。
将局部运动特征与多个帧间时序特征逐元素(element-wise)相加9052,得到不同时间尺度的特征,作为局部时序特征。
除了使用多个具有不同卷积核大小的时序卷积进行多尺度时序卷积之外,还可以采用其他方式进行多尺度时序卷积,例如,可以使用多个不同池化窗口大小的最大值池化或平均值池化层进行多尺度时序卷积,等等,本实施例对此不加以限制。
步骤1043、将局部时序特征映射为预设的动作,获得在单帧目标图像数据中出现的单体动作。
在本实施例中,如图2所示,可设置线性的局部分类器(如全连接层)233,该局部分类器233可进行局部的动作分类。
将局部时序特征输入到该局部分类器233中,可将局部时序特征映射为预设的动作,从而预测在单帧目标图像数据中出现的动作,为便于区分,该动作记为单体动作。
步骤1044、将所有单体动作融合为在视频数据中出现的局部动作。
对于给定的视频数据,在其中任意位置和任意时间长度的一个片段中包含作为目标的动作时,便可认为该视频整体上包含作为目标的动作,因此,如图2所示,不同片段的动作识别结果(即单体动作)通过池化操作234进行整合,可以得到在全局的维度下、在整个视频数据中出现的动作。
以最大值池化操作作为池化操作的示例,在本示例中,可统计每个动作类型下单体动作的数量,选择数量最大的单体动作为在视频数据中出现的局部动作。
步骤105、将全局动作与局部动作融合为在视频数据中出现的目标动作。
在本实施例中,如图2所示,若在全局维度下预测视频数据出现的全局动作、以及在局部维度下预测视频数据出现的局部动作,则可以在局部动作识别网络230的末尾添加一个融合层,将全局动作与局部动作进行融合240,从而综合预测视频数据中出现的动作,为便于区分,该动作记为目标动作。
在实现中,一方面,确定全局动作的概率,作为全局概率,另一方面,确定局部动作的概率,作为局部概率。
将每个全局动作与每个局部动作进行比较。
若全局动作与局部动作相同,则将全局动作与局部动作设置为在视频数据中出现的目标动作,以及,基于全局概率与局部概率计算目标动作的目标概率,其中,目标概率与全局概率、以及局部概率均正相关,即全局概率越大、局部概率越大,则目标概率越大,全局概率越小、局部概率越小,则目标概率越小。
在一个示例中,由于局部动作识别网络和全局动作识别网络分别利用的是局部动作信息和全局动作信息进行动作识别,因此,二者具有较强的互补性。
则在本示例中,全局动作的全局概率为p global、局部动作的局部概率为p local,两者近似且相互独立,则可以计算第一差值与第二差值之间的乘积,作为反相概率,第一差值表示一减去全局概率,第二差值表示一减去局部概率。
将一减去反相概率,获得目标动作的目标概率,则目标概率p表示如下:
P=1-(1-p global)(1-p local)
在本示例中,基于近似独立性的假设对全局动作识别分支和局部动作识别分支的预测结果(即全局动作、局部动作)进行融合,从而能够更好地利用全局动作识别分支和局部动作识别分支的互补性,增强动作识别模型整体识别动作的准确率。
上述计算目标概率的方式只是作为示例,在实施本申请实施例时,可以根据实际情况设置其它计算目标概率的方式,例如,将全局概率与局部概率通过 相乘、加权之后相加计算目标概率,等等,本申请实施例对此不加以限制。另外,除了上述计算目标概率的方式外,本领域技术人员还可以根据实际需要采用其它计算目标概率的方式,本申请实施例对此也不加以限制。
目标概率越高,视频数据越有可能含有该目标动作,目标概率越低,视频数据越有可能不含有该目标动作,本实施例可视目标概率的情况对视频数据标注该目标动作的标签,后者,推送技术人员进行人工审核,本实施例对此不加以限制。
在本实施例中,接收视频数据,视频数据中具有多帧原始图像数据,对多帧原始图像数据进行采样,获得多帧目标图像数据,根据多帧目标图像数据在全局的特征,识别在视频数据中出现的动作,获得全局动作,根据多帧目标图像数据在局部的特征,识别在视频数据中出现的动作,获得局部动作,将全局动作与局部动作融合为在视频数据中出现的目标动作,针对局部动作识别分支和全局动作识别分支执行一次采样操作,即局部动作识别分支和全局动作识别分支复用同一个特征,在保持视频数据的主要特征的情况下,可降低视频数据的数据量,降低识别动作的计算量,通过使用局部动作识别分支和全局动作识别分支分别进行视频数据进行动作建模和识别动作,避免只关注局部动作信息或全局动作信息的缺陷,提高了识别动作的灵活性,通过融合局部动作和全局动作预测视频数据的动作,提高了识别多种不同视频数据的精确度。
实施例二
图10为本申请实施例二提供的一种动作识别模型的训练方法的流程图,本实施例可适用于基于全局和局部对视频数据识别动作的情况,该方法可以由动作识别模型的训练装置来执行,该动作识别模型的训练装置可以由软件和/或硬件实现,可配置在计算机设备中,例如,服务器、工作站、个人电脑,等等,包括如下步骤:
步骤1001、确定动作识别模型。
在本实施例中,可预先构建动作识别模型,动作识别模型可以使用MXNet(一款设计为效率和灵活性的深度学习框架)作为底层支持库实现,动作识别模型可利用四张显卡完成训练。
在实现中,动作识别模型包括如下结构:
1、采样网络
采样网络用于对视频数据的多帧原始图像数据进行采样,获得多帧目标图像数据。
采样网络还用于对多帧原始图像数据执行二维卷积操作,获得多帧目标图像数据。
2、全局动作识别网络
全局动作识别网络用于根据多帧目标图像数据在全局的特征,识别在视频数据中出现的动作,获得全局动作。
全局动作识别网络包括如下结构:
2.1、二维特征提取网络
二维特征提取网络用于对多帧目标图像数据提取二维下的特征,获得全局空间特征。
在一些设计中,二维特征提取网络按照顺序划分为多个二维阶段,在每个二维阶段中依次设置有二维投影块、以及多个二维残差块。
二维投影块用于对多帧目标图像数据执行二维卷积操作。
二维残差块用于对多帧目标图像数据执行二维卷积操作,最后一个二维残差块输出的多帧目标图像数据为全局空间特征。
在部分情况下,全局动作识别网络还包括二维池化层,该二维池化层级联在二维特征提取网络之后,用于对多帧目标图像数据执行空间上的全局池化操作(如全局池化平均操作),作为全局空间特征。
示例性地,二维投影块设置有第一二维卷积层、以及多个第二二维卷积层;第一二维卷积层用于对多帧目标图像数据执行二维卷积操作;第二二维卷积层用于对多帧目标图像数据执行二维卷积操作;第一二维卷积层输出的多帧目标图像数据与第二二维卷积层输出的多帧目标图像数据合并为二维投影块输出的多帧目标图像数据。
示例性地,二维残差块设置有多个第三二维卷积层。
第三二维卷积层用于对多帧目标图像数据执行二维卷积操作。
2.2、三维特征提取网络
三维特征提取网络用于对多帧目标图像数据提取三维下的特征,获得全局时序特征。
在部分情况下,全局动作识别网络还包括三维池化层,该三维池化层级联在三维特征提取网络之后,对多帧目标图像数据执行时序上的全局池化操作(如全局池化平均操作),作为全局时序特征。
在一些设计中,三维特征提取网络按照顺序划分为多个三维阶段,第一个 三维阶段中设置一个或多个三维残差块,其他三维阶段中依次设置有三维投影块、一个或多个三维残差块。
三维投影块用于对多帧目标图像数据执行三维卷积操作;三维残差块用于对多帧目标图像数据执行三维卷积操作,最后一个三维残差块输出的多帧目标图像数据为全局时序特征。
示例性地,三维投影块设置有多个第一三维卷积层、以及第二三维卷积层;
第一三维卷积层用于对多帧目标图像数据执行三维卷积操作;第二三维卷积层用于对多帧目标图像数据执行三维卷积操作;第一三维卷积层输出的多帧目标图像数据与第二三维卷积层输出的多帧目标图像数据合并为三维投影块输出的多帧目标图像数据。
示例性地,三维残差块设置有多个第三三维卷积层。
第三三维卷积层用于对多帧目标图像数据执行三维卷积操作。
2.3、特征拼接器
特征拼接器,用于将全局空间特征与全局时序特征拼接为全局目标特征。
2.4、全局分类器
全局分类器,用于将全局目标特征映射为预设的动作,获得在视频数据中出现的全局动作。
3、局部动作识别网络
局部动作识别网络用于根据多帧目标图像数据在局部的特征,识别在视频数据中出现的动作,获得局部动作。
局部动作识别网络包括如下结构:
3.1、运动特征提取网络
运动特征提取网络用于提取在多帧目标图像数据的局部中表征运动的特征,作为局部运动特征。
示例性地,运动特征提取网络还用于计算任意相邻的两帧目标图像数据之间的差值,作为局部运动特征。
在一些实施方式中,局部动作识别网络还包括如下结构:
平滑层,级联在运动特征提取网络之后,用于对局部运动特征进行平滑操作。
3.2、时序特征提取网络
时序特征提取网络,用于在多个尺度下对局部运动特征进行时序卷积操作,获得局部时序特征。
在实现中,时序特征提取网络包括:
多个时序卷积层,多个时序卷积层设置有大小不同的卷积核;每个时序卷积层用于使用指定卷积核对局部运动特征沿时间维度进行卷积操作,获得多个帧间时序特征。
特征融合层,用于将局部运动特征与多个帧间时序特征相加,得到局部时序特征。
3.3、局部分类器
局部分类器用于将局部时序特征映射为预设的动作,获得在单帧目标图像数据中出现的单体动作。
3.4、全局池化层
全局池化层用于将所有单体动作融合为在视频数据中出现的局部动作。
在实现中,全局池化层还用于统计每个动作类型下单体动作的数量,选择数量最大的单体动作为在视频数据中出现的局部动作。
在本实施例中,全局动作与局部动作用于融合为在视频数据中出现的目标动作。
在实现中,可确定全局动作的概率,作为全局概率,确定局部动作的概率,作为局部概率,若全局动作与局部动作相同,则将全局动作与局部动作设置为在视频数据中出现的目标动作,基于全局概率与局部概率计算目标动作的目标概率,目标概率与全局概率、局部概率均正相关。
示例性地,动作融合层还用于计算第一差值与第二差值之间的乘积,作为反相概率,第一差值表示一减去全局概率,第二差值表示一减去局部概率;将一减去反相概率,获得目标动作的目标概率。
在本申请实施例中,由于动作识别模型的结构及其应用与实施例一的应用基本相似,所以描述的比较简单,相关之处参见实施例一的部分说明即可,本申请实施例在此不加以详述。
在训练时,可对作为样本的视频数据根据业务的数据需求,采取不同的数据增强方案进行数据增强,例如,随机缩放裁剪、随机运动模糊、随机翻转,等等。本实施例对此不加以限制。
步骤1002、计算全局动作识别网络在识别全局动作时的全局损失值。
在训练动作识别模型时,对于全局动作识别网络,可以使用预设的损失函数计算其在识别全局动作时的损失值,作为全局损失值。
示例性地,损失函数可以为交叉熵损失,即一种用于分类任务的损失函数,其目标是使全局动作识别网络预测出的全局动作的概率分布与标注给定的正确全局动作的分布之间的差异最小化。
步骤1003、计算局部动作识别网络在识别局部动作时的局部损失值。
在训练动作识别模型时,对于局部动作识别网络,可以使用预设的损失函数计算其在识别局部动作时的损失值,作为局部损失值。
示例性地,损失函数可以为交叉熵损失,即一种用于分类任务的损失函数,其目标是使局部动作识别网络预测出的局部动作的概率分布与标注给定的正确局部动作的分布之间的差异最小化。
在视频数据的内容审核的业务场景中,一个视频数据中经常只有一部分片段包含作为目标的动作,其余片段则包含无关的背景内容。而标注时通常只给出视频级别的动作,即整个视频数据中是否包含作为目标的动作,而没有视频数据中哪个片段包含作为目标的动作的标注。
为此,在本实施例中,从多示例学习的角度将视频数据看成是一个样本包,将每一个局部片段(即每帧原始图像数据)看成一个示例。
在一个示例中具有作为目标的动作时,该示例为正样本示例,正样本的视频数据中的多个局部片段构成多示例学习中的一个正样本包,该正样本包中至少包含一个正样本示例。
在示例中均不具有作为目标的动作时,该示例为负样本示例,负样本的视频数据中的多个局部片段构成多示例学习中的一个负样本包,该负样本包中的所有示例均为负样本示例。
多示例学习是在样本包而非示例的级别上进行模型训练,调用局部网络对视频数据进行处理,从而确定在视频数据出现的动作,作为参考动作,以视频数据作为样本包、每帧原始图像数据作为示例,在样本包中取概率最高的动作,作为样本包的局部动作。
使用预设的损失函数计算参考动作与局部动作之间的差异,作为局部动作识别网络在识别局部动作时的局部损失值。
本实施例使用多示例学习的方法对局部动作识别网络进行训练,保证局部动作识别网络得到有效的训练,可解决实际业务场景中目标动作经常只出现在视频数据中的部分片段的问题。
步骤1004、根据全局损失值与局部损失值更新采样网络、全局动作识别网络与局部动作识别网络。
动作识别模型(包括采样网络、全局动作识别网络与局部动作识别网络)可以看作是一种函数映射,而动作识别模型(包括采样网络、全局动作识别网络与局部动作识别网络)的训练过程是一个函数优化求解的过程。优化求解的目标就是不断更新该动作识别模型(包括采样网络、全局动作识别网络与局部动作识别网络)所包含的参数,将已标注的样本作为输入的数据,经过动作识别模型(包括采样网络、全局动作识别网络与局部动作识别网络)的计算,输出的预测值和标注之间的损失值最小。
动作识别模型(包括采样网络、全局动作识别网络与局部动作识别网络)训练的过程就是参数更新的过程:计算损失函数在当前参数的梯度方向,然后根据损失值和学习速率,计算参数的更新幅度,在梯度相反方向更新参数。
假设动作识别模型(包括采样网络、全局动作识别网络与局部动作识别网络)的参数表示为w,损失函数为f,则损失函数在第t个时刻时的参数梯度g t可以表示为:
Figure PCTCN2022071211-appb-000001
其中,
Figure PCTCN2022071211-appb-000002
指动作识别模型(包括采样网络、全局动作识别网络与局部动作识别网络)中的一层(参数为w)在优化第t-1个时刻时的梯度,也可以通指整个动作识别模型(包括采样网络、全局动作识别网络与局部动作识别网络)在第t-1个时刻时所有层的梯度。
因此,学习率为a时,第t个时刻参数的更新幅度可以表示为:
Δ? t=-a t·g t
第t+1个时刻时的更新可以表示为:
w t+1=w t+Δw t
对于全局动作识别网络而言,可基于全局损失值计算全局动作识别网络中的梯度,作为全局梯度,从而应用全局梯度对全局动作识别网络进行梯度下降,以更新全局动作识别网络中的参数。
对于局部动作识别网络而言,可基于局部损失值计算局部动作识别网络中的梯度,作为局部梯度,从而应用局部梯度对局部动作识别网络进行梯度下降,以更新局部动作识别网络中的参数。
对于采样网络而言,可将全局梯度与局部梯度结合(即向量相加)为相交梯度,从而应用相交梯度对采样网络进行梯度下降,以更新采样网络的参数。
在训练全局动作识别网络、局部动作识别网络、采样网络的过程中,可使用一些非启发式的优化算法来提高梯度下降的收敛速度,以及优化性能。
在本实施例中,确定动作识别模型,动作识别模型包括采样网络、全局动作识别网络、局部动作识别网络;采样网络用于对视频数据的多帧原始图像数据进行采样,获得多帧目标图像数据,全局动作识别网络用于根据多帧目标图像数据在全局的特征,识别在视频数据中出现的动作,获得全局动作,局部动作识别网络用于根据多帧目标图像数据在局部的特征,识别在视频数据中出现的动作,获得局部动作,全局动作与局部动作用于融合为在视频数据中出现的目标动作,计算全局动作识别网络在识别全局动作时的全局损失值,计算局部动作识别网络在识别局部动作时的局部损失值,根据全局损失值与局部损失值更新采样网络、全局动作识别网络与局部动作识别网络,在动作识别模型中针对局部动作识别网络和全局动作识别网络执行一次采样操作,即局部动作识别网络和全局动作识别网络复用同一个特征,在保持视频数据的主要特征的情况下,可降低视频数据的数据量,降低识别动作的计算量,通过使用局部动作识别网络和全局动作识别网络分别进行视频数据进行动作建模和识别动作,避免只关注局部动作信息或全局动作信息的缺陷,提高了识别动作的灵活性,通过融合局部动作和全局动作预测视频数据的动作,提高了识别多种不同视频数据的精确度。
此外,结合全局动作识别网络和局部动作识别网络的损失值对动作识别模型进行联合训练,从而使全局动作识别网络和局部动作识别网络更好地共享动作识别模型的采样网络,并取得更好的整体性能。
对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,一些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例中所涉及的动作并不一定是本申请实施例所必须的。
实施例三
图11为本申请实施例三提供的一种动作识别装置的结构框图,可以包括如下模块:
视频数据接收模块1101,设置为接收视频数据,所述视频数据中具有多帧原始图像数据;采样模块1102,设置为对所述原始图像数据进行采样,获得多 帧目标图像数据;全局动作识别模块1103,设置为根据所述多帧目标图像数据在全局的特征,识别在所述视频数据中出现的动作,获得全局动作;局部动作识别模块1104,设置为根据所述多帧目标图像数据在局部的特征,识别在所述视频数据中出现的动作,获得局部动作;目标动作融合模块1105,设置为将所述全局动作与所述局部动作融合为在所述视频数据中出现的目标动作。
在本申请的一个实施例中,所述全局动作识别模块1103包括:
全局空间特征提取模块,设置为对所述多帧目标图像数据提取二维下的特征,获得全局空间特征;全局时序特征提取模块,设置为对所述多帧目标图像数据提取三维下的特征,获得全局时序特征;全局目标特征拼接模块,设置为将所述全局空间特征与所述全局时序特征拼接为全局目标特征;全局目标特征映射模块,设置为将所述全局目标特征映射为预设的动作,获得在所述视频数据中出现的全局动作。
在本申请的一个实施例中,将提取全局空间特征的二维特征提取网路按照顺序划分为多个二维阶段,在每个二维阶段中依次设置有二维投影块、以及多个二维残差块;所述全局空间特征提取模块还设置为:
在当前二维阶段中,调用所述二维投影块对所述多帧目标图像数据执行二维卷积操作,依次调用所述多个二维残差块对所述多帧目标图像数据执行二维卷积操作;判断是否已遍历所有二维阶段;若已遍历所有二维阶段,则将执行二维卷积操作后的所述多帧目标图像数据输出为全局空间特征;若未遍历所述所有二维阶段,则将执行二维卷积操作后的所述多帧目标图像数据输出至下一个所述二维阶段,返回执行所述在当前二维阶段中,调用所述二维投影块对所述多帧目标图像数据执行二维卷积操作,依次调用所述多个二维残差块对所述多帧目标图像数据执行二维卷积操作。
在本申请的一个实施例中,所述二维投影块设置有第一二维卷积层、以及多个第二二维卷积层;所述全局空间特征提取模块还设置为:
调用所述第一二维卷积层对所述多帧目标图像数据执行二维卷积操作;依次调用所述多个第二二维卷积层对所述多帧目标图像数据执行二维卷积操作;对所述第一二维卷积层输出的多帧目标图像数据与所述第二二维卷积层输出的多帧目标图像数据进行合并。
在本申请的一个实施例中,所述二维残差块设置有多个第三二维卷积层;所述全局空间特征提取模块还设置为:
在每个二维残差块中,依次调用所述多个第三二维卷积层对所述目标图像数据执行二维卷积操作。
在本申请的一个实施例中,将提取全局时序特征的三维特征提取网络按照顺序划分为多个三维阶段,第一个三维阶段中设置一个或多个三维残差块,其他每个三维阶段中依次设置有三维投影块、以及一个或多个三维残差块;所述全局时序特征提取模块还设置为:
在当前三维阶段中,调用所述三维投影块对所述多帧目标图像数据执行三维卷积操作,和/或,调用所述三维残差块对所述多帧目标图像数据执行三维卷积操作;判断是否已遍历所有三维阶段;若已遍历所有三维阶段,则将执行三维卷积操作后的所述多帧目标图像数据输出为全局时序特征;若未遍历所有三维阶段,则将执行三维卷积操作后的所述多帧目标图像数据输出至下一个所述三维阶段,返回执行所述在当前三维阶段中,调用所述三维投影块对所述多帧目标图像数据执行三维卷积操作,和/或,依次调用多个所述三维残差块对所述多帧目标图像数据执行三维卷积操作。
在本申请的一个实施例中,所述三维投影块设置有多个第一三维卷积层、以及第二三维卷积层;所述全局时序特征提取模块还设置为:
依次调用所述多个第一三维卷积层对所述多帧目标图像数据执行三维卷积操作;调用所述第二三维卷积层对所述多帧目标图像数据执行三维卷积操作;对所述第一三维卷积层输出的多帧目标图像数据与所述第二三维卷积层输出的多帧目标图像数据进行合并。
在本申请的一个实施例中,所述三维残差块设置有多个第三三维卷积层;所述全局时序特征提取模块还设置为:
在每个三维残差块中,依次调用所述多个第三三维卷积层对所述多帧目标图像数据执行三维卷积操作。
在本申请的一个实施例中,所述局部动作识别模块1104包括:
局部运动特征提取模块,设置为提取在所述多帧目标图像数据的局部中表征运动的特征,作为局部运动特征;局部时序特征生成模块,设置为在多个尺度下对所述局部运动特征进行时序卷积操作,获得局部时序特征;局部时序特征映射模块,设置为将所述局部时序特征映射为预设的动作,获得在单帧目标图像数据中出现的单体动作;单体动作融合模块,设置为将所有单体动作融合为在所述视频数据中出现的局部动作。
在本申请的一个实施例中,所述局部运动特征提取模块包括:
帧间差计算模块,设置为计算任意相邻的两帧目标图像数据之间的差值,作为局部运动特征。
在本申请的一个实施例中,所述局部时序特征生成模块包括:
多时序卷积模块,设置为使用多个卷积核对所述局部运动特征沿时间维度进行卷积操作,获得多个帧间时序特征;特征融合模块,设置为将所述局部运动特征与所述多个帧间时序特征相加,得到局部时序特征。
在本申请的一个实施例中,所述单体动作融合模块包括:
数量统计模块,设置为统计每个动作类型下单体动作的数量;局部动作选择模块,设置为选择数量最大的单体动作为在所述视频数据中出现的局部动作。
在本申请的一个实施例中,所述局部动作识别模块1104还包括:
平滑操作模块,设置为对所述局部运动特征进行平滑操作。
在本申请的一个实施例中,所述目标动作融合模块1105包括:
全局概率确定模块,设置为确定所述全局动作的概率,作为全局概率;局部概率确定模块,设置为确定所述局部动作的概率,作为局部概率;目标动作确定模块,设置为若所述全局动作与所述局部动作相同,则将所述全局动作与所述局部动作设置为在所述视频数据中出现的目标动作;目标概率计算模块,设置为基于所述全局概率与所述局部概率计算所述目标动作的目标概率,所述目标概率与所述全局概率、所述局部概率均正相关。
在本申请的一个实施例中,所述目标概率计算模块包括:
反相概率计算模块,设置为计算第一差值与第二差值之间的乘积,作为反相概率,所述第一差值表示一减去所述全局概率,所述第二差值表示一减去所述局部概率;反相概率相减模块,设置为将一减去所述反相概率,获得所述目标动作的目标概率。
本申请实施例所提供的动作识别装置可执行本申请任意实施例所提供的动作识别方法,具备执行方法相应的功能模块和效果。
实施例四
图12为本申请实施例四提供的一种动作识别模型的训练装置的结构框图,可以包括如下模块:
动作识别模型确定模块1201,设置为确定动作识别模型,所述动作识别模型包括采样网络、全局动作识别网络、局部动作识别网络;所述采样网络用于对视频数据的多帧原始图像数据进行采样,获得多帧目标图像数据,所述全局动作识别网络用于根据所述多帧目标图像数据在全局的特征,识别在所述视频数据中出现的动作,获得全局动作,所述局部动作识别网络用于根据所述多帧目标图像数据在局部的特征,识别在所述视频数据中出现的动作,获得局部动 作,所述全局动作与所述局部动作用于融合为在所述视频数据中出现的目标动作;全局损失值计算模块1202,设置为计算所述全局动作识别网络在识别所述全局动作时的全局损失值;局部损失值计算模块1203,设置为计算所述局部动作识别网络在识别所述局部动作时的局部损失值;动作识别模型更新模块1204,设置为根据所述全局损失值与所述局部损失值更新所述采样网络、所述全局动作识别网络与所述局部动作识别网络。
在本申请的一个实施例中,所述局部损失值计算模块1203包括:
参考动作确定模块,设置为确定在所述视频数据出现的动作,作为参考动作;局部动作确定模块,设置为以所述视频数据作为样本包、每帧所述原始图像数据作为示例,在所述样本包中取概率最高的动作,作为所述样本包的局部动作;动作差异计算模块,设置为计算所述参考动作与所述局部动作之间的差异,作为所述局部动作识别网络在识别所述局部动作时的局部损失值。
在本申请的一个实施例中,所述动作识别模型更新模块1204包括:
全局梯度计算模块,设置为基于所述全局损失值计算所述全局动作识别网络中的梯度,作为全局梯度;局部梯度计算模块,设置为基于所述局部损失值计算局部动作识别网络中的梯度,作为局部梯度;相交梯度计算模块,设置为将所述全局梯度与所述局部梯度结合为相交梯度;全局动作识别网络更新模块,设置为应用所述全局梯度对所述全局动作识别网络进行梯度下降,以更新所述全局动作识别网络;局部动作识别网络更新模块,设置为应用所述局部梯度对所述局部动作识别网络进行梯度下降,以更新所述局部动作识别网络;采样网络更新模块,设置为应用所述相交梯度对所述采样网络进行梯度下降,以更新所述采样网络。
本申请实施例所提供的动作识别模型的训练装置可执行本申请任意实施例所提供的动作识别模型的训练方法,具备执行方法相应的功能模块和效果。
实施例五
图13为本申请实施例五提供的一种计算机设备的结构示意图。图13示出了适于用来实现本申请实施方式的示例性计算机设备12的框图。图13显示的计算机设备12仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图13所示,计算机设备12以通用计算设备的形式表现。计算机设备12的组件可以包括但不限于:一个或者多个处理器或者处理单元16,系统存储器28,连接不同系统组件(包括系统存储器28和处理单元16)的总线18。
总线18表示几类总线结构中的一种或多种,包括存储器总线或者存储器控制器,外围总线,图形加速端口,处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说,这些体系结构包括但不限于工业标准体系结构(Industry Standard Architecture,ISA)总线,微通道体系结构(Micro Channel Architecture,MAC)总线,增强型ISA总线、视频电子标准协会(Video Electronics Standards Association,VESA)局域总线以及外围组件互连(Peripheral Component Interconnect,PCI)总线。
计算机设备12包括多种计算机系统可读介质。这些介质可以是任何能够被计算机设备12访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。
系统存储器28可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(Random Access Memory,RAM)30和/或高速缓存存储器32。计算机设备12可以包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。仅作为举例,存储系统34可以用于读写不可移动的、非易失性磁介质(图13未显示,通常称为“硬盘驱动器”)。尽管图13中未示出,可以提供用于对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器,以及对可移动非易失性光盘(例如只读光盘存储器(Compact Disc Read-Only Memory,CD-ROM),数字通用光盘只读存储器(Digital Versatile Disc Read-Only Memory,DVD-ROM)或者其它光介质)读写的光盘驱动器。在这些情况下,每个驱动器可以通过一个或者多个数据介质接口与总线18相连。存储器28可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块,这些程序模块被配置以执行本申请实施例的功能。
具有一组(至少一个)程序模块42的程序/实用工具40,可以存储在例如存储器28中,这样的程序模块42包括但不限于操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或一种组合中可能包括网络环境的实现。程序模块42通常执行本申请所描述的实施例中的功能和/或方法。
计算机设备12也可以与一个或多个外部设备14(例如键盘、指向设备、显示器24等)通信,还可与一个或者多个使得用户能与该计算机设备12交互的设备通信,和/或与使得该计算机设备12能与一个或多个其它计算设备进行通信的任何设备(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(Input/Output,I/O)接口22进行。并且,计算机设备12还可以通过网络适配器20与一个或者多个网络(例如局域网(Local Area Network,LAN),广域网(Wide Area Network,WAN)和/或公共网络,例如因特网)通信。如图所示, 网络适配器20通过总线18与计算机设备12的其它模块通信。应当明白,尽管图中未示出,可以结合计算机设备12使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。
处理单元16通过运行存储在系统存储器28中的程序,从而执行多种功能应用以及数据处理,例如实现本申请实施例所提供的动作识别方法、动作识别模型的训练方法。
实施例六
本申请实施例六还提供一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现上述动作识别方法、动作识别模型的训练方法的多个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。
计算机可读存储介质例如可以包括但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、RAM、ROM、可擦式可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM或闪存)、光纤、CD-ROM、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质或非暂态存储介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。

Claims (20)

  1. 一种动作识别方法,包括:
    接收视频数据,其中,所述视频数据中具有多帧原始图像数据;
    对所述多帧原始图像数据进行采样,获得多帧目标图像数据;
    根据所述多帧目标图像数据在全局的特征,识别在所述视频数据中出现的动作,获得全局动作;
    根据所述多帧目标图像数据在局部的特征,识别在所述视频数据中出现的动作,获得局部动作;
    将所述全局动作与所述局部动作融合为在所述视频数据中出现的目标动作。
  2. 根据权利要求1所述的方法,其中,所述根据所述多帧目标图像数据在全局的特征,识别在所述视频数据中出现的动作,获得全局动作,包括:
    对所述多帧目标图像数据提取二维下的特征,获得全局空间特征;
    对所述多帧目标图像数据提取三维下的特征,获得全局时序特征;
    将所述全局空间特征与所述全局时序特征拼接为全局目标特征;
    将所述全局目标特征映射为预设的动作,获得在所述视频数据中出现的所述全局动作。
  3. 根据权利要求2所述的方法,其中,将提取所述全局空间特征的二维特征提取网路按照顺序划分为多个二维阶段,在每个二维阶段中依次设置有二维投影块、以及多个二维残差块;
    所述对所述多帧目标图像数据提取二维下的特征,获得全局空间特征,包括:
    在当前二维阶段中,调用所述二维投影块对所述多帧目标图像数据执行二维卷积操作,依次调用所述多个二维残差块对所述多帧目标图像数据执行二维卷积操作;
    判断是否已遍历所有二维阶段;
    响应于已遍历所述所有二维阶段的,将执行二维卷积操作后的所述多帧目标图像数据输出为全局空间特征;
    响应于未遍历所述所有二维阶段,将执行二维卷积操作后的所述多帧目标图像数据输出至下一个二维阶段,返回执行所述在当前二维阶段中,调用所述二维投影块对所述多帧目标图像数据执行二维卷积操作,依次调用所述多个二维残差块对所述多帧目标图像数据执行二维卷积操作。
  4. 根据权利要求3所述的方法,其中,
    所述二维投影块设置有第一二维卷积层、以及多个第二二维卷积层;
    所述调用所述二维投影块对所述多帧目标图像数据执行二维卷积操作,包括:
    调用所述第一二维卷积层对所述多帧目标图像数据执行二维卷积操作;
    依次调用所述多个第二二维卷积层对所述多帧目标图像数据执行二维卷积操作;
    对所述第一二维卷积层输出的多帧目标图像数据与所述多个第二二维卷积层输出的多帧目标图像数据进行合并;
    所述二维残差块设置有多个第三二维卷积层;
    所述依次调用所述多个二维残差块对所述目标图像数据执行二维卷积操作,包括:
    在每个二维残差块中,依次调用所述多个第三二维卷积层对所述目标图像数据执行二维卷积操作。
  5. 根据权利要求2所述的方法,其中,将提取所述全局时序特征的三维特征提取网络按照顺序划分为多个三维阶段,第一个三维阶段中设置至少一个三维残差块,其他每个三维阶段中依次设置有三维投影块、以及至少一个三维残差块;
    所述对所述多帧目标图像数据提取三维下的特征,获得全局时序特征,包括:
    在当前三维阶段中,调用所述三维投影块和所述至少一个三维残差块中的至少之一对所述多帧目标图像数据执行三维卷积操作;
    判断是否已遍历所有三维阶段;
    响应于已遍历所述所有三维阶段,将执行三维卷积操作后的所述多帧目标图像数据输出为全局时序特征;
    响应于未遍历所述所有三维阶段,将执行三维卷积操作后的所述多帧目标图像数据输出至下一个三维阶段,返回执行所述在当前三维阶段中,调用所述三维投影块和所述至少一个三维残差块中的至少之一对所述多帧目标图像数据执行三维卷积操作。
  6. 根据权利要求5所述的方法,其中,所述三维投影块设置有多个第一三维卷积层、以及第二三维卷积层;
    调用所述三维投影块对所述多帧目标图像数据执行三维卷积操作,包括:
    依次调用所述多个第一三维卷积层对所述多帧目标图像数据执行三维卷积操作;
    调用所述第二三维卷积层对所述多帧目标图像数据执行三维卷积操作;
    对所述第一三维卷积层输出的多帧目标图像数据与所述第二三维卷积层输出的多帧目标图像数据进行合并;
    所述三维残差块设置有多个第三三维卷积层;
    调用所述至少一个三维残差块对所述多帧目标图像数据执行三维卷积操作,包括:
    在每个三维残差块中,依次调用所述多个第三三维卷积层对所述多帧目标图像数据执行三维卷积操作。
  7. 根据权利要求1-6任一项所述的方法,其中,所述根据所述多帧目标图像数据在局部的特征,识别在所述视频数据中出现的动作,获得局部动作,包括:
    提取在所述多帧目标图像数据的局部中表征运动的特征,作为局部运动特征;
    在多个尺度下对所述局部运动特征进行时序卷积操作,获得局部时序特征;
    将所述局部时序特征映射为预设的动作,获得在单帧目标图像数据中出现的单体动作;
    将所有单体动作融合为在所述视频数据中出现的所述局部动作。
  8. 根据权利要求7所述的方法,其中,所述提取在所述多帧目标图像数据的局部中表征运动的特征,作为局部运动特征,包括:
    计算相邻的两帧目标图像数据之间的差值,作为所述局部运动特征。
  9. 根据权利要求7所述的方法,其中,所述在多个尺度下对所述局部运动特征进行时序卷积操作,获得局部时序特征,包括:
    使用多个卷积核对所述局部运动特征沿时间维度进行卷积操作,获得多个帧间时序特征;
    将所述局部运动特征与所述多个帧间时序特征相加,得到所述局部时序特征。
  10. 根据权利要求7所述的方法,其中,所述将所有单体动作融合为在所 述视频数据中出现的所述局部动作,包括:
    统计每个动作类型下单体动作的数量;
    选择数量最大的单体动作为在所述视频数据中出现的所述局部动作。
  11. 根据权利要求7所述的方法,其中,所述根据所述多个目标图像数据在局部的特征,识别在所述视频数据中出现的动作,获得局部动作,还包括:
    对所述局部运动特征进行平滑操作。
  12. 根据权利要求1-6、7-11任一项所述的方法,其中,所述将所述全局动作与所述局部动作融合为在所述视频数据中出现的目标动作,包括:
    确定所述全局动作的概率,作为全局概率;
    确定所述局部动作的概率,作为局部概率;
    在所述全局动作与所述局部动作相同的情况下,将所述全局动作与所述局部动作均设置为在所述视频数据中出现的所述目标动作;
    基于所述全局概率与所述局部概率计算所述目标动作的目标概率,其中,所述目标概率与所述全局概率、以及所述局部概率均正相关。
  13. 根据权利要求12所述的方法,其中,所述基于所述全局概率与所述局部概率计算所述目标动作的目标概率,包括:
    计算第一差值与第二差值之间的乘积,作为反相概率,其中,所述第一差值表示一减去所述全局概率,所述第二差值表示一减去所述局部概率;
    将一减去所述反相概率,获得所述目标动作的目标概率。
  14. 一种动作识别模型的训练方法,包括:
    确定动作识别模型,其中,所述动作识别模型包括采样网络、全局动作识别网络、局部动作识别网络;所述采样网络用于对视频数据的多帧原始图像数据进行采样,获得多帧目标图像数据,所述全局动作识别网络用于根据所述多帧目标图像数据在全局的特征,识别在所述视频数据中出现的动作,获得全局动作,所述局部动作识别网络用于根据所述多帧目标图像数据在局部的特征,识别在所述视频数据中出现的动作,获得局部动作,所述全局动作与所述局部动作用于融合为在所述视频数据中出现的目标动作;
    计算所述全局动作识别网络在识别所述全局动作时的全局损失值;
    计算所述局部动作识别网络在识别所述局部动作时的局部损失值;
    根据所述全局损失值与所述局部损失值更新所述采样网络、所述全局动作 识别网络与所述局部动作识别网络。
  15. 根据权利要求14所述的方法,其中,所述计算所述局部动作识别网络在识别所述局部动作时的局部损失值,包括:
    确定在所述视频数据出现的动作,作为参考动作;
    以所述视频数据作为样本包、每帧原始图像数据作为示例,在所述样本包中取概率最高的动作,作为所述样本包的局部动作;
    计算所述参考动作与所述局部动作之间的差异,作为所述局部动作识别网络在识别所述局部动作时的局部损失值。
  16. 根据权利要求14或15所述的方法,其中,所述根据所述全局损失值与所述局部损失值更新所述采样网络、所述全局动作识别网络与所述局部动作识别网络,包括:
    基于所述全局损失值计算所述全局动作识别网络中的梯度,作为全局梯度;
    基于所述局部损失值计算局部动作识别网络中的梯度,作为局部梯度;
    将所述全局梯度与所述局部梯度结合为相交梯度;
    应用所述全局梯度对所述全局动作识别网络进行梯度下降,以更新所述全局动作识别网络;
    应用所述局部梯度对所述局部动作识别网络进行梯度下降,以更新所述局部动作识别网络;
    应用所述相交梯度对所述采样网络进行梯度下降,以更新所述采样网络。
  17. 一种动作识别装置,包括:
    视频数据接收模块,设置为接收视频数据,其中,所述视频数据中具有多帧原始图像数据;
    采样模块,设置为对所述多帧原始图像数据进行采样,获得多帧目标图像数据;
    全局动作识别模块,设置为根据所述多帧目标图像数据在全局的特征,识别在所述视频数据中出现的动作,获得全局动作;
    局部动作识别模块,设置为根据所述多帧目标图像数据在局部的特征,识别在所述视频数据中出现的动作,获得局部动作;
    目标动作融合模块,设置为将所述全局动作与所述局部动作融合为在所述视频数据中出现的目标动作。
  18. 一种动作识别模型的训练装置,包括:
    动作识别模型确定模块,设置为确定动作识别模型,其中,所述动作识别模型包括采样网络、全局动作识别网络、局部动作识别网络;所述采样网络用于对视频数据的多帧原始图像数据进行采样,获得多帧目标图像数据,所述全局动作识别网络用于根据所述多帧目标图像数据在全局的特征,识别在所述视频数据中出现的动作,获得全局动作,所述局部动作识别网络用于根据所述多帧目标图像数据在局部的特征,识别在所述视频数据中出现的动作,获得局部动作,所述全局动作与所述局部动作用于融合为在所述视频数据中出现的目标动作;
    全局损失值计算模块,设置为计算所述全局动作识别网络在识别所述全局动作时的全局损失值;
    局部损失值计算模块,设置为计算所述局部动作识别网络在识别所述局部动作时的局部损失值;
    动作识别模型更新模块,设置为根据所述全局损失值与所述局部损失值更新所述采样网络、所述全局动作识别网络与所述局部动作识别网络。
  19. 一种计算机设备,包括:
    至少一个处理器;
    存储器,设置为存储至少一个程序;
    当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如权利要求1-13中任一项所述的动作识别方法或者如权利要求14-16中任一项所述的动作识别模型的训练方法。
  20. 一种计算机可读存储介质,所述计算机可读存储介质上存储计算机程序,所述计算机程序被处理器执行时实现如权利要求1-13中任一项所述的动作识别方法或者如权利要求14-16中任一项所述的动作识别模型的训练方法。
PCT/CN2022/071211 2021-01-15 2022-01-11 动作识别模型的训练方法及装置、动作识别方法及装置 WO2022152104A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110056978.XA CN112749666B (zh) 2021-01-15 2021-01-15 一种动作识别模型的训练及动作识别方法与相关装置
CN202110056978.X 2021-01-15

Publications (1)

Publication Number Publication Date
WO2022152104A1 true WO2022152104A1 (zh) 2022-07-21

Family

ID=75652157

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/071211 WO2022152104A1 (zh) 2021-01-15 2022-01-11 动作识别模型的训练方法及装置、动作识别方法及装置

Country Status (2)

Country Link
CN (1) CN112749666B (zh)
WO (1) WO2022152104A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116030538A (zh) * 2023-03-30 2023-04-28 中国科学技术大学 弱监督动作检测方法、系统、设备及存储介质
CN116614666A (zh) * 2023-07-17 2023-08-18 微网优联科技(成都)有限公司 一种基于ai摄像头特征提取系统及方法

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112749666B (zh) * 2021-01-15 2024-06-04 百果园技术(新加坡)有限公司 一种动作识别模型的训练及动作识别方法与相关装置
CN113762121A (zh) * 2021-08-30 2021-12-07 北京金山云网络技术有限公司 一种动作识别方法及装置、电子设备及存储介质
CN114241376A (zh) * 2021-12-15 2022-03-25 深圳先进技术研究院 行为识别模型训练和行为识别方法、装置、系统及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108416288A (zh) * 2018-03-04 2018-08-17 南京理工大学 基于全局与局部网络融合的第一视角交互动作识别方法
CN110765854A (zh) * 2019-09-12 2020-02-07 昆明理工大学 一种视频动作识别方法
US10713493B1 (en) * 2020-02-06 2020-07-14 Shenzhen Malong Technologies Co., Ltd. 4D convolutional neural networks for video recognition
CN111985343A (zh) * 2020-07-23 2020-11-24 深圳大学 一种行为识别深度网络模型的构建方法及行为识别方法
CN112749666A (zh) * 2021-01-15 2021-05-04 百果园技术(新加坡)有限公司 一种动作识别模型的训练及动作识别方法与相关装置

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2720172A1 (en) * 2012-10-12 2014-04-16 Nederlandse Organisatie voor toegepast -natuurwetenschappelijk onderzoek TNO Video access system and method based on action type detection
CN107463949B (zh) * 2017-07-14 2020-02-21 北京协同创新研究院 一种视频动作分类的处理方法及装置
CN109919011A (zh) * 2019-01-28 2019-06-21 浙江工业大学 一种基于多时长信息的动作视频识别方法
CN110084202B (zh) * 2019-04-29 2023-04-18 东南大学 一种基于高效三维卷积的视频行为识别方法
CN110188653A (zh) * 2019-05-27 2019-08-30 东南大学 基于局部特征聚合编码和长短期记忆网络的行为识别方法
CN110866509B (zh) * 2019-11-20 2023-04-28 腾讯科技(深圳)有限公司 动作识别方法、装置、计算机存储介质和计算机设备
CN110893277B (zh) * 2019-11-28 2021-05-28 腾讯科技(深圳)有限公司 控制虚拟对象与投掷物进行交互的方法、装置和存储介质
CN111310676A (zh) * 2020-02-21 2020-06-19 重庆邮电大学 基于CNN-LSTM和attention的视频动作识别方法
CN111353452A (zh) * 2020-03-06 2020-06-30 国网湖南省电力有限公司 一种基于rgb图像的行为识别方法、装置、介质及设备
CN111598026B (zh) * 2020-05-20 2023-05-30 广州市百果园信息技术有限公司 动作识别方法、装置、设备及存储介质
CN111914925B (zh) * 2020-07-28 2022-03-29 复旦大学 一种基于深度学习的患者行为多模态感知与分析系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108416288A (zh) * 2018-03-04 2018-08-17 南京理工大学 基于全局与局部网络融合的第一视角交互动作识别方法
CN110765854A (zh) * 2019-09-12 2020-02-07 昆明理工大学 一种视频动作识别方法
US10713493B1 (en) * 2020-02-06 2020-07-14 Shenzhen Malong Technologies Co., Ltd. 4D convolutional neural networks for video recognition
CN111985343A (zh) * 2020-07-23 2020-11-24 深圳大学 一种行为识别深度网络模型的构建方法及行为识别方法
CN112749666A (zh) * 2021-01-15 2021-05-04 百果园技术(新加坡)有限公司 一种动作识别模型的训练及动作识别方法与相关装置

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116030538A (zh) * 2023-03-30 2023-04-28 中国科学技术大学 弱监督动作检测方法、系统、设备及存储介质
CN116030538B (zh) * 2023-03-30 2023-06-16 中国科学技术大学 弱监督动作检测方法、系统、设备及存储介质
CN116614666A (zh) * 2023-07-17 2023-08-18 微网优联科技(成都)有限公司 一种基于ai摄像头特征提取系统及方法
CN116614666B (zh) * 2023-07-17 2023-10-20 微网优联科技(成都)有限公司 一种基于ai摄像头特征提取系统及方法

Also Published As

Publication number Publication date
CN112749666A (zh) 2021-05-04
CN112749666B (zh) 2024-06-04

Similar Documents

Publication Publication Date Title
WO2022152104A1 (zh) 动作识别模型的训练方法及装置、动作识别方法及装置
US11200424B2 (en) Space-time memory network for locating target object in video content
WO2021203863A1 (zh) 基于人工智能的物体检测方法、装置、设备及存储介质
WO2020228446A1 (zh) 模型训练方法、装置、终端及存储介质
EP3757905A1 (en) Deep neural network training method and apparatus
CN112633419B (zh) 小样本学习方法、装置、电子设备和存储介质
US11030750B2 (en) Multi-level convolutional LSTM model for the segmentation of MR images
CN114494981B (zh) 一种基于多层次运动建模的动作视频分类方法及系统
WO2023174098A1 (zh) 一种实时手势检测方法及装置
CN110569814A (zh) 视频类别识别方法、装置、计算机设备及计算机存储介质
CN113869138A (zh) 多尺度目标检测方法、装置及计算机可读存储介质
WO2023160290A1 (zh) 神经网络推理加速方法、目标检测方法、设备及存储介质
WO2023109361A1 (zh) 用于视频处理的方法、系统、设备、介质和产品
WO2023036157A1 (en) Self-supervised spatiotemporal representation learning by exploring video continuity
US20180165539A1 (en) Visual-saliency driven scene description
CN113011568A (zh) 一种模型的训练方法、数据处理方法及设备
CN110782430A (zh) 一种小目标的检测方法、装置、电子设备及存储介质
CN113343981A (zh) 一种视觉特征增强的字符识别方法、装置和设备
CN115797731A (zh) 目标检测模型训练方法、检测方法、终端设备及存储介质
US20200342287A1 (en) Selective performance of deterministic computations for neural networks
CN114299304A (zh) 一种图像处理方法及相关设备
CN112614108A (zh) 基于深度学习检测甲状腺超声图像中结节的方法和装置
CN114298289A (zh) 一种数据处理的方法、数据处理设备及存储介质
JP2023527228A (ja) 情報処理装置、情報処理方法及びプログラム
WO2024011859A1 (zh) 一种基于神经网络的人脸检测方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22738984

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22738984

Country of ref document: EP

Kind code of ref document: A1