US20220164569A1 - Action recognition method and apparatus based on spatio-temporal self-attention - Google Patents
Action recognition method and apparatus based on spatio-temporal self-attention Download PDFInfo
- Publication number
- US20220164569A1 US20220164569A1 US17/512,544 US202117512544A US2022164569A1 US 20220164569 A1 US20220164569 A1 US 20220164569A1 US 202117512544 A US202117512544 A US 202117512544A US 2022164569 A1 US2022164569 A1 US 2022164569A1
- Authority
- US
- United States
- Prior art keywords
- feature map
- action
- feature
- temporal
- processor
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 230000009471 action Effects 0.000 title claims abstract description 170
- 238000000034 method Methods 0.000 title claims abstract description 71
- 230000002123 temporal effect Effects 0.000 claims abstract description 90
- 238000011176 pooling Methods 0.000 claims abstract description 5
- 239000011159 matrix material Substances 0.000 claims description 43
- 239000013598 vector Substances 0.000 claims description 37
- 230000008569 process Effects 0.000 claims description 36
- 230000015654 memory Effects 0.000 claims description 17
- 239000000284 extract Substances 0.000 claims description 16
- 230000007246 mechanism Effects 0.000 description 17
- 230000003993 interaction Effects 0.000 description 11
- 238000013527 convolutional neural network Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 7
- 230000006399 behavior Effects 0.000 description 5
- 238000001514 detection method Methods 0.000 description 5
- 238000011156 evaluation Methods 0.000 description 5
- 238000003909 pattern recognition Methods 0.000 description 5
- 238000013501 data transformation Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000000605 extraction Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 230000033001 locomotion Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 241000282412 Homo Species 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000007796 conventional method Methods 0.000 description 2
- 238000009472 formulation Methods 0.000 description 2
- 229910052739 hydrogen Inorganic materials 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000002194 synthesizing effect Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- NVNSXBXKNMWKEJ-UHFFFAOYSA-N 5-[[5-(2-nitrophenyl)furan-2-yl]methylidene]-1,3-diphenyl-2-sulfanylidene-1,3-diazinane-4,6-dione Chemical compound [O-][N+](=O)C1=CC=CC=C1C(O1)=CC=C1C=C1C(=O)N(C=2C=CC=CC=2)C(=S)N(C=2C=CC=CC=2)C1=O NVNSXBXKNMWKEJ-UHFFFAOYSA-N 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G06K9/00335—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G06K9/00362—
-
- G06K9/6232—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
Definitions
- the present disclosure relates to an action recognition method and apparatus and, more particularly, to a method and apparatus for recognizing a human action by using an action recognition neural network.
- An action recognition which locates a person in videos to recognize an action what the person is doing, is a core technology in the field of computer vision that is widely being used in various industries such as video surveillance cameras, human-computer interactions, and autonomous driving.
- One of the most widely used methods of recognizing the human action is an object-detection based recognition.
- the action recognition requires to discriminate complex and various motions contained in the videos, and is associated with many complicated real-world problems that must be addressed.
- CNN Deep Convolutional Neural Networks
- a method and apparatus for recognizing the human action by applying a self-attention mechanism to extract a feature map in a spatial axis domain and a feature map in a temporal axis domain to recognize the human action by using all the feature maps.
- an action recognition method includes: acquiring video features for input videos; generating a bounding box surrounding a person who may be a target for an action recognition; pooling the video features based on bounding box information; extracting at least one spatial feature map from pooled video features; extracting at least one temporal feature map from pooled video features; concatenating the at least one spatial feature map and the at least one temporal feature map to generate a concatenated feature map; and performing a human action recognition based on the concatenated feature map.
- Extracting of at least one spatial feature map may include a process of generating a feature map for a spatially fast action and a process of generating a feature map for a spatially slow action.
- Extracting of at least one temporal feature map may include a process of generating a feature map for a temporally fast action and a process of generating a feature map for a temporally slow action.
- Each of the process of generating the feature map for the spatially fast action and the process of generating the feature map for the spatially slow action may include: projecting the pooled video features into two new feature spaces; calculating a spatial attention map having components representing influences between spatial regions; and obtaining a spatial feature vector by performing a matrix multiplication of the spatial attention map with input video data.
- Each of the process of generating the feature map for the spatially fast action and the process of generating the feature map for the spatially slow action may further include: generating the spatial feature map by multiplying a first scaling parameter to the spatial feature vector and adding the video feature.
- Each of the process of generating the feature map for the temporally fast action and the process of generating the feature map for the temporally slow action may include: projecting the pooled video features into two new feature spaces; calculating a temporal attention map having components representing influences between temporal regions; and obtaining a temporal feature vector by performing a matrix multiplication of the temporal attention map with the input video feature.
- Each of the process of generating the feature map for the temporally fast action and the process of generating the feature map for the temporally slow action may further include: generating the temporal feature map by multiplying a second scaling parameter to the temporal feature vector and adding the video feature.
- an apparatus for recognizing a human action from videos includes: a processor and a memory storing program instructions to be executed by the processor.
- the program instructions causes the processor to acquire video features for input videos; generate a bounding box surrounding a person who may be a target for an action recognition; pool the video features based on bounding box information; extract at least one spatial feature map from pooled video features; extract at least one temporal feature map from pooled video features; concatenate the at least one spatial feature map and the at least one temporal feature map to generate a concatenated feature map; and perform a human action recognition based on the concatenated feature map.
- the program instructions causing the processor to pool the video features may cause the processor to pool the video features through RoIAlign operations.
- the program instructions causing the processor to extract the at least one spatial feature map may include instructions causing the processor to: generate a feature map for a spatially fast action; and generate a feature map for a spatially slow action.
- the program instructions causing the processor to extract the at least one temporal feature map may include instructions causing the processor to: generate a feature map for a temporally fast action; and generate a feature map for a temporally slow action.
- the program instructions causing the processor to generate the feature map for the spatially fast action the program instructions causing the processor to generate the feature map for the spatially slow action may include instructions causing the processor to: project the pooled video features into two new feature spaces; calculate a spatial attention map having components representing influences between spatial regions; and obtain a spatial feature vector by performing a matrix multiplication of the spatial attention map with input video data.
- Each of the program instructions causing the processor to generate the feature map for the spatially fast action the program instructions causing the processor to generate the feature map for the spatially slow action further may include instructions causing the processor to generate the spatial feature map by multiplying a first scaling parameter to the spatial feature vector and adding the video feature.
- Each of the program instructions causing the processor to generate the feature map for the temporally fast action the program instructions causing the processor to generate the feature map for the temporally slow action may include instructions causing the processor to: project the pooled video features into two new feature spaces; calculate a temporal attention map having components representing influences between temporal regions; and obtain a temporal feature vector by performing a matrix multiplication of the temporal attention map with the input video feature.
- Each of the program instructions causing the processor to generate the feature map for the temporally fast action the program instructions causing the processor to generate the feature map for the temporally slow action may further include instructions causing the processor to generate the temporal feature map by multiplying a second scaling parameter to the temporal feature vector and adding the video feature.
- the self-attention mechanism recognizes the human action using both the spatial feature map and the temporal feature map
- the human action may be recognized by taking into account a person's hand, face, objects, and other person's features.
- the feature map is extracted by reflecting the features of both the slow action and fast action, it is possible to properly distinguish differences in characteristic features according to genders and ages of persons.
- Performance improvement was confirmed in 44 of the 60 evaluation items compared to a basic action recognition algorithm.
- the performance improvement may be achieved by a simple network structure.
- FIG. 1 is a block diagram showing an overall structure of a spatio-temporal self-attention network according to an exemplary embodiment of the present disclosure
- FIG. 2 is a block diagram of an action recognition apparatus according to an exemplary embodiment of the present disclosure
- FIG. 3 is a flowchart showing the action recognition method according to an exemplary embodiment of the present disclosure
- FIG. 4 is an illustration for explaining a process of generating the feature map for the spatially slow action
- FIG. 5 is an illustration for explaining a process of generating the feature map for the spatially fast action
- FIG. 6 is an illustration for explaining a process of generating the feature map for the temporally slow action
- FIG. 7 is an illustration for explaining a process of generating the feature map for the temporally fast action
- FIG. 8 is a table summarizing performance evaluation results of the action recognition method of the present disclosure and conventional methods performed using the AVA dataset.
- FIGS. 9A and 9B are graphs showing comparison results of the frame APs for the cases with and without the spatio-temporal self-attention mechanism according to the present disclosure.
- first and second designated for explaining various components in this specification are used to discriminate a component from the other ones but are not intended to be limiting to a specific component.
- a second component may be referred to as a first component and, similarly, a first component may also be referred to as a second component without departing from the scope of the present disclosure.
- the term “and/or” may include a presence of one or more of the associated listed items and any and all combinations of the listed items.
- a component When a component is referred to as being “connected” or “coupled” to another component, the component may be directly connected or coupled logically or physically to the other component or indirectly through an object therebetween. Contrarily, when a component is referred to as being “directly connected” or “directly coupled” to another component, it is to be understood that there is no intervening object between the components. Other words used to describe the relationship between elements should be interpreted in a similar fashion.
- a self-attention mechanism which is widely used in a field of natural language processing than Recurrent Neural Networks (RNNs), reveals good performances in the fields of machine translation and image captioning.
- the self-attention mechanism is expected to reveal noticeable performance improvements and expand its use in many other fields as well.
- a general self-attention mechanism performs a matrix operation of a key feature vector and a query feature vector to find a relationship between three feature vectors, the key, the query, and a value, and extracts an attention map taking into account a long range interaction through a softmax operation.
- the extracted attention map serves as an index for determining the relationship of each element with other elements in the input data.
- the attention map is subj ected to a matrix multiplication with the value feature vector so that the relationship is reflected.
- the action recognition of the present disclosure applies the self-attention mechanism taking into account the long-range interaction to an action recognition problem, and uses temporal information along with spatial information when applying the self-attention mechanism to the video action recognition problem.
- FIG. 1 is a block diagram showing an overall structure of a spatio-temporal self-attention network according to an exemplary embodiment of the present disclosure.
- the spatio-temporal self-attention network shown in the drawing includes a backbone network 100 , a bounding box generator 110 , a region-of-interest (RoI) alignment unit 120 , a spatial attention module 200 , a temporal attention module 300 , a concatenator 400 , and a determination unit 420 .
- RoI region-of-interest
- the backbone network 100 receives a certain number of video frames as one video data unit and extracts features of the input videos.
- the video data unit may include 32 video frames, for example.
- the backbone network 100 may be implemented by a Residual network (ResNet) or an Inflated 3D convolutional network (I3D) pre-trained with Kinetics-400 dataset, for example.
- Residual network Residual network
- I3D Inflated 3D convolutional network
- the bounding box generator 110 finds a location of a person in the video who may be a target of the action recognition, based on the input video features output by the backbone network 100 , and generates a bounding box surrounding the person. Also, the bounding box generator 110 may update a position and size of the bounding box by performing regression operations with reference to a feature map output by the concatenator 400 .
- the bounding box generator 110 may be implemented based on a Region Proposal Network (RPN) used in a Fast Region-based Convolutional Neural Networks (Fast R-CNNs).
- RPN Region Proposal Network
- Fast R-CNNs Fast Region-based Convolutional Neural Networks
- the RoI alignment unit 120 may pool the video features from the backbone network 100 through RoIAlign operations with reference to the bounding box information from the bounding box generation unit 110 .
- the spatial attention module 200 extracts a feature map for an area to be intensively considered on a spatial axis domain from the RoIAligned video features.
- the spatial attention module 200 may separately extract a spatial slow action self-attention feature map and a spatial fast action self-attention feature map. While conventional self-attention mechanisms were used to identify relationships between pixels in an image, an exemplary embodiment of the present disclosure uses the spatial self-attention mechanism to extract spatially significant regions from the video features.
- the spatial attention module 200 is pre-trained to focus on the video features (e.g., a hand or face) that are useful for determining the human action from the video features.
- the temporal attention module 300 extracts a feature map for an area to be intensively considered on a temporal axis domain from the RoIAligned video features.
- the temporal attention module 300 may separately extract a temporal slow action self-attention feature map and a temporal fast action self-attention feature map.
- the temporal attention module 300 may extract the feature vector useful for determining the human action from the video features when viewed from the temporal axis domain.
- the concatenator 400 may concatenate all the feature maps extracted by the spatial attention module 200 and the temporal attention module 300 to create a concatenated feature map, and the determining unit 420 performs the human action recognition based on the concatenated feature map.
- a dichotomous cross-entropy may be applied for each behavior so that an action may be recognized as the human action if a determination value is higher than a threshold according to an exemplary embodiment of the present disclosure.
- the spatial attention module 200 and the temporal attention module 300 will now be described in more detail.
- the spatial attention module 200 may receive the video features having a shape of C ⁇ T ⁇ H ⁇ W from the backbone network 100 through the RoI alignment unit 120 , where C, T, H, W denote a number of channels, a temporal duration, a height, and a width, respectively.
- the spatial attention module 200 transforms the video features into C ⁇ T first features and H ⁇ W second features.
- the data transformation may be performed by a separate member other than the spatial attention module 200 .
- the data transformation may just mean a selective use of only some portion of the video features stored in the memory rather than an actual data manipulation.
- ‘C ⁇ T’ may represent a number of feature channels and temporal spaces
- ‘H ⁇ W’ may represent a number of spatial feature maps.
- the spatial attention module 200 projects the transformed video features x ⁇ R (C ⁇ T) ⁇ (H ⁇ W) into two new feature spaces F, G according to Equation 1. This projection corresponds to a multiplication of a key matrix by a query matrix in the spatial axis domain.
- the spatial attention module 200 may calculate a spatial attention map.
- Each component of the spatial attention map may be referred to as a spatial attention level B j,i between regions, e.g., pixels, and may be calculated by Equation 2.
- the spatial attention level B j,i which is a Softmax function value, may represent an extent to which the model attends to an i-th region when synthesizing a j-th region.
- the spatial attention level may denote a degree of influence of the i-th region on the j-th region.
- the spatial attention module 200 may obtain a spatial feature vector by a matrix multiplication of the spatial attention map with the input data. That is, each component of the spatial feature vector may be expressed by Equation 3.
- the spatial feature vector may be constructed to reflect the weights by the multiplication of the spatial attention map by a value matrix.
- W F , W G , and W h are learned weight parameters, which may be implemented by respective 3D vectors having shapes of 1 ⁇ 1 ⁇ 1.
- the spatial attention module 200 may output the spatial feature vector expressed by the Equation 3 as a spatial feature map.
- the spatial attention module 200 may calculate a separate spatial self-attention feature vector by multiplying a scaling parameter to the spatial feature vector and adding the initial input video feature as shown in Equation 4 to output as the spatial feature map.
- the temporal attention module 300 may receive the video features having a shape of C ⁇ T ⁇ H ⁇ W from the backbone network 100 through the RoI alignment unit 120 , where C, T, H, W denote a number of channels, a temporal duration, a height, and a width, respectively.
- the temporal attention module 300 transforms the video features into C ⁇ T first features and H ⁇ W second features.
- the temporal attention module 300 may receive the transformed video features from the spatial attention module 200 .
- the data transformation may be performed by a separate member other than the spatial attention module 200 or the temporal attention module 300 .
- the data transformation may just mean a selective use of only some portion of the video features stored in the memory rather than an actual data manipulation.
- ‘C ⁇ T’ may represent a number of feature channels and temporal spaces
- ‘H ⁇ W’ may represent a number of spatial feature maps.
- the temporal attention module 300 projects the transformed video features x ⁇ R (C ⁇ T) ⁇ (H ⁇ W) into two new feature spaces K, L according to Equation 5. This projection corresponds to a multiplication of a key matrix by a query matrix in the time axis domain.
- the temporal attention module 300 may calculate a temporal attention map.
- Each component of the temporal attention map may be referred to as a temporal attention level a j,i between regions, e.g., pixels, and may be calculated by Equation 6.
- the temporal attention level a j,i which is a Softmax function value, may represent an extent to which the model attends to an i-th region when synthesizing a j-th region.
- the spatial attention level ⁇ j,i may denote a degree of influence of the i-th region on the j-th region.
- the temporal attention module 300 may obtain a temporal feature vector by a matrix multiplication of the temporal attention map with the input data. That is, each component of the temporal feature vector may be expressed by Equation 7.
- the temporal feature vector may be constructed to reflect the weights by the multiplication of the temporal attention map by a value matrix.
- W K , W L , and W b are learned weight parameters, which may be implemented by respective 3D vectors having shapes of 1 ⁇ 1 ⁇ 1.
- the temporal attention module 300 may output the temporal feature vector expressed by the Equation 7 as a temporal feature map.
- the temporal attention module 300 may calculate a separate temporal self-attention feature vector by multiplying a scaling parameter to the temporal feature vector and adding the initial input video feature as shown in Equation 8 to output as the temporal feature map.
- Human actions may be divided into two categories: slow-moving actions and fast-moving actions.
- Most of the existing action recognition networks puts an emphasis on the slow actions and treated the fast actions as a kind of features.
- the fast actions may be important at every moment while the slow actions are usually unnecessary but may be meaningful in rare cases. Therefore, according to an exemplary embodiment of the present disclosure, human actions are divided into the fast actions and the slow actions, and the feature maps are extracted separately for the fast actions and the slow actions. That is, a kernel used in the convolution operation used in each of the spatial attention module 200 and the temporal attention module 300 are differentiated to separately extract the feature map for the slow action and that for the fast action.
- the spatial attention module 200 may include a first kernel for the slow action recognition and a second kernel for the fast action recognition to store the transformed (i.e., projected) video features to be provided to a convolution operator.
- the first kernel may have a shape of 7 ⁇ 1 ⁇ 1, for example, and the second kernel may have a shape of 1 ⁇ 1 ⁇ 1, for example.
- the larger first kernel may be used to store the transformed video features during the process of calculating the feature map for the slow action recognition.
- the smaller second kernel may be used to store the transformed video features during the process of calculating the feature map for the fast action recognition.
- only one of the first and second kernels may operate at each moment under a control of a controller.
- the first kernel and the second kernel may operate simultaneously, so that both the feature map for the slow action recognition and the feature map for the fast action recognition may be calculated and concatenated by the concatenator 400 .
- the temporal attention module 300 may include a third kernel for the slow action recognition and a fourth kernel for the fast action recognition to store the transformed video features to be provided to a convolution operator.
- the third kernel may have a shape of 7 ⁇ 1 ⁇ 1, for example, and the fourth kernel may have a shape of 1 ⁇ 1 ⁇ 1, for example.
- the larger third kernel may be used to store the transformed video features during the process of calculating the feature map for the slow action recognition.
- the smaller fourth kernel may be used to store the transformed video features during the process of calculating the feature map for the fast action recognition.
- only one of the third and fourth kernels may operate at each moment under the control of the controller.
- the third kernel and the fourth kernel may operate simultaneously, so that both the feature map for the slow action recognition and the feature map for the fast action recognition may be calculated and concatenated by the concatenator 400 .
- the two feature maps from the spatial attention module 200 and the two feature maps from the temporal attention module 300 may be concatenated by the connection unit 400 .
- FIG. 2 is a block diagram of an action recognition apparatus according to an exemplary embodiment of the present disclosure.
- the action recognition apparatus may include a processor 1020 , a memory 1040 , and a storage 1060 .
- the processor 1020 may execute program instructions stored in the memory 1020 and/or the storage 1060 .
- the processor 1020 may be a central processing unit (CPU), a graphics processing unit (GPU), or another kind of dedicated processor suitable for performing the methods of the present disclosure.
- the memory 1040 may include, for example, a volatile memory such as a read only memory (ROM) and a nonvolatile memory such as a random access memory (RAM).
- the memory 1040 may load the program instructions stored in the storage 1060 to provide to the processor 1020 .
- the storage 1060 may include an intangible recording medium suitable for storing the program instructions, data files, data structures, and a combination thereof. Any device capable of storing data that may be readable by a computer system may be used for the storage. Examples of the storage medium may include magnetic media such as a hard disk, a floppy disk, and a magnetic tape, optical media such as a compact disk read only memory (CD-ROM) and a digital video disk (DVD), magneto-optical medium such as a floptical disk, and semiconductor memories such as ROM, RAM, a flash memory, and a solid-state drive (SSD).
- magnetic media such as a hard disk, a floppy disk, and a magnetic tape
- optical media such as a compact disk read only memory (CD-ROM) and a digital video disk (DVD)
- magneto-optical medium such as a floptical disk
- semiconductor memories such as ROM, RAM, a flash memory, and a solid-state drive (SSD).
- the program instructions stored in the memory 1040 and/or the storage 1060 may implement an action recognition method according to an exemplary embodiment of the present disclosure. Such program instructions may be executed by the processor 1020 in a state of being loaded into the memory 1040 under the control of the processor 1020 to implement the method according to the present disclosure.
- FIG. 3 is a flowchart showing the action recognition method according to an exemplary embodiment of the present disclosure.
- the backbone network 100 receives a certain number of video frames as one video data unit and extracts features of the input videos (S 500 ). Subsequently, the bounding box generator 110 finds a location of a person in the video who may be the target of the action recognition, based on the input video features output by the backbone network 100 , and generates the bounding box surrounding the person (S 510 ).
- the RoI alignment unit 120 may pool the video features from the backbone network 100 through RoIAlign operations with reference to the bounding box information (S 520 ).
- the spatial attention module 200 may extract the spatial feature map from the RoIAligned video features (S 530 ).
- the temporal attention module 300 may extract the temporal feature map from the RoIAligned video features (S 540 ).
- the concatenator 400 may concatenate all the feature maps extracted by the spatial attention module 200 and the temporal attention module 300 to create a concatenated feature map.
- the determining unit 420 may perform the human action recognition based on the concatenated feature map (S 560 ).
- FIG. 4 is an illustration for explaining a process of generating the feature map for the spatially slow action.
- the self-attention mechanism may include matrix operations of the key, query, and value matrices.
- the key matrix and the query matrix can be projected into a different dimensions by a three-dimensional (3D) convolutional neural network.
- the window size of the spatial axis is set to be large to be suitable for the extraction of the feature map for the spatially slow action, so that the features for several frames may be extracted.
- the matrix multiplication of the key matrix and the query matrix may be performed in the spatial axis domain, and the self-attention map may be generated using the Softmax function. Then, the weight may be reflected by multiplying the generated self-attention map by the value matrix.
- FIG. 5 is an illustration for explaining a process of generating the feature map for the spatially fast action.
- the key matrix and the query matrix can be projected into a different dimensions by the 3D convolutional neural network.
- the window size of the spatial axis is set to be small to be suitable for the extraction of the feature map for the spatially fast action, so that the features for a single frame may be extracted.
- the matrix multiplication of the key matrix and the query matrix may be performed in the spatial axis domain, and the self-attention map may be generated using the Softmax function. Then, the weight may be reflected by multiplying the generated self-attention map by the value matrix.
- FIG. 6 is an illustration for explaining a process of generating the feature map for the temporally slow action.
- the key matrix and the query matrix can be projected into a different dimensions by the 3D convolutional neural network.
- the window size of the temporal axis is set to be large to be suitable for the extraction of the feature map for the temporally slow action, so that the features for several frames may be extracted.
- the matrix multiplication of the key matrix and the query matrix may be performed in the temporal axis domain, and the self-attention map may be generated using the Softmax function. Then, the weight may be reflected by multiplying the generated self-attention map by the value matrix.
- FIG. 7 is an illustration for explaining a process of generating the feature map for the temporally fast action.
- the key matrix and the query matrix can be projected into a different dimensions by the 3D convolutional neural network.
- the window size of the temporal axis is set to be small to be suitable for the extraction of the feature map for the temporally fast action, so that the features for a single frame may be extracted.
- the matrix multiplication of the key matrix and the query matrix may be performed in the temporal axis domain, and the self-attention map may be generated using the Softmax function. Then, the weight may be reflected by multiplying the generated self-attention map by the value matrix.
- the AVA dataset is Chunhui Gu, Chen Sun, et al., “Ava: A video dataset of spatiotemporally localized atomic visual actions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6047-6056, and consists of 80 action classes. Each class is largely divided into three categories: individual behavior, behaviors related to people, and behaviors related to objects.
- the AVA dataset includes a total of 430 videos which are split into 235 for training, 64 for validation, and 131 for test. Each video is a 15 minute long video clip and includes one annotation per second.
- Frame level average precision (frame-AP) was used as an evaluation index
- intersection of union (IoU) threshold was set to 0.5 in center frame of video clip.
- FIG. 8 is a table summarizing performance evaluation results of the action recognition method of the present disclosure and conventional methods performed using the AVA dataset.
- the Single Frame model and AVA Baseline model are disclosed in Chunhui Gu, et al., “Ava: A video dataset of spatiotemporally localized atomic visual actions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6047-6056.
- the ARCN model is disclosed in Chen Sun, et al., “Actor-centric relation network,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 318-334.
- the STEP model is disclosed in Xitong Yang, et al., “Step: Spatiotemporal progressive learning for video action detection,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 264-272.
- the structured Model for Action Detection is disclosed in Yubo Zhang, et al., “A structured model for action detection,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 9975-9984.
- the Action Transformer model is disclosed in Rohit Girdhar, et al., “Video action transformer network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 244-253.
- FIGS. 9A and 9B are graphs showing comparison results of the frame APs for the cases with and without the spatio-temporal self-attention mechanism according to the present disclosure.
- the spatio-temporal self-attention mechanism of the present disclosure was used, the performance improved in 39 classes, and in particular, the high performances occurred for low-performance classes such as those associated with interactions with objects or interactions with other humans.
- the reason is that the spatio-temporal self-attention mechanism is applied to the features obtained through RoIPool, allowing the network to focus more on objects or humans in the surrounding pooled context. Therefore, it can be said that the spatio-temporal self-attention mechanism of the present disclosure may be useful for the long-range interactions.
- the spatio-temporal self-attention mechanism may extract important spatial information, temporal information, slow action information, and fast action information from the input videos.
- the proposed features may play major roles in distinguishing action classes.
- Experiments revealed that the method of the present disclosure may achieve remarkable performances compared to the conventional networks while using less amount of resources and having simpler structure.
- the apparatus and method according to exemplary embodiments of the present disclosure can be implemented by computer-readable program codes or instructions stored on a computer-readable intangible recording medium.
- the computer-readable recording medium includes all types of recording media storing data readable by a computer system.
- the computer-readable recording medium may be distributed over computer systems connected through a network so that a computer-readable program or code may be stored and executed in a distributed manner.
- the computer-readable recording medium may include a hardware device specially configured to store and execute program commands, such as ROM, RAM, and flash memory.
- the program commands may include not only machine language codes such as those produced by a compiler, but also high-level language codes executable by a computer using an interpreter or the like.
- Some aspects of the present disclosure described above in the context of the device may indicate corresponding descriptions of the method according to the present disclosure, and the blocks or devices may correspond to operations of the method or features of the operations. Similarly, some aspects described in the context of the method may be expressed by features of blocks, items, or devices corresponding thereto. Some or all of the operations of the method may be performed by use of a hardware device such as a microprocessor, a programmable computer, or electronic circuits, for example. In some exemplary embodiments, one or more of the most important operations of the method may be performed by such a device.
- a programmable logic device such as a field-programmable gate array may be used to perform some or all of functions of the methods described herein.
- the field-programmable gate array may be operated with a microprocessor to perform one of the methods described herein. In general, the methods are preferably performed by a certain hardware device.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Human Computer Interaction (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Social Psychology (AREA)
- Psychiatry (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Image Analysis (AREA)
Abstract
The present disclosure provides an action recognition method including: acquiring video features for input videos; generating a bounding box surrounding a person who may be a target for an action recognition; pooling the video features based on bounding box information; extracting at least one spatial feature map from pooled video features; extracting at least one temporal feature map from pooled video features; concatenating the at least one spatial feature map and the at least one temporal feature map to generate a concatenated feature map; and performing a human action recognition based on the concatenated feature map.
Description
- The present application claims a convention priority based on Korean Patent Application No. 10-2020-0161680 filed on Nov. 26, 2020, with the Korean Intellectual Property Office (KIPO), the entire content of which is incorporated herein by reference.
- The present disclosure relates to an action recognition method and apparatus and, more particularly, to a method and apparatus for recognizing a human action by using an action recognition neural network.
- An action recognition, which locates a person in videos to recognize an action what the person is doing, is a core technology in the field of computer vision that is widely being used in various industries such as video surveillance cameras, human-computer interactions, and autonomous driving. One of the most widely used methods of recognizing the human action is an object-detection based recognition. The action recognition requires to discriminate complex and various motions contained in the videos, and is associated with many complicated real-world problems that must be addressed.
- Deep Convolutional Neural Networks (CNN) have achieved great performances in image classification, object detection, and semantic segmentation. Attempts are being made to apply the CNNs to the action recognition, but progress is slow partly because many of human actions are associated with other person or objects and the recognition thereof is difficult using only local features. Human actions may be divided into three categories: person movement, object manipulation, and person interaction. Thus, in order to recognize the human action, the interactions with the objects and/or other person should be taken into account.
- Provided is a method and apparatus for recognizing a human action taking the interactions with objects and/or other person into account.
- Provided is a method and apparatus for recognizing the human action by applying a self-attention mechanism to extract a feature map in a spatial axis domain and a feature map in a temporal axis domain to recognize the human action by using all the feature maps.
- According to an aspect of an exemplary embodiment, an action recognition method includes: acquiring video features for input videos; generating a bounding box surrounding a person who may be a target for an action recognition; pooling the video features based on bounding box information; extracting at least one spatial feature map from pooled video features; extracting at least one temporal feature map from pooled video features; concatenating the at least one spatial feature map and the at least one temporal feature map to generate a concatenated feature map; and performing a human action recognition based on the concatenated feature map.
- Pooling of the video features may be performed through RoIAlign operations.
- Extracting of at least one spatial feature map may include a process of generating a feature map for a spatially fast action and a process of generating a feature map for a spatially slow action.
- Extracting of at least one temporal feature map may include a process of generating a feature map for a temporally fast action and a process of generating a feature map for a temporally slow action.
- Each of the process of generating the feature map for the spatially fast action and the process of generating the feature map for the spatially slow action may include: projecting the pooled video features into two new feature spaces; calculating a spatial attention map having components representing influences between spatial regions; and obtaining a spatial feature vector by performing a matrix multiplication of the spatial attention map with input video data.
- Each of the process of generating the feature map for the spatially fast action and the process of generating the feature map for the spatially slow action may further include: generating the spatial feature map by multiplying a first scaling parameter to the spatial feature vector and adding the video feature.
- Each of the process of generating the feature map for the temporally fast action and the process of generating the feature map for the temporally slow action may include: projecting the pooled video features into two new feature spaces; calculating a temporal attention map having components representing influences between temporal regions; and obtaining a temporal feature vector by performing a matrix multiplication of the temporal attention map with the input video feature.
- Each of the process of generating the feature map for the temporally fast action and the process of generating the feature map for the temporally slow action may further include: generating the temporal feature map by multiplying a second scaling parameter to the temporal feature vector and adding the video feature.
- According to another aspect of an exemplary embodiment, an apparatus for recognizing a human action from videos includes: a processor and a memory storing program instructions to be executed by the processor. When executed by the processor, the program instructions causes the processor to acquire video features for input videos; generate a bounding box surrounding a person who may be a target for an action recognition; pool the video features based on bounding box information; extract at least one spatial feature map from pooled video features; extract at least one temporal feature map from pooled video features; concatenate the at least one spatial feature map and the at least one temporal feature map to generate a concatenated feature map; and perform a human action recognition based on the concatenated feature map.
- The program instructions causing the processor to pool the video features may cause the processor to pool the video features through RoIAlign operations.
- The program instructions causing the processor to extract the at least one spatial feature map may include instructions causing the processor to: generate a feature map for a spatially fast action; and generate a feature map for a spatially slow action.
- The program instructions causing the processor to extract the at least one temporal feature map may include instructions causing the processor to: generate a feature map for a temporally fast action; and generate a feature map for a temporally slow action.
- The program instructions causing the processor to generate the feature map for the spatially fast action the program instructions causing the processor to generate the feature map for the spatially slow action may include instructions causing the processor to: project the pooled video features into two new feature spaces; calculate a spatial attention map having components representing influences between spatial regions; and obtain a spatial feature vector by performing a matrix multiplication of the spatial attention map with input video data.
- Each of the program instructions causing the processor to generate the feature map for the spatially fast action the program instructions causing the processor to generate the feature map for the spatially slow action further may include instructions causing the processor to generate the spatial feature map by multiplying a first scaling parameter to the spatial feature vector and adding the video feature.
- Each of the program instructions causing the processor to generate the feature map for the temporally fast action the program instructions causing the processor to generate the feature map for the temporally slow action may include instructions causing the processor to: project the pooled video features into two new feature spaces; calculate a temporal attention map having components representing influences between temporal regions; and obtain a temporal feature vector by performing a matrix multiplication of the temporal attention map with the input video feature.
- Each of the program instructions causing the processor to generate the feature map for the temporally fast action the program instructions causing the processor to generate the feature map for the temporally slow action may further include instructions causing the processor to generate the temporal feature map by multiplying a second scaling parameter to the temporal feature vector and adding the video feature.
- Since the self-attention mechanism according to an exemplary embodiment of the present disclosure recognizes the human action using both the spatial feature map and the temporal feature map, the human action may be recognized by taking into account a person's hand, face, objects, and other person's features. In addition, since the feature map is extracted by reflecting the features of both the slow action and fast action, it is possible to properly distinguish differences in characteristic features according to genders and ages of persons. Performance improvement was confirmed in 44 of the 60 evaluation items compared to a basic action recognition algorithm. In addition, the performance improvement may be achieved by a simple network structure.
- In order that the disclosure may be well understood, there will now be described various forms thereof, given by way of example, reference being made to the accompanying drawings, in which:
-
FIG. 1 is a block diagram showing an overall structure of a spatio-temporal self-attention network according to an exemplary embodiment of the present disclosure; -
FIG. 2 is a block diagram of an action recognition apparatus according to an exemplary embodiment of the present disclosure; -
FIG. 3 is a flowchart showing the action recognition method according to an exemplary embodiment of the present disclosure; -
FIG. 4 is an illustration for explaining a process of generating the feature map for the spatially slow action; -
FIG. 5 is an illustration for explaining a process of generating the feature map for the spatially fast action; -
FIG. 6 is an illustration for explaining a process of generating the feature map for the temporally slow action; -
FIG. 7 is an illustration for explaining a process of generating the feature map for the temporally fast action; -
FIG. 8 is a table summarizing performance evaluation results of the action recognition method of the present disclosure and conventional methods performed using the AVA dataset; and -
FIGS. 9A and 9B are graphs showing comparison results of the frame APs for the cases with and without the spatio-temporal self-attention mechanism according to the present disclosure. - The drawings described herein are for illustration purposes only and are not intended to limit the scope of the present disclosure in any way.
- For a more clear understanding of the features and advantages of the present disclosure, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanied drawings. However, it should be understood that the present disclosure is not limited to particular embodiments disclosed herein but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the present disclosure. In the drawings, similar or corresponding components may be designated by the same or similar reference numerals.
- The terminologies including ordinals such as “first” and “second” designated for explaining various components in this specification are used to discriminate a component from the other ones but are not intended to be limiting to a specific component. For example, a second component may be referred to as a first component and, similarly, a first component may also be referred to as a second component without departing from the scope of the present disclosure. As used herein, the term “and/or” may include a presence of one or more of the associated listed items and any and all combinations of the listed items.
- When a component is referred to as being “connected” or “coupled” to another component, the component may be directly connected or coupled logically or physically to the other component or indirectly through an object therebetween. Contrarily, when a component is referred to as being “directly connected” or “directly coupled” to another component, it is to be understood that there is no intervening object between the components. Other words used to describe the relationship between elements should be interpreted in a similar fashion.
- The terminologies are used herein for the purpose of describing particular exemplary embodiments only and are not intended to limit the present disclosure. The singular forms include plural referents as well unless the context clearly dictates otherwise. Also, the expressions “comprises,” “includes,” “constructed,” “configured” are used to refer a presence of a combination of stated features, numbers, processing steps, operations, elements, or components, but are not intended to preclude a presence or addition of another feature, number, processing step, operation, element, or component.
- Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by those of ordinary skill in the art to which the present disclosure pertains. Terms such as those defined in a commonly used dictionary should be interpreted as having meanings consistent with their meanings in the context of related literatures and will not be interpreted as having ideal or excessively formal meanings unless explicitly defined in the present application.
- In the following description, in order to facilitate an overall understanding thereof, the same components are assigned the same reference numerals in the drawings and are not redundantly described here.
- Researches for analyzing and localizing human behavior in video data have recently been accelerated. Most common datasets used for training or evaluating such researches may include Kinetics and UCF-101. A dataset may include a person movement, human-to-human interaction, and human-object interaction. As new data come out, understanding the relationships between people and the association between people and objects has become a critical factor in action recognition and it is also important to be aware of the situation appropriately. There were several approaches for action recognition. In some of the approaches, human joints information are found through human pose estimation, while another approach judging human action by capturing how each joint moves with temporal axis. Some other networks use more abundant information by fusion of video and optical flow features. However, the recent trend is solving the action recognition using only video clips.
- A self-attention mechanism, which is widely used in a field of natural language processing than Recurrent Neural Networks (RNNs), reveals good performances in the fields of machine translation and image captioning. The self-attention mechanism is expected to reveal noticeable performance improvements and expand its use in many other fields as well.
- A general self-attention mechanism performs a matrix operation of a key feature vector and a query feature vector to find a relationship between three feature vectors, the key, the query, and a value, and extracts an attention map taking into account a long range interaction through a softmax operation. The extracted attention map serves as an index for determining the relationship of each element with other elements in the input data. Finally, the attention map is subj ected to a matrix multiplication with the value feature vector so that the relationship is reflected.
- The action recognition of the present disclosure applies the self-attention mechanism taking into account the long-range interaction to an action recognition problem, and uses temporal information along with spatial information when applying the self-attention mechanism to the video action recognition problem.
-
FIG. 1 is a block diagram showing an overall structure of a spatio-temporal self-attention network according to an exemplary embodiment of the present disclosure. The spatio-temporal self-attention network shown in the drawing includes abackbone network 100, abounding box generator 110, a region-of-interest (RoI)alignment unit 120, aspatial attention module 200, atemporal attention module 300, aconcatenator 400, and adetermination unit 420. - The
backbone network 100 receives a certain number of video frames as one video data unit and extracts features of the input videos. The video data unit may include 32 video frames, for example. Thebackbone network 100 may be implemented by a Residual network (ResNet) or an Inflated 3D convolutional network (I3D) pre-trained with Kinetics-400 dataset, for example. - The
bounding box generator 110 finds a location of a person in the video who may be a target of the action recognition, based on the input video features output by thebackbone network 100, and generates a bounding box surrounding the person. Also, thebounding box generator 110 may update a position and size of the bounding box by performing regression operations with reference to a feature map output by theconcatenator 400. Thebounding box generator 110 may be implemented based on a Region Proposal Network (RPN) used in a Fast Region-based Convolutional Neural Networks (Fast R-CNNs). - The
RoI alignment unit 120 may pool the video features from thebackbone network 100 through RoIAlign operations with reference to the bounding box information from the boundingbox generation unit 110. - The
spatial attention module 200 extracts a feature map for an area to be intensively considered on a spatial axis domain from the RoIAligned video features. In particular, thespatial attention module 200 may separately extract a spatial slow action self-attention feature map and a spatial fast action self-attention feature map. While conventional self-attention mechanisms were used to identify relationships between pixels in an image, an exemplary embodiment of the present disclosure uses the spatial self-attention mechanism to extract spatially significant regions from the video features. For this purpose, thespatial attention module 200 is pre-trained to focus on the video features (e.g., a hand or face) that are useful for determining the human action from the video features. - The
temporal attention module 300 extracts a feature map for an area to be intensively considered on a temporal axis domain from the RoIAligned video features. In particular, thetemporal attention module 300 may separately extract a temporal slow action self-attention feature map and a temporal fast action self-attention feature map. In general, there is a difference in the amount of information that may be obtained from the input frames constituting the input video between feature vectors at a point where the human action starts or ends and a feature vector while the action is in progress. Therefore, thetemporal attention module 300 may extract the feature vector useful for determining the human action from the video features when viewed from the temporal axis domain. - The
concatenator 400 may concatenate all the feature maps extracted by thespatial attention module 200 and thetemporal attention module 300 to create a concatenated feature map, and the determiningunit 420 performs the human action recognition based on the concatenated feature map. Considering that the human action is complex, a dichotomous cross-entropy may be applied for each behavior so that an action may be recognized as the human action if a determination value is higher than a threshold according to an exemplary embodiment of the present disclosure. - The
spatial attention module 200 and thetemporal attention module 300 will now be described in more detail. - The
spatial attention module 200 may receive the video features having a shape of C×T×H×W from thebackbone network 100 through theRoI alignment unit 120, where C, T, H, W denote a number of channels, a temporal duration, a height, and a width, respectively. Thespatial attention module 200 transforms the video features into C×T first features and H×W second features. The data transformation may be performed by a separate member other than thespatial attention module 200. Alternatively, the data transformation may just mean a selective use of only some portion of the video features stored in the memory rather than an actual data manipulation. Here, ‘C×T’ may represent a number of feature channels and temporal spaces, and ‘H×W’ may represent a number of spatial feature maps. - The
spatial attention module 200 projects the transformed video features x ∈ R(C×T)×(H×W) into two new feature spaces F, G according to Equation 1. This projection corresponds to a multiplication of a key matrix by a query matrix in the spatial axis domain. -
- Subsequently, the
spatial attention module 200 may calculate a spatial attention map. Each component of the spatial attention map may be referred to as a spatial attention level Bj,i between regions, e.g., pixels, and may be calculated by Equation 2. The spatial attention level Bj,i, which is a Softmax function value, may represent an extent to which the model attends to an i-th region when synthesizing a j-th region. In other words, the spatial attention level may denote a degree of influence of the i-th region on the j-th region. -
- The
spatial attention module 200 may obtain a spatial feature vector by a matrix multiplication of the spatial attention map with the input data. That is, each component of the spatial feature vector may be expressed by Equation 3. The spatial feature vector may be constructed to reflect the weights by the multiplication of the spatial attention map by a value matrix. -
- In the formulation above, WF, WG, and Wh are learned weight parameters, which may be implemented by respective 3D vectors having shapes of 1×1×1.
- In an exemplary embodiment, the
spatial attention module 200 may output the spatial feature vector expressed by the Equation 3 as a spatial feature map. Alternatively, however, thespatial attention module 200 may calculate a separate spatial self-attention feature vector by multiplying a scaling parameter to the spatial feature vector and adding the initial input video feature as shown in Equation 4 to output as the spatial feature map. -
- The
temporal attention module 300 may receive the video features having a shape of C×T×H×W from thebackbone network 100 through theRoI alignment unit 120, where C, T, H, W denote a number of channels, a temporal duration, a height, and a width, respectively. Thetemporal attention module 300 transforms the video features into C×T first features and H×W second features. Thetemporal attention module 300 may receive the transformed video features from thespatial attention module 200. Also, the data transformation may be performed by a separate member other than thespatial attention module 200 or thetemporal attention module 300. Alternatively, the data transformation may just mean a selective use of only some portion of the video features stored in the memory rather than an actual data manipulation. Here, ‘C×T’ may represent a number of feature channels and temporal spaces, and ‘H×W’ may represent a number of spatial feature maps. - The
temporal attention module 300 projects the transformed video features x ∈ R(C×T)×(H×W) into two new feature spaces K, L according toEquation 5. This projection corresponds to a multiplication of a key matrix by a query matrix in the time axis domain. -
- Subsequently, the
temporal attention module 300 may calculate a temporal attention map. Each component of the temporal attention map may be referred to as a temporal attention level aj,i between regions, e.g., pixels, and may be calculated by Equation 6. The temporal attention level aj,i, which is a Softmax function value, may represent an extent to which the model attends to an i-th region when synthesizing a j-th region. In other words, the spatial attention level βj,i may denote a degree of influence of the i-th region on the j-th region. -
- The
temporal attention module 300 may obtain a temporal feature vector by a matrix multiplication of the temporal attention map with the input data. That is, each component of the temporal feature vector may be expressed by Equation 7. The temporal feature vector may be constructed to reflect the weights by the multiplication of the temporal attention map by a value matrix. -
- In the formulation above, WK, WL, and Wb are learned weight parameters, which may be implemented by respective 3D vectors having shapes of 1×1×1.
- In an exemplary embodiment, the
temporal attention module 300 may output the temporal feature vector expressed by the Equation 7 as a temporal feature map. Alternatively, however, thetemporal attention module 300 may calculate a separate temporal self-attention feature vector by multiplying a scaling parameter to the temporal feature vector and adding the initial input video feature as shown in Equation 8 to output as the temporal feature map. -
- Human actions may be divided into two categories: slow-moving actions and fast-moving actions. Most of the existing action recognition networks puts an emphasis on the slow actions and treated the fast actions as a kind of features. However, the inventors believe that the fast actions may be important at every moment while the slow actions are usually unnecessary but may be meaningful in rare cases. Therefore, according to an exemplary embodiment of the present disclosure, human actions are divided into the fast actions and the slow actions, and the feature maps are extracted separately for the fast actions and the slow actions. That is, a kernel used in the convolution operation used in each of the
spatial attention module 200 and thetemporal attention module 300 are differentiated to separately extract the feature map for the slow action and that for the fast action. - That is, the
spatial attention module 200 may include a first kernel for the slow action recognition and a second kernel for the fast action recognition to store the transformed (i.e., projected) video features to be provided to a convolution operator. The first kernel may have a shape of 7×1×1, for example, and the second kernel may have a shape of 1×1×1, for example. The larger first kernel may be used to store the transformed video features during the process of calculating the feature map for the slow action recognition. The smaller second kernel may be used to store the transformed video features during the process of calculating the feature map for the fast action recognition. In an embodiment, only one of the first and second kernels may operate at each moment under a control of a controller. Alternatively, however, the first kernel and the second kernel may operate simultaneously, so that both the feature map for the slow action recognition and the feature map for the fast action recognition may be calculated and concatenated by theconcatenator 400. - The
temporal attention module 300 may include a third kernel for the slow action recognition and a fourth kernel for the fast action recognition to store the transformed video features to be provided to a convolution operator. The third kernel may have a shape of 7×1×1, for example, and the fourth kernel may have a shape of 1×1×1, for example. The larger third kernel may be used to store the transformed video features during the process of calculating the feature map for the slow action recognition. The smaller fourth kernel may be used to store the transformed video features during the process of calculating the feature map for the fast action recognition. In an embodiment, only one of the third and fourth kernels may operate at each moment under the control of the controller. Alternatively, however, the third kernel and the fourth kernel may operate simultaneously, so that both the feature map for the slow action recognition and the feature map for the fast action recognition may be calculated and concatenated by theconcatenator 400. In this case, the two feature maps from thespatial attention module 200 and the two feature maps from thetemporal attention module 300 may be concatenated by theconnection unit 400. -
FIG. 2 is a block diagram of an action recognition apparatus according to an exemplary embodiment of the present disclosure. The action recognition apparatus according to an embodiment of the present disclosure may include aprocessor 1020, amemory 1040, and astorage 1060. - The
processor 1020 may execute program instructions stored in thememory 1020 and/or thestorage 1060. Theprocessor 1020 may be a central processing unit (CPU), a graphics processing unit (GPU), or another kind of dedicated processor suitable for performing the methods of the present disclosure. - The
memory 1040 may include, for example, a volatile memory such as a read only memory (ROM) and a nonvolatile memory such as a random access memory (RAM). Thememory 1040 may load the program instructions stored in thestorage 1060 to provide to theprocessor 1020. - The
storage 1060 may include an intangible recording medium suitable for storing the program instructions, data files, data structures, and a combination thereof. Any device capable of storing data that may be readable by a computer system may be used for the storage. Examples of the storage medium may include magnetic media such as a hard disk, a floppy disk, and a magnetic tape, optical media such as a compact disk read only memory (CD-ROM) and a digital video disk (DVD), magneto-optical medium such as a floptical disk, and semiconductor memories such as ROM, RAM, a flash memory, and a solid-state drive (SSD). - The program instructions stored in the
memory 1040 and/or thestorage 1060 may implement an action recognition method according to an exemplary embodiment of the present disclosure. Such program instructions may be executed by theprocessor 1020 in a state of being loaded into thememory 1040 under the control of theprocessor 1020 to implement the method according to the present disclosure. -
FIG. 3 is a flowchart showing the action recognition method according to an exemplary embodiment of the present disclosure. - First, the
backbone network 100 receives a certain number of video frames as one video data unit and extracts features of the input videos (S500). Subsequently, thebounding box generator 110 finds a location of a person in the video who may be the target of the action recognition, based on the input video features output by thebackbone network 100, and generates the bounding box surrounding the person (S510). TheRoI alignment unit 120 may pool the video features from thebackbone network 100 through RoIAlign operations with reference to the bounding box information (S520). - Next, the
spatial attention module 200 may extract the spatial feature map from the RoIAligned video features (S530). Meanwhile, thetemporal attention module 300 may extract the temporal feature map from the RoIAligned video features (S540). In operation S550, theconcatenator 400 may concatenate all the feature maps extracted by thespatial attention module 200 and thetemporal attention module 300 to create a concatenated feature map. Finally, the determiningunit 420 may perform the human action recognition based on the concatenated feature map (S560). -
FIG. 4 is an illustration for explaining a process of generating the feature map for the spatially slow action. The self-attention mechanism may include matrix operations of the key, query, and value matrices. The key matrix and the query matrix can be projected into a different dimensions by a three-dimensional (3D) convolutional neural network. In this case, the window size of the spatial axis is set to be large to be suitable for the extraction of the feature map for the spatially slow action, so that the features for several frames may be extracted. Afterwards, the matrix multiplication of the key matrix and the query matrix may be performed in the spatial axis domain, and the self-attention map may be generated using the Softmax function. Then, the weight may be reflected by multiplying the generated self-attention map by the value matrix. -
FIG. 5 is an illustration for explaining a process of generating the feature map for the spatially fast action. The key matrix and the query matrix can be projected into a different dimensions by the 3D convolutional neural network. In this case, the window size of the spatial axis is set to be small to be suitable for the extraction of the feature map for the spatially fast action, so that the features for a single frame may be extracted. Afterwards, the matrix multiplication of the key matrix and the query matrix may be performed in the spatial axis domain, and the self-attention map may be generated using the Softmax function. Then, the weight may be reflected by multiplying the generated self-attention map by the value matrix. -
FIG. 6 is an illustration for explaining a process of generating the feature map for the temporally slow action. The key matrix and the query matrix can be projected into a different dimensions by the 3D convolutional neural network. In this case, the window size of the temporal axis is set to be large to be suitable for the extraction of the feature map for the temporally slow action, so that the features for several frames may be extracted. Afterwards, the matrix multiplication of the key matrix and the query matrix may be performed in the temporal axis domain, and the self-attention map may be generated using the Softmax function. Then, the weight may be reflected by multiplying the generated self-attention map by the value matrix. -
FIG. 7 is an illustration for explaining a process of generating the feature map for the temporally fast action. The key matrix and the query matrix can be projected into a different dimensions by the 3D convolutional neural network. In this case, the window size of the temporal axis is set to be small to be suitable for the extraction of the feature map for the temporally fast action, so that the features for a single frame may be extracted. Afterwards, the matrix multiplication of the key matrix and the query matrix may be performed in the temporal axis domain, and the self-attention map may be generated using the Softmax function. Then, the weight may be reflected by multiplying the generated self-attention map by the value matrix. - The inventors evaluated the action recognition method according to an exemplary embodiment of the present disclosure by using AVA dataset. The AVA dataset is Chunhui Gu, Chen Sun, et al., “Ava: A video dataset of spatiotemporally localized atomic visual actions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6047-6056, and consists of 80 action classes. Each class is largely divided into three categories: individual behavior, behaviors related to people, and behaviors related to objects. The AVA dataset includes a total of 430 videos which are split into 235 for training, 64 for validation, and 131 for test. Each video is a 15 minute long video clip and includes one annotation per second. The inventors evaluated 60 classes as the other researcher's evaluations and used at least 25 instances for validation. Frame level average precision (frame-AP) was used as an evaluation index, and intersection of union (IoU) threshold was set to 0.5 in center frame of video clip.
-
FIG. 8 is a table summarizing performance evaluation results of the action recognition method of the present disclosure and conventional methods performed using the AVA dataset. In the table, The Single Frame model and AVA Baseline model are disclosed in Chunhui Gu, et al., “Ava: A video dataset of spatiotemporally localized atomic visual actions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6047-6056. The ARCN model is disclosed in Chen Sun, et al., “Actor-centric relation network,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 318-334. The STEP model is disclosed in Xitong Yang, et al., “Step: Spatiotemporal progressive learning for video action detection,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 264-272. The structured Model for Action Detection is disclosed in Yubo Zhang, et al., “A structured model for action detection,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 9975-9984. The Action Transformer model is disclosed in Rohit Girdhar, et al., “Video action transformer network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 244-253. - While early stage action recognition networks have used both the RGB image and optical flow features, recently developed networks are using only the RGB images owing to the use of more abundant features such as Graph Convolutional Network (GCN) and Attention Mechanism. It can be seen, in Table 1, that From Table 1, it can be seen that the recognition method of the present disclosure can obtain meaningful results using fewer image frames and lower resolution compared to other networks.
-
FIGS. 9A and 9B are graphs showing comparison results of the frame APs for the cases with and without the spatio-temporal self-attention mechanism according to the present disclosure. When the spatio-temporal self-attention mechanism of the present disclosure was used, the performance improved in 39 classes, and in particular, the high performances occurred for low-performance classes such as those associated with interactions with objects or interactions with other humans. The reason is that the spatio-temporal self-attention mechanism is applied to the features obtained through RoIPool, allowing the network to focus more on objects or humans in the surrounding pooled context. Therefore, it can be said that the spatio-temporal self-attention mechanism of the present disclosure may be useful for the long-range interactions. - As described above, the spatio-temporal self-attention mechanism according to an exemplary embodiment of the present disclosure may extract important spatial information, temporal information, slow action information, and fast action information from the input videos. The proposed features may play major roles in distinguishing action classes. Experiments revealed that the method of the present disclosure may achieve remarkable performances compared to the conventional networks while using less amount of resources and having simpler structure.
- As mentioned above, the apparatus and method according to exemplary embodiments of the present disclosure can be implemented by computer-readable program codes or instructions stored on a computer-readable intangible recording medium. The computer-readable recording medium includes all types of recording media storing data readable by a computer system. The computer-readable recording medium may be distributed over computer systems connected through a network so that a computer-readable program or code may be stored and executed in a distributed manner.
- The computer-readable recording medium may include a hardware device specially configured to store and execute program commands, such as ROM, RAM, and flash memory. The program commands may include not only machine language codes such as those produced by a compiler, but also high-level language codes executable by a computer using an interpreter or the like.
- Some aspects of the present disclosure described above in the context of the device may indicate corresponding descriptions of the method according to the present disclosure, and the blocks or devices may correspond to operations of the method or features of the operations. Similarly, some aspects described in the context of the method may be expressed by features of blocks, items, or devices corresponding thereto. Some or all of the operations of the method may be performed by use of a hardware device such as a microprocessor, a programmable computer, or electronic circuits, for example. In some exemplary embodiments, one or more of the most important operations of the method may be performed by such a device.
- In some exemplary embodiments, a programmable logic device such as a field-programmable gate array may be used to perform some or all of functions of the methods described herein. In some exemplary embodiments, the field-programmable gate array may be operated with a microprocessor to perform one of the methods described herein. In general, the methods are preferably performed by a certain hardware device.
- The description of the disclosure is merely exemplary in nature and, thus, variations that do not depart from the substance of the disclosure are intended to be within the scope of the disclosure. Such variations are not to be regarded as a departure from the spirit and scope of the disclosure. Thus, it will be understood by those of ordinary skill in the art that various changes in form and details may be made without departing from the spirit and scope as defined by the following claims.
Claims (16)
1. An action recognition method, comprising:
acquiring video features for input videos;
generating a bounding box surrounding a person who may be a target for an action recognition;
pooling the video features based on bounding box information;
extracting at least one spatial feature map from pooled video features;
extracting at least one temporal feature map from pooled video features;
concatenating the at least one spatial feature map and the at least one temporal feature map to generate a concatenated feature map; and
performing a human action recognition based on the concatenated feature map.
2. The action recognition method of claim 1 , wherein pooling the video features is performed through RoIAlign operations.
3. The action recognition method of claim 1 , wherein extracting at least one spatial feature map comprises a process of generating a feature map for a spatially fast action and a process of generating a feature map for a spatially slow action.
4. The action recognition method of claim 3 , wherein extracting at least one temporal feature map comprises a process of generating a feature map for a temporally fast action and a process of generating a feature map for a temporally slow action.
5. The action recognition method of claim 4 , wherein each of the process of generating the feature map for the spatially fast action and the process of generating the feature map for the spatially slow action comprises:
projecting the pooled video features into two new feature spaces;
calculating a spatial attention map having components representing influences between spatial regions; and
obtaining a spatial feature vector by performing a matrix multiplication of the spatial attention map with input video data.
6. The action recognition method of claim 5 , wherein each of the process of generating the feature map for the spatially fast action and the process of generating the feature map for the spatially slow action further comprises:
generating the spatial feature map by multiplying a first scaling parameter to the spatial feature vector and adding the video feature.
7. The action recognition method of claim 4 , wherein each of the process of generating the feature map for the temporally fast action and the process of generating the feature map for the temporally slow action comprises:
projecting the pooled video features into two new feature spaces;
calculating a temporal attention map having components representing influences between temporal regions; and
obtaining a temporal feature vector by performing a matrix multiplication of the temporal attention map with the input video feature.
8. The action recognition method of claim 7 , wherein each of the process of generating the feature map for the temporally fast action and the process of generating the feature map for the temporally slow action further comprises:
generating the temporal feature map by multiplying a second scaling parameter to the temporal feature vector and adding the video feature.
9. An apparatus for recognizing a human action from videos, comprising:
a processor; and
a memory storing program instructions to be executed by the processor,
wherein the program instructions, when executed by the processor, causes the processor to:
acquire video features for input videos;
generate a bounding box surrounding a person who may be a target for an action recognition;
pool the video features based on bounding box information;
extract at least one spatial feature map from pooled video features;
extract at least one temporal feature map from pooled video features;
concatenate the at least one spatial feature map and the at least one temporal feature map to generate a concatenated feature map; and
perform a human action recognition based on the concatenated feature map.
10. The apparatus of claim 9 , wherein the program instructions causing the processor to pool the video features causes the processor to pool the video features through RoIAlign operations.
11. The apparatus of claim 9 , wherein the program instructions causing the processor to extract the at least one spatial feature map comprise instructions causing the processor to:
generate a feature map for a spatially fast action; and
generate a feature map for a spatially slow action.
12. The apparatus of claim 3 , wherein the program instructions causing the processor to extract the at least one temporal feature map comprise instructions causing the processor to:
generate a feature map for a temporally fast action; and
generate a feature map for a temporally slow action.
13. The apparatus of claim 12 , wherein each of the program instructions causing the processor to generate the feature map for the spatially fast action the program instructions causing the processor to generate the feature map for the spatially slow action comprise instructions causing the processor to:
project the pooled video features into two new feature spaces;
calculate a spatial attention map having components representing influences between spatial regions; and
obtain a spatial feature vector by performing a matrix multiplication of the spatial attention map with input video data.
14. The apparatus of claim 13 , wherein each of the program instructions causing the processor to generate the feature map for the spatially fast action the program instructions causing the processor to generate the feature map for the spatially slow action further comprise instructions causing the processor to:
generate the spatial feature map by multiplying a first scaling parameter to the spatial feature vector and adding the video feature.
15. The apparatus of claim 12 , wherein each of the program instructions causing the processor to generate the feature map for the temporally fast action the program instructions causing the processor to generate the feature map for the temporally slow action comprise instructions causing the processor to:
project the pooled video features into two new feature spaces;
calculate a temporal attention map having components representing influences between temporal regions; and
obtain a temporal feature vector by performing a matrix multiplication of the temporal attention map with the input video feature.
16. The apparatus of claim 15 , wherein each of the program instructions causing the processor to generate the feature map for the temporally fast action the program instructions causing the processor to generate the feature map for the temporally slow action further comprise instructions causing the processor to:
generate the temporal feature map by multiplying a second scaling parameter to the temporal feature vector and adding the video feature.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR10-2020-0161680 | 2020-11-26 | ||
KR20200161680 | 2020-11-26 |
Publications (1)
Publication Number | Publication Date |
---|---|
US20220164569A1 true US20220164569A1 (en) | 2022-05-26 |
Family
ID=81658846
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/512,544 Abandoned US20220164569A1 (en) | 2020-11-26 | 2021-10-27 | Action recognition method and apparatus based on spatio-temporal self-attention |
Country Status (2)
Country | Link |
---|---|
US (1) | US20220164569A1 (en) |
KR (1) | KR20220073645A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220303560A1 (en) * | 2021-03-16 | 2022-09-22 | Deepak Sridhar | Systems, methods and computer media for joint attention video processing |
CN115100740A (en) * | 2022-06-15 | 2022-09-23 | 东莞理工学院 | Human body action recognition and intention understanding method, terminal device and storage medium |
CN117351218A (en) * | 2023-12-04 | 2024-01-05 | 武汉大学人民医院(湖北省人民医院) | Method for identifying inflammatory bowel disease pathological morphological feature crypt stretching image |
CN117649630A (en) * | 2024-01-29 | 2024-03-05 | 武汉纺织大学 | Examination room cheating behavior identification method based on monitoring video stream |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102560480B1 (en) * | 2022-06-28 | 2023-07-27 | 퀀텀테크엔시큐 주식회사 | Systems and methods to support artificial intelligence modeling services on behavior perception over time |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10850693B1 (en) * | 2018-04-05 | 2020-12-01 | Ambarella International Lp | Determining comfort settings in vehicles using computer vision |
US20210073525A1 (en) * | 2019-09-11 | 2021-03-11 | Naver Corporation | Action Recognition Using Implicit Pose Representations |
US20220019807A1 (en) * | 2018-11-20 | 2022-01-20 | Deepmind Technologies Limited | Action classification in video clips using attention-based neural networks |
US20220058394A1 (en) * | 2020-08-20 | 2022-02-24 | Ambarella International Lp | Person-of-interest centric timelapse video with ai input on home security camera to protect privacy |
US20220059132A1 (en) * | 2020-08-19 | 2022-02-24 | Ambarella International Lp | Event/object-of-interest centric timelapse video generation on camera device with the assistance of neural network input |
US20220156944A1 (en) * | 2020-11-13 | 2022-05-19 | Samsung Electronics Co., Ltd. | Apparatus and method with video processing |
US20220292827A1 (en) * | 2021-03-09 | 2022-09-15 | The Research Foundation For The State University Of New York | Interactive video surveillance as an edge service using unsupervised feature queries |
US20220327835A1 (en) * | 2019-12-31 | 2022-10-13 | Huawei Technologies Co., Ltd. | Video processing method and apparatus |
US11498500B1 (en) * | 2018-08-31 | 2022-11-15 | Ambarella International Lp | Determining comfort settings in vehicles using computer vision |
US20220383639A1 (en) * | 2020-03-27 | 2022-12-01 | Sportlogiq Inc. | System and Method for Group Activity Recognition in Images and Videos with Self-Attention Mechanisms |
-
2021
- 2021-10-27 US US17/512,544 patent/US20220164569A1/en not_active Abandoned
- 2021-10-28 KR KR1020210145311A patent/KR20220073645A/en not_active Application Discontinuation
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10850693B1 (en) * | 2018-04-05 | 2020-12-01 | Ambarella International Lp | Determining comfort settings in vehicles using computer vision |
US11498500B1 (en) * | 2018-08-31 | 2022-11-15 | Ambarella International Lp | Determining comfort settings in vehicles using computer vision |
US20220019807A1 (en) * | 2018-11-20 | 2022-01-20 | Deepmind Technologies Limited | Action classification in video clips using attention-based neural networks |
US20210073525A1 (en) * | 2019-09-11 | 2021-03-11 | Naver Corporation | Action Recognition Using Implicit Pose Representations |
US20220327835A1 (en) * | 2019-12-31 | 2022-10-13 | Huawei Technologies Co., Ltd. | Video processing method and apparatus |
US20220383639A1 (en) * | 2020-03-27 | 2022-12-01 | Sportlogiq Inc. | System and Method for Group Activity Recognition in Images and Videos with Self-Attention Mechanisms |
US20220059132A1 (en) * | 2020-08-19 | 2022-02-24 | Ambarella International Lp | Event/object-of-interest centric timelapse video generation on camera device with the assistance of neural network input |
US20220058394A1 (en) * | 2020-08-20 | 2022-02-24 | Ambarella International Lp | Person-of-interest centric timelapse video with ai input on home security camera to protect privacy |
US20220156944A1 (en) * | 2020-11-13 | 2022-05-19 | Samsung Electronics Co., Ltd. | Apparatus and method with video processing |
US20220292827A1 (en) * | 2021-03-09 | 2022-09-15 | The Research Foundation For The State University Of New York | Interactive video surveillance as an edge service using unsupervised feature queries |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220303560A1 (en) * | 2021-03-16 | 2022-09-22 | Deepak Sridhar | Systems, methods and computer media for joint attention video processing |
US11902548B2 (en) * | 2021-03-16 | 2024-02-13 | Huawei Technologies Co., Ltd. | Systems, methods and computer media for joint attention video processing |
CN115100740A (en) * | 2022-06-15 | 2022-09-23 | 东莞理工学院 | Human body action recognition and intention understanding method, terminal device and storage medium |
CN117351218A (en) * | 2023-12-04 | 2024-01-05 | 武汉大学人民医院(湖北省人民医院) | Method for identifying inflammatory bowel disease pathological morphological feature crypt stretching image |
CN117649630A (en) * | 2024-01-29 | 2024-03-05 | 武汉纺织大学 | Examination room cheating behavior identification method based on monitoring video stream |
Also Published As
Publication number | Publication date |
---|---|
KR20220073645A (en) | 2022-06-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220164569A1 (en) | Action recognition method and apparatus based on spatio-temporal self-attention | |
EP3399460B1 (en) | Captioning a region of an image | |
US9830529B2 (en) | End-to-end saliency mapping via probability distribution prediction | |
Najibi et al. | G-cnn: an iterative grid based object detector | |
US9767381B2 (en) | Similarity-based detection of prominent objects using deep CNN pooling layers as features | |
KR102224253B1 (en) | Teacher-student framework for light weighted ensemble classifier combined with deep network and random forest and the classification method based on thereof | |
KR100421740B1 (en) | Object activity modeling method | |
Mahdi et al. | DeepFeat: A bottom-up and top-down saliency model based on deep features of convolutional neural networks | |
JP2023549579A (en) | Temporal Bottleneck Attention Architecture for Video Behavior Recognition | |
CN111523421A (en) | Multi-user behavior detection method and system based on deep learning and fusion of various interaction information | |
Munir et al. | LDNet: End-to-end lane marking detection approach using a dynamic vision sensor | |
Ahmadi et al. | Efficient and fast objects detection technique for intelligent video surveillance using transfer learning and fine-tuning | |
EP3995992A1 (en) | Method and system for detecting an action in a video clip | |
Termritthikun et al. | On-device facial verification using NUF-Net model of deep learning | |
KR102178469B1 (en) | Method and system for estimation of pedestrian pose orientation using soft target training based on teacher-student framework | |
Do et al. | Face tracking with convolutional neural network heat-map | |
Zhou et al. | Feature extraction based on local directional pattern with svm decision-level fusion for facial expression recognition | |
Huan et al. | Learning deep cross-scale feature propagation for indoor semantic segmentation | |
Zhang et al. | Facial keypoints detection using neural network | |
EP3627391A1 (en) | Deep neural net for localising objects in images, methods for preparing such a neural net and for localising objects in images, corresponding computer program product, and corresponding computer-readable medium | |
An et al. | MTAtrack: Multilevel transformer attention for visual tracking | |
Lin et al. | Human centric visual analysis with deep learning | |
TanujaPatgar | Convolution neural network based emotion classification cognitive modelforfacial expression | |
Gouizi et al. | Nested-Net: a deep nested network for background subtraction | |
Wang et al. | G-NET: Accurate Lane Detection Model for Autonomous Vehicle |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: POSTECH RESEARCH AND BUSINESS DEVELOPMENT FOUNDATION, KOREA, REPUBLIC OF Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:KIM, DAI JIN;KIM, MYEONG JUN;REEL/FRAME:058011/0913 Effective date: 20211027 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |