CN107944409B

CN107944409B - Video analysis method and device capable of distinguishing key actions

Info

Publication number: CN107944409B
Application number: CN201711243388.8A
Authority: CN
Inventors: 季向阳; 杨武魁; 陈孝罡
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2017-11-30
Filing date: 2017-11-30
Publication date: 2020-05-08
Anticipated expiration: 2037-11-30
Also published as: CN107944409A

Abstract

The present disclosure relates to a video analysis method and apparatus, the method comprising: inputting a video to be identified into a single-frame identification model to obtain single-frame characteristics of a single-frame image in the video to be identified; dividing the video to be identified into video blocks according to the frame length, the initial frame and the identification step length; determining a characteristic flow matrix of each video block according to the single-frame characteristics and the frame length of a single-frame image included in each video block; inputting the initial attention matrix and the feature stream matrix of the video block into a long-short term memory model for processing to obtain the attention matrix of the video block; and determining the attention vector of the video to be identified according to the attention matrix of the video block. The present disclosure selectively focuses on regions of video that are spatially important and frames that are temporally relatively important, thereby reducing the impact of irrelevant information on video analysis results.

Description

Video analysis method and device capable of distinguishing key actions

Technical Field

The present disclosure relates to the field of computer vision, and in particular, to a video analysis method and apparatus.

Background

Video analysis is an important direction in the field of computer vision, and in recent years, a neural network makes a major breakthrough in the field of image analysis, but compared with images, videos have increased time dimension information, so that it becomes more important for a machine to understand the relation between different video frames in the time dimension. In a traditional method, time information of a video is usually described by using manual features such as optical flow and the like, and often only analysis results of different single-frame images are considered, so that some key actions in the whole actions in the video cannot be accurately distinguished, and the identification result of the video is inaccurate.

Disclosure of Invention

In view of this, the present disclosure provides a video analysis method and apparatus, so as to solve the problem that the recognition result of a video is inaccurate because the key actions in the whole work in the video cannot be accurately distinguished in the conventional video analysis method.

According to an aspect of the present disclosure, there is provided a video analysis method, the method including:

inputting a video to be identified into a single-frame identification model to obtain single-frame characteristics of a single-frame image in the video to be identified;

dividing the video to be identified into video blocks according to the frame length, the initial frame and the identification step length;

determining a characteristic flow matrix of each video block according to the single-frame characteristics and the frame length of a single-frame image included in each video block;

inputting the initial attention matrix and the feature stream matrix of the video block into a long-short term memory model for processing to obtain the attention matrix of the video block;

and determining the attention vector of the video to be identified according to the attention matrix of the video block.

In a possible implementation manner, inputting the initial attention matrix and the feature stream matrix of the video block into a long-term and short-term memory model for processing to obtain the attention matrix of the video block, including:

determining an initial attention matrix of the video block according to the feature width of the single frame feature, the feature height of the single frame feature and the frame length;

inputting the initial attention matrix and the feature stream matrix of the first video block into a long-short term memory model for processing to obtain an attention matrix of the first video block;

and taking the second video block and the subsequent video blocks as the current video block, and inputting the attention matrix of the previous video block and the feature stream matrix of the current video block into a long-short term memory model for processing to obtain the attention matrix of the current video block.

In a possible implementation manner, inputting the attention matrix of the previous video block and the feature stream matrix of the current video block into a long-term and short-term memory model for processing to obtain the attention matrix of the current video block, including:

weighting and summing the attention matrix of the previous video block and the feature stream matrix of the current video block to obtain an integrated feature matrix;

and inputting the integrated characteristic matrix into a long-term and short-term memory model for processing to obtain an attention matrix of the current video block.

In one possible implementation, determining an attention vector of the video to be identified according to an attention matrix of a video block includes:

averaging the attention moment arrays of the video blocks where the single-frame images are located to obtain single-frame vectors of the single-frame images;

and obtaining the attention vector of the video to be identified according to the single-frame vectors of all the single-frame images.

In one possible implementation form of the method,

inputting the initial attention matrix and the feature stream matrix of the video block into a long-short term memory model for processing to obtain the attention matrix of the video block, and further comprising:

obtaining the class probability of the current video block;

inputting the category probability into a classifier for processing to obtain the video block category of the current video block;

and determining the video category of the video to be identified according to the video block category of the video block.

According to another aspect of the present disclosure, there is provided a video analysis apparatus including:

the single-frame characteristic determining module is used for inputting the video to be identified into the single-frame identification model to obtain the single-frame characteristics of the single-frame image in the video to be identified;

the video block dividing module is used for dividing the video to be identified into video blocks according to the frame length, the initial frame and the identification step length;

the characteristic flow matrix determining module is used for determining the characteristic flow matrix of each video block according to the single-frame characteristics and the frame length of the single-frame image included in each video block;

the attention moment matrix determining module is used for inputting the initial attention matrix and the feature flow matrix of the video block into the long-short term memory model for processing to obtain the attention matrix of the video block;

and the attention vector determining module is used for determining the attention vector of the video to be identified according to the attention matrix of the video block.

In one possible implementation, the attention matrix determination module includes:

the initial attention matrix determining submodule is used for determining an initial attention matrix of the video block according to the feature width of the single frame features, the feature height of the single frame features and the frame length;

the first attention moment matrix determining submodule is used for inputting the initial attention matrix and the characteristic flow matrix of the first video block into a long-short term memory model for processing to obtain an attention matrix of the first video block;

and the subsequent attention moment matrix determining sub-module is used for taking the second video block and the subsequent video blocks as the current video block, and inputting the attention matrix of the previous video block and the feature stream matrix of the current video block into the long-short term memory model for processing to obtain the attention matrix of the current video block.

In one possible implementation, the subsequent attention matrix determining sub-module includes:

the integration submodule is used for weighting and summing the attention matrix of the previous video block and the feature stream matrix of the current video block to obtain an integration feature matrix;

and the long-short term memory model processing submodule is used for inputting the integrated characteristic matrix into the long-short term memory model for processing to obtain an attention matrix of the current video block.

In one possible implementation, the attention vector determination module includes:

the single-frame vector determining submodule is used for averaging the attention moment array of the video block where the single-frame image is located to obtain a single-frame vector of the single-frame image;

and the summation submodule is used for obtaining the attention vector of the video to be identified according to the single-frame vectors of all the single-frame images.

In a possible implementation manner, the attention moment array determining module further includes:

a category probability determination submodule for obtaining a category probability of the current video block;

the classifier submodule is used for inputting the class probability into a classifier for processing to obtain the video block class of the current video block;

and the video category determining submodule is used for determining the video category of the video to be identified according to the video block category of the video block.

According to an aspect of the present disclosure, there is provided a video analysis apparatus including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: when executed, implement the method of any one of the method claims.

According to an aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method of any one of the method claims.

By dividing the video to be identified into video blocks and acquiring the single-frame characteristics of the single-frame image of the video to be identified, the method selectively focuses on the more important regions in the video space and the relatively important frames in time, and further reduces the influence of irrelevant information on the video analysis result. Furthermore, the attention model in the time domain can be used to filter key frames of the video. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flow diagram of a video analysis method according to an embodiment of the present disclosure;

FIG. 2 shows a flow diagram of a video analysis method according to an embodiment of the present disclosure;

FIG. 3 shows a flow diagram of a video analysis method according to an embodiment of the present disclosure;

FIG. 4 shows a flow diagram of a video analysis method according to an embodiment of the present disclosure;

FIG. 5 shows a flow diagram of a video analysis method according to an embodiment of the present disclosure;

fig. 6 is a schematic diagram illustrating an application example of a video analysis method according to an embodiment of the present disclosure;

fig. 7 is a schematic diagram illustrating an application example of a video analysis method according to an embodiment of the present disclosure;

fig. 8 is a schematic diagram illustrating an application example of a video analysis method according to an embodiment of the present disclosure;

fig. 9 shows a block diagram of a video analysis apparatus according to an embodiment of the present disclosure;

FIG. 10 shows a block diagram of a video analysis apparatus according to an embodiment of the present disclosure;

fig. 11 shows a block diagram of a video analysis apparatus according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 shows a flow chart of a video analysis method according to an embodiment of the present disclosure, as shown in fig. 1, the method includes the steps of:

and step S10, inputting the video to be recognized into the single-frame recognition model to obtain the single-frame characteristics of the single-frame image in the video to be recognized.

In one possible implementation, the video to be identified includes a plurality of frames of consecutive images. And a single frame recognition model, such as a trained convolutional neural network model. And inputting the video to be recognized into the trained single-frame recognition model, and then obtaining the single-frame features of each frame of image according to the feature width (W), the feature height (H) and the feature dimension (D) of the set single-frame features. The feature width W of the single-frame feature comprises the pixel position of the single-frame feature vector in the width direction of the frame image, the feature height H of the single-frame feature, and the pixel position of the single-frame feature vector in the height direction of the frame image.

And step S20, dividing the video to be identified into video blocks according to the frame length, the initial frame and the identification step length.

In one possible implementation, the frame length (T) includes the number of consecutive frames, the starting frame is the starting frame of each video block, and the identification step is the division step of each video block. For example, the frame length is 10, the starting frame is the 1 st frame, the identification step size is 2, which means that in the video to be identified, the 1 st frame to the 10 th frame are used as the first video block, the 3 rd frame to the 13 th frame are used as the second video block, and so on. The position of the start frame may be chosen randomly.

Step S30, determining a feature stream matrix of each video block according to the single frame feature and the frame length of the single frame image included in each video block.

In a possible implementation manner, the feature stream matrix of each video block is obtained by splicing the single-frame features of each single-frame image in the video block. The single frame features are feature vectors with dimension W × H × D, and thus T × W × H × D is the feature stream matrix (F) of each video block. It will be appreciated that the feature stream matrix may express spatial features over video block features.

And step S40, inputting the initial attention matrix and the feature stream matrix of the video block into a long-short term memory model for processing to obtain the attention matrix of the video block.

In one possible implementation, the Long Short-Term Memory model (LSTM Long Short-Term Memory) is a time-recursive neural network suitable for processing and predicting significant events of relatively Long interval and delay in a time sequence. And the attention matrix of the video block obtained by processing the long and short term memory model can provide characteristic information on time sequence in the video block. Since the calculation of the long-short term memory model requires the input information of the last time period, an initial attention matrix needs to be set for the first video block. The initial attention moment array can be given randomly or according to the training result of the long-term and short-term memory model.

And inputting the long-term and short-term memory model according to the calculated attention matrix of the last video block and the feature stream matrix of the current video block, obtaining feature information of each frame of image of the video block on a time sequence, and obtaining a frame where the attention of each video block is positioned and a feature position in the frame where the attention of each video block is positioned. It will be appreciated that note is that the moment matrix may express temporal features on the video block features.

Step S50, according to the attention matrix of the video block, determining the attention vector of the video to be identified.

In a possible implementation manner, after the attention moment array of each video block is integrated, the attention vector of the video to be identified can be obtained. If the video to be identified is taken as a video block to be processed in the above steps, the attention matrix of the video block is obtained, namely the attention vector of the video to be identified. The attention vector obtained by integrating the spatial features and the temporal features of the video blocks has the spatial features and the temporal features of the video blocks at the same time.

The present disclosure selectively focuses on regions of video that are spatially important and frames that are temporally relatively important, thereby reducing the impact of irrelevant information on video analysis results. Furthermore, the attention model in the time domain can be used to filter key frames of the video. The method is closer to the process of analyzing the video by human, and reduces the influence of irrelevant information and redundant information on key information.

Fig. 2 shows a flowchart of a video analysis method according to an embodiment of the present disclosure, and as shown in fig. 2, on the basis of the above embodiment, step S40 of the method includes:

step S41, determining the initial attention matrix of the video block according to the feature width of the single frame feature, the feature height of the single frame feature and the frame length.

And step S42, inputting the initial attention matrix and the feature stream matrix of the first video block into a long-short term memory model for processing to obtain the attention matrix of the first video block.

And step S43, taking the second video block and the subsequent video blocks as the current video block, and inputting the attention matrix of the previous video block and the feature stream matrix of the current video block into the long-short term memory model for processing to obtain the attention matrix of the current video block.

In one possible implementation, the frame length T has information on the time series. And according to T W H composed of the characteristic width W, the characteristic height H and the frame length T, the attention matrix (L) of each video block is obtained. The attention matrix (L) has temporal characteristics in the video block. Will initiate the attention matrix L₀And the eigenflow matrix F of the first video block₁Inputting into a trained multilayer LSTM model for processing to obtain an attention matrix L of a first video module₁. Then the attention matrix L of the first video block is added₁And a feature stream matrix F for the second video block₂Inputting into the well-trained LSTM model for processing to obtain the secondAttention matrix L for video modules₂. And repeating the iterative calculation until obtaining the attention matrixes of all the video blocks.

Compared with a traditional 2-dimensional spatial attention model, the method and the device can adaptively focus on the area where the information in the video frame is relatively concentrated, and can also adaptively screen out the key frame in the video and focus on the key frame, so that the effect of video analysis is optimized.

Fig. 3 shows a flowchart of a video analysis method according to an embodiment of the present disclosure, and the step S43 includes, in the method shown in fig. 3, based on the embodiment shown in fig. 2:

and step S431, carrying out weighted summation on the attention matrix of the previous video block and the feature stream matrix of the current video block to obtain an integrated feature matrix.

And step S42, inputting the integrated feature matrix into a long-term and short-term memory model for processing to obtain the attention matrix of the current video block.

In one possible implementation, F is paired with equation (1)₁And L₀Weighted summation to obtain the integrated characteristic f₀∈R^D. Will f is₀Sending the data into the well-trained LSTM, and outputting an attention matrix L of a second video block₁. Then the characteristic flow matrix F₂And L₁Weighted summation is carried out to obtain f₁. By analogy, after N iterations, the attention matrix L of the Nth video block is obtained_n。

Fig. 6 is a schematic diagram illustrating an application example of a video analysis method according to an embodiment of the present disclosure, and a single-needle feature of a single-frame image in a left video block in fig. 6. Each layer plane represents a feature of one frame image. The solid squares in each layer represent one vector value in a single frame feature. It is to be understood that the position of each vector value in the plane, i.e., the feature height and the feature width of each vector value, is the position of the corresponding pixel of the vector value in the single-frame image. The height of each vector value represents the vector dimension. The single frame features of the single frame image are combined together to form a feature stream matrix of the video block. On the right side of fig. 6 is the attention matrix for the video block, which consists of feature width, feature height, and frame length. And integrating the feature flow matrix on the left side and the attention matrix on the right side according to a formula (1) to obtain an integrated feature matrix.

Fig. 4 shows a flowchart of a video analysis method according to an embodiment of the present disclosure, and based on the embodiment shown in fig. 4, step S50 of the method includes:

step S51, averaging the attention moment arrays of the video blocks in which the single-frame images are located, to obtain a single-frame vector of the single-frame image.

And step S52, obtaining the attention vector of the video to be identified according to the single-frame vectors of all the single-frame images.

In one possible implementation, when the video to be identified is divided into different video blocks, the same frame of image may belong to multiple video blocks. Each pixel in a single frame image is averaged over values in the attention matrices that belong to different video blocks. And obtaining a single-frame vector of the single-frame image according to the average value of each pixel. And then connecting the single-frame vectors of all the single-frame images to obtain the attention vector of the video to be identified.

The method and the device can be used for extracting the key frame information in the video and can be used for information screening. Because the attention moment matrix reflects the influence degree of different empty regions on the video analysis result, the attention moment matrix can be summed on the space dimension to obtain the vector with the consistent length and the frame number, and the frame corresponding to the position with the higher vector value in the attention matrix is more 'critical'.

Fig. 5 shows a flowchart of a video analysis method according to an embodiment of the present disclosure, and on the basis of the above embodiment, the step S40 of the method further includes:

in step S60, the class probability of the current video block is obtained.

And step S70, inputting the category probability into a classifier for processing to obtain the video block category of the current video block.

And step S80, determining the video category of the video to be identified according to the video block category of the video block.

In one possible implementation, the trained LSTM model may also output a video class probability vector for the video block.

In one possible implementation, the LSTM model is trained using video samples that identify motion categories, and the trained LSTM model may be used by a user to output the motion categories in the video. After the video block is input into the LSTM model, the LSTM model can also output the action category probability of the video block while obtaining the attention matrix of the video block. And obtaining the action type of the video block by the action type probability output by the LSTM module through a softmax classifier. Fig. 7 is a schematic diagram illustrating an application example of a video analysis method according to an embodiment of the present disclosure, as shown in fig. 7, an attention matrix L output by an LSTM model is simultaneously provided with probability matrices of different action classes by a softmax classifier. And in different action category probability matrixes, the video category is the one with the highest score. The nth iteration obtains an attention matrix L_nThe position of the video block in time and space where the key information is located is recorded. After the sum is carried out on the space, a vector with the length of T is obtained, and a video frame corresponding to a position with a larger vector value is relatively more critical.

Fig. 8 shows an application flowchart of a video analysis method according to an embodiment of the present disclosure. As shown in fig. 8, a video to be recognized (a video in the leftmost block in the figure) is input into a trained CNN (convolutional neural network) model, and features are extracted frame by frame to obtain single-frame features of a single-frame image. And then randomly selecting a video block in the video to be identified. And combining the attention matrix of each video block, performing weighted summation on the feature stream matrix of each video block, inputting the weighted summation into a multi-layer LSTM model, and processing to obtain the attention matrix and the action classification result of each video block respectively. Wherein the attention moment matrix of each video block needs to be iteratively calculated. For example, when a person is watching a video, not only the attention in space is different, but also the attention in the time dimension is different depending on the content of the video. For a case that the video to be recognized is playing football, the video in the first half is that a person is running, and the video in the second half is that the player actually plays the football. Therefore, in the video to be recognized for kicking a football, the later half kicking motion is given a higher weight in the attention vector of the video to be recognized obtained by the method in the present disclosure. Compared with the traditional video classification mode based on the LSTM, the method and the device have the advantage that higher classification precision is obtained.

Fig. 9 shows a block diagram of a video analysis apparatus according to an embodiment of the present disclosure, as shown in fig. 9, the apparatus including:

and the single-frame feature determining module 41 is configured to input the video to be recognized into the single-frame recognition model, so as to obtain a single-frame feature of a single-frame image in the video to be recognized.

And the video block dividing module 42 is configured to divide the video to be identified into video blocks according to the frame length, the start frame, and the identification step length.

And a feature stream matrix determining module 43, configured to determine a feature stream matrix of each video block according to a single frame feature and a frame length of a single frame image included in each video block.

And the attention moment matrix determining module 44 is configured to input the initial attention matrix and the feature stream matrix of the video block into a long-term and short-term memory model for processing, so as to obtain an attention matrix of the video block.

And an attention vector determination module 45, configured to determine an attention vector of the video to be identified according to the attention matrix of the video block.

Fig. 10 shows a block diagram of a video analysis apparatus according to an embodiment of the present disclosure, on the basis of the embodiment as shown in fig. 9,

in one possible implementation, the attention matrix determining module 44 includes:

an initial attention moment matrix determining submodule 441, configured to determine an initial attention matrix of the video block according to a feature width of a single frame feature, a feature height of the single frame feature, and the frame length;

a first attention moment matrix determining sub-module 442, configured to input the initial attention matrix and the feature stream matrix of the first video block into a long-term and short-term memory model for processing, so as to obtain an attention matrix of the first video block;

and the subsequent attention moment matrix determining sub-module 443 is configured to use the second video block and the subsequent video blocks as a current video block, and input the attention matrix of the previous video block and the feature stream matrix of the current video block into the long-term and short-term memory model in sequence for processing, so as to obtain the attention matrix of the current video block.

In one possible implementation, the attention vector determination module 45 includes:

the single-frame vector determining sub-module 451 is used for averaging the attention moment arrays of the video blocks where the single-frame images are located to obtain single-frame vectors of the single-frame images;

and the summation submodule 452 is configured to obtain the attention vector of the video to be identified according to the single-frame vectors of all the single-frame images.

In one possible implementation, the attention matrix determining module 44 further includes:

a category probability determination sub-module 444, configured to obtain a category probability of the current video block;

the classifier sub-module 445 is configured to input the class probability into a classifier for processing to obtain a video block class of the current video block;

the video category determining sub-module 446 is configured to determine the video category of the video to be identified according to the video block category of the video block.

Fig. 11 is a block diagram illustrating an apparatus 800 for video recognition according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 11, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the apparatus 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed status of the device 800, the relative positioning of components, such as a display and keypad of the device 800, the sensor assembly 814 may also detect a change in the position of the device 800 or a component of the device 800, the presence or absence of user contact with the device 800, the orientation or acceleration/deceleration of the device 800, and a change in the temperature of the device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the device 800 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of video analysis, the method comprising:

determining a characteristic flow matrix of each video block according to the single frame characteristics and the frame length of a single frame image included in each video block, wherein the characteristic flow matrix is used for representing the spatial characteristics of the video blocks;

inputting the initial attention matrix and the feature stream matrix of the video block into a long-short term memory model for processing to obtain an attention matrix of the video block, wherein the attention matrix is used for representing the time feature of the video block;

determining an attention vector of the video to be identified according to an attention matrix of a video block,

wherein, according to the attention matrix of the video block, determining the attention vector of the video to be identified comprises:

2. The method of claim 1, wherein inputting the initial attention matrix and the feature stream matrix of the video block into a long-term and short-term memory model for processing to obtain the attention matrix of the video block comprises:

3. The method of claim 2, wherein inputting the attention matrix of the previous video block and the feature stream matrix of the current video block into a long-term and short-term memory model for processing to obtain the attention matrix of the current video block comprises:

4. The method of any one of claims 1 to 3, wherein the initial attention matrix and the feature stream matrix of the video block are input into a long-term and short-term memory model for processing to obtain the attention matrix of the video block, and further comprising:

obtaining the class probability of the current video block;

5. A video analysis apparatus, comprising:

the characteristic flow matrix determining module is used for determining a characteristic flow matrix of each video block according to the single frame characteristics and the frame length of a single frame image included in each video block, wherein the characteristic flow matrix is used for representing the spatial characteristics of the video blocks;

the attention moment array determining module is used for inputting the initial attention matrix and the characteristic flow matrix of the video block into a long-short term memory model for processing to obtain an attention matrix of the video block, and the attention moment array is used for representing the time characteristic of the video block;

an attention vector determination module for determining an attention vector of the video to be identified according to an attention matrix of a video block,

wherein the attention vector determination module comprises:

6. The apparatus of claim 5, wherein the attention matrix determination module comprises:

7. The apparatus of claim 6, wherein the subsequent attention matrix determination submodule comprises:

8. The apparatus of any of claims 5 to 7, wherein the attention torque matrix determination module further comprises:

9. A video analysis apparatus, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: when executed, implement the method of any one of claims 1 to 4.

10. A non-transitory computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the method of any of claims 1 to 4.