CN113361344B

CN113361344B - Video event identification method, device, equipment and storage medium

Info

Publication number: CN113361344B
Application number: CN202110559992.1A
Authority: CN
Inventors: 汪琦; 冯知凡; 柴春光; 朱勇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2023-10-03
Anticipated expiration: 2041-05-21
Also published as: CN113361344A

Abstract

The application discloses a video event identification method, a device, equipment and a storage medium, relates to the technical field of computers, and in particular relates to the technical fields of knowledge maps, deep learning, computer vision and the like. The specific implementation scheme is as follows: obtaining a video of an event to be identified, extracting a key frame image from the video, determining a predicted action of the video according to the key frame image, inquiring a corresponding action map according to the predicted action, determining a plurality of attributes of the predicted action, determining attribute values corresponding to the attributes, constructing an event map according to the predicted action, the attributes and the attribute values corresponding to the attributes, matching the event map with each event in a pre-constructed video event map, and taking the matched event as an event associated with the video. Therefore, the event related to the video is determined through the constructed event map and the pre-constructed video event map, and the accuracy of video event identification is improved.

Description

Video event identification method, device, equipment and storage medium

Technical Field

The application discloses a video event identification method, a device, equipment and a storage medium, relates to the technical field of computers, and in particular relates to the technical fields of knowledge maps, image processing, deep learning, computer vision and the like.

Background

With the explosion of video in the information age, video understanding is an important technical requirement, such as video event recognition (also referred to as video event understanding), and the gist of video event recognition is to understand video content deeply from the perspective of content understanding and describe core elements in the video from multiple dimensions, which can identify scenes where the video occurs, main actions occurring, entities involved and events occurring specifically, and implement deep description of the video content in a structured manner, so that the video event recognition is an important technology in the field of video understanding.

At present, the identification of video events is still based on the content identification of video images, and the video-associated events cannot be accurately identified.

Disclosure of Invention

The application provides a video event identification method, a device, equipment and a storage medium.

According to an aspect of the present application, there is provided a video event recognition method, including:

acquiring a video of an event to be identified;

extracting a key frame image from the video to determine a prediction action of the video according to the key frame image;

determining a plurality of attributes of the predicted action according to the action map corresponding to the predicted action query;

Determining attribute values corresponding to the plurality of attributes according to the prediction action, the plurality of attributes and the key frame image;

constructing an event diagram according to the predicted action, the plurality of attributes and attribute values corresponding to the plurality of attributes;

and matching the event map with each event in a pre-constructed video event map, and taking the event obtained by matching as the event associated with the video.

According to another aspect of the present application, there is provided a video event recognition apparatus including:

the acquisition module is used for acquiring videos of the events to be identified;

the first determining module is used for extracting a key frame image from the video so as to determine the prediction action of the video according to the key frame image;

the second determining module is used for determining a plurality of attributes of the predicted action according to the action pattern corresponding to the predicted action query;

a third determining module, configured to determine attribute values corresponding to the plurality of attributes according to the prediction action, the plurality of attributes, and the key frame image;

the construction module is used for constructing an event diagram according to the prediction action, the plurality of attributes and attribute values corresponding to the plurality of attributes;

And the processing module is used for matching the event map with each event in a pre-constructed video event map, and taking the event obtained by matching as the event associated with the video.

According to another aspect of the present application, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the video event recognition method described in the above embodiments.

According to another aspect of the present application, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the video event recognition method according to the above-described embodiment.

According to another aspect of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the video event recognition method described in the above embodiments.

According to the method and the device for identifying the video event, the problem that the video associated event cannot be accurately identified in the related technology is solved, and the accuracy of video event identification is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.

Drawings

The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application. Wherein:

fig. 1 is a flow chart of a video event recognition method according to an embodiment of the present application;

fig. 2 is an exemplary diagram of an action map corresponding to a predicted action according to an embodiment of the present application;

fig. 3 is a flowchart of another video event recognition method according to an embodiment of the present application;

FIG. 4 is an exemplary diagram of identifying a key frame image according to an embodiment of the present application;

fig. 5 is a flowchart of another video event recognition method according to an embodiment of the present application;

FIG. 6 is an exemplary diagram of determining attribute values provided by an embodiment of the present application;

fig. 7 is a flowchart of a method for generating an event map according to an embodiment of the present application;

fig. 8 is a flowchart of another video event recognition method according to an embodiment of the present application;

FIG. 9 is an exemplary diagram of a graph embedded representation provided by an embodiment of the present application;

FIG. 10 is an exemplary diagram of video event recognition provided by an embodiment of the present application;

fig. 11 is a schematic structural diagram of a video event recognition device according to an embodiment of the present application;

fig. 12 is a block diagram of an electronic device for implementing a video event recognition method of an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The video event recognition technology is currently generally solved by adopting the following three schemes:

(1) Manual labeling: and identifying the content in the video event by adopting a crowdsourcing marking technology through manual marking. However, identifying video events by manual annotation requires a lot of labor costs, and has the disadvantages of low efficiency, high cost, and the like.

(2) And classifying event labels in a video classification mode. However, common labels are far from deep enough to describe video content.

(3) Based on an understanding of the picture content, a single Zhang Guanjian frame in the video is modeled, and the picture content is understood to further characterize the video content. However, the key frame information in the video is adopted to convert the video understanding technology into the picture understanding technology, but the events in the video cannot be accurately understood because the video is not globally modeled and understood.

In view of the above problems, the present application provides a method for identifying video events, which includes obtaining a video of an event to be identified, extracting a key frame image from the video, determining a predicted action of the video according to the key frame image, querying a corresponding action map according to the predicted action, determining a plurality of attributes of the predicted action, determining attribute values corresponding to the plurality of attributes according to the predicted action, the plurality of attributes and the key frame image, constructing an event map according to the predicted action, the plurality of attributes and the attribute values corresponding to the plurality of attributes, matching the event map with each event in a pre-constructed video event map, and using the matched event as a video-related event.

Events can be largely divided into three granularities:

(1) Coarse granularity action event: general scenes such as wedding, body building and the like;

(2) Fine grain action event: fine granularity actions such as reversing, warehousing, lifting and the like;

(3) Specific events: a specific example scenario is shown in the Qingdao beer section.

The following describes a video event recognition method, apparatus, device and storage medium according to embodiments of the present application with reference to the accompanying drawings.

Fig. 1 is a flowchart of a video event recognition method according to an embodiment of the present application.

The embodiment of the application is exemplified by the video event recognition method being configured in a video event recognition device, and the video event recognition device can be applied to any electronic equipment so that the electronic equipment can execute the video event recognition function.

The electronic device may be a personal computer (Personal Computer, abbreviated as PC), a cloud device, a mobile device, etc., and the mobile device may be a hardware device with various operating systems, such as a mobile phone, a tablet computer, a personal digital assistant, a wearable device, a vehicle-mounted device, etc.

As shown in fig. 1, the video event recognition method may include the steps of:

step 101, acquiring a video of an event to be identified.

In the embodiment of the application, the video of the event to be identified may be a video shot by the acquisition device, a video downloaded from a server, a locally stored video, or the like, which is not limited herein.

Step 102, extracting a key frame image from the video to determine a prediction action of the video according to the key frame image.

In the embodiment of the application, after the video of the event to be identified is acquired, the key frame image can be extracted from the video so as to determine the prediction action of the video according to the key frame image.

It can be understood that, because many frame images in the video are irrelevant to the motion of the video, in order to eliminate redundant information in the video and improve the efficiency of video motion prediction, in the application, when the frame extraction processing is performed on the video, key frame images can be extracted from the video, so as to determine the prediction motion of the video according to the key frame images.

Alternatively, when the key frame image is extracted from the video, a lens-based extraction method, a motion analysis method, a video clustering method, and the like may be used, which are not limited herein, and any method that can extract the key frame image from the video is suitable for the present application.

In the embodiment of the application, the key frame image is not limited to one frame, one frame of key frame image can be extracted from the video, and a plurality of frames of key frame images can be extracted from the video, which is not limited.

In one possible case, after a frame of key frame image is extracted from the video, performing action recognition on the frame of key frame image, and determining the action of the key frame image as the predicted action of the video.

In another possible case, after extracting multiple frames of key frame images from the video, the multiple frames of key frame images can be respectively identified in a motion mode, so that the prediction motion of the video can be determined according to the motions of the multiple frames of key frame images. For example, the motion with the highest proportion among the motions corresponding to the multi-frame key frame images can be used as the prediction motion of the video.

As an example, assuming that 5 frame key frame images are extracted from a video, motion recognition is performed on the 5 frame key frame images, and it is determined that 4 frame key frame images are motion a and 1 frame key frame image is motion B, then the motion a may be used as a prediction motion of the video.

And step 103, determining a plurality of attributes of the predicted action according to the action map corresponding to the predicted action query.

It can be understood that each action corresponds to a pre-constructed action map, and the action map includes attributes corresponding to the action. In the embodiment of the application, after the predicted action of the video is determined, the action pattern corresponding to the predicted action can be obtained, and then the action pattern is queried to determine a plurality of attributes of the predicted action.

In the embodiment of the application, the action patterns can be fixed sets preset manually, and the attributes corresponding to each action are different and are all related action knowledge built by experts in the prior field.

As an example, as shown in fig. 2, after a key frame image is extracted from a video, a predicted action of the video is determined as a jump according to the key frame image, and an action map corresponding to the jump of the action is queried to determine that the action contains 5 attributes, namely, an executive person, a starting point, an obstacle, a destination and a place.

And 104, determining attribute values corresponding to the attributes according to the predicted actions, the attributes and the key frame images.

In the embodiment of the application, after the prediction action of the video is determined, a plurality of attributes corresponding to the prediction action can be determined, and then the attribute values corresponding to the attributes can be determined according to the prediction action, the attributes and the key frame image of the video.

Optionally, the predicted actions of the video, the plurality of attributes, and the key frame images may be input into a trained attention model to determine attribute values corresponding to the attributes from the output of the model.

In one possible scenario, if the key frame image is a frame, the predicted motion of the video, the plurality of attributes, and the frame of key frame image may be input into a trained attention model to determine attribute values corresponding to the attributes from the output of the model.

In another possible case, the key frame image is a plurality of frames, and the plurality of frames of key frame images can be combined to obtain a combined key frame image, and then the prediction action, the plurality of attributes and the combined key frame image of the video are input into the trained attention model to determine attribute values corresponding to the attributes according to the output of the model.

Continuing with the description of fig. 2 as an example, determining that the prediction action corresponding to the key frame image in fig. 2 is skip, encoding 5 attributes together with skip, inputting the encoded result and the key frame image together into a trained attention model, and determining attribute values corresponding to the attributes according to the output of the model.

Step 105, constructing an event map according to the predicted actions, the attributes and the attribute values corresponding to the attributes.

The event graph refers to an event representation mode formed by all nodes.

In the embodiment of the application, after the prediction action of the video is determined, a plurality of attributes and attribute values corresponding to the attributes can be used for constructing an event diagram according to the prediction action, the attributes and the attribute values corresponding to the attributes.

And 106, matching the event map with each event in the pre-constructed video event map, and taking the event obtained by matching as the event of video association.

Any event in the video event map respectively comprises a prediction action of the event, a plurality of attributes and attribute values corresponding to the attributes.

In the embodiment of the application, after the event map corresponding to the video event is constructed, the event map can be matched with each event in the pre-constructed video event map, and the event obtained by matching is used as the event associated with the video.

According to the video event identification method, the video of the event to be identified is obtained, after the key frame image is extracted from the video, the motion prediction is carried out on the video based on the key frame image, the motion map corresponding to the predicted motion is inquired, the plurality of attributes of the predicted motion are determined, further, the attribute values corresponding to the plurality of attributes are determined, the event map is constructed according to the predicted motion, the plurality of attributes and the attribute values corresponding to the plurality of attributes, the event map is matched with each event in the pre-constructed video event map, and the event obtained through matching is used as the event associated with the video. Therefore, a plurality of attributes of the predicted action are determined based on the action map corresponding to the predicted action query, and after the attribute values corresponding to the attributes are determined, the event associated with the video is determined through the constructed event map and the pre-constructed video event map, so that the accuracy rate of video event identification is improved.

Based on the above embodiment, multiple frames of video frame images can be extracted from the video, and the prediction motion of the video is determined according to the motion of at least one frame of key frame image determined in the images in the multiple frames of video, which is described in detail below with reference to fig. 3, and fig. 3 is a flow chart of another video event recognition method provided in an embodiment of the present application.

As shown in fig. 3, the video event recognition method may include the steps of:

step 301, a video of an event to be identified is acquired.

In the embodiment of the present application, the implementation process of step 301 may refer to the implementation process of step 101 in the above embodiment, which is not described herein again.

Step 302, frame extraction processing is performed on the video to obtain multi-frame video frame images.

In the embodiment of the application, after the video of the event to be identified is obtained, frame extraction processing can be performed on the video so as to extract and obtain multi-frame video frame images from the video.

Alternatively, multiple frames of video frame images may be extracted from the video by extracting frames at intervals. For example, from all images of the video, 3 frames of images are extracted every interval, and a plurality of frames of video frame images are obtained.

At step 303, at least one key frame image is determined from the multi-frame video frame images.

It can be understood that, in order to improve the accuracy of motion recognition of a video, after extracting multiple frames of video frame images from the video, at least one frame of key frame image can be determined from the multiple frames of video frame images, so that the accuracy of motion recognition can be improved by removing video frame images which are difficult to classify.

As a possible implementation manner of the embodiment of the application, feature extraction is performed on video frame images of a video to obtain video frame image features, feature fusion is performed on the video frame image features and global image features of the video to obtain fused image features, and the fused image features are input into a classifier to perform key frame recognition to obtain at least one frame of key frame image. Therefore, based on the image characteristics of the video frame and the global image characteristics of the video, at least one frame of key frame image is determined, and redundant or difficultly classified images in the video frame image are screened out, so that the accuracy of video action recognition is improved.

Alternatively, a MLP (Multi-Layer Perceptron) may be used to perform feature extraction on the video frame image to obtain video frame image features.

As an example, as shown in fig. 4, assuming that N frames of video frame images are extracted from the video of the event to be identified, the MLP may be used to extract each frame of video frame image (X _i ) The extraction of the characteristics is carried out,the method comprises the steps of obtaining video frame image features of each frame, respectively fusing the video frame image features of each frame with global image features of a video to obtain fused image features, inputting the fused image features into a classifier, and carrying out key frame recognition through the classifier to obtain at least one frame of key frame image.

Alternatively, CNN (Convolutional Neural Network ) may be employed to extract image features.

Step 304, determining a prediction action of the video according to the action of at least one frame of key frame image.

In the embodiment of the application, after at least one frame of key frame image is determined from multiple frames of video frame images, the at least one frame of key frame image is subjected to action recognition so as to determine the prediction action of the video according to the action of the at least one frame of key frame image.

As an example, assuming that a frame of key frame image is extracted from a video, performing motion recognition on the frame of key frame image, determining that the motion of the frame of key frame image is "jump", and determining that the predicted motion of the video is "jump".

As one possible implementation, feature extraction may be performed on at least one frame of key frame images, and the at least one frame of key frame images is input into a trained motion recognition model to determine a motion of the at least one key frame image based on an output of the model.

Step 305, determining a plurality of attributes of the predicted actions according to the action patterns corresponding to the predicted action queries.

And step 306, determining attribute values corresponding to the attributes according to the predicted actions, the attributes and the key frame images.

Step 307, constructing an event map according to the predicted actions, the plurality of attributes and the attribute values corresponding to the plurality of attributes.

Step 308, matching the event map with each event in the pre-constructed video event map, and taking the event obtained by matching as the event of video association.

In the embodiment of the present application, the implementation process of step 305 to step 308 may refer to the implementation process of step 103 to step 106 in the above embodiment, and will not be described herein.

In the embodiment of the application, after frame extraction processing is carried out on a video to obtain multi-frame video frame images, at least one frame of key frame image is determined from the multi-frame video frame images, and the prediction action of the video is determined according to the action of the at least one frame of key frame image. Therefore, the key frame images of the video are extracted, and the action recognition is carried out on the key frame images, so that the predicted action of the video is determined according to the action obtained by recognition, and the accuracy and the efficiency of the video action recognition are improved.

Based on the above embodiments, the attribute values corresponding to the attributes may be determined according to the prediction actions, the attributes and the key frame images, and detailed description is given below with reference to fig. 5, where fig. 5 is a schematic flow chart of another video event recognition method according to an embodiment of the present application.

As shown in fig. 5, the video event recognition method may include the steps of:

step 501, a video of an event to be identified is acquired.

Step 502, extracting a key frame image from the video to determine a prediction action of the video according to the key frame image.

Step 503, determining a plurality of attributes of the predicted actions according to the action patterns corresponding to the predicted action queries.

In the embodiment of the present application, the implementation process of step 501 to step 503 may refer to the implementation process of step 101 to step 103 in the above embodiment, and will not be described herein.

And 504, respectively encoding the prediction action and each attribute to obtain a plurality of encoding results.

In the embodiment of the application, after the prediction action and the plurality of attributes of the prediction action of the video are determined, the prediction action and the attributes can be respectively encoded together to obtain a plurality of encoding results.

As an example, as shown in fig. 6, motion recognition is performed on the key frame image in fig. 6, a predicted motion of the video is determined as a comb, a corresponding motion map is queried according to the predicted motion, and a plurality of attributes of the predicted motion are determined as a executor, an object, an intention, and a tool, respectively. As can be seen from fig. 6, the prediction motion header may be encoded with a plurality of attributes, respectively, to obtain a plurality of encoding results.

And 505, extracting features of the key frame image to obtain extracted image features.

In the embodiment of the application, after the key frame image is extracted from the video of the event to be identified, the feature extraction can be performed on the key frame image so as to obtain the image feature of the key frame image.

The feature extraction is an important concept in computer vision and image processing, and is to use a computer to extract image information and determine whether points of each image belong to an image feature. The result of feature extraction is to divide the points on the image into different themselves, these subsets often belonging to isolated points, continuous curves or continuous areas. The image features are color features, texture features, shape features, and spatial relationship features.

Alternatively, the following feature extraction method may be used to perform feature extraction on the key frame image: fourier transform, windowed Fourier transform, wavelet transform, least squares, boundary direction histogram, texture feature extraction based on Tamura texture features, and the like. The image feature extraction method in the present application is not limited to the above-described method, and any image feature extraction method that can be implemented is applicable to the present application, and is not limited thereto.

And step 506, respectively inputting the plurality of coding results and the image characteristics into the trained attention mechanism model to obtain attribute values corresponding to the plurality of attributes.

In the embodiment of the application, after the prediction action and the plurality of attributes are respectively encoded to obtain a plurality of encoding results, the plurality of encoding results are respectively input into a trained attention mechanism model together with image features to obtain attribute values corresponding to the plurality of attributes.

Continuing with the exemplary description of fig. 6, assuming that the coding result obtained by coding the predicted action header and the attribute executor together is input into the Attention mechanism model (Top-Down attribute) together with the image feature, the attribute value corresponding to the attribute may be obtained as girl.

As shown in fig. 6, the encoding result q and the image feature E are connected together to obtain fa, the fa is linearly transformed to obtain Wa, the Wa is normalized, and the processing result and the image feature are processed and input into the TDA model, so that the attribute value corresponding to the attribute can be obtained.

Step 507, constructing an event map according to the predicted actions, the plurality of attributes and attribute values corresponding to the plurality of attributes.

And step 508, matching the event map with each event in the pre-constructed video event map, and taking the event obtained by matching as the event of video association.

In the embodiment of the present application, the implementation process of step 507 to step 508 may refer to the implementation process of step 105 to step 106 in the above embodiment, which is not described herein.

According to the video event identification method, after a video of an event to be identified is acquired, a key frame image is extracted from the video, a predicted action of the video is determined according to the key frame image, a corresponding action map is inquired according to the predicted action, a plurality of attributes of the predicted action are determined, the predicted action and each attribute are respectively encoded to obtain a plurality of encoding results, feature extraction is carried out on the key frame image to obtain extracted image features, the plurality of encoding results and the image features are respectively input into a trained attention mechanism model to obtain attribute values corresponding to the plurality of attributes, an event map is constructed according to the predicted action, the plurality of attributes and the attribute values corresponding to the plurality of attributes, the event map is matched with each event in a pre-constructed video event map, and the matched event is used as an event associated with the video. Therefore, the attribute values of all the attributes are determined based on the attention mechanism model, the accuracy of generating the attribute values of the prediction action is improved, and further, the accuracy of identifying the video event is improved.

On the basis of the embodiment, after the predicted actions, the plurality of attributes and the attribute values corresponding to the attributes of the video are determined, an event map is further constructed according to the predicted actions, the plurality of attributes and the attribute values corresponding to the attributes. Fig. 7 is a schematic flow chart of an event map generating method according to an embodiment of the present application.

As shown in fig. 7, the method may include the steps of:

in step 701, the predicted action is taken as a first layer node of the event graph.

In the embodiment of the application, after the predicted action of the video is determined, the predicted action can be used as the first layer node of the event diagram.

As an example, assuming the predicted action of the video is a jump, the jump may be taken as a first layer node of the event map.

In step 702, the plurality of attributes are respectively connected to the first-layer nodes as second-layer nodes.

In the embodiment of the application, after determining a plurality of attributes corresponding to the predicted action according to the predicted action query action map, the attributes can be respectively connected with the first layer node to serve as the second layer node.

As an example, assuming that the predicted action is a jump, by querying an action map corresponding to the action jump, determining that this action contains 5 attributes, namely, a executor, a starting point, an obstacle, a destination, and a place, the 5 attributes may be respectively connected to the first-layer nodes as the second-layer nodes.

Step 703, connecting the attribute values corresponding to the plurality of attributes with the second layer nodes corresponding to the attributes respectively, and using the second layer nodes as the third layer nodes to obtain the event graph.

In the embodiment of the application, after the attribute values corresponding to the attributes are determined according to the prediction action, the attributes and the key frame image, the attribute values corresponding to the attributes can be respectively connected with the second layer nodes corresponding to the attributes to be used as the third layer nodes so as to obtain the event graph.

Continuing with the example in step 702, if it is determined that the attribute values corresponding to the 5 attribute executives, the start point, the obstacle, the destination, and the place are horse racing riders, land, fences, land, and outdoor, the attribute value corresponding to each attribute may be used as a third layer node and connected to the corresponding attribute, respectively, so as to obtain the event map corresponding to the jump.

In the embodiment of the application, the predicted action is used as a first layer node of the event graph, a plurality of attributes are respectively connected with the first layer node and used as a second layer node, and attribute values corresponding to the attributes are respectively connected with the second layer node corresponding to the attributes and used as a third layer node, so that the event graph is obtained. Therefore, the event map corresponding to the video of the event to be identified is generated, and the accuracy of the event identification result is improved.

As a possible implementation manner of the embodiment of the present application, when the event map is matched with each event in the pre-constructed event map, the event associated with the video may be determined based on the similarity between the map embedded representation corresponding to the event map and the map embedded representation of each event in the video event map. Fig. 8 is a flowchart of another video event recognition method according to an embodiment of the present application.

As shown in fig. 8, the video event recognition method may further include the steps of:

step 801, obtaining a graph embedded representation corresponding to an event graph.

As a possible implementation manner of the embodiment of the present application, an event map constructed according to a predicted action, a plurality of attributes, and attribute values corresponding to the plurality of attributes may be input into a GGNN (Gated Graph Neural Network, a gating map neural network), and a map embedded representation corresponding to the event map may be determined according to an output of the GGNN network. Because the GGNN network has stronger expansibility, the graph embedded representation of the event can be determined according to the mutual information among the nodes.

The method for embedding and representing the map corresponding to the obtained event map is not limited to the GGNN network, and any other realizable method may be applied to the present application, and is not limited herein.

When representing an event as a graph embedded representation, the specific method is as follows:

firstly, constructing an event graph G= (V, E), storing D dimension vectors in a node V epsilon V, storing D x D dimension matrixes in an edge E epsilon E, utilizing GGNN to learn graph embedded representation of the node V for a plurality of times in an iterative manner, and finally obtaining the expression of the whole graph by the graph embedded representation of all the nodes.

GGNN is a model of classical spatial domain message passing based on GRU (Gated Recurrent Unit, gated loop units). The generic framework of message passing contains three parts of operations: information transfer operation (M), update operation (U), read operation (R). It can be seen from the following formula that the graph embedded representation m at time t+1 of the node v is determined by the graph embedded representation at the current time of the node v, the current time graph embedded representation of the neighboring node thereof, and the interactive side information of the two.

Wherein the formula (1),the initial hidden vector of the node v is the vector of the D dimension, and when the node inputs the characteristic x _v And when the dimension is smaller than D, adopting a filling mode of supplementing 0.

In the formula (2), A _v Two columns of the corresponding node v are selected for the matrix a of fig. 9 (c), a being D x 2D dimension, a _v Is in the dimension 1 x 2d,is a D-dimensional vector formed by all node characteristics at the t-1 moment, < > >A vector of dimension 1 x 2d, represents the result of an interaction between a current node and a neighboring node through an edge. It can be seen that two columns in and out are taken for a during the calculation, so that the result of the calculation here takes into account the bi-directional information transfer (as in fig. 9 (b)).

In the formulae (3) to (6),control forgetting information, < >>And controlling the new generation information. In formula (6)>The first half selects which information to forget in the past, < >>Which then chooses to remember which information was newly generated. />Then it is the final updated node state.

In the application, when obtaining the graph embedded representation corresponding to the event graph, N central nodes can be selected from all nodes in the event graph, N is a positive integer greater than one and is less than the number of nodes contained in the event graph, and then, for any central node, the following processing can be performed respectively: and acquiring a neighborhood node of the central node, wherein the neighborhood node is a node connected with the central node, determining vector representations corresponding to subgraphs formed by the central node and the neighborhood node, and inputting the obtained vector representations into a GGNN network to obtain a graph embedded representation corresponding to the event graph.

Step 802, determining a similarity between a graph embedded representation corresponding to an event graph and a graph embedded representation of each event in a video event graph.

In step 803, the event with the highest similarity is determined as the video-associated event.

In the embodiment of the application, the graph embedded representation corresponding to each event in the video event map is obtained, so that the similarity between the graph embedded representation corresponding to each event in the video event map and the graph embedded representation corresponding to each event in the video event map can be calculated respectively, and the event with the highest similarity can be used as the selected event, namely the event associated with the video as the event to be identified. Therefore, the video associated event is determined based on the mode of graph embedded representation, and the accuracy of video event identification is improved.

Optionally, after determining that the graph embedded representation corresponding to the event graph has the highest similarity with the graph embedded representation of each event in the video event graph, performing secondary judgment through the attribute of the event to realize situation matching based on multiple frame levels, and obtaining the event associated with the video.

As an example, as shown in fig. 10, after a key frame is extracted from a video to be identified, a key frame image is obtained, motion recognition is performed on the key frame image, and a predicted motion of the video is determined as "riding" according to the motion of the key frame image. Inquiring a corresponding action map according to the identified action to determine a plurality of attributes (such as a main body, a carrier, a place and a scene in fig. 10) corresponding to the action, determining attribute values corresponding to the attributes according to the predicted action, the attributes and the key frame image, extracting features of the key frame to obtain extracted image features, and constructing an event map according to the predicted action, the attributes and the attribute values corresponding to the attributes. And matching the event map with each event in the pre-constructed video event map, and taking the event obtained by matching as the event of video association.

In order to implement the above embodiment, the present application proposes a video event recognition apparatus.

Fig. 11 is a schematic structural diagram of a video event recognition device according to an embodiment of the present application.

As shown in fig. 11, the video event recognition apparatus 1100 may include: acquisition module 1110, first determination module 1120, second determination module 1130, third determination module 1140, build module 1150, and processing module 1160.

The acquiring module 1110 is configured to acquire a video of an event to be identified.

A first determining module 1120, configured to extract a key frame image from the video, so as to determine a prediction action of the video according to the key frame image;

a second determining module 1130, configured to determine a plurality of attributes of the predicted action according to the action profile corresponding to the predicted action query;

a third determining module 1140, configured to determine attribute values corresponding to the plurality of attributes according to the predicted action, the plurality of attributes, and the key frame image;

a construction module 1150, configured to construct an event map according to the predicted actions, the plurality of attributes, and attribute values corresponding to the plurality of attributes;

and the processing module 1160 is configured to match the event map with each event in the pre-constructed video event map, and use the event obtained by matching as the event associated with the video.

As a possible scenario, the first determining module 1120 may further include:

the processing unit is used for performing frame extraction processing on the video to obtain multi-frame video frame images;

a first determining unit, configured to determine at least one frame of key frame image from multiple frames of video frame images;

and the second determining unit is used for determining the prediction action of the video according to the action of at least one frame of key frame image.

As another possible case, the first determining unit may be further configured to: extracting features of the video frame images to obtain video frame image features; feature fusion is carried out on the image features of the video frames and the global image features of the video, so that fused image features are obtained; and inputting the fused image features into a classifier for key frame identification to obtain at least one frame of key frame image.

As another possible scenario, the third determining module 1140 may also be used to:

coding the prediction action and each attribute respectively to obtain a plurality of coding results;

extracting features of the key frame image to obtain extracted image features;

and respectively inputting the plurality of coding results and the image characteristics into a trained attention mechanism model to obtain attribute values corresponding to the plurality of attributes.

As another possible scenario, build module 1150 may also be used to:

taking the predicted action as a first layer node of the event graph;

respectively connecting the plurality of attributes with the first layer of nodes to serve as second layer of nodes;

and respectively connecting the attribute values corresponding to the attributes with the second-layer nodes corresponding to the attributes to serve as third-layer nodes so as to obtain an event graph.

As another possible scenario, the processing module 1160 may also be used to:

acquiring a graph embedded representation corresponding to the event graph;

determining the similarity between the graph embedded representation corresponding to the event graph and the graph embedded representation of each event in the video event graph;

and determining the event with the highest similarity as the event associated with the video.

It should be noted that the foregoing explanation of the embodiment of the video event recognition method is also applicable to the video event recognition device of this embodiment, and will not be repeated herein.

According to the video event recognition device, the video of the event to be recognized is obtained, after the key frame image is extracted from the video, the motion prediction is carried out on the video based on the key frame image, the motion map corresponding to the predicted motion is inquired, the plurality of attributes of the predicted motion are determined, further, the attribute values corresponding to the plurality of attributes are determined, the event map is constructed according to the predicted motion, the plurality of attributes and the attribute values corresponding to the plurality of attributes, the event map is matched with each event in the pre-constructed video event map, and the matched event is used as the event associated with the video. Therefore, a plurality of attributes of the predicted action are determined based on the action map corresponding to the predicted action query, and after the attribute values corresponding to the attributes are determined, the event associated with the video is determined through the constructed event map and the pre-constructed video event map, so that the accuracy rate of video event identification is improved.

In order to achieve the above embodiments, the present application proposes an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the identification method described in the above embodiments.

In order to achieve the above-described embodiments, the present application proposes a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the identification method described in the above-described embodiments.

In order to achieve the above embodiments, the present application proposes a computer program product comprising a computer program which, when executed by a processor, implements the identification method described in the above embodiments.

According to an embodiment of the present application, the present application also provides an electronic device and a readable storage medium.

As shown in fig. 12, there is a block diagram of an electronic device of a video event recognition method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.

As shown in fig. 12, the electronic device includes: one or more processors 1201, memory 1202, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 1201 is illustrated in fig. 12.

Memory 1202 is a non-transitory computer readable storage medium provided by the present application. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the video event recognition method provided by the present application. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to execute the video event recognition method provided by the present application.

The memory 1202 is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules (e.g., the acquisition module 1110, the first determination module 1120, the second determination module 1130, the third determination module 1140, the construction module 1150, and the processing module 1160 shown in fig. 11) corresponding to the video event recognition method according to the embodiments of the present application. The processor 1201 performs various functional applications of the server and data processing, i.e., implements the video event recognition method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 1202.

Memory 1202 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of the electronic device, etc. In addition, memory 1202 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 1202 optionally includes memory remotely located relative to processor 1201, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device may further include: an input device 1203 and an output device 1204. The processor 1201, the memory 1202, the input device 1203, and the output device 1204 may be connected by a bus or otherwise, for example in fig. 12.

The input device 1203 may receive entered numeric or character information and generate key signal inputs related to user settings and function control of the electronic device, such as a touch screen, keypad, mouse, trackpad, touchpad, pointer stick, one or more mouse buttons, trackball, joystick, and like input devices. The output device 1204 may include a display apparatus, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibration motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and virtual special servers (Virtual Private Server, VPS) are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

According to the technical scheme of the embodiment of the application, the video of the event to be identified is obtained, the key frame image is extracted from the video, then the motion prediction is carried out on the video based on the key frame image, the motion map corresponding to the predicted motion is inquired, a plurality of attributes of the predicted motion are determined, further, attribute values corresponding to the attributes are determined, an event map is constructed according to the predicted motion, the attributes and the attribute values corresponding to the attributes, the event map is matched with each event in the pre-constructed video event map, and the event obtained by matching is used as the event associated with the video. Therefore, a plurality of attributes of the predicted action are determined based on the action map corresponding to the predicted action query, and after the attribute values corresponding to the attributes are determined, the event associated with the video is determined through the constructed event map and the pre-constructed video event map, so that the accuracy rate of video event identification is improved.

In the technical scheme of the application, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed embodiments are achieved, and are not limited herein.

The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

Claims

1. A method of video event identification, comprising:

acquiring a video of an event to be identified;

the predicted action is used as a first layer node of the event graph, the plurality of attributes are respectively connected with the first layer node to be used as a second layer node, and the attribute values corresponding to the plurality of attributes are respectively connected with the second layer node corresponding to each attribute to be used as a third layer node to obtain the event graph;

2. The identification method of claim 1, wherein the extracting key frame images from the video to determine the predicted action of the video from the key frame images comprises:

performing frame extraction processing on the video to obtain multi-frame video frame images;

determining at least one frame of key frame image from the multi-frame video frame images;

and determining the prediction action of the video according to the action of the at least one frame of key frame image.

3. The identification method of claim 2, wherein the determining at least one frame of key frame image from the plurality of frames of video frame images comprises:

extracting the characteristics of the video frame image to obtain the characteristics of the video frame image;

feature fusion is carried out on the video frame image features and the global image features of the video to obtain fused image features;

and inputting the fused image features into a classifier for key frame identification to obtain at least one frame of key frame image.

4. The identification method according to claim 1, wherein the determining attribute values corresponding to the plurality of attributes according to the prediction action, the plurality of attributes, and the key frame image includes:

extracting features of the key frame images to obtain extracted image features;

and respectively inputting the plurality of coding results and the image features into a trained attention mechanism model to obtain attribute values corresponding to the plurality of attributes.

5. The identifying method according to claim 1, wherein the matching the event map with each event in a pre-constructed video event map, and using the event obtained by matching as the event associated with the video, includes:

obtaining a graph embedded representation corresponding to the event graph;

6. A video event recognition device, comprising:

the construction module is used for taking the prediction action as a first layer node of the event graph, connecting the plurality of attributes with the first layer node respectively to serve as a second layer node, connecting attribute values corresponding to the plurality of attributes with second layer nodes corresponding to the attributes respectively to serve as a third layer node, and obtaining the event graph;

7. The identification device of claim 6, wherein the first determination module further comprises:

a first determining unit, configured to determine at least one frame of key frame image from the multiple frames of video frame images;

and the second determining unit is used for determining the prediction action of the video according to the action of the at least one frame of key frame image.

8. The identification device of claim 7, wherein the first determination unit is further configured to:

9. The identification device of claim 6, wherein the third determination module is further configured to:

extracting features of the key frame images to obtain extracted image features;

10. The identification device of claim 6, wherein the processing module is further configured to:

obtaining a graph embedded representation corresponding to the event graph;

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the identification method of any one of claims 1-5.

12. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the identification method of any one of claims 1-5.