CN113361344A

CN113361344A - Video event identification method, device, equipment and storage medium

Info

Publication number: CN113361344A
Application number: CN202110559992.1A
Authority: CN
Inventors: 汪琦; 冯知凡; 柴春光; 朱勇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2021-09-07
Anticipated expiration: 2041-05-21
Also published as: CN113361344B

Abstract

The application discloses a video event identification method, a video event identification device, video event identification equipment and a storage medium, and relates to the technical field of computers, in particular to the technical fields of knowledge maps, deep learning, computer vision and the like. The specific implementation scheme is as follows: the method comprises the steps of obtaining a video of an event to be identified, extracting a key frame image from the video, determining a prediction action of the video according to the key frame image, inquiring a corresponding action map according to the prediction action, determining a plurality of attributes of the prediction action, determining attribute values corresponding to the attributes, constructing an event map according to the prediction action, the attributes and the attribute values corresponding to the attributes, matching the event map with each event in a pre-constructed video event map, and taking the event obtained through matching as an event related to the video. Therefore, the event related to the video is determined through the constructed event graph and the video event graph constructed in advance, and the accuracy of video event identification is improved.

Description

Video event identification method, device, equipment and storage medium

Technical Field

The application discloses a video event identification method, a video event identification device, video event identification equipment and a storage medium, and relates to the technical field of computers, in particular to the technical fields of knowledge maps, image processing, deep learning, computer vision and the like.

Background

With the outbreak of video in the information age, video understanding becomes an important technical requirement, such as video event recognition (also referred to as video event understanding), the main idea of video event recognition is to understand video content deeply from the perspective of content understanding and to depict core elements in a video from multiple dimensions, which can recognize scenes where the video occurs, main actions that occur, entities involved in the video, and specific events that occur, and implement deep description of video content in a structured manner, and gradually becomes an important technology in the field of video understanding.

At present, aiming at the identification of video events, content identification based on video images is still performed, and the events related to the videos cannot be accurately identified.

Disclosure of Invention

The application provides a video event identification method, a video event identification device, video event identification equipment and a storage medium.

According to an aspect of the present application, there is provided a video event recognition method, including:

acquiring a video of an event to be identified;

extracting a key frame image from the video to determine a prediction action of the video according to the key frame image;

inquiring a corresponding action map according to the predicted action, and determining a plurality of attributes of the predicted action;

determining attribute values corresponding to the attributes according to the predicted action, the attributes and the key frame image;

constructing an event graph according to the predicted action, the attributes and attribute values corresponding to the attributes;

and matching the event graph with each event in a pre-constructed video event graph, and taking the event obtained by matching as the event related to the video.

According to another aspect of the present application, there is provided a video event recognition apparatus including:

the acquisition module is used for acquiring a video of an event to be identified;

the first determination module is used for extracting a key frame image from the video so as to determine a prediction action of the video according to the key frame image;

the second determining module is used for inquiring a corresponding action map according to the predicted action and determining a plurality of attributes of the predicted action;

a third determining module, configured to determine attribute values corresponding to the plurality of attributes according to the predicted motion, the plurality of attributes, and the key frame image;

the construction module is used for constructing an event graph according to the predicted action, the attributes and attribute values corresponding to the attributes;

and the processing module is used for matching the event graph with each event in a pre-constructed video event graph and taking the event obtained by matching as the event related to the video.

According to another aspect of the present application, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the video event recognition method of the above embodiments.

According to another aspect of the present application, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the video event recognition method according to the above-described embodiment.

According to another aspect of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the video event recognition method of the above-described embodiments.

According to the technology of the application, the problem that the video-related events cannot be accurately identified in the related technology is solved, and the accuracy of video event identification is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a schematic flowchart of a video event recognition method according to an embodiment of the present application;

FIG. 2 is an exemplary diagram of an action map corresponding to a predicted action provided in an embodiment of the present application;

fig. 3 is a schematic flowchart of another video event recognition method according to an embodiment of the present application;

FIG. 4 is a diagram illustrating an example of identifying key frame images according to an embodiment of the present application;

fig. 5 is a schematic flowchart of another video event recognition method according to an embodiment of the present application;

FIG. 6 is an exemplary diagram of determining attribute values according to an embodiment of the present application;

fig. 7 is a schematic flowchart of an event graph generating method according to an embodiment of the present application;

fig. 8 is a schematic flowchart of another video event recognition method according to an embodiment of the present application;

FIG. 9 is an exemplary diagram of an embedded representation of a graph provided by an embodiment of the present application;

fig. 10 is an exemplary diagram of video event recognition provided by an embodiment of the present application;

fig. 11 is a schematic structural diagram of a video event recognition apparatus according to an embodiment of the present application;

fig. 12 is a block diagram of an electronic device for implementing a video event recognition method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The video event recognition technology currently adopts the following three schemes to solve the problem:

(1) manual labeling: and identifying the content in the video event by manual marking by adopting a crowdsourcing marking technology. However, the method of identifying video events by manual annotation consumes a lot of labor cost, and has the disadvantages of low efficiency, high cost and the like.

(2) And the event labels are classified in a video classification mode. However, the common label is far from enough for the video content deep description.

(3) Based on the understanding of the picture content, a single key frame in the video is modeled, and the picture content is understood so as to depict the video content. However, the video understanding technology is converted into the picture understanding technology by adopting the key frame information in the video, but the events in the video cannot be accurately understood because the video is not globally modeled and understood.

In order to solve the problems, the application provides a video event identification method, a video of an event to be identified is obtained, a key frame image is extracted from the video, a prediction action of the video is determined according to the key frame image, a corresponding action map is inquired according to the prediction action, a plurality of attributes of the prediction action are determined, attribute values corresponding to the attributes are determined according to the prediction action, the attributes and the key frame image, an event map is constructed according to the prediction action, the attributes and the attribute values corresponding to the attributes, the event map is matched with each event in a pre-constructed video event map, and the event obtained through matching is used as a video related event.

Events can be divided into three main granularities:

(1) coarse-grained action events: universal scenes such as marrying, fitness and the like;

(2) fine-grained action events: fine-grained actions such as backing up and warehousing, lifting and the like;

(3) specific events are as follows: a specific example scenario such as Qingdao beer festival, etc.

A video event recognition method, apparatus, device, and storage medium according to embodiments of the present application are described below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of a video event identification method according to an embodiment of the present application.

The embodiment of the present application is exemplified by that the video event recognition method is configured in a video event recognition device, and the video event recognition device can be applied to any electronic equipment, so that the electronic equipment can execute a video event recognition function.

The electronic device may be a Personal Computer (PC), a cloud device, a mobile device, and the like, and the mobile device may be a hardware device having various operating systems, such as a mobile phone, a tablet Computer, a Personal digital assistant, a wearable device, and an in-vehicle device.

As shown in fig. 1, the video event recognition method may include the following steps:

step 101, obtaining a video of an event to be identified.

In this embodiment of the application, the video of the event to be identified may be a video obtained by shooting by a capturing device, a video downloaded from a server, a locally stored video, and the like, and is not limited herein.

Step 102, extracting a key frame image from the video to determine a prediction action of the video according to the key frame image.

In the embodiment of the application, after the video of the event to be identified is acquired, the key frame image can be extracted from the video, so that the prediction action of the video is determined according to the key frame image.

It can be understood that, since many frame images in the video are not related to the motion of the video itself, in order to eliminate redundant information in the video and improve the efficiency of video motion prediction, in the present application, when the frame extraction processing is performed on the video, the key frame image can be extracted from the video, so as to determine the prediction motion of the video according to the key frame image.

Optionally, when the key frame image is extracted from the video, a shot-based extraction method, a motion analysis method, a video clustering method, and the like may be adopted, which is not limited herein, and any method that can extract the key frame image from the video is applicable to the present application.

In the embodiment of the present application, the key frame image is not limited to one frame, and may be extracted from a video to obtain a frame of key frame image, or may also be extracted from a video to obtain a plurality of frames of key frame images, which is not limited herein.

In a possible case, after a frame of key frame image is extracted from the video, the motion of the frame of key frame image is identified, and the motion of the key frame image is determined as the predicted motion of the video.

In another possible case, after multiple frames of key frame images are extracted from the video, the motion of the multiple frames of key frame images can be respectively identified, so as to determine the predicted motion of the video according to the motion of the multiple frames of key frame images. For example, the most significant motion among the motions corresponding to the key frame images of the plurality of frames may be used as the predicted motion of the video.

As an example, if 5 key frame images are extracted from a video, the 5 key frame images are subjected to motion recognition, and it is determined that 4 key frame images are motion a and 1 key frame image is motion B, the motion a can be used as a predicted motion of the video.

Step 103, inquiring a corresponding action map according to the predicted action, and determining a plurality of attributes of the predicted action.

It can be understood that each action corresponds to a pre-constructed action map, and the action map includes an attribute corresponding to the action. In the embodiment of the application, after the predicted action of the video is determined, the action map corresponding to the predicted action can be acquired, and further, the action map is inquired to determine a plurality of attributes of the predicted action.

In the embodiment of the application, the action map can be a fixed set preset manually, and the attributes corresponding to each action are different and are related action knowledge constructed by domain experts in advance.

As an example, as shown in fig. 2, after a key frame image is extracted from a video, a predicted action of the video is determined as a jump according to the key frame image, and the action is determined to include 5 attributes, namely, an execution person, a starting point, an obstacle, a destination, and a location by querying an action map corresponding to the action jump.

And 104, determining attribute values corresponding to the attributes according to the predicted action, the attributes and the key frame image.

In the embodiment of the application, the predicted action of the video is determined, and after the plurality of attributes corresponding to the action are predicted, the attribute values corresponding to the plurality of attributes can be determined according to the predicted action, the plurality of attributes and the key frame image of the video.

Alternatively, the predicted motion of the video, the plurality of attributes, and the key frame image may be input into a trained attention model to determine attribute values corresponding to the attributes from the output of the model.

In one possible case, if the key frame image is a frame, the predicted motion of the video, the plurality of attributes and the key frame image may be input into a trained attention model to determine the attribute value corresponding to each attribute according to the output of the model.

In another possible case, the key frame images are multiple frames, and multiple frames of key frame images may be merged to obtain a merged key frame image, and then the predicted motion, multiple attributes, and the merged key frame image of the video are input into a trained attention model, so as to determine attribute values corresponding to the attributes according to the output of the model.

Continuing with the example of fig. 2, the predicted motion corresponding to the key frame image in fig. 2 is determined as a jump, 5 attributes may be encoded together with the jump, the encoded result may be input to the trained attention model together with the key frame image, and the attribute value corresponding to each attribute may be determined from the output of the model.

And 105, constructing an event graph according to the predicted action, the attributes and the attribute values corresponding to the attributes.

The event graph refers to an event representation mode composed of nodes.

In the embodiment of the application, after the predicted action, the plurality of attributes and the attribute values corresponding to the attributes of the video are determined, the event graph can be constructed according to the predicted action, the plurality of attributes and the attribute values corresponding to the attributes.

And 106, matching the event graph with each event in a pre-constructed video event graph, and taking the event obtained by matching as a video-related event.

Any event in the video event map comprises a predicted action of the event, a plurality of attributes and attribute values corresponding to the attributes.

In the embodiment of the application, after the event graph corresponding to the video event is constructed, the event graph can be matched with each event in the video event graph constructed in advance, and the event obtained through matching is used as the video-related event.

According to the video event identification method, after a video of an event to be identified is obtained, a key frame image is extracted from the video, action prediction is conducted on the video based on the key frame image, a corresponding action map is inquired according to the predicted action, a plurality of attributes of the predicted action are determined, further attribute values corresponding to the attributes are determined, an event map is constructed according to the predicted action, the attributes and the attribute values corresponding to the attributes, the event map is matched with each event in a pre-constructed video event map, and the event obtained through matching is used as an event related to the video. Therefore, a plurality of attributes of the predicted action are determined based on the action map corresponding to the predicted action query, and after the attribute value corresponding to each attribute is determined, the video-related event is determined through the constructed event map and the video event map constructed in advance, so that the accuracy of video event identification is improved.

Based on the foregoing embodiment, multiple frames of video frame images may be extracted from a video, and a predicted action of the video is determined according to an action of at least one determined key frame image in the multiple frames of video frames, which is described in detail below with reference to fig. 3, where fig. 3 is a schematic flowchart of another video event identification method provided in this embodiment of the present application.

As shown in fig. 3, the video event recognition method may include the following steps:

step 301, acquiring a video of an event to be identified.

In the embodiment of the present application, the implementation process of step 301 may refer to the implementation process of step 101 in the foregoing embodiment, and is not described herein again.

Step 302, performing frame extraction processing on the video to obtain a plurality of frames of video frame images.

In the embodiment of the application, after the video of the event to be identified is acquired, frame extraction processing can be performed on the video, so that a plurality of frames of video frame images can be extracted from the video.

Alternatively, a plurality of frames of video frame images can be extracted from the video by extracting at certain intervals. For example, a plurality of frames of video frame images are extracted from all images of the video every 3 frames.

Step 303, determining at least one key frame image from the plurality of video frame images.

It can be understood that, in order to improve the accuracy of motion recognition of the video, after extracting multiple frames of video frame images from the video, at least one frame of key frame image can be determined from the multiple frames of video frame images, and the video frame images which are difficult to classify are removed, so as to contribute to improving the accuracy of motion recognition.

As a possible implementation manner of the embodiment of the application, feature extraction is performed on a video frame image of a video to obtain video frame image features, feature fusion is performed on the video frame image features and global image features of the video to obtain fused image features, and the fused image features are input into a classifier to perform key frame identification to obtain at least one frame of key frame image. Therefore, at least one frame of key frame image is determined based on the video frame image characteristics and the global image characteristics of the video, so that redundant or difficultly classified images in the video frame image are screened out, and the accuracy of video motion identification is improved.

Alternatively, MLP (Multi-Layer Perceptron) may be used to perform feature extraction on the video frame image to obtain the video frame image features.

As an example, as shown in FIG. 4, assuming that N frames of video frame images are extracted from the video of the event to be identified, MLP may be used for each frame of video frame image (X)_i) And performing feature extraction to obtain the image features of each frame of video frame, fusing the image features of each frame of video frame with the global image features of the video respectively to obtain fused image features, inputting the fused image features into a classifier, and performing key frame identification through the classifier to obtain at least one frame of key frame image.

Alternatively, CNN (Convolutional Neural Network) extraction may be used to obtain image features.

Step 304, determining a predicted action of the video according to the action of the at least one frame of key frame image.

In the embodiment of the application, after at least one key frame image is determined from a plurality of frames of video frame images, the at least one key frame image is subjected to action recognition, so that the predicted action of the video is determined according to the action of the at least one key frame image.

As an example, it is assumed that a frame key frame image is extracted from a video, motion recognition is performed on the frame key frame image, and the motion of the frame key frame image is determined to be "jumping", and then the predicted motion of the video is determined to be "jumping".

As a possible implementation manner, feature extraction may be performed on at least one frame of key frame image, and the at least one frame of key frame image is input into a trained motion recognition model, so as to determine a motion of the at least one key frame image according to an output of the model.

Step 305, querying a corresponding action map according to the predicted action, and determining a plurality of attributes of the predicted action.

Step 306, determining attribute values corresponding to the attributes according to the predicted action, the attributes and the key frame image.

Step 307, an event graph is constructed according to the predicted action, the plurality of attributes and the attribute values corresponding to the plurality of attributes.

And 308, matching the event graph with each event in a pre-constructed video event graph, and taking the event obtained by matching as a video-related event.

In the embodiment of the present application, the implementation process of step 305 to step 308 may refer to the implementation process of step 103 to step 106 in the above embodiment, and is not described herein again.

In the embodiment of the application, after the frame extraction processing is performed on the video to obtain the multi-frame video frame images, at least one key frame image is determined from the multi-frame video frame images, and the prediction action of the video is determined according to the action of the at least one key frame image. Therefore, the key frame images of the video are extracted, the motion recognition is carried out on the key frame images, the predicted motion of the video is determined according to the motion obtained through recognition, and therefore the accuracy and the efficiency of the video motion recognition are improved.

Based on the foregoing embodiment, attribute values corresponding to the attributes may be determined according to the predicted motion, the attributes, and the key frame image, which is described in detail below with reference to fig. 5, where fig. 5 is a flowchart of another video event identification method provided in this embodiment of the present application.

As shown in fig. 5, the video event recognition method may include the steps of:

step 501, obtaining a video of an event to be identified.

Step 502, extracting a key frame image from the video to determine a predicted action of the video according to the key frame image.

Step 503, querying a corresponding action map according to the predicted action, and determining a plurality of attributes of the predicted action.

In the embodiment of the present application, the implementation process of step 501 to step 503 may refer to the implementation process of step 101 to step 103 in the above embodiment, and is not described herein again.

Step 504, the predicted actions and the attributes are encoded separately to obtain a plurality of encoding results.

In the embodiment of the application, after the predicted action of the video and the attributes of the predicted action are determined, the predicted action and the attributes can be coded together respectively to obtain a plurality of coding results.

As an example, as shown in fig. 6, motion recognition is performed on the key frame image in fig. 6, a predicted motion of the video is determined as a comb, a corresponding motion map is queried according to the predicted motion, and a plurality of attributes of the predicted motion are determined as an executive, an object, an intention, and a tool, respectively. As is clear from fig. 6, the predicted motion comb can be encoded with a plurality of attributes, respectively, to obtain a plurality of encoding results.

And 505, performing feature extraction on the key frame image to obtain the extracted image features.

In the embodiment of the application, after the key frame image is extracted from the video of the event to be identified, the feature of the key frame image can be extracted to obtain the image feature of the key frame image.

The feature extraction is an important concept in computer vision and image processing, and is to use a computer to extract image information and determine whether a point of each image belongs to an image feature. The result of feature extraction is to separate points on the image into distinct subsets, which often belong to isolated points, continuous curves, or continuous regions. The image features include color features, texture features, shape features and spatial relationship features.

Optionally, the following feature extraction method may be adopted to perform feature extraction on the key frame image: fourier transform method, window Fourier transform method, wavelet transform method, least square method, histogram of boundary directions, texture feature extraction method based on Tamura texture features, and the like. The image feature extraction method in the present application is not limited to the above-described method, and any realizable image feature extraction method is applicable to the present application, and is not limited herein.

Step 506, inputting the plurality of coding results and the image features into the trained attention mechanism model respectively to obtain attribute values corresponding to the plurality of attributes.

In the embodiment of the application, after the prediction action and the attributes are respectively encoded to obtain a plurality of encoding results, the encoding results and the image features are respectively input into a trained attention mechanism model together to obtain attribute values corresponding to the attributes.

Continuing with the exemplary illustration of fig. 6, assuming that the encoding result obtained by encoding the predicted motion combing together with the attribute performer is input to the Attention-based model (Top-Down attachment) together with the image feature, the attribute value corresponding to the attribute can be obtained as girls.

As shown in fig. 6, an attribute value corresponding to an attribute can be obtained by connecting the encoding result q and the image feature E to obtain fa, performing linear transformation on fa to obtain Wa, performing normalization processing on Wa, processing the processing result and the image feature, and inputting the result into the TDA model.

Step 507, according to the predicted action, the attributes and the attribute values corresponding to the attributes, an event graph is constructed.

And step 508, matching the event graph with each event in a pre-constructed video event graph, and taking the event obtained through matching as a video-related event.

In the embodiment of the present application, the implementation process of step 507 to step 508 may refer to the implementation process of step 105 to step 106 in the above embodiment, and is not described herein again.

According to the video event identification method, after a video of an event to be identified is obtained, a key frame image is extracted from the video, a prediction action of the video is determined according to the key frame image, a corresponding action map is inquired according to the prediction action, a plurality of attributes of the prediction action are determined, the prediction action and the attributes are respectively encoded to obtain a plurality of encoding results, feature extraction is carried out on the key frame image to obtain extracted image features, the encoding results and the image features are respectively input into a trained attention mechanism model to obtain attribute values corresponding to the attributes, an event map is constructed according to the prediction action, the attributes and the attribute values corresponding to the attributes, the event map is matched with each event in a pre-constructed video event map, and the event obtained through matching is used as a video related event. Therefore, the attribute values of the attributes are determined based on the attention mechanism model, the accuracy of generation of the attribute values of the predicted action is improved, and the accuracy of video event identification is improved.

Based on the above embodiment, after the predicted action, the plurality of attributes, and the attribute values corresponding to the attributes of the video are determined, an event graph is further constructed according to the predicted action, the plurality of attributes, and the attribute values corresponding to the attributes. Referring to fig. 7 for details, fig. 7 is a flowchart illustrating an event graph generating method according to an embodiment of the present disclosure.

As shown in fig. 7, the method may include the steps of:

step 701, the predicted action is taken as a first-layer node of the event graph.

In the embodiment of the application, after the predicted action of the video is determined, the predicted action can be used as a first-layer node of an event graph.

As an example, assuming that the predicted action of the video is a jump, the jump may be taken as a first level node of the event graph.

Step 702, connecting the plurality of attributes with the first layer of nodes respectively to serve as second layer of nodes.

In the embodiment of the application, after the action map is queried according to the predicted action and a plurality of attributes corresponding to the predicted action are determined, the plurality of attributes can be respectively connected with the first-layer node to serve as the second-layer node.

As an example, assuming that the predicted action is a jump, by querying an action map corresponding to the action jump, it is determined that the action contains 5 attributes, namely, a human actor, an origin, an obstacle, a destination, and a place, the 5 attributes may be respectively connected to the first-layer node as the second-layer node.

And 703, connecting the attribute values corresponding to the attributes with the second-layer nodes corresponding to the attributes respectively to serve as third-layer nodes so as to obtain the event graph.

In the embodiment of the application, after the attribute values corresponding to the attributes are determined according to the predicted action, the attributes and the key frame image, the attribute values corresponding to the attributes can be respectively connected with the second-layer nodes corresponding to the attributes to serve as the third-layer nodes, so as to obtain the event graph.

Continuing with the example in step 702, if it is determined that the attribute values corresponding to the 5 attribute executives, the starting point, the obstacle, the destination, and the place are racehorse riders, land, fences, land, and outdoors, the attribute value corresponding to each attribute may be used as a third-layer node and connected to the corresponding attribute, so as to obtain an event graph corresponding to the jump.

In the embodiment of the application, the predicted action is used as a first-layer node of an event graph, the attributes are respectively connected with the first-layer node to be used as a second-layer node, the attribute values corresponding to the attributes are respectively connected with the second-layer node corresponding to each attribute to be used as a third-layer node, and therefore the event graph is obtained. Therefore, the event graph corresponding to the video of the event to be identified is generated, and the accuracy of the event identification result is improved.

As a possible implementation manner of the embodiment of the present application, when the event graph is matched with each event in the pre-constructed event graph, the event associated with the video may be determined based on a similarity between the graph embedded representation corresponding to the event graph and the graph embedded representation of each event in the video event graph. Referring to fig. 8 for details, fig. 8 is a schematic flowchart of another video event recognition method according to an embodiment of the present disclosure.

As shown in fig. 8, the video event recognition method may further include the following steps:

step 801, obtaining graph embedding representation corresponding to the event graph.

As a possible implementation manner of the embodiment of the present application, an event Graph constructed according to a predicted action, a plurality of attributes, and attribute values corresponding to the attributes may be input into a GGNN (Gated Graph Neural Network), and a Graph embedding representation corresponding to the event Graph is determined according to an output of the GGNN Network. Because the GGNN has stronger expansibility, the graph embedded representation of the event can be determined according to the mutual information among the nodes.

It should be noted that the graph embedding representation method for obtaining the event graph is not limited to the GGNN network, and any other implementable manner may also be applied to the present application, and is not limited herein.

When the event is represented as a graph embedded representation, the specific method is as follows:

firstly, constructing an event graph G ═ V, storing a D-dimensional vector in the node V ∈ V, storing a D × D-dimensional matrix in the edge E ∈ E, learning a graph embedding representation of the node V by using GGNN (generalized Gaussian distribution network) for multiple iterations, and finally obtaining a representation of a whole graph from the graph embedding representations of all nodes.

GGNN is a model based on the classical spatial domain message passing of a GRU (Gated current Unit). The general framework of message passing comprises three operations: information transfer operation (M), update operation (U), and read operation (R). As can be seen from the following formula, the graph embedding representation m of the node v at the time t +1 is determined by the graph embedding representation of the current time, the graph embedding representation of the neighbor node at the current time, and the mutual side information of the two.

Wherein the compound of the formula (1),

is an initial hidden vector of a node v and is a D-dimensional vector, when the node inputs a feature x_vAnd when the dimension is smaller than D, a filling mode of 0 complementing is adopted at the back.

In the formula (2), A_vTwo columns corresponding to nodes v are selected from the matrix a of fig. 9(c), where a is D × 2D, and a_vIs in the 1 x 2D dimension,

is a D-dimensional vector formed by all the node characteristics at the time t-1,

and the vector with the dimension of 1 x 2D represents the result of the interaction between the current node and the adjacent node through the edge. It can be seen that two columns of in and out are taken for a in the calculation, so the result of the calculation here takes into account the two-way information transfer (see fig. 9 (b)).

In the formulae (3) to (6),

the forgetting information is controlled to be left,

and controlling the newly generated information. Formula (II)(6) In

The first half selects which information has been forgotten,

and the latter chooses to remember which information was newly generated.

The final updated node state.

In the present application, when obtaining the graph-embedded representation corresponding to the event graph, N central nodes may be selected from each node in the event graph, where N is a positive integer greater than one and smaller than the number of nodes included in the event graph, and then, for any central node, the following processes may be performed, respectively: and acquiring neighborhood nodes of the central node, wherein the neighborhood nodes are nodes connected with the central node, determining vector representation corresponding to a sub-graph formed by the central node and the neighborhood nodes, and further inputting the obtained vector representation into the GGNN network so as to obtain graph embedded representation corresponding to the event graph.

At step 802, a similarity between a graph-embedded representation corresponding to the event graph and a graph-embedded representation of each event in the video event graph is determined.

And step 803, determining the event with the highest similarity as the video-related event.

In the embodiment of the application, graph embedded representations corresponding to all events in a video event graph are obtained, so that the similarity between the graph embedded representations corresponding to the event graphs and the graph embedded representations corresponding to all events in the video event graph can be respectively calculated, and the event with the highest similarity can be used as a selected event, namely, an event related to a video of an event to be identified. Therefore, the video-related events are determined based on the graph embedding representation mode, and the accuracy of video event identification is improved.

Optionally, after determining an event with the highest similarity between the graph embedded representation corresponding to the event graph and the graph embedded representation of each event in the video event graph, performing secondary judgment according to the attribute of the event to realize situation matching based on multi-frame levels to obtain an event associated with the video.

As an example, as shown in fig. 10, after a key frame is extracted from a video to be recognized and a key frame image is obtained, the motion of the key frame image is recognized, and a predicted motion of the video is determined as "riding a horse" according to the motion of the key frame image. The method comprises the steps of inquiring a corresponding action map according to the identified action to determine a plurality of attributes (such as a main body, a carrier, a place and a scene in fig. 10) corresponding to the action, determining attribute values corresponding to the attributes according to the predicted action, the attributes and a key frame image, extracting features of the key frame to obtain extracted image features, and constructing an event graph according to the predicted action, the attributes and the attribute values corresponding to the attributes. And matching the event graph with each event in a pre-constructed video event graph, and taking the event obtained by matching as a video-related event.

In order to implement the above embodiments, the present application provides a video event recognition apparatus.

Fig. 11 is a schematic structural diagram of a video event recognition apparatus according to an embodiment of the present application.

As shown in fig. 11, the video event recognition apparatus 1100 may include: an acquisition module 1110, a first determination module 1120, a second determination module 1130, a third determination module 1140, a construction module 1150, and a processing module 1160.

The obtaining module 1110 is configured to obtain a video of an event to be identified.

A first determining module 1120, configured to extract a key frame image from the video, so as to determine a predicted action of the video according to the key frame image;

a second determining module 1130, configured to query a corresponding action map according to the predicted action, and determine a plurality of attributes of the predicted action;

a third determining module 1140, configured to determine attribute values corresponding to the plurality of attributes according to the predicted motion, the plurality of attributes, and the key frame image;

a building module 1150, configured to build an event graph according to the predicted action, the plurality of attributes, and attribute values corresponding to the plurality of attributes;

the processing module 1160 is configured to match the event graph with each event in a pre-constructed video event graph, and use the event obtained through matching as a video-related event.

As a possible scenario, the first determining module 1120 may further include:

the processing unit is used for performing frame extraction processing on the video to obtain a multi-frame video frame image;

a first determining unit configured to determine at least one key frame image from a plurality of frame video frame images;

and the second determining unit is used for determining the prediction action of the video according to the action of at least one frame of key frame image.

As another possible case, the first determining unit may be further configured to: performing feature extraction on the video frame image to obtain video frame image features; performing feature fusion on the video frame image features and the global image features of the video to obtain fused image features; and inputting the fused image features into a classifier for key frame identification to obtain at least one frame of key frame image.

As another possible scenario, the third determining module 1140 may further be configured to:

respectively coding the predicted action and each attribute to obtain a plurality of coding results;

extracting the characteristics of the key frame image to obtain the extracted image characteristics;

and respectively inputting the plurality of coding results and the image characteristics into the trained attention mechanism model to obtain attribute values corresponding to the plurality of attributes.

As another possible scenario, the building block 1150 may also be configured to:

taking the predicted action as a first-layer node of the event graph;

connecting a plurality of attributes with the first layer nodes respectively to serve as second layer nodes;

and respectively connecting the attribute values corresponding to the attributes with the second-layer nodes corresponding to the attributes to obtain an event graph as third-layer nodes.

As another possible scenario, the processing module 1160 may be further configured to:

acquiring graph embedding representation corresponding to the event graph;

determining similarity between graph-embedded representations corresponding to the event graph and graph-embedded representations of the events in the video event graph;

and determining the event with the highest similarity as the video-related event.

It should be noted that the foregoing explanation on the embodiment of the video event recognition method is also applicable to the video event recognition apparatus of this embodiment, and details are not repeated here.

The identification device for the video event obtains the video of the event to be identified, extracts the key frame image from the video, predicts the action of the video based on the key frame image, queries the corresponding action map according to the predicted action, determines a plurality of attributes of the predicted action, further determines attribute values corresponding to the attributes, constructs an event map according to the predicted action, the attributes and the attribute values corresponding to the attributes, matches the event map with each event in the pre-constructed video event map, and takes the event obtained by matching as the video-related event. Therefore, a plurality of attributes of the predicted action are determined based on the action map corresponding to the predicted action query, and after the attribute value corresponding to each attribute is determined, the video-related event is determined through the constructed event map and the video event map constructed in advance, so that the accuracy of video event identification is improved.

In order to achieve the above embodiments, the present application proposes an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the identification method described in the above embodiments.

In order to achieve the above embodiments, the present application proposes a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the identification method described in the above embodiments.

In order to implement the above embodiments, the present application proposes a computer program product comprising a computer program which, when executed by a processor, implements the identification method described in the above embodiments.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 12 is a block diagram of an electronic device according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 12, the electronic apparatus includes: one or more processors 1201, memory 1202, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 12 illustrates an example of one processor 1201.

Memory 1202 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the video event recognition methods provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the video event recognition method provided herein.

The memory 1202, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the video event recognition method in the embodiments of the present application (e.g., the obtaining module 1110, the first determining module 1120, the second determining module 1130, the third determining module 1140, the constructing module 1150, and the processing module 1160 shown in fig. 11). The processor 1201 implements the video event recognition method in the above-described method embodiments by executing non-transitory software programs, instructions, and modules stored in the memory 1202 to execute various functional applications of the server and data processing.

The memory 1202 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 1202 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 1202 may optionally include memory located remotely from the processor 1201, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device may further include: an input device 1203 and an output device 1204. The processor 1201, the memory 1202, the input device 1203, and the output device 1204 may be connected by a bus or other means, and the bus connection is exemplified in fig. 12.

The input device 1203 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic apparatus, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input devices. The output devices 1204 may include a display device, auxiliary lighting devices (e.g., LEDs), tactile feedback devices (e.g., vibrating motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a conventional physical host and a Virtual Private Server (VPS). The server may also be a server of a distributed system, or a server incorporating a blockchain.

According to the technical scheme of the embodiment of the application, after the video of the event to be identified is obtained and the key frame image is extracted from the video, action prediction is carried out on the video based on the key frame image, the corresponding action map is inquired according to the predicted action, a plurality of attributes of the predicted action are determined, further, attribute values corresponding to the attributes are determined, the event map is constructed according to the predicted action, the attributes and the attribute values corresponding to the attributes, the event map is matched with each event in the video event map constructed in advance, and the event obtained through matching is used as the video-related event. Therefore, a plurality of attributes of the predicted action are determined based on the action map corresponding to the predicted action query, and after the attribute value corresponding to each attribute is determined, the video-related event is determined through the constructed event map and the video event map constructed in advance, so that the accuracy of video event identification is improved.

According to the technical scheme, the acquisition, storage, application and the like of the personal information of the related user are all in accordance with the regulations of related laws and regulations, and the customs of the public order is not violated.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A video event identification method, comprising:

acquiring a video of an event to be identified;

2. The identification method according to claim 1, wherein said extracting key frame images from the video to determine the predicted action of the video according to the key frame images comprises:

performing frame extraction processing on the video to obtain a plurality of frames of video frame images;

determining at least one key frame image from the plurality of frames of video frame images;

determining a predicted action of the video based on the action of the at least one key frame image.

3. The identification method of claim 2, wherein said determining at least one key frame image from said plurality of video frame images comprises:

performing feature extraction on the video frame image to obtain video frame image features;

performing feature fusion on the video frame image features and the global image features of the video to obtain fused image features;

and inputting the fused image features into a classifier for key frame identification to obtain at least one frame of key frame image.

4. The identification method of claim 1, wherein the determining attribute values corresponding to the plurality of attributes according to the predicted action, the plurality of attributes and the key frame image comprises:

and respectively inputting the coding results and the image characteristics into a trained attention mechanism model to obtain attribute values corresponding to the attributes.

5. The identification method of claim 1, wherein the constructing an event graph according to the predicted action, the plurality of attributes, and attribute values corresponding to the plurality of attributes comprises:

taking the predicted action as a first-level node of an event graph;

connecting the attributes with the first layer nodes respectively to serve as second layer nodes;

and respectively connecting the attribute values corresponding to the attributes with the second-layer nodes corresponding to the attributes to obtain the event graph as third-layer nodes.

6. The identification method according to claim 1, wherein the matching the event graph with each event in a pre-constructed video event graph, and taking the matched event as the video-related event comprises:

acquiring graph embedded representation corresponding to the event graph;

and determining the event with the highest similarity as the event associated with the video.

7. A video event recognition apparatus comprising:

8. The identification apparatus of claim 7, wherein the first determining module further comprises:

the processing unit is used for performing frame extraction processing on the video to obtain a plurality of frames of video frame images;

a first determining unit configured to determine at least one key frame image from the plurality of frames of video frame images;

a second determining unit for determining a predicted action of the video according to the action of the at least one frame of key frame image.

9. The identification apparatus of claim 8, wherein the first determination unit is further configured to:

10. The identification apparatus of claim 7, wherein the third determination module is further configured to:

11. The identification apparatus of claim 7, wherein the building module is further configured to:

taking the predicted action as a first-level node of an event graph;

12. The identification apparatus of claim 7, wherein the processing module is further configured to:

acquiring graph embedded representation corresponding to the event graph;

13. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the identification method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the identification method of any one of claims 1-6.

15. A computer program product comprising a computer program which, when executed by a processor, implements the identification method of any one of claims 1-6.