CN112163525B

CN112163525B - Event type prediction method and device, electronic equipment and storage medium

Info

Publication number: CN112163525B
Application number: CN202011050761.XA
Authority: CN
Inventors: 孙尚勇
Original assignee: New H3C Security Technologies Co Ltd
Current assignee: New H3C Security Technologies Co Ltd
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2023-02-21
Anticipated expiration: 2040-09-29
Also published as: CN112163525A

Abstract

The embodiment of the application provides an event type prediction method, an event type prediction device, electronic equipment and a storage medium, and relates to the technical field of data processing, wherein the method comprises the following steps: obtaining a video frame acquired by image acquisition equipment; extracting object features of objects in each video frame, and fusing the extracted object features based on the context relationship of contents in each video frame and the moving relationship of the objects between the video frames; and predicting the event type of the event which can be monitored by the image acquisition equipment based on the fused object characteristics. Therefore, by applying the event type prediction scheme provided by the embodiment of the application, the efficiency of the event type of the event which can be monitored by the predicted image acquisition equipment can be improved.

Description

Event type prediction method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to an event type prediction method and apparatus, an electronic device, and a storage medium.

Background

With the rapid development of security technologies, image acquisition devices are also more and more widely applied. The image acquisition devices at different monitoring points can be used for monitoring events of different event types. Taking image acquisition equipment in a traffic scene as an example, the image acquisition equipment erected near a pedestrian crossing can monitor events of event types such as vehicles which do not give way to pedestrians; the image acquisition equipment erected near the traffic intersection can monitor events of the types that vehicles run red light, pedestrians run red light and the like; the image acquisition equipment erected on the road side of the middle section of the road can monitor events of event types such as solid line lane changing and illegal parking.

In the prior art, when predicting the event types of events that can be monitored by image acquisition devices at different monitoring points, a worker is usually required to observe monitoring videos acquired by the image acquisition devices and manually determine the event types of the events to be monitored by the image acquisition devices. Because the number of the image acquisition devices at different monitoring points is huge, the efficiency of manually predicting the event types of the events monitored by the image acquisition devices is low.

Disclosure of Invention

An object of the embodiments of the present application is to provide an event type prediction method, an event type prediction apparatus, an electronic device, and a storage medium, so as to improve efficiency of event types of events that can be monitored by a predicted image acquisition device. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present application provides an event type prediction method, where the method includes:

obtaining a video frame acquired by image acquisition equipment;

extracting object features of objects in each video frame, and fusing the extracted object features based on the context relationship of contents in each video frame and the moving relationship of the objects between the video frames;

and predicting the event type of the event which can be monitored by the image acquisition equipment based on the fused object characteristics.

In an embodiment of the present application, the extracting object features of objects in each video frame, and fusing the extracted object features based on a context of content in each video frame and a moving relationship of the objects between the video frames includes:

extracting object features of objects in each video frame, and combining the object features corresponding to each video frame in a channel dimension according to the acquisition sequence of each video frame to obtain a first feature map;

fusing the features of different channels in the first feature map to obtain a second feature map, wherein the number of the channels of the second feature map is less than that of the channels of the first feature map;

performing dimension conversion on the second feature map to obtain a one-dimensional feature vector;

and performing feature fusion on the features in the one-dimensional feature vector.

In an embodiment of the present application, the extracting object features of objects in each video frame includes:

respectively inputting each video frame into a convolution layer of a pre-trained event type prediction model, performing convolution transformation on each video frame by using the convolution layer to obtain a feature map of an object in each video frame as an object feature of the object in each video frame, wherein the feature map comprises a channel dimension, a height dimension and a width dimension, and the event type prediction model comprises: the characteristic fusion layer comprises: a feature merging sublayer, a convolution sublayer, a dimension conversion sublayer and a long-term and short-term memory network sublayer;

the method for combining the object features corresponding to the video frames in the channel dimension according to the acquisition sequence of the video frames to obtain a first feature map comprises the following steps:

inputting the feature maps of the objects in the video frames into the feature merging sublayer, and merging the feature maps corresponding to the video frames in a channel dimension by using the feature merging sublayer according to the acquisition sequence of the video frames to obtain a first feature map;

the fusing the features of different channels in the first feature map to obtain a second feature map includes:

inputting the first feature map into the convolution sublayer, and fusing features of different channels in the first feature map by using the convolution sublayer according to a mode of performing convolution transformation on the first feature map to obtain a second feature map;

the performing dimension conversion on the second feature map to obtain a one-dimensional feature vector includes:

inputting the second feature map into the dimension conversion sublayer, and performing dimension conversion on the second feature map on the height dimension and the width dimension by using the dimension conversion sublayer to obtain a one-dimensional feature vector;

the feature fusion of the features in the one-dimensional feature vector comprises:

inputting the one-dimensional feature vector into the long-short term memory network sublayer, and performing feature fusion on features in the one-dimensional feature vector by using the long-short term memory network sublayer;

predicting the event type of the event which can be monitored by the image acquisition equipment based on the fused object features, wherein the predicting comprises the following steps:

and inputting the fused object features into the full-connection layer, and predicting the types of events which can be monitored by the image acquisition equipment by using the full-connection layer.

In an embodiment of the application, the predicting, based on the fused object feature, an event type of an event that can be monitored by the image capturing device includes:

predicting the probability of the obtained video frame corresponding to each preset event type based on the fused object characteristics, wherein the probability of each event type is characterized by: the obtained video frame reflects the occurrence probability of the event type in the scene;

and determining the event type of which the corresponding probability reaches a preset probability threshold, and taking the determined event type as the event type of the event which can be monitored by the image acquisition equipment.

In an embodiment of the present application, the obtaining a video frame captured by an image capturing device includes:

the method comprises the steps of obtaining a video collected by image collecting equipment, and extracting a preset number of frames of video frames from the video according to a preset interval.

and obtaining video frames acquired by the image acquisition equipment, and carrying out zooming processing on each video frame to obtain a video frame with a preset size.

In a second aspect, an embodiment of the present application provides an event type prediction apparatus, where the apparatus includes:

the video frame acquisition module is used for acquiring a video frame acquired by the image acquisition equipment;

the event type prediction module is used for extracting object characteristics of objects in each video frame and fusing the extracted object characteristics based on the context relationship of contents in each video frame and the moving relationship of the objects between the video frames; and predicting the event type of the event which can be monitored by the image acquisition equipment based on the fused object characteristics.

In an embodiment of the application, the event type prediction module is specifically configured to:

extracting object features of objects in each video frame;

according to the acquisition sequence of each video frame, combining the object features corresponding to each video frame in the channel dimension to obtain a first feature map;

performing dimension conversion on the second characteristic diagram to obtain a one-dimensional characteristic vector;

respectively inputting each video frame into a convolution layer of a pre-trained event type prediction model, performing convolution transformation on each video frame by using the convolution layer to obtain a feature map of an object in each video frame, wherein the feature map is used as an object feature of the object in each video frame, the feature map comprises a channel dimension, a height dimension and a width dimension, and the event type prediction model comprises the following steps: a feature fusion layer and a fully-connected layer, the feature fusion layer comprising: a feature merging sublayer, a convolution sublayer, a dimension conversion sublayer and a long-short term memory network sublayer;

In an embodiment of the application, the video frame obtaining module is specifically configured to:

In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface complete communication between the memory and the processor through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any one of the first aspect when executing a program stored in the memory.

In a fourth aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method steps of any one of the first aspect.

Embodiments of the present application also provide a computer program product containing instructions that, when executed on a computer, cause the computer to perform any of the event type prediction methods described above.

The embodiment of the application has the following beneficial effects:

when the scheme provided by the embodiment of the application is applied to event type prediction, video frames acquired by image acquisition equipment are firstly obtained, object features of objects in the video frames are extracted, the extracted object features are fused based on the context relationship of contents in the video frames and the moving relationship of the objects between the video frames, and the event type of an event which can be monitored by the image acquisition equipment is predicted based on the fused object features. Therefore, the object characteristics of the object in each video frame are obtained by analyzing the video frames acquired by the image acquisition equipment, and the event types of the events monitored by each image acquisition equipment are predicted based on the object characteristics without manual prediction.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other embodiments can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic flowchart of an event type prediction method according to an embodiment of the present application;

fig. 2 is a video frame provided in an embodiment of the present application;

fig. 3a and 3b are another video frame provided by the embodiment of the present application;

fig. 4 is a schematic structural diagram of an event type prediction model according to an embodiment of the present application;

fig. 5 is a schematic flowchart of another event type prediction method according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of another event type prediction model provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of an event type prediction apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to improve the efficiency of the event type of an event that can be monitored by a predicted image acquisition device, embodiments of the present application provide an event type prediction method, an event type prediction apparatus, an electronic device, and a storage medium, which are described in detail below.

Referring to fig. 1, fig. 1 is a schematic flowchart of an event type prediction method provided in an embodiment of the present application, where the method may be applied to an electronic device such as an image capturing device, an electronic computer, a mobile phone, and the method includes the following steps 101 to 103.

Step 101, obtaining a video frame collected by an image collecting device.

The image acquisition equipment can be erected in scenes such as pedestrian crossings, traffic intersections, road side of road middle sections, markets, factories and schools.

Specifically, a monitoring video acquired by the image acquisition device can be obtained, and a video frame is extracted from the acquired monitoring video. The number of the video frames can be 3 frames, 5 frames, 10 frames and other multi-frame video frames, continuous multi-frame video frames can be extracted from the video, and multi-frame video frames with higher imaging quality can also be selected from the video.

In an embodiment of the application, a video acquired by an image acquisition device may be obtained, and a preset number of frames of video frames may be extracted from the video according to a preset interval. The preset number may be 10 frames, 15 frames, 30 frames, etc.

The preset interval may be a preset time interval, that is, a preset number of frames of video frames are extracted in a manner of every preset time interval, where the time interval may be 30 milliseconds, 100 milliseconds, 1 second, and the like, and the embodiment of the present application is not limited thereto. For example, assuming that the preset time interval is 1 second and the preset number is 15 frames, a video of not less than 15 seconds can be acquired, one video frame is extracted every 1 second interval, and finally 15 video frames are acquired.

The preset interval may also be a preset video frame interval, that is, a preset number of frames of video frames are extracted in a manner of a preset video frame interval per interval, and the video frame interval may be 5 frames, 10 frames, 20 frames, and the like. For example, assuming that the preset video frame interval is 5 frames and the preset number is 10 frames, a video not smaller than 50 frames can be obtained, one video frame is extracted every 5 frames, and finally 50 frames are obtained.

Therefore, the multi-frame video frames are selected, and when the event type prediction is performed according to the multi-frame video frames subsequently, the information contained in the multi-frame video frames is richer, for example, if an object in one of the video frames is blocked, the information about the object can be obtained in other video frames, so that the accuracy of the event type obtained according to the multi-frame video frames is higher.

In an embodiment of the application, video frames acquired by image acquisition equipment can be obtained, and each video frame is subjected to scaling processing to obtain a video frame with a preset size. The preset size is used for representing pixel values of the video frame, and may be 300 × 300, 640 × 480, or the like. Therefore, the pixel size of the video frame can be reduced, the calculation complexity can be reduced when the video frame is processed subsequently, and the calculation efficiency can be improved.

And 102, extracting object characteristics of the objects in each video frame, and fusing the extracted object characteristics based on the context of the content in each video frame and the moving relation of the objects between the video frames.

The object is an object contained in a video frame acquired by image acquisition equipment. For the image acquisition equipment erected at the pedestrian crossing, the objects contained in the acquired video frames can be pedestrian crossings, pedestrians, traffic lights, vehicles, deceleration marks and the like; for the image acquisition equipment erected in a shopping mall, the objects contained in the acquired video frames can be pedestrians, shops, billboards, ornaments and the like. The characteristic of the object may be an object tag.

For each video frame, object features of one or more objects in the video frame may be extracted. For example, referring to fig. 2, fig. 2 is a video frame provided by an embodiment of the present application. In the case that the object feature is an object tag, the object feature in fig. 2 may include: vehicles, pedestrians, traffic lights, pedestrian crossings, and the like.

The context of the content in a video frame is used to characterize: the association relation between the semantics of the object features fuses the object features based on the context relation of the content, so that the image description of the video frame can be realized. For example, referring to fig. 3a and 3b, fig. 3a and 3b are another video frame provided in the embodiment of the present application. As shown in fig. 3a, the object features of the object in the video frame include: the batter, referee, observation, court, club, swing, and the object features are fused according to the context of the content in the video frame, and the image description of the video frame can be obtained as follows: the batter is preparing to swing on the course while the referee is observing. As shown in fig. 3b, the object characteristics of the object in the video frame include: the bus, the parking, and the building, wherein the object features are fused according to the context of the content in the video frame, and the image description of the video frame can be obtained by: a bus is parked alongside the building.

The motion relationship of the objects between the video frames is used for characterizing: the situation that each object moves between different video frames reflects the change information of the object at different moments, namely the time domain characteristics of the object between different video frames. Taking an object as an example, the vehicle is located in a parking space in one of the video frames, the parking space is crossed in the next video frame, and then the vehicle drives away from the parking space in the next video frame, and the movement relationship of the object between the video frames can be characterized as follows: the vehicle is driven out of the parking space.

Specifically, after the object features of the objects in each video frame are obtained, the context relationship of the content in each video frame and the moving relationship of the objects between each video frame may be analyzed, and the object features may be fused according to the analyzed relationship, and the obtained fusion features may be used to describe the relationship of the objects between each video frame, that is, the events described by each video frame and the scenes reflected by each video frame may be represented.

In one embodiment of the application, when the object features are extracted, the object features of the objects in each video frame can be extracted by using a feature extraction model which is trained in advance; the label of the object in each video frame can be obtained by using a label classification model trained in advance and used as the object feature of each object. When feature fusion is performed, a semantic analysis algorithm can be used to fuse the features of each object in each video frame.

And 103, predicting the event type of the event which can be monitored by the image acquisition equipment based on the fused object characteristics.

Specifically, based on the fused object features, events and reflected scenes described by the video frames can be analyzed, and then event types of events that can be monitored by the image acquisition device can be predicted according to the described events.

In an embodiment of the application, the feature classification model trained in advance can be used to classify the fused object features, so as to obtain the event type of the event described by each video frame. The event type of the event that the image capturing device can monitor can then be predicted from the obtained event type.

In an embodiment of the present application, a corresponding relationship between an event type of an event described by a video frame and an event type of an event monitored by an image capturing device may be pre-established, and after the event type of the event described by the video frame is obtained, the event type of the event that can be monitored by the image capturing device is predicted according to the corresponding relationship.

Referring to table 1 below, table 1 is a schematic table of a corresponding relationship provided in the embodiments of the present application.

TABLE 1

For example, after the event type of the event described by each video frame is obtained and is an prohibited area on a road, it indicates that an event that a vehicle parks in the prohibited area may exist in a scene monitored by the image capture device, and therefore it can be predicted that the image capture device can monitor the event that the vehicle parks in the prohibited area.

For image acquisition equipment erected in a scene of a cash register of a shopping mall, if the event type of the event described by each video frame is obtained and is customer queuing, it is indicated that a queue-inserting event may exist in the scene monitored by the image acquisition equipment, and therefore it can be predicted that the image acquisition equipment can monitor the queue-inserting event.

In an embodiment of the application, probabilities of the obtained video frames corresponding to the respective preset event types may be predicted based on the fused object features, an event type whose corresponding probability reaches a preset probability threshold is determined, and the determined event type is used as an event type of an event that the image acquisition device can monitor.

For video frames collected in a traffic scene, the preset event types can be forbidden regional parking, road congestion, vehicle solid line lane change and the like; for video frames acquired in a factory scene, the preset event types can be late arrival of employees, early departure of employees, aggregation of employees and the like.

Probabilistic characterization of each event type: the obtained video frame reflects the occurrence probability of an event of the event type occurring in the scene. Specifically, a scene reflected by the video frame may be obtained based on the fused object features, where the scene is a scene monitored by the image acquisition device, and the probability of each event type may represent the probability of an event of the event type occurring in the scene.

The preset probability threshold may be 0.6, 0.8, 0.9, etc.

Specifically, based on the fused object features, the occurrence probability of an event of each preset event type in a scene reflected by the video frame can be predicted, wherein for an event type of which the probability reaches a probability threshold, the probability of the event representing the event type in the scene is higher, that is, the event of the event type may occur in the scene monitored by the image acquisition device, so that the event type of the event that can be monitored by the image acquisition device can be determined as the event type. The probability of each event type is not interfered with each other, and the number of the event types of the events which can be monitored by the image acquisition equipment can be 1 or more.

In an embodiment of the present application, the probabilities of predicting that the obtained video frames correspond to the respective preset event types may include a first probability and a second probability based on the fused object features.

Wherein the first probability characterizes: the probability of the event type occurring in the scene reflected by the obtained video frame;

the second probability characterization: the obtained video frame reflects a probability that an event of the event type does not occur in the scene.

For example, assuming that the event type of the event is "the vehicle travels backward on the road", the probability corresponding to the event type includes a first probability that an event of the "the vehicle travels backward on the road" type can occur in the scene reflected by the video frame, and a second probability that an event of the "the vehicle travels backward on the road" type cannot occur in the scene reflected by the video frame.

Subsequently, the event type of the event with the corresponding first probability reaching the preset probability threshold can be determined, and the determined event type is used as the event type of the event which can be monitored by the image acquisition equipment

Referring to table 2, table 2 is a schematic table of probabilities of respective event types provided in the embodiment of the present application.

TABLE 2

Type of event	First probability	Second probability
			Parking in forbidden area	0.9	0.1
Vehicle reverse driving	0.7	0.3
			Vehicle giving no good luck to pedestrians	0.25	0.75
Road congestion	0.4	0.6
			Vehicle lane change by solid line	0.5	0.5

Assuming that the probabilities of the predicted video frames corresponding to the respective preset event types are as shown in table 2 above, it can be seen that the first probability of occurrence of the event with the event type "no parking in the prohibited area" is 0.9, the second probability of non-occurrence of the event with the event type "no parking in the prohibited area" is 0.1, and so on, the first probability of occurrence of the event with the event type "lane change by vehicle solid line" is 0.5, and the second probability of non-occurrence of the event with the event type "lane change by vehicle solid line" is 0.5.

Assuming that the preset probability threshold is 0.6, the event types that can obtain the corresponding event with the first probability reaching 0.6 include "no parking in the prohibited area" and "reverse driving of the vehicle", and thus, the image acquisition device can be predicted to monitor the events of the no parking in the prohibited area and the reverse driving of the vehicle event types.

When the scheme provided by the embodiment is applied to event type prediction, video frames acquired by image acquisition equipment are firstly obtained, object features of objects in each video frame are extracted, the extracted object features are fused based on the context of the content in each video frame and the moving relation of the objects between the video frames, and the event type of an event which can be monitored by the image acquisition equipment is predicted based on the fused object features. Therefore, the object characteristics of the object in each video frame are obtained by analyzing the video frames acquired by the image acquisition equipment, the event type of the event which can be monitored by the image acquisition equipment is predicted based on the object characteristics, and the event type of the event which can be monitored by each image acquisition equipment does not need to be predicted manually, so that the efficiency of the event type of the event which can be monitored by the image acquisition equipment can be improved by applying the event type prediction scheme provided by the embodiment.

In an embodiment of the present application, when the object features are fused in the step 102, the object features of the objects in each video frame may be extracted, and the object features corresponding to each video frame are merged in the channel dimension according to the acquisition sequence of each video frame to obtain a first feature map, the features of different channels in the first feature map are fused to obtain a second feature map, the second feature map is subjected to dimension conversion to obtain a one-dimensional feature vector, and the features in the one-dimensional feature vector are subjected to feature fusion.

And the number of the channels of the second characteristic diagram is less than that of the channels of the first characteristic diagram.

Specifically, after the video frames are obtained, feature extraction is performed on the video frames first, so that a feature map corresponding to each video frame can be obtained. The feature map corresponding to each video frame includes a channel dimension C, a height dimension H, and a width dimension W. And then combining the channel dimensions C of the feature maps corresponding to the video frames according to the acquisition sequence of the video frames to obtain a first feature map. Therefore, the object features corresponding to the video frames can be combined together without changing the height and the width of the feature map.

For the first feature map, features of different channels in the first feature map may be fused to obtain a second feature map. Since the channel dimension of the first feature map includes each object feature in each video frame, that is, the spatial feature of each video frame, and also includes the object feature in each video frame, that is, the temporal feature between each video frame, the feature of the channel dimension in the first feature map is fused, that is, the feature fusion of the object feature of each video frame in the temporal and spatial dimensions can be understood.

And then, dimension reduction processing can be carried out on the second feature map to obtain a one-dimensional feature vector, and features in the one-dimensional feature vector are further fused, so that the features of each object can be further fused in time domain and space domain dimensions, and the accuracy of the obtained fused features is improved. Dimension reduction processing is performed on the second feature map, that is, the height dimension and the width dimension of the second feature map are merged, so that a one-dimensional feature vector is obtained.

In an embodiment of the present application, a preset feature extraction algorithm may be used to extract object features of objects in each video frame respectively. And then combining the object features corresponding to the video frames in the channel dimension by using a feature combination algorithm according to the acquisition sequence of the video frames to obtain a first feature map. And then, fusing the characteristics of different channels in the first characteristic diagram by using a first characteristic fusion algorithm to obtain a second characteristic diagram. And performing dimension conversion on the second feature map by using a dimension conversion algorithm to obtain a one-dimensional feature vector. And finally, performing feature fusion on the features in the one-dimensional feature vector by using a second feature fusion algorithm.

In an embodiment of the present application, for the

above steps

102 and 103, the event type prediction model may also be implemented by means of pre-training, which is described in detail below.

In an embodiment of the application, each video frame may be input into a pre-trained event type prediction model, and an event type of an event monitored by the image capturing device may be determined according to an output result of the event type prediction model.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an event type prediction model provided in an embodiment of the present application, where the event type prediction model includes: convolution layer, characteristic fusion layer and full articulamentum, the layer of characteristic fusion includes: a feature merging sublayer, a convolution sublayer, a dimension conversion sublayer and a long-short term memory network sublayer.

Wherein the convolutional layer is used for: and performing convolution transformation on each video frame to obtain a feature map of the object in each video frame, wherein the feature map is used as the object feature of the object in each video frame, the feature map comprises a channel dimension, a height dimension and a width dimension, and the feature map corresponding to each video frame is input into a feature merging sublayer.

The network model used by the convolutional layer may be a VGG16 network model, and the convolutional layer may include a plurality of network layers.

Taking the example that the convolutional layer includes 5 VGG16 network models, each video frame may be sequentially input into the 5 VGG16 network models, each VGG16 network model includes a convolution module and a pooling module, and is used to perform convolution processing and pooling processing on the input data, so that after performing convolution processing and pooling processing on each video frame by the 5 VGG16 network models, a feature map of an object in each video frame may be extracted and obtained as an object feature of the object included in each video frame.

The convolution layer can be multiplexed by each video frame, so that the model structure can be reduced, and the calculation amount of the network model can be saved.

The feature merging sublayer is to: and combining the feature maps corresponding to the video frames in the channel dimension according to the acquisition sequence of the video frames to obtain a first feature map, and inputting the first feature map into a convolution sublayer. Wherein the feature merging sublayer may be a Concat layer.

The convolution sublayer is used to: and fusing the characteristics of different channels in the first characteristic diagram in a mode of carrying out convolution transformation on the first characteristic diagram to obtain a second characteristic diagram, and inputting the second characteristic diagram into the dimension conversion sublayer. Wherein, the convolution sublayer is a CNN network layer.

The dimension conversion sublayer is used for: and performing dimension conversion on the second feature map based on the height dimension and the width dimension to obtain a one-dimensional feature vector, and inputting the one-dimensional feature vector into the long-short term memory network sublayer, so that the long-short term memory network can perform further feature fusion on the one-dimensional feature vector. The dimension conversion layer may be a coaten layer.

The long-short term memory network sublayer is used for: and performing feature fusion on features in the one-dimensional feature vector, and inputting the fused features into the full connection layer. The Long-Short Term Memory network layer may be an LSTM (Long Short-Term Memory) layer.

The full connection layer is used for: and based on the feature after feature fusion, predicting the event type of the event which can be monitored by the image acquisition device, wherein the full connection layer can be an FC layer.

Specifically, the full connection layer may output probabilities of the corresponding event types, and select an event type with a corresponding probability reaching a preset probability threshold as an event type of an event that the image capture device can monitor.

In an embodiment of the present application, the full connection layer may output a probability of an event type of each event that the image capturing device can monitor, where the probability of each event type includes: the image capture device is capable of monitoring a first probability of an event of the event type and is incapable of monitoring a second probability of an event of the event type. The event type of the event, the first probability of which reaches the preset probability threshold, may be selected as the event type of the event that the image capturing device can monitor.

Referring to fig. 5, fig. 5 is a schematic flowchart of another event type prediction method provided in this embodiment of the present application, where the method includes the following steps 501 to 507.

Step 501, obtaining a video frame collected by an image collecting device.

And 502, respectively inputting each video frame into a convolution layer of a pre-trained event type prediction model, and performing convolution transformation on each video frame by using the convolution layer to obtain a feature map of an object in each video frame as an object feature of the object in each video frame.

The feature graph comprises a channel dimension, a height dimension and a width dimension, and the event type prediction model comprises the following steps: the characteristic fusion layer and the full tie layer, the characteristic fusion layer includes: a feature merging sublayer, a convolution sublayer, a dimension conversion sublayer and a long-short term memory network sublayer;

step 503, inputting the feature maps of the objects in the video frames into the feature merging sublayer, and merging the feature maps corresponding to the video frames in the channel dimension by using the feature merging sublayer according to the acquisition sequence of the video frames to obtain a first feature map.

Step 504, inputting the first feature map into a convolution sublayer, and fusing the features of different channels in the first feature map by using the convolution sublayer according to a mode of performing convolution transformation on the first feature map to obtain a second feature map.

And 505, inputting the second feature map into a dimension conversion sublayer, and performing dimension conversion on the second feature map in the height dimension and the width dimension by using the dimension conversion sublayer to obtain a one-dimensional feature vector.

Step 506, inputting the one-dimensional feature vector into the long-short term memory network sublayer, and performing feature fusion on the features in the one-dimensional feature vector by using the long-short term memory network sublayer;

and 507, inputting the fused object features into a full connection layer, and predicting the type of an event which can be monitored by image acquisition equipment by using the full connection layer.

By applying the scheme provided by the embodiment, after the video frames acquired by each image acquisition device are obtained, the event type of the event which can be monitored by the image acquisition device can be predicted by using the event type prediction model which is trained in advance, so that the event type prediction efficiency can be further improved, and the robustness of the event type prediction can be improved.

When the event type prediction model is trained, a large number of video frames can be selected as samples, the event type of an event monitored by the image acquisition equipment is judged manually according to the samples, and the samples are labeled. And then, inputting the labeled sample into a model to be trained to obtain a model output result, calculating the loss of the model output result relative to the label, adjusting the parameters of the model based on the loss, and then re-training the model until the training end condition is reached to obtain a trained event type prediction model.

Wherein, the cross entropy loss function can be used to calculate the loss of the output result of the model relative to the label.

The training end condition may be that the number of times of training reaches a preset threshold, such as 1000 times, 10000 times, 50000 times, and the like, and the training end condition may also be that a loss of a model output result relative to a label is smaller than a preset loss threshold, and the like, which is not limited in the embodiment of the present application.

Referring to fig. 6, fig. 6 is a schematic structural diagram of another event type prediction model provided in the embodiment of the present application. The event type prediction model comprises 5 VGG16 layers, a Concat layer, a CNN layer, a Flaten layer, an LSTM layer and an FC layer.

The embodiments of the examples of the present application are described below by way of specific examples.

The method comprises the steps of obtaining a monitoring video collected by image collecting equipment, extracting 15 video frames according to a preset time interval, and compressing the 15 video frames to obtain a video frame with a pixel value of 300 × 300.

Sequentially inputting the 15 frames of video frames into an event type prediction model, and performing convolution processing on the 15 frames of video frames by a convolution module and a pooling module in 5 VGG16 layers to obtain a feature map of 15 channel dimensions C, height, width and dimension W =512, 10 and 10;

then, the 15 signatures of C × H × W =512 × 10 may be merged in the channel dimension by the Concat layer according to the collection order of each video frame, so as to obtain a first signature of C × H × W =7680 × 10;

then, performing feature fusion on the first feature map in the channel dimension by the CNN layer according to a convolution kernel with the height of 1 × width 1 × channel 256 to obtain a second feature map with C × H × W =256 × 10;

performing dimension conversion on the second feature diagram in H x W dimension by using a Flaten layer, and converting the second feature diagram of C x H x W =256 x 10 into a one-dimensional feature vector with 256 channel dimensions and 100 length;

the LSTM layer performs feature fusion on features in the one-dimensional feature vector;

the FC layer outputs the probability of the event type of each event which can be monitored by the image acquisition equipment based on the fused features, wherein the probability of each event type comprises the following steps: the image capture device is capable of monitoring a first probability of an event of the event type and is incapable of monitoring a second probability of an event of the event type.

In an embodiment of the present application, after predicting an event type of an event that can be monitored by an image capturing device, an identification model for identifying the event of the event type may be deployed in the image capturing device.

Specifically, after the event type of the event that can be monitored by the image capturing device is predicted, it is described that the event of the event type may exist in the monitoring video captured by the image capturing device, and in order to recognize the event of the event type, a recognition model for recognizing the event of the event type may be deployed in the image capturing device in a targeted manner. For example, after an event that the image capturing device can monitor the vehicle in the wrong direction is predicted, a vehicle in-the-way recognition model can be deployed in the image capturing device, and the vehicle in-the-way recognition model can recognize the vehicles in the wrong direction in the road and obtain information such as license plate numbers of the vehicles in the wrong direction.

Therefore, the identification model can be deployed on the image acquisition equipment in a targeted manner, and for the event type of the event which cannot be monitored by the image acquisition equipment, the relevant identification model does not need to be deployed, so that the resource occupation of the image acquisition equipment can be reduced, and the material resources are saved.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an event type prediction apparatus according to an embodiment of the present application, where the apparatus includes:

a video frame obtaining module 701, configured to obtain a video frame acquired by an image acquisition device;

an event type prediction module 702, configured to extract object features of objects in each video frame, and fuse the extracted object features based on a context of content in each video frame and a moving relationship of the objects between the video frames; and predicting the event type of the event which can be monitored by the image acquisition equipment based on the fused object characteristics.

In an embodiment of the application, the event type prediction module 702 is specifically configured to:

extracting object features of objects in each video frame;

according to the acquisition sequence of each video frame, combining object features corresponding to each video frame in a channel dimension to obtain a first feature map;

fusing the features of different channels in the first feature map to obtain a second feature map, wherein the number of the channels of the second feature map is smaller than that of the channels of the first feature map;

respectively inputting each video frame into a convolution layer of a pre-trained event type prediction model, performing convolution transformation on each video frame by using the convolution layer to obtain a feature map of an object in each video frame, wherein the feature map is used as an object feature of the object in each video frame, the feature map comprises a channel dimension, a height dimension and a width dimension, and the event type prediction model comprises the following steps: a feature fusion layer and a fully-connected layer, the feature fusion layer comprising: a feature merging sublayer, a convolution sublayer, a dimension conversion sublayer and a long-term and short-term memory network sublayer;

predicting the probability of the obtained video frame corresponding to each preset event type based on the fused object characteristics, wherein the probability of each event type is characterized by: the occurrence probability of the event type in the scene reflected by the obtained video frame;

In an embodiment of the present application, the video frame obtaining module 701 is specifically configured to:

When the scheme provided by the embodiment is applied to event type prediction, firstly, video frames acquired by image acquisition equipment are obtained, object features of objects in each video frame are extracted, the extracted object features are fused based on the context relationship of contents in each video frame and the moving relationship of the objects between the video frames, and the event type of an event which can be monitored by the image acquisition equipment is predicted based on the fused object features. Therefore, the object characteristics of the object in each video frame are obtained by analyzing the video frames acquired by the image acquisition equipment, the event type of the event which can be monitored by the image acquisition equipment is predicted based on the object characteristics, and the event type of the event which can be monitored by each image acquisition equipment does not need to be predicted manually, so that the efficiency of the event type of the event which can be monitored by the image acquisition equipment can be improved by applying the event type prediction scheme provided by the embodiment.

The embodiment of the present application further provides an electronic device, as shown in fig. 8, which includes a processor 801, a communication interface 802, a memory 803, and a communication bus 804, where the processor 801, the communication interface 802, and the memory 803 complete mutual communication through the communication bus 804,

a memory 803 for storing a computer program;

the processor 801 is configured to implement the steps of the event type prediction method when executing the program stored in the memory 803.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but not to indicate only one bus or one event type bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Alternatively, the memory may be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In yet another embodiment provided by the present application, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of any of the above event type prediction methods.

In yet another embodiment provided by the present application, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the steps of any of the event type prediction methods of the above embodiments.

When the scheme provided by the embodiment is applied to event type prediction, video frames acquired by image acquisition equipment are firstly obtained, object features of objects in each video frame are extracted, the extracted object features are fused based on the context of the content in each video frame and the moving relation of the objects between the video frames, and the event type of an event which can be monitored by the image acquisition equipment is predicted based on the fused object features. Therefore, the object characteristics of the object in each video frame are obtained by analyzing the video frames acquired by the image acquisition equipment, the event type of the event monitored by each image acquisition equipment is predicted based on the object characteristics, and the event type of the event monitored by each image acquisition equipment does not need to be predicted manually.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, apparatus embodiments, electronic device embodiments, computer-readable storage medium embodiments, and computer program product embodiments are substantially similar to method embodiments and therefore are described with relative ease, as appropriate, with reference to the partial description of the method embodiments.

The above description is only for the preferred embodiment of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims

1. A method for event type prediction, the method comprising:

obtaining a video frame acquired by image acquisition equipment;

predicting the event type of the event which can be monitored by the image acquisition equipment based on the fused object characteristics;

the extracting object features of objects in each video frame, and fusing the extracted object features based on the context of content in each video frame and the moving relationship of objects between video frames includes:

performing feature fusion on features in the one-dimensional feature vector;

the extracting the object features of the objects in each video frame includes:

respectively inputting each video frame into a convolution layer of a pre-trained event type prediction model, performing convolution transformation on each video frame by using the convolution layer to obtain a feature map of an object in each video frame, wherein the feature map is used as an object feature of the object in each video frame, the feature map comprises a channel dimension, a height dimension and a width dimension, and the event type prediction model comprises the following steps: the characteristic fusion layer comprises: a feature merging sublayer, a convolution sublayer, a dimension conversion sublayer and a long-short term memory network sublayer;

inputting the second feature map into the dimension conversion sublayer, and performing dimension conversion on the second feature map in a height dimension and a width dimension by using the dimension conversion sublayer to obtain a one-dimensional feature vector;

the feature fusion of the features in the one-dimensional feature vector includes:

predicting an event type of an event which can be monitored by the image acquisition equipment based on the fused object features, wherein the event type comprises the following steps:

and inputting the fused object features into the full-connection layer, and predicting the type of the event which can be monitored by the image acquisition equipment by using the full-connection layer.

2. The method according to claim 1, wherein predicting the event type of the event that can be monitored by the image capturing device based on the fused object feature comprises:

3. The method according to claim 1 or 2, wherein the obtaining of the video frame captured by the image capturing device comprises:

acquiring a video acquired by image acquisition equipment, and extracting a preset number of frames of video frames from the video according to a preset interval; alternatively, the first and second electrodes may be,

4. An event type prediction apparatus, characterized in that the apparatus comprises:

the event type prediction module is used for extracting object characteristics of objects in each video frame and fusing the extracted object characteristics based on the context relation of contents in each video frame and the movement relation of the objects between the video frames; predicting the event type of the event which can be monitored by the image acquisition equipment based on the fused object characteristics;

the event type prediction module is specifically configured to:

extracting object features of objects in each video frame;

performing feature fusion on features in the one-dimensional feature vector;

the event type prediction module is specifically configured to:

respectively inputting each video frame into a convolution layer of a pre-trained event type prediction model, performing convolution transformation on each video frame by using the convolution layer to obtain a feature map of an object in each video frame, wherein the feature map is used as an object feature of the object in each video frame, the feature map comprises a channel dimension, a height dimension and a width dimension, and the event type prediction model comprises the following steps: the characteristic fusion layer comprises: a feature merging sublayer, a convolution sublayer, a dimension conversion sublayer and a long-term and short-term memory network sublayer;

5. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing the communication between the processor and the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1 to 3 when executing a program stored in the memory.

6. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of the claims 1-3.