CN108334910B

CN108334910B - Event detection model training method and event detection method

Info

Publication number: CN108334910B
Application number: CN201810297702.9A
Authority: CN
Inventors: 孙源良; 夏虎; 李长升; 樊雨茂
Original assignee: Guoxin Youe Data Co Ltd
Current assignee: Guoxin Youe Data Co Ltd
Priority date: 2018-03-30
Filing date: 2018-03-30
Publication date: 2020-11-03
Anticipated expiration: 2038-03-30
Also published as: CN108334910A

Abstract

The application provides an event detection model training method and an event detection method; the event detection model training method comprises the following steps: acquiring training image frames in a plurality of training videos with labels, and dividing the training image frames into a plurality of batches; extracting feature vectors for training image frames in all batches by using a target neural network; performing at least two rounds of weight assignment on the feature vectors of the training image frames in each batch by using an attention mechanism processing network; inputting the feature vectors of the training image frames in each batch subjected to weight assignment into a target classifier to obtain a classification result of a training video; and training the target neural network, the attention mechanism processing network and the target classifier according to the classification result of the training video and the comparison result between the labels of the training video. According to the method and the device, the required calculated amount in the training process can be reduced on the premise of not influencing the model precision, and the consumption of calculation resources and training time is reduced.

Description

Event detection model training method and event detection method

Technical Field

The application relates to the technical field of deep learning, in particular to an event detection model training method and an event detection method.

Background

With the rapid development of the neural network in the fields of images, videos, voices, texts and the like, the land of a series of intelligent products is promoted, and the precision requirements of users on various models based on the neural network are higher and higher. When an event detection model is constructed based on a neural network, in order to enable the neural network to fully learn the characteristics of images in videos so as to improve the classification of the event detection model, a large amount of training videos need to be input into the neural network to train the neural network.

However, the training video usually includes a large number of images, and the amount of data is huge. When the training videos are used for training the neural network, although the accuracy of the model obtained by training can be improved, the excessive data amount causes huge calculation amount required in the model training process, and excessive calculation resources and training time are consumed.

Content of application

In view of this, an object of the embodiments of the present application is to provide an event detection model training method and an event detection method, which can reduce the required amount of computation in the training process and reduce the consumption of computation resources and training time on the premise of not affecting the model accuracy.

In a first aspect, an embodiment of the present application provides an event detection model training method, including:

acquiring training image frames in a plurality of training videos with labels, and dividing the training image frames into a plurality of batches; each batch comprises a preset number of training image frames;

extracting feature vectors for the training image frames in all batches by using a target neural network;

performing at least two rounds of weight assignment on the feature vectors of the training image frames in each batch by using an attention mechanism processing network;

inputting the feature vectors of the training image frames in each batch subjected to weight assignment into a target classifier to obtain a classification result of the training video;

and training the target neural network, the attention mechanism processing network and the target classifier according to the classification result of the training video and the comparison result between the labels of the training video.

With reference to the first aspect, an embodiment of the present application provides a first possible implementation manner of the first aspect, where: the acquiring of training image frames in a plurality of training videos with labels specifically includes:

acquiring a plurality of training videos with labels;

sampling the training video according to a preset sampling frequency;

and taking the image obtained by sampling each training video as a training image frame in the training video.

With reference to the first aspect, an embodiment of the present application provides a second possible implementation manner of the first aspect, where: using an attention mechanism processing network to perform weight assignment on the feature vectors of the training image frames in each batch, specifically comprising:

and respectively carrying out weight assignment on the feature vectors of the training image frames in each batch by using the attention mechanism processing network with the feature vectors as the granularity, and respectively carrying out weight assignment on each batch by using the attention mechanism processing network with the batches as the granularity.

With reference to the first aspect, an embodiment of the present application provides a third possible implementation manner of the first aspect, where: the feature vectors are used as granularity, the attention mechanism processing network is used for carrying out weight assignment on the feature vectors of the training image frames in each batch, and the obtained weight assignment result a (i) of the ith batch meets the formula (1):

(1)a(i)＝tanh(W₁F₁+W₂F₂+…+W_nF_n+c)；

wherein n represents the number of training image frames in the ith batch; w₁To W_nRespectively representing the weight corresponding to the 1 st training image frame to the nth training image frame in each batch; f₁To F_nRepresenting the characteristic vectors corresponding to the 1 st to the nth training image frames in each batch respectively; c represents the bias execution items when the batches are taken as the granularity and the attention mechanism processing network is used for respectively carrying out weight assignment on each batch; tanh represents an activation function;

the batch is taken as the granularity, the attention mechanism processing network is used for carrying out weight assignment on each batch respectively, and the obtained weight assignment result b (j) of the jth batch meets the formula (2):

(2)b(j)＝M₁a(1)+M₂a(2)+…+M_ma(m)+d；

M₁to M_mRepresenting the weight corresponding to the 1 st to the m-th batches respectively; d represents processing the web using an attention mechanism with the batch as a particle sizeRespectively carrying out weight assignment on each batch according to the network;

after the batch is taken as the granularity and the attention mechanism processing network is used for respectively carrying out weight assignment on each batch, the method further comprises the following steps: and normalizing the weight assignment result of each batch.

With reference to the first aspect, an embodiment of the present application provides a fourth possible implementation manner of the first aspect, where:

the inputting the feature vectors of the training image frames in each batch subjected to the weight assignment to a classifier to obtain the classification result of the training video specifically comprises:

respectively inputting the weight-assigned feature vectors corresponding to each batch into a target classifier to obtain a classification result corresponding to each batch;

and taking the classification result with the largest batch number as the classification result of the training video.

With reference to the first aspect, an embodiment of the present application provides a fifth possible implementation manner of the first aspect, where: the step of respectively inputting the weight-assigned feature vectors corresponding to each batch into the target classifier to obtain the classification result corresponding to each batch specifically includes:

sequentially inputting the weight-assigned feature vectors corresponding to each batch into the target classifier respectively to obtain a classification result of each training image frame represented by the weight-assigned feature vectors;

and taking the classification result corresponding to the maximum number of the training image frames as the classification result of the batch.

With reference to the first aspect, an embodiment of the present application provides a sixth possible implementation manner of the first aspect, where:

further comprising:

respectively splicing the feature vectors of the training image frames in each batch to form spliced feature vectors;

the using of the attention mechanism processing network for performing at least two rounds of weight assignment on the feature vectors of the training image frames in each batch specifically comprises the following steps:

performing at least two rounds of weight assignment on the splicing vectors corresponding to each batch by using an attention mechanism processing network;

inputting the splicing feature vectors subjected to weight assignment and corresponding to each batch into a target classifier, and obtaining a classification result of the training video.

With reference to the first aspect, an embodiment of the present application provides a seventh possible implementation manner of the first aspect, where:

inputting the spliced feature vectors subjected to weight assignment and corresponding to each batch into a target classifier to obtain a classification result of the training video, specifically comprising:

inputting the spliced eigenvectors subjected to weight assignment and corresponding to each batch into a target classifier respectively to obtain a classification result corresponding to each batch;

With reference to the first aspect, an embodiment of the present application provides an eighth possible implementation manner of the first aspect, where: the training the target neural network and the target classifier according to the classification result of the training video and the comparison result between the labels of the training video specifically comprises:

performing the following comparison operation until the classification result of the training video is consistent with the label of the training video;

the comparison operation comprises the following steps:

comparing the classification result of the training video with the label of the training video;

if the classification result of the training video is inconsistent with the label of the training video, adjusting parameters of the target neural network, the attention mechanism processing network and the target classifier;

based on the adjusted parameters, extracting new feature vectors for the training image frames in all batches by using a target neural network, and performing at least two times of weight assignment again on the new feature vectors of the training image frames in each batch by using an attention mechanism processing network;

inputting new feature vectors of the training image frames in each batch subjected to the re-weight assignment into a classifier to obtain a new classification result of the training video;

and performing the comparison operation again.

In a second aspect, an embodiment of the present application further provides an event detection method, including:

acquiring a video to be detected;

inputting the video to be detected into an event detection model obtained by the event detection model training method of any one of the first aspect, so as to obtain a classification result of the video to be detected;

wherein the event detection model comprises: the target neural network, the attention mechanism processing network, and the target classifier.

When training an event detection model by using training image frames in a training video, the embodiment of the application firstly divides the training image frames into a plurality of batches, and then extracts feature vectors for the training image frames in all the batches by using a target neural network. Then, performing at least two rounds of weight assignment on the feature vectors of the training image frames in each batch by using an attention mechanism processing network, so that the weight of the feature vector corresponding to the training image frame belonging to the main event in the training video is increased, the weight of the feature vector corresponding to the training image frame not belonging to the main event in the training video is reduced, and in the process of training the event detection model based on the feature vectors subjected to the weight assignment, the event detection model can well learn the features in the training image frame belonging to the main event, thereby ensuring the accuracy of the finally obtained event detection model; meanwhile, the weight of the feature vector corresponding to the training image frame not belonging to the main event in the training video is reduced, that is, the value of the element in the feature vector corresponding to the training image frame not belonging to the main event is correspondingly reduced, and part of the elements are even directly zeroed, so that when the event detection model is trained based on the feature vector corresponding to the training image frame not belonging to the main event, a large amount of calculation is reduced, the calculation required in the event detection model training process is reduced, and the consumption of calculation resources and training events is reduced.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

FIG. 1 is a flow chart of an event detection model training method according to an embodiment of the present disclosure;

FIG. 2 is a flowchart illustrating a method for performing two rounds of weight assignment on feature vectors of training image frames in each batch using an attention mechanism processing network according to a second embodiment of the present application;

fig. 3 is a flowchart illustrating a specific method for inputting the feature vectors of the training image frames in each batch subjected to the weight assignment to the classifier to obtain the classification result of the training video, which is further provided in the third application embodiment;

FIG. 4 is a flowchart illustrating a comparison operation method provided in the fourth embodiment of the present application;

FIG. 5 is a flowchart illustrating an event detection model training method according to the fifth embodiment of the present application

FIG. 6 is a schematic structural diagram of an event detection model training apparatus provided in a sixth embodiment of the present application;

FIG. 7 is a flow chart of an event detection method provided in the seventh embodiment of the present application;

fig. 8 is a schematic structural diagram of a computer device according to a ninth embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

At present, when training an event detection model by using a training video, the training video is directly input into a neural network and a classifier, and the neural network and the classifier are trained. In practice, in the training process of the neural network and the classifier, the neural network and the classifier are required to perform operation on each image in the training video. However, the training video generally includes a plurality of events, and the images of some events actually do not positively contribute to the classification of the video, but rather affect the normal training of the event detection model, so that the use of the neural network and the classification classifier to perform feature learning on the images that do not positively contribute to the classification of the video will instead spend a lot of computation on unnecessary places, which results in a huge amount of computation required in the model training process, and a lot of computation resources and training time are consumed. Based on the above, the application provides an event detection model training method and an event detection method, which can reduce the required calculation amount in the training process and reduce the consumption of calculation resources and training time on the premise of not affecting the model precision.

To facilitate understanding of the present embodiment, a detailed description will be given first of all of an event detection model training method disclosed in the embodiments of the present application. The event detection model obtained by using the event detection model training method provided by the embodiment of the application can effectively complete the classification of events occurring in the un-clipped video; meanwhile, automatic classification of network videos can be effectively realized; in addition, reasonable label support can be provided for the video recommendation system, and effective recommendation can be conveniently carried out on the return training video.

Referring to fig. 1, a method for training an event detection model according to an embodiment of the present application includes:

s101: acquiring training image frames in a plurality of training videos with labels, and dividing the training image frames into a plurality of batches; each batch includes a preset number of training image frames.

When implemented, the training video is typically a relatively long video, which typically includes at least one event; when a plurality of events are included in the training video, one event is generally used as a primary event, and the other events are used as secondary events, and the training video is labeled based on the primary event.

For example, in a video of a swimming match, in addition to the event of the swimming match, a spectator event and a player follow-up event may be involved, but the swimming match may have a greater weight in the whole video, and therefore the swimming match is taken as a main event, and the video is labeled as the swimming match.

The event detection model is trained by using the whole training video, and the problems that the convergence speed of the model is reduced due to large input data volume, the training process needs to consume long time, resources are more and the like generally exist. Therefore, in order to accelerate model convergence and reduce the time and resources consumed in the model training process, training image frames need to be obtained from the whole training video; the training image frames are part of all the images included in the entire training video. Generally, a plurality of training videos are respectively sampled according to a preset sampling frequency, an image obtained by sampling each training video is used as a training image frame in the training video, and then an event detection model is trained based on the obtained training image frame of each training video.

Meanwhile, each training video usually includes at least one event, and especially when the training video includes multiple events, different events usually appear in the training video interspersed with each other, and there is a link between different events. Therefore, in order to better locate the main events in the training video, strengthen the weight occupied by the main events in all events and weaken the weight occupied by the secondary events in all events, the training video frames can be divided into a plurality of batches, each batch comprises a preset number of training image frames, so that the training image frames included in different events are cut as far as possible, and the different events are divided into different batches.

Here, the number of training image frames corresponding to each batch may be specifically selected according to actual needs; for example, if the event switching in the training video is fast, the number of the corresponding training image frames in each batch may be set to be small; if the event switching in the training video is slow, the number of the corresponding training image frames in each batch can be set to be large.

In addition, it should be noted that when the training video is divided into a plurality of batches, since the number of training image frames in the acquired training video is mostly not an integer multiple of the number of training image frames included in each batch, the number of training image frames included in the last batch obtained by dividing the video training cannot reach the preset number, so that the batch in which the number of training image frames cannot meet the requirement can be filled, and a transparent frame, a completely black frame or a completely white frame is filled after an image sequence formed by the training image frames, so that the number of training image frames included in the batch reaches the preset number.

S102: feature vectors are extracted for the training image frames in all batches using the target neural network.

In a specific implementation, the target neural Network may use a Convolutional Neural Network (CNN) model to perform feature extraction on the training image frames in each batch, and obtain a feature vector corresponding to each training image frame.

Here, in order to accelerate convergence in the event detection model training process, the target network model used may be obtained by inputting a training image frame in a training video into a target neural network to be trained, and training the target neural network to be trained.

S103: at least two rounds of weight assignments are performed on the feature vectors of the training image frames in each batch using an attention mechanism processing network.

In a specific implementation, an attention mechanism is used to learn a part to be processed of a training image frame, and each time a current state is learned according to a previous state to obtain a position to be focused and/or a currently input image, so as to process pixels of the attention part instead of all pixels of the image. For example, a training video a includes two events, namely, a diving event and an auditorium event, and the diving event is the main event, the attention mechanism can focus more attention points into the diving event, strengthen the attention of the diving event and weaken the attention of the auditorium event.

The attention to the main event is strengthened, and meanwhile, the attention to the secondary event is weakened, namely, the weight assignment is carried out on the feature vector of the training image frame, the weight of the training image frame corresponding to the main event is increased, and the weight of the training image frame corresponding to the secondary event is reduced.

Specifically, referring to fig. 2, a second embodiment of the present application provides a method for performing two rounds of weight assignment on feature vectors of training image frames in each batch using an attention mechanism processing network, including:

s201: respectively carrying out weight assignment on the feature vectors of the training image frames in each batch by using an attention mechanism processing network by taking the feature vectors as granularity; and the number of the first and second groups,

s202: and (4) respectively carrying out weight assignment on each batch by using the attention mechanism processing network with the batch as the granularity.

Here, each training video includes a plurality of events, and when the training video is divided into a plurality of batches, but the division does not strictly ensure that only one training image frame corresponding to an event is in each batch. Therefore, in order to strengthen the weight occupied by the primary event in all events and weaken the weight occupied by the secondary event in all events, firstly, with the feature vector as a granularity, the feature vector of the training image frame in each batch is respectively subjected to weight assignment by using an attention mechanism processing network, that is, in each batch, the weight of the training image frame corresponding to the primary event is increased, and the weight of the training image frame corresponding to the secondary event is decreased.

Similarly, due to the occurrence position and the uncertainty of the duration of the event in the training video, the number of training image frames corresponding to the main event included in the partial batch is large, the number of training image frames corresponding to the main event included in the partial batch is small, in order to further increase the weight of the training image frame corresponding to the primary event, and decrease the weight of the training image frame corresponding to the secondary event, after performing the attention-machine processing for each batch, performing the attention-machine processing for all the batches, that is, increasing the weight of the batch including the training image frames corresponding to more main events, decreasing the weight of the batch including less or even no training image frames corresponding to main events, therefore, attention is further focused on the training image corresponding to the primary event, and the influence of the secondary event on the event detection model is further weakened. Meanwhile, when the weight of the training image frame corresponding to the secondary event in the training video is reduced, a large number of element values existing in the feature vector of the training image frame corresponding to a plurality of secondary events can be reduced, and some of the element values can even be zeroed, so that when the feature vector of the training image frame with the reduced weight is used for training an event detection model, the complexity of calculation is simplified, the required calculation amount in the training process is reduced, and the consumption of calculation resources and training time is reduced.

Specifically, with the feature vectors as the granularity, the feature vectors of the training image frames in each batch are respectively subjected to weight assignment by using an attention mechanism processing network, and the obtained weight assignment result a (i) of the ith batch meets the formula (1):

a(i)＝tanh(W₁F₁+W₂F₂+…+W_nF_n+c) (1)

with the batches as the granularity, respectively carrying out weight assignment on each batch by using an attention mechanism processing network, and obtaining a weight assignment result b (j) of the jth batch, wherein the weight assignment result b (j) meets the formula (2):

b(j)＝M₁a(1)+M₂a(2)+…+M_ma(m)+d (2)

wherein M is₁To M_mRepresenting the weight corresponding to the 1 st to the m-th batches respectively; d represents the paranoim item when the weight assignment is respectively carried out on each batch by using the attention mechanism processing network by taking the batch as granularity.

Here, it should be noted that, according to actual needs, the gravity mechanism processing network may be used to perform more rounds of weight assignment on the feature vectors of the training image frames in each batch, so as to further increase the weight of the training image frame corresponding to the primary event, reduce the weight of the training image frame corresponding to the secondary event, and further reduce the calculation amount.

In another embodiment, after performing weight assignment on each batch respectively by using the attention mechanism processing network with the batch as granularity, the method further includes: and normalizing the weight assignment result of each batch. The feature vectors are simplified, and the calculation amount is further reduced.

S104: and inputting the feature vectors of the training image frames in each batch subjected to the weight assignment into a target classifier to obtain a classification result of the training video.

In specific implementation, after the feature vectors of the training images in each batch subjected to weight assignment are respectively input to the classifier, the classifier can classify the training image frames represented by each feature vector based on the feature vectors. For the feature vectors with increased weights, the classifier can learn more features, for the feature vectors with reduced weights, the classifier can reduce the learning of the part of feature vectors, and then the classification result of the whole training video is obtained according to the classification result of each feature vector.

Specifically, referring to fig. 3, a third embodiment of the present application further provides a specific method for inputting the feature vectors of the training image frames in each batch subjected to the weight assignment to the classifier to obtain the classification result of the training video, including:

s301: and respectively inputting the weight-assigned feature vectors corresponding to each batch into the target classifier to obtain a classification result corresponding to each batch.

In the specific implementation, the classification result corresponding to each batch can be measured by the classification results of all training image frames in the batch; in the batch, more training image frames belong to which class, and the probability of the batch belonging to the class is higher than that of the batch belonging to other classes.

Therefore, the classification result corresponding to each batch can be obtained by adopting the following method:

sequentially inputting the weight-assigned feature vectors corresponding to each batch into a target classifier respectively to obtain a classification result of each training image frame represented by the weight-assigned feature vectors; and taking the classification result corresponding to the maximum number of the training image frames as the classification result of the batch.

For example, the training video a includes a primary event and a secondary event; the training image frames of the training video are divided into batches, and the obtained batch with the number of 1 comprises 64 training image frames. After the 64 training images are subjected to two-round weight assignment, feature vectors corresponding to the 64 training images subjected to the weight assignment are sequentially input into a classifier, 50 training image frames with the classification result of a primary event and 14 training image frames with the classification result of a secondary event are obtained, and the number of the training image frames belonging to the primary event is greater than that of the training image frames belonging to the secondary event, so that the classification result of the batch with the number of 1 can be determined as the primary event.

S302: and taking the classification result with the largest batch number as the classification result of the training video.

After the classification result corresponding to each batch included in each training video is obtained according to the method of S301, the classification result corresponding to the largest number of batches is used as the classification result of the training video.

For example, training video a includes primary events as well as secondary events; when the training image frames in the training video A are divided into a plurality of batches, the batches form 20 batches; and respectively inputting the weight-assigned feature vectors corresponding to each batch into the target classifier to obtain a classification result corresponding to each batch, wherein the classification result of 16 batches is a main event, and the classification result of the training video is a main event if the classification result of 4 batches is a secondary event.

S105: and training the target neural network, the attention mechanism processing network and the target classifier according to the classification result of the training video and the comparison result between the labels of the training video.

Specifically, a fourth embodiment of the present application further provides a specific method for training a target neural network and a target classifier according to a classification result of a training video and a comparison result between labels of the training video, including:

referring to fig. 4, the alignment operation includes:

s401: comparing whether the classification result of the training video is consistent with the label of the training video; if yes, jumping to S402; if not, jumping to S403;

s402: completing the training of the target neural network, the attention mechanism processing network and the target classifier in the current round; the flow ends.

S403: adjusting parameters of a target neural network, an attention mechanism processing network and a target classifier;

s404: based on the adjusted parameters, extracting new feature vectors for the training image frames in all batches by using a target neural network, and performing at least two rounds of weight assignment on the new feature vectors of the training image frames in each batch by using an attention mechanism processing network; inputting new feature vectors of the training image frames in each batch subjected to the re-weight assignment into a classifier to obtain a new classification result of the training video; and S401 is performed again.

In a specific implementation, before weight assignment is performed on feature vectors of training image frames in each batch for the first time, initial assignment is performed on the attention mechanism processing network in a weight random distribution mode. The attention mechanism processing network subjected to the initial assignment may reduce the weight of the training image frames belonging to the primary event and increase the weight of the training image frames belonging to the secondary event, which affects the accuracy of the final classification result of the training video.

Meanwhile, if the target neural network cannot well learn the features in the training image frame, the accuracy of the final training video classification result is also affected, so that the target neural network needs to be trained, and the target neural network tends to better learn the direction development of the features in the training image frame. Similarly, the target classifier needs to be trained, so that the target classifier develops toward the correct classification when classifying the feature vectors.

Referring to fig. 5, a fifth embodiment of the present application further provides another event detection model training method, including:

s501: acquiring training image frames in a plurality of training videos with labels, and dividing the training image frames into a plurality of batches; each batch includes a preset number of training image frames.

Here, similar to S101, the details are described in S101, and are not described herein again.

S502: feature vectors are extracted for the training image frames in all batches using the target neural network.

Here, similar to S102, the details are described in S102, and are not described herein again.

S503: and respectively splicing the feature vectors of the training image frames in each batch to form spliced feature vectors.

When the method is specifically realized, the feature vectors of the training image frames in each batch are spliced, and the formed splicing vector is actually a higher-dimensionality splicing feature vector formed by using the feature vectors of a plurality of training image frames.

Specifically, since the sizes of the training image frames belonging to the same training video are consistent, the dimensions of the feature vectors of all the obtained training image frames are the same. When the feature vectors of the training image frames in each batch are spliced to form spliced feature vectors, the splicing can be performed in a transverse mode or a longitudinal mode. For example, when the dimension of the feature vector of the training image frame is 1 × 512, the result of vertically splicing the feature vectors of 10 training image frames is: 10 × 512, the result of transversely stitching the feature vectors of 10 training image frames is: 1*5120.

S504: and performing at least two rounds of weight assignment on the splicing vectors corresponding to each batch by using an attention mechanism processing network.

Here, the method of performing at least two rounds of weight assignment on the stitching vectors corresponding to each batch is similar to S103, and is not described herein again.

S505: and inputting the splicing feature vectors subjected to weight assignment and corresponding to each batch into a target classifier to obtain a classification result of the training video.

Here, the classification result of the training video may be obtained by the method of S104 described above. In addition, different from the above S104, since the spliced feature vector subjected to the weight assignment is actually a large vector, the target classifier can directly classify the batch characterized by the spliced feature vector subjected to the weight assignment based on the spliced feature vector subjected to the weight assignment, and obtain the classification result corresponding to the batch. Here, since the weights of the training image frames belonging to the secondary event have been reduced and the weights of the training image frames belonging to the primary event have been increased in the above S504, the interference of the training image frames belonging to the secondary event on the entire stitched feature vector is reduced, and accurate classification of the batch can be achieved. And after the classification results of the batches are obtained, taking the classification result corresponding to the maximum batch number as the classification result of the training video.

S506: and training the target neural network, the attention mechanism processing network and the target classifier according to the classification result of the training video and the comparison result between the labels of the training video.

Here, similar to S105, the details are described in S105, and are not described herein again.

Through the fifth embodiment of the application, the feature vectors of the training image frames in each batch are spliced to form spliced feature vectors, the subsequent operation is based on the spliced feature vectors, the weight occupied by the training image frames in each batch can be better reflected through the spliced feature vectors, the batches are classified based on the spliced feature vectors, the training image frames are classified based on the feature vectors, then the batches are classified based on the classification results of the training image frames in each batch, the times of classification operation can be reduced, and the calculation amount is further reduced.

Based on the same inventive concept, an event detection model training device corresponding to the event detection model training method is also provided in the embodiments of the present application, and because the principle of solving the problem of the device in the embodiments of the present application is similar to the event detection model training method described above in the embodiments of the present application, the implementation of the device can refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 6, a sixth embodiment of the present application provides an event detection model training apparatus, including:

an obtaining module 61, configured to obtain training image frames in a plurality of training videos with labels, and divide the training image frames into a plurality of batches; each batch comprises a preset number of training image frames;

a feature extraction module 62, configured to extract feature vectors for the training image frames in all batches using the target neural network;

an attention mechanism processing module 63, configured to perform at least two rounds of weight assignment on the feature vectors of the training image frames in each batch using an attention mechanism processing network;

the classification module 64 is configured to input the feature vectors of the training image frames in each batch subjected to the weight assignment to the target classifier, so as to obtain a classification result of the training video;

and the training module 65 is configured to train the target neural network, the attention mechanism processing network, and the target classifier according to the classification result of the training video and the comparison result between the labels of the training video.

Optionally, the obtaining module 61 is specifically configured to: acquiring a plurality of training videos with labels;

sampling a training video according to a preset sampling frequency;

Optionally, the attention mechanism processing module 63 is specifically configured to perform weight assignment on the feature vectors of the training image frames in each batch respectively by using the attention mechanism processing network with the feature vectors as the granularity, and perform weight assignment on each batch respectively by using the attention mechanism processing network with the batches as the granularity.

Optionally, with the feature vector as a granularity, performing weight assignment on the feature vectors of the training image frames in each batch respectively by using an attention mechanism processing network, and the obtained weight assignment result a (i) of the ith batch satisfies formula (1):

(1)a(i)＝tanh(W₁F₁+W₂F₂+…+W_nF_n+c)；

(2)b(j)＝M₁a(1)+M₂a(2)+…+M_ma(m)+d；

M₁to M_mRepresenting the weight corresponding to the 1 st to the m-th batches respectively; d represents the bias execution item when the batch is taken as the granularity and the attention mechanism processing network is used for respectively carrying out weight assignment on each batch;

the attention mechanism processing module 63 is further configured to normalize the weight assignment result of each batch after performing weight assignment on each batch respectively using the attention mechanism processing network with the batch as the granularity.

Optionally, the classification module 64 is specifically configured to: respectively inputting the weight-assigned feature vectors corresponding to each batch into a target classifier to obtain a classification result corresponding to each batch;

Optionally, the classifying module 64 is specifically configured to input the weight-assigned feature vectors corresponding to each batch into the target classifier respectively through the following steps to obtain a classification result corresponding to each batch:

sequentially inputting the weight-assigned feature vectors corresponding to each batch into a target classifier respectively to obtain a classification result of each training image frame represented by the weight-assigned feature vectors;

Optionally, the method further comprises: the splicing module 66 is configured to splice the feature vectors of the training image frames in each batch to form spliced feature vectors;

attention mechanism processing module 63 is also operable to: performing at least two rounds of weight assignment on the splicing vectors corresponding to each batch by using an attention mechanism processing network;

the classification module 64 is further configured to: and inputting the splicing feature vectors subjected to weight assignment and corresponding to each batch into a target classifier to obtain a classification result of the training video.

Optionally, the classification module 64 is specifically configured to input the spliced feature vectors subjected to weight assignment and corresponding to each batch to the target classifier by using the following steps to obtain a classification result of the training video:

Optionally, the training module 65 is specifically configured to: performing the following comparison operation until the classification result of the training video is consistent with the label of the training video;

the comparison operation comprises the following steps:

if the classification result of the training video is inconsistent with the label of the training video, adjusting parameters of a target neural network, an attention mechanism processing network and a target classifier;

based on the adjusted parameters, extracting new feature vectors for the training image frames in all batches by using a target neural network, and performing at least two rounds of weight assignment on the new feature vectors of the training image frames in each batch by using an attention mechanism processing network;

inputting the new feature vectors of the training image frames in each batch subjected to the re-weight assignment into a classifier to obtain a new classification result of the training video;

and performing the comparison operation again.

Referring to fig. 7, a seventh embodiment of the present application further provides an event detection method, including:

s701: acquiring a video to be detected;

s702: inputting a video to be detected into an event detection model obtained by the event detection model training method of any one of the embodiments of the application, and obtaining a classification result of the video to be detected;

wherein the event detection model comprises: a target neural network, an attention mechanism processing network, and a target classifier.

An eighth embodiment of the present application further provides an event detection method, including:

the to-be-detected video acquisition module is used for acquiring a to-be-detected video;

the event detection module is used for inputting the video to be detected into the event detection model obtained by the event detection model training method of any embodiment of the application to obtain the classification result of the video to be detected;

Corresponding to the event detection model training method in fig. 1, an embodiment of the present invention further provides a computer device, as shown in fig. 8, the computer device includes a memory 1000, a processor 2000 and a computer program stored in the memory 1000 and executable on the processor 2000, where the processor 2000 implements the steps of the event detection model training method when executing the computer program.

Specifically, the memory 1000 and the processor 2000 can be general memories and general processors, which are not specifically limited herein, and when the processor 2000 runs a computer program stored in the memory 1000, the event detection model training method can be executed, so as to solve the problems that when a training video is used to directly train an event detection model, in order to ensure model accuracy, the required calculation amount is large, and excessive calculation resources and training time are consumed, thereby achieving the effects of reducing the required calculation amount in the training process and reducing the consumption of the calculation resources and the training time on the premise of not affecting the model accuracy.

Corresponding to the guest event detection model training method in fig. 1, an embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program performs the steps of the above event detection model training method.

Specifically, the storage medium can be a general storage medium, such as a mobile disk, a hard disk, and the like, and when a computer program on the storage medium is run, the guest event detection model training method can be executed, so that the problems that when a training video is used to directly train an event detection model, in order to ensure the model accuracy, the required calculation amount is large, and too much calculation resources and training time are consumed are solved, and further, on the premise of not affecting the model accuracy, the required calculation amount in the training process is reduced, and the consumption of the calculation resources and the training time is reduced.

The event detection model training method and the computer program product of the event detection method provided in the embodiments of the present application include a computer-readable storage medium storing program codes, and instructions included in the program codes may be used to execute the methods in the foregoing method embodiments, and specific implementations may refer to the method embodiments and are not described herein again.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An event detection model training method, comprising:

acquiring training image frames in a plurality of training videos with labels, and dividing the training image frames into a plurality of batches; each batch comprises a preset number of training image frames; each training video comprises at least one event, training image frames included by different events are divided into different batches, and the events comprise primary events and secondary events;

training the target neural network, the attention mechanism processing network and the target classifier according to the classification result of the training video and the comparison result between the labels of the training video;

wherein, using an attention mechanism processing network to perform weight assignment on the feature vectors of the training image frames in each batch, specifically comprising:

2. The method according to claim 1, wherein the acquiring training image frames in the plurality of labeled training videos specifically comprises:

acquiring a plurality of training videos with labels;

sampling the training video according to a preset sampling frequency;

3. The method according to claim 1, wherein the feature vectors are used as granularity, the feature vectors of the training image frames in each batch are respectively subjected to weight assignment by using an attention mechanism processing network, and the obtained weight assignment result a (i) of the ith batch satisfies formula (1):

(1)a(i)＝tanh(W₁F₁+W₂F₂+…+W_nF_n+c)；

(2)b(j)＝M₁a(1)+M₂a(2)+…+M_ma(m)+d；

4. The method according to claim 1, wherein the inputting the feature vectors of the training image frames in each batch subjected to the weight assignment to a classifier to obtain the classification result of the training video comprises:

5. The method according to claim 4, wherein the step of inputting the weight-assigned feature vectors corresponding to the respective batches into the target classifier respectively to obtain the classification result corresponding to each batch comprises:

6. The method of claim 1, further comprising:

7. The method according to claim 6, wherein the inputting the spliced feature vectors subjected to weight assignment and corresponding to each batch to a target classifier to obtain the classification result of the training video specifically comprises:

8. The method according to claim 1, wherein the training the target neural network and the target classifier according to the classification result of the training video and the comparison result between the labels of the training video specifically comprises:

the comparison operation comprises the following steps:

and performing the comparison operation again.

9. An event detection method, comprising:

acquiring a video to be detected;

inputting the video to be detected into an event detection model obtained by the event detection model training method according to any one of claims 1 to 8, and obtaining a classification result of the video to be detected;