CN112507860A

CN112507860A - Video annotation method, device, equipment and storage medium

Info

Publication number: CN112507860A
Application number: CN202011409894.1A
Authority: CN
Inventors: 黎阳
Original assignee: Shanghai Eye Control Technology Co Ltd
Current assignee: Shanghai Eye Control Technology Co Ltd
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2021-03-16

Abstract

The invention discloses a video labeling method, a video labeling device, video labeling equipment and a storage medium. The method comprises the following steps: performing frame extraction processing on a video to be annotated to form at least one video sequence; taking each video sequence as input data, and carrying out target detection and event recognition through a pre-trained target behavior recognition model to obtain a candidate behavior data set; and determining behavior event information included in the video to be labeled according to each candidate behavior data in the candidate behavior data set, and labeling the video to be labeled according to each behavior event information. The method solves the problems of high processing difficulty caused by large video data volume and serious video picture jitter and low identification accuracy caused by directly carrying out behavior identification on the original video, simultaneously carries out target detection and event identification on the video content after compressing the video data volume, and accurately marks the video to be marked according to the identification result, thereby realizing the effects of improving the calculation efficiency and increasing the marking accuracy.

Description

Video annotation method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to a video annotation method, a video annotation device, video annotation equipment and a storage medium.

Background

Along with the progress and development of society, public safety and management are one of the most concerned problems at present. In order to better maintain stable and safe normal life of people, the video monitoring system plays a vital role. In daily life, all the cameras, the mobile phones of pedestrians or the automobile data recorders of vehicles and the like which are visible everywhere can record the things happening around, and the shot videos can be used as references and penalty evidences of emergencies.

In the field of traffic management, the law enforcement recorder is a portable intelligent law enforcement and evidence collection technology device integrating the functions of real-time audio recording, photographing, recording and the like, is widely applied to law enforcement tasks of law enforcement units such as public security, traffic, city management, judicial law and the like, and plays an important role in recording the law enforcement process in real time and accurately backtracking and evidence collection afterwards. However, in practical application, the law enforcement recorders are numerous, the amount of video data is huge, and because the law enforcement recorders are wearing equipment, the jitter of the video pictures which are usually collected is serious, so that the difficulty in labeling, managing and analyzing the video data is greatly improved, and how to realize the automatic analysis and sorting of a large amount of law enforcement recorders is a problem which needs to be solved urgently.

At present, many machine learning models directly identify original videos to realize classification of target behaviors, which is called as a violent classification algorithm, but the complexity of scenes in real life is high, the method easily ignores excessive effective information in the videos, extremely high computer operation cost is needed, and the robustness and generalization capability of the models are weak.

Disclosure of Invention

The invention provides a video labeling method, a video labeling device, video labeling equipment and a storage medium, which are used for quickly and accurately identifying and labeling behavior events in videos.

In a first aspect, an embodiment of the present invention provides a video annotation method, including:

performing frame extraction processing on a video to be annotated to form at least one video sequence;

taking each video sequence as input data, and carrying out target detection and event recognition through a pre-trained target behavior recognition model to obtain a candidate behavior data set;

and determining behavior event information included in the video to be labeled according to each candidate behavior data in the candidate behavior data set, and labeling the video to be labeled according to each behavior event information.

Optionally, the frame extraction processing is performed on the video to be annotated to form at least one video sequence, and the method includes:

taking the first frame of the video to be marked as an initial sequence frame, extracting the sequence frame by using a preset interval frame number, and dividing all the sequence frames by waiting for the frame to form at least one video sequence; alternatively, the first and second electrodes may be,

the method comprises the steps of dividing a video to be marked into at least one sub video to be marked at equal intervals, dividing each sub video to be marked into a plurality of video intervals with preset sequence frames at equal intervals, randomly extracting one frame of sequence frame in each video interval, and forming one video sequence based on each sequence frame in each sub video to be marked respectively.

Optionally, the taking each video sequence as input data, performing target detection and event recognition through a pre-trained target behavior recognition model, and obtaining a candidate behavior data set includes:

for each video sequence, extracting shared three-dimensional features of the video sequence;

performing dimensionality reduction on the shared three-dimensional features to obtain target detection two-dimensional features, and performing target detection according to the target detection two-dimensional features to obtain target area information;

according to the shared three-dimensional feature and the target area information, carrying out event identification on the video sequence;

and when the video sequence contains a preset behavior event, forming corresponding candidate behavior data, and adding the candidate behavior data to a candidate behavior data set.

Optionally, the extracting shared three-dimensional features of the video sequence includes:

respectively carrying out high-frequency sampling and low-frequency sampling on the video sequence to obtain a high-sampling sequence and a low-sampling sequence, wherein the sampling frequency of the high-frequency sampling is greater than that of the low-frequency sampling;

inputting the high sampling sequence into a time sequence feature extraction channel to obtain a time sequence feature;

inputting the low sampling sequence into a spatial feature extraction channel to obtain spatial features;

and fusing the time sequence characteristic and the space characteristic to obtain a shared three-dimensional characteristic.

Optionally, the training process of the target behavior recognition model includes:

carrying out target area labeling and event type labeling on the training video image to obtain a standard target area and a standard event type;

inputting the training video image into a behavior recognition model to be trained to obtain an output actual target area and an actual event type;

obtaining a fitting loss function according to the standard target area, the standard event type, the actual target area and the actual event type;

and performing back propagation on the behavior recognition model to be trained through the fitting loss function to obtain the target behavior recognition model.

Optionally, the obtaining a fitting loss function according to the standard target area, the standard event type, the actual target area, and the actual event type includes:

respectively substituting the standard target area, the standard event type, the actual target area and the actual event type into at least two given loss function expressions to obtain corresponding loss functions;

and performing weighted fusion on each loss function to obtain a fitting loss function.

Optionally, the candidate behavior data includes a candidate event type and a candidate event occurrence time;

correspondingly, the determining, according to each candidate behavior data in the candidate behavior data set, behavior event information included in the video to be labeled, and labeling the video to be labeled according to each behavior event information includes:

combining the candidate behavior data with the same candidate event type and continuous candidate event occurrence time to form behavior event information corresponding to each candidate event type, wherein the behavior event information comprises corresponding candidate event types and event occurrence time periods;

for each behavior event information, determining a target event corresponding to the included candidate event type, and determining a target video segment corresponding to the video to be annotated in the event occurrence time period;

and marking the target event for the target video segment.

In a second aspect, an embodiment of the present invention further provides a video annotation device, where the device includes:

the video sequence determination module is used for performing frame extraction processing on a video to be annotated to form at least one video sequence;

the behavior recognition module is used for performing target detection and event recognition on each video sequence as input data through a pre-trained target behavior recognition model to obtain a candidate behavior data set;

and the video labeling module is used for determining behavior event information included in the video to be labeled according to each candidate behavior data in the candidate behavior data set and labeling the video to be labeled according to each behavior event information.

In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the video annotation method according to any embodiment of the present invention.

In a fourth aspect, embodiments of the present invention further provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform the steps of the video annotation method according to any of the embodiments of the present invention.

The invention carries out frame extraction processing on the video to be marked, takes at least one formed video sequence as input data, target detection and event recognition are carried out through a pre-trained target behavior recognition model to obtain a candidate behavior data set, determining behavior event information included in the video to be annotated according to each candidate behavior data in the candidate behavior data set, and labels the video to be labeled through the information of each behavior event, solves the problems of high processing difficulty caused by large video data volume and serious video picture jitter and low identification accuracy rate caused by directly identifying the behavior of the original video, by performing frame extraction processing on the video to be annotated, compressing the data volume of the video, and simultaneously performing target detection and event identification on the video content, and the video to be marked is accurately marked according to the identification result, so that the effects of improving the calculation efficiency and increasing the marking accuracy are realized.

Drawings

Fig. 1 is a flowchart of a video annotation method according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a training procedure of a target behavior recognition model in a video annotation method according to an embodiment of the present invention;

FIG. 3 is a flowchart of a video annotation method according to a second embodiment of the present invention;

fig. 4 is a flowchart illustrating steps of forming a shared three-dimensional feature in a video annotation method according to a second embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating a video annotation method according to a second embodiment of the present invention;

FIG. 6 is a block diagram of a video annotation apparatus according to a third embodiment of the present invention;

fig. 7 is a block diagram of a computer device according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be noted that, for convenience of description, only a part of the structures related to the present invention, not all of the structures, are shown in the drawings, and furthermore, embodiments of the present invention and features of the embodiments may be combined with each other without conflict.

Example one

Fig. 1 is a flowchart of a video annotation method according to an embodiment of the present invention, where the embodiment is applicable to a case of identifying and annotating a behavior event in a video, and the method may be executed by a video annotation device, and the device may be implemented by software and/or hardware.

As shown in fig. 1, the method specifically includes the following steps:

and 110, performing frame extraction processing on the video to be annotated to form at least one video sequence.

The video to be marked can be understood as an original video needing behavior identification and marking, and the video to be marked can be video data acquired by any video acquisition device such as monitoring equipment, a mobile terminal, a vehicle event recorder or a law enforcement recorder. A video sequence may understand a set of video frames consisting of a certain number of video frames.

Specifically, due to the popularization of the current electronic equipment, more and more equipment can be used for shooting videos, correspondingly, the amount of formed video data is also increased at a very high speed, in order to more quickly identify and label the behavior of the video to be labeled, the video to be labeled can be treated according to the preset frame extraction rule to carry out frame extraction processing, extracted video frames are divided, a plurality of video sequences are formed, and the behavior identification processing is conveniently carried out.

Optionally, the frame extraction operation in step 110 may be implemented in any one of the following two ways:

taking a first frame of a video to be marked as an initial sequence frame, extracting the sequence frame by using a preset interval frame number, and dividing all the sequence frames by waiting for the frame to form at least one video sequence; alternatively, the first and second electrodes may be,

the method comprises the steps of dividing a video to be marked into at least one sub video to be marked at equal intervals, dividing each sub video to be marked into a plurality of video intervals with preset sequence frames at equal intervals, randomly extracting a frame sequence frame in each video interval, and forming a video sequence based on each sequence frame in each sub video to be marked respectively.

For example, if the video to be annotated includes 5000 video frames in total, the first frame extraction manner is to extract the sequence frames by a preset interval frame number, for example, the preset interval frame number is 10, 500 sequence frames may be extracted from the video to be annotated, if it is preset that each video sequence may include 20 sequence frames, the extracted 500 sequence frames may be divided into a group of every 20 frames, and finally, 25 video sequences are formed. The second frame extraction method can divide the video to be annotated into a certain number of sub-videos to be annotated, for example, divide the video to be annotated including 5000 video frames into 25 sub-videos to be annotated, each sub-video to be annotated includes 200 video frames, and divide each sub-video to be annotated into a plurality of video intervals of preset sequence frames, where the preset sequence frame number refers to the number of sequence frames included in each video sequence, and can be set as the same as the first frame extraction method that each video sequence includes 20 sequence frames, so that 200 video frames in each sub-video to be annotated are divided into 20 video intervals, each video interval includes 10 sequence frames, a sequence of frames is randomly extracted from each video interval, for each sub-video to be annotated, 20 sequence frames are extracted, the 20 sequence of frames form a video sequence, and 25 sub-videos to be annotated all correspond to form a video sequence, finally 25 video sequences are formed.

And step 120, taking each video sequence as input data, and performing target detection and event recognition through a pre-trained target behavior recognition model to obtain a candidate behavior data set.

Specifically, the target behavior recognition model may be obtained by training a large number of training data sets in advance, and may be used to detect a target in a video image and recognize a corresponding event, and analyze a behavior type of a target person in the video image. In the embodiment, each video sequence is used as input data, the target behavior recognition model performs behavior recognition on each video sequence, and when videos to be labeled are divided into video sequences, a complete behavior event is divided into several different video sequences.

Optionally, fig. 2 is a flowchart of a training step of a target behavior recognition model in the video annotation method according to an embodiment of the present invention. As shown in fig. 2, the training process of the target behavior recognition model can be implemented by the following steps:

step 1201, performing target area labeling and event type labeling on the training video image to obtain a standard target area and a standard event type.

Wherein, the training video image can be understood as an original sample video for training the behavior recognition model. The standard target area may be understood as an area where the behavioral subject is actually located in the behavioral event of interest in the user in the training video image. The standard event type may be understood as an event type of a behavioral event of interest to a user in the training video image.

In this embodiment, the video data collected by the law enforcement recorder can be taken as an example to classify and label the types of events common in the daily law enforcement process of policemen, such as drunk driving test, violation punishment, accident handling and other events.

Specifically, the training video images in the training data may be labeled, and since the behavior recognition model of this embodiment needs to implement target detection and event recognition at the same time, target area labeling and event type labeling of target behavior events need to be performed on the training video images respectively. When the training video images are labeled, a target area of the target behavior event can be specifically determined according to the event type in the target behavior event in each training video image, if the target behavior event in the training video images is a drunk driving test, an alcohol tester in the drunk driving event can be used as a target main body of the drunk driving test, and the area where the alcohol tester is located is labeled as the target area.

Step 1202, inputting the training video image into the behavior recognition model to be trained, and obtaining an output actual target area and an actual event type.

The behavior recognition model to be trained can be understood as an initially constructed deep learning model for recognizing a target behavior event which is interested in a user in a video image. The actual target area can be understood as a target area in a target behavior event identified by the behavior recognition model to be trained. The actual event type can be understood as the event type of the target behavior event identified by the behavior recognition model to be trained.

Specifically, the set-up behavior recognition model to be trained can be used for recognizing the training video image, and the actual target area and the actual event type corresponding to the target behavior event which is interested in the user in the training video image are output.

Step 1203, obtaining a fitting loss function according to the standard target area, the standard event type, the actual target area and the actual event type.

Specifically, the standard target area and the standard event type are labeled according to actual video content, the actual target area and the actual event type are identified and output by an untrained behavior identification model, errors necessarily exist between the standard target area and the actual target area and between the standard event type and the actual event type, and a fitting loss function can be formed according to the errors so as to achieve training and parameter adjustment of the behavior identification model.

Further, obtaining a fitting loss function according to the standard target area, the standard event type, the actual target area, and the actual event type may include: respectively substituting the standard target area, the standard event type, the actual target area and the actual event type into at least two given loss function expressions to obtain corresponding loss functions; and performing weighted fusion on the loss functions to obtain a fitting loss function.

Specifically, a plurality of loss function expressions may be given in advance, and the standard target area, the standard event type, the actual target area, and the actual event type are respectively substituted into the corresponding loss function expressions to obtain corresponding loss functions. The weight of each loss function can be preset according to different scenes, and the fitting loss function is obtained after weighted fusion of each loss function.

For example, a given loss function expression may include 4 of the following, 3 for target detection and 1 for event recognition:

heamag loss function expression:

where Y may represent a gaussian kernel function constructed using coordinates, N may represent the number of detected keypoints, σ may represent a standard deviation, α, β are both hyper-parameters, (x, Y) may represent coordinate values of a standard target region, and p may represent coordinate values of a predicted actual target region.

Embedding loss function expression:

wherein (x)₁,y₁) And (x)₂,y₂) Two key points of the predicted actual target area may be represented separately,

a predicted value may be represented.

Offset loss function expression:

where, p may represent a standard target area,

the actual target region may be represented, R may represent a sampling frequency, and is also a super-parameter, which may be set in advance, and generally, R is set to 4.

classification loss function expression:

where K may represent the number of types of event types, y may represent a label, and p may represent a probability of prediction.

And 1204, performing back propagation on the behavior recognition model to be trained through a fitting loss function to obtain a target behavior recognition model.

Specifically, after the fitting loss function is obtained, the behavior recognition model to be trained can be subjected to back propagation through the fitting loss function, the behavior recognition model is continuously adjusted, and finally the target lane line classification model is obtained.

And step 130, determining behavior event information included in the video to be labeled according to each candidate behavior data in the candidate behavior data set, and labeling the video to be labeled according to each behavior event information.

The behavior event information can be understood as information data of a complete behavior event in the video to be labeled.

Specifically, since a plurality of video sequences are formed after frame-extracting and segmentation processing is performed on a video to be annotated, and a target behavior recognition model outputs detected candidate behavior data after target detection and event recognition are performed on each video sequence, a complete behavior event information may include one or more candidate behavior data. The candidate behavior data of the same target behavior event can be combined into behavior event information by analyzing each candidate behavior data in the candidate behavior data set, the target behavior event is positioned in the video to be labeled according to the behavior event information, and corresponding labeling is performed.

The technical scheme of the embodiment includes performing frame extraction processing on a video to be annotated, taking at least one formed video sequence as input data, performing target detection and event recognition through a pre-trained target behavior recognition model to obtain a candidate behavior data set, determining behavior event information included in the video to be annotated according to each candidate behavior data in the candidate behavior data set, and annotating the video to be annotated through each behavior event information, so that the problems that the processing difficulty is high due to large video data volume and serious video frame jitter, and the recognition accuracy is low due to direct behavior recognition on an original video are solved, the frame extraction processing is performed on the video to be annotated, the video data volume is compressed, the target detection and event recognition are performed on video content, the video to be annotated is accurately performed according to recognition results, and the calculation efficiency is improved, the effect of increasing the accuracy of the labeling.

Example two

Fig. 3 is a flowchart of a video annotation method according to a second embodiment of the present invention. On the basis of the above embodiments, the present embodiment further optimizes the video annotation method.

As shown in fig. 3, the method specifically includes:

step 210, performing frame extraction processing on the video to be annotated to form at least one video sequence.

And step 220, extracting the shared three-dimensional features of the video sequences aiming at each video sequence.

Wherein shared three-dimensional features may be understood as feature data used for both object detection and event recognition.

Specifically, for each video sequence, a preset feature extraction method may be used to extract the shared three-dimensional features corresponding to the video sequence.

Optionally, fig. 4 is a flowchart of a forming step of sharing three-dimensional features in a video annotation method according to a second embodiment of the present invention. As shown in fig. 4, step 220 may be implemented by:

step 2201, respectively performing high-frequency sampling and low-frequency sampling on the video sequence to obtain a high-frequency sampling sequence and a low-frequency sampling sequence, wherein the sampling frequency of the high-frequency sampling is greater than that of the low-frequency sampling.

Specifically, in a video scene, the video scene may be mainly divided into a slowly changing static area and a rapidly changing dynamic area, for example, a video scene where a person meets the video scene, and the person meets the video scene and then a handshake action occurs, during which images of hand areas of the person change faster and images of other areas change slower. In general, the static area occupies a larger proportion of the whole image, while the dynamic area occupies a smaller proportion, but for the detection process of behavior recognition, the dynamic area can reflect the occurrence of the behavior better, and for the recognition of the dynamic area, the method is relatively more important.

And 2202, inputting the high sampling sequence into a time sequence feature extraction channel to obtain time sequence features.

Specifically, the feature extraction of the dynamic change process in the high sampling sequence can be focused more, corresponding time sequence features are formed, and dynamic change information in the video sequence is reflected more.

Step 2203, inputting the low sampling sequence into the spatial feature extraction channel to obtain the spatial feature.

Specifically, the feature extraction of a static region in a low-sampling sequence can be focused more, a corresponding spatial feature is formed, and scene information in a video sequence is reflected more.

And 2204, fusing the time sequence characteristics and the space characteristics to obtain shared three-dimensional characteristics.

Specifically, the time sequence feature capable of embodying dynamic change and the spatial feature embodying a video scene are fused, so that complete feature information of a video sequence can be formed, namely, a three-dimensional feature is shared.

For example, fig. 5 is a schematic diagram of a video annotation method according to a second embodiment of the present invention. As shown in fig. 5, the slawfast model based on centrnet can be used to extract shared three-dimensional features of video sequences. For an input video sequence, the video sequence can be processed by using different sampling frequencies, and the processed video sequence is respectively input into two channels, namely a spatial feature extraction channel and a time sequence feature extraction channel, and the two channels can be subjected to feature extraction based on a 3D ResNet model. For a continuous video sequence, the change of the static region features is small, so the sampling frequency is low, the number of the included sequence frames is small, the change of the dynamic region features is large, so the sampling frequency is high, the number of the included sequence frames is large, namely the number of frames processed by the spatial feature extraction channel is small, the time sequence feature extraction channel focuses more on the dynamic change process, the sequence frames related to 3D convolution are continuous as much as possible and can include the context information of the video sequence, and in addition, the number of the used convolution kernels is as small as possible in order to avoid excessive redundant information. In fig. 5, α may represent a multiple of a sampling frequency of a video sequence, β may represent the number of channel convolution kernels, and H, W, C and T may represent the height, width, number of channels, and number of frames of the sequence frame, respectively. After the spatial features of the video sequence are extracted through the spatial feature extraction channel and the time sequence features of the video sequence are extracted through the time sequence feature extraction channel, the transverse connection module is added, the two features are effectively combined, and the motion information is better extracted. The time sequence features from the time sequence feature extraction channels can be sent to the space feature extraction channels through transverse connection for feature fusion, data dimension conversion is needed due to the fact that the sampling frequency and the number of convolution kernels used by the two channels are not consistent, a 3D convolution mode based on the time sequence direction can be used, on one hand, the dimension reduction of the time sequence is carried out, the loss of the features is reduced, on the other hand, the fusion of the space features is carried out by combining time information, and finally, the time sequence is sent to a connection layer through an average pooling layer, so that the shared three-dimensional features representing the whole video sequence are obtained.

And step 230, performing dimensionality reduction on the shared three-dimensional features to obtain target detection two-dimensional features, and performing target detection according to the target detection two-dimensional features to obtain target area information.

In the embodiment, the shared three-dimensional features obtained by horizontal connection are used for target detection and event recognition, but since target detection is performed based on 2D images, the obtained shared features are three-dimensional features and contain time series information, key frames in a video sequence can be selected for target detection.

For example, as shown in fig. 5, after the shared three-dimensional features of the video sequences are extracted, the shared three-dimensional features may be input into a target detection submodel, the target detection submodel may select an intermediate frame of each video sequence to perform target detection, reduce the time dimension to one dimension through a 3D 1 × 1 convolution kernel on the shared three-dimensional features, remove a space occupied by the dimension, obtain two-dimensional features for target detection, and perform target detection on a dynamic region of a key frame by using the features of the entire video sequence, on one hand, attention of the model to the dynamic region and a trend feature optimization direction may be improved, on the other hand, better event identification features may be extracted by combining a loss function 1, a loss function 2, and a loss function 3 related to target detection, and generalization and robustness of the network model may be increased.

Because the whole face information of the target person cannot be acquired in most real scenes of the video acquired by the law enforcement recorder, a face recognition and tracking algorithm can be introduced when the video acquired by the law enforcement recorder is processed, and face tracking can be performed on target persons such as traffic polices and car owners receiving alcohol tests.

And 240, performing event identification on the video sequence according to the shared three-dimensional characteristics and the target area information.

Specifically, event recognition can be performed according to the detected target area information and by combining with the shared three-dimensional features, specific behavior actions occurring in the video sequence are analyzed and predicted, and whether one or more of a plurality of target behavior events preset during model training are included is judged, such as drunk driving tests, violation punishment, accident handling and other events.

For example, as shown in fig. 5, the event recognition sub-model may obtain a shared three-dimensional feature of the video sequence and target area information obtained by the target detection sub-model after performing target detection, and the event recognition sub-model may recognize a behavior event occurring in the video sequence based on the shared three-dimensional feature and the target area information in combination with a loss function 4 related to event recognition.

And step 250, when the video sequence contains the preset behavior event, forming corresponding candidate behavior data, and adding the candidate behavior data to the candidate behavior data set.

The candidate behavior data may include a candidate event type and a candidate event occurrence time, among others.

Specifically, when the behavior event occurring in the video sequence is one or more of the preset behavior events after the event identification is performed on the video sequence, corresponding one or more candidate behavior data are formed and added to the candidate behavior data set. In this embodiment, one video sequence may include two or more different preset behavior events, for example, two different behavior events of a drunk driving test and an accident handling may be detected simultaneously in one video sequence, so that two candidate behavior data are correspondingly formed, the occurrence time of the candidate event of the two candidate behavior data may be the same, and the candidate event type corresponds to the respective behavior event type.

And step 260, combining the candidate behavior data with consistent candidate event types and continuous candidate event occurrence time to form behavior event information corresponding to each candidate event type.

The behavior event information comprises corresponding candidate event types and event occurrence time periods.

Specifically, the video sequence is formed by performing frame extraction and segmentation on a video to be annotated, if the video sequence includes a preset behavior event, a candidate behavior data is formed, and when an entire behavior event in the video to be annotated is segmented into a plurality of video sequences, the behavior event corresponds to a plurality of candidate behavior data, so that the candidate behavior data with consistent candidate event types and continuous candidate event occurrence time can be merged into behavior event information, the event type of the behavior event information is consistent with the candidate event type corresponding to the behavior event information before merging, and the event occurrence time period can be a union of the corresponding candidate video times before merging.

And step 270, determining a target event corresponding to the included candidate event type and determining a target video segment corresponding to the video to be annotated in the event occurrence time period for each behavior event information.

The target event can be understood as a complete behavior event in the video to be labeled.

Specifically, the target event included in the video to be annotated can be determined according to the candidate event type corresponding to the behavior event information, and the target video segment where the target event appears can be positioned in the video to be annotated according to the event occurrence time period.

And step 280, marking the target event for the target video segment.

Specifically, the event type of the target event may be considered to be consistent with the candidate event type corresponding to the behavior event information, and after the target video segment is obtained, the corresponding event type may be labeled to the target video segment. For convenience of management, the target video clip can be extracted from the video to be labeled and processed for independent storage.

Exemplarily, after a video to be annotated is acquired, frame extraction processing may be performed on the video to be annotated, assuming that the video to be annotated includes 5000 video frames in total, extracting one frame every 10 frames, and forming a video sequence by every 20 frames that are extracted, 25 video sequences may be formed, and the 25 video sequences may be sequentially recorded as video sequence 1, video sequence 2, … …, and video sequence 25. Inputting each video sequence as input data into a pre-trained target behavior recognition model for target detection and event recognition, performing target detection on each video sequence by using the target behavior recognition model, positioning a target character or a target object appearing in the video sequence, recognizing a behavior event of the target character or the target object in time sequence, outputting candidate behavior data if the recognized behavior event meets a preset event type, outputting candidate behavior data 1 according to the video sequence 2, outputting candidate behavior data 2 according to the video sequence 3, wherein the candidate event type of the candidate behavior data 2 is violation punishment, outputting candidate behavior data 3 according to the video sequence 15, and the candidate event type of the candidate behavior data 3 is drunk driving test, and outputting candidate behavior data 4 according to the video sequence 16, wherein the candidate event type of the candidate behavior data 4 is drunk driving test, outputting candidate behavior data 5 and candidate behavior data 6 according to the video sequence 17, the candidate event type of the candidate behavior data 5 is drunk driving test, and the candidate event type of the candidate behavior data 6 is violation punishment. After all candidate behavior data are identified and output, the candidate behavior data with the consistent candidate event type and the continuous candidate event occurrence time can be merged to form behavior event information corresponding to each candidate event type, in the above example, there is no candidate behavior data with the consistent candidate event type and the continuous candidate event occurrence time of the candidate behavior data 1, so the candidate behavior data 1 corresponds to the behavior event information 1, and similarly, the candidate behavior data 2 corresponds to the behavior event information 2, and the candidate behavior data 3, the candidate behavior data 4 and the candidate behavior data 5 are determined and output according to the video sequence 15, the video sequence 16 and the video sequence 17, the occurrence time of the candidate behavior data 3, the candidate behavior data 4 and the candidate behavior data 5 is continuous, and the candidate event types are all drunk driving tests, so the candidate behavior data 3, the candidate event information 3 and the candidate event information are combined to form behavior event information corresponding to each candidate event type The candidate behavior data 4 and the candidate behavior data 5 are combined into one behavior event information to form behavior event information 3, and the candidate behavior data 6 correspondingly forms the behavior event information 4. After the behavior event information is obtained, the corresponding target event can be determined, the target video segment can be marked in the video to be marked, in the above example, the target event corresponding to the behavior event information 1 is an accident handling, the event occurrence time period is a video time interval corresponding to the video sequence 2, the target event corresponding to the behavior event information 2 is a violation penalty, the event occurrence time period is a video time interval corresponding to the video sequence 3, the target event corresponding to the behavior event information 3 is a drunk driving test, the event occurrence time period is a video time interval corresponding to the video sequence 15, the video sequence 16 and the video sequence 17, the target event corresponding to the behavior event information 4 is a violation penalty, and the event occurrence time period is a video time interval corresponding to the video sequence 17, and positioning and marking the target video segment in the video to be marked according to the event occurrence time period of each target event.

According to the technical scheme, frame extraction processing is carried out on a video to be labeled to form at least one video sequence, shared three-dimensional features of the video sequences are extracted for each video sequence, dimension reduction processing is carried out on the shared three-dimensional features to carry out target detection, event identification is carried out on the video sequences by combining the obtained target area information with the shared three-dimensional features to form corresponding candidate behavior data, finally, the candidate behavior data are combined into corresponding behavior event information, a corresponding target event is determined for each behavior event information, and a target video segment is labeled. The method solves the problems that the processing difficulty is high due to large video data volume and serious video image jitter, and the identification accuracy is low due to the fact that the original video is directly subjected to behavior identification.

EXAMPLE III

The video annotation device provided by the embodiment of the invention can execute the video annotation method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. Fig. 6 is a block diagram of a video annotation apparatus according to a third embodiment of the present invention, and as shown in fig. 6, the apparatus includes: a video sequence determination module 310, a behavior recognition module 320, and a video annotation module 330.

The video sequence determining module 310 is configured to perform frame extraction processing on a video to be annotated to form at least one video sequence.

And a behavior identification module 320, configured to use each video sequence as input data, perform target detection and event identification through a pre-trained target behavior identification model, and obtain a candidate behavior data set.

The video tagging module 330 is configured to determine, according to each candidate behavior data in the candidate behavior data set, behavior event information included in the video to be tagged, and tag the video to be tagged according to each behavior event information.

Optionally, the video sequence determining module 310 is specifically configured to:

Optionally, the behavior recognizing module 320 specifically includes:

a shared three-dimensional feature determination unit, configured to extract, for each video sequence, a shared three-dimensional feature of the video sequence;

the target area information determining unit is used for performing dimensionality reduction processing on the shared three-dimensional feature to obtain a target detection two-dimensional feature, and performing target detection according to the target detection two-dimensional feature to obtain target area information;

the event identification unit is used for carrying out event identification on the video sequence according to the shared three-dimensional feature and the target area information;

and the candidate behavior data set determining unit is used for forming corresponding candidate behavior data when the video sequence contains a preset behavior event, and adding the candidate behavior data to the candidate behavior data set.

Optionally, the shared three-dimensional feature determining unit is specifically configured to:

correspondingly, the video annotation module 330 is specifically configured to:

and marking the target event for the target video segment.

Example four

Fig. 7 is a block diagram of a computer device according to a fourth embodiment of the present invention, as shown in fig. 7, the computer device includes a processor 410, a memory 420, an input device 430, and an output device 440; the number of the processors 410 in the computer device may be one or more, and one processor 410 is taken as an example in fig. 7; the processor 410, the memory 420, the input device 430 and the output device 440 in the computer apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 7.

The memory 420 serves as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the video annotation method in the embodiment of the present invention (e.g., the video sequence determination module 310, the behavior recognition module 320, and the video annotation module 330 in the video annotation device). The processor 410 executes various functional applications and data processing of the computer device by executing software programs, instructions and modules stored in the memory 420, so as to realize the video annotation method.

The memory 420 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 420 may further include memory located remotely from processor 410, which may be connected to a computer device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 430 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus. The output device 440 may include a display device such as a display screen.

EXAMPLE five

An embodiment of the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a video annotation method, including:

Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the video annotation method provided by any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the video annotation apparatus, the included units and modules are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for video annotation, comprising:

2. The video annotation method of claim 1, wherein the frame-extracting process is performed on the video to be annotated to form at least one video sequence, and comprises:

3. The video annotation method of claim 1, wherein the obtaining of the candidate behavior data set by performing target detection and event recognition through a pre-trained target behavior recognition model using each video sequence as input data comprises:

4. The method according to claim 3, wherein said extracting the shared three-dimensional features of the video sequence comprises:

5. The video annotation method of claim 1, wherein the training process of the target behavior recognition model comprises:

6. The method for video annotation according to claim 5, wherein said obtaining a fitting loss function according to said standard target region, said standard event type, said actual target region and said actual event type comprises:

7. The video annotation method of claim 1, wherein the candidate behavior data comprises a candidate event type and a candidate event occurrence time;

and marking the target event for the target video segment.

8. A video annotation apparatus, comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the video annotation method according to any one of claims 1 to 7 when executing the program.

10. A storage medium containing computer-executable instructions for performing the steps of the video annotation method of any one of claims 1-7 when executed by a computer processor.