CN110942011B - Video event identification method, system, electronic equipment and medium - Google Patents

Video event identification method, system, electronic equipment and medium Download PDF

Info

Publication number
CN110942011B
CN110942011B CN201911154578.1A CN201911154578A CN110942011B CN 110942011 B CN110942011 B CN 110942011B CN 201911154578 A CN201911154578 A CN 201911154578A CN 110942011 B CN110942011 B CN 110942011B
Authority
CN
China
Prior art keywords
features
video
event
information
optical flow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911154578.1A
Other languages
Chinese (zh)
Other versions
CN110942011A (en
Inventor
徐宝函
姜育刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jilian Network Technology Co ltd
Original Assignee
Shanghai Jilian Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jilian Network Technology Co ltd filed Critical Shanghai Jilian Network Technology Co ltd
Priority to CN201911154578.1A priority Critical patent/CN110942011B/en
Publication of CN110942011A publication Critical patent/CN110942011A/en
Application granted granted Critical
Publication of CN110942011B publication Critical patent/CN110942011B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Abstract

The embodiment of the invention discloses a video event identification method, a video event identification system, electronic equipment and a medium. The method comprises the following steps: acquiring a target video clip of an event to be identified, and determining video frame information, optical flow picture information and audio information corresponding to the target video clip according to the target video clip; determining video event features of the target video clip based on video frame information, optical flow picture information and audio information corresponding to the target video clip; and determining video event information corresponding to the target video clip according to the video event characteristics of the target video clip. The technical scheme of the embodiment of the invention realizes accurate identification of the event in the video.

Description

Video event identification method, system, electronic equipment and medium
Technical Field
The embodiment of the invention relates to the technical field of computer processing, in particular to a video event identification method, a video event identification system, electronic equipment and a medium.
Background
With the development of the internet and the popularization of mobile devices, in recent years, the demand for understanding video content has increased, and event recognition in video is one of important tasks of video understanding. Compared with the identification of static contents such as scenes or objects, the video event identification comprises higher-dimensional semantic concepts and richer features. Event identification is a high-level semantic concept, is more diverse and complex, and has different correlations between different concepts and different events. In addition, an event is often composed of multi-modal features, including human interaction with objects in a specific scene, and the scene, the objects, actions, sounds and other features all affect the event.
However, early research on video event recognition focused on the use of traditional artificial features, including the use of visual features such as SIFT, gradient histograms, spatial interest points, or optical flow histograms, among others. The manually designed feature characterization capability is limited, and therefore, the identification capability of the methods for complex and large-scale videos is limited.
In recent years, many researchers have performed many optimization works by using a neural network for video event recognition, but most of the researches are focused on recognition of single-modal tasks such as scenes and objects, and multi-modal attributes of events are often ignored by a single model.
Disclosure of Invention
The embodiment of the invention provides a video event identification method, a video event identification system, electronic equipment and a medium, so as to realize accurate identification of events in videos.
In a first aspect, an embodiment of the present invention provides a video event identification method, where the video event identification method includes:
acquiring a target video clip of an event to be identified, and determining video frame information, optical flow picture information and audio information corresponding to the target video clip according to the target video clip;
determining video event features of the target video clip based on video frame information, optical flow picture information and audio information corresponding to the target video clip;
and determining video event information corresponding to the target video clip according to the video event characteristics of the target video clip.
Further, determining video event features of the target video segment based on video frame information, optical flow picture information and audio information corresponding to the target video segment, comprising:
inputting video frame information, optical flow picture information and audio information corresponding to the target video clip into a preset feature extraction model, and outputting video frame features corresponding to the video frame information, optical flow picture features corresponding to the optical flow picture information and audio features corresponding to the audio information;
generating video event features of the target video segment from the video frame features, the optical flow picture features, and the audio features;
wherein the video frame features comprise object features, motion features, and scene features, the optical flow picture features comprise motion features, and the audio features comprise sound features.
Further, generating video event features of the target video segment from the video frame features, the optical flow picture features, and the audio features, comprising:
acquiring concept features and content features of the video frame features, concept features and content features of the optical flow picture features and concept features and content features of the audio features;
and respectively carrying out concept selection on the concept features of the video frame features, the optical flow picture features and the audio features, and fusing the selected concept features of the video frame features, the optical flow picture features and the audio features with the content features of the video frame features, the optical flow picture features and the audio features respectively to generate video event features of the target video clip.
Further, determining video event information corresponding to the target video segment according to the video event characteristics of the target video segment includes:
and inputting the video event characteristics of the target video clip into a pre-trained classifier, and outputting video event information corresponding to the target video clip.
Further, still include:
generating a training sample set based on at least one historical video segment;
inputting the training sample set into a pre-established classifier to obtain video event information of the historical video clip;
and adjusting the parameters of the classifier according to the video event information and the expected video event information.
Further, the pre-established classifier comprises a deep learning network;
inputting the training sample set into a pre-established classifier to obtain video event information of the historical video clip, wherein the method comprises the following steps:
inputting the training sample set into the deep learning network, and directly carrying out event classification and sequencing to obtain video event information of the historical video clip; alternatively, the first and second electrodes may be,
and inputting the training sample set into the deep learning network, determining video event information larger than a preset probability threshold value, and then obtaining the video event information of the historical video clip through the event classification and sequencing.
Further, the video event information is the probability of the event to which the target video clip of the event to be identified belongs; or the light source is used for emitting light,
and the video event information is an event identified in the target video clip of the event to be identified, and the identified event is displayed on the terminal.
In a second aspect, an embodiment of the present invention further provides a video event recognition system, where the video event recognition system includes:
the information acquisition module is used for acquiring a target video clip of an event to be identified and determining video frame information, optical flow picture information and audio information corresponding to the target video clip according to the target video clip;
the characteristic determining module is used for determining video event characteristics of the target video clip based on video frame information, optical flow picture information and audio information corresponding to the target video clip;
and the event information determining module is used for determining the video event information corresponding to the target video clip according to the video event characteristics of the target video clip.
In a third aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes:
one or more processors;
a storage device for storing a plurality of programs,
when at least one of the programs is executed by the one or more processors, the one or more processors are caused to implement a video event recognition method provided in the embodiments of the first aspect of the present invention.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a video event identification method provided in the embodiment of the first aspect of the present invention.
According to the technical scheme of the embodiment of the invention, a target video clip of an event to be identified is obtained, and video frame information, optical flow picture information and audio information corresponding to the target video clip are determined according to the target video clip; determining video event features of the target video clip based on video frame information, optical flow picture information and audio information corresponding to the target video clip; and determining video event information corresponding to the target video clip according to the video event characteristics of the target video clip. The method solves the problems that the manually designed feature characterization capability in the prior art is limited and is not suitable for complex and large-scale video identification, and the multi-modal attribute of the event is ignored, so that the event in the video can be accurately identified.
Drawings
Fig. 1 is a flowchart of a video event recognition method according to an embodiment of the present invention;
fig. 2 is a flowchart of a video event recognition method according to a second embodiment of the present invention;
FIG. 3 is a flowchart block diagram of an exemplary video event recognition method provided by an embodiment of the present invention;
fig. 4 is a block diagram of a video event recognition system according to a third embodiment of the present invention;
fig. 5 is a schematic diagram of a hardware structure of an electronic device according to a fourth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below with reference to the accompanying drawings. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention.
It should be further noted that, for the convenience of description, only some but not all of the relevant aspects of the present invention are shown in the drawings. Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.
Example one
Fig. 1 is a flowchart of a video event recognition method according to an embodiment of the present invention, where the present embodiment is applicable to event recognition tasks in video retrieval, monitoring analysis, or advertisement delivery, and the method may be executed by a video event recognition system, and the apparatus may be implemented in a form of software and/or hardware. The method specifically comprises the following steps:
s110, obtaining a target video clip of an event to be identified, and determining video frame information, optical flow picture information and audio information corresponding to the target video clip according to the target video clip.
The target video clip of the event to be identified refers to a target video clip understood aiming at the video content, and the target video clip may be a video clip in a section of intercepted complete video content, or may also be a video clip in two or more sections of intercepted complete video content. The target video clip preferably contains the event to be identified, and in order to ensure that the event to be identified can be accurately determined, the target video clip can be intercepted from the middle time period of the complete video content, or a flag bit is preset, and a section of video is intercepted within the preset time period to serve as the target video clip.
Specifically, by preprocessing a target video segment of an event to be recognized, video frame information, optical flow picture information and audio information corresponding to the target video segment are output.
The video frame information refers to a static video frame of the target video clip, and the static video frame is used for analyzing a scene where the video occurs, an object or a person appearing in the video, and an action of the object or the person appearing in the video.
The optical flow picture information is only used for analyzing the motion of the object or the human figure appearing in the target video segment.
The audio information refers to the audio of the target video segment, and is used for analyzing the sound appearing in the video.
It is understood that the target video segment of the event to be recognized is not limited to extracting the video frame information, the optical flow picture information and the audio information corresponding to the target video segment, and may also include content related to video event recognition, such as subtitle information, which is present in other target video segments and is used for analyzing subtitles present in the video.
S120, determining video event characteristics of the target video clip based on the video frame information, the optical flow picture information and the audio information corresponding to the target video clip.
The video event feature comprises two parts, namely a concept feature and a content feature. The conceptual features refer to event classification scores obtained after the video frames are calculated through different models, namely event classification probabilities output by the models. Content features generally refer to features preceding the model softmax layer. Content features are more general than conceptual features, rather than for a particular task or category.
Specifically, feature extraction is performed on video frame information, optical flow picture information, and audio information, respectively. Illustratively, the video frame information corresponds to the extracted video event features including object features, scene features and motion features, the optical flow picture information corresponds to the extracted video event features including motion features, and the audio information corresponds to the extracted video event features including sound features.
Because the events are often determined by the characteristics of multiple modes, the technical solution of the embodiment of the present invention adopts a mode of multi-mode fusion to classify and recognize the events. The multi-mode fusion can realize the feature fusion by directly connecting features of different modes, manually distributing different weights or building a neural network.
Specifically, the video event features are composed of multi-modal content features and concept features selected by concepts, and the video event features of the whole video are represented by all frames and audio averages.
Further, after the characteristics of each piece of information are extracted, characteristic selection and fusion are carried out, and the video event characteristics of the target video clip are obtained.
It can be understood that in the technical solution of the embodiment of the present invention, the video frame information, the optical flow picture information, and the audio information correspond to the extracted video event features respectively, and the above four modal features are considered only exemplarily, and in the analysis of the video event, different information obtained by preprocessing the video clip according to the actual situation is extracted correspondingly to extract different event feature information, so as to further complete the determination of the video event feature of the video clip.
S130, determining video event information corresponding to the target video clip according to the video event characteristics of the target video clip.
The video event information is the probability of an event to which a target video clip of the event to be identified belongs; or, the video event information is an event identified in the target video segment of the event to be identified, and the identified event is displayed on the terminal.
Specifically, the video event characteristics of the target video clip are classified by a preset neural network or other classifiers, such as a Support Vector Machine (SVM), so as to obtain the probability that the target video clip belongs to each type of event, and the category with the highest probability is used as the final classification result of the target video clip, that is, the video event information corresponding to the target video clip.
It should be noted that the technical solution of the embodiment of the present invention can be implemented by a CPU or a GPU processor on a personal computer, a server, or a cloud server.
The terminal where the identified event is displayed may be a personal computer, a server, or a cloud server.
According to the technical scheme of the embodiment of the invention, a target video clip of an event to be identified is obtained, and video frame information, optical flow picture information and audio information corresponding to the target video clip are determined according to the target video clip; determining video event features of the target video clip based on video frame information, optical flow picture information and audio information corresponding to the target video clip; and determining video event information corresponding to the target video clip according to the video event characteristics of the target video clip. The method solves the problems that the manually designed feature characterization capability in the prior art is limited and is not suitable for complex and large-scale video identification, and the multi-modal attribute of the event is ignored, so that the event in the video can be accurately identified.
Example two
Fig. 2 is a flowchart of a video event recognition method according to a second embodiment of the present invention. The present embodiment is optimized based on the above embodiments.
Correspondingly, the method of the embodiment specifically includes:
s210, obtaining a target video clip of an event to be identified, and determining video frame information, optical flow picture information and audio information corresponding to the target video clip according to the target video clip.
S220, inputting video frame information, optical flow picture information and audio information corresponding to the target video clip into a preset feature extraction model, and outputting video frame features corresponding to the video frame information, optical flow picture features corresponding to the optical flow picture information and audio features corresponding to the audio information.
Specifically, the characteristics of four modalities, i.e., a motion characteristic, an object characteristic, a scene characteristic, and a sound characteristic, are taken as examples.
The motion features can be extracted through a Temporal Segment Network (TSN) model and pre-trained on a Kinetics data set, wherein 400 motion classes can be contained. Specifically, models can be respectively input through video frame information and optical flow picture information and finally fused, wherein the prediction probability of a 400-dimensional concept is used as a concept feature, and the last layer of global pooling feature of the model is used as a content feature.
Scene features can be pre-trained on a Place365 data set through a ResNet-based picture classification model, wherein the data set comprises 365 common scene categories. Wherein, the prediction probability of 365 dimensional scene is used as the concept feature, and the average pooling layer feature before the model classification is used as the content feature.
Objects are also often closely related to events, and thus object features can be characterized by object models. Unlike most of the previous work of directly extracting the features of pictures based on ImageNet, in the embodiment of the invention, firstly, all object parts are extracted from a video frame by using an object detection algorithm, and then, the ImageNet is used for classifying object blocks with different scales and extracting the features.
Another significant difference between video and still images is that video often contains features of audio. Therefore, the characteristics of the acoustic model are also considered in the embodiment of the invention. In the embodiment of the invention, the voice characteristics can adopt a voice event classification model based on VGGish, pre-training is carried out on the AudioSet, 485 voice categories are included, the prediction probability of the voice categories is used as concept characteristics, and the voice characteristics subjected to dimension reduction are used as content characteristics.
S230, generating a video event feature of the target video clip according to the video frame feature, the optical flow picture feature and the audio feature; wherein the video frame features comprise object features, motion features, and scene features, the optical flow picture features comprise motion features, and the audio features comprise sound features.
Wherein generating video event features of the target video segment from the video frame features, the optical flow picture features, and the audio features comprises: acquiring concept features and content features of the video frame features, concept features and content features of the optical flow picture features and concept features and content features of the audio features; and respectively carrying out concept selection on the concept features of the video frame features, the optical flow picture features and the audio features, and fusing the selected concept features of the video frame features, the optical flow picture features and the audio features with the content features of the video frame features, the optical flow picture features and the audio features respectively to generate video event features of the target video clip.
The obtained concept features can be screened and reserved through two concept selection criteria.
Because not all high-dimensional conceptual features contribute equally to the event classification, the embodiment of the present invention provides two criteria for concept selection for high-dimensional conceptual features to better distinguish different events.
First, we first criterion is to select the most discriminative features, i.e., we want to select the concept that only appears in a small number of event categories. In addition, there are strong correlations between concepts, and any one of these correlations may express a specific event. Therefore, as another criterion, we want to select a diversity of the subset of concepts, which means we can select a concept with low relevance in the concept selection.
Specifically, we define the training data set as:
Tr={(Vi,S(Vi),zi)}i=1,…,n
wherein, S (V)i) Is a video ViConcept score of, ziIs a category of events.
We consider the concept selection problem as selecting partial points on a fully connected graph, where each point represents a concept and the edges of the connection points represent the correlation between the points. Meanwhile, each point also comprises an implicit variable hiE {0, 1}, which defines whether this concept is eventually selected. Therefore, we want to express for a given concept by minimizing the energy function, with the following formula:
Figure BDA0002284447900000111
wherein phi (h)i) Represents the cost of selecting the i-th class of concepts, i.e., represents the discriminative power of the concepts.
And psi (h)i,hj) The cost of simultaneously selecting the i and j concepts is shown, and the relevance and diversity of the concepts are also shown.
For the discriminative power of concepts, we define conditional entropy to select concepts that only appear in a small fraction of event categories, with the specific formula:
Figure BDA0002284447900000121
where c represents a specific concept, and p (e | c) is an event condition distribution for the specific concept c, and can be obtained by a bayesian formula. When h is generatediWhen 1, phi (h)i)=H(E|i)。
Furthermore, for correlation between concepts, we use the conditional probability p (e | c) to represent that if two concepts predict the same event category, they should have similar conditional probabilities. Therefore, when hi1 and hjWhen 1, psi (h)i,hj)=<p(e|i),p(e|j)>。
Finally, concept selection is achieved through a greedy algorithm. First, the concept with the minimum conditional entropy, i.e., the concept that appears in the least event category, is selected. Then, we update the correlation matrix of all the concepts according to the selected concepts, and continue to select the concept that minimizes the energy function e (h), i.e. contains the minimum cost. This process is used separately on the model for each modality, looping until the number of selected concepts reaches the required threshold.
S240, inputting the video event characteristics of the target video clip into a pre-trained classifier, and outputting video event information corresponding to the target video clip.
It is understood that training the classifier is further included before inputting the video event features of the target video segment into the pre-trained classifier. The method specifically comprises the following steps: generating a training sample set based on at least one historical video segment; inputting the training sample set into a pre-established classifier to obtain video event information of the historical video clip; and adjusting the parameters of the classifier according to the video event information and the expected video event information.
The pre-established classifier comprises a deep learning network; optionally, inputting the training sample set into a pre-established classifier to obtain video event information of the historical video segment, where the method includes: inputting the training sample set into the deep learning network, and directly carrying out event classification and sequencing to obtain video event information of the historical video clip; or inputting the training sample set into the deep learning network, determining video event information larger than a preset probability threshold value, and then obtaining the video event information of the historical video clip through the event classification and sequencing.
Fig. 3 is a flowchart of an exemplary video event recognition method according to an embodiment of the present invention. Referring to fig. 3, the video event recognition method can be divided into five components, namely video preprocessing, video feature extraction, concept selection, multi-modal feature fusion and event classification. Video pre-processing may extract frames and audio, such as video frames (RGB frames), optical flow pictures, and audio, for a target video segment. The video feature extraction inputs video frames into a plurality of models, respectively extracts content features and concept features of a plurality of modes, such as inputting RGB frames into an object model, a scene model and a motion model, respectively outputting object features, scene features and motion features corresponding to the RGB frames, inputting an optical flow picture into the motion model, outputting motion features corresponding to the optical flow picture, inputting audio into a sound model, and outputting sound features corresponding to the audio. Concept selection for the input concept features, the input concept features are screened and retained by two criteria for concept selection. Furthermore, the multi-modal feature fusion is to perform multi-modal fusion on the content features obtained from the extraction module and the concept features to form feature representation of the whole video. And after the characteristic representation of the whole video is obtained, determining the score and the most possible category of the target video for event classification, and returning the generated result to the user.
EXAMPLE III
Fig. 4 is a structural diagram of a video event recognition system according to a third embodiment of the present invention, which is applicable to event recognition tasks in video retrieval, monitoring analysis, advertisement delivery, or the like.
As shown in fig. 4, the video event recognition system includes: an information acquisition module 410, a feature determination module 420, and an event information determination module 430, wherein:
the information acquisition module 410 is configured to acquire a target video segment of an event to be identified, and determine video frame information, optical flow picture information, and audio information corresponding to the target video segment according to the target video segment;
a feature determination module 420, configured to determine a video event feature of the target video segment based on the video frame information, the optical flow picture information, and the audio information corresponding to the target video segment;
and the event information determining module 430 is configured to determine, according to the video event feature of the target video segment, video event information corresponding to the target video segment.
The video event identification system of the embodiment acquires a target video clip of an event to be identified, and determines video frame information, optical flow picture information and audio information corresponding to the target video clip according to the target video clip; determining video event features of the target video clip based on video frame information, optical flow picture information and audio information corresponding to the target video clip; and determining video event information corresponding to the target video clip according to the video event characteristics of the target video clip. The method solves the problems that the manually designed feature characterization capability in the prior art is limited and is not suitable for complex and large-scale video identification, and the multi-modal attribute of the event is ignored, so that the event in the video can be accurately identified.
On the basis of the above embodiments, determining the video event feature of the target video segment based on the video frame information, the optical flow picture information and the audio information corresponding to the target video segment includes:
inputting video frame information, optical flow picture information and audio information corresponding to the target video clip into a preset feature extraction model, and outputting video frame features corresponding to the video frame information, optical flow picture features corresponding to the optical flow picture information and audio features corresponding to the audio information;
generating video event features of the target video segment from the video frame features, the optical flow picture features, and the audio features;
wherein the video frame features comprise object features, motion features, and scene features, the optical flow picture features comprise motion features, and the audio features comprise sound features.
On the basis of the above embodiments, generating the video event feature of the target video segment according to the video frame feature, the optical flow picture feature and the audio feature includes:
acquiring concept features and content features of the video frame features, concept features and content features of the optical flow picture features and concept features and content features of the audio features;
and respectively carrying out concept selection on the concept features of the video frame features, the optical flow picture features and the audio features, and fusing the selected concept features of the video frame features, the optical flow picture features and the audio features with the content features of the video frame features, the optical flow picture features and the audio features respectively to generate video event features of the target video clip.
On the basis of the foregoing embodiments, determining video event information corresponding to the target video segment according to the video event feature of the target video segment includes:
and inputting the video event characteristics of the target video clip into a pre-trained classifier, and outputting video event information corresponding to the target video clip.
On the basis of the above embodiments, the method further includes:
generating a training sample set based on at least one historical video segment;
inputting the training sample set into a pre-established classifier to obtain video event information of the historical video clip;
and adjusting the parameters of the classifier according to the video event information and the expected video event information.
On the basis of the above embodiments, the pre-established classifier includes a deep learning network;
inputting the training sample set into a pre-established classifier to obtain video event information of the historical video clip, wherein the method comprises the following steps:
inputting the training sample set into the deep learning network, and directly carrying out event classification and sequencing to obtain video event information of the historical video clip; alternatively, the first and second electrodes may be,
and inputting the training sample set into the deep learning network, determining video event information larger than a preset probability threshold value, and then obtaining the video event information of the historical video clip through the event classification and sequencing.
On the basis of the above embodiments, the video event information is the probability of the event to which the target video clip of the event to be identified belongs; or the light source is used for emitting light,
and the video event information is an event identified in the target video clip of the event to be identified, and the identified event is displayed on the terminal.
The video event recognition system provided by each embodiment can execute the video event recognition method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects for executing the video event recognition method.
Example four
Fig. 5 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention. FIG. 5 illustrates a block diagram of an exemplary electronic device 12 suitable for use in implementing embodiments of the present invention. The electronic device 12 shown in fig. 5 is only an example and should not bring any limitation to the function and the scope of use of the embodiment of the present invention.
As shown in FIG. 5, electronic device 12 is embodied in the form of a general purpose computing device. The components of electronic device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Electronic device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by electronic device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. The electronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, and commonly referred to as a "hard drive"). Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.
Electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with electronic device 12, and/or with any devices (e.g., network card, modem, etc.) that enable electronic device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the electronic device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via the network adapter 20. As shown, the network adapter 20 communicates with other modules of the electronic device 12 via the bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processing unit 16 executes various functional applications and data processing by executing programs stored in the system memory 28, for example, implementing a video event recognition method provided by an embodiment of the present invention, the video event recognition method including:
acquiring a target video clip of an event to be identified, and determining video frame information, optical flow picture information and audio information corresponding to the target video clip according to the target video clip;
determining video event features of the target video clip based on video frame information, optical flow picture information and audio information corresponding to the target video clip;
and determining video event information corresponding to the target video clip according to the video event characteristics of the target video clip.
Of course, those skilled in the art can understand that the processor can also implement the technical solution of the video event recognition method provided by any embodiment of the present invention.
EXAMPLE five
An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a video event recognition method provided in an embodiment of the present invention, where the video event recognition method includes:
acquiring a target video clip of an event to be identified, and determining video frame information, optical flow picture information and audio information corresponding to the target video clip according to the target video clip;
determining video event features of the target video clip based on video frame information, optical flow picture information and audio information corresponding to the target video clip;
and determining video event information corresponding to the target video clip according to the video event characteristics of the target video clip.
Of course, the computer program stored on the computer-readable storage medium provided by the embodiments of the present invention is not limited to the method operations described above, and may also perform related operations in the video event identification method provided by any embodiments of the present invention.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (8)

1. A method for video event recognition, comprising:
acquiring a target video clip of an event to be identified, and determining video frame information, optical flow picture information and audio information corresponding to the target video clip according to the target video clip; the target video clip is intercepted from the middle time period of the complete video content;
determining video event features of the target video clip based on video frame information, optical flow picture information and audio information corresponding to the target video clip;
determining video event information corresponding to the target video clip according to the video event characteristics of the target video clip;
determining video event features of the target video clip based on video frame information, optical flow picture information, and audio information corresponding to the target video clip, including:
inputting video frame information, optical flow picture information and audio information corresponding to the target video clip into a preset feature extraction model, and outputting video frame features corresponding to the video frame information, optical flow picture features corresponding to the optical flow picture information and audio features corresponding to the audio information;
generating video event features of the target video segment from the video frame features, the optical flow picture features, and the audio features;
wherein the video frame features comprise object features, motion features, and scene features, the optical flow picture features comprise motion features, and the audio features comprise sound features;
wherein generating video event features of the target video segment from the video frame features, the optical flow picture features, and the audio features comprises:
acquiring concept features and content features of the video frame features, concept features and content features of the optical flow picture features and concept features and content features of the audio features; the conceptual features refer to event classification probabilities output by the preset feature extraction models, and the content features refer to features before the output layers of the preset feature extraction models;
and respectively carrying out concept selection on the concept features of the video frame features, the optical flow picture features and the audio features, and fusing the selected concept features of the video frame features, the optical flow picture features and the audio features with the content features of the video frame features, the optical flow picture features and the audio features respectively to generate video event features of the target video clip.
2. The method of claim 1, wherein determining video event information corresponding to the target video segment according to the video event characteristics of the target video segment comprises:
and inputting the video event characteristics of the target video clip into a pre-trained classifier, and outputting video event information corresponding to the target video clip.
3. The method of claim 2, further comprising:
generating a training sample set based on at least one historical video segment;
inputting the training sample set into a pre-established classifier to obtain video event information of the historical video clip;
and adjusting the parameters of the classifier according to the video event information and the expected video event information.
4. The method of claim 3, wherein the pre-established classifier comprises a deep learning network;
inputting the training sample set into a pre-established classifier to obtain video event information of the historical video clip, wherein the method comprises the following steps:
inputting the training sample set into the deep learning network, and directly carrying out event classification and sequencing to obtain video event information of the historical video clip; alternatively, the first and second electrodes may be,
and inputting the training sample set into the deep learning network, determining video event information larger than a preset probability threshold value, and then obtaining the video event information of the historical video clip through the event classification and sequencing.
5. The method according to claim 1, wherein the video event information is a probability of an event to which a target video segment of the event to be identified belongs; or the light source is used for emitting light,
and the video event information is an event identified in the target video clip of the event to be identified, and the identified event is displayed on the terminal.
6. A video event recognition system, comprising:
the information acquisition module is used for acquiring a target video clip of an event to be identified and determining video frame information, optical flow picture information and audio information corresponding to the target video clip according to the target video clip; the target video clip is intercepted from the middle time period of the complete video content;
the characteristic determining module is used for determining video event characteristics of the target video clip based on video frame information, optical flow picture information and audio information corresponding to the target video clip;
the event information determining module is used for determining video event information corresponding to the target video clip according to the video event characteristics of the target video clip;
the feature determination module includes:
inputting video frame information, optical flow picture information and audio information corresponding to the target video clip into a preset feature extraction model, and outputting video frame features corresponding to the video frame information, optical flow picture features corresponding to the optical flow picture information and audio features corresponding to the audio information;
generating video event features of the target video segment from the video frame features, the optical flow picture features, and the audio features;
wherein the video frame features comprise object features, motion features, and scene features, the optical flow picture features comprise motion features, and the audio features comprise sound features;
wherein generating video event features of the target video segment from the video frame features, the optical flow picture features, and the audio features comprises:
acquiring concept features and content features of the video frame features, concept features and content features of the optical flow picture features and concept features and content features of the audio features; the conceptual features refer to event classification probabilities output by the preset feature extraction models, and the content features refer to features before the output layers of the preset feature extraction models;
and respectively carrying out concept selection on the concept features of the video frame features, the optical flow picture features and the audio features, and fusing the selected concept features of the video frame features, the optical flow picture features and the audio features with the content features of the video frame features, the optical flow picture features and the audio features respectively to generate video event features of the target video clip.
7. An electronic device, characterized in that the electronic device comprises:
one or more processors;
storage means for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the video event recognition method of any of claims 1-5.
8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the video event recognition method according to any one of claims 1 to 5.
CN201911154578.1A 2019-11-18 2019-11-18 Video event identification method, system, electronic equipment and medium Active CN110942011B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911154578.1A CN110942011B (en) 2019-11-18 2019-11-18 Video event identification method, system, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911154578.1A CN110942011B (en) 2019-11-18 2019-11-18 Video event identification method, system, electronic equipment and medium

Publications (2)

Publication Number Publication Date
CN110942011A CN110942011A (en) 2020-03-31
CN110942011B true CN110942011B (en) 2021-02-02

Family

ID=69907895

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911154578.1A Active CN110942011B (en) 2019-11-18 2019-11-18 Video event identification method, system, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN110942011B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111507421A (en) * 2020-04-22 2020-08-07 上海极链网络科技有限公司 Video-based emotion recognition method and device
CN112464898A (en) * 2020-12-15 2021-03-09 北京市商汤科技开发有限公司 Event detection method and device, electronic equipment and storage medium
CN112714362A (en) * 2020-12-25 2021-04-27 北京百度网讯科技有限公司 Method, apparatus, electronic device, medium, and program product for determining attributes
CN113012200B (en) * 2021-03-23 2023-01-13 北京灵汐科技有限公司 Method and device for positioning moving object, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751782A (en) * 2009-12-30 2010-06-23 北京大学深圳研究生院 Crossroad traffic event automatic detection system based on multi-source information fusion
CN101799876A (en) * 2010-04-20 2010-08-11 王巍 Video/audio intelligent analysis management control system
CN102098492A (en) * 2009-12-11 2011-06-15 上海弘视通信技术有限公司 Audio and video conjoint analysis-based fighting detection system and detection method thereof
CN106503723A (en) * 2015-09-06 2017-03-15 华为技术有限公司 A kind of video classification methods and device
CN108288035A (en) * 2018-01-11 2018-07-17 华南理工大学 The human motion recognition method of multichannel image Fusion Features based on deep learning
CN109376603A (en) * 2018-09-25 2019-02-22 北京周同科技有限公司 A kind of video frequency identifying method, device, computer equipment and storage medium
CN110110624A (en) * 2019-04-24 2019-08-09 江南大学 A kind of Human bodys' response method based on DenseNet network and the input of frame difference method feature

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9373174B2 (en) * 2014-10-21 2016-06-21 The United States Of America As Represented By The Secretary Of The Air Force Cloud based video detection and tracking system
CN109344781A (en) * 2018-10-11 2019-02-15 上海极链网络科技有限公司 Expression recognition method in a kind of video based on audio visual union feature
CN109977818A (en) * 2019-03-14 2019-07-05 上海极链网络科技有限公司 A kind of action identification method and system based on space characteristics and multi-target detection

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102098492A (en) * 2009-12-11 2011-06-15 上海弘视通信技术有限公司 Audio and video conjoint analysis-based fighting detection system and detection method thereof
CN101751782A (en) * 2009-12-30 2010-06-23 北京大学深圳研究生院 Crossroad traffic event automatic detection system based on multi-source information fusion
CN101799876A (en) * 2010-04-20 2010-08-11 王巍 Video/audio intelligent analysis management control system
CN106503723A (en) * 2015-09-06 2017-03-15 华为技术有限公司 A kind of video classification methods and device
CN108288035A (en) * 2018-01-11 2018-07-17 华南理工大学 The human motion recognition method of multichannel image Fusion Features based on deep learning
CN109376603A (en) * 2018-09-25 2019-02-22 北京周同科技有限公司 A kind of video frequency identifying method, device, computer equipment and storage medium
CN110110624A (en) * 2019-04-24 2019-08-09 江南大学 A kind of Human bodys' response method based on DenseNet network and the input of frame difference method feature

Also Published As

Publication number Publication date
CN110942011A (en) 2020-03-31

Similar Documents

Publication Publication Date Title
CN110942011B (en) Video event identification method, system, electronic equipment and medium
US10354362B2 (en) Methods and software for detecting objects in images using a multiscale fast region-based convolutional neural network
CN109117777B (en) Method and device for generating information
US11409791B2 (en) Joint heterogeneous language-vision embeddings for video tagging and search
CN109598231B (en) Video watermark identification method, device, equipment and storage medium
US9875445B2 (en) Dynamic hybrid models for multimodal analysis
US20180336439A1 (en) Novelty detection using discriminator of generative adversarial network
US20190034814A1 (en) Deep multi-task representation learning
EP3399460A1 (en) Captioning a region of an image
Dewi et al. Weight analysis for various prohibitory sign detection and recognition using deep learning
US11556302B2 (en) Electronic apparatus, document displaying method thereof and non-transitory computer readable recording medium
CN107273458B (en) Depth model training method and device, and image retrieval method and device
CN109993102B (en) Similar face retrieval method, device and storage medium
CN110232340B (en) Method and device for establishing video classification model and video classification
KR101617649B1 (en) Recommendation system and method for video interesting section
EP3886037A1 (en) Image processing apparatus and method for style transformation
CN110929802A (en) Information entropy-based subdivision identification model training and image identification method and device
CN111368878B (en) Optimization method based on SSD target detection, computer equipment and medium
WO2021034864A1 (en) Detection of moment of perception
JP5311899B2 (en) Pattern detector learning apparatus, learning method, and program
CN110263218B (en) Video description text generation method, device, equipment and medium
CN112861474B (en) Information labeling method, device, equipment and computer readable storage medium
CN114092746A (en) Multi-attribute identification method and device, storage medium and electronic equipment
WO2021090771A1 (en) Method, apparatus and system for training a neural network, and storage medium storing instructions
CN113792569A (en) Object identification method and device, electronic equipment and readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant