CN110909655A - Method and equipment for identifying video event - Google Patents

Method and equipment for identifying video event Download PDF

Info

Publication number
CN110909655A
CN110909655A CN201911128938.0A CN201911128938A CN110909655A CN 110909655 A CN110909655 A CN 110909655A CN 201911128938 A CN201911128938 A CN 201911128938A CN 110909655 A CN110909655 A CN 110909655A
Authority
CN
China
Prior art keywords
picture
video
feature vector
feature
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911128938.0A
Other languages
Chinese (zh)
Inventor
周康明
李俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Eye Control Technology Co Ltd
Original Assignee
Shanghai Eye Control Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Eye Control Technology Co Ltd filed Critical Shanghai Eye Control Technology Co Ltd
Priority to CN201911128938.0A priority Critical patent/CN110909655A/en
Publication of CN110909655A publication Critical patent/CN110909655A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/44Event detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The method comprises the steps of sampling an obtained video to be detected to obtain a picture sequence; detecting each picture in the picture sequence, and determining a key point group of each picture; extracting feature information in the key point group of each picture, and determining a feature vector sequence according to the feature information in the key point group of each picture; and determining the event category of the video to be detected according to each feature vector in the feature vector sequence. Therefore, the type of the video is accurately and quickly judged based on the key points of the human body and the object in the video.

Description

Method and equipment for identifying video event
Technical Field
The present application relates to the field of computers, and in particular, to a method and an apparatus for identifying a video event.
Background
In the prior art, the type of the video is difficult to judge based on key points of human bodies and objects in the video, and the judgment result is not accurate. Meanwhile, in order to identify the video type, the overall feature extraction of the picture sequence is often performed, and a large amount of algorithm resources and time resources are wasted.
Disclosure of Invention
An object of the present application is to provide a method and an apparatus for identifying a video event, which solve the problems of the prior art that identifying a video type is not accurate and identifying a video type is time-consuming.
According to an aspect of the present application, there is provided a method of identifying a video event, the method comprising:
sampling the obtained video to be detected to obtain a picture sequence;
detecting each picture in the picture sequence, and determining a key point group of each picture;
extracting feature information in the key point group of each picture, and determining a feature vector sequence according to the feature information in the key point group of each picture;
and determining the event category of the video to be detected according to each feature vector in the feature vector sequence.
Further, the determining the event category of the video to be detected according to each feature vector in the feature vector sequence includes:
obtaining a plurality of event categories and scores of the event categories according to each feature vector in the feature vector sequence;
and determining the event category of the video to be detected according to the scores of the event categories and the score threshold value.
Further, the detecting each picture in the picture sequence and determining the key point group of each picture includes:
and detecting a human body and an object for each picture in the picture sequence, and determining a key point group of each picture, wherein the key point group of each picture comprises key points of the human body and key points of the object.
Further, the extracting the feature information in the key point group of each picture and determining a feature vector sequence according to the feature information in the key point group of each picture includes:
respectively unfolding coordinates of the key points of the human body and the key points of the object in the key point group of each picture in the corresponding picture into vectors of the human body and the objects;
determining a characteristic vector of the human body and a characteristic vector of the object according to the vector of the human body and the vector of the object;
fusing all the characteristic vectors of the human body and the characteristic vectors of the object in each picture to obtain a fused characteristic matrix corresponding to each picture;
and determining a feature vector sequence according to the fusion feature matrixes corresponding to all the pictures.
Further, the fusing all the feature vectors of the human body and the feature vectors of the object in each picture to obtain a fused feature matrix corresponding to each picture includes:
arranging all the characteristic vectors of the human body in each picture to form a matrix;
and performing a kronecker product on the matrix and the feature matrix of the object in the same picture to obtain a fusion feature matrix corresponding to each picture.
Further, the determining a feature vector sequence according to the fusion feature matrices corresponding to all the pictures includes:
inputting the fusion feature matrixes corresponding to all the pictures into a convolutional neural network to obtain a new feature graph;
and determining a feature vector sequence according to the new feature map.
Further, the determining the event category of the video to be detected according to the scores of the multiple event categories and the score threshold includes:
if at least one score in the scores of the multiple event categories is larger than a score threshold, taking a maximum score in the scores higher than the score threshold, and taking the event category corresponding to the maximum score as the event category of the video to be detected;
and if the scores of the multiple event categories are less than or equal to a score threshold value, performing sequence recognition on the feature vector sequence by using a long-short term memory model to determine the event category of the video to be detected.
Further, the determining the event category of the video to be detected according to each feature vector in the feature vector sequence further includes:
and measuring each component in each feature vector in the feature vector sequence to obtain a two-norm, and determining the event category of the video to be detected according to the two-norms corresponding to the feature vector sequence.
According to another aspect of the present application, there is also provided an apparatus for identifying a video event, the apparatus including:
the sampling device is used for sampling the obtained video to be detected to obtain a picture sequence;
the detection device is used for detecting each picture in the picture sequence and determining a key point group of each picture;
the extracting device is used for extracting the characteristic information in the key point group of each picture and determining a characteristic vector sequence according to the characteristic information in the key point group of each picture;
and the identification device is used for determining the event category of the video to be detected according to each feature vector in the feature vector sequence.
According to yet another aspect of the present application, there is also provided an apparatus for identifying a video event, the apparatus including:
one or more processors; and
a memory storing computer readable instructions that, when executed, cause the processor to perform the operations of a method of identifying video events as in any preceding claim.
Compared with the prior art, the method and the device have the advantages that the obtained video to be detected is sampled to obtain the picture sequence; detecting each picture in the picture sequence, and determining a key point group of each picture; extracting feature information in the key point group of each picture, and determining a feature vector sequence according to the feature information in the key point group of each picture; and determining the event category of the video to be detected according to each feature vector in the feature vector sequence. Therefore, the type of the video is accurately and quickly judged based on the key points of the human body and the object in the video.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
fig. 1 illustrates a flow diagram of a method of identifying video events provided in accordance with an aspect of the present application;
FIG. 2 is a diagram illustrating a method for detecting key points in a picture according to a preferred embodiment of the present application;
FIG. 3 is a flow chart illustrating a method for identifying video events in accordance with a further preferred embodiment of the present application;
fig. 4 is a schematic structural diagram of an apparatus for identifying a video event according to another aspect of the present application.
The same or similar reference numbers in the drawings identify the same or similar elements.
Detailed Description
The present application is described in further detail below with reference to the attached figures.
In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
Fig. 1 shows a schematic flow chart of a method for identifying a video event according to an aspect of the present application, the method comprising: S11-S14, wherein in the S11, the obtained video to be detected is sampled to obtain a picture sequence; step S12, detecting each picture in the picture sequence, and determining a key point group of each picture; step S13, extracting the characteristic information in the key point group of each picture, and determining a characteristic vector sequence according to the characteristic information in the key point group of each picture; and step S14, determining the event category of the video to be detected according to each feature vector in the feature vector sequence. Therefore, the type of the video is accurately and quickly judged based on the key points of the human body and the object in the video.
Specifically, in step S11, the obtained video to be detected is sampled to obtain a picture sequence. The video to be detected can be acquired through a camera device and the like, and then the video to be detected is sampled according to a preset fixed time interval to obtain a picture sequence arranged according to a sampling time sequence.
Step S12, detecting each picture in the picture sequence, and determining a key point group of each picture. Here, each picture in the picture sequence may be detected using a deep learning algorithm model, for example, an openpos model, a set of all the keypoints in each picture is determined, and the set of all the keypoints in each picture is used as a single group of keypoints.
And step S13, extracting the characteristic information in the key point group of each picture, and determining a characteristic vector sequence according to the characteristic information in the key point group of each picture. Here, the key point group in each picture includes key points of a person, key points of an object, and the like, the feature information in the key point group in each picture may be determined by coordinates of the key points, a classification in which the key points are located, and the like, and a feature vector sequence may be determined according to the feature information by some algorithm, for example, a feature vector sequence may be determined for a plurality of vectors after the key point group is correspondingly expanded.
And step S14, determining the event category of the video to be detected according to each feature vector in the feature vector sequence. Here, each feature vector in the feature vector sequence may be calculated by a classification network algorithm to obtain a corresponding score, and an event category corresponding to each feature vector is determined according to the corresponding score. Therefore, the type of the video is accurately and quickly judged based on the key points of the human body and the object in the video.
Preferably, in step S14, a plurality of event categories and scores of the plurality of event categories are obtained according to each feature vector in the feature vector sequence; and determining the event category of the video to be detected according to the scores of the event categories and the score threshold value. Here, a plurality of different event categories and scores of each feature vector under the plurality of event categories are determined according to each feature vector in the feature vector sequence. The score threshold values of multiple event categories can be preset, the classification neural network can be used for calculating the score of each feature vector under multiple different event categories, and the event category of the video to be detected is determined by comparing the score of each feature vector under multiple different event categories with the preset score threshold values of the multiple event categories.
Preferably, in step S12, human body and object detection is performed on each picture in the picture sequence, and a keypoint group of each picture is determined, where the keypoint group of each picture includes a keypoint of a human body and a keypoint of an object. As shown in fig. 2, the openpos model is used to detect a human body and an object in each picture, and determine key points of the human body and key points of the object in each picture, such as the head of the human body and joints of bones including the knee, the shoulder, the ankle, etc., which can be the key points of the human body, and for the key points of the object, the key points can be corners, edges, or points with prominent colors on the object. The dynamic tracks of the people and the objects in the video can be tracked and determined through the key points.
Preferably, in step S13, the coordinates of the key points of the human body and the key points of the object in the key point group of each picture in the corresponding picture are expanded into a vector of the human body and a vector of the object, respectively; determining a characteristic vector of the human body and a characteristic vector of the object according to the vector of the human body and the vector of the object; fusing all the characteristic vectors of the human body and the characteristic matrix of the object in each picture to obtain a fused characteristic matrix corresponding to each picture; and determining a feature vector sequence according to the fusion feature matrixes corresponding to all the pictures. In the method, key points of a human body and an object are detected in each picture, coordinates of the key points in the picture are respectively expanded into a vector of the human body and a vector of the object, and then the vector of the human body and the vector of the object can be calculated through a neural network to obtain a feature vector. In a preferred embodiment of the present application, coordinates of the key points of the human body and the object detected in each picture in the picture are expanded into a vector a, and then the vector a is input into a neural network to obtain a feature vector, wherein the neural network is preferably a fully connected neural network. The feature vector output by the fully-connected neural network identifies an event class, and the encoding of the feature vector for the event class identifies different event classes according to a permutation and combination manner, for example, the feature vector (1,1,1,0,0, 0), (1,1,0,1,0,0,0), (1,0,1,0,0,0,1), (0,0,1,1, 0,0,0) is a permutation and combination of two numbers of 1 and 0 for a part of the output feature vector with the length of 7, and can represent 4 different event classes respectively; however, in the prior art, each component can only represent one event category, for example, (1,0,0,0,0, 0), (0,1,0,0,0,0,0) and (0,0,1,0,0,0,0) each represent one event category, that is, 1 component is selected as 1, and the rest are selected as 0, in the prior art, if the total length of the feature vector is N, at most N categories can be encoded, but different feature vectors of N powers of 2 can be obtained based on a permutation and combination manner, and then N powers of 2 different event categories can be correspondingly obtained. This has the advantage of a more detailed classification of the event categories targeted by the keypoints, since the two different categories of (1,0,1,1,0,0,0) and (1,1,1,0,0,0,0) coding methods differ only by an amount that can be used to identify similar event categories. Meanwhile, compared with the prior design method of taking each component as a score of a certain category, the method can obtain richer codes.
Fig. 2 is a schematic diagram of detecting keypoints in a picture in a preferred embodiment of the present application, where a vector a obtained after expansion according to the keypoints identified in fig. 2 is [454.657,214.517,495.127,264.14,463.858,268.059,462.543,329.411,452.101,381.636,525.143,257.605,529.12,338.571,516.005,407.71,496.439,407.71,475.61,406.425483.394,476.914, 0.0633343], and a feature vector determined according to the vector a is: [0.1,0.4,0.6,0.7,0.4,0.9].
Preferably, in step S13, the feature vectors of all human bodies in each picture are arranged to form a matrix; and performing a kronecker product on the matrix and the feature matrix of the object in the same picture to obtain a fusion feature matrix corresponding to each picture. Here, all the human feature column vectors Y1, Y2, …, yn are arranged in a matrix Y ═ Y1, Y2, …, yn ], and a feature matrix W ═ W1, W2, …, wn ] of the object is obtained in a similar manner. The kronecker product is performed on Y and W to obtain a new feature matrix R of the fusion feature, which is W × Y, where "x" denotes the kronecker product.
Preferably, in step S13, the fusion feature matrices corresponding to all the pictures are input into a convolutional neural network, so as to obtain a new feature map; and determining a feature vector sequence according to the new feature map. Here, after resizing is performed according to the new feature map, each fused feature vector is pulled into one-dimensional vector to determine the feature vector sequence. Next to the above embodiment, the fusion feature matrices R obtained from each picture are arranged into a fusion feature matrix sequence { Ri } according to the sampling time sequence of each picture, the fusion feature matrix sequence { Ri } is input into a convolutional neural network to obtain a new feature map, the size is changed to a fixed size sxs again, each feature vector is pulled into a one-dimensional vector, and finally a sequence { Ti } of a feature vector is obtained.
Preferably, in step S14, if at least one score among the scores of the multiple event categories is greater than a score threshold, taking a maximum score among the scores that are greater than the score threshold, and taking the event category corresponding to the maximum score as the event category of the video to be detected; and if the scores of the multiple event categories are less than or equal to a score threshold value, performing sequence recognition on the feature vector sequence by using a long-short term memory model to determine the event category of the video to be detected. Here, the score threshold may be customized by a user, the feature vector sequence may be calculated by using a classification network to obtain a score of the feature vector sequence under each event category, and when at least one score among the scores under all the event categories is greater than the score threshold, the event category corresponding to the score higher than the score threshold is the event category of the video to be detected; when the scores under all the event categories are less than or equal to the score threshold, it is indicated that no key frame exists, that is, no convincing picture exists to judge the category of the video, and a long-short term memory model (LSTM model) is used for carrying out sequence recognition on the feature vector sequence to determine the event category of the video to be detected.
And inputting each feature vector in the { Ti } into the fully-connected classification network to calculate and determine the score of each feature vector under each event category, wherein a preset score threshold value is 0.95, and when the obtained score of the Q event category is higher than 0.95, judging the video to be detected into the Q event category. When the scores of the event categories obtained by all the feature vectors are less than 0.95, it is indicated that no key frame exists, and no convincing picture exists to judge the event category of the video to be detected, so that the feature sequence { Ti } is necessary to be input into an LSTM model for sequence recognition. For example, after the feature vector corresponding to a picture of a certain frame in the video to be detected is calculated by the fully-connected classification network, the score under the condition of aggregation is higher than 0.95, and then the video to be detected is classified into an aggregation category. If all pictures score below 0.95 under all event categories, then the entire feature sequence Ti is input into the LSTM model to determine the corresponding event category of the video.
After the video to be detected is detected through the full-connection classification network, when the score under the aggregation event category is determined to be higher than 0.95, if the score under the running event category is also higher than 0.95, the score under the running event category is compared with the score under the aggregation category, and the event category with the higher score is obtained as the event category of the video to be detected.
Preferably, in step S14, a two-norm is measured from each component in each feature vector in the feature vector sequence, and the event category of the video to be detected is determined according to the two-norm corresponding to the feature vector sequence. Here, the value obtained by each component in each feature vector in the feature vector sequence in a permutation and combination manner is not necessarily 1 or 0, and may be a decimal number, a two-norm is taken, the event category of the video to be detected is determined according to the two-norm corresponding to the feature vector sequence, so as to perform accurate event classification on the video, and the difference between the event categories can be measured by using the difference between the two-norms of the feature vectors.
Fig. 3 is a schematic flow chart of a method for identifying a video event in yet another preferred embodiment of the present application, where a video to be detected is obtained, the video to be detected is sampled at a fixed time interval T to obtain a picture sequence S, each picture in the picture sequence S is detected by using an openpos model to obtain a key point sequence P, feature information is extracted for each group of key points in the key point sequence P to determine a feature vector corresponding to each group of key points, then, feature vectors Y1, Y2, Y3, …, yn extracted from each group of key points in the key point sequence P are recorded as a matrix Y, and each feature vector in Y is input into a fully-connected classification network and calculated to determine a score under each event category that is output. When one score in all the scores is larger than a preset score threshold value of 0.95, taking the event category corresponding to the score higher than 0.95 as the event category of the video to be detected; and when all the scores are less than or equal to a preset score threshold value of 0.95, inputting each feature vector in the Y into the LSTM model, and determining the event category of the video to be detected after the LSTM model detects the feature vectors.
Fig. 4 is a schematic structural diagram illustrating an apparatus for identifying a video event according to another aspect of the present application, the apparatus including: the device comprises a sampling device 11, a detection device 12, an extraction device 13 and an identification device 14, wherein the sampling device 11 is used for sampling an obtained video to be detected to obtain a picture sequence; the detection device 12 is configured to detect each picture in the picture sequence, and determine a key point group of each picture; the extracting device 13 is configured to extract feature information in the key point group of each picture, and determine a feature vector sequence according to the feature information in the key point group of each picture; the identifying device 14 is configured to determine an event category of the video to be detected according to each feature vector in the feature vector sequence. Therefore, the type of the video is accurately and quickly judged based on the key points of the human body and the object in the video.
It should be noted that the contents executed by the sampling device 11, the detecting device 12, the extracting device 13 and the identifying device 14 are respectively the same as or corresponding to the contents executed in the above steps S11, S12, S13 and S14, and for the sake of brevity, the description is omitted here.
Furthermore, a computer readable medium is provided, on which computer readable instructions are stored, and the computer readable instructions can be executed by a processor to implement the foregoing method for identifying a video event.
According to still another aspect of the present application, there is also provided an apparatus for identifying a video event, wherein the apparatus includes:
one or more processors; and
a memory storing computer readable instructions that, when executed, cause the processor to perform the operations of the aforementioned method of identifying video events.
For example, the computer readable instructions, when executed, cause the one or more processors to:
sampling the obtained video to be detected to obtain a picture sequence; detecting each picture in the picture sequence, and determining a key point group of each picture; extracting feature information in the key point group of each picture, and determining a feature vector sequence according to the feature information in the key point group of each picture; and determining the event category of the video to be detected according to each feature vector in the feature vector sequence.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.
It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.
In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims (10)

1. A method of identifying a video event, wherein the method comprises:
sampling the obtained video to be detected to obtain a picture sequence;
detecting each picture in the picture sequence, and determining a key point group of each picture;
extracting feature information in the key point group of each picture, and determining a feature vector sequence according to the feature information in the key point group of each picture;
and determining the event category of the video to be detected according to each feature vector in the feature vector sequence.
2. The method according to claim 1, wherein the determining the event category of the video to be detected according to each feature vector in the feature vector sequence comprises:
obtaining a plurality of event categories and scores of the event categories according to each feature vector in the feature vector sequence;
and determining the event category of the video to be detected according to the scores of the event categories and the score threshold value.
3. The method of claim 1, wherein the detecting each picture in the sequence of pictures, determining the keypoint group for each picture, comprises:
and detecting a human body and an object for each picture in the picture sequence, and determining a key point group of each picture, wherein the key point group of each picture comprises key points of the human body and key points of the object.
4. The method according to claim 3, wherein the extracting feature information in the key point group of each picture, and determining a feature vector sequence according to the feature information in the key point group of each picture comprises:
respectively unfolding coordinates of the key points of the human body and the key points of the object in the key point group of each picture in the corresponding picture into vectors of the human body and the objects;
determining a characteristic vector of the human body and a characteristic vector of the object according to the vector of the human body and the vector of the object;
fusing all the characteristic vectors of the human body and the characteristic vectors of the object in each picture to obtain a fused characteristic matrix corresponding to each picture;
and determining a feature vector sequence according to the fusion feature matrixes corresponding to all the pictures.
5. The method according to claim 4, wherein the fusing all the feature vectors of the human body and the feature vectors of the object in each picture to obtain a fused feature matrix corresponding to each picture comprises:
arranging all the characteristic vectors of the human body in each picture to form a matrix;
and performing a kronecker product on the matrix and the feature matrix of the object in the same picture to obtain a fusion feature matrix corresponding to each picture.
6. The method according to claim 5, wherein the determining a feature vector sequence according to the fused feature matrix corresponding to all the pictures comprises:
inputting the fusion feature matrixes corresponding to all the pictures into a convolutional neural network to obtain a new feature graph;
and determining a feature vector sequence according to the new feature map.
7. The method according to claim 2, wherein the determining the event category of the video to be detected according to the scores of the plurality of event categories and a score threshold comprises:
if at least one score in the scores of the multiple event categories is larger than a score threshold, taking a maximum score in the scores higher than the score threshold, and taking the event category corresponding to the maximum score as the event category of the video to be detected;
and if the scores of the multiple event categories are less than or equal to a score threshold value, performing sequence recognition on the feature vector sequence by using a long-short term memory model to determine the event category of the video to be detected.
8. The method according to claim 1, wherein the determining an event category of the video to be detected according to each feature vector in the feature vector sequence further comprises:
and measuring each component in each feature vector in the feature vector sequence to obtain a two-norm, and determining the event category of the video to be detected according to the two-norms corresponding to the feature vector sequence.
9. An apparatus for identifying video events, wherein the apparatus comprises:
the sampling device is used for sampling the obtained video to be detected to obtain a picture sequence;
the detection device is used for detecting each picture in the picture sequence and determining a key point group of each picture;
the extracting device is used for extracting the characteristic information in the key point group of each picture and determining a characteristic vector sequence according to the characteristic information in the key point group of each picture;
and the identification device is used for determining the event category of the video to be detected according to each feature vector in the feature vector sequence.
10. An apparatus for identifying video events, wherein the apparatus comprises:
one or more processors; and
a memory storing computer readable instructions that, when executed, cause the processor to perform the operations of the method of any of claims 1 to 8.
CN201911128938.0A 2019-11-18 2019-11-18 Method and equipment for identifying video event Pending CN110909655A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911128938.0A CN110909655A (en) 2019-11-18 2019-11-18 Method and equipment for identifying video event

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911128938.0A CN110909655A (en) 2019-11-18 2019-11-18 Method and equipment for identifying video event

Publications (1)

Publication Number Publication Date
CN110909655A true CN110909655A (en) 2020-03-24

Family

ID=69816811

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911128938.0A Pending CN110909655A (en) 2019-11-18 2019-11-18 Method and equipment for identifying video event

Country Status (1)

Country Link
CN (1) CN110909655A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111507288A (en) * 2020-04-22 2020-08-07 上海眼控科技股份有限公司 Image detection method, image detection device, computer equipment and storage medium
CN112001229A (en) * 2020-07-09 2020-11-27 浙江大华技术股份有限公司 Method, device and system for identifying video behaviors and computer equipment
WO2023139757A1 (en) * 2022-01-21 2023-07-27 Nec Corporation Pose estimation apparatus, pose estimation method, and non-transitory computer-readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229489A (en) * 2016-12-30 2018-06-29 北京市商汤科技开发有限公司 Crucial point prediction, network training, image processing method, device and electronic equipment
CN109684920A (en) * 2018-11-19 2019-04-26 腾讯科技(深圳)有限公司 Localization method, image processing method, device and the storage medium of object key point
WO2019114405A1 (en) * 2017-12-13 2019-06-20 北京市商汤科技开发有限公司 Video recognition and training method and apparatus, electronic device and medium
CN110110613A (en) * 2019-04-19 2019-08-09 北京航空航天大学 A kind of rail traffic exception personnel's detection method based on action recognition
CN110287938A (en) * 2019-07-02 2019-09-27 齐鲁工业大学 Event recognition method, system, equipment and medium based on critical segment detection
CN110348335A (en) * 2019-06-25 2019-10-18 平安科技(深圳)有限公司 Method, apparatus, terminal device and the storage medium of Activity recognition

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229489A (en) * 2016-12-30 2018-06-29 北京市商汤科技开发有限公司 Crucial point prediction, network training, image processing method, device and electronic equipment
WO2019114405A1 (en) * 2017-12-13 2019-06-20 北京市商汤科技开发有限公司 Video recognition and training method and apparatus, electronic device and medium
CN109684920A (en) * 2018-11-19 2019-04-26 腾讯科技(深圳)有限公司 Localization method, image processing method, device and the storage medium of object key point
CN110110613A (en) * 2019-04-19 2019-08-09 北京航空航天大学 A kind of rail traffic exception personnel's detection method based on action recognition
CN110348335A (en) * 2019-06-25 2019-10-18 平安科技(深圳)有限公司 Method, apparatus, terminal device and the storage medium of Activity recognition
CN110287938A (en) * 2019-07-02 2019-09-27 齐鲁工业大学 Event recognition method, system, equipment and medium based on critical segment detection

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIAOFANG WANG ET AL.: "Combined trajectories for action recognition based on saliency detection and motion boundary", 《SIGNAL PROCESSING:IMAGE COMMUNICATION》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111507288A (en) * 2020-04-22 2020-08-07 上海眼控科技股份有限公司 Image detection method, image detection device, computer equipment and storage medium
CN112001229A (en) * 2020-07-09 2020-11-27 浙江大华技术股份有限公司 Method, device and system for identifying video behaviors and computer equipment
CN112001229B (en) * 2020-07-09 2021-07-20 浙江大华技术股份有限公司 Method, device and system for identifying video behaviors and computer equipment
WO2023139757A1 (en) * 2022-01-21 2023-07-27 Nec Corporation Pose estimation apparatus, pose estimation method, and non-transitory computer-readable storage medium

Similar Documents

Publication Publication Date Title
JP6893233B2 (en) Image-based data processing methods, devices, electronics, computer-readable storage media and computer programs
CN107358149B (en) Human body posture detection method and device
KR101507662B1 (en) Semantic parsing of objects in video
CN110909655A (en) Method and equipment for identifying video event
CN111046235A (en) Method, system, equipment and medium for searching acoustic image archive based on face recognition
CN111612822B (en) Object tracking method, device, computer equipment and storage medium
CN111104925B (en) Image processing method, image processing apparatus, storage medium, and electronic device
CN110348392B (en) Vehicle matching method and device
CN110969045B (en) Behavior detection method and device, electronic equipment and storage medium
KR101177626B1 (en) Object checking apparatus and method
CN109284700B (en) Method, storage medium, device and system for detecting multiple faces in image
CN111144215A (en) Image processing method, image processing device, electronic equipment and storage medium
KR20220098030A (en) Method for constructing target motion trajectory, device and computer storage medium
CN111507332A (en) Vehicle VIN code detection method and equipment
CN110728193B (en) Method and device for detecting richness characteristics of face image
CN112381092A (en) Tracking method, device and computer readable storage medium
Mahurkar Integrating yolo object detection with augmented reality for ios apps
EP3647997A1 (en) Person searching method and apparatus and image processing device
CN112150508A (en) Target tracking method, device and related equipment
CN111062385A (en) Network model construction method and system for image text information detection
CN110969138A (en) Human body posture estimation method and device
CN111539390A (en) Small target image identification method, equipment and system based on Yolov3
CN111860122A (en) Method and system for recognizing reading comprehensive behaviors in real scene
CN116129523A (en) Action recognition method, device, terminal and computer readable storage medium
CN111274602A (en) Image characteristic information replacement method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned

Effective date of abandoning: 20231013

AD01 Patent right deemed abandoned