CN110909655A

CN110909655A - Method and equipment for identifying video event

Info

Publication number: CN110909655A
Application number: CN201911128938.0A
Authority: CN
Inventors: 周康明; 李俊杰
Original assignee: Shanghai Eye Control Technology Co Ltd
Current assignee: Shanghai Eye Control Technology Co Ltd
Priority date: 2019-11-18
Filing date: 2019-11-18
Publication date: 2020-03-24

Abstract

The method comprises the steps of sampling an obtained video to be detected to obtain a picture sequence; detecting each picture in the picture sequence, and determining a key point group of each picture; extracting feature information in the key point group of each picture, and determining a feature vector sequence according to the feature information in the key point group of each picture; and determining the event category of the video to be detected according to each feature vector in the feature vector sequence. Therefore, the type of the video is accurately and quickly judged based on the key points of the human body and the object in the video.

Description

Method and equipment for identifying video event

Technical Field

The present application relates to the field of computers, and in particular, to a method and an apparatus for identifying a video event.

Background

In the prior art, the type of the video is difficult to judge based on key points of human bodies and objects in the video, and the judgment result is not accurate. Meanwhile, in order to identify the video type, the overall feature extraction of the picture sequence is often performed, and a large amount of algorithm resources and time resources are wasted.

Disclosure of Invention

An object of the present application is to provide a method and an apparatus for identifying a video event, which solve the problems of the prior art that identifying a video type is not accurate and identifying a video type is time-consuming.

According to an aspect of the present application, there is provided a method of identifying a video event, the method comprising:

sampling the obtained video to be detected to obtain a picture sequence;

detecting each picture in the picture sequence, and determining a key point group of each picture;

extracting feature information in the key point group of each picture, and determining a feature vector sequence according to the feature information in the key point group of each picture;

and determining the event category of the video to be detected according to each feature vector in the feature vector sequence.

Further, the determining the event category of the video to be detected according to each feature vector in the feature vector sequence includes:

obtaining a plurality of event categories and scores of the event categories according to each feature vector in the feature vector sequence;

and determining the event category of the video to be detected according to the scores of the event categories and the score threshold value.

Further, the detecting each picture in the picture sequence and determining the key point group of each picture includes:

and detecting a human body and an object for each picture in the picture sequence, and determining a key point group of each picture, wherein the key point group of each picture comprises key points of the human body and key points of the object.

Further, the extracting the feature information in the key point group of each picture and determining a feature vector sequence according to the feature information in the key point group of each picture includes:

respectively unfolding coordinates of the key points of the human body and the key points of the object in the key point group of each picture in the corresponding picture into vectors of the human body and the objects;

determining a characteristic vector of the human body and a characteristic vector of the object according to the vector of the human body and the vector of the object;

fusing all the characteristic vectors of the human body and the characteristic vectors of the object in each picture to obtain a fused characteristic matrix corresponding to each picture;

and determining a feature vector sequence according to the fusion feature matrixes corresponding to all the pictures.

Further, the fusing all the feature vectors of the human body and the feature vectors of the object in each picture to obtain a fused feature matrix corresponding to each picture includes:

arranging all the characteristic vectors of the human body in each picture to form a matrix;

and performing a kronecker product on the matrix and the feature matrix of the object in the same picture to obtain a fusion feature matrix corresponding to each picture.

Further, the determining a feature vector sequence according to the fusion feature matrices corresponding to all the pictures includes:

inputting the fusion feature matrixes corresponding to all the pictures into a convolutional neural network to obtain a new feature graph;

and determining a feature vector sequence according to the new feature map.

Further, the determining the event category of the video to be detected according to the scores of the multiple event categories and the score threshold includes:

if at least one score in the scores of the multiple event categories is larger than a score threshold, taking a maximum score in the scores higher than the score threshold, and taking the event category corresponding to the maximum score as the event category of the video to be detected;

and if the scores of the multiple event categories are less than or equal to a score threshold value, performing sequence recognition on the feature vector sequence by using a long-short term memory model to determine the event category of the video to be detected.

Further, the determining the event category of the video to be detected according to each feature vector in the feature vector sequence further includes:

and measuring each component in each feature vector in the feature vector sequence to obtain a two-norm, and determining the event category of the video to be detected according to the two-norms corresponding to the feature vector sequence.

According to another aspect of the present application, there is also provided an apparatus for identifying a video event, the apparatus including:

the sampling device is used for sampling the obtained video to be detected to obtain a picture sequence;

the detection device is used for detecting each picture in the picture sequence and determining a key point group of each picture;

the extracting device is used for extracting the characteristic information in the key point group of each picture and determining a characteristic vector sequence according to the characteristic information in the key point group of each picture;

and the identification device is used for determining the event category of the video to be detected according to each feature vector in the feature vector sequence.

According to yet another aspect of the present application, there is also provided an apparatus for identifying a video event, the apparatus including:

one or more processors; and

a memory storing computer readable instructions that, when executed, cause the processor to perform the operations of a method of identifying video events as in any preceding claim.

Compared with the prior art, the method and the device have the advantages that the obtained video to be detected is sampled to obtain the picture sequence; detecting each picture in the picture sequence, and determining a key point group of each picture; extracting feature information in the key point group of each picture, and determining a feature vector sequence according to the feature information in the key point group of each picture; and determining the event category of the video to be detected according to each feature vector in the feature vector sequence. Therefore, the type of the video is accurately and quickly judged based on the key points of the human body and the object in the video.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

fig. 1 illustrates a flow diagram of a method of identifying video events provided in accordance with an aspect of the present application;

FIG. 2 is a diagram illustrating a method for detecting key points in a picture according to a preferred embodiment of the present application;

FIG. 3 is a flow chart illustrating a method for identifying video events in accordance with a further preferred embodiment of the present application;

fig. 4 is a schematic structural diagram of an apparatus for identifying a video event according to another aspect of the present application.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

The present application is described in further detail below with reference to the attached figures.

In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

Fig. 1 shows a schematic flow chart of a method for identifying a video event according to an aspect of the present application, the method comprising: S11-S14, wherein in the S11, the obtained video to be detected is sampled to obtain a picture sequence; step S12, detecting each picture in the picture sequence, and determining a key point group of each picture; step S13, extracting the characteristic information in the key point group of each picture, and determining a characteristic vector sequence according to the characteristic information in the key point group of each picture; and step S14, determining the event category of the video to be detected according to each feature vector in the feature vector sequence. Therefore, the type of the video is accurately and quickly judged based on the key points of the human body and the object in the video.

Specifically, in step S11, the obtained video to be detected is sampled to obtain a picture sequence. The video to be detected can be acquired through a camera device and the like, and then the video to be detected is sampled according to a preset fixed time interval to obtain a picture sequence arranged according to a sampling time sequence.

Step S12, detecting each picture in the picture sequence, and determining a key point group of each picture. Here, each picture in the picture sequence may be detected using a deep learning algorithm model, for example, an openpos model, a set of all the keypoints in each picture is determined, and the set of all the keypoints in each picture is used as a single group of keypoints.

And step S13, extracting the characteristic information in the key point group of each picture, and determining a characteristic vector sequence according to the characteristic information in the key point group of each picture. Here, the key point group in each picture includes key points of a person, key points of an object, and the like, the feature information in the key point group in each picture may be determined by coordinates of the key points, a classification in which the key points are located, and the like, and a feature vector sequence may be determined according to the feature information by some algorithm, for example, a feature vector sequence may be determined for a plurality of vectors after the key point group is correspondingly expanded.

And step S14, determining the event category of the video to be detected according to each feature vector in the feature vector sequence. Here, each feature vector in the feature vector sequence may be calculated by a classification network algorithm to obtain a corresponding score, and an event category corresponding to each feature vector is determined according to the corresponding score. Therefore, the type of the video is accurately and quickly judged based on the key points of the human body and the object in the video.

Preferably, in step S14, a plurality of event categories and scores of the plurality of event categories are obtained according to each feature vector in the feature vector sequence; and determining the event category of the video to be detected according to the scores of the event categories and the score threshold value. Here, a plurality of different event categories and scores of each feature vector under the plurality of event categories are determined according to each feature vector in the feature vector sequence. The score threshold values of multiple event categories can be preset, the classification neural network can be used for calculating the score of each feature vector under multiple different event categories, and the event category of the video to be detected is determined by comparing the score of each feature vector under multiple different event categories with the preset score threshold values of the multiple event categories.

Preferably, in step S12, human body and object detection is performed on each picture in the picture sequence, and a keypoint group of each picture is determined, where the keypoint group of each picture includes a keypoint of a human body and a keypoint of an object. As shown in fig. 2, the openpos model is used to detect a human body and an object in each picture, and determine key points of the human body and key points of the object in each picture, such as the head of the human body and joints of bones including the knee, the shoulder, the ankle, etc., which can be the key points of the human body, and for the key points of the object, the key points can be corners, edges, or points with prominent colors on the object. The dynamic tracks of the people and the objects in the video can be tracked and determined through the key points.

Preferably, in step S13, the coordinates of the key points of the human body and the key points of the object in the key point group of each picture in the corresponding picture are expanded into a vector of the human body and a vector of the object, respectively; determining a characteristic vector of the human body and a characteristic vector of the object according to the vector of the human body and the vector of the object; fusing all the characteristic vectors of the human body and the characteristic matrix of the object in each picture to obtain a fused characteristic matrix corresponding to each picture; and determining a feature vector sequence according to the fusion feature matrixes corresponding to all the pictures. In the method, key points of a human body and an object are detected in each picture, coordinates of the key points in the picture are respectively expanded into a vector of the human body and a vector of the object, and then the vector of the human body and the vector of the object can be calculated through a neural network to obtain a feature vector. In a preferred embodiment of the present application, coordinates of the key points of the human body and the object detected in each picture in the picture are expanded into a vector a, and then the vector a is input into a neural network to obtain a feature vector, wherein the neural network is preferably a fully connected neural network. The feature vector output by the fully-connected neural network identifies an event class, and the encoding of the feature vector for the event class identifies different event classes according to a permutation and combination manner, for example, the feature vector (1,1,1,0,0, 0), (1,1,0,1,0,0,0), (1,0,1,0,0,0,1), (0,0,1,1, 0,0,0) is a permutation and combination of two numbers of 1 and 0 for a part of the output feature vector with the length of 7, and can represent 4 different event classes respectively; however, in the prior art, each component can only represent one event category, for example, (1,0,0,0,0, 0), (0,1,0,0,0,0,0) and (0,0,1,0,0,0,0) each represent one event category, that is, 1 component is selected as 1, and the rest are selected as 0, in the prior art, if the total length of the feature vector is N, at most N categories can be encoded, but different feature vectors of N powers of 2 can be obtained based on a permutation and combination manner, and then N powers of 2 different event categories can be correspondingly obtained. This has the advantage of a more detailed classification of the event categories targeted by the keypoints, since the two different categories of (1,0,1,1,0,0,0) and (1,1,1,0,0,0,0) coding methods differ only by an amount that can be used to identify similar event categories. Meanwhile, compared with the prior design method of taking each component as a score of a certain category, the method can obtain richer codes.

Fig. 2 is a schematic diagram of detecting keypoints in a picture in a preferred embodiment of the present application, where a vector a obtained after expansion according to the keypoints identified in fig. 2 is [454.657,214.517,495.127,264.14,463.858,268.059,462.543,329.411,452.101,381.636,525.143,257.605,529.12,338.571,516.005,407.71,496.439,407.71,475.61,406.425483.394,476.914, 0.0633343], and a feature vector determined according to the vector a is: [0.1,0.4,0.6,0.7,0.4,0.9].

Preferably, in step S13, the feature vectors of all human bodies in each picture are arranged to form a matrix; and performing a kronecker product on the matrix and the feature matrix of the object in the same picture to obtain a fusion feature matrix corresponding to each picture. Here, all the human feature column vectors Y1, Y2, …, yn are arranged in a matrix Y ═ Y1, Y2, …, yn ], and a feature matrix W ═ W1, W2, …, wn ] of the object is obtained in a similar manner. The kronecker product is performed on Y and W to obtain a new feature matrix R of the fusion feature, which is W × Y, where "x" denotes the kronecker product.

Preferably, in step S13, the fusion feature matrices corresponding to all the pictures are input into a convolutional neural network, so as to obtain a new feature map; and determining a feature vector sequence according to the new feature map. Here, after resizing is performed according to the new feature map, each fused feature vector is pulled into one-dimensional vector to determine the feature vector sequence. Next to the above embodiment, the fusion feature matrices R obtained from each picture are arranged into a fusion feature matrix sequence { Ri } according to the sampling time sequence of each picture, the fusion feature matrix sequence { Ri } is input into a convolutional neural network to obtain a new feature map, the size is changed to a fixed size sxs again, each feature vector is pulled into a one-dimensional vector, and finally a sequence { Ti } of a feature vector is obtained.

Preferably, in step S14, if at least one score among the scores of the multiple event categories is greater than a score threshold, taking a maximum score among the scores that are greater than the score threshold, and taking the event category corresponding to the maximum score as the event category of the video to be detected; and if the scores of the multiple event categories are less than or equal to a score threshold value, performing sequence recognition on the feature vector sequence by using a long-short term memory model to determine the event category of the video to be detected. Here, the score threshold may be customized by a user, the feature vector sequence may be calculated by using a classification network to obtain a score of the feature vector sequence under each event category, and when at least one score among the scores under all the event categories is greater than the score threshold, the event category corresponding to the score higher than the score threshold is the event category of the video to be detected; when the scores under all the event categories are less than or equal to the score threshold, it is indicated that no key frame exists, that is, no convincing picture exists to judge the category of the video, and a long-short term memory model (LSTM model) is used for carrying out sequence recognition on the feature vector sequence to determine the event category of the video to be detected.

And inputting each feature vector in the { Ti } into the fully-connected classification network to calculate and determine the score of each feature vector under each event category, wherein a preset score threshold value is 0.95, and when the obtained score of the Q event category is higher than 0.95, judging the video to be detected into the Q event category. When the scores of the event categories obtained by all the feature vectors are less than 0.95, it is indicated that no key frame exists, and no convincing picture exists to judge the event category of the video to be detected, so that the feature sequence { Ti } is necessary to be input into an LSTM model for sequence recognition. For example, after the feature vector corresponding to a picture of a certain frame in the video to be detected is calculated by the fully-connected classification network, the score under the condition of aggregation is higher than 0.95, and then the video to be detected is classified into an aggregation category. If all pictures score below 0.95 under all event categories, then the entire feature sequence Ti is input into the LSTM model to determine the corresponding event category of the video.

After the video to be detected is detected through the full-connection classification network, when the score under the aggregation event category is determined to be higher than 0.95, if the score under the running event category is also higher than 0.95, the score under the running event category is compared with the score under the aggregation category, and the event category with the higher score is obtained as the event category of the video to be detected.

Preferably, in step S14, a two-norm is measured from each component in each feature vector in the feature vector sequence, and the event category of the video to be detected is determined according to the two-norm corresponding to the feature vector sequence. Here, the value obtained by each component in each feature vector in the feature vector sequence in a permutation and combination manner is not necessarily 1 or 0, and may be a decimal number, a two-norm is taken, the event category of the video to be detected is determined according to the two-norm corresponding to the feature vector sequence, so as to perform accurate event classification on the video, and the difference between the event categories can be measured by using the difference between the two-norms of the feature vectors.

Fig. 3 is a schematic flow chart of a method for identifying a video event in yet another preferred embodiment of the present application, where a video to be detected is obtained, the video to be detected is sampled at a fixed time interval T to obtain a picture sequence S, each picture in the picture sequence S is detected by using an openpos model to obtain a key point sequence P, feature information is extracted for each group of key points in the key point sequence P to determine a feature vector corresponding to each group of key points, then, feature vectors Y1, Y2, Y3, …, yn extracted from each group of key points in the key point sequence P are recorded as a matrix Y, and each feature vector in Y is input into a fully-connected classification network and calculated to determine a score under each event category that is output. When one score in all the scores is larger than a preset score threshold value of 0.95, taking the event category corresponding to the score higher than 0.95 as the event category of the video to be detected; and when all the scores are less than or equal to a preset score threshold value of 0.95, inputting each feature vector in the Y into the LSTM model, and determining the event category of the video to be detected after the LSTM model detects the feature vectors.

Fig. 4 is a schematic structural diagram illustrating an apparatus for identifying a video event according to another aspect of the present application, the apparatus including: the device comprises a sampling device 11, a detection device 12, an extraction device 13 and an identification device 14, wherein the sampling device 11 is used for sampling an obtained video to be detected to obtain a picture sequence; the detection device 12 is configured to detect each picture in the picture sequence, and determine a key point group of each picture; the extracting device 13 is configured to extract feature information in the key point group of each picture, and determine a feature vector sequence according to the feature information in the key point group of each picture; the identifying device 14 is configured to determine an event category of the video to be detected according to each feature vector in the feature vector sequence. Therefore, the type of the video is accurately and quickly judged based on the key points of the human body and the object in the video.

It should be noted that the contents executed by the sampling device 11, the detecting device 12, the extracting device 13 and the identifying device 14 are respectively the same as or corresponding to the contents executed in the above steps S11, S12, S13 and S14, and for the sake of brevity, the description is omitted here.

Furthermore, a computer readable medium is provided, on which computer readable instructions are stored, and the computer readable instructions can be executed by a processor to implement the foregoing method for identifying a video event.

According to still another aspect of the present application, there is also provided an apparatus for identifying a video event, wherein the apparatus includes:

one or more processors; and

a memory storing computer readable instructions that, when executed, cause the processor to perform the operations of the aforementioned method of identifying video events.

For example, the computer readable instructions, when executed, cause the one or more processors to:

sampling the obtained video to be detected to obtain a picture sequence; detecting each picture in the picture sequence, and determining a key point group of each picture; extracting feature information in the key point group of each picture, and determining a feature vector sequence according to the feature information in the key point group of each picture; and determining the event category of the video to be detected according to each feature vector in the feature vector sequence.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A method of identifying a video event, wherein the method comprises:

sampling the obtained video to be detected to obtain a picture sequence;

2. The method according to claim 1, wherein the determining the event category of the video to be detected according to each feature vector in the feature vector sequence comprises:

3. The method of claim 1, wherein the detecting each picture in the sequence of pictures, determining the keypoint group for each picture, comprises:

4. The method according to claim 3, wherein the extracting feature information in the key point group of each picture, and determining a feature vector sequence according to the feature information in the key point group of each picture comprises:

5. The method according to claim 4, wherein the fusing all the feature vectors of the human body and the feature vectors of the object in each picture to obtain a fused feature matrix corresponding to each picture comprises:

6. The method according to claim 5, wherein the determining a feature vector sequence according to the fused feature matrix corresponding to all the pictures comprises:

and determining a feature vector sequence according to the new feature map.

7. The method according to claim 2, wherein the determining the event category of the video to be detected according to the scores of the plurality of event categories and a score threshold comprises:

8. The method according to claim 1, wherein the determining an event category of the video to be detected according to each feature vector in the feature vector sequence further comprises:

9. An apparatus for identifying video events, wherein the apparatus comprises:

10. An apparatus for identifying video events, wherein the apparatus comprises:

one or more processors; and

a memory storing computer readable instructions that, when executed, cause the processor to perform the operations of the method of any of claims 1 to 8.