CN111310516B - Behavior recognition method and device - Google Patents

Behavior recognition method and device Download PDF

Info

Publication number
CN111310516B
CN111310516B CN201811510291.3A CN201811510291A CN111310516B CN 111310516 B CN111310516 B CN 111310516B CN 201811510291 A CN201811510291 A CN 201811510291A CN 111310516 B CN111310516 B CN 111310516B
Authority
CN
China
Prior art keywords
convolutional neural
neural network
network
behavior recognition
video frames
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811510291.3A
Other languages
Chinese (zh)
Other versions
CN111310516A (en
Inventor
卜英家
谭文明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Hikvision Digital Technology Co Ltd
Original Assignee
Hangzhou Hikvision Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Hikvision Digital Technology Co Ltd filed Critical Hangzhou Hikvision Digital Technology Co Ltd
Priority to CN201811510291.3A priority Critical patent/CN111310516B/en
Publication of CN111310516A publication Critical patent/CN111310516A/en
Application granted granted Critical
Publication of CN111310516B publication Critical patent/CN111310516B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The application provides a behavior recognition method and a behavior recognition device, wherein the method comprises the following steps: inputting a preset number of continuous video frames into a pre-trained 2D convolutional neural network to obtain apparent characteristics of the video frames; inputting the apparent characteristics of the video frames into a pre-trained 1D convolutional neural network to obtain the space-time characteristics of the video frames; and inputting the space-time characteristics of the video frames into a pre-trained classification network to obtain the behavior recognition result of the video frames. The method can improve the efficiency of behavior recognition.

Description

Behavior recognition method and device
Technical Field
The present application relates to the field of computer vision, and in particular, to a behavior recognition method and apparatus.
Background
Computer vision is a simulation of biological vision using a computer and related equipment. The main task of the method is to process the acquired pictures or videos to obtain three-dimensional information of corresponding scenes, and behavior identification belongs to a popular research direction in the field of computer vision.
The current behavior recognition scheme mainly sends continuous video frames (typically RGB (Red, green, blue) images) into a 3D convolutional neural network, extracts space-time features of the video frames based on 3D convolution, and sends the video frames into a classifier for behavior recognition.
However, the practice finds that the behavior recognition scheme has large calculation amount, and the 3D convolutional neural network model has more parameters and is difficult to train.
Disclosure of Invention
In view of the above, the present application provides a behavior recognition method and apparatus.
Specifically, the application is realized by the following technical scheme:
according to a first aspect of an embodiment of the present application, there is provided a behavior recognition method, including:
inputting a preset number of continuous video frames into a pre-trained 2D convolutional neural network to obtain apparent characteristics of the video frames;
inputting the apparent characteristics of the video frames into a pre-trained 1D convolutional neural network to obtain the space-time characteristics of the video frames;
and inputting the space-time characteristics of the video frames into a pre-trained classification network to obtain the behavior recognition result of the video frames.
According to a second aspect of an embodiment of the present application, there is provided a behavior recognition apparatus including:
the visual characteristic extraction unit is used for inputting a preset number of continuous video frames into the pre-trained 2D convolutional neural network so as to obtain visual characteristics of the video frames;
the time sequence feature extraction unit is used for inputting the apparent features of the video frames into a pre-trained 1D convolutional neural network so as to obtain the space-time features of the video frames;
and the behavior recognition unit is used for inputting the space-time characteristics of the video frames into a pre-trained classification network to obtain behavior recognition results of the video frames.
According to a third aspect of an embodiment of the present application, there is provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing the behavior recognition method when executing the program stored in the memory.
According to a fourth aspect of embodiments of the present application, there is provided a machine-readable storage medium having stored therein a computer program which, when executed by a processor, implements the above-described behavior recognition method.
According to the behavior recognition method, the preset number of continuous video frames are input into the pre-trained 2D convolutional neural network to obtain the apparent features of the preset number of continuous video frames, the apparent features of the preset number of continuous video frames are input into the pre-trained 1D convolutional neural network to obtain the space-time features of the preset number of continuous video frames, furthermore, the space-time features of the preset number of continuous video frames can be input into the pre-trained classification network to obtain the behavior recognition result of the preset number of continuous video frames, and the 2D convolutional neural network and the 1D convolutional neural network are combined, so that the space-time features can be effectively extracted, the model parameter quantity is small, the training is easy, and the behavior recognition efficiency can be improved.
Drawings
FIG. 1 is a flow chart of a behavior recognition method according to an exemplary embodiment of the present application;
FIG. 2 is a flow chart of a method of training a cascaded convolutional neural network in accordance with an exemplary embodiment of the present application;
FIG. 3A is a schematic diagram of an apparent feature extraction shown in accordance with an exemplary embodiment of the present application;
FIG. 3B is a schematic diagram of a timing feature extraction according to an exemplary embodiment of the present application;
FIG. 3C is a diagram illustrating a behavior recognition according to an exemplary embodiment of the present application;
FIG. 4 is a schematic diagram of a behavior recognition apparatus according to an exemplary embodiment of the present application;
fig. 5 is a schematic structural view of a behavior recognition apparatus according to still another exemplary embodiment of the present application;
fig. 6 is a schematic structural view of a behavior recognition apparatus according to still another exemplary embodiment of the present application;
fig. 7 is a schematic diagram of a hardware structure of an electronic device according to an exemplary embodiment of the present application.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
In order to better understand the technical solution provided by the embodiments of the present application and make the above objects, features and advantages of the embodiments of the present application more obvious, the technical solution in the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Referring to fig. 1, a flow chart of a behavior recognition method provided by an embodiment of the present application is shown, wherein the behavior recognition method may be applied to a background server for video monitoring, and as shown in fig. 1, the method may include the following steps:
step S100, inputting a preset number of continuous video frames into a pre-trained 2D convolutional neural network to obtain apparent features of the preset number of continuous video frames.
In the embodiment of the application, when the behavior of the video is required to be identified, a preset number (which can be set according to actual requirements, such as 8 frames, 16 frames, etc.) of continuous video frames of the video can be input into the pre-trained 2D convolutional neural network so as to obtain the apparent characteristics (i.e. the spatial characteristics) of the preset number of continuous video frames.
Wherein, for a frame of video image, the spatial dimension of the apparent characteristic extracted by the pre-trained 2D convolutional neural network is 1, after apparent feature extraction is performed on a frame of video image through a 2D convolutional neural network, a vector of C dimension can be obtained and can be expressed as 1 x (i.e. width and height are 1); wherein C is the number of channels of the extracted apparent feature.
Step S110, inputting the apparent features of the preset number of continuous video frames into a pre-trained 1D convolutional neural network to obtain the space-time features of the preset number of continuous video frames.
In the embodiment of the application, after the apparent features of the preset number of continuous video frames are extracted, in order to realize behavior recognition, the time sequence features of the preset number of continuous video frames are also required to be extracted.
In the embodiment of the application, in order to improve the extraction efficiency of the time sequence features, the time sequence features can be extracted through a 1D convolutional neural network.
Accordingly, after the apparent features of the preset number of continuous video frames are extracted, the apparent features of the preset number of continuous video frames can be input into a pre-trained 1D convolutional neural network, and time sequence features among the apparent features of the preset number of continuous video frames are obtained to obtain the time-space features of the preset number of continuous video frames.
Step S120, inputting the space-time characteristics of the preset number of continuous video frames into a pre-trained classification network to obtain the behavior recognition results corresponding to the preset number of continuous video frames.
In the embodiment of the application, after the space-time characteristics of the preset number of continuous video frames are extracted, the space-time characteristics of the preset number of continuous video frames can be input into a pre-trained classification network for behavior recognition, so that the behavior recognition result of the preset number of continuous video frames can be obtained.
In one embodiment of the present application, the inputting the spatio-temporal features of the preset number of continuous video frames into the pre-trained classification network to obtain the behavior recognition result of the preset number of continuous video frames may include:
inputting the space-time characteristics of the preset number of continuous video frames into a full-connection layer of a classification network to obtain a plurality of characteristic values of the preset number of continuous video frames; the feature values are in one-to-one correspondence with behavior categories supported by the classification network, and the larger the numerical value of the feature values is, the larger the probability of the behavior category corresponding to the feature value in the behavior recognition results of the preset number of continuous video frames is;
inputting a plurality of characteristic values of the preset number of continuous video frames into a softmax layer of the classification network to obtain the confidence coefficient of each behavior category in the behavior recognition result of the preset number of continuous video frames.
In this embodiment, the classification network may include a fully connected layer and a softmax layer.
After the space-time characteristics of the preset number of continuous video frames are extracted, the space-time characteristics of the preset number of continuous video frames can be input into a full connection layer of a classification network; the full connection layer of the classification network may output a plurality of characteristic values of the preset number of consecutive video frames.
For example, assuming that the classification network supports to identify 10 behavior categories, after the spatiotemporal features of the preset number of continuous video frames are input to the full-connection layer of the classification network, the full-connection layer of the classification network outputs 10 feature values, where the 10 feature values are respectively in one-to-one correspondence with the 10 behavior categories that the classification network supports to identify, and the larger the feature value, the larger the probability of the behavior category corresponding to the feature value in the behavior identification result of the preset number of continuous video frames.
For example, assuming that 10 feature values output by the full connection layer of the classification network are respectively T1 to T10, the behavior categories corresponding to the 10 feature values are respectively L1 to L10, and T3 is the largest and T5 is the smallest in T1 to T10, the probability that the behavior corresponding to the preset number of continuous video frames is L3 is the highest and the probability that L5 is the smallest.
In this embodiment, for a plurality of feature values output by the full connection layer of the classification network, the softmax layer of the classification network may be further input to perform normalization processing to obtain confidence degrees of each behavior category in the behavior recognition result of the preset number of continuous video frames.
Therefore, in the method flow shown in fig. 1, the apparent features are extracted by using the 2D convolutional neural network, the time sequence features are extracted by using the 1D convolutional neural network, and the 2D convolutional neural network and the 1D convolutional neural network are combined, so that the space-time features can be effectively extracted, the model parameter quantity is less, and the training is easy, thereby improving the efficiency of behavior recognition.
Referring to fig. 2, in one embodiment of the present application, the cascade of the 2D convolutional neural network, the 1D convolutional neural network, and the classification network is trained by:
step S100a, inputting any training sample in the training set into a 2D convolutional neural network to obtain the apparent characteristics of the training sample.
In the embodiment of the application, before performing behavior recognition through the 2D convolutional neural network, the 1D convolutional neural network and the classification network which are cascaded, the 2D convolutional neural network, the 1D convolutional neural network and the classification network are required to be trained by using a training set comprising a certain number of training samples (which can be set according to actual scenes) until the networks converge, and then performing a behavior recognition task.
Accordingly, in this embodiment, for any training sample in the training set, the apparent characteristics of each video frame in the training sample may be extracted using a 2D convolutional neural network.
The training samples may be a preset number of consecutive video frames labeled with actual behavior.
Step S100b, inputting the apparent characteristics of the training sample into a 1D convolutional neural network to obtain the space-time characteristics of the training sample.
In this embodiment, after the apparent features of each video frame in the training sample are extracted, the temporal features between each video frame may also be extracted through a 1D convolutional neural network to obtain the temporal-spatial features of each video frame in the training sample.
Step S100c, inputting the space-time characteristics of the training sample into a classification network to obtain the behavior recognition result of the training sample.
In this embodiment, after the spatio-temporal features of each video frame in the training sample are extracted, the spatio-temporal features of each video frame in the training sample may be input into a classification network to perform behavior recognition, so as to obtain a behavior recognition result of the training sample.
In this embodiment, after performing behavior recognition on each training sample in the training set in the manner described in steps S100a to S100c, the accuracy of the behavior recognition result of the training set, that is, the ratio of the number of training samples in the training set with correct behavior recognition to the number of training samples in the training set, may be counted.
For any training sample in the training set, when the behavior type with highest confidence in the behavior recognition result of behavior recognition through the network combination of the cascaded 2D convolutional network, the 1D convolutional network and the classification network is matched with the actual behavior of the pre-marked training sample, determining that the behavior recognition of the training sample is correct; otherwise, determining that the behavior recognition of the training sample is incorrect.
When the accuracy of the behavior recognition result of the training samples in the training set meets the requirement, the 2D convolution network, the 1D convolution network and the classification network can be used for the behavior recognition task.
Further, in this embodiment, in order to improve the recognition accuracy of the 2D convolutional network, the 1D convolutional network, and the classification network, after the step S100c, the method may further include:
and carrying out parameter optimization on the network combination of the cascaded 2D convolutional neural network, the 1D convolutional neural network and the classification network according to the accuracy of the behavior recognition result of the training samples in the training set until the accuracy increase amplitude of the behavior recognition result of the training samples in the training set is smaller than a preset threshold (which can be called a first threshold).
Specifically, in this embodiment, after performing behavior recognition on each training sample in the training set by using a cascade of a 2D convolutional neural network, a 1D convolutional neural network, and a network combination of a classification network, the accuracy of the behavior recognition result of the training sample in the training set may be counted.
For example, assuming that the training set includes 100 training samples, and assuming that the behavior recognition results of 90 training samples in the behavior recognition results of the 100 training samples in the training set are matched with the actual behaviors pre-labeled by the training samples in the behavior recognition results of the 100 training samples in the manner described in the steps S100a to S100c, the accuracy of the behavior recognition results of the training samples in the training set is 90% (90/100×100% =90%).
In this embodiment, training samples in a training set may be repeatedly input into the above-mentioned cascade network combinations of the 2D convolutional neural network, the 1D convolutional neural network, and the classification network, parameter optimization is performed on the network combinations according to the behavior result recognition accuracy of the training samples in the training set that is fed back, and an increase amplitude of the behavior recognition result accuracy after the current parameter optimization relative to that before the current parameter optimization is determined, and if the increase amplitude is greater than or equal to a preset threshold, parameter optimization may be continued; if the increase amplitude is smaller than a preset threshold value, the network combination training is determined to be completed.
In one example, the above-described parameter optimization of the network combination of the cascaded 2D convolutional neural network, 1D convolutional neural network, and classification network may include:
model parameters of the 2D convolutional neural network, the 1D convolutional neural network, and/or the classification network are optimized.
Further, in this embodiment, considering that the current video-based training data is less and the calibration cost is higher, in order to avoid the problem that the model is difficult to train due to insufficient video training data, the training complexity of the combined network of the cascaded 2D convolutional neural network, 1D convolutional neural network and the classification network is reduced, and the 2D convolutional neural network may be pre-trained before the combined network of the cascaded 2D convolutional neural network, 1D convolutional neural network and the classification network is trained, so as to better initialize the model parameters of the 2D convolutional neural network.
Accordingly, in one example, before training the concatenated 2D convolutional neural network, 1D convolutional neural network, and classification network, it may further include:
the 2D convolutional neural network is pre-trained based on ImageNet (picture classification dataset).
In this example, to better initialize model parameters of the 2D convolutional neural network, improve training efficiency of the cascaded 2D convolutional neural network, 1D convolutional neural network, and classification network, before training the cascaded 2D convolutional neural network, 1D convolutional neural network, and classification network, the 2D convolutional neural network may be trained based on ImageNet first until an increasing magnitude of image classification accuracy of the 2D convolutional neural network is less than a preset threshold (may be referred to as a second threshold).
In the embodiment of the present application, after the training of the cascaded 2D convolutional neural network, 1D convolutional neural network, and classification network is completed in the manner described in the steps S100a to S100D, the trained 2D convolutional neural network, 1D convolutional neural network, and classification network may be used to perform behavior recognition according to the method flows shown in the steps S100 to S130.
In order to enable those skilled in the art to better understand the technical solutions provided by the embodiments of the present application, the technical solutions provided by the embodiments of the present application are described below with reference to specific examples.
In this embodiment, the behavior recognition process may include apparent feature extraction, time series feature extraction, and behavior recognition in this order, respectively, as described below.
1. Apparent feature extraction
In this embodiment, referring to fig. 3A, for each frame of image, a 2D convolutional neural network may be used to perform apparent feature extraction, and finally reduce the spatial dimension of each frame of image to 1*1: the original size of a frame of image is Cin H W, and the apparent feature extraction is Cout 1*1 after the 2D convolutional neural network.
By using the apparent feature extraction method, apparent feature extraction can be performed on continuous N (N is a positive integer greater than 1) frame video frames, the input of the 2D convolutional neural network is N, cin, H and W, and the output is N, cout and 1*1.
Where N represents the number of frames of the input image, cin represents the number of channels per frame of the input image (e.g., the number of channels of the RGB image is 3), and Cout is the number of channels of the extracted apparent feature (which may be set according to practical requirements, e.g., 512 or 1024, etc.).
It should be noted that, the more the number of channels of the extracted apparent feature is, the larger the operand is in the subsequent flow, and the higher the recognition accuracy is, so when the number of channels of the extracted apparent feature is set, the recognition accuracy and operand setting can be balanced, and specific implementation thereof will not be described herein.
2. Sequential feature extraction
And (3) carrying out 1D convolution on the apparent characteristics of each frame of image in the continuous N frames of video frames obtained in the step (1) to extract time sequence characteristics, and finally obtaining the space-time characteristics of the continuous N frames of video frames, wherein the schematic diagram can be shown in figure 3B.
In order to extract the time sequence features, n×cin (i.e., cout) 1*1 output in step 1 is reordered (reshape) to cin×n, and then the time sequence features are extracted by using a 1D convolutional neural network, and finally the time-space features of the continuous N frames of video frames are expressed as cout×1.
Where N represents the number of frames of the input image, cin represents the number of channels per frame (e.g., 512 or 1024, etc.) after the input image performs apparent feature extraction, and Cout is the number of channels of the finally extracted space-time feature (e.g., 512 or 1024, etc.) which can be set according to the actual requirement.
3. Behavior recognition
And performing behavior recognition on the extracted space-time characteristics of the continuous N frames of video frames through a classification network.
The extracted space-time features can be input into a full-connection layer to obtain a plurality of feature values of the video frame, and the feature values are input into a softmax layer for normalization processing to obtain the confidence of each behavior category in the behavior recognition result of the video frame.
The complete flow of behavior recognition for consecutive N frames of video frames can be seen in fig. 3C.
In the embodiment of the application, the apparent characteristics of the preset number of continuous video frames are input into the pre-trained 2D convolutional neural network to obtain the apparent characteristics of the preset number of continuous video frames, and the apparent characteristics of the preset number of continuous video frames are input into the pre-trained 1D convolutional neural network to obtain the space-time characteristics of the preset number of continuous video frames, so that the space-time characteristics of the preset number of continuous video frames are input into the pre-trained classification network to obtain the behavior recognition result of the preset number of continuous video frames, and the 2D convolutional neural network and the 1D convolutional neural network are combined, so that the extraction of the space-time characteristics can be effectively carried out, the model parameter quantity is small, the training is easy, and the behavior recognition efficiency can be improved.
The method provided by the application is described above. The device provided by the application is described below:
referring to fig. 4, a schematic structural diagram of a behavior recognition device according to an embodiment of the present application is shown in fig. 4, where the behavior recognition device may include:
an apparent feature extraction unit 410, configured to input a preset number of continuous video frames into a pre-trained 2D convolutional neural network, so as to obtain apparent features of the video frames;
a time sequence feature extraction unit 420, configured to input the apparent features of the video frame into a pre-trained 1D convolutional neural network, so as to obtain the space-time features of the video frame;
the behavior recognition unit 430 is configured to input the spatiotemporal features of the video frame into a pre-trained classification network to obtain a behavior recognition result of the video frame.
In an alternative embodiment, the apparent feature extraction unit 410 is further configured to input, for any training sample in the training set, the training sample into the 2D convolutional neural network, so as to obtain an apparent feature of the training sample;
the time sequence feature extraction unit 420 is further configured to input the apparent feature of the training sample into the 1D convolutional neural network, so as to obtain a space-time feature of the training sample;
the behavior recognition unit 430 is further configured to input the spatiotemporal feature of the training sample into the classification network to obtain a behavior recognition result of the training sample.
In an alternative embodiment, as shown in fig. 5, the apparatus further comprises:
and the parameter optimization unit 440 is configured to perform parameter optimization on the cascaded network combinations of the 2D convolutional neural network, the 1D convolutional neural network, and the classification network according to the accuracy of the behavior recognition result of the training samples in the training set until the accuracy increase amplitude of the behavior recognition result of the training samples in the training set is less than a preset threshold.
In an alternative embodiment, the parameter optimization unit 440 is specifically configured to optimize model parameters of the 2D convolutional neural network, the 1D convolutional neural network, and/or the classification network.
In an alternative embodiment, as shown in fig. 6, the apparatus further comprises:
a pre-training unit 450, configured to pre-train the 2D convolutional neural network based on ImageNet before training the cascaded 2D convolutional neural network, 1D convolutional neural network, and classification network.
In an alternative embodiment, the behavior recognition unit 430 is specifically configured to input the spatio-temporal features of the video frame into the fully connected layer of the classification network, so as to obtain a plurality of feature values of the video frame; the feature values are in one-to-one correspondence with behavior categories supported by the classification network, and the larger the numerical value of the feature value is, the larger the probability of the behavior category corresponding to the feature value in the behavior recognition result of the video frame is; and inputting a plurality of characteristic values of the video frame into a softmax layer of the classification network to obtain the confidence of each behavior category in the behavior recognition result of the video frame.
Fig. 7 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application. The electronic device may include a processor 701, a communication interface 702, a memory 703, and a communication bus 704. The processor 701, the communication interface 702, and the memory 703 perform communication with each other via the communication bus 704. Wherein the memory 703 has stored thereon a computer program; the processor 701 can execute the behavior recognition method described above by executing the program stored on the memory 703.
The memory 703 referred to herein may be any electronic, magnetic, optical, or other physical storage device that may contain or store information, such as executable instructions, data, or the like. For example, the memory 703 may be: RAM (Radom Access Memory, random access memory), volatile memory, non-volatile memory, flash memory, a storage drive (e.g., hard drive), a solid state drive, any type of storage disk (e.g., optical disk, dvd, etc.), or a similar storage medium, or a combination thereof.
Embodiments of the present application also provide a machine-readable storage medium, such as memory 703 in fig. 7, storing a computer program executable by processor 701 in the electronic device shown in fig. 7 to implement the behavior recognition method described above.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the application.

Claims (10)

1. A method of behavior recognition, comprising:
inputting a preset number of continuous video frames into a pre-trained 2D convolutional neural network to obtain apparent characteristics of the video frames;
inputting the apparent characteristics of the video frames into a pre-trained 1D convolutional neural network to obtain the space-time characteristics of the video frames;
inputting the space-time characteristics of the video frames into a pre-trained classification network to obtain behavior recognition results of the video frames;
the step of inputting the space-time characteristics of the video frames into a pre-trained classification network to obtain the behavior recognition result of the video frames comprises the following steps:
inputting the space-time characteristics of the video frames into a full connection layer of the classification network to obtain a plurality of characteristic values of the video frames; the feature values are in one-to-one correspondence with behavior categories supported by the classification network, and the larger the numerical value of the feature value is, the larger the probability of the behavior category corresponding to the feature value in the behavior recognition result of the video frame is;
inputting a plurality of characteristic values of the video frame into a softmax layer of the classification network to obtain confidence degrees of all behavior categories in a behavior recognition result of the video frame;
the 2D convolutional neural network, the 1D convolutional neural network, and the classification network are cascaded.
2. The method of claim 1, wherein the 2D convolutional neural network, 1D convolutional neural network, and classification network of a cascade are trained by:
inputting any training sample in the training set into the 2D convolutional neural network to obtain the apparent characteristics of the training sample;
inputting the apparent characteristics of the training sample into the 1D convolutional neural network to obtain the space-time characteristics of the training sample;
and inputting the space-time characteristics of the training sample into the classification network to obtain the behavior recognition result of the training sample.
3. The method of claim 2, wherein after said inputting the spatiotemporal features of the training sample into the classification network, further comprising:
and according to the accuracy of the behavior recognition result of the training sample in the training set, carrying out parameter optimization on the network combination of the cascaded 2D convolutional neural network, the 1D convolutional neural network and the classification network until the accuracy increase amplitude of the behavior recognition result of the training sample in the training set is smaller than a preset threshold value.
4. A method according to claim 3, wherein said parameter optimizing a network combination of the cascaded 2D convolutional neural network, the 1D convolutional neural network, and the classification network comprises:
model parameters of the 2D convolutional neural network, the 1D convolutional neural network, and/or the classification network are optimized.
5. The method of claim 2, wherein prior to training the cascaded 2D convolutional neural network, 1D convolutional neural network, and classification network, further comprising:
the 2D convolutional neural network is pre-trained based on a picture classification dataset ImageNet.
6. A behavior recognition apparatus, comprising:
the visual characteristic extraction unit is used for inputting a preset number of continuous video frames into the pre-trained 2D convolutional neural network so as to obtain visual characteristics of the video frames;
the time sequence feature extraction unit is used for inputting the apparent features of the video frames into a pre-trained 1D convolutional neural network so as to obtain the space-time features of the video frames;
the behavior recognition unit is used for inputting the space-time characteristics of the video frames into a pre-trained classification network so as to obtain behavior recognition results of the video frames;
the behavior recognition unit is specifically configured to input the spatio-temporal features of the video frame into a full-connection layer of the classification network, so as to obtain a plurality of feature values of the video frame; the feature values are in one-to-one correspondence with behavior categories supported by the classification network, and the larger the numerical value of the feature value is, the larger the probability of the behavior category corresponding to the feature value in the behavior recognition result of the video frame is; inputting a plurality of characteristic values of the video frame into a softmax layer of the classification network to obtain confidence degrees of all behavior categories in a behavior recognition result of the video frame;
the 2D convolutional neural network, the 1D convolutional neural network, and the classification network are cascaded.
7. The apparatus of claim 6, wherein the device comprises a plurality of sensors,
the apparent feature extraction unit is further used for inputting any training sample in the training set into the 2D convolutional neural network so as to obtain the apparent feature of the training sample;
the time sequence feature extraction unit is further used for inputting the apparent features of the training sample into the 1D convolutional neural network so as to obtain the space-time features of the training sample;
the behavior recognition unit is further used for inputting the space-time characteristics of the training sample into the classification network so as to obtain a behavior recognition result of the training sample.
8. The apparatus of claim 7, wherein the apparatus further comprises:
and the parameter optimization unit is used for performing parameter optimization on the network combination of the cascaded 2D convolutional neural network, the 1D convolutional neural network and the classification network according to the accuracy of the behavior recognition result of the training samples in the training set until the accuracy increase amplitude of the behavior recognition result of the training samples in the training set is smaller than a preset threshold value.
9. The apparatus of claim 8, wherein the device comprises a plurality of sensors,
the parameter optimization unit is specifically configured to optimize model parameters of the 2D convolutional neural network, the 1D convolutional neural network, and/or the classification network.
10. The apparatus of claim 7, wherein the apparatus further comprises:
and the pre-training unit is used for pre-training the 2D convolutional neural network based on a picture classification data set ImageNet before training the cascaded 2D convolutional neural network, 1D convolutional neural network and classification network.
CN201811510291.3A 2018-12-11 2018-12-11 Behavior recognition method and device Active CN111310516B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811510291.3A CN111310516B (en) 2018-12-11 2018-12-11 Behavior recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811510291.3A CN111310516B (en) 2018-12-11 2018-12-11 Behavior recognition method and device

Publications (2)

Publication Number Publication Date
CN111310516A CN111310516A (en) 2020-06-19
CN111310516B true CN111310516B (en) 2023-08-29

Family

ID=71159620

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811510291.3A Active CN111310516B (en) 2018-12-11 2018-12-11 Behavior recognition method and device

Country Status (1)

Country Link
CN (1) CN111310516B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113177450A (en) * 2021-04-20 2021-07-27 北京有竹居网络技术有限公司 Behavior recognition method and device, electronic equipment and storage medium
CN115346143A (en) * 2021-04-27 2022-11-15 中兴通讯股份有限公司 Behavior detection method, electronic device, and computer-readable medium
CN113688729B (en) * 2021-08-24 2023-04-07 上海商汤科技开发有限公司 Behavior recognition method and device, electronic equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014049118A (en) * 2012-08-31 2014-03-17 Fujitsu Ltd Convolution neural network classifier system, training method for the same, classifying method, and usage
CN106845381A (en) * 2017-01-16 2017-06-13 西北工业大学 Sky based on binary channels convolutional neural networks composes united hyperspectral image classification method
CN107341452A (en) * 2017-06-20 2017-11-10 东北电力大学 Human bodys' response method based on quaternary number space-time convolutional neural networks
EP3291146A1 (en) * 2016-09-05 2018-03-07 Fujitsu Limited Knowledge extraction from a convolutional neural network
CN108229240A (en) * 2016-12-09 2018-06-29 杭州海康威视数字技术股份有限公司 A kind of method and device of determining picture quality
CN108460342A (en) * 2018-02-05 2018-08-28 西安电子科技大学 Hyperspectral image classification method based on convolution net and Recognition with Recurrent Neural Network
WO2018157862A1 (en) * 2017-03-02 2018-09-07 腾讯科技(深圳)有限公司 Vehicle type recognition method and device, storage medium and electronic device
CN108596069A (en) * 2018-04-18 2018-09-28 南京邮电大学 Neonatal pain expression recognition method and system based on depth 3D residual error networks

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060215929A1 (en) * 2005-03-23 2006-09-28 David Fresneau Methods and apparatus for image convolution
US10586153B2 (en) * 2016-06-16 2020-03-10 Qatar University Method and apparatus for performing motor-fault detection via convolutional neural networks
US10963506B2 (en) * 2016-11-15 2021-03-30 Evolv Technology Solutions, Inc. Data object creation and recommendation using machine learning based offline evolution
CN106683680B (en) * 2017-03-10 2022-03-25 百度在线网络技术(北京)有限公司 Speaker recognition method and device, computer equipment and computer readable medium
US11379688B2 (en) * 2017-03-16 2022-07-05 Packsize Llc Systems and methods for keypoint detection with convolutional neural networks

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014049118A (en) * 2012-08-31 2014-03-17 Fujitsu Ltd Convolution neural network classifier system, training method for the same, classifying method, and usage
EP3291146A1 (en) * 2016-09-05 2018-03-07 Fujitsu Limited Knowledge extraction from a convolutional neural network
CN108229240A (en) * 2016-12-09 2018-06-29 杭州海康威视数字技术股份有限公司 A kind of method and device of determining picture quality
CN106845381A (en) * 2017-01-16 2017-06-13 西北工业大学 Sky based on binary channels convolutional neural networks composes united hyperspectral image classification method
WO2018157862A1 (en) * 2017-03-02 2018-09-07 腾讯科技(深圳)有限公司 Vehicle type recognition method and device, storage medium and electronic device
CN107341452A (en) * 2017-06-20 2017-11-10 东北电力大学 Human bodys' response method based on quaternary number space-time convolutional neural networks
CN108460342A (en) * 2018-02-05 2018-08-28 西安电子科技大学 Hyperspectral image classification method based on convolution net and Recognition with Recurrent Neural Network
CN108596069A (en) * 2018-04-18 2018-09-28 南京邮电大学 Neonatal pain expression recognition method and system based on depth 3D residual error networks

Also Published As

Publication number Publication date
CN111310516A (en) 2020-06-19

Similar Documents

Publication Publication Date Title
US11182620B2 (en) Method for training a convolutional recurrent neural network and for semantic segmentation of inputted video using the trained convolutional recurrent neural network
CN110032926B (en) Video classification method and device based on deep learning
CN108133188B (en) Behavior identification method based on motion history image and convolutional neural network
CN111768432A (en) Moving target segmentation method and system based on twin deep neural network
CN110334589B (en) High-time-sequence 3D neural network action identification method based on hole convolution
CN111310516B (en) Behavior recognition method and device
CN110826596A (en) Semantic segmentation method based on multi-scale deformable convolution
CN110059728B (en) RGB-D image visual saliency detection method based on attention model
CN110287777B (en) Golden monkey body segmentation algorithm in natural scene
CN110569814B (en) Video category identification method, device, computer equipment and computer storage medium
CN111178120B (en) Pest image detection method based on crop identification cascading technology
US20180137630A1 (en) Image processing apparatus and method
CN108805151B (en) Image classification method based on depth similarity network
CN114463218B (en) Video deblurring method based on event data driving
CN111027347A (en) Video identification method and device and computer equipment
CN110827265A (en) Image anomaly detection method based on deep learning
CN114549913A (en) Semantic segmentation method and device, computer equipment and storage medium
CN114494981A (en) Action video classification method and system based on multi-level motion modeling
CN114663769B (en) Fruit identification method based on YOLO v5
CN110688966B (en) Semantic guidance pedestrian re-recognition method
CN112418032A (en) Human behavior recognition method and device, electronic equipment and storage medium
CN111027472A (en) Video identification method based on fusion of video optical flow and image space feature weight
CN111242176A (en) Computer vision task processing method and device and electronic system
CN112528077B (en) Video face retrieval method and system based on video embedding
CN112651267A (en) Recognition method, model training, system and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant