CN111310516B

CN111310516B - Behavior recognition method and device

Info

Publication number: CN111310516B
Application number: CN201811510291.3A
Authority: CN
Inventors: 卜英家; 谭文明
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2018-12-11
Filing date: 2018-12-11
Publication date: 2023-08-29
Anticipated expiration: 2038-12-11
Also published as: CN111310516A

Abstract

The application provides a behavior recognition method and a behavior recognition device, wherein the method comprises the following steps: inputting a preset number of continuous video frames into a pre-trained 2D convolutional neural network to obtain apparent characteristics of the video frames; inputting the apparent characteristics of the video frames into a pre-trained 1D convolutional neural network to obtain the space-time characteristics of the video frames; and inputting the space-time characteristics of the video frames into a pre-trained classification network to obtain the behavior recognition result of the video frames. The method can improve the efficiency of behavior recognition.

Description

Behavior recognition method and device

Technical Field

The present application relates to the field of computer vision, and in particular, to a behavior recognition method and apparatus.

Background

Computer vision is a simulation of biological vision using a computer and related equipment. The main task of the method is to process the acquired pictures or videos to obtain three-dimensional information of corresponding scenes, and behavior identification belongs to a popular research direction in the field of computer vision.

The current behavior recognition scheme mainly sends continuous video frames (typically RGB (Red, green, blue) images) into a 3D convolutional neural network, extracts space-time features of the video frames based on 3D convolution, and sends the video frames into a classifier for behavior recognition.

However, the practice finds that the behavior recognition scheme has large calculation amount, and the 3D convolutional neural network model has more parameters and is difficult to train.

Disclosure of Invention

In view of the above, the present application provides a behavior recognition method and apparatus.

Specifically, the application is realized by the following technical scheme:

according to a first aspect of an embodiment of the present application, there is provided a behavior recognition method, including:

inputting a preset number of continuous video frames into a pre-trained 2D convolutional neural network to obtain apparent characteristics of the video frames;

inputting the apparent characteristics of the video frames into a pre-trained 1D convolutional neural network to obtain the space-time characteristics of the video frames;

and inputting the space-time characteristics of the video frames into a pre-trained classification network to obtain the behavior recognition result of the video frames.

According to a second aspect of an embodiment of the present application, there is provided a behavior recognition apparatus including:

the visual characteristic extraction unit is used for inputting a preset number of continuous video frames into the pre-trained 2D convolutional neural network so as to obtain visual characteristics of the video frames;

the time sequence feature extraction unit is used for inputting the apparent features of the video frames into a pre-trained 1D convolutional neural network so as to obtain the space-time features of the video frames;

and the behavior recognition unit is used for inputting the space-time characteristics of the video frames into a pre-trained classification network to obtain behavior recognition results of the video frames.

According to a third aspect of an embodiment of the present application, there is provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing the behavior recognition method when executing the program stored in the memory.

According to a fourth aspect of embodiments of the present application, there is provided a machine-readable storage medium having stored therein a computer program which, when executed by a processor, implements the above-described behavior recognition method.

According to the behavior recognition method, the preset number of continuous video frames are input into the pre-trained 2D convolutional neural network to obtain the apparent features of the preset number of continuous video frames, the apparent features of the preset number of continuous video frames are input into the pre-trained 1D convolutional neural network to obtain the space-time features of the preset number of continuous video frames, furthermore, the space-time features of the preset number of continuous video frames can be input into the pre-trained classification network to obtain the behavior recognition result of the preset number of continuous video frames, and the 2D convolutional neural network and the 1D convolutional neural network are combined, so that the space-time features can be effectively extracted, the model parameter quantity is small, the training is easy, and the behavior recognition efficiency can be improved.

Drawings

FIG. 1 is a flow chart of a behavior recognition method according to an exemplary embodiment of the present application;

FIG. 2 is a flow chart of a method of training a cascaded convolutional neural network in accordance with an exemplary embodiment of the present application;

FIG. 3A is a schematic diagram of an apparent feature extraction shown in accordance with an exemplary embodiment of the present application;

FIG. 3B is a schematic diagram of a timing feature extraction according to an exemplary embodiment of the present application;

FIG. 3C is a diagram illustrating a behavior recognition according to an exemplary embodiment of the present application;

FIG. 4 is a schematic diagram of a behavior recognition apparatus according to an exemplary embodiment of the present application;

fig. 5 is a schematic structural view of a behavior recognition apparatus according to still another exemplary embodiment of the present application;

fig. 6 is a schematic structural view of a behavior recognition apparatus according to still another exemplary embodiment of the present application;

fig. 7 is a schematic diagram of a hardware structure of an electronic device according to an exemplary embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

In order to better understand the technical solution provided by the embodiments of the present application and make the above objects, features and advantages of the embodiments of the present application more obvious, the technical solution in the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, a flow chart of a behavior recognition method provided by an embodiment of the present application is shown, wherein the behavior recognition method may be applied to a background server for video monitoring, and as shown in fig. 1, the method may include the following steps:

step S100, inputting a preset number of continuous video frames into a pre-trained 2D convolutional neural network to obtain apparent features of the preset number of continuous video frames.

In the embodiment of the application, when the behavior of the video is required to be identified, a preset number (which can be set according to actual requirements, such as 8 frames, 16 frames, etc.) of continuous video frames of the video can be input into the pre-trained 2D convolutional neural network so as to obtain the apparent characteristics (i.e. the spatial characteristics) of the preset number of continuous video frames.

Wherein, for a frame of video image, the spatial dimension of the apparent characteristic extracted by the pre-trained 2D convolutional neural network is 1, after apparent feature extraction is performed on a frame of video image through a 2D convolutional neural network, a vector of C dimension can be obtained and can be expressed as 1 x (i.e. width and height are 1); wherein C is the number of channels of the extracted apparent feature.

Step S110, inputting the apparent features of the preset number of continuous video frames into a pre-trained 1D convolutional neural network to obtain the space-time features of the preset number of continuous video frames.

In the embodiment of the application, after the apparent features of the preset number of continuous video frames are extracted, in order to realize behavior recognition, the time sequence features of the preset number of continuous video frames are also required to be extracted.

In the embodiment of the application, in order to improve the extraction efficiency of the time sequence features, the time sequence features can be extracted through a 1D convolutional neural network.

Accordingly, after the apparent features of the preset number of continuous video frames are extracted, the apparent features of the preset number of continuous video frames can be input into a pre-trained 1D convolutional neural network, and time sequence features among the apparent features of the preset number of continuous video frames are obtained to obtain the time-space features of the preset number of continuous video frames.

Step S120, inputting the space-time characteristics of the preset number of continuous video frames into a pre-trained classification network to obtain the behavior recognition results corresponding to the preset number of continuous video frames.

In the embodiment of the application, after the space-time characteristics of the preset number of continuous video frames are extracted, the space-time characteristics of the preset number of continuous video frames can be input into a pre-trained classification network for behavior recognition, so that the behavior recognition result of the preset number of continuous video frames can be obtained.

In one embodiment of the present application, the inputting the spatio-temporal features of the preset number of continuous video frames into the pre-trained classification network to obtain the behavior recognition result of the preset number of continuous video frames may include:

inputting the space-time characteristics of the preset number of continuous video frames into a full-connection layer of a classification network to obtain a plurality of characteristic values of the preset number of continuous video frames; the feature values are in one-to-one correspondence with behavior categories supported by the classification network, and the larger the numerical value of the feature values is, the larger the probability of the behavior category corresponding to the feature value in the behavior recognition results of the preset number of continuous video frames is;

inputting a plurality of characteristic values of the preset number of continuous video frames into a softmax layer of the classification network to obtain the confidence coefficient of each behavior category in the behavior recognition result of the preset number of continuous video frames.

In this embodiment, the classification network may include a fully connected layer and a softmax layer.

After the space-time characteristics of the preset number of continuous video frames are extracted, the space-time characteristics of the preset number of continuous video frames can be input into a full connection layer of a classification network; the full connection layer of the classification network may output a plurality of characteristic values of the preset number of consecutive video frames.

For example, assuming that the classification network supports to identify 10 behavior categories, after the spatiotemporal features of the preset number of continuous video frames are input to the full-connection layer of the classification network, the full-connection layer of the classification network outputs 10 feature values, where the 10 feature values are respectively in one-to-one correspondence with the 10 behavior categories that the classification network supports to identify, and the larger the feature value, the larger the probability of the behavior category corresponding to the feature value in the behavior identification result of the preset number of continuous video frames.

For example, assuming that 10 feature values output by the full connection layer of the classification network are respectively T1 to T10, the behavior categories corresponding to the 10 feature values are respectively L1 to L10, and T3 is the largest and T5 is the smallest in T1 to T10, the probability that the behavior corresponding to the preset number of continuous video frames is L3 is the highest and the probability that L5 is the smallest.

In this embodiment, for a plurality of feature values output by the full connection layer of the classification network, the softmax layer of the classification network may be further input to perform normalization processing to obtain confidence degrees of each behavior category in the behavior recognition result of the preset number of continuous video frames.

Therefore, in the method flow shown in fig. 1, the apparent features are extracted by using the 2D convolutional neural network, the time sequence features are extracted by using the 1D convolutional neural network, and the 2D convolutional neural network and the 1D convolutional neural network are combined, so that the space-time features can be effectively extracted, the model parameter quantity is less, and the training is easy, thereby improving the efficiency of behavior recognition.

Referring to fig. 2, in one embodiment of the present application, the cascade of the 2D convolutional neural network, the 1D convolutional neural network, and the classification network is trained by:

step S100a, inputting any training sample in the training set into a 2D convolutional neural network to obtain the apparent characteristics of the training sample.

In the embodiment of the application, before performing behavior recognition through the 2D convolutional neural network, the 1D convolutional neural network and the classification network which are cascaded, the 2D convolutional neural network, the 1D convolutional neural network and the classification network are required to be trained by using a training set comprising a certain number of training samples (which can be set according to actual scenes) until the networks converge, and then performing a behavior recognition task.

Accordingly, in this embodiment, for any training sample in the training set, the apparent characteristics of each video frame in the training sample may be extracted using a 2D convolutional neural network.

The training samples may be a preset number of consecutive video frames labeled with actual behavior.

Step S100b, inputting the apparent characteristics of the training sample into a 1D convolutional neural network to obtain the space-time characteristics of the training sample.

In this embodiment, after the apparent features of each video frame in the training sample are extracted, the temporal features between each video frame may also be extracted through a 1D convolutional neural network to obtain the temporal-spatial features of each video frame in the training sample.

Step S100c, inputting the space-time characteristics of the training sample into a classification network to obtain the behavior recognition result of the training sample.

In this embodiment, after the spatio-temporal features of each video frame in the training sample are extracted, the spatio-temporal features of each video frame in the training sample may be input into a classification network to perform behavior recognition, so as to obtain a behavior recognition result of the training sample.

In this embodiment, after performing behavior recognition on each training sample in the training set in the manner described in steps S100a to S100c, the accuracy of the behavior recognition result of the training set, that is, the ratio of the number of training samples in the training set with correct behavior recognition to the number of training samples in the training set, may be counted.

For any training sample in the training set, when the behavior type with highest confidence in the behavior recognition result of behavior recognition through the network combination of the cascaded 2D convolutional network, the 1D convolutional network and the classification network is matched with the actual behavior of the pre-marked training sample, determining that the behavior recognition of the training sample is correct; otherwise, determining that the behavior recognition of the training sample is incorrect.

When the accuracy of the behavior recognition result of the training samples in the training set meets the requirement, the 2D convolution network, the 1D convolution network and the classification network can be used for the behavior recognition task.

Further, in this embodiment, in order to improve the recognition accuracy of the 2D convolutional network, the 1D convolutional network, and the classification network, after the step S100c, the method may further include:

and carrying out parameter optimization on the network combination of the cascaded 2D convolutional neural network, the 1D convolutional neural network and the classification network according to the accuracy of the behavior recognition result of the training samples in the training set until the accuracy increase amplitude of the behavior recognition result of the training samples in the training set is smaller than a preset threshold (which can be called a first threshold).

Specifically, in this embodiment, after performing behavior recognition on each training sample in the training set by using a cascade of a 2D convolutional neural network, a 1D convolutional neural network, and a network combination of a classification network, the accuracy of the behavior recognition result of the training sample in the training set may be counted.

For example, assuming that the training set includes 100 training samples, and assuming that the behavior recognition results of 90 training samples in the behavior recognition results of the 100 training samples in the training set are matched with the actual behaviors pre-labeled by the training samples in the behavior recognition results of the 100 training samples in the manner described in the steps S100a to S100c, the accuracy of the behavior recognition results of the training samples in the training set is 90% (90/100×100% =90%).

In this embodiment, training samples in a training set may be repeatedly input into the above-mentioned cascade network combinations of the 2D convolutional neural network, the 1D convolutional neural network, and the classification network, parameter optimization is performed on the network combinations according to the behavior result recognition accuracy of the training samples in the training set that is fed back, and an increase amplitude of the behavior recognition result accuracy after the current parameter optimization relative to that before the current parameter optimization is determined, and if the increase amplitude is greater than or equal to a preset threshold, parameter optimization may be continued; if the increase amplitude is smaller than a preset threshold value, the network combination training is determined to be completed.

In one example, the above-described parameter optimization of the network combination of the cascaded 2D convolutional neural network, 1D convolutional neural network, and classification network may include:

model parameters of the 2D convolutional neural network, the 1D convolutional neural network, and/or the classification network are optimized.

Further, in this embodiment, considering that the current video-based training data is less and the calibration cost is higher, in order to avoid the problem that the model is difficult to train due to insufficient video training data, the training complexity of the combined network of the cascaded 2D convolutional neural network, 1D convolutional neural network and the classification network is reduced, and the 2D convolutional neural network may be pre-trained before the combined network of the cascaded 2D convolutional neural network, 1D convolutional neural network and the classification network is trained, so as to better initialize the model parameters of the 2D convolutional neural network.

Accordingly, in one example, before training the concatenated 2D convolutional neural network, 1D convolutional neural network, and classification network, it may further include:

the 2D convolutional neural network is pre-trained based on ImageNet (picture classification dataset).

In this example, to better initialize model parameters of the 2D convolutional neural network, improve training efficiency of the cascaded 2D convolutional neural network, 1D convolutional neural network, and classification network, before training the cascaded 2D convolutional neural network, 1D convolutional neural network, and classification network, the 2D convolutional neural network may be trained based on ImageNet first until an increasing magnitude of image classification accuracy of the 2D convolutional neural network is less than a preset threshold (may be referred to as a second threshold).

In the embodiment of the present application, after the training of the cascaded 2D convolutional neural network, 1D convolutional neural network, and classification network is completed in the manner described in the steps S100a to S100D, the trained 2D convolutional neural network, 1D convolutional neural network, and classification network may be used to perform behavior recognition according to the method flows shown in the steps S100 to S130.

In order to enable those skilled in the art to better understand the technical solutions provided by the embodiments of the present application, the technical solutions provided by the embodiments of the present application are described below with reference to specific examples.

In this embodiment, the behavior recognition process may include apparent feature extraction, time series feature extraction, and behavior recognition in this order, respectively, as described below.

1. Apparent feature extraction

In this embodiment, referring to fig. 3A, for each frame of image, a 2D convolutional neural network may be used to perform apparent feature extraction, and finally reduce the spatial dimension of each frame of image to 1*1: the original size of a frame of image is Cin H W, and the apparent feature extraction is Cout 1*1 after the 2D convolutional neural network.

By using the apparent feature extraction method, apparent feature extraction can be performed on continuous N (N is a positive integer greater than 1) frame video frames, the input of the 2D convolutional neural network is N, cin, H and W, and the output is N, cout and 1*1.

Where N represents the number of frames of the input image, cin represents the number of channels per frame of the input image (e.g., the number of channels of the RGB image is 3), and Cout is the number of channels of the extracted apparent feature (which may be set according to practical requirements, e.g., 512 or 1024, etc.).

It should be noted that, the more the number of channels of the extracted apparent feature is, the larger the operand is in the subsequent flow, and the higher the recognition accuracy is, so when the number of channels of the extracted apparent feature is set, the recognition accuracy and operand setting can be balanced, and specific implementation thereof will not be described herein.

2. Sequential feature extraction

And (3) carrying out 1D convolution on the apparent characteristics of each frame of image in the continuous N frames of video frames obtained in the step (1) to extract time sequence characteristics, and finally obtaining the space-time characteristics of the continuous N frames of video frames, wherein the schematic diagram can be shown in figure 3B.

In order to extract the time sequence features, n×cin (i.e., cout) 1*1 output in step 1 is reordered (reshape) to cin×n, and then the time sequence features are extracted by using a 1D convolutional neural network, and finally the time-space features of the continuous N frames of video frames are expressed as cout×1.

Where N represents the number of frames of the input image, cin represents the number of channels per frame (e.g., 512 or 1024, etc.) after the input image performs apparent feature extraction, and Cout is the number of channels of the finally extracted space-time feature (e.g., 512 or 1024, etc.) which can be set according to the actual requirement.

3. Behavior recognition

And performing behavior recognition on the extracted space-time characteristics of the continuous N frames of video frames through a classification network.

The extracted space-time features can be input into a full-connection layer to obtain a plurality of feature values of the video frame, and the feature values are input into a softmax layer for normalization processing to obtain the confidence of each behavior category in the behavior recognition result of the video frame.

The complete flow of behavior recognition for consecutive N frames of video frames can be seen in fig. 3C.

In the embodiment of the application, the apparent characteristics of the preset number of continuous video frames are input into the pre-trained 2D convolutional neural network to obtain the apparent characteristics of the preset number of continuous video frames, and the apparent characteristics of the preset number of continuous video frames are input into the pre-trained 1D convolutional neural network to obtain the space-time characteristics of the preset number of continuous video frames, so that the space-time characteristics of the preset number of continuous video frames are input into the pre-trained classification network to obtain the behavior recognition result of the preset number of continuous video frames, and the 2D convolutional neural network and the 1D convolutional neural network are combined, so that the extraction of the space-time characteristics can be effectively carried out, the model parameter quantity is small, the training is easy, and the behavior recognition efficiency can be improved.

The method provided by the application is described above. The device provided by the application is described below:

referring to fig. 4, a schematic structural diagram of a behavior recognition device according to an embodiment of the present application is shown in fig. 4, where the behavior recognition device may include:

an apparent feature extraction unit 410, configured to input a preset number of continuous video frames into a pre-trained 2D convolutional neural network, so as to obtain apparent features of the video frames;

a time sequence feature extraction unit 420, configured to input the apparent features of the video frame into a pre-trained 1D convolutional neural network, so as to obtain the space-time features of the video frame;

the behavior recognition unit 430 is configured to input the spatiotemporal features of the video frame into a pre-trained classification network to obtain a behavior recognition result of the video frame.

In an alternative embodiment, the apparent feature extraction unit 410 is further configured to input, for any training sample in the training set, the training sample into the 2D convolutional neural network, so as to obtain an apparent feature of the training sample;

the time sequence feature extraction unit 420 is further configured to input the apparent feature of the training sample into the 1D convolutional neural network, so as to obtain a space-time feature of the training sample;

the behavior recognition unit 430 is further configured to input the spatiotemporal feature of the training sample into the classification network to obtain a behavior recognition result of the training sample.

In an alternative embodiment, as shown in fig. 5, the apparatus further comprises:

and the parameter optimization unit 440 is configured to perform parameter optimization on the cascaded network combinations of the 2D convolutional neural network, the 1D convolutional neural network, and the classification network according to the accuracy of the behavior recognition result of the training samples in the training set until the accuracy increase amplitude of the behavior recognition result of the training samples in the training set is less than a preset threshold.

In an alternative embodiment, the parameter optimization unit 440 is specifically configured to optimize model parameters of the 2D convolutional neural network, the 1D convolutional neural network, and/or the classification network.

In an alternative embodiment, as shown in fig. 6, the apparatus further comprises:

a pre-training unit 450, configured to pre-train the 2D convolutional neural network based on ImageNet before training the cascaded 2D convolutional neural network, 1D convolutional neural network, and classification network.

In an alternative embodiment, the behavior recognition unit 430 is specifically configured to input the spatio-temporal features of the video frame into the fully connected layer of the classification network, so as to obtain a plurality of feature values of the video frame; the feature values are in one-to-one correspondence with behavior categories supported by the classification network, and the larger the numerical value of the feature value is, the larger the probability of the behavior category corresponding to the feature value in the behavior recognition result of the video frame is; and inputting a plurality of characteristic values of the video frame into a softmax layer of the classification network to obtain the confidence of each behavior category in the behavior recognition result of the video frame.

Fig. 7 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application. The electronic device may include a processor 701, a communication interface 702, a memory 703, and a communication bus 704. The processor 701, the communication interface 702, and the memory 703 perform communication with each other via the communication bus 704. Wherein the memory 703 has stored thereon a computer program; the processor 701 can execute the behavior recognition method described above by executing the program stored on the memory 703.

The memory 703 referred to herein may be any electronic, magnetic, optical, or other physical storage device that may contain or store information, such as executable instructions, data, or the like. For example, the memory 703 may be: RAM (Radom Access Memory, random access memory), volatile memory, non-volatile memory, flash memory, a storage drive (e.g., hard drive), a solid state drive, any type of storage disk (e.g., optical disk, dvd, etc.), or a similar storage medium, or a combination thereof.

Embodiments of the present application also provide a machine-readable storage medium, such as memory 703 in fig. 7, storing a computer program executable by processor 701 in the electronic device shown in fig. 7 to implement the behavior recognition method described above.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the application.

Claims

1. A method of behavior recognition, comprising:

inputting the space-time characteristics of the video frames into a pre-trained classification network to obtain behavior recognition results of the video frames;

the step of inputting the space-time characteristics of the video frames into a pre-trained classification network to obtain the behavior recognition result of the video frames comprises the following steps:

inputting the space-time characteristics of the video frames into a full connection layer of the classification network to obtain a plurality of characteristic values of the video frames; the feature values are in one-to-one correspondence with behavior categories supported by the classification network, and the larger the numerical value of the feature value is, the larger the probability of the behavior category corresponding to the feature value in the behavior recognition result of the video frame is;

inputting a plurality of characteristic values of the video frame into a softmax layer of the classification network to obtain confidence degrees of all behavior categories in a behavior recognition result of the video frame;

the 2D convolutional neural network, the 1D convolutional neural network, and the classification network are cascaded.

2. The method of claim 1, wherein the 2D convolutional neural network, 1D convolutional neural network, and classification network of a cascade are trained by:

inputting any training sample in the training set into the 2D convolutional neural network to obtain the apparent characteristics of the training sample;

inputting the apparent characteristics of the training sample into the 1D convolutional neural network to obtain the space-time characteristics of the training sample;

and inputting the space-time characteristics of the training sample into the classification network to obtain the behavior recognition result of the training sample.

3. The method of claim 2, wherein after said inputting the spatiotemporal features of the training sample into the classification network, further comprising:

and according to the accuracy of the behavior recognition result of the training sample in the training set, carrying out parameter optimization on the network combination of the cascaded 2D convolutional neural network, the 1D convolutional neural network and the classification network until the accuracy increase amplitude of the behavior recognition result of the training sample in the training set is smaller than a preset threshold value.

4. A method according to claim 3, wherein said parameter optimizing a network combination of the cascaded 2D convolutional neural network, the 1D convolutional neural network, and the classification network comprises:

5. The method of claim 2, wherein prior to training the cascaded 2D convolutional neural network, 1D convolutional neural network, and classification network, further comprising:

the 2D convolutional neural network is pre-trained based on a picture classification dataset ImageNet.

6. A behavior recognition apparatus, comprising:

the behavior recognition unit is used for inputting the space-time characteristics of the video frames into a pre-trained classification network so as to obtain behavior recognition results of the video frames;

the behavior recognition unit is specifically configured to input the spatio-temporal features of the video frame into a full-connection layer of the classification network, so as to obtain a plurality of feature values of the video frame; the feature values are in one-to-one correspondence with behavior categories supported by the classification network, and the larger the numerical value of the feature value is, the larger the probability of the behavior category corresponding to the feature value in the behavior recognition result of the video frame is; inputting a plurality of characteristic values of the video frame into a softmax layer of the classification network to obtain confidence degrees of all behavior categories in a behavior recognition result of the video frame;

7. The apparatus of claim 6, wherein the device comprises a plurality of sensors,

the apparent feature extraction unit is further used for inputting any training sample in the training set into the 2D convolutional neural network so as to obtain the apparent feature of the training sample;

the time sequence feature extraction unit is further used for inputting the apparent features of the training sample into the 1D convolutional neural network so as to obtain the space-time features of the training sample;

the behavior recognition unit is further used for inputting the space-time characteristics of the training sample into the classification network so as to obtain a behavior recognition result of the training sample.

8. The apparatus of claim 7, wherein the apparatus further comprises:

and the parameter optimization unit is used for performing parameter optimization on the network combination of the cascaded 2D convolutional neural network, the 1D convolutional neural network and the classification network according to the accuracy of the behavior recognition result of the training samples in the training set until the accuracy increase amplitude of the behavior recognition result of the training samples in the training set is smaller than a preset threshold value.

9. The apparatus of claim 8, wherein the device comprises a plurality of sensors,

the parameter optimization unit is specifically configured to optimize model parameters of the 2D convolutional neural network, the 1D convolutional neural network, and/or the classification network.

10. The apparatus of claim 7, wherein the apparatus further comprises:

and the pre-training unit is used for pre-training the 2D convolutional neural network based on a picture classification data set ImageNet before training the cascaded 2D convolutional neural network, 1D convolutional neural network and classification network.