CN117275093A

CN117275093A - Subway driver driving action detection method and system

Info

Publication number: CN117275093A
Application number: CN202311320382.1A
Authority: CN
Inventors: 魏秀琨; 沈星; 张唯; 高利华; 李欣; 马垚; 汤庆锋; 高方庆; 管青鸾; 张慧贤; 刘志强; 胡新杨; 葛承宇; 蔡坤林; 丁亚宁; 吉杨; 郭海鹏
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2023-10-12
Filing date: 2023-10-12
Publication date: 2023-12-22

Abstract

The invention provides a method and a system for detecting driving actions of a subway driver, comprising the following steps: acquiring video data to be detected; performing target detection on a target object in the video data to be detected by using a pre-trained motion detection model, and obtaining an area where the target object is located; and performing action recognition on the target object based on the region where the target object is located to obtain an action category. The invention not only can rapidly and accurately realize the region positioning, but also can realize the action recognition, saves the cost of manual determination, has high degree of automation and intelligence, and can meet the increasing demands of operation safety.

Description

Subway driver driving action detection method and system

Technical Field

The invention relates to the technical field of image processing, in particular to a method and a system for detecting driving actions of a subway driver.

Background

The rapid development of urban rail transit brings higher requirements to train operation safety. The train driver is an important role in ensuring safe running of the train, and the driver needs to confirm with fingers according to the steps actually completed so as to ensure that no step is missed, and the gestures indicate that the current equipment is in a normal running state. The type of actions completed by the driver is important to judge whether the corresponding equipment is in a normal running state at present. Therefore, the method requires real-time detection of the action category of the driver, provides a basis for judging the normal operation of equipment, and ensures the safe and reliable operation of the train.

At present, aiming at the action detection of subway drivers, the method mainly uses manual review of monitoring videos, has the advantages of high labor intensity, low efficiency, poor real-time performance and low automation and intelligent degree, and is difficult to meet the increasing operation safety requirements.

Disclosure of Invention

The embodiment of the invention provides a method and a system for detecting the driving action of a subway driver, which are used for overcoming the defects of the prior art.

In order to achieve the above purpose, the present invention adopts the following technical scheme.

In a first aspect, the present invention provides a method for detecting driving actions of a subway driver, including:

acquiring video data to be detected;

performing target detection on a target object in the video data to be detected by using a pre-trained motion detection model, and obtaining an area where the target object is located; and performing action recognition on the target object based on the region where the target object is located to obtain an action category.

Optionally, the pre-trained motion detection model includes a target detection network and a motion recognition network;

performing target detection on a target object in the video data to be detected by using a pre-trained motion detection model, and obtaining an area where the target object is located; and based on the region where the target object is located, performing motion recognition on the target object to obtain a motion category, including:

Performing target detection on a target object in the video data to be detected by using the target detection network to obtain an area where the target object is located;

and based on the area where the target object is located, performing action recognition on the target object by using the action recognition network to obtain an action category.

Optionally, the action recognition network comprises a first convolution module, a space enhancement module, a plurality of groups of replacement units and a second convolution module which are sequentially connected;

the step of performing motion recognition on the target object by using the motion recognition network based on the region where the target object is located to obtain a motion category includes:

carrying out convolution processing on the video data to be detected by using the first convolution module to obtain a first convolved feature map;

performing spatial enhancement on the first convolved feature map by using the spatial enhancement module to obtain an enhanced feature map;

carrying out space-time feature extraction on the enhanced feature map by utilizing the plurality of groups of replacement units to obtain space-time features;

carrying out convolution processing on the space-time characteristics by using the second convolution module to obtain a second convolved characteristic diagram;

performing ROI alignment and ROI pooling according to the second convolved feature map and an anchor frame corresponding to the region where the target object is located, and obtaining feature information corresponding to the anchor frame;

And performing action recognition on the characteristic information corresponding to the anchor frame by using the full connection layer to obtain an action category.

Optionally, the multiple groups of replacement units include a first replacement unit, a second replacement unit and a third replacement unit which are sequentially connected;

the first replacement unit comprises a first replacement module, a second replacement modules and a replacement attention module which are connected in sequence;

the second replacement unit comprises a first replacement module, b second replacement modules and a replacement attention module which are connected in sequence;

the third replacement unit comprises c second replacement modules and a replacement attention module which are connected in sequence;

the first replacement module comprises a first branch, a second branch, a third branch, a fourth branch, a splicing layer and a channel replacement layer of the first replacement module;

the second replacement module comprises a channel dividing layer, a first branch, a second branch, a third branch, a splicing layer and a channel replacement layer of the second replacement module;

the third branch comprises a 1x1x1 convolution layer, a 5x5x5 depth separable convolution layer and a 1x1x1 convolution layer which are connected in sequence;

the fourth branch comprises a depth separable convolution layer of 5x5x5 and a convolution layer of 1x1x1 connected in sequence.

Optionally, the method further comprises:

the replacement attention module is utilized to carry out channel grouping on the feature images output by the first replacement module or the second replacement module, so as to obtain a plurality of groups of feature images;

processing the plurality of groups of feature images by using a channel attention mechanism and a space attention mechanism in the replacement attention module respectively, and correspondingly obtaining a channel importance coefficient and a space importance coefficient of each group of feature images;

based on the channel importance coefficient and the space importance coefficient, the replacement attention module is utilized to splice and fuse the plurality of groups of feature images, channel replacement is adopted to carry out inter-group communication on the fused feature images, and the feature images output by the replacement attention module are obtained.

Optionally, the spatially enhancing the first convolved feature map by using the spatial enhancing module to obtain an enhanced feature map includes:

the space enhancement module is utilized to respectively carry out global average pooling and global maximum pooling on the first convolved feature map along the channel dimension, and splice the average pooled feature map and the maximum pooled feature map to obtain a spliced pooled feature map;

performing feature extraction on the pooling feature images after splicing by using the 3D convolution layer in the space enhancement module, and activating the feature images output by the 3D convolution layer by using an activation function in the space enhancement module to obtain activated feature images;

Multiplying the activated feature map with the first convolved feature map to obtain a multiplied feature map as an enhanced feature map.

Optionally, the target detection network is MobileNetV2-SSDLite.

Optionally, the target detection network includes a two-dimensional standard convolution layer, a plurality of first bottleneck layers, a two-dimensional standard convolution layer, and a plurality of second bottleneck layers connected in sequence.

Optionally, the pre-trained motion detection model is obtained by training based on training video data and corresponding tag data, wherein the training video data is motion video data of a driver in a cab.

In a second aspect, the present invention also provides a subway driver driving motion detection system, including:

the region detection and action recognition module is used for carrying out target detection on a target object in the video data to be detected by utilizing a pre-trained action detection model, and obtaining a region where the target object is located; and performing action recognition on the target object based on the region where the target object is located to obtain an action category.

The invention has the beneficial effects that: according to the subway driver driving action detection method and system, a pre-trained action detection model is utilized to carry out target detection on a target object in the video data to be detected, and an area where the target object is located is obtained; and based on the region where the target object is located, the target object is subjected to action recognition to obtain action types, so that the region positioning can be quickly and accurately realized, the action recognition can be realized, the cost of manual determination is saved, the automation and intelligent degree is high, and the ever-increasing operation safety requirement can be met.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a method for detecting driving actions of a subway driver according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a motion detection model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a first replacement module and a second replacement module according to an embodiment of the present invention;

FIG. 4a is a schematic flow chart of a channel replacement layer data processing according to an embodiment of the present invention;

FIG. 4b is a gray scale of FIG. 4 a;

FIG. 5 is a flow diagram of a replacement attention module process provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of a processing flow of a spatial enhancement module according to an embodiment of the present invention;

Fig. 7 is a schematic structural diagram of a bottleneck layer according to an embodiment of the present invention;

FIG. 8 is a second flow chart of a method for detecting driving actions of a subway driver according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of an action detection result according to an embodiment of the present invention;

fig. 10 is a schematic diagram of action categories according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Currently, subway driver action detection research can be divided into two types, namely an image processing technology and a deep learning technology. The motion recognition method based on the image processing technology can be divided into three types, namely template-based, space-time interest point-based and motion track-based. The template-based method is suitable for simple actions and has high requirements on the selection of templates; the method based on the space-time interest points needs to detect the interest points in the video, namely the points with the most intense change in the real-time space dimension can greatly improve the action recognition accuracy, but the interest point detection calculation amount is large, and the real-time detection cannot be achieved; the motion trail-based method utilizes the motion trail of the key points of the human skeleton to represent the motion, has the characteristics of good robustness, strong anti-interference capability and the like, but is seriously dependent on accurate estimation of the key points of the skeleton and accurate tracking of the key points, and the calculated amount of the model is large, so that real-time detection cannot be realized. The method based on the deep learning technology utilizes the deep convolutional neural network to extract the space-time characteristics of the action video, and can be subdivided into three types of three-dimensional convolutional neural network-based, double-flow convolutional neural network-based and long-short-time memory network-based. Compared with the motion recognition method based on the traditional image processing technology, the method based on the deep learning technology greatly improves the accuracy and detection speed of motion recognition, but the model size and the calculated amount are large, and the real-time motion detection requirement is not met yet.

The driver action recognition based on the deep learning specifically comprises the following steps:

step 1, various action videos are cut and marked: and cutting out the fragments containing each type of action in the video from the original long video by utilizing video clipping software, so that each action fragment contains only one action. After cutting out, the video clips of each type of action are marked by the same numbers.

Step 2, video frame extraction and sampling: for each video segment, it is extracted into a frame sequence at a fixed frame rate, and an equally spaced sampling strategy is employed, i.e. 1 frame is sampled per 8 frames, for a total of 16 frames of images from the frame sequence.

Step 3, data preprocessing: the sampled 16-frame image is first scaled to 342 (width) ×256 (height), then cut into 224×224 by a random cutting operation, and then subjected to normalization (normalization) processing.

Step 4, model training: and training the input 16 frames of images by using a convolutional neural network, and finally obtaining the trained model weight.

Step 5, model reasoning: and inputting a certain action video clip into a model, performing action recognition, and outputting the action category and the prediction score of the video by the model.

However, deep learning-based driver action recognition has the following drawbacks: the current subway driver action recognition method based on deep learning mainly aims at recognizing various cut action fragments, each action fragment only comprises one action category, and multi-action category detection and action recognition from a real-time video stream cannot be performed. And the model has a slower speed in the actual action recognition due to the larger size and calculation amount of the used three-dimensional convolution network model. Furthermore, existing approaches are directed to video-level motion recognition, i.e., video motion classification, that cannot locate the driver zone.

Aiming at the situation that no real-time high-precision high-efficiency subway driver driving action detection method exists at present, the invention provides a subway driver driving action detection method for a subway driver based on an improved 3D (three-dimensional) shuffleNetV2 network by utilizing a deep learning technology, and the problems of action types of the driver and multi-action labels of the driver in a real driving environment are fully considered. The method of the invention not only can rapidly and accurately realize the region positioning, but also can realize the action recognition. In addition, based on the method, a subway driver action detection system is further built, video data are read from the monitoring video in real time, and real-time driver action detection is carried out. The automatic detection method for the driving action of the driver of the subway is provided for the monitoring department, so that the labor cost can be reduced, the detection efficiency can be improved, the safety of urban rail transit can be improved, and the automation level of the urban rail transit can be improved. The following describes a driving action detection method for a subway driver provided by the invention with reference to the accompanying drawings.

Example 1

Fig. 1 is a schematic flow chart of a method for detecting driving actions of a subway driver according to an embodiment of the present invention; as shown in fig. 1, a method for detecting driving actions of a subway driver includes the following steps:

s101, obtaining video data to be detected.

In this step, video data to be detected including an object, which may be a person, an object, or the like, such as a vehicle, a driver, a pedestrian, or the like, in a dynamic state, is acquired.

S102, performing target detection on a target object in the video data to be detected by using a pre-trained motion detection model, and obtaining an area where the target object is located; and performing action recognition on the target object based on the region where the target object is located to obtain an action category.

In this step, 1 three-dimensional convolutional neural network may be used to perform target object region positioning and motion recognition on the video data to be detected at the same time, or two sub-networks (target detection sub-network and motion recognition sub-network) may be used to perform region positioning and motion recognition respectively. Through the method, the region where the target object is located and the action category corresponding to the target object can be obtained at the same time.

According to the subway driver driving action detection method provided by the embodiment of the invention, the target object in the video data to be detected is subjected to target detection by using the pre-trained action detection model, and the area where the target object is obtained; and based on the region where the target object is located, the target object is subjected to action recognition to obtain action types, so that the region positioning can be quickly and accurately realized, the action recognition can be realized, the cost of manual determination is saved, the automation and intelligent degree is high, and the ever-increasing operation safety requirement can be met.

Fig. 2 is a schematic structural diagram of an action detection model according to an embodiment of the present invention, and as shown in fig. 2, the pre-trained action detection model includes a target detection network and an action recognition network.

The method comprises the steps that a pre-trained action detection model is utilized to carry out target detection on a target object in video data to be detected, and an area where the target object is located is obtained; and based on the region where the target object is located, performing motion recognition on the target object to obtain a motion category, including:

and carrying out target detection on the target object in the video data to be detected by using the target detection network, and obtaining the area where the target object is located. It should be noted that, the area where the target object directly output through the target detection network is located is a plurality of proposed candidate positioning areas, and the positioning area corresponding to only one target object is obtained after the non-maximum suppression processing.

And based on the area where the target object is located, performing action recognition on the target object by using the action recognition network to obtain an action category. Here, the operation recognition is performed based on the positioning region after the non-maximum value suppression processing.

In this embodiment, the region location and the motion recognition are performed by the target detection network and the motion recognition network, respectively. The object detection network may be SSD series, YOLO series, R-CNN series, or the like. In order to further enhance the real-time performance of the target detection, preferably, the target detection network may be a lightweight target detection network, such as: nanoDet, mobileNetV2-SSDLite, and the like. The action recognition network may be a 3D CNN or a lightweight 3D convolutional network, such as: the real-time performance of motion recognition can be further improved by utilizing a lightweight 3D convolution network, such as a Shewlenet series, a MobileNet series, a GhostNet series and the like.

Further, as shown in fig. 2, the action recognition network is a modified 3d ShuffleNetV2, which includes a first convolution module, a spatial enhancement module, a plurality of groups of permutation units (3 groups in this embodiment, and adjustable according to the model performance in other embodiments) and a second convolution module connected in sequence.

And carrying out convolution processing on the video data to be detected by using the first convolution module to obtain a first convolved feature map.

And performing space enhancement on the first convolved feature map by using the space enhancement module to obtain an enhanced feature map.

And extracting the space-time characteristics of the enhanced characteristic map by utilizing the plurality of groups of replacement units to obtain the space-time characteristics.

And carrying out convolution processing on the space-time characteristics by using the second convolution module to obtain a second convolved characteristic diagram.

And performing ROI alignment and ROI pooling according to the second convolved feature map and the anchor frame corresponding to the region where the target object is located, and obtaining feature information corresponding to the anchor frame.

In this embodiment, the first convolution module in the improved 3d ShuffleNetV2 is used to perform downsampling on the input video data to be detected, so that the feature map size is reduced, and then the spatial enhancement module is used to perform feature extraction, where the spatial enhancement module is used to reduce the influence of spatial information loss caused by the downsampling process of the first convolution module, so that the spatial information representation capability of the motion recognition network can be improved. And (3) carrying out receptive field lifting on the enhanced feature map output by the space enhancement module by utilizing the plurality of groups of replacement units, and carrying out convolution processing on the space-time features output by the last group of replacement units by utilizing the second convolution module to obtain a second convolved feature map. Mapping the anchor frames of the positioning region obtained in the target detection network on the second convolved feature map, and performing ROI alignment and ROI pooling, so that each anchor frame can generate a feature with a fixed size, and finally performing action recognition through the full connection layer. And finally, displaying the region where the target object is located, the action category and the confidence level of the action category in the video.

Further, the plurality of groups of replacement units comprise a first replacement unit, a second replacement unit and a third replacement unit which are sequentially connected;

the fourth branch comprises a depth separable convolution layer of 5x5x5 and a convolution layer of 1x1x1 connected in sequence. It should be noted that, the first branch of the first replacing module is not the same as the first branch of the second replacing module, and refer to the schematic structural diagram of fig. 3.

In this embodiment, in order to further enhance the receptive field of the motion recognition network, and because of the large convolution kernel size, the receptive field of the model can be greatly enhanced, whereas the receptive field of the motion recognition network is very important for subsequent recognition tasks, thus, an improved permutation unit is proposed which adds branches with a convolution kernel size of 5x5x5 on the basis of the original permutation module (the existing permutation module only comprises a first branch and a second branch), as shown in fig. 3.

When the step size is 1 (i.e., the second permutation module), as shown in the left graph of fig. 3, a branch 3 (i.e., the third branch) is added on the basis of the 3x3x3 convolution kernels of the original branches 1 and 2 (i.e., the second branch), the convolution kernel size of the depth separable convolution is 5x5x5, so as to obtain a larger receptive field, other parts of the branches are consistent with the branch 2, and in order to ensure that the number of channels which are finally spliced is unchanged, the number of output channels of the last 1x1x1 convolution of the branches 2 and 3 is set to be 1/4 of the number of input channels. Then, the branch 1, the branch 2 and the branch 3 are spliced together to obtain a characteristic diagram which is the same as the original channel number, and finally, a channel replacement layer is followed (as shown in fig. 4).

When the step length is 2, as shown in the right graph of fig. 3, adding a branch 3 and a branch 4 (i.e. a fourth branch) on the basis of the original branch 1 (i.e. a first branch of a first replacement module) and the branch 2, wherein the overall structures of the branch 1 and the branch 4 are kept consistent, the difference is that the convolution kernel size of the depth separable convolution of the branch 1 is 3x3x3, and the convolution kernel size of the depth separable convolution of the branch 4 is 5x5x5; similarly, the overall structures of the branch 3 and the branch 2 are kept consistent, the difference is that the convolution kernel size of the branch 2 depth separable convolution is 3x3x3, the convolution kernel size of the branch 3 depth separable convolution is 5x5x5, in order to ensure that the number of finally spliced channels is unchanged, the output channels of the last 1x1x1 convolution of all the branches are set to be 1/4 of the number of original channels, then the branch 1, the branch 2, the branch 3 and the branch are spliced together to obtain a characteristic diagram which is the same as the number of the original channels, and finally a channel replacement layer is introduced (as shown in fig. 4, the characteristic diagram is grouped and is divided into G groups and then the device is carried out).

Fig. 5 is a schematic flow chart of a replacement attention module process according to an embodiment of the present invention, as shown in fig. 5:

channel grouping is carried out on the feature graphs (namely the original feature graphs in fig. 5) output by the first replacement module or the second replacement module by utilizing the replacement attention module, so as to obtain a plurality of groups of feature graphs;

based on the channel importance coefficient and the space importance coefficient, the replacement attention module is utilized to splice and fuse the plurality of groups of feature images, and channel replacement is utilized to perform inter-group communication on the fused feature images, so that the feature images output by the replacement attention module (i.e. the improved feature images in fig. 5) are obtained.

In this embodiment, in order to enable the action recognition network to pay more attention to important channels and spatial locations, a replacement attention module is introduced. By introducing this attention module, the action recognition network can learn the importance weights of the channels and spatial locations, multiplied by the original feature map, resulting in an improved feature map.

Specifically, the feature images output by the first replacement module or the second replacement module are divided into g groups along the channel, each group is divided into 2 branches of a channel attention branch and a space attention branch, and after the feature images of each group are processed through the channel attention branch and the space attention branch respectively, the channel importance coefficient and the space importance coefficient are obtained. The channel attention branch adopts a global average pooling + scaling + Sigmoid activation combination. Spatial attention branching uses a group regularization (GN) process to derive statistical information of spatial dimensions, which are then enhanced with Fc ().

After the channel importance coefficient and the space importance coefficient are obtained, the grouped feature images are integrated according to the channel importance coefficient and the space importance coefficient, namely, the feature images are spliced and fused first, and then the channel replacement is adopted for inter-group communication, so that the final feature image output by the replacement attention module is obtained.

Fig. 6 is a schematic process flow diagram of a spatial enhancement module provided in an embodiment of the present invention, as shown in fig. 6, where the performing spatial enhancement on the first convolved feature map by using the spatial enhancement module to obtain an enhanced feature map includes:

In this embodiment, the spatial enhancement module performs global average pooling and global maximum pooling on the feature map after the first convolution along the channel dimension, splices the 2 feature maps obtained after pooling, performs feature extraction through a 3D convolution with a convolution kernel size of 7x7x7, improves the expression capability of the network by using an activation function, and multiplies the feature map by the original feature map to obtain the enhanced feature map.

Based on the above description, in this embodiment, the network structure and parameters of the action recognition network are shown in table 1:

Table 1 action recognition network structure and parameters

As can be seen from table 1, the first permutation unit includes a first permutation module with a step size of 2, 3 second permutation modules with a step size of 1, and a permutation attention module; the second replacement unit comprises a first replacement module with a step length of 2, 7 second replacement modules with a step length of 1 and a replacement attention module; the third permutation unit includes 4 second permutation modules of step size 1 and a permutation attention module.

Further, the target detection network is MobileNet V2-SSDLite. The target detection network comprises a two-dimensional standard convolution layer (namely 2D convolution), a plurality of first bottleneck layers (namely bottleneck layer 1), a two-dimensional standard convolution layer (namely 2D convolution) and a plurality of second bottleneck layers (namely bottleneck layer 2) which are connected in sequence.

In this embodiment, the target detection network is mobilenet v2-SSDLite, and the structure and parameters of the target detection network are shown in table 2, and the mobilenet v2-SSDLite has the advantages of high accuracy, good real-time performance, and small number of model parameters and calculation amount when used for target detection. As can be seen from Table 2, the MobileNet V2-SSDLite consists of 2 standard 2-dimensional convolutions, 17 bottleneck layers 1 and 4 bottleneck layers 2.

During the processing of the video data to be detected, the input of the MobileNetV2-SSDLite may be set to 320 x 320, and the IoU (cross-over) threshold value at the time of non-maximum suppression is set to 0.45, i.e. when the two anchor boxes IoU exceed 0.45, the same object is considered. The confidence threshold is set to 0.5, i.e., the target object is considered detected when the detected anchor frame confidence exceeds 0.5.

Table 2 driver detection network structure

The structures of the bottleneck layer 1 and the bottleneck layer 2 are shown in fig. 7. The network structure of the bottleneck layer 1 is divided into a network structure with a step length of 1 and a network structure with a step length of 2, when the step length is 1, the network structure of the bottleneck layer 1 comprises an input (namely an output of a last network layer), 1x1 convolution+relu6 activation, 3x3 depth separable convolution+relu6 activation and 1x1 convolution+linear layer which are sequentially connected, and the input and the 1x1 convolution+linear layer are connected, so that the result of the input and the 1x1 convolution+linear layer are added.

When the step size is 2, the network structure of bottleneck layer 1 includes sequentially connected inputs, 1x1 convolution+relu6 activation, 3x3 depth separable convolution+relu6 activation, and 1x1 convolution+linear layer.

The network structure of bottleneck layer 2 includes an input, a 1x1 convolution+relu 6 activation (channel number here=c/2), a 3x3 depth separable convolution+relu 6 activation (channel number here=c), and an output (channel number here=c) connected in sequence.

Further, the pre-trained motion detection model is obtained by training based on training video data and corresponding tag data, wherein the training video data is motion video data of a driver in a cab.

Based on the target detection network and the action recognition network, in the training process, the target detection network is directly selected from a pre-trained target detection network, and only the pre-trained target detection network and the weight are downloaded and stored. Therefore, among the motion detection models, only the motion recognition network is needed to be trained.

The network training parameters of the action recognition network are as follows: the adopted optimizer is random gradient descent (SGD), the initial learning rate is 0.01, the momentum weight is set to 0.9, the weight attenuation is set to 0.00003, the learning rate updating strategy is cosine attenuation, the loss function is cross entropy loss function, the batch size during training is set to 32, and the rest parameters use default parameters.

And after the action detection model is trained, the generated weight file is stored as a pth format, and on the basis, the inference prediction can be performed only by inputting video.

According to the subway driver driving action detection method provided by the embodiment of the invention, the driver area can be automatically and accurately detected and the action type can be identified in real time through the action detection model. Compared with the existing action recognition method, the method considers the situation of area positioning and action category multi-label, and realizes that video data are read from a monitoring camera in real time to perform action detection. Easy programming realization, convenience and practicability, and higher economic and social benefits. In addition, due to the adoption of a lightweight network, the overall model parameters and the calculated amount are small, and the method is easy to deploy and use in embedded equipment and mobile equipment with limited calculation resources; the model-based driver action detection system realizes that video data is directly read from the monitoring camera to perform action detection, has the advantages of real time accuracy and high efficiency, and is simple in system algorithm realization and convenient for practical application.

Fig. 8 is a second flow chart of a method for detecting driving actions of a subway driver according to an embodiment of the present invention, as shown in fig. 8, taking action recognition of a subway driver as an example, the method for detecting driving actions of a subway driver includes the following steps:

s801, a subway driver monitoring video is read, which may be recorded by 2 monitoring cameras (respectively located at the lower left corner and the upper right corner of the cab) installed in the cab, and the specific resolution may be 1280×720.

S802, video frame reading. In this step, since the motion video clip with the duration of 10 seconds is input during the training of the motion detection model, the motion video clip with the duration of 10 seconds is cut out from the monitoring room video (in other embodiments, the length of the video clip can be adjusted according to the actual situation), which is not limited by the present invention. And then extracting the frame sequence of each video segment according to the frame rate of 30FPS to obtain the frame sequence of each video segment.

In the step S803, the video frames are sampled, and in this step, since the actions are represented by a certain number of video frames arranged in time sequence, the calculation amount of the model is increased by inputting all the frame sequences into the model, and a certain number of frames, such as 8 frames, 16 frames and 32 frames, can be sampled from the video to reduce the calculation amount. Preferably, 8 frames are used in this embodiment. And setting the sampling interval to be 8, namely 8 frames of images are sampled to 1 frame of images.

S804, preprocessing video frames; in this step, a normalization operation is required for the sampled 8-frame image, that is, a normalization operation is performed for the pixel values, and the pixel values are normalized from [0,255] to a distribution with a mean value of 0 and a standard deviation of 1. The specific formula is as follows:

wherein, x is the original pixel value;

μ—the average value of the current channel pixel values in the image;

sigma-standard deviation of current channel pixel values in the image;

x _new -using the new pixel value after normalization.

S805, driver area positioning, in which the preprocessed video frames are used to perform driver area positioning by using the above-mentioned object detection network.

S806, driver motion recognition, in which the pre-processed video frame is used to perform driver motion recognition by using the above-mentioned motion recognition network.

S807, the output motion detection result is shown in fig. 9, including the motion category and the region where the driver is located. The action categories of the driver are as shown in fig. 10, and include sitting+pointing to the front window, sitting+no other actions, standing in the cab+pressing on/off button, standing outside the cab, going from outside to inside, going from inside to outside, sitting+pointing to instruments and screens, sitting+pointing to the lower left instrument, sitting+pushing instruments, and so on.

The invention provides a subway driver driving action detection method for a subway driver based on an improved 3D (three-dimensional) shuffleNetV2 network, which realizes automatic real-time accurate detection of a driver area and identification of action types. 2) Compared with the existing action recognition method, the method considers the situation of the positioning of the driver area and the multi-label of the action category of the driver. 3) The model is mainly based on a depth method, positions the body area of the driver and identifies the action type of the current driver. 4) A real-time detection system for the action of a driver is built, and the video data is read from the monitoring camera in real time to perform the action detection. 5) Easy programming realization, convenience and practicability, and higher economic and social benefits.

Example 2

On the basis of embodiment 1, this embodiment 2 provides a subway driver driving motion detection system corresponding to the above motion detection, the subway driver driving motion detection system including:

the data acquisition module is used for acquiring video data to be detected;

In practical application, the subway driver driving action detection system further comprises a result display module, wherein the result display module is used for drawing action detection results on images, including action categories and corresponding confidence degrees thereof, synthesizing videos at a fixed frame rate and displaying the videos in front of a screen.

The specific details refer to the description of the driving action detection method of the subway driver, and are not repeated here.

In summary, according to the method and the system for detecting the driving actions of the subway driver provided by the embodiment of the invention, the target object in the video data to be detected is detected by using the pre-trained action detection model, and the area where the target object is located is obtained; and based on the region where the target object is located, the target object is subjected to action recognition to obtain action types, so that the region positioning can be quickly and accurately realized, the action recognition can be realized, the cost of manual determination is saved, the automation and intelligent degree is high, and the ever-increasing operation safety requirement can be met.

Those of ordinary skill in the art will appreciate that: the drawing is a schematic diagram of one embodiment and the modules or flows in the drawing are not necessarily required to practice the invention.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for a method or apparatus embodiment, since it is substantially similar to a method embodiment, the description is relatively simple, with reference to the description of a method embodiment in part. The method and system embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A method for detecting driving actions of a subway driver, comprising the steps of:

acquiring video data to be detected;

2. The subway driver driving motion detection method according to claim 1, wherein the pre-trained motion detection model includes a target detection network and a motion recognition network;

3. The subway driver driving motion detection method according to claim 2, wherein the motion recognition network comprises a first convolution module, a space enhancement module, a plurality of groups of replacement units and a second convolution module which are connected in sequence;

4. The subway driver driving motion detection method according to claim 3, wherein the plurality of groups of replacement units include a first replacement unit, a second replacement unit, and a third replacement unit which are sequentially connected;

5. The subway driver driving motion detection method according to claim 4, characterized in that the method further comprises:

6. The method for detecting driving actions of a subway driver according to claim 3, wherein spatially enhancing the first convolved feature map by using the spatial enhancement module to obtain an enhanced feature map comprises:

7. The subway driver driving motion detection method according to any one of claims 2 to 6, wherein the target detection network is MobileNetV2-SSDLite.

8. The subway driver driving motion detection method according to claim 7, wherein the target detection network comprises a two-dimensional standard convolution layer, a plurality of first bottleneck layers, a two-dimensional standard convolution layer and a plurality of second bottleneck layers which are sequentially connected.

9. The subway driver driving motion detection method according to claim 1, wherein the pre-trained motion detection model is trained based on training video data and corresponding tag data, and the training video data is motion video data of a driver in a cab.

10. A subway driver driving action detecting system, characterized by comprising:

The data acquisition module is used for acquiring video data to be detected;