WO2023138444A1

WO2023138444A1 - Pedestrian action continuous detection and recognition method and apparatus, storage medium, and computer device

Info

Publication number: WO2023138444A1
Application number: PCT/CN2023/071627
Authority: WO
Inventors: 孙叶纳; 周军
Original assignee: 北京眼神智能科技有限公司; 北京眼神科技有限公司; 深圳爱酷智能科技有限公司
Priority date: 2022-01-22
Filing date: 2023-01-10
Publication date: 2023-07-27
Also published as: CN116563881A

Abstract

A pedestrian action continuous detection and recognition method and apparatus, a storage medium, and a device. The method comprises: segmenting a video to be detected into a plurality of video clips, each video clip comprising a plurality of frame images; in each video clip, selecting a frame image as a key frame, and taking the remaining frame images as non-key frames; for each video clip, inputting all the frame images of the video clip into a pedestrian detection model, and obtaining a certain number of detection boxes on each frame image of the video clip; for each video clip, inputting all the frame images of the video clip and the detection boxes thereof into an action recognition model, to obtain an action category of each detection box of the key frame; and for each video clip, matching all the detection boxes of each non-key frame with all the detection boxes of the key frame, and setting action categories of the detection boxes of the non-key frames as the action categories of the detection boxes of the matched key frame.

Description

Method, device, storage medium and computer equipment for continuous detection and recognition of pedestrian movements

Cross References to Related Applications

This application claims the priority of the Chinese patent application with the application number 202210075002.1 and the application title "Method, Device, Storage Medium and Equipment for Continuous Detection and Recognition of Pedestrian Movements" submitted to the China Patent Office on January 22, 2022, which is hereby incorporated by reference in its entirety.

technical field

The present application relates to the field of motion detection and recognition, in particular to a method, device, storage medium and computer equipment for continuous detection and recognition of pedestrian motion.

Background technique

Pedestrian action detection and recognition methods in the prior art usually only detect and recognize key frames in video clips, while the detection of the entire video completely depends on the detection effect of the key frames. If only relying on key frame detection, it is easy to bring bias to the evaluation of the overall detection effect of the video. In addition, only key frames are detected, and the detection frame and recognition results are presented intermittently, resulting in a poor visual experience.

Contents of the invention

In order to solve the defects of the prior art, the present application provides a method, device, storage medium and equipment for continuous detection and recognition of pedestrian movements, which realize continuous detection and recognition of human body movements frame by frame in videos.

The technical solution provided by this application is as follows.

In a first aspect, the present application provides a method for continuous detection and recognition of pedestrian actions, and the method includes the following steps.

The video to be detected is divided into multiple video segments, and each video segment includes multiple frames of images.

Select one frame of image in each video clip as a key frame, and the rest of the frames as non-key frames.

For each video clip, input all frame images of the video clip into the pedestrian detection model, and obtain a certain number of detection frames on each frame image of the video clip.

For each video clip, all frame images of the video clip and their detection frames are input into the action recognition model to obtain the action category of each detection frame of the key frame.

For each video clip, all detection frames of each non-key frame of the video clip are matched with all detection frames of the key frame, if the match is passed, the action category of the detection frame of the non-key frame is set to the action category of the detection frame of the key frame matched therewith.

In some embodiments, the action recognition model includes a slow channel and a fast channel executed in parallel, both of the slow channel and the fast channel are convolutional neural networks, and the number of channels of the convolutional neural network of the fast channel is less than that of the slow channel.

For each video segment, inputting all frame images of the video segment and their detection frames into the action recognition model to obtain the action category of each detection frame of the key frame includes the following steps.

The video clips are sampled according to different frame sampling rates to obtain a first frame sequence and a second frame sequence, wherein the number of frames of images included in the first frame sequence is less than the number of frames of images included in the second frame sequence.

The first frame sequence and the second frame sequence are respectively input into the slow channel and the fast channel to extract features, and a first feature map matrix and a second feature map matrix are respectively obtained.

Carrying out a time series pooling operation on the first feature map matrix and the second feature map matrix respectively, extracting features of the region of interest based on the detection frame of the key frame on the two obtained time series pooling results, and performing space pooling operations respectively to obtain the features of the slow channel and the features of the fast channel, the features of the slow channel represent the static information of the video segment, and the features of the fast channel represent the dynamic information of the video segment.

The feature of the slow channel is fused with the feature of the fast channel, and the fusion result is sequentially subjected to a full connection operation and a softmax operation to obtain the probability of each action category.

The action category corresponding to the maximum probability is used as the action category of the detection frame of the key frame.

In some embodiments, the action recognition model further includes a side connection from the fast channel to the slow channel, and the side connection sends the data of the fast channel into the slow channel.

In some embodiments, for each video clip, matching all detection frames of each non-key frame of the video clip with all detection frames of key frames includes the following steps.

For each non-key frame, calculate the IOU distance between all the detection frames and all the detection frames of the key frame, and obtain the IOU cost matrix, where IOU represents the overlap rate of the two detection frames, which is the ratio of the intersection of the two detection frames and the union of the two detection frames.

Based on the IOU cost matrix, the Hungarian algorithm is used to match all the detection frames of the non-key frames with all the detection frames of the key frames.

Wherein, for a detection frame of the non-key frame, if there is a detection frame of a key frame matching it, and the IOU distance between the two is smaller than the set threshold, the matching is passed; otherwise, the matching is not passed.

In some embodiments, for each video clip, inputting all frame images of the video clip into the pedestrian detection model, and obtaining a certain number of detection frames on each frame image of the video clip includes the following steps.

All the frame images of the video clips are input into the YOLOX detection model, and several candidate detection frames and the confidence scores for identifying the candidate detection frames as people are obtained on each frame image of the video clip.

Perform a non-maximum suppression operation on the candidate detection frame according to the set non-maximum suppression (Non-Maximum Suppression, NMS) threshold.

Filter out the candidate detection frames whose confidence scores are lower than the set confidence threshold from the results of the non-maximum value suppression operation, and obtain the certain number of detection frames and their confidence scores.

In some embodiments, in each video segment, the middle frame image is selected as a key frame, and the remaining frame images are selected as non-key frames.

In some embodiments, the method further includes: displaying and presenting all the detection frames and their action categories of the key frames and the matching detection frames and their action categories of non-key frames, and displaying and presenting the detection frames of non-key frames that did not pass.

In some embodiments, there are overlapping images between two adjacent video segments.

In some embodiments, each video clip includes 16 frames of images; said sampling the video clips according to different frame sampling rates to obtain a first frame sequence and a second frame sequence is specifically: sampling the video clips according to frame sampling rates of 2 and 1, respectively, to obtain a first frame sequence containing 8 frame images and a second frame sequence containing 16 frame images.

In some embodiments, the action category corresponding to the maximum value of the probability is used as the action category of the detection frame of the key frame, including: filtering out the probability lower than the score threshold through the set score threshold, and then selecting the maximum value from the remaining probabilities, and using the action category corresponding to the maximum probability value as the action category of the detection frame of the key frame.

In a second aspect, the present application provides a device for continuous detection and recognition of pedestrian actions, said device comprising a video segmentation module, a key frame selection module, a pedestrian detection module, an action recognition module and a continuous recognition module.

The video segmentation module is used to divide the video to be detected into multiple video clips, and each video clip includes multiple frames of images.

The key frame selection module is used to select one frame of image in each video clip as a key frame, and the rest of the frame images as non-key frames.

The pedestrian detection module is configured to input all frame images of the video clip into the pedestrian detection model for each video clip, and obtain a certain number of detection frames on each frame image of the video clip.

The action recognition module is configured to, for each video clip, input all frame images of the video clip and their detection frames into the motion recognition model to obtain the action category of each detection frame of the key frame.

The continuous recognition module is used for each video clip, matching all the detection frames of each non-key frame of the video clip with all the detection frames of the key frame, if the match is passed, the action category of the detection frame of the non-key frame is set to the action category of the detection frame of the key frame matched therewith.

In some embodiments, the action recognition module includes a sampling unit, a feature map matrix extraction unit, a feature calculation unit, a probability calculation unit and a category determination unit.

The sampling unit is configured to sample the video segment according to different frame sampling rates to obtain a first frame sequence containing fewer frame images and a second frame sequence containing more frame images.

A feature map matrix extraction unit, configured to input the first frame sequence and the second frame sequence into the slow channel and fast channel to extract features, respectively, to obtain a first feature map matrix and a second feature map matrix.

The feature calculation unit is configured to perform time-series pooling operations on the first feature map matrix and the second feature map matrix respectively, extract features of the region of interest based on the detection frame of the key frame on the two obtained time-series pooling results, and perform space pooling operations respectively to obtain the features of the slow channel and the features of the fast channel. The features of the slow channel represent the static information of the video segment, and the features of the fast channel represent the dynamic information of the video segment.

The probability calculation unit is used to fuse the features of the slow channel and the fast channel, perform a full connection operation and a softmax operation on the fusion result in sequence, and obtain the probability of each action category.

The category determination unit is configured to use the action category corresponding to the maximum probability as the action category of the detection frame of the key frame.

In some embodiments, the continuous identification module includes:

The IOU cost matrix calculation unit is used for calculating the IOU distances of all detection frames and all detection frames of the key frame for each non-key frame to obtain the IOU cost matrix;

A matching unit, configured to use the Hungarian algorithm to match all detection frames of the non-key frame with all detection frames of the key frame based on the IOU cost matrix;

In some embodiments, the pedestrian detection module includes a candidate detection frame acquisition unit, an NMS unit, and a filtering unit.

The candidate detection frame acquisition unit is used to input all the frame images of the video clip into the YOLOX detection model, and obtain several candidate detection frames and the confidence scores for identifying the candidate detection frames as people on each frame image of the video clip.

The NMS unit is configured to perform a non-maximum value suppression operation on the candidate detection frame according to the set NMS threshold.

A filtering unit, configured to filter out candidate detection frames whose confidence scores are lower than a set confidence threshold from the result of the non-maximum value suppression operation, to obtain the certain number of detection frames and their confidence scores.

In some embodiments, in the key frame selection module, an intermediate frame image is selected as a key frame in each video segment, and other frame images are selected as non-key frames.

Further, the device further includes a presentation module.

The presentation module is configured to display and present all detection frames of the key frames and their action categories, as well as the matching detection frames and their action categories of non-key frames, and display and present the detection frames of non-key frames that do not pass.

In a third aspect, the present application provides a non-transitory computer-readable storage medium, including a memory storing processor-executable instructions. When the instructions are executed by the processor, the steps of the method for continuous detection and recognition of pedestrian movements included in the first aspect are implemented.

In a fourth aspect, the present application provides a computer device, including at least one processor and a memory storing computer-executable instructions. When the processor executes the instructions, the steps of the method for continuous detection and recognition of pedestrian movements described in the first aspect are implemented.

The application has the following beneficial effects:

This application realizes the sharing of key frame recognition results to non-key frames through detection frame matching, thereby realizing frame-by-frame continuous detection and recognition of human body movements in videos, and solving the problem of deviation caused by only relying on key frame detection to the overall detection and evaluation of videos. Moreover, the detection frame and recognition results of each frame of image are presented in a continuous manner, which improves the visual experience of presentation.

Description of drawings

FIG. 1 is a flow chart of a method for continuous detection and recognition of pedestrian movements according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an embodiment of an action recognition model of the present application;

3 is a schematic diagram of a matching process between a detection frame of a non-key frame and a detection frame of a key frame according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a device for continuous detection and recognition of pedestrian movements according to an embodiment of the present application;

FIG. 5 is an internal structure diagram of a computer device according to an embodiment of the present application.

Detailed ways

In order to make the technical problems, technical solutions and advantages to be solved by the present application clearer, the technical solutions of the present application will be clearly and completely described below in conjunction with the accompanying drawings and specific embodiments. Apparently, the described embodiments are only some of the embodiments of this application, not all of them. The components of the embodiments of the application generally described and illustrated in the figures herein may be arranged and designed in a variety of different configurations. Accordingly, the following detailed description of the embodiments of the application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely represents selected embodiments of the application. Based on the embodiments of the present application, all other embodiments obtained by those skilled in the art without making creative efforts belong to the scope of protection of the present application.

An embodiment of the present application provides a method for continuous detection and recognition of pedestrian actions, as shown in Figure 1, the method includes steps S100 to S500

S100: Divide the video to be detected into multiple video segments, each of which includes multiple frames of images.

In this step, for a given segment of the video V to be detected that contains a specified action, it is first divided into frames and divided into multiple video segments (clips), each of which includes multiple frames of images. The division can be performed according to a certain length s. For example, when s is 16, the divided video segment includes 16 frames of images. And in order to avoid loss of information, it is also possible to overlap several frames, such as 5 frames, between two adjacent clips.

S200: Selecting one frame of image in each video segment as a key frame, and the other frames of images as non-key frames.

For example, an intermediate frame image is selected as a key frame (key_frame), and other frame images are selected as non-key frames (norm_frame).

S300: For each video segment, input all frame images of the video segment into the pedestrian detection model, and obtain a certain number of detection frames on each frame image of the video segment.

The pedestrian detection model is used to detect a certain number of detection frames representing people on each frame image, and this application does not limit the specific implementation of the pedestrian detection model.

For example, each frame image is processed by a pedestrian detection model to obtain a matrix with a dimension of N×5, where N is the number of detection frames, and the five dimensions correspond to the upper left coordinates (x1, y1) and lower right corner coordinates (x2, y2) of the detection frame, and the confidence score score of the detection frame being recognized as a person.

S400: For each video segment, input all frame images of the video segment and their detection frames into the action recognition model to obtain the action category of each detection frame of the key frame.

The function of the action recognition model is to determine the action category pred of the detection frame of the key frame according to the information of all frame images of the video clip. This application does not limit the specific implementation of the action recognition model.

S500: For each video clip, match all the detection frames of each non-key frame of the video clip with all the detection frames of the key frame, and if the matching is passed, set the motion category of the detection frame of the non-key frame to the motion category of the detection frame of the key frame matched therewith.

The existing technology can only detect and identify the key frames in the video clips, and the recognition results of the key frames represent the entire video clip, and the detection of the entire video to be detected is evaluated by the recognition results of the key frames of all video clips. Existing methods rely entirely on the detection effect of key frames for the detection of the entire video, which is easy to cause deviations in the evaluation of the overall detection effect of the video. In addition, only key frames are detected, and the detection frame and recognition results are presented intermittently, resulting in a poor visual experience.

After the application obtains the detection and recognition results of the key frames (each detection frame of the key frame and its action category), all detection frames of each non-key frame are matched with all detection frames of the key frame, and for the detection frames of the non-key frames that have been matched, the recognition result pred is shared with the detection frames of the matching key frames to realize the motion detection and recognition of each frame of image.

In addition, the present application can also output and display the detection frames. For key frames, all detection frames of the key frames and their action categories are displayed and presented; for non-key frames, the detection frames of non-key frames that have been matched and their action categories are displayed and presented, and for non-key frames that have not been matched, only their detection frames are displayed and presented.

This application realizes the sharing of key frame recognition results with non-key frames through detection frame matching, thereby realizing frame-by-frame continuous detection and recognition of human body movements in videos, and solving the problem of deviation caused by only relying on key frame detection to the overall detection and evaluation of the video; moreover, the detection frames and recognition results of each frame of image are presented in a continuous manner, which improves the visual experience of presentation.

In some embodiments of the present application, as shown in FIG. 2 , the action recognition model of the present application includes a slow channel and a fast channel executed in parallel. Both the slow channel and the fast channel are convolutional neural networks, and the convolutional neural network of the fast channel has fewer channels than the convolutional neural network of the slow channel.

The inventor found through research that a series of frame images in a video scene usually includes two different parts: a static part that changes little or slowly and a dynamic part that is changing. For example, a video of an airplane taking off may contain a relatively static airport and a dynamic airplane moving rapidly within the static airport scene. As another example, in everyday life, when two people meet, the handshake is usually quicker while the rest of the scene is relatively static.

Based on this finding, the present application designs the action recognition model to include slow and fast passes executed in parallel. The slow channel is a slow high-resolution convolutional neural network with fewer input frame sequences and a larger number of channels to analyze spatially static content in videos. The fast channel is a fast low-resolution convolutional neural network with a large sequence of input frames and a small number of channels, which is used to analyze the temporal dynamic content in the video. The fast channel uses less number of channels (i.e., uses less number of filters) to keep the network lightweight, and its ability to represent static spatial semantics is weak.

Similar to the principle in the retinal ganglion of primates, in the retinal ganglion, about 80% of the cells (P-cells) operate at low frequencies and can recognize static details, while about 20% of the cells (M-cells) operate at high frequencies and are responsible for responding to rapid changes.

In some embodiments, based on the action recognition model, the aforementioned S400 includes steps S410 to S450.

S410: Sampling the video segment according to different frame sampling rates to obtain a first frame sequence and a second frame sequence, where the number of image frames included in the first frame sequence is less than the number of image frames included in the second frame sequence.

For example, the frame sampling rate is set to 2 and 1, that is, a 16-frame video segment is sampled every two frames and every frame, to obtain a first frame sequence of 8 frames and a second frame sequence of 16 frames.

S420: Input the first frame sequence and the second frame sequence into the slow channel and the fast channel to extract features, respectively, to obtain a first feature map matrix and a second feature map matrix.

For example, the first frame sequence of 8 frames is input into the slow channel, and the feature map representing the static information of the video clip is extracted for each frame image, and the feature maps of all images in the first frame sequence form the first feature map matrix.

At the same time, the second frame sequence of 16 frames is input into the fast channel, and a feature map representing the dynamic information of the video clip is extracted for each frame image, and the feature maps of all images in the second frame sequence form a second feature map matrix.

S430: Perform time-series pooling (pool) operations on the first feature map matrix and the second feature map matrix respectively, extract features of a region of interest (ROI) based on the detection frame of the key frame on the two obtained time-series pooling results, and perform space pooling (pool) operations respectively to obtain features of the slow channel and features of the fast channel. The features of the slow channel represent the static information of the video segment, and the features of the fast channel represent the dynamic information of the video segment.

Pooling is the process of compressing one or more matrices created by the previous convolutional layer into a smaller matrix. In deep learning, pooling generally refers to spatial pooling. The application of pooling on time series is called temporal pooling.

Taking a 16-frame video clip as an example, after each frame of image passes through a convolutional neural network, feature maps of 16 frames are obtained. Since action category recognition is usually based on the video level rather than the frame level, it is necessary to convert each frame feature into a video level feature through a temporal pooling method (ie, sequential pooling).

After time-series pooling, ROI Align operations are performed on the two time-series pooling results to complete regional feature aggregation, and then space pooling is performed to obtain the features of the fast channel and the slow channel.

S440: Fusing the features of the slow channel and the features of the fast channel, and sequentially performing a full connection operation and a softmax operation on the fusion result, to obtain the probability of each action category of each detection frame of the key frame.

During fusion, the concat operation is performed on the channel dimension, and then the features of the num_classes dimension are obtained through the fully connected layer, where num_classes is the number of action categories, and then the probability of being recognized as each action category is obtained through softmax activation.

S450: Use the action category corresponding to the maximum probability as the action category of the detection frame of the key frame.

In this step, the probability lower than the score threshold can be filtered out through the set score threshold, and then the maximum value can be selected from the remaining probabilities, and the corresponding action category is the action category (pred) of the detection frame of the key frame.

The action recognition model also includes side connections from the fast channel to the slow channel, which feed data from the fast channel into the slow channel.

Because the information of the fast channel and the slow channel are fused, one path needs to know the representation learned by the other path, and the data of the fast channel is sent to the slow channel through a side connection. Exemplarily, the connection mode of the lateral connection can be realized by using a 3D convolution (convolution) with a convolution kernel of 5×1 ² .

In some embodiments of the present application, as shown in FIG. 3 , the aforementioned S500 includes S510 and S520.

S510: For each non-key frame, calculate the IOU distance between all the detection frames and all the detection frames of the key frame to obtain an IOU cost matrix.

IOU (Intersection-over-Union) represents the overlap rate of two detection frames, that is, the ratio of the intersection of two detection frames to the union of two detection frames.

S520: Based on the IOU cost matrix, use the Hungarian algorithm to match all the detection frames of the non-key frames with all the detection frames of the key frames.

Among them, for a detection frame of a non-key frame, if there is a detection frame of a key frame matching it, and the IOU distance between the two is smaller than the set threshold, the matching is passed, otherwise the matching is not passed.

For example, record the detection frame list of non-key frames as Bn, and the detection frame list of key frames as B. Each detection box included in Bn is marked as bbox_i, and each detection box included in B is marked as Bbox_i. Match each detection box bbox_i in Bn with each detection box Bbox_i in B, and the output obtained is:

A. Complete the matching non-key frame detection frame and key frame detection frame list M, each element is (index_i, index_j), where index_i is the detection frame index of the non-key frame, and index_j is the detection frame index of the key frame.

B. The detection frame list F of the unmatched non-key frame, each element is index_i, indicating the detection frame index of the non-key frame.

The specific steps for matching are:

1) Initialization: let M←Φ, F←Φ.

2) Calculate the IOU distance between each detection frame (the number is N) in Bn and each detection frame (the number is M) in B, and obtain an N×M-dimensional IOU cost matrix C.

3) Based on the IOU cost matrix C, the Hungarian algorithm is used to complete the matching of the detection frame of the non-key frame and the detection frame of the key frame.

4) Set the threshold max_distance, and index the detection frame bbox_i in Bn one by one: if there is a detection frame Bbox_j in B that matches bbox_i, and the IOU distance between bbox_i and Bbox_j is less than max_distance, share pred_j with bbox_i, add (bbox_i, Bbox_j) to list M, and add "None" to list F.

If there is a detection frame Bbox_j in B that matches bbox_i, but the IOU distance between bbox_i and Bbox_j is greater than max_distance, or there is no detection frame in B that matches bbox_i, add bbox_i to list F, and add (None, None) to list M.

This application matches the detection frame of the non-key frame with the detection frame of the key frame based on the IOU cost matrix and the Hungarian algorithm, which improves the matching effect.

The prior art pedestrian action detection and recognition method uses Faster RCNN to detect pedestrians. In scenes with dense pedestrians and severe occlusion, there are serious missed detections.

In order to improve the detection effect of the pedestrian detection model in scenes with dense pedestrians and severe occlusion, in some embodiments, the present application adopts the following method for pedestrian detection, and the step S300 includes steps S310 to S330.

S310: Input all frame images of the video clip into the YOLOX detection model, and obtain several candidate detection frames and confidence scores for identifying the candidate detection frames as people on each frame image of the video clip.

This application uses the YOLOX detection model to realize pedestrian detection. In order to improve the pedestrian detection effect of the model in complex crowded scenes, the YOLOX detection model is obtained by retraining pedestrian data. In the detection process, the video clips are input into the YOLOX detection model to complete the detection of pedestrians in all frame images, and output the detection frame of the pedestrian and the confidence score score for the recognition as "person".

S320: Perform a non-maximum value suppression operation on the candidate detection frame according to the set NMS threshold.

Non-Maximum Suppression (NMS) is a post-processing method applied to object detection, which can remove redundant detection frames.

S330: Filter out candidate detection frames whose confidence scores are lower than a set confidence threshold from the result of the non-maximum value suppression operation, and obtain a certain number of detection frames and their confidence scores.

In this step, the candidate detection frames whose confidence scores are lower than the threshold are filtered out through the confidence threshold, and the remaining detection frames and their confidence scores in the image are finally output.

It should be understood that although the steps in the flow charts involved in the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed sequentially in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in the flow charts involved in the above-mentioned embodiments may include multiple steps or multiple stages, and these steps or stages may not necessarily be executed at the same time, but may be executed at different times, and the execution order of these steps or stages may not necessarily be performed sequentially, but may be performed alternately or alternately with other steps or at least a part of steps or stages in other steps.

Some embodiments of the present application provide a device for continuous detection and recognition of pedestrian actions. As shown in FIG.

The video segmentation module 1 is configured to divide the video to be detected into multiple video segments, each video segment including multiple frames of images.

The key frame selection module 2 is used to select one frame of image in each video segment as a key frame, and the rest of the frame images as non-key frames.

For example, in each video clip, select the middle frame image as a key frame, and the remaining frame images as non-key frames.

The pedestrian detection module 3 is configured to input all frame images of the video clip into the pedestrian detection model for each video clip, and obtain a certain number of detection frames on each frame image of the video clip.

The action recognition module 4 is configured to input all frame images of the video segment and their detection frames into the action recognition model for each video segment, and obtain the action category of each detection frame of the key frame.

Continuous identification module 5 is used for each video clip, matches all detection frames of each non-key frame of video clip with all detection frames of key frame, if match is passed, then the action category of the detection frame of non-key frame is set to the action category of the detection frame of the key frame that matches with it.

The apparatus of the present application may further include a presentation module.

The presenting module is configured to display and present all detection frames of the key frames and their action categories, as well as the detection frames of the matching non-key frames and their action categories, and display and present the detection frames of the non-key frames that fail to pass.

This application realizes the sharing of key frame recognition results with non-key frames through detection frame matching, thereby realizing frame-by-frame continuous detection and recognition of human body movements in videos, and solving the problem of deviation caused by only relying on key frame detection to the overall detection and evaluation of the video. Moreover, the detection frame and recognition results of each frame of image are presented in a continuous manner, which improves the visual experience of presentation.

In some embodiments of the present application, the action recognition model includes a slow pass and a fast pass executed in parallel. Both the slow channel and the fast channel are convolutional neural networks, and the number of channels of the convolutional neural network of the fast channel is less than that of the convolutional neural network of the slow channel.

In some embodiments, based on the above-mentioned action recognition model, the action recognition module of the present application includes a sampling unit, a feature map matrix extraction unit, a feature calculation unit, a probability calculation unit and a category determination unit.

The sampling unit is configured to sample the video clips according to different frame sampling rates to obtain a first frame sequence and a second frame sequence, the number of frames of images contained in the first frame sequence is less than the number of frames of images contained in the second frame sequence.

The feature map matrix extraction unit is configured to input the first frame sequence and the second frame sequence into the slow channel and the fast channel to extract features, respectively, to obtain a first feature map matrix and a second feature map matrix.

The probability calculation unit is used to fuse the features of the slow channel and the features of the fast channel, and perform a full connection operation and a softmax operation on the fusion results in order to obtain the probability of each action category.

Among them, the action recognition model also includes a side connection from the fast channel to the slow channel, and the side connection sends the data of the fast channel to the slow channel.

In some embodiments of the present application, the continuous identification module includes an IOU cost matrix calculation unit and a matching unit.

The IOU cost matrix calculation unit is used to calculate the IOU distance between all the detection frames of each non-key frame and all the detection frames of the key frame to obtain the IOU cost matrix.

The matching unit is configured to use the Hungarian algorithm to match all the detection frames of the non-key frame with all the detection frames of the key frame based on the IOU cost matrix.

In order to improve the detection effect of the pedestrian detection model in scenes with dense pedestrians and severe occlusion, in some embodiments, the pedestrian detection module of the present application includes a candidate detection frame acquisition unit, an NMS unit and a filtering unit.

The candidate detection frame acquisition unit is used to input all the frame images of the video clip into the YOLOX detection model, and obtain several candidate detection frames and the confidence scores that the candidate detection frames are identified as people on each frame image of the video clip.

The NMS unit is used to perform a non-maximum value suppression operation on the candidate detection frame according to the set NMS threshold.

The filtering unit is used to filter out the candidate detection frames whose confidence scores are lower than the set confidence threshold from the result of the non-maximum value suppression operation, and obtain a certain number of detection frames and their confidence scores.

The implementation principles and technical effects of the device provided by the embodiment of the present application are the same as those of the aforementioned method embodiment. For brief description, for the part not mentioned in the device embodiment, reference may be made to the corresponding content in the aforementioned method embodiment. Those skilled in the art can clearly understand that for the convenience and brevity of description, the specific working process of the devices and units described above can refer to the corresponding process in the above method embodiment, and details are not repeated here.

The above-mentioned method embodiments provided in this application can implement business logic through computer programs and record them on storage media. The storage media can be read and executed by computers to achieve the effects of the solutions described in the method embodiments of this specification. Therefore, the present application also provides a non-volatile computer-readable storage medium for continuous detection and recognition of pedestrian actions. The non-volatile computer-readable storage medium includes a memory for storing processor-executable instructions. When the instructions are executed by the processor, the steps of the method for continuous detection and recognition of pedestrian movements in the above-mentioned embodiments are implemented.

This application realizes the sharing of recognition results between key frames and non-key frames through detection frame matching, thereby realizing the frame-by-frame continuous detection and recognition of human body movements in videos, and solving the problem of deviation caused by only relying on key frame detection to the overall detection and evaluation of the video. Moreover, the detection frame and recognition results of each frame of image are presented in a continuous manner, which improves the visual experience of presentation.

The storage medium may include a physical device for storing information, and information is usually digitized and then stored using an electrical, magnetic, or optical medium. Described storage medium can include: the device that utilizes electric energy mode to store information such as, various memory, such as RAM, ROM etc.; Utilize the device that utilizes magnetic energy mode to store information such as, hard disk, floppy disk, magnetic tape, magnetic core memory, magnetic bubble memory, U disk; Utilize the device that utilizes optical mode to store information such as, CD or DVD. Of course, there are other readable storage media, such as quantum memory, graphene memory and so on.

The storage medium described above may also include other implementations according to the descriptions of the method embodiments. The implementation principles and technical effects of this embodiment are the same as those of the foregoing method embodiments. For details, refer to the descriptions of related method embodiments, and details are not repeated here.

The present application also provides a computer device for continuous detection and recognition of pedestrian movements. The computer device may be a single computer, or may include an actual operating device using one or more methods described in this specification or devices in one or more embodiments. The computer device for continuous detection and recognition of pedestrian movements may include at least one processor and a memory storing computer-executable instructions. When the processor executes the instructions, the steps of the method for continuous detection and recognition of pedestrian movements in any one or more of the above embodiments are implemented.

This application realizes the sharing of key frame recognition results to non-key frames through detection frame matching, thereby realizing frame-by-frame continuous detection and recognition of human body movements in videos, and solving the problem of deviation caused by only relying on key frame detection to the overall detection and evaluation of videos; moreover, the detection frames and recognition results of each frame of image are presented in a continuous manner, which improves the visual experience of presentation.

In some implementations of the present application, the computer device may be a terminal, and its internal structure may be as shown in FIG. 5 . The computer device includes a processor, a memory, a communication interface, a display screen and an input device connected through a system bus. Wherein, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, mobile cellular network, NFC (Near Field Communication) or other technologies. When the computer program is executed by the processor, a medical image processing method is realized. The display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the casing of the computer device, or an external keyboard, touch pad or mouse.

Those skilled in the art can understand that the structure shown in FIG. 5 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation to the computer equipment to which the solution of the application is applied. The specific computer equipment may include more or less components than those shown in the figure, or combine certain components, or have different component arrangements.

The computer equipment described above may also include other implementations according to the description of the method embodiments. The implementation principle and technical effects of this embodiment are the same as those of the foregoing method embodiments. For details, refer to the descriptions of related method embodiments, and details will not be repeated here.

Finally, it should be noted that the above-mentioned embodiments are only specific implementations of the present application, and are used to illustrate the technical solutions of the present application, rather than to limit them. The scope of protection of the present application is not limited thereto. Although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that any person familiar with the technical field within the technical scope disclosed in the present application can still modify or easily think of changes to the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some of the technical features; and these modifications, changes or replacements, The essence of the corresponding technical solutions does not deviate from the spirit and scope of the technical solutions of the embodiments of the present application. All should be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be determined by the protection scope of the claims.

Claims

A method for continuous detection and recognition of pedestrian actions, characterized in that the method comprises:

The video to be detected is divided into multiple video clips, and each video clip includes multiple frames of images;

Select one frame of image in each video clip as a key frame, and the rest of the frame images as non-key frames;

For each video clip, all frame images of the video clip are input to the pedestrian detection model, and a certain number of detection frames are obtained on each frame image of the video clip;

For each video segment, all frame images and detection frames thereof of the video segment are input into the action recognition model to obtain the action category of each detection frame of the key frame;

For each video clip, all detection frames of each non-key frame of the video clip are matched with all detection frames of the key frame, if the match is passed, the action category of the detection frame of the non-key frame is set to the action category of the detection frame of the key frame matched therewith.
The method for continuous detection and recognition of pedestrian actions according to claim 1, wherein the action recognition model includes a slow channel and a fast channel executed in parallel, both of the slow channel and the fast channel are convolutional neural networks, and the number of channels of the convolutional neural network of the fast channel is less than the number of channels of the convolutional neural network of the slow channel;

For each video clip, all frame images of the video clip and their detection frames are input into the action recognition model to obtain the action category of each detection frame of the key frame, including:

Sampling the video clips according to different frame sampling rates to obtain a first frame sequence and a second frame sequence, wherein the number of frames of images included in the first frame sequence is less than the number of frames of images included in the second frame sequence;

Inputting the first frame sequence and the second frame sequence into the slow channel and the fast channel to extract features, respectively, to obtain a first feature map matrix and a second feature map matrix;

Carrying out time series pooling operations on the first feature map matrix and the second feature map matrix respectively, extracting features of the region of interest based on the detection frame of the key frame on the two obtained time series pooling results, and performing space pooling operations respectively to obtain the features of the slow channel and the features of the fast channel, the features of the slow channel represent the static information of the video segment, and the features of the fast channel represent the dynamic information of the video segment;

Fusing the feature of the slow channel with the feature of the fast channel, performing a full connection operation and a softmax operation on the fusion result in turn to obtain the probability of each action category;

The action category corresponding to the maximum probability is used as the action category of the detection frame of the key frame.
The pedestrian action continuous detection and recognition method according to claim 2, wherein the action recognition model further includes a lateral connection from the fast channel to the slow channel, and the lateral connection sends the data of the fast channel into the slow channel.
The pedestrian action continuous detection and recognition method according to claim 1, wherein, for each video clip, matching all detection frames of each non-key frame of the video clip with all detection frames of key frames includes:

For each non-key frame, calculate the IOU distance between all the detection frames and all the detection frames of the key frame to obtain the IOU cost matrix, where IOU represents the overlap rate of the two detection frames, which is the ratio of the intersection of the two detection frames and the union of the two detection frames;

Based on the IOU cost matrix, using the Hungarian algorithm to match all detection frames of the non-key frame with all detection frames of the key frame;

Wherein, for a detection frame of the non-key frame, if there is a detection frame of a key frame matching it, and the IOU distance between the two is smaller than a set threshold, the matching is passed; otherwise, the matching is not passed.
The pedestrian action continuous detection and recognition method according to claim 1, wherein, for each video clip, all frame images of the video clip are input into the pedestrian detection model, and a certain number of detection frames are obtained on each frame image of the video clip, including:

All frame images of the video clips are input into the YOLOX detection model, and several candidate detection frames and the confidence scores that the candidate detection frames are identified as people are obtained on each frame image of the video clips;

Perform a non-maximum suppression operation on the candidate detection frame according to a set non-maximum suppression (Non-Maximum Suppression, NMS) threshold;

Filter out the candidate detection frames whose confidence scores are lower than the set confidence threshold from the results of the non-maximum value suppression operation, and obtain the certain number of detection frames and their confidence scores.
The method for continuous detection and recognition of pedestrian actions according to claim 1, wherein in each video segment, the middle frame image is selected as a key frame, and the remaining frame images are used as non-key frames.
The method for continuous detection and recognition of pedestrian actions according to any one of claims 1-6, wherein the method further comprises:

All the detection frames of the key frame and their action categories and the detection frames of the non-key frames that are matched and their action categories are displayed and presented, and the detection frames of the non-key frames that are not matched are displayed and presented.
The method for continuous detection and recognition of pedestrian actions according to claim 1, wherein there are overlapped images between two adjacent video clips.
The method for continuous detection and recognition of pedestrian actions according to claim 2, characterized in that:

Each video clip includes 16 frames of images;

The video clips are sampled according to different frame sampling rates to obtain a first frame sequence and a second frame sequence, specifically: the video clips are sampled respectively according to a frame sampling rate of 2 and 1 to obtain a first frame sequence comprising 8 frames of images and a second frame sequence comprising 16 frames of images.
The method for continuous detection and recognition of pedestrian actions according to claim 2, wherein the action category corresponding to the maximum probability value is used as the action category of the detection frame of the key frame, comprising: filtering out the probability lower than the score threshold through the set score threshold, and then selecting the maximum value from the remaining probabilities, and using the action category corresponding to the maximum probability value as the action category of the detection frame of the key frame.
A pedestrian action continuous detection and recognition device, characterized in that the device includes:

The video segmentation module is used to divide the video to be detected into a plurality of video clips, and each video clip includes a multi-frame image;

The key frame selection module is used to select and select a frame image as a key frame in each video clip, and the remaining frame images as non-key frames;

Pedestrian detection module, for each video clip, all frame images of the video clip are input into the pedestrian detection model, and a certain number of detection frames are obtained on each frame image of the video clip;

An action recognition module, for each video clip, inputting all frame images and detection frames thereof of the video clip into the motion recognition model to obtain the action category of each detection frame of the key frame;

The continuous recognition module is used for each video clip, matching all the detection frames of each non-key frame of the video clip with all the detection frames of the key frame, if the match is passed, the action category of the detection frame of the non-key frame is set to the action category of the detection frame of the key frame matched therewith.
The device for continuous detection and recognition of pedestrian actions according to claim 11, further comprising a presentation module for displaying and presenting all detection frames of key frames and their action categories and matching non-key frame detection frames and their action categories, and displaying and presentation of detection frames of non-key frames that have not been matched.
The pedestrian motion continuous detection and recognition device according to claim 11, wherein the motion recognition module includes:

A sampling unit, configured to sample the video clips according to different frame sampling rates to obtain a first frame sequence and a second frame sequence, wherein the number of frames of images contained in the first frame sequence is less than the number of frames of images contained in the second frame sequence;

A feature map matrix extraction unit, configured to input the first frame sequence and the second frame sequence into the slow channel and the fast channel to extract features, respectively, to obtain a first feature map matrix and a second feature map matrix;

The feature calculation unit is configured to perform a time-series pooling operation on the first feature map matrix and the second feature map matrix respectively, extract features of the region of interest based on the detection frame of the key frame on the two obtained time-series pooling results, and perform a space pooling operation respectively to obtain the features of the slow channel and the features of the fast channel, the features of the slow channel represent the static information of the video segment, and the features of the fast channel represent the dynamic information of the video segment;

The probability calculation unit is used to fuse the features of the slow channel and the features of the fast channel, and perform a full connection operation and a softmax operation on the fusion results in order to obtain the probability of each action category;

The category determination unit uses the action category corresponding to the maximum probability as the action category of the detection frame of the key frame.
A non-volatile computer-readable storage medium, characterized by comprising a memory storing processor-executable instructions, and when the instructions are executed by the processor, the steps comprising the method for continuous detection and recognition of pedestrian movements in any one of claims 1-10 are realized.
A computer device, characterized by comprising at least one processor and a memory storing computer-executable instructions, when the processor executes the instructions, the steps of the method for continuous detection and recognition of pedestrian movements in any one of claims 1-10 are realized.