WO2023138444A1 - Pedestrian action continuous detection and recognition method and apparatus, storage medium, and computer device - Google Patents

Pedestrian action continuous detection and recognition method and apparatus, storage medium, and computer device Download PDF

Info

Publication number
WO2023138444A1
WO2023138444A1 PCT/CN2023/071627 CN2023071627W WO2023138444A1 WO 2023138444 A1 WO2023138444 A1 WO 2023138444A1 CN 2023071627 W CN2023071627 W CN 2023071627W WO 2023138444 A1 WO2023138444 A1 WO 2023138444A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
detection
frames
key
video clip
Prior art date
Application number
PCT/CN2023/071627
Other languages
French (fr)
Chinese (zh)
Inventor
孙叶纳
周军
Original Assignee
北京眼神智能科技有限公司
北京眼神科技有限公司
深圳爱酷智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京眼神智能科技有限公司, 北京眼神科技有限公司, 深圳爱酷智能科技有限公司 filed Critical 北京眼神智能科技有限公司
Publication of WO2023138444A1 publication Critical patent/WO2023138444A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • the present application relates to the field of motion detection and recognition, in particular to a method, device, storage medium and computer equipment for continuous detection and recognition of pedestrian motion.
  • Pedestrian action detection and recognition methods in the prior art usually only detect and recognize key frames in video clips, while the detection of the entire video completely depends on the detection effect of the key frames. If only relying on key frame detection, it is easy to bring bias to the evaluation of the overall detection effect of the video. In addition, only key frames are detected, and the detection frame and recognition results are presented intermittently, resulting in a poor visual experience.
  • the present application provides a method, device, storage medium and equipment for continuous detection and recognition of pedestrian movements, which realize continuous detection and recognition of human body movements frame by frame in videos.
  • the present application provides a method for continuous detection and recognition of pedestrian actions, and the method includes the following steps.
  • the video to be detected is divided into multiple video segments, and each video segment includes multiple frames of images.
  • For each video clip input all frame images of the video clip into the pedestrian detection model, and obtain a certain number of detection frames on each frame image of the video clip.
  • the action recognition model includes a slow channel and a fast channel executed in parallel, both of the slow channel and the fast channel are convolutional neural networks, and the number of channels of the convolutional neural network of the fast channel is less than that of the slow channel.
  • inputting all frame images of the video segment and their detection frames into the action recognition model to obtain the action category of each detection frame of the key frame includes the following steps.
  • the video clips are sampled according to different frame sampling rates to obtain a first frame sequence and a second frame sequence, wherein the number of frames of images included in the first frame sequence is less than the number of frames of images included in the second frame sequence.
  • the first frame sequence and the second frame sequence are respectively input into the slow channel and the fast channel to extract features, and a first feature map matrix and a second feature map matrix are respectively obtained.
  • the feature of the slow channel is fused with the feature of the fast channel, and the fusion result is sequentially subjected to a full connection operation and a softmax operation to obtain the probability of each action category.
  • the action category corresponding to the maximum probability is used as the action category of the detection frame of the key frame.
  • the action recognition model further includes a side connection from the fast channel to the slow channel, and the side connection sends the data of the fast channel into the slow channel.
  • matching all detection frames of each non-key frame of the video clip with all detection frames of key frames includes the following steps.
  • IOU the overlap rate of the two detection frames, which is the ratio of the intersection of the two detection frames and the union of the two detection frames.
  • the Hungarian algorithm is used to match all the detection frames of the non-key frames with all the detection frames of the key frames.
  • inputting all frame images of the video clip into the pedestrian detection model, and obtaining a certain number of detection frames on each frame image of the video clip includes the following steps.
  • All the frame images of the video clips are input into the YOLOX detection model, and several candidate detection frames and the confidence scores for identifying the candidate detection frames as people are obtained on each frame image of the video clip.
  • NMS Non-Maximum Suppression
  • the middle frame image is selected as a key frame, and the remaining frame images are selected as non-key frames.
  • the method further includes: displaying and presenting all the detection frames and their action categories of the key frames and the matching detection frames and their action categories of non-key frames, and displaying and presenting the detection frames of non-key frames that did not pass.
  • the action category corresponding to the maximum value of the probability is used as the action category of the detection frame of the key frame, including: filtering out the probability lower than the score threshold through the set score threshold, and then selecting the maximum value from the remaining probabilities, and using the action category corresponding to the maximum probability value as the action category of the detection frame of the key frame.
  • the present application provides a device for continuous detection and recognition of pedestrian actions, said device comprising a video segmentation module, a key frame selection module, a pedestrian detection module, an action recognition module and a continuous recognition module.
  • the video segmentation module is used to divide the video to be detected into multiple video clips, and each video clip includes multiple frames of images.
  • the key frame selection module is used to select one frame of image in each video clip as a key frame, and the rest of the frame images as non-key frames.
  • the pedestrian detection module is configured to input all frame images of the video clip into the pedestrian detection model for each video clip, and obtain a certain number of detection frames on each frame image of the video clip.
  • the action recognition module is configured to, for each video clip, input all frame images of the video clip and their detection frames into the motion recognition model to obtain the action category of each detection frame of the key frame.
  • the continuous recognition module is used for each video clip, matching all the detection frames of each non-key frame of the video clip with all the detection frames of the key frame, if the match is passed, the action category of the detection frame of the non-key frame is set to the action category of the detection frame of the key frame matched therewith.
  • the action recognition model includes a slow channel and a fast channel executed in parallel, both of the slow channel and the fast channel are convolutional neural networks, and the number of channels of the convolutional neural network of the fast channel is less than that of the slow channel.
  • the action recognition module includes a sampling unit, a feature map matrix extraction unit, a feature calculation unit, a probability calculation unit and a category determination unit.
  • the sampling unit is configured to sample the video segment according to different frame sampling rates to obtain a first frame sequence containing fewer frame images and a second frame sequence containing more frame images.
  • a feature map matrix extraction unit configured to input the first frame sequence and the second frame sequence into the slow channel and fast channel to extract features, respectively, to obtain a first feature map matrix and a second feature map matrix.
  • the feature calculation unit is configured to perform time-series pooling operations on the first feature map matrix and the second feature map matrix respectively, extract features of the region of interest based on the detection frame of the key frame on the two obtained time-series pooling results, and perform space pooling operations respectively to obtain the features of the slow channel and the features of the fast channel.
  • the features of the slow channel represent the static information of the video segment
  • the features of the fast channel represent the dynamic information of the video segment.
  • the probability calculation unit is used to fuse the features of the slow channel and the fast channel, perform a full connection operation and a softmax operation on the fusion result in sequence, and obtain the probability of each action category.
  • the category determination unit is configured to use the action category corresponding to the maximum probability as the action category of the detection frame of the key frame.
  • the action recognition model further includes a side connection from the fast channel to the slow channel, and the side connection sends the data of the fast channel into the slow channel.
  • the continuous identification module includes:
  • the IOU cost matrix calculation unit is used for calculating the IOU distances of all detection frames and all detection frames of the key frame for each non-key frame to obtain the IOU cost matrix;
  • a matching unit configured to use the Hungarian algorithm to match all detection frames of the non-key frame with all detection frames of the key frame based on the IOU cost matrix;
  • the pedestrian detection module includes a candidate detection frame acquisition unit, an NMS unit, and a filtering unit.
  • the candidate detection frame acquisition unit is used to input all the frame images of the video clip into the YOLOX detection model, and obtain several candidate detection frames and the confidence scores for identifying the candidate detection frames as people on each frame image of the video clip.
  • the NMS unit is configured to perform a non-maximum value suppression operation on the candidate detection frame according to the set NMS threshold.
  • a filtering unit configured to filter out candidate detection frames whose confidence scores are lower than a set confidence threshold from the result of the non-maximum value suppression operation, to obtain the certain number of detection frames and their confidence scores.
  • an intermediate frame image is selected as a key frame in each video segment, and other frame images are selected as non-key frames.
  • the device further includes a presentation module.
  • the presentation module is configured to display and present all detection frames of the key frames and their action categories, as well as the matching detection frames and their action categories of non-key frames, and display and present the detection frames of non-key frames that do not pass.
  • the present application provides a non-transitory computer-readable storage medium, including a memory storing processor-executable instructions. When the instructions are executed by the processor, the steps of the method for continuous detection and recognition of pedestrian movements included in the first aspect are implemented.
  • the present application provides a computer device, including at least one processor and a memory storing computer-executable instructions.
  • the processor executes the instructions, the steps of the method for continuous detection and recognition of pedestrian movements described in the first aspect are implemented.
  • This application realizes the sharing of key frame recognition results to non-key frames through detection frame matching, thereby realizing frame-by-frame continuous detection and recognition of human body movements in videos, and solving the problem of deviation caused by only relying on key frame detection to the overall detection and evaluation of videos. Moreover, the detection frame and recognition results of each frame of image are presented in a continuous manner, which improves the visual experience of presentation.
  • FIG. 1 is a flow chart of a method for continuous detection and recognition of pedestrian movements according to an embodiment of the present application
  • FIG. 2 is a schematic diagram of an embodiment of an action recognition model of the present application
  • FIG. 3 is a schematic diagram of a matching process between a detection frame of a non-key frame and a detection frame of a key frame according to an embodiment of the present application;
  • FIG. 4 is a schematic diagram of a device for continuous detection and recognition of pedestrian movements according to an embodiment of the present application
  • FIG. 5 is an internal structure diagram of a computer device according to an embodiment of the present application.
  • An embodiment of the present application provides a method for continuous detection and recognition of pedestrian actions, as shown in Figure 1, the method includes steps S100 to S500
  • S100 Divide the video to be detected into multiple video segments, each of which includes multiple frames of images.
  • a given segment of the video V to be detected that contains a specified action it is first divided into frames and divided into multiple video segments (clips), each of which includes multiple frames of images.
  • the division can be performed according to a certain length s. For example, when s is 16, the divided video segment includes 16 frames of images. And in order to avoid loss of information, it is also possible to overlap several frames, such as 5 frames, between two adjacent clips.
  • an intermediate frame image is selected as a key frame (key_frame), and other frame images are selected as non-key frames (norm_frame).
  • S300 For each video segment, input all frame images of the video segment into the pedestrian detection model, and obtain a certain number of detection frames on each frame image of the video segment.
  • the pedestrian detection model is used to detect a certain number of detection frames representing people on each frame image, and this application does not limit the specific implementation of the pedestrian detection model.
  • each frame image is processed by a pedestrian detection model to obtain a matrix with a dimension of N ⁇ 5, where N is the number of detection frames, and the five dimensions correspond to the upper left coordinates (x1, y1) and lower right corner coordinates (x2, y2) of the detection frame, and the confidence score score of the detection frame being recognized as a person.
  • S400 For each video segment, input all frame images of the video segment and their detection frames into the action recognition model to obtain the action category of each detection frame of the key frame.
  • the function of the action recognition model is to determine the action category pred of the detection frame of the key frame according to the information of all frame images of the video clip. This application does not limit the specific implementation of the action recognition model.
  • the existing technology can only detect and identify the key frames in the video clips, and the recognition results of the key frames represent the entire video clip, and the detection of the entire video to be detected is evaluated by the recognition results of the key frames of all video clips.
  • Existing methods rely entirely on the detection effect of key frames for the detection of the entire video, which is easy to cause deviations in the evaluation of the overall detection effect of the video.
  • only key frames are detected, and the detection frame and recognition results are presented intermittently, resulting in a poor visual experience.
  • the application obtains the detection and recognition results of the key frames (each detection frame of the key frame and its action category), all detection frames of each non-key frame are matched with all detection frames of the key frame, and for the detection frames of the non-key frames that have been matched, the recognition result pred is shared with the detection frames of the matching key frames to realize the motion detection and recognition of each frame of image.
  • the present application can also output and display the detection frames.
  • key frames all detection frames of the key frames and their action categories are displayed and presented;
  • non-key frames the detection frames of non-key frames that have been matched and their action categories are displayed and presented, and for non-key frames that have not been matched, only their detection frames are displayed and presented.
  • This application realizes the sharing of key frame recognition results with non-key frames through detection frame matching, thereby realizing frame-by-frame continuous detection and recognition of human body movements in videos, and solving the problem of deviation caused by only relying on key frame detection to the overall detection and evaluation of the video; moreover, the detection frames and recognition results of each frame of image are presented in a continuous manner, which improves the visual experience of presentation.
  • the action recognition model of the present application includes a slow channel and a fast channel executed in parallel. Both the slow channel and the fast channel are convolutional neural networks, and the convolutional neural network of the fast channel has fewer channels than the convolutional neural network of the slow channel.
  • a series of frame images in a video scene usually includes two different parts: a static part that changes little or slowly and a dynamic part that is changing.
  • a video of an airplane taking off may contain a relatively static airport and a dynamic airplane moving rapidly within the static airport scene.
  • the handshake is usually quicker while the rest of the scene is relatively static.
  • the present application designs the action recognition model to include slow and fast passes executed in parallel.
  • the slow channel is a slow high-resolution convolutional neural network with fewer input frame sequences and a larger number of channels to analyze spatially static content in videos.
  • the fast channel is a fast low-resolution convolutional neural network with a large sequence of input frames and a small number of channels, which is used to analyze the temporal dynamic content in the video.
  • the fast channel uses less number of channels (i.e., uses less number of filters) to keep the network lightweight, and its ability to represent static spatial semantics is weak.
  • P-cells Similar to the principle in the retinal ganglion of primates, in the retinal ganglion, about 80% of the cells (P-cells) operate at low frequencies and can recognize static details, while about 20% of the cells (M-cells) operate at high frequencies and are responsible for responding to rapid changes.
  • the aforementioned S400 includes steps S410 to S450.
  • S410 Sampling the video segment according to different frame sampling rates to obtain a first frame sequence and a second frame sequence, where the number of image frames included in the first frame sequence is less than the number of image frames included in the second frame sequence.
  • the frame sampling rate is set to 2 and 1, that is, a 16-frame video segment is sampled every two frames and every frame, to obtain a first frame sequence of 8 frames and a second frame sequence of 16 frames.
  • S420 Input the first frame sequence and the second frame sequence into the slow channel and the fast channel to extract features, respectively, to obtain a first feature map matrix and a second feature map matrix.
  • the first frame sequence of 8 frames is input into the slow channel, and the feature map representing the static information of the video clip is extracted for each frame image, and the feature maps of all images in the first frame sequence form the first feature map matrix.
  • the second frame sequence of 16 frames is input into the fast channel, and a feature map representing the dynamic information of the video clip is extracted for each frame image, and the feature maps of all images in the second frame sequence form a second feature map matrix.
  • S430 Perform time-series pooling (pool) operations on the first feature map matrix and the second feature map matrix respectively, extract features of a region of interest (ROI) based on the detection frame of the key frame on the two obtained time-series pooling results, and perform space pooling (pool) operations respectively to obtain features of the slow channel and features of the fast channel.
  • the features of the slow channel represent the static information of the video segment
  • the features of the fast channel represent the dynamic information of the video segment.
  • Pooling is the process of compressing one or more matrices created by the previous convolutional layer into a smaller matrix.
  • pooling In deep learning, pooling generally refers to spatial pooling.
  • the application of pooling on time series is called temporal pooling.
  • ROI Align operations are performed on the two time-series pooling results to complete regional feature aggregation, and then space pooling is performed to obtain the features of the fast channel and the slow channel.
  • S440 Fusing the features of the slow channel and the features of the fast channel, and sequentially performing a full connection operation and a softmax operation on the fusion result, to obtain the probability of each action category of each detection frame of the key frame.
  • the concat operation is performed on the channel dimension, and then the features of the num_classes dimension are obtained through the fully connected layer, where num_classes is the number of action categories, and then the probability of being recognized as each action category is obtained through softmax activation.
  • S450 Use the action category corresponding to the maximum probability as the action category of the detection frame of the key frame.
  • the probability lower than the score threshold can be filtered out through the set score threshold, and then the maximum value can be selected from the remaining probabilities, and the corresponding action category is the action category (pred) of the detection frame of the key frame.
  • the action recognition model also includes side connections from the fast channel to the slow channel, which feed data from the fast channel into the slow channel.
  • connection mode of the lateral connection can be realized by using a 3D convolution (convolution) with a convolution kernel of 5 ⁇ 1 2 .
  • the aforementioned S500 includes S510 and S520.
  • IOU Intersection-over-Union
  • S520 Based on the IOU cost matrix, use the Hungarian algorithm to match all the detection frames of the non-key frames with all the detection frames of the key frames.
  • each detection box included in Bn is marked as bbox_i
  • each detection box included in B is marked as Bbox_i.
  • Match each detection box bbox_i in Bn with each detection box Bbox_i in B, and the output obtained is:
  • each element is (index_i, index_j), where index_i is the detection frame index of the non-key frame, and index_j is the detection frame index of the key frame.
  • the detection frame list F of the unmatched non-key frame, each element is index_i, indicating the detection frame index of the non-key frame.
  • the Hungarian algorithm is used to complete the matching of the detection frame of the non-key frame and the detection frame of the key frame.
  • This application matches the detection frame of the non-key frame with the detection frame of the key frame based on the IOU cost matrix and the Hungarian algorithm, which improves the matching effect.
  • the prior art pedestrian action detection and recognition method uses Faster RCNN to detect pedestrians. In scenes with dense pedestrians and severe occlusion, there are serious missed detections.
  • the present application adopts the following method for pedestrian detection, and the step S300 includes steps S310 to S330.
  • S310 Input all frame images of the video clip into the YOLOX detection model, and obtain several candidate detection frames and confidence scores for identifying the candidate detection frames as people on each frame image of the video clip.
  • This application uses the YOLOX detection model to realize pedestrian detection.
  • the YOLOX detection model is obtained by retraining pedestrian data.
  • the video clips are input into the YOLOX detection model to complete the detection of pedestrians in all frame images, and output the detection frame of the pedestrian and the confidence score score for the recognition as "person”.
  • S320 Perform a non-maximum value suppression operation on the candidate detection frame according to the set NMS threshold.
  • Non-Maximum Suppression is a post-processing method applied to object detection, which can remove redundant detection frames.
  • S330 Filter out candidate detection frames whose confidence scores are lower than a set confidence threshold from the result of the non-maximum value suppression operation, and obtain a certain number of detection frames and their confidence scores.
  • the candidate detection frames whose confidence scores are lower than the threshold are filtered out through the confidence threshold, and the remaining detection frames and their confidence scores in the image are finally output.
  • steps in the flow charts involved in the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed sequentially in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in the flow charts involved in the above-mentioned embodiments may include multiple steps or multiple stages, and these steps or stages may not necessarily be executed at the same time, but may be executed at different times, and the execution order of these steps or stages may not necessarily be performed sequentially, but may be performed alternately or alternately with other steps or at least a part of steps or stages in other steps.
  • Some embodiments of the present application provide a device for continuous detection and recognition of pedestrian actions. As shown in FIG.
  • the video segmentation module 1 is configured to divide the video to be detected into multiple video segments, each video segment including multiple frames of images.
  • the key frame selection module 2 is used to select one frame of image in each video segment as a key frame, and the rest of the frame images as non-key frames.
  • each video clip select the middle frame image as a key frame, and the remaining frame images as non-key frames.
  • the pedestrian detection module 3 is configured to input all frame images of the video clip into the pedestrian detection model for each video clip, and obtain a certain number of detection frames on each frame image of the video clip.
  • the action recognition module 4 is configured to input all frame images of the video segment and their detection frames into the action recognition model for each video segment, and obtain the action category of each detection frame of the key frame.
  • Continuous identification module 5 is used for each video clip, matches all detection frames of each non-key frame of video clip with all detection frames of key frame, if match is passed, then the action category of the detection frame of non-key frame is set to the action category of the detection frame of the key frame that matches with it.
  • the apparatus of the present application may further include a presentation module.
  • the presenting module is configured to display and present all detection frames of the key frames and their action categories, as well as the detection frames of the matching non-key frames and their action categories, and display and present the detection frames of the non-key frames that fail to pass.
  • This application realizes the sharing of key frame recognition results with non-key frames through detection frame matching, thereby realizing frame-by-frame continuous detection and recognition of human body movements in videos, and solving the problem of deviation caused by only relying on key frame detection to the overall detection and evaluation of the video. Moreover, the detection frame and recognition results of each frame of image are presented in a continuous manner, which improves the visual experience of presentation.
  • the action recognition model includes a slow pass and a fast pass executed in parallel.
  • Both the slow channel and the fast channel are convolutional neural networks, and the number of channels of the convolutional neural network of the fast channel is less than that of the convolutional neural network of the slow channel.
  • the action recognition module of the present application includes a sampling unit, a feature map matrix extraction unit, a feature calculation unit, a probability calculation unit and a category determination unit.
  • the sampling unit is configured to sample the video clips according to different frame sampling rates to obtain a first frame sequence and a second frame sequence, the number of frames of images contained in the first frame sequence is less than the number of frames of images contained in the second frame sequence.
  • the feature map matrix extraction unit is configured to input the first frame sequence and the second frame sequence into the slow channel and the fast channel to extract features, respectively, to obtain a first feature map matrix and a second feature map matrix.
  • the feature calculation unit is configured to perform time-series pooling operations on the first feature map matrix and the second feature map matrix respectively, extract features of the region of interest based on the detection frame of the key frame on the two obtained time-series pooling results, and perform space pooling operations respectively to obtain the features of the slow channel and the features of the fast channel.
  • the features of the slow channel represent the static information of the video segment
  • the features of the fast channel represent the dynamic information of the video segment.
  • the probability calculation unit is used to fuse the features of the slow channel and the features of the fast channel, and perform a full connection operation and a softmax operation on the fusion results in order to obtain the probability of each action category.
  • the category determination unit is configured to use the action category corresponding to the maximum probability as the action category of the detection frame of the key frame.
  • the action recognition model also includes a side connection from the fast channel to the slow channel, and the side connection sends the data of the fast channel to the slow channel.
  • the continuous identification module includes an IOU cost matrix calculation unit and a matching unit.
  • the IOU cost matrix calculation unit is used to calculate the IOU distance between all the detection frames of each non-key frame and all the detection frames of the key frame to obtain the IOU cost matrix.
  • the matching unit is configured to use the Hungarian algorithm to match all the detection frames of the non-key frame with all the detection frames of the key frame based on the IOU cost matrix.
  • the pedestrian detection module of the present application includes a candidate detection frame acquisition unit, an NMS unit and a filtering unit.
  • the candidate detection frame acquisition unit is used to input all the frame images of the video clip into the YOLOX detection model, and obtain several candidate detection frames and the confidence scores that the candidate detection frames are identified as people on each frame image of the video clip.
  • the NMS unit is used to perform a non-maximum value suppression operation on the candidate detection frame according to the set NMS threshold.
  • the filtering unit is used to filter out the candidate detection frames whose confidence scores are lower than the set confidence threshold from the result of the non-maximum value suppression operation, and obtain a certain number of detection frames and their confidence scores.
  • the above-mentioned method embodiments provided in this application can implement business logic through computer programs and record them on storage media.
  • the storage media can be read and executed by computers to achieve the effects of the solutions described in the method embodiments of this specification. Therefore, the present application also provides a non-volatile computer-readable storage medium for continuous detection and recognition of pedestrian actions.
  • the non-volatile computer-readable storage medium includes a memory for storing processor-executable instructions. When the instructions are executed by the processor, the steps of the method for continuous detection and recognition of pedestrian movements in the above-mentioned embodiments are implemented.
  • This application realizes the sharing of recognition results between key frames and non-key frames through detection frame matching, thereby realizing the frame-by-frame continuous detection and recognition of human body movements in videos, and solving the problem of deviation caused by only relying on key frame detection to the overall detection and evaluation of the video. Moreover, the detection frame and recognition results of each frame of image are presented in a continuous manner, which improves the visual experience of presentation.
  • the storage medium may include a physical device for storing information, and information is usually digitized and then stored using an electrical, magnetic, or optical medium. Described storage medium can include: the device that utilizes electric energy mode to store information such as, various memory, such as RAM, ROM etc.; Utilize the device that utilizes magnetic energy mode to store information such as, hard disk, floppy disk, magnetic tape, magnetic core memory, magnetic bubble memory, U disk; Utilize the device that utilizes optical mode to store information such as, CD or DVD. Of course, there are other readable storage media, such as quantum memory, graphene memory and so on.
  • the storage medium described above may also include other implementations according to the descriptions of the method embodiments.
  • the implementation principles and technical effects of this embodiment are the same as those of the foregoing method embodiments.
  • the present application also provides a computer device for continuous detection and recognition of pedestrian movements.
  • the computer device may be a single computer, or may include an actual operating device using one or more methods described in this specification or devices in one or more embodiments.
  • the computer device for continuous detection and recognition of pedestrian movements may include at least one processor and a memory storing computer-executable instructions. When the processor executes the instructions, the steps of the method for continuous detection and recognition of pedestrian movements in any one or more of the above embodiments are implemented.
  • This application realizes the sharing of key frame recognition results to non-key frames through detection frame matching, thereby realizing frame-by-frame continuous detection and recognition of human body movements in videos, and solving the problem of deviation caused by only relying on key frame detection to the overall detection and evaluation of videos; moreover, the detection frames and recognition results of each frame of image are presented in a continuous manner, which improves the visual experience of presentation.
  • the computer device may be a terminal, and its internal structure may be as shown in FIG. 5 .
  • the computer device includes a processor, a memory, a communication interface, a display screen and an input device connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer programs.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, mobile cellular network, NFC (Near Field Communication) or other technologies.
  • the display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen
  • the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the casing of the computer device, or an external keyboard, touch pad or mouse.
  • FIG. 5 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation to the computer equipment to which the solution of the application is applied.
  • the specific computer equipment may include more or less components than those shown in the figure, or combine certain components, or have different component arrangements.
  • the computer equipment described above may also include other implementations according to the description of the method embodiments.
  • the implementation principle and technical effects of this embodiment are the same as those of the foregoing method embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

A pedestrian action continuous detection and recognition method and apparatus, a storage medium, and a device. The method comprises: segmenting a video to be detected into a plurality of video clips, each video clip comprising a plurality of frame images; in each video clip, selecting a frame image as a key frame, and taking the remaining frame images as non-key frames; for each video clip, inputting all the frame images of the video clip into a pedestrian detection model, and obtaining a certain number of detection boxes on each frame image of the video clip; for each video clip, inputting all the frame images of the video clip and the detection boxes thereof into an action recognition model, to obtain an action category of each detection box of the key frame; and for each video clip, matching all the detection boxes of each non-key frame with all the detection boxes of the key frame, and setting action categories of the detection boxes of the non-key frames as the action categories of the detection boxes of the matched key frame.

Description

行人动作连续检测识别方法、装置、存储介质及计算机设备Method, device, storage medium and computer equipment for continuous detection and recognition of pedestrian movements
相关申请的交叉引用Cross References to Related Applications
本申请要求于2022年1月22日提交中国专利局,申请号为202210075002.1,申请名称为“行人动作连续检测识别方法、装置、存储介质及设备”的中国专利申请的优先权,在此将其全文引入作为参考。This application claims the priority of the Chinese patent application with the application number 202210075002.1 and the application title "Method, Device, Storage Medium and Equipment for Continuous Detection and Recognition of Pedestrian Movements" submitted to the China Patent Office on January 22, 2022, which is hereby incorporated by reference in its entirety.
技术领域technical field
本申请涉及动作检测识别领域,特别是指一种行人动作连续检测识别方法、装置、存储介质及计算机设备。The present application relates to the field of motion detection and recognition, in particular to a method, device, storage medium and computer equipment for continuous detection and recognition of pedestrian motion.
背景技术Background technique
现有技术的行人动作检测识别方法通常只是对视频片段中的关键帧进行检测和识别,而对整个视频的检测则完全依赖于关键帧的检测效果。如果仅依赖关键帧检测,容易对视频整体检测效果的评估带来偏差。此外,仅对关键帧进行检测,检测框与识别结果以间断的方式呈现,视觉体验感差。Pedestrian action detection and recognition methods in the prior art usually only detect and recognize key frames in video clips, while the detection of the entire video completely depends on the detection effect of the key frames. If only relying on key frame detection, it is easy to bring bias to the evaluation of the overall detection effect of the video. In addition, only key frames are detected, and the detection frame and recognition results are presented intermittently, resulting in a poor visual experience.
发明内容Contents of the invention
为解决现有技术的缺陷,本申请提供一种行人动作连续检测识别方法、装置、存储介质及设备,实现了视频中人体动作的逐帧连续检测识别。In order to solve the defects of the prior art, the present application provides a method, device, storage medium and equipment for continuous detection and recognition of pedestrian movements, which realize continuous detection and recognition of human body movements frame by frame in videos.
本申请提供技术方案如下。The technical solution provided by this application is as follows.
第一方面,本申请提供一种行人动作连续检测识别方法,所述方法包括以下步骤。In a first aspect, the present application provides a method for continuous detection and recognition of pedestrian actions, and the method includes the following steps.
将待检测视频分割成多个视频片段,每个视频片段均包括多帧图像。The video to be detected is divided into multiple video segments, and each video segment includes multiple frames of images.
在每个视频片段选中选取一帧图像作为关键帧,其余帧图像作为非关键帧。Select one frame of image in each video clip as a key frame, and the rest of the frames as non-key frames.
对每一个视频片段,将所述视频片段的所有帧图像输入行人检测模型,在所述视频片段的每张帧图像上得到一定数量的检测框。For each video clip, input all frame images of the video clip into the pedestrian detection model, and obtain a certain number of detection frames on each frame image of the video clip.
对每一个视频片段,将所述视频片段的所有帧图像及其检测框输入动作识 别模型,得到所述关键帧的每个检测框的动作类别。For each video clip, all frame images of the video clip and their detection frames are input into the action recognition model to obtain the action category of each detection frame of the key frame.
对每一个视频片段,将所述视频片段的每个非关键帧的所有检测框与关键帧的所有检测框进行匹配,若匹配通过,则将非关键帧的检测框的动作类别设置为与之匹配的关键帧的检测框的动作类别。For each video clip, all detection frames of each non-key frame of the video clip are matched with all detection frames of the key frame, if the match is passed, the action category of the detection frame of the non-key frame is set to the action category of the detection frame of the key frame matched therewith.
在一些实施例中,所述动作识别模型包括平行执行的慢速通道和快速通道,所述慢速通道和快速通道均为卷积神经网络,所述快速通道的卷积神经网络的通道数少于慢速通道的卷积神经网络的通道数。In some embodiments, the action recognition model includes a slow channel and a fast channel executed in parallel, both of the slow channel and the fast channel are convolutional neural networks, and the number of channels of the convolutional neural network of the fast channel is less than that of the slow channel.
所述对每一个视频片段,将所述视频片段的所有帧图像及其检测框输入动作识别模型,得到所述关键帧的每个检测框的动作类别,包括以下步骤。For each video segment, inputting all frame images of the video segment and their detection frames into the action recognition model to obtain the action category of each detection frame of the key frame includes the following steps.
按照不同的帧采样率对所述视频片段进行采样,得到第一帧序列和第二帧序列,其中,第一帧序列包含的图像的帧数少于第二帧序列包含的图像的帧数。The video clips are sampled according to different frame sampling rates to obtain a first frame sequence and a second frame sequence, wherein the number of frames of images included in the first frame sequence is less than the number of frames of images included in the second frame sequence.
将所述第一帧序列和第二帧序列分别输入所述慢速通道和快速通道提取特征,分别得到第一特征图矩阵和第二特征图矩阵。The first frame sequence and the second frame sequence are respectively input into the slow channel and the fast channel to extract features, and a first feature map matrix and a second feature map matrix are respectively obtained.
分别对所述第一特征图矩阵和第二特征图矩阵进行时序池化操作,在得到的两个时序池化结果上分别基于所述关键帧的检测框提取感兴趣区域的特征,并分别进行空间池化操作,得到所述慢速通道的特征和所述快速通道的特征,所述慢速通道的特征表征所述视频片段的静态信息,所述快速通道的特征表征所述视频片段的动态信息。Carrying out a time series pooling operation on the first feature map matrix and the second feature map matrix respectively, extracting features of the region of interest based on the detection frame of the key frame on the two obtained time series pooling results, and performing space pooling operations respectively to obtain the features of the slow channel and the features of the fast channel, the features of the slow channel represent the static information of the video segment, and the features of the fast channel represent the dynamic information of the video segment.
将所述慢速通道的特征和所述快速通道的特征进行融合,对融合结果依次进行全连接操作和softmax操作,得到各个动作类别的概率。The feature of the slow channel is fused with the feature of the fast channel, and the fusion result is sequentially subjected to a full connection operation and a softmax operation to obtain the probability of each action category.
将概率最大值对应的动作类别作为关键帧的检测框的动作类别。The action category corresponding to the maximum probability is used as the action category of the detection frame of the key frame.
在一些实施例中,所述动作识别模型还包括从所述快速通道到慢速通道的侧向连接,所述侧向连接将所述快速通道的数据送入所述慢速通道。In some embodiments, the action recognition model further includes a side connection from the fast channel to the slow channel, and the side connection sends the data of the fast channel into the slow channel.
在一些实施例中,所述对每一个视频片段,将所述视频片段的每个非关键帧的所有检测框与关键帧的所有检测框进行匹配,包括以下步骤。In some embodiments, for each video clip, matching all detection frames of each non-key frame of the video clip with all detection frames of key frames includes the following steps.
对每个非关键帧,计算其所有检测框与关键帧的所有检测框的IOU距离,得到IOU代价矩阵,其中IOU表示两个检测框的交叠率,为两个检测框的交集与两个检测框的并集的比值。For each non-key frame, calculate the IOU distance between all the detection frames and all the detection frames of the key frame, and obtain the IOU cost matrix, where IOU represents the overlap rate of the two detection frames, which is the ratio of the intersection of the two detection frames and the union of the two detection frames.
基于所述IOU代价矩阵,利用匈牙利算法对所述非关键帧的所有检测框与关键帧的所有检测框进行匹配。Based on the IOU cost matrix, the Hungarian algorithm is used to match all the detection frames of the non-key frames with all the detection frames of the key frames.
其中,对所述非关键帧的一个检测框,若存在一个关键帧的检测框与其匹配,且两者的IOU距离小于设定的阈值,则匹配通过,否则未匹配通过。Wherein, for a detection frame of the non-key frame, if there is a detection frame of a key frame matching it, and the IOU distance between the two is smaller than the set threshold, the matching is passed; otherwise, the matching is not passed.
在一些实施例中,所述对每一个视频片段,将所述视频片段的所有帧图像输入行人检测模型,在所述视频片段的每张帧图像上得到一定数量的检测框,包括以下步骤。In some embodiments, for each video clip, inputting all frame images of the video clip into the pedestrian detection model, and obtaining a certain number of detection frames on each frame image of the video clip includes the following steps.
将所述视频片段的所有帧图像输入YOLOX检测模型,在所述视频片段的每张帧图像上得到若干候选检测框以及所述候选检测框识别为人的置信度分数。All the frame images of the video clips are input into the YOLOX detection model, and several candidate detection frames and the confidence scores for identifying the candidate detection frames as people are obtained on each frame image of the video clip.
根据设定的非极大值抑制(Non-Maximum Suppression,NMS)阈值对所述候选检测框进行非极大值抑制操作。Perform a non-maximum suppression operation on the candidate detection frame according to the set non-maximum suppression (Non-Maximum Suppression, NMS) threshold.
从非极大值抑制操作的结果中过滤掉置信度分数低于设定的置信度阈值的候选检测框,得到所述一定数量的检测框及其置信度分数。Filter out the candidate detection frames whose confidence scores are lower than the set confidence threshold from the results of the non-maximum value suppression operation, and obtain the certain number of detection frames and their confidence scores.
在一些实施例中,在每个视频片段选中选取中间帧图像作为关键帧,其余帧图像作为非关键帧。In some embodiments, in each video segment, the middle frame image is selected as a key frame, and the remaining frame images are selected as non-key frames.
在一些实施例中,所述方法还包括:将所述关键帧的所有检测框及其动作类别以及匹配通过的非关键帧的检测框及其动作类别显示呈现,将未匹配通过的非关键帧的检测框显示呈现。In some embodiments, the method further includes: displaying and presenting all the detection frames and their action categories of the key frames and the matching detection frames and their action categories of non-key frames, and displaying and presenting the detection frames of non-key frames that did not pass.
在一些实施例中,相邻两个视频片段间有重叠的图像。In some embodiments, there are overlapping images between two adjacent video segments.
在一些实施例中,每个视频片段均包括16帧图像;所述按照不同的帧采样率对所述视频片段进行采样,得到第一帧序列和第二帧序列,具体为:按照帧采样率2和1对所述视频片段分别进行采样,得到包含8帧图像的第一帧序列和包含16帧图像的第二帧序列。In some embodiments, each video clip includes 16 frames of images; said sampling the video clips according to different frame sampling rates to obtain a first frame sequence and a second frame sequence is specifically: sampling the video clips according to frame sampling rates of 2 and 1, respectively, to obtain a first frame sequence containing 8 frame images and a second frame sequence containing 16 frame images.
在一些实施例中,所述将概率最大值对应的动作类别作为关键帧的检测框的动作类别,包括:通过设定的分数阈值筛选掉低于分数阈值的概率,再从剩余的概率中选择最大值,将概率最大值对应的动作类别做为关键帧的检测框的动作类别。In some embodiments, the action category corresponding to the maximum value of the probability is used as the action category of the detection frame of the key frame, including: filtering out the probability lower than the score threshold through the set score threshold, and then selecting the maximum value from the remaining probabilities, and using the action category corresponding to the maximum probability value as the action category of the detection frame of the key frame.
第二方面,本申请提供一种行人动作连续检测识别装置,所述装置包括视频 分割模块,关键帧选取模块,行人检测模块,动作识别模块和连续识别模块。In a second aspect, the present application provides a device for continuous detection and recognition of pedestrian actions, said device comprising a video segmentation module, a key frame selection module, a pedestrian detection module, an action recognition module and a continuous recognition module.
视频分割模块,用于将待检测视频分割成多个视频片段,每个视频片段均包括多帧图像。The video segmentation module is used to divide the video to be detected into multiple video clips, and each video clip includes multiple frames of images.
关键帧选取模块,用于在每个视频片段选中选取一帧图像作为关键帧,其余帧图像作为非关键帧。The key frame selection module is used to select one frame of image in each video clip as a key frame, and the rest of the frame images as non-key frames.
行人检测模块,用于对每一个视频片段,将所述视频片段的所有帧图像输入行人检测模型,在所述视频片段的每张帧图像上得到一定数量的检测框。The pedestrian detection module is configured to input all frame images of the video clip into the pedestrian detection model for each video clip, and obtain a certain number of detection frames on each frame image of the video clip.
动作识别模块,用于对每一个视频片段,将所述视频片段的所有帧图像及其检测框输入动作识别模型,得到所述关键帧的每个检测框的动作类别。The action recognition module is configured to, for each video clip, input all frame images of the video clip and their detection frames into the motion recognition model to obtain the action category of each detection frame of the key frame.
连续识别模块,用于对每一个视频片段,将所述视频片段的每个非关键帧的所有检测框与关键帧的所有检测框进行匹配,若匹配通过,则将非关键帧的检测框的动作类别设置为与之匹配的关键帧的检测框的动作类别。The continuous recognition module is used for each video clip, matching all the detection frames of each non-key frame of the video clip with all the detection frames of the key frame, if the match is passed, the action category of the detection frame of the non-key frame is set to the action category of the detection frame of the key frame matched therewith.
在一些实施例中,所述动作识别模型包括平行执行的慢速通道和快速通道,所述慢速通道和快速通道均为卷积神经网络,所述快速通道的卷积神经网络的通道数少于慢速通道的卷积神经网络的通道数。In some embodiments, the action recognition model includes a slow channel and a fast channel executed in parallel, both of the slow channel and the fast channel are convolutional neural networks, and the number of channels of the convolutional neural network of the fast channel is less than that of the slow channel.
在一些实施例中,所述动作识别模块包括采样单元,特征图矩阵提取单元,特征计算单元,概率计算单元和类别确定单元。In some embodiments, the action recognition module includes a sampling unit, a feature map matrix extraction unit, a feature calculation unit, a probability calculation unit and a category determination unit.
采样单元,用于按照不同的帧采样率对所述视频片段进行采样,得到包含较少帧图像的第一帧序列和包含较多帧图像的第二帧序列。The sampling unit is configured to sample the video segment according to different frame sampling rates to obtain a first frame sequence containing fewer frame images and a second frame sequence containing more frame images.
特征图矩阵提取单元,用于将所述第一帧序列和第二帧序列分别输入所述慢速通道和快速通道提取特征,分别得到第一特征图矩阵和第二特征图矩阵。A feature map matrix extraction unit, configured to input the first frame sequence and the second frame sequence into the slow channel and fast channel to extract features, respectively, to obtain a first feature map matrix and a second feature map matrix.
特征计算单元,用于分别对所述第一特征图矩阵和第二特征图矩阵进行时序池化操作,在得到的两个时序池化结果上分别基于所述关键帧的检测框提取感兴趣区域的特征,并分别进行空间池化操作,得到所述慢速通道的特征和所述快速通道的特征,所述慢速通道的特征表征所述视频片段的静态信息,所述快速通道的特征表征所述视频片段的动态信息。The feature calculation unit is configured to perform time-series pooling operations on the first feature map matrix and the second feature map matrix respectively, extract features of the region of interest based on the detection frame of the key frame on the two obtained time-series pooling results, and perform space pooling operations respectively to obtain the features of the slow channel and the features of the fast channel. The features of the slow channel represent the static information of the video segment, and the features of the fast channel represent the dynamic information of the video segment.
概率计算单元,用于将所述慢速通道的特征和所述快速通道的特征进行融合,对融合结果依次进行全连接操作和softmax操作,得到各个动作类别的概率。The probability calculation unit is used to fuse the features of the slow channel and the fast channel, perform a full connection operation and a softmax operation on the fusion result in sequence, and obtain the probability of each action category.
类别确定单元,用于将概率最大值对应的动作类别作为关键帧的检测框的动作类别。The category determination unit is configured to use the action category corresponding to the maximum probability as the action category of the detection frame of the key frame.
在一些实施例中,所述动作识别模型还包括从所述快速通道到慢速通道的侧向连接,所述侧向连接将所述快速通道的数据送入所述慢速通道。In some embodiments, the action recognition model further includes a side connection from the fast channel to the slow channel, and the side connection sends the data of the fast channel into the slow channel.
在一些实施例中,所述连续识别模块包括:In some embodiments, the continuous identification module includes:
IOU代价矩阵计算单元,用于对每个非关键帧,计算其所有检测框与关键帧的所有检测框的IOU距离,得到IOU代价矩阵;The IOU cost matrix calculation unit is used for calculating the IOU distances of all detection frames and all detection frames of the key frame for each non-key frame to obtain the IOU cost matrix;
匹配单元,用于基于所述IOU代价矩阵,利用匈牙利算法对所述非关键帧的所有检测框与关键帧的所有检测框进行匹配;A matching unit, configured to use the Hungarian algorithm to match all detection frames of the non-key frame with all detection frames of the key frame based on the IOU cost matrix;
其中,对所述非关键帧的一个检测框,若存在一个关键帧的检测框与其匹配,且两者的IOU距离小于设定的阈值,则匹配通过,否则未匹配通过。Wherein, for a detection frame of the non-key frame, if there is a detection frame of a key frame matching it, and the IOU distance between the two is smaller than the set threshold, the matching is passed; otherwise, the matching is not passed.
在一些实施例中,所述行人检测模块包括候选检测框获取单元,NMS单元,和过滤单元。In some embodiments, the pedestrian detection module includes a candidate detection frame acquisition unit, an NMS unit, and a filtering unit.
候选检测框获取单元,用于将所述视频片段的所有帧图像输入YOLOX检测模型,在所述视频片段的每张帧图像上得到若干候选检测框以及所述候选检测框识别为人的置信度分数。The candidate detection frame acquisition unit is used to input all the frame images of the video clip into the YOLOX detection model, and obtain several candidate detection frames and the confidence scores for identifying the candidate detection frames as people on each frame image of the video clip.
NMS单元,用于根据设定的NMS阈值对所述候选检测框进行非极大值抑制操作。The NMS unit is configured to perform a non-maximum value suppression operation on the candidate detection frame according to the set NMS threshold.
过滤单元,用于从非极大值抑制操作的结果中过滤掉置信度分数低于设定的置信度阈值的候选检测框,得到所述一定数量的检测框及其置信度分数。A filtering unit, configured to filter out candidate detection frames whose confidence scores are lower than a set confidence threshold from the result of the non-maximum value suppression operation, to obtain the certain number of detection frames and their confidence scores.
在一些实施例中,所述关键帧选取模块中,在每个视频片段选中选取中间帧图像作为关键帧,其余帧图像作为非关键帧。In some embodiments, in the key frame selection module, an intermediate frame image is selected as a key frame in each video segment, and other frame images are selected as non-key frames.
进一步的,所述装置还包括呈现模块。Further, the device further includes a presentation module.
呈现模块,用于将所述关键帧的所有检测框及其动作类别以及匹配通过的非关键帧的检测框及其动作类别显示呈现,将未匹配通过的非关键帧的检测框显示呈现。The presentation module is configured to display and present all detection frames of the key frames and their action categories, as well as the matching detection frames and their action categories of non-key frames, and display and present the detection frames of non-key frames that do not pass.
第三方面,本申请提供一种非易失计算机可读存储介质,包括存储有处理器可执行的指令的存储器,所述指令被所述处理器执行时实现包括第一方面所述 的行人动作连续检测识别方法的步骤。In a third aspect, the present application provides a non-transitory computer-readable storage medium, including a memory storing processor-executable instructions. When the instructions are executed by the processor, the steps of the method for continuous detection and recognition of pedestrian movements included in the first aspect are implemented.
第四方面,本申请提供一种计算机设备,包括至少一个处理器以及存储有计算机可执行的指令的存储器,所述处理器执行所述指令时实现第一方面所述的行人动作连续检测识别方法的步骤。In a fourth aspect, the present application provides a computer device, including at least one processor and a memory storing computer-executable instructions. When the processor executes the instructions, the steps of the method for continuous detection and recognition of pedestrian movements described in the first aspect are implemented.
本申请具有以下有益效果:The application has the following beneficial effects:
本申请通过检测框匹配,实现了关键帧识别结果对非关键帧的共享,从而实现了视频中人体动作的逐帧连续检测识别,解决了仅依赖关键帧检测对视频整体检测评估带来偏差的问题。并且,每帧图像的检测框与识别结果以连续的方式呈现,提升了呈现的视觉体验感。This application realizes the sharing of key frame recognition results to non-key frames through detection frame matching, thereby realizing frame-by-frame continuous detection and recognition of human body movements in videos, and solving the problem of deviation caused by only relying on key frame detection to the overall detection and evaluation of videos. Moreover, the detection frame and recognition results of each frame of image are presented in a continuous manner, which improves the visual experience of presentation.
附图说明Description of drawings
图1为本申请的一实施例的行人动作连续检测识别方法的流程图;FIG. 1 is a flow chart of a method for continuous detection and recognition of pedestrian movements according to an embodiment of the present application;
图2为本申请的动作识别模型一实施例的示意图;FIG. 2 is a schematic diagram of an embodiment of an action recognition model of the present application;
图3为本申请一实施例的非关键帧的检测框与关键帧的检测框的匹配过程示意图;3 is a schematic diagram of a matching process between a detection frame of a non-key frame and a detection frame of a key frame according to an embodiment of the present application;
图4为本申请的一实施例的行人动作连续检测识别装置的示意图;FIG. 4 is a schematic diagram of a device for continuous detection and recognition of pedestrian movements according to an embodiment of the present application;
图5为本申请的一个实施例的计算机设备的内部结构图。FIG. 5 is an internal structure diagram of a computer device according to an embodiment of the present application.
具体实施方式Detailed ways
为使本申请要解决的技术问题、技术方案和优点更加清楚,下面将结合附图及具体实施例对本申请的技术方案进行清楚、完整地描述。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。通常在此处附图中描述和示出的本申请实施例的组件可以以各种不同的配置来布置和设计。因此,以下对在附图中提供的本申请的实施例的详细描述并非旨在限制要求保护的本申请的范围,而是仅仅表示本申请的选定实施例。基于本申请的实施例,本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the technical problems, technical solutions and advantages to be solved by the present application clearer, the technical solutions of the present application will be clearly and completely described below in conjunction with the accompanying drawings and specific embodiments. Apparently, the described embodiments are only some of the embodiments of this application, not all of them. The components of the embodiments of the application generally described and illustrated in the figures herein may be arranged and designed in a variety of different configurations. Accordingly, the following detailed description of the embodiments of the application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely represents selected embodiments of the application. Based on the embodiments of the present application, all other embodiments obtained by those skilled in the art without making creative efforts belong to the scope of protection of the present application.
本申请的一实施例提供一种行人动作连续检测识别方法,如图1所示,该 方法包括步骤S100至S500An embodiment of the present application provides a method for continuous detection and recognition of pedestrian actions, as shown in Figure 1, the method includes steps S100 to S500
S100:将待检测视频分割成多个视频片段,每个视频片段均包括多帧图像。S100: Divide the video to be detected into multiple video segments, each of which includes multiple frames of images.
本步骤中,对于给定的一段包含指定动作的待检测视频V,首先将其分帧,并进行分割,分割成多个视频片段(clip),每个视频片段包括多帧图像。分割时可以按照一定的长度s进行,例如s取16时,分割后的视频片段包括16帧图像。并且为避免信息的丢失,还可以使得相邻两个clip间重叠若干帧,如5帧。In this step, for a given segment of the video V to be detected that contains a specified action, it is first divided into frames and divided into multiple video segments (clips), each of which includes multiple frames of images. The division can be performed according to a certain length s. For example, when s is 16, the divided video segment includes 16 frames of images. And in order to avoid loss of information, it is also possible to overlap several frames, such as 5 frames, between two adjacent clips.
S200:在每个视频片段选中选取一帧图像作为关键帧,其余帧图像作为非关键帧。S200: Selecting one frame of image in each video segment as a key frame, and the other frames of images as non-key frames.
例如,选取中间帧图像作为关键帧(key_frame),其余帧图像作为非关键帧(norm_frame)。For example, an intermediate frame image is selected as a key frame (key_frame), and other frame images are selected as non-key frames (norm_frame).
S300:对每一个视频片段,将视频片段的所有帧图像输入行人检测模型,在视频片段的每张帧图像上得到一定数量的检测框。S300: For each video segment, input all frame images of the video segment into the pedestrian detection model, and obtain a certain number of detection frames on each frame image of the video segment.
行人检测模型用于在每张帧图像上检测出一定数量的代表人的检测框,本申请不限制行人检测模型的具体实现方式。The pedestrian detection model is used to detect a certain number of detection frames representing people on each frame image, and this application does not limit the specific implementation of the pedestrian detection model.
例如,每张帧图像经过行人检测模型处理得到一个维度为N×5的矩阵,其中N为检测框的个数,5个维度分别对应检测框的左上角坐标(x1,y1),右下角坐标(x2,y2)以及检测框识别为人的置信度分数score,该视频片段所有非关键帧的检测框列表用Bn表示,中间帧即关键帧的检测框列表用B表示。For example, each frame image is processed by a pedestrian detection model to obtain a matrix with a dimension of N×5, where N is the number of detection frames, and the five dimensions correspond to the upper left coordinates (x1, y1) and lower right corner coordinates (x2, y2) of the detection frame, and the confidence score score of the detection frame being recognized as a person.
S400:对每一个视频片段,将视频片段的所有帧图像及其检测框输入动作识别模型,得到关键帧的每个检测框的动作类别。S400: For each video segment, input all frame images of the video segment and their detection frames into the action recognition model to obtain the action category of each detection frame of the key frame.
动作识别模型的作用是根据视频片段的所有帧图像的信息,判断出其关键帧的检测框的动作类别pred,本申请不限制动作识别模型的具体实现方式。The function of the action recognition model is to determine the action category pred of the detection frame of the key frame according to the information of all frame images of the video clip. This application does not limit the specific implementation of the action recognition model.
S500:对每一个视频片段,将视频片段的每个非关键帧的所有检测框与关键帧的所有检测框进行匹配,若匹配通过,则将非关键帧的检测框的动作类别设置为与之匹配的关键帧的检测框的动作类别。S500: For each video clip, match all the detection frames of each non-key frame of the video clip with all the detection frames of the key frame, and if the matching is passed, set the motion category of the detection frame of the non-key frame to the motion category of the detection frame of the key frame matched therewith.
现有技术只能对视频片段中的关键帧进行检测和识别,以关键帧的识别结果代表整个视频片段,并通过所有视频片段的关键帧的识别结果对整个待检测视频的检测情况进行评估。现有的方法对整个视频的检测情况完全依赖于关键 帧的检测效果,容易对视频整体检测效果的评估带来偏差。此外,仅对关键帧进行检测,检测框与识别结果以间断的方式呈现,视觉体验感差。The existing technology can only detect and identify the key frames in the video clips, and the recognition results of the key frames represent the entire video clip, and the detection of the entire video to be detected is evaluated by the recognition results of the key frames of all video clips. Existing methods rely entirely on the detection effect of key frames for the detection of the entire video, which is easy to cause deviations in the evaluation of the overall detection effect of the video. In addition, only key frames are detected, and the detection frame and recognition results are presented intermittently, resulting in a poor visual experience.
本申请得到关键帧的检测识别结果(关键帧的每个检测框及其动作类别)后,将每个非关键帧的所有检测框与关键帧的所有检测框进行匹配,对于匹配通过的非关键帧的检测框,将与其匹配的关键帧的检测框共享识别结果pred,实现对每一帧图像的动作检测识别。After the application obtains the detection and recognition results of the key frames (each detection frame of the key frame and its action category), all detection frames of each non-key frame are matched with all detection frames of the key frame, and for the detection frames of the non-key frames that have been matched, the recognition result pred is shared with the detection frames of the matching key frames to realize the motion detection and recognition of each frame of image.
并且,本申请还可以对检测框进行输出显示,对于关键帧,将关键帧的所有检测框及其动作类别显示呈现;对于非关键帧,将匹配通过的非关键帧的检测框及其动作类别显示呈现,未匹配通过的非关键帧只将其检测框显示呈现。In addition, the present application can also output and display the detection frames. For key frames, all detection frames of the key frames and their action categories are displayed and presented; for non-key frames, the detection frames of non-key frames that have been matched and their action categories are displayed and presented, and for non-key frames that have not been matched, only their detection frames are displayed and presented.
本申请通过检测框匹配,实现了将关键帧识别结果与非关键帧共享,从而实现了视频中人体动作的逐帧连续检测识别,解决了仅依赖关键帧检测对视频整体检测评估带来偏差的问题;并且,每帧图像的检测框与识别结果以连续的方式呈现,提升了呈现的视觉体验感。This application realizes the sharing of key frame recognition results with non-key frames through detection frame matching, thereby realizing frame-by-frame continuous detection and recognition of human body movements in videos, and solving the problem of deviation caused by only relying on key frame detection to the overall detection and evaluation of the video; moreover, the detection frames and recognition results of each frame of image are presented in a continuous manner, which improves the visual experience of presentation.
在本申请的一些实施例中,如图2所示,本申请的动作识别模型包括平行执行的慢速通道和快速通道。慢速通道和快速通道均为卷积神经网络,并且快速通道的卷积神经网络的通道数少于慢速通道的卷积神经网络的通道数。In some embodiments of the present application, as shown in FIG. 2 , the action recognition model of the present application includes a slow channel and a fast channel executed in parallel. Both the slow channel and the fast channel are convolutional neural networks, and the convolutional neural network of the fast channel has fewer channels than the convolutional neural network of the slow channel.
发明人经研究发现:视频场景中的一系列帧图像通常包含两个不同的部分:很少变化或者缓慢变化的静态部分和正在发生变化的动态部分。例如,飞机起飞的视频会包含相对静态的机场和一个在静态机场场景中快速移动的动态的飞机。又例如,在日常生活中,当两个人见面时,握手通常会比较快而场景中的其他部分则相对静态。The inventor found through research that a series of frame images in a video scene usually includes two different parts: a static part that changes little or slowly and a dynamic part that is changing. For example, a video of an airplane taking off may contain a relatively static airport and a dynamic airplane moving rapidly within the static airport scene. As another example, in everyday life, when two people meet, the handshake is usually quicker while the rest of the scene is relatively static.
根据这一发现,本申请将动作识别模型设计为包括平行执行的慢速通道和快速通道。慢速通道是一个慢速高分辨率的卷积神经网络,其具有较少的输入帧序列和较多的通道(channel)数,用来分析视频中的空间静态内容。快速通道是一个快速低分辨率的卷积神经网络,其具有较多的输入帧序列和较少的通道数,用来分析视频中的时序动态内容。快速通道使用较少的通道数(即使用较少的滤波器数量)来保持网络的轻量化,其表示静态空间语义的能力较弱。Based on this finding, the present application designs the action recognition model to include slow and fast passes executed in parallel. The slow channel is a slow high-resolution convolutional neural network with fewer input frame sequences and a larger number of channels to analyze spatially static content in videos. The fast channel is a fast low-resolution convolutional neural network with a large sequence of input frames and a small number of channels, which is used to analyze the temporal dynamic content in the video. The fast channel uses less number of channels (i.e., uses less number of filters) to keep the network lightweight, and its ability to represent static spatial semantics is weak.
与灵长类动物的视网膜神经节的原理类似,在视网膜神经节中,大约80% 的细胞(P-cells)以低频运作,可以识别静态细节,而大约20%的细胞(M-cells)则以高频运作,负责响应快速变化。Similar to the principle in the retinal ganglion of primates, in the retinal ganglion, about 80% of the cells (P-cells) operate at low frequencies and can recognize static details, while about 20% of the cells (M-cells) operate at high frequencies and are responsible for responding to rapid changes.
在一些实施例中,基于此动作识别模型,前述的S400包括步骤S410至S450。In some embodiments, based on the action recognition model, the aforementioned S400 includes steps S410 to S450.
S410:按照不同的帧采样率对所述视频片段进行采样,得到第一帧序列和第二帧序列,第一帧序列包含的图像的帧数少于第二帧序列包含的图像的帧数。S410: Sampling the video segment according to different frame sampling rates to obtain a first frame sequence and a second frame sequence, where the number of image frames included in the first frame sequence is less than the number of image frames included in the second frame sequence.
例如,帧采样率设置为2和1,即每两帧和每一帧对16帧的视频片段进行采样,得到8帧的第一帧序列和16帧的第二帧序列。For example, the frame sampling rate is set to 2 and 1, that is, a 16-frame video segment is sampled every two frames and every frame, to obtain a first frame sequence of 8 frames and a second frame sequence of 16 frames.
S420:将所述第一帧序列和第二帧序列分别输入所述慢速通道和快速通道提取特征,分别得到第一特征图矩阵和第二特征图矩阵。S420: Input the first frame sequence and the second frame sequence into the slow channel and the fast channel to extract features, respectively, to obtain a first feature map matrix and a second feature map matrix.
例如,将8帧的第一帧序列输入慢速通道,对每帧图像提取特征表征视频片段静态信息的特征图,第一帧序列的所有图像的特征图组成第一特征图矩阵。For example, the first frame sequence of 8 frames is input into the slow channel, and the feature map representing the static information of the video clip is extracted for each frame image, and the feature maps of all images in the first frame sequence form the first feature map matrix.
同时将16帧的第二帧序列输入快速通道,对每帧图像提取特征表征视频片段动态信息的特征图,第二帧序列的所有图像的特征图组成第二特征图矩阵。At the same time, the second frame sequence of 16 frames is input into the fast channel, and a feature map representing the dynamic information of the video clip is extracted for each frame image, and the feature maps of all images in the second frame sequence form a second feature map matrix.
S430:分别对第一特征图矩阵和第二特征图矩阵进行时序池化(pool)操作,在得到的两个时序池化结果上分别基于关键帧的检测框提取感兴趣区域(region of interest,ROI)的特征,并分别进行空间池化(pool)操作,得到慢速通道的特征和快速通道的特征。慢速通道的特征表征视频片段的静态信息,快速通道的特征表征视频片段的动态信息。S430: Perform time-series pooling (pool) operations on the first feature map matrix and the second feature map matrix respectively, extract features of a region of interest (ROI) based on the detection frame of the key frame on the two obtained time-series pooling results, and perform space pooling (pool) operations respectively to obtain features of the slow channel and features of the fast channel. The features of the slow channel represent the static information of the video segment, and the features of the fast channel represent the dynamic information of the video segment.
池化是将一个或多个由前趋的卷积层创建的矩阵压缩为较小的矩阵的过程,在深度学习中,池化一般指空间池化,池化在时间序列上的应用称为时序池化。Pooling is the process of compressing one or more matrices created by the previous convolutional layer into a smaller matrix. In deep learning, pooling generally refers to spatial pooling. The application of pooling on time series is called temporal pooling.
以16帧的视频片段为例,每帧图像经过卷积神经网络之后,得到16个帧的特征图。由于动作类别识别通常是基于视频级别的而不是基于帧级别的,因此,需要通过一种时序汇合方法(即时序池化)将各帧特征转换为视频级别特征。Taking a 16-frame video clip as an example, after each frame of image passes through a convolutional neural network, feature maps of 16 frames are obtained. Since action category recognition is usually based on the video level rather than the frame level, it is necessary to convert each frame feature into a video level feature through a temporal pooling method (ie, sequential pooling).
时序池化后,针对两个时序池化结果,分别进行ROI Align操作,完成区域特征聚集,再经过空间池化,得到快速通道和慢速通道的特征。After time-series pooling, ROI Align operations are performed on the two time-series pooling results to complete regional feature aggregation, and then space pooling is performed to obtain the features of the fast channel and the slow channel.
S440:将慢速通道的特征和快速通道的特征进行融合,对融合结果依次进行全连接操作和softmax操作,得到关键帧每个检测框的各个动作类别的概率。S440: Fusing the features of the slow channel and the features of the fast channel, and sequentially performing a full connection operation and a softmax operation on the fusion result, to obtain the probability of each action category of each detection frame of the key frame.
融合时,在通道维度上通过concat操作进行融合,再经过全连接层得到 num_classes维的特征,num_classes为动作类别的个数,然后经过softmax激活得到识别为各个动作类别的概率。During fusion, the concat operation is performed on the channel dimension, and then the features of the num_classes dimension are obtained through the fully connected layer, where num_classes is the number of action categories, and then the probability of being recognized as each action category is obtained through softmax activation.
S450:将概率最大值对应的动作类别作为关键帧的检测框的动作类别。S450: Use the action category corresponding to the maximum probability as the action category of the detection frame of the key frame.
本步骤中,可以通过设定的分数阈值筛选掉低于分数阈值的概率,再从剩余的概率中选择最大值,其对应的动作类别即为关键帧的检测框的动作类别(pred)。In this step, the probability lower than the score threshold can be filtered out through the set score threshold, and then the maximum value can be selected from the remaining probabilities, and the corresponding action category is the action category (pred) of the detection frame of the key frame.
动作识别模型还包括从快速通道到慢速通道的侧向连接,侧向连接将快速通道的数据送入慢速通道。The action recognition model also includes side connections from the fast channel to the slow channel, which feed data from the fast channel into the slow channel.
因为快速通道和慢速通道的信息是融合的,因此一条路径需要知道另一条路径所学习到的表示,通过侧向连接将快速通道的数据送入慢速通道。示例性的,侧向连接的连接方式可以使用卷积核为5×1 2的3D convolution(卷积)实现。 Because the information of the fast channel and the slow channel are fused, one path needs to know the representation learned by the other path, and the data of the fast channel is sent to the slow channel through a side connection. Exemplarily, the connection mode of the lateral connection can be realized by using a 3D convolution (convolution) with a convolution kernel of 5×1 2 .
在本申请的一些实施例中,如图3所示,前述的S500包括S510和S520。In some embodiments of the present application, as shown in FIG. 3 , the aforementioned S500 includes S510 and S520.
S510:对每个非关键帧,计算其所有检测框与关键帧的所有检测框的IOU距离,得到IOU代价矩阵。S510: For each non-key frame, calculate the IOU distance between all the detection frames and all the detection frames of the key frame to obtain an IOU cost matrix.
IOU(Intersection-over-Union)表示两个检测框的交叠率,即两个检测框的交集与两个检测框的并集的比值。IOU (Intersection-over-Union) represents the overlap rate of two detection frames, that is, the ratio of the intersection of two detection frames to the union of two detection frames.
S520:基于IOU代价矩阵,利用匈牙利算法对非关键帧的所有检测框与关键帧的所有检测框进行匹配。S520: Based on the IOU cost matrix, use the Hungarian algorithm to match all the detection frames of the non-key frames with all the detection frames of the key frames.
其中,对非关键帧的一个检测框,若存在一个关键帧的检测框与其匹配,且两者的IOU距离小于设定的阈值,则匹配通过,否则匹配未通过。Among them, for a detection frame of a non-key frame, if there is a detection frame of a key frame matching it, and the IOU distance between the two is smaller than the set threshold, the matching is passed, otherwise the matching is not passed.
例如,记非关键帧的检测框列表为Bn,关键帧的检测框列表为B。Bn中包括的每个检测框记为bbox_i,B中包括的每个检测框记为Bbox_i。将Bn中的每个检测框bbox_i与B中的每个检测框Bbox_i匹配,得到的输出为:For example, record the detection frame list of non-key frames as Bn, and the detection frame list of key frames as B. Each detection box included in Bn is marked as bbox_i, and each detection box included in B is marked as Bbox_i. Match each detection box bbox_i in Bn with each detection box Bbox_i in B, and the output obtained is:
A.完成匹配的非关键帧的检测框与关键帧的检测框列表M,每个元素为(index_i,index_j),其中index_i为非关键帧的检测框索引,index_j为关键帧的检测框索引。A. Complete the matching non-key frame detection frame and key frame detection frame list M, each element is (index_i, index_j), where index_i is the detection frame index of the non-key frame, and index_j is the detection frame index of the key frame.
B.未匹配上的非关键帧的检测框列表F,每个元素为index_i,表示非关键帧的检测框索引。B. The detection frame list F of the unmatched non-key frame, each element is index_i, indicating the detection frame index of the non-key frame.
匹配的具体步骤为:The specific steps for matching are:
1)初始化:令M←Φ,F←Φ。1) Initialization: let M←Φ, F←Φ.
2)计算Bn中的各个检测框(数量为N个)与B中的各个检测框(数量为M个)的IOU距离,得到N×M维的IOU代价矩阵C。2) Calculate the IOU distance between each detection frame (the number is N) in Bn and each detection frame (the number is M) in B, and obtain an N×M-dimensional IOU cost matrix C.
3)基于IOU代价矩阵C,利用匈牙利算法完成非关键帧的检测框与关键帧的检测框的匹配。3) Based on the IOU cost matrix C, the Hungarian algorithm is used to complete the matching of the detection frame of the non-key frame and the detection frame of the key frame.
4)设定阈值max_distance,逐个索引Bn中的检测框bbox_i:若有B中检测框Bbox_j与bbox_i匹配,且bbox_i与Bbox_j的IOU距离小于max_distance,则将pred_j共享给bbox_i,将(bbox_i,Bbox_j)添加到列表M,同时将“None”添加到列表F。4) Set the threshold max_distance, and index the detection frame bbox_i in Bn one by one: if there is a detection frame Bbox_j in B that matches bbox_i, and the IOU distance between bbox_i and Bbox_j is less than max_distance, share pred_j with bbox_i, add (bbox_i, Bbox_j) to list M, and add "None" to list F.
若有B中检测框Bbox_j与bbox_i匹配,但bbox_i与Bbox_j的IOU距离大于max_distance,或者B中没有检测框与bbox_i匹配,则将bbox_i添加进列表F,同时将(None,None)添加进列表M。If there is a detection frame Bbox_j in B that matches bbox_i, but the IOU distance between bbox_i and Bbox_j is greater than max_distance, or there is no detection frame in B that matches bbox_i, add bbox_i to list F, and add (None, None) to list M.
本申请基于IOU代价矩阵和匈牙利算法对非关键帧的检测框与关键帧的检测框进行匹配,提升了匹配效果。This application matches the detection frame of the non-key frame with the detection frame of the key frame based on the IOU cost matrix and the Hungarian algorithm, which improves the matching effect.
现有技术的行人动作检测识别方法通过Faster RCNN进行行人检测,在行人密集,遮挡严重的场景,存在较严重的漏检情况。The prior art pedestrian action detection and recognition method uses Faster RCNN to detect pedestrians. In scenes with dense pedestrians and severe occlusion, there are serious missed detections.
为提升行人检测模型在行人密集、遮挡严重的场景中的检测效果,在一些实施例中,本申请采用如下方法进行行人检测,所述步骤S300包括步骤S310至S330。In order to improve the detection effect of the pedestrian detection model in scenes with dense pedestrians and severe occlusion, in some embodiments, the present application adopts the following method for pedestrian detection, and the step S300 includes steps S310 to S330.
S310:将视频片段的所有帧图像输入YOLOX检测模型,在视频片段的每帧图像上得到若干候选检测框以及候选检测框识别为人的置信度分数。S310: Input all frame images of the video clip into the YOLOX detection model, and obtain several candidate detection frames and confidence scores for identifying the candidate detection frames as people on each frame image of the video clip.
本申请使用YOLOX检测模型实现行人检测,为提升模型在复杂拥挤场景的行人检测效果,YOLOX检测模型经行人数据再训练取得。在检测过程中,将视频片段输入YOLOX检测模型,完成对所有帧图像中行人的检测,输出行人的检测框以及识别为“人”的置信度分数score。This application uses the YOLOX detection model to realize pedestrian detection. In order to improve the pedestrian detection effect of the model in complex crowded scenes, the YOLOX detection model is obtained by retraining pedestrian data. In the detection process, the video clips are input into the YOLOX detection model to complete the detection of pedestrians in all frame images, and output the detection frame of the pedestrian and the confidence score score for the recognition as "person".
S320:根据设定的NMS阈值对候选检测框进行非极大值抑制操作。S320: Perform a non-maximum value suppression operation on the candidate detection frame according to the set NMS threshold.
非极大值抑制(Non-Maximum Suppression,NMS),是应用在物体检测的后处理方法,能够去除冗余的检测框。Non-Maximum Suppression (NMS) is a post-processing method applied to object detection, which can remove redundant detection frames.
S330:从非极大值抑制操作的结果中过滤掉置信度分数低于设定的置信度阈值的候选检测框,得到一定数量的检测框及其置信度分数。S330: Filter out candidate detection frames whose confidence scores are lower than a set confidence threshold from the result of the non-maximum value suppression operation, and obtain a certain number of detection frames and their confidence scores.
本步骤经过置信度阈值过滤掉置信度分数低于阈值的候选检测框,最终输出图像中保留的检测框以及其置信度分数。In this step, the candidate detection frames whose confidence scores are lower than the threshold are filtered out through the confidence threshold, and the remaining detection frames and their confidence scores in the image are finally output.
应该理解的是,虽然如上所述的各实施例所涉及的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,如上所述的各实施例所涉及的流程图中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flow charts involved in the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed sequentially in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in the flow charts involved in the above-mentioned embodiments may include multiple steps or multiple stages, and these steps or stages may not necessarily be executed at the same time, but may be executed at different times, and the execution order of these steps or stages may not necessarily be performed sequentially, but may be performed alternately or alternately with other steps or at least a part of steps or stages in other steps.
本申请的一些实施例提供一种行人动作连续检测识别装置,如图4所示,该装置包括视频分割模块1,关键帧选取模块2,行人检测模块3,动作识别模块4和连续识别模块5。Some embodiments of the present application provide a device for continuous detection and recognition of pedestrian actions. As shown in FIG.
视频分割模块1,用于将待检测视频分割成多个视频片段,每个视频片段均包括多帧图像。The video segmentation module 1 is configured to divide the video to be detected into multiple video segments, each video segment including multiple frames of images.
关键帧选取模块2,用于在每个视频片段选中选取一帧图像作为关键帧,其余帧图像作为非关键帧。The key frame selection module 2 is used to select one frame of image in each video segment as a key frame, and the rest of the frame images as non-key frames.
例如,在每个视频片段选中选取中间帧图像作为关键帧,其余帧图像作为非关键帧。For example, in each video clip, select the middle frame image as a key frame, and the remaining frame images as non-key frames.
行人检测模块3,用于对每一个视频片段,将视频片段的所有帧图像输入行人检测模型,在视频片段的每张帧图像上得到一定数量的检测框。The pedestrian detection module 3 is configured to input all frame images of the video clip into the pedestrian detection model for each video clip, and obtain a certain number of detection frames on each frame image of the video clip.
动作识别模块4,用于对每一个视频片段,将视频片段的所有帧图像及其检测框输入动作识别模型,得到关键帧的每个检测框的动作类别。The action recognition module 4 is configured to input all frame images of the video segment and their detection frames into the action recognition model for each video segment, and obtain the action category of each detection frame of the key frame.
连续识别模块5,用于对每一个视频片段,将视频片段的每个非关键帧的所有检测框与关键帧的所有检测框进行匹配,若匹配通过,则将非关键帧的检测框 的动作类别设置为与之匹配的关键帧的检测框的动作类别。 Continuous identification module 5 is used for each video clip, matches all detection frames of each non-key frame of video clip with all detection frames of key frame, if match is passed, then the action category of the detection frame of non-key frame is set to the action category of the detection frame of the key frame that matches with it.
本申请的装置还可以包括呈现模块。The apparatus of the present application may further include a presentation module.
呈现模块,用于将关键帧的所有检测框及其动作类别以及匹配通过的非关键帧的检测框及其动作类别显示呈现,将匹配未通过的非关键帧的检测框显示呈现。The presenting module is configured to display and present all detection frames of the key frames and their action categories, as well as the detection frames of the matching non-key frames and their action categories, and display and present the detection frames of the non-key frames that fail to pass.
本申请通过检测框匹配,实现了将关键帧识别结果与非关键帧共享,从而实现了视频中人体动作的逐帧连续检测识别,解决了仅依赖关键帧检测对视频整体检测评估带来偏差的问题。并且,每帧图像的检测框与识别结果以连续的方式呈现,提升了呈现的视觉体验感。This application realizes the sharing of key frame recognition results with non-key frames through detection frame matching, thereby realizing frame-by-frame continuous detection and recognition of human body movements in videos, and solving the problem of deviation caused by only relying on key frame detection to the overall detection and evaluation of the video. Moreover, the detection frame and recognition results of each frame of image are presented in a continuous manner, which improves the visual experience of presentation.
在本申请的一些实施例中,动作识别模型包括平行执行的慢速通道和快速通道。慢速通道和快速通道均为卷积神经网络,快速通道的卷积神经网络的通道数少于慢速通道的卷积神经网络的通道数。In some embodiments of the present application, the action recognition model includes a slow pass and a fast pass executed in parallel. Both the slow channel and the fast channel are convolutional neural networks, and the number of channels of the convolutional neural network of the fast channel is less than that of the convolutional neural network of the slow channel.
在一些实施例中,基于上述动作识别模型,本申请的动作识别模块包括采样单元,特征图矩阵提取单元,特征计算单元,概率计算单元和类别确定单元。In some embodiments, based on the above-mentioned action recognition model, the action recognition module of the present application includes a sampling unit, a feature map matrix extraction unit, a feature calculation unit, a probability calculation unit and a category determination unit.
采样单元,用于按照不同的帧采样率对所述视频片段进行采样,得到第一帧序列和第二帧序列,第一帧序列包含的图像的帧数少于第二帧序列包含的图像的帧数。The sampling unit is configured to sample the video clips according to different frame sampling rates to obtain a first frame sequence and a second frame sequence, the number of frames of images contained in the first frame sequence is less than the number of frames of images contained in the second frame sequence.
特征图矩阵提取单元,用于将将所述第一帧序列和第二帧序列分别输入所述慢速通道和快速通道提取特征,分别得到第一特征图矩阵和第二特征图矩阵。The feature map matrix extraction unit is configured to input the first frame sequence and the second frame sequence into the slow channel and the fast channel to extract features, respectively, to obtain a first feature map matrix and a second feature map matrix.
特征计算单元,用于分别对第一特征图矩阵和第二特征图矩阵进行时序池化操作,在得到的两个时序池化结果上分别基于关键帧的检测框提取感兴趣区域的特征,并分别进行空间池化操作,得到慢速通道的特征和快速通道的特征。慢速通道的特征表征视频片段的静态信息,快速通道的特征表征视频片段的动态信息。The feature calculation unit is configured to perform time-series pooling operations on the first feature map matrix and the second feature map matrix respectively, extract features of the region of interest based on the detection frame of the key frame on the two obtained time-series pooling results, and perform space pooling operations respectively to obtain the features of the slow channel and the features of the fast channel. The features of the slow channel represent the static information of the video segment, and the features of the fast channel represent the dynamic information of the video segment.
概率计算单元,用于将慢速通道的特征和快速通道的特征进行融合,对融合结果依次进行全连接操作和softmax操作,得到各个动作类别的概率。The probability calculation unit is used to fuse the features of the slow channel and the features of the fast channel, and perform a full connection operation and a softmax operation on the fusion results in order to obtain the probability of each action category.
类别确定单元,用于将概率最大值对应的动作类别作为关键帧的检测框的动作类别。The category determination unit is configured to use the action category corresponding to the maximum probability as the action category of the detection frame of the key frame.
其中,动作识别模型还包括从快速通道到慢速通道的侧向连接,侧向连接将快速通道的数据送入慢速通道。Among them, the action recognition model also includes a side connection from the fast channel to the slow channel, and the side connection sends the data of the fast channel to the slow channel.
在本申请的一些实施例中,连续识别模块包括IOU代价矩阵计算单元和匹配单元。In some embodiments of the present application, the continuous identification module includes an IOU cost matrix calculation unit and a matching unit.
IOU代价矩阵计算单元,用于对每个非关键帧,计算其所有检测框与关键帧的所有检测框的IOU距离,得到IOU代价矩阵。The IOU cost matrix calculation unit is used to calculate the IOU distance between all the detection frames of each non-key frame and all the detection frames of the key frame to obtain the IOU cost matrix.
匹配单元,用于基于IOU代价矩阵,利用匈牙利算法对非关键帧的所有检测框与关键帧的所有检测框进行匹配。The matching unit is configured to use the Hungarian algorithm to match all the detection frames of the non-key frame with all the detection frames of the key frame based on the IOU cost matrix.
其中,对非关键帧的一个检测框,若存在一个关键帧的检测框与其匹配,且两者的IOU距离小于设定的阈值,则匹配通过,否则匹配未通过。Among them, for a detection frame of a non-key frame, if there is a detection frame of a key frame matching it, and the IOU distance between the two is smaller than the set threshold, the matching is passed, otherwise the matching is not passed.
为提升行人检测模型在行人密集、遮挡严重的场景中的检测效果,在一些实施例中,本申请的行人检测模块包括候选检测框获取单元,NMS单元和过滤单元。In order to improve the detection effect of the pedestrian detection model in scenes with dense pedestrians and severe occlusion, in some embodiments, the pedestrian detection module of the present application includes a candidate detection frame acquisition unit, an NMS unit and a filtering unit.
候选检测框获取单元,用于将视频片段的所有帧图像输入YOLOX检测模型,在视频片段的每张帧图像上得到若干候选检测框以及候选检测框识别为人的置信度分数。The candidate detection frame acquisition unit is used to input all the frame images of the video clip into the YOLOX detection model, and obtain several candidate detection frames and the confidence scores that the candidate detection frames are identified as people on each frame image of the video clip.
NMS单元,用于根据设定的NMS阈值对候选检测框进行非极大值抑制操作。The NMS unit is used to perform a non-maximum value suppression operation on the candidate detection frame according to the set NMS threshold.
过滤单元,用于从非极大值抑制操作的结果中过滤掉置信度分数低于设定的置信度阈值的候选检测框,得到一定数量的检测框及其置信度分数。The filtering unit is used to filter out the candidate detection frames whose confidence scores are lower than the set confidence threshold from the result of the non-maximum value suppression operation, and obtain a certain number of detection frames and their confidence scores.
本申请实施例所提供的装置,其实现原理及产生的技术效果和前述方法实施例相同,为简要描述,该装置实施例部分未提及之处,可参考前述方法实施例中相应内容。所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,前述描述的装置和单元的具体工作过程,均可以参考上述方法实施例中的对应过程,在此不再赘述。The implementation principles and technical effects of the device provided by the embodiment of the present application are the same as those of the aforementioned method embodiment. For brief description, for the part not mentioned in the device embodiment, reference may be made to the corresponding content in the aforementioned method embodiment. Those skilled in the art can clearly understand that for the convenience and brevity of description, the specific working process of the devices and units described above can refer to the corresponding process in the above method embodiment, and details are not repeated here.
本申请提供的上述方法实施例可以通过计算机程序实现业务逻辑并记录在存储介质上,所述的存储介质可以计算机读取并执行,实现本说明书方法实施例 所描述方案的效果。因此,本申请还提供用于行人动作连续检测识别的非易失计算机可读存储介质。所述非易失计算机可读存储介质包括用于存储有处理器可执行的指令的存储器,指令被处理器执行时实现包括上述实施例中的行人动作连续检测识别方法的步骤。The above-mentioned method embodiments provided in this application can implement business logic through computer programs and record them on storage media. The storage media can be read and executed by computers to achieve the effects of the solutions described in the method embodiments of this specification. Therefore, the present application also provides a non-volatile computer-readable storage medium for continuous detection and recognition of pedestrian actions. The non-volatile computer-readable storage medium includes a memory for storing processor-executable instructions. When the instructions are executed by the processor, the steps of the method for continuous detection and recognition of pedestrian movements in the above-mentioned embodiments are implemented.
本申请通过检测框匹配,实现了关键帧与非关键帧共享识别结果,从而实现了视频中人体动作的逐帧连续检测识别,解决了仅依赖关键帧检测对视频整体检测评估带来偏差的问题。并且,每帧图像的检测框与识别结果以连续的方式呈现,提升了呈现的视觉体验感。This application realizes the sharing of recognition results between key frames and non-key frames through detection frame matching, thereby realizing the frame-by-frame continuous detection and recognition of human body movements in videos, and solving the problem of deviation caused by only relying on key frame detection to the overall detection and evaluation of the video. Moreover, the detection frame and recognition results of each frame of image are presented in a continuous manner, which improves the visual experience of presentation.
所述存储介质可以包括用于存储信息的物理装置,通常是将信息数字化后再以利用电、磁或者光学等方式的媒体加以存储。所述存储介质有可以包括:利用电能方式存储信息的装置如,各式存储器,如RAM、ROM等;利用磁能方式存储信息的装置如,硬盘、软盘、磁带、磁芯存储器、磁泡存储器、U盘;利用光学方式存储信息的装置如,CD或DVD。当然,还有其他方式的可读存储介质,例如量子存储器、石墨烯存储器等等。The storage medium may include a physical device for storing information, and information is usually digitized and then stored using an electrical, magnetic, or optical medium. Described storage medium can include: the device that utilizes electric energy mode to store information such as, various memory, such as RAM, ROM etc.; Utilize the device that utilizes magnetic energy mode to store information such as, hard disk, floppy disk, magnetic tape, magnetic core memory, magnetic bubble memory, U disk; Utilize the device that utilizes optical mode to store information such as, CD or DVD. Of course, there are other readable storage media, such as quantum memory, graphene memory and so on.
上述所述的存储介质根据方法实施例的描述还可以包括其他的实施方式,本实施例的实现原理及产生的技术效果和前述方法实施例相同,具体可以参照相关方法实施例的描述,在此不作一一赘述。The storage medium described above may also include other implementations according to the descriptions of the method embodiments. The implementation principles and technical effects of this embodiment are the same as those of the foregoing method embodiments. For details, refer to the descriptions of related method embodiments, and details are not repeated here.
本申请还提供一种用于行人动作连续检测识别的计算机设备,所述计算机设备可以为单独的计算机,也可以包括使用了本说明书的一个或多个所述方法或一个或多个实施例装置的实际操作装置等。所述行人动作连续检测识别的计算机设备可以包括至少一个处理器以及存储计算机可执行指令的存储器,处理器执行所述指令时实现上述任意一个或者多个实施例中所述行人动作连续检测识别方法的步骤。The present application also provides a computer device for continuous detection and recognition of pedestrian movements. The computer device may be a single computer, or may include an actual operating device using one or more methods described in this specification or devices in one or more embodiments. The computer device for continuous detection and recognition of pedestrian movements may include at least one processor and a memory storing computer-executable instructions. When the processor executes the instructions, the steps of the method for continuous detection and recognition of pedestrian movements in any one or more of the above embodiments are implemented.
本申请通过检测框匹配,实现了关键帧识别结果对非关键帧的共享,从而实现了视频中人体动作的逐帧连续检测识别,解决了仅依赖关键帧检测对视频整体检测评估带来偏差的问题;并且,每帧图像的检测框与识别结果以连续的方式呈现,提升了呈现的视觉体验感。This application realizes the sharing of key frame recognition results to non-key frames through detection frame matching, thereby realizing frame-by-frame continuous detection and recognition of human body movements in videos, and solving the problem of deviation caused by only relying on key frame detection to the overall detection and evaluation of videos; moreover, the detection frames and recognition results of each frame of image are presented in a continuous manner, which improves the visual experience of presentation.
在本申请的一些实施中,计算机设备可以是终端,其内部结构图可以如图5 所示。该计算机设备包括通过系统总线连接的处理器、存储器、通信接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信,无线方式可通过WIFI、移动蜂窝网络、NFC(近场通信)或其他技术实现。该计算机程序被处理器执行时以实现一种医学图像处理方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。In some implementations of the present application, the computer device may be a terminal, and its internal structure may be as shown in FIG. 5 . The computer device includes a processor, a memory, a communication interface, a display screen and an input device connected through a system bus. Wherein, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, mobile cellular network, NFC (Near Field Communication) or other technologies. When the computer program is executed by the processor, a medical image processing method is realized. The display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the casing of the computer device, or an external keyboard, touch pad or mouse.
本领域技术人员可以理解,图5中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 5 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation to the computer equipment to which the solution of the application is applied. The specific computer equipment may include more or less components than those shown in the figure, or combine certain components, or have different component arrangements.
上述所述的计算机设备根据方法实施例的描述还可以包括其他的实施方式,本实施例的实现原理及产生的技术效果和前述方法实施例相同,具体可以参照相关方法实施例的描述,在此不作一一赘述。The computer equipment described above may also include other implementations according to the description of the method embodiments. The implementation principle and technical effects of this embodiment are the same as those of the foregoing method embodiments. For details, refer to the descriptions of related method embodiments, and details will not be repeated here.
最后应说明的是:以上所述实施例,仅为本申请的具体实施方式,用以说明本申请的技术方案,而非对其限制,本申请的保护范围并不局限于此,尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化,或者对其中部分技术特征进行等同替换;而这些修改、变化或者替换,并不使相应技术方案的本质脱离本申请实施例技术方案的精神和范围。都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。Finally, it should be noted that the above-mentioned embodiments are only specific implementations of the present application, and are used to illustrate the technical solutions of the present application, rather than to limit them. The scope of protection of the present application is not limited thereto. Although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that any person familiar with the technical field within the technical scope disclosed in the present application can still modify or easily think of changes to the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some of the technical features; and these modifications, changes or replacements, The essence of the corresponding technical solutions does not deviate from the spirit and scope of the technical solutions of the embodiments of the present application. All should be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be determined by the protection scope of the claims.

Claims (15)

  1. 一种行人动作连续检测识别方法,其特征在于,所述方法包括:A method for continuous detection and recognition of pedestrian actions, characterized in that the method comprises:
    将待检测视频分割成多个视频片段,每个视频片段均包括多帧图像;The video to be detected is divided into multiple video clips, and each video clip includes multiple frames of images;
    在每个视频片段选中选取一帧图像作为关键帧,其余帧图像作为非关键帧;Select one frame of image in each video clip as a key frame, and the rest of the frame images as non-key frames;
    对每一个视频片段,将所述视频片段的所有帧图像输入行人检测模型,在所述视频片段的每张帧图像上得到一定数量的检测框;For each video clip, all frame images of the video clip are input to the pedestrian detection model, and a certain number of detection frames are obtained on each frame image of the video clip;
    对每一个视频片段,将所述视频片段的所有帧图像及其检测框输入动作识别模型,得到所述关键帧的每个检测框的动作类别;For each video segment, all frame images and detection frames thereof of the video segment are input into the action recognition model to obtain the action category of each detection frame of the key frame;
    对每一个视频片段,将所述视频片段的每个非关键帧的所有检测框与关键帧的所有检测框进行匹配,若匹配通过,则将非关键帧的检测框的动作类别设置为与之匹配的关键帧的检测框的动作类别。For each video clip, all detection frames of each non-key frame of the video clip are matched with all detection frames of the key frame, if the match is passed, the action category of the detection frame of the non-key frame is set to the action category of the detection frame of the key frame matched therewith.
  2. 根据权利要求1所述的行人动作连续检测识别方法,其特征在于,所述动作识别模型包括平行执行的慢速通道和快速通道,所述慢速通道和快速通道均为卷积神经网络,所述快速通道的卷积神经网络的通道数少于慢速通道的卷积神经网络的通道数;The method for continuous detection and recognition of pedestrian actions according to claim 1, wherein the action recognition model includes a slow channel and a fast channel executed in parallel, both of the slow channel and the fast channel are convolutional neural networks, and the number of channels of the convolutional neural network of the fast channel is less than the number of channels of the convolutional neural network of the slow channel;
    所述对每一个视频片段,将所述视频片段的所有帧图像及其检测框输入动作识别模型,得到所述关键帧的每个检测框的动作类别,包括:For each video clip, all frame images of the video clip and their detection frames are input into the action recognition model to obtain the action category of each detection frame of the key frame, including:
    按照不同的帧采样率对所述视频片段进行采样,得到第一帧序列和第二帧序列,其中,第一帧序列包含的图像的帧数少于第二帧序列包含的图像的帧数;Sampling the video clips according to different frame sampling rates to obtain a first frame sequence and a second frame sequence, wherein the number of frames of images included in the first frame sequence is less than the number of frames of images included in the second frame sequence;
    将所述第一帧序列和第二帧序列分别输入所述慢速通道和快速通道提取特征,分别得到第一特征图矩阵和第二特征图矩阵;Inputting the first frame sequence and the second frame sequence into the slow channel and the fast channel to extract features, respectively, to obtain a first feature map matrix and a second feature map matrix;
    分别对所述第一特征图矩阵和第二特征图矩阵进行时序池化操作,在得到的两个时序池化结果上分别基于所述关键帧的检测框提取感兴趣区域的特征,并分别进行空间池化操作,得到所述慢速通道的特征和所述快速通道的特征,所述慢速通道的特征表征所述视频片段的静态信息,所述快速通道的特征表征所述视频片段的动态信息;Carrying out time series pooling operations on the first feature map matrix and the second feature map matrix respectively, extracting features of the region of interest based on the detection frame of the key frame on the two obtained time series pooling results, and performing space pooling operations respectively to obtain the features of the slow channel and the features of the fast channel, the features of the slow channel represent the static information of the video segment, and the features of the fast channel represent the dynamic information of the video segment;
    将所述慢速通道的特征和所述快速通道的特征进行融合,对融合结果依次进行全连接操作和softmax操作,得到各个动作类别的概率;Fusing the feature of the slow channel with the feature of the fast channel, performing a full connection operation and a softmax operation on the fusion result in turn to obtain the probability of each action category;
    将概率最大值对应的动作类别作为关键帧的检测框的动作类别。The action category corresponding to the maximum probability is used as the action category of the detection frame of the key frame.
  3. 根据权利要求2所述的行人动作连续检测识别方法,其特征在于,所述动作识别模型还包括从所述快速通道到慢速通道的侧向连接,所述侧向连接将所述快速通道的数据送入所述慢速通道。The pedestrian action continuous detection and recognition method according to claim 2, wherein the action recognition model further includes a lateral connection from the fast channel to the slow channel, and the lateral connection sends the data of the fast channel into the slow channel.
  4. 根据权利要求1所述的行人动作连续检测识别方法,其特征在于,所述对每一个视频片段,将所述视频片段的每个非关键帧的所有检测框与关键帧的所有检测框进行匹配,包括:The pedestrian action continuous detection and recognition method according to claim 1, wherein, for each video clip, matching all detection frames of each non-key frame of the video clip with all detection frames of key frames includes:
    对每个非关键帧,计算其所有检测框与关键帧的所有检测框的IOU距离,得到IOU代价矩阵,其中IOU表示两个检测框的交叠率,为两个检测框的交集与两个检测框的并集的比值;For each non-key frame, calculate the IOU distance between all the detection frames and all the detection frames of the key frame to obtain the IOU cost matrix, where IOU represents the overlap rate of the two detection frames, which is the ratio of the intersection of the two detection frames and the union of the two detection frames;
    基于所述IOU代价矩阵,利用匈牙利算法对所述非关键帧的所有检测框与关键帧的所有检测框进行匹配;Based on the IOU cost matrix, using the Hungarian algorithm to match all detection frames of the non-key frame with all detection frames of the key frame;
    其中,对所述非关键帧的一个检测框,若存在一个关键帧的检测框与其匹配,且两者的IOU距离小于设定的阈值,则匹配通过,否则匹配未通过。Wherein, for a detection frame of the non-key frame, if there is a detection frame of a key frame matching it, and the IOU distance between the two is smaller than a set threshold, the matching is passed; otherwise, the matching is not passed.
  5. 根据权利要求1所述的行人动作连续检测识别方法,其特征在于,所述对每一个视频片段,将所述视频片段的所有帧图像输入行人检测模型,在所述视频片段的每张帧图像上得到一定数量的检测框,包括:The pedestrian action continuous detection and recognition method according to claim 1, wherein, for each video clip, all frame images of the video clip are input into the pedestrian detection model, and a certain number of detection frames are obtained on each frame image of the video clip, including:
    将所述视频片段的所有帧图像输入YOLOX检测模型,在所述视频片段的每帧图像上得到若干候选检测框以及所述候选检测框识别为人的置信度分数;All frame images of the video clips are input into the YOLOX detection model, and several candidate detection frames and the confidence scores that the candidate detection frames are identified as people are obtained on each frame image of the video clips;
    根据设定的非极大值抑制(Non-Maximum Suppression,NMS)阈值对所述候选检测框进行非极大值抑制操作;Perform a non-maximum suppression operation on the candidate detection frame according to a set non-maximum suppression (Non-Maximum Suppression, NMS) threshold;
    从非极大值抑制操作的结果中过滤掉置信度分数低于设定的置信度阈值的候选检测框,得到所述一定数量的检测框及其置信度分数。Filter out the candidate detection frames whose confidence scores are lower than the set confidence threshold from the results of the non-maximum value suppression operation, and obtain the certain number of detection frames and their confidence scores.
  6. 根据权利要求1所述的行人动作连续检测识别方法,其特征在于,在每个视频片段选中选取中间帧图像作为关键帧,其余帧图像作为非关键帧。The method for continuous detection and recognition of pedestrian actions according to claim 1, wherein in each video segment, the middle frame image is selected as a key frame, and the remaining frame images are used as non-key frames.
  7. 根据权利要求1-6任一所述的行人动作连续检测识别方法,其特征在于,所述方法还包括:The method for continuous detection and recognition of pedestrian actions according to any one of claims 1-6, wherein the method further comprises:
    将所述关键帧的所有检测框及其动作类别以及匹配通过的非关键帧的检测 框及其动作类别显示呈现,将未匹配通过的非关键帧的检测框显示呈现。All the detection frames of the key frame and their action categories and the detection frames of the non-key frames that are matched and their action categories are displayed and presented, and the detection frames of the non-key frames that are not matched are displayed and presented.
  8. 根据权利要求1所述的行人动作连续检测识别方法,其特征在于,相邻两个视频片段间有重叠的图像。The method for continuous detection and recognition of pedestrian actions according to claim 1, wherein there are overlapped images between two adjacent video clips.
  9. 根据权利要求2所述的行人动作连续检测识别方法,其特征在于:The method for continuous detection and recognition of pedestrian actions according to claim 2, characterized in that:
    每个视频片段均包括16帧图像;Each video clip includes 16 frames of images;
    所述按照不同的帧采样率对所述视频片段进行采样,得到第一帧序列和第二帧序列,具体为:按照帧采样率2和1对所述视频片段分别进行采样,得到包含8帧图像的第一帧序列和包含16帧图像的第二帧序列。The video clips are sampled according to different frame sampling rates to obtain a first frame sequence and a second frame sequence, specifically: the video clips are sampled respectively according to a frame sampling rate of 2 and 1 to obtain a first frame sequence comprising 8 frames of images and a second frame sequence comprising 16 frames of images.
  10. 根据权利要求2所述的行人动作连续检测识别方法,其特征在于,所述将概率最大值对应的动作类别作为关键帧的检测框的动作类别,包括:通过设定的分数阈值筛选掉低于分数阈值的概率,再从剩余的概率中选择最大值,将概率最大值对应的动作类别做为关键帧的检测框的动作类别。The method for continuous detection and recognition of pedestrian actions according to claim 2, wherein the action category corresponding to the maximum probability value is used as the action category of the detection frame of the key frame, comprising: filtering out the probability lower than the score threshold through the set score threshold, and then selecting the maximum value from the remaining probabilities, and using the action category corresponding to the maximum probability value as the action category of the detection frame of the key frame.
  11. 一种行人动作连续检测识别装置,其特征在于,所述装置包括:A pedestrian action continuous detection and recognition device, characterized in that the device includes:
    视频分割模块,用于将待检测视频分割成多个视频片段,每个视频片段均包括多帧图像;The video segmentation module is used to divide the video to be detected into a plurality of video clips, and each video clip includes a multi-frame image;
    关键帧选取模块,用于在每个视频片段选中选取一帧图像作为关键帧,其余帧图像作为非关键帧;The key frame selection module is used to select and select a frame image as a key frame in each video clip, and the remaining frame images as non-key frames;
    行人检测模块,用于对每一个视频片段,将所述视频片段的所有帧图像输入行人检测模型,在所述视频片段的每张帧图像上得到一定数量的检测框;Pedestrian detection module, for each video clip, all frame images of the video clip are input into the pedestrian detection model, and a certain number of detection frames are obtained on each frame image of the video clip;
    动作识别模块,用于对每一个视频片段,将所述视频片段的所有帧图像及其检测框输入动作识别模型,得到所述关键帧的每个检测框的动作类别;An action recognition module, for each video clip, inputting all frame images and detection frames thereof of the video clip into the motion recognition model to obtain the action category of each detection frame of the key frame;
    连续识别模块,用于对每一个视频片段,将所述视频片段的每个非关键帧的所有检测框与关键帧的所有检测框进行匹配,若匹配通过,则将非关键帧的检测框的动作类别设置为与之匹配的关键帧的检测框的动作类别。The continuous recognition module is used for each video clip, matching all the detection frames of each non-key frame of the video clip with all the detection frames of the key frame, if the match is passed, the action category of the detection frame of the non-key frame is set to the action category of the detection frame of the key frame matched therewith.
  12. 根据权利要求11所述的行人动作连续检测识别装置,其特征在于,还包括呈现模块,用于将关键帧的所有检测框及其动作类别以及匹配通过的非关键帧的检测框及其动作类别显示呈现,将匹配未通过的非关键帧的检测框显示呈现。The device for continuous detection and recognition of pedestrian actions according to claim 11, further comprising a presentation module for displaying and presenting all detection frames of key frames and their action categories and matching non-key frame detection frames and their action categories, and displaying and presentation of detection frames of non-key frames that have not been matched.
  13. 根据权利要求11所述的行人动作连续检测识别装置,其特征在于,所述动作识别模块包括:The pedestrian motion continuous detection and recognition device according to claim 11, wherein the motion recognition module includes:
    采样单元,用于按照不同的帧采样率对所述视频片段进行采样,得到第一帧序列和第二帧序列,其中,第一帧序列包含的图像的帧数少于第二帧序列包含的图像的帧数;A sampling unit, configured to sample the video clips according to different frame sampling rates to obtain a first frame sequence and a second frame sequence, wherein the number of frames of images contained in the first frame sequence is less than the number of frames of images contained in the second frame sequence;
    特征图矩阵提取单元,用于将所述第一帧序列和第二帧序列分别输入所述慢速通道和快速通道提取特征,分别得到第一特征图矩阵和第二特征图矩阵;A feature map matrix extraction unit, configured to input the first frame sequence and the second frame sequence into the slow channel and the fast channel to extract features, respectively, to obtain a first feature map matrix and a second feature map matrix;
    特征计算单元,用于分别对所述第一特征图矩阵和第二特征图矩阵进行时序池化操作,在得到的两个时序池化结果上分别基于所述关键帧的检测框提取感兴趣区域的特征,并分别进行空间池化操作,得到所述慢速通道的特征和所述快速通道的特征,所述慢速通道的特征表征所述视频片段的静态信息,所述快速通道的特征表征所述视频片段的动态信息;The feature calculation unit is configured to perform a time-series pooling operation on the first feature map matrix and the second feature map matrix respectively, extract features of the region of interest based on the detection frame of the key frame on the two obtained time-series pooling results, and perform a space pooling operation respectively to obtain the features of the slow channel and the features of the fast channel, the features of the slow channel represent the static information of the video segment, and the features of the fast channel represent the dynamic information of the video segment;
    概率计算单元,用于将所述慢速通道的特征和所述快速通道的特征进行融合,对融合结果依次进行全连接操作和softmax操作,得到各个动作类别的概率;The probability calculation unit is used to fuse the features of the slow channel and the features of the fast channel, and perform a full connection operation and a softmax operation on the fusion results in order to obtain the probability of each action category;
    类别确定单元,将概率最大值对应的动作类别作为关键帧的检测框的动作类别。The category determination unit uses the action category corresponding to the maximum probability as the action category of the detection frame of the key frame.
  14. 一种非易失计算机可读存储介质,其特征在于,包括存储有处理器可执行的指令的存储器,所述指令被所述处理器执行时实现包括权利要求1-10任一所述行人动作连续检测识别方法的步骤。A non-volatile computer-readable storage medium, characterized by comprising a memory storing processor-executable instructions, and when the instructions are executed by the processor, the steps comprising the method for continuous detection and recognition of pedestrian movements in any one of claims 1-10 are realized.
  15. 一种计算机设备,其特征在于,包括至少一个处理器以及存储有计算机可执行的指令的存储器,所述处理器执行所述指令时实现权利要求1-10中任意一项所述行人动作连续检测识别方法的步骤。A computer device, characterized by comprising at least one processor and a memory storing computer-executable instructions, when the processor executes the instructions, the steps of the method for continuous detection and recognition of pedestrian movements in any one of claims 1-10 are realized.
PCT/CN2023/071627 2022-01-22 2023-01-10 Pedestrian action continuous detection and recognition method and apparatus, storage medium, and computer device WO2023138444A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210075002.1 2022-01-22
CN202210075002.1A CN116563881A (en) 2022-01-22 2022-01-22 Pedestrian action continuous detection and recognition method, device, storage medium and equipment

Publications (1)

Publication Number Publication Date
WO2023138444A1 true WO2023138444A1 (en) 2023-07-27

Family

ID=87347801

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/071627 WO2023138444A1 (en) 2022-01-22 2023-01-10 Pedestrian action continuous detection and recognition method and apparatus, storage medium, and computer device

Country Status (2)

Country Link
CN (1) CN116563881A (en)
WO (1) WO2023138444A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609497A (en) * 2017-08-31 2018-01-19 武汉世纪金桥安全技术有限公司 The real-time video face identification method and system of view-based access control model tracking technique
CN108256506A (en) * 2018-02-14 2018-07-06 北京市商汤科技开发有限公司 Object detecting method and device, computer storage media in a kind of video
US20190102908A1 (en) * 2017-10-04 2019-04-04 Nvidia Corporation Iterative spatio-temporal action detection in video
CN110427800A (en) * 2019-06-17 2019-11-08 平安科技(深圳)有限公司 Video object acceleration detection method, apparatus, server and storage medium
CN111461010A (en) * 2020-04-01 2020-07-28 贵州电网有限责任公司 Power equipment identification efficiency optimization method based on template tracking
CN112651292A (en) * 2020-10-01 2021-04-13 新加坡依图有限责任公司(私有) Video-based human body action recognition method, device, medium and electronic equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609497A (en) * 2017-08-31 2018-01-19 武汉世纪金桥安全技术有限公司 The real-time video face identification method and system of view-based access control model tracking technique
US20190102908A1 (en) * 2017-10-04 2019-04-04 Nvidia Corporation Iterative spatio-temporal action detection in video
CN108256506A (en) * 2018-02-14 2018-07-06 北京市商汤科技开发有限公司 Object detecting method and device, computer storage media in a kind of video
CN110427800A (en) * 2019-06-17 2019-11-08 平安科技(深圳)有限公司 Video object acceleration detection method, apparatus, server and storage medium
CN111461010A (en) * 2020-04-01 2020-07-28 贵州电网有限责任公司 Power equipment identification efficiency optimization method based on template tracking
CN112651292A (en) * 2020-10-01 2021-04-13 新加坡依图有限责任公司(私有) Video-based human body action recognition method, device, medium and electronic equipment

Also Published As

Publication number Publication date
CN116563881A (en) 2023-08-08

Similar Documents

Publication Publication Date Title
Johnston et al. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume
WO2021017606A1 (en) Video processing method and apparatus, and electronic device and storage medium
Zhou et al. Global and local-contrast guides content-aware fusion for RGB-D saliency prediction
EP3961485A1 (en) Image processing method, apparatus and device, and storage medium
CN110610510B (en) Target tracking method and device, electronic equipment and storage medium
WO2018177379A1 (en) Gesture recognition, gesture control and neural network training methods and apparatuses, and electronic device
TW202012885A (en) Electronic apparatus, method for controlling thereof, and method for controlling a server
GB2555136A (en) A method for analysing media content
CN111783620A (en) Expression recognition method, device, equipment and storage medium
CN113221743B (en) Table analysis method, apparatus, electronic device and storage medium
KR102576344B1 (en) Method and apparatus for processing video, electronic device, medium and computer program
Wen et al. CF-SIS: Semantic-instance segmentation of 3D point clouds by context fusion with self-attention
US11894021B2 (en) Data processing method and system, storage medium, and computing device
KR20200010993A (en) Electronic apparatus for recognizing facial identity and facial attributes in image through complemented convolutional neural network
Yu et al. A content-adaptively sparse reconstruction method for abnormal events detection with low-rank property
JP7242994B2 (en) Video event identification method, apparatus, electronic device and storage medium
WO2023273173A1 (en) Target segmentation method and apparatus, and electronic device
WO2023040146A1 (en) Behavior recognition method and apparatus based on image fusion, and electronic device and medium
CN108537109B (en) OpenPose-based monocular camera sign language identification method
Li et al. Deep reasoning with multi-scale context for salient object detection
Dong et al. Holistic and Deep Feature Pyramids for Saliency Detection.
WO2023282847A1 (en) Detecting objects in a video using attention models
Gündüz et al. Turkish sign language recognition based on multistream data fusion
Pavlov et al. Application for video analysis based on machine learning and computer vision algorithms
WO2023138444A1 (en) Pedestrian action continuous detection and recognition method and apparatus, storage medium, and computer device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23742759

Country of ref document: EP

Kind code of ref document: A1