WO2017129020A1 - Human behaviour recognition method and apparatus in video, and computer storage medium - Google Patents

Human behaviour recognition method and apparatus in video, and computer storage medium Download PDF

Info

Publication number
WO2017129020A1
WO2017129020A1 PCT/CN2017/071574 CN2017071574W WO2017129020A1 WO 2017129020 A1 WO2017129020 A1 WO 2017129020A1 CN 2017071574 W CN2017071574 W CN 2017071574W WO 2017129020 A1 WO2017129020 A1 WO 2017129020A1
Authority
WO
WIPO (PCT)
Prior art keywords
human body
category
body region
behavior
target
Prior art date
Application number
PCT/CN2017/071574
Other languages
French (fr)
Chinese (zh)
Inventor
姜育刚
张殿凯
沈琳
瞿广财
赵瑞伟
雷晨雨
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2017129020A1 publication Critical patent/WO2017129020A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content

Definitions

  • the present invention relates to the field of video recognition technologies, and in particular, to a method, device and computer storage medium for human behavior recognition in video.
  • the existing video behavior analysis technology mainly includes three steps of detection, tracking and recognition.
  • the traditional method is mainly to extract some manually defined visual features, such as color histogram, SIFT, HoG, etc., and then based on these features to detect, track and classify the target.
  • some manually defined visual features such as color histogram, SIFT, HoG, etc.
  • the ability to describe features is limited.
  • the recognition performance that can be achieved is often limited.
  • deep network models are used to perform behavior detection and recognition in pictures or videos.
  • the model of the deep network can learn better characterization.
  • time series models such as 3D-CNN, RCNN, and two-streams.
  • These existing deep network-based video classification methods are mainly general-purpose algorithms.
  • the prior art has certain deficiencies and improvement spaces, for example, in monitoring.
  • the behavior of different types of people in the scene should be treated differently in the process of identification.
  • Some behaviors can be quickly identified through static images, such as fighting, cycling, etc.
  • Some actions are more regular in timing, and continuous image frame analysis is more helpful in distinguishing, such as walking and (slow) running.
  • the use of a single model in the prior art cannot simultaneously take into account the above two aspects, affecting real-time and accuracy.
  • embodiments of the present invention provide a method, an apparatus, and a computer storage medium for human body behavior recognition in a video.
  • the predicted value is a behavior category score of the target in the human body region of the human body category
  • the corresponding behavior category is output.
  • the corresponding behavior category is output, including:
  • the corresponding behavior category is calculated and output in combination with the human body running trajectory information.
  • the calculating, by the body area of the human body category, the predicted value is a behavior category score of the target in the human body region of the human body category, including:
  • the behavior category score of the target of the human body region is calculated.
  • the combining the human body running track information, calculating and outputting a corresponding behavior category including:
  • the behavior category score and the result of the sequential superposition are weighted and summed, and the corresponding behavior category is output.
  • the predicting value corresponding to the human body region is calculated according to the human body region, Filtering the human body area whose predicted value is non-human body category, including:
  • the predicted value is a non-human body category, filtering the human body region whose predicted value is a non-human body category from the acquired human body region;
  • a step of calculating the behavior category score of the target in the human body region of the human body category is performed.
  • the detecting the human body region in the to-be-identified video and acquiring the human body running track information in the human body region includes:
  • the embodiment of the invention further provides a device for recognizing human behavior in a video, the device comprising:
  • the detecting module is configured to detect a human body region in the to-be-identified video, and acquire information about the human body running track in the human body region;
  • the filtering module is configured to calculate a predicted value corresponding to the human body region according to the human body region, and filter the human body region whose predicted value is a non-human body category, to obtain the human body region whose predicted value is a human body category;
  • a calculation module configured to calculate, according to the human body region whose predicted value is a human body category, the behavior category score of the target in the human body region of the human body category;
  • An output module configured to output a corresponding behavior category according to the behavior category score.
  • the output module is configured to output the behavior category if the behavior category score is higher than a threshold of a preset behavior category; if the behavior category score is not higher than a threshold of a preset behavior category,
  • the human body runs track information, and calculates and outputs a corresponding behavior category.
  • the calculating module is configured to acquire a background image of the human body region whose predicted value is a human body category, to obtain description information of the background image, and calculate corresponding to the background image according to the description information of the background image.
  • the background area information is calculated, and the neighboring target information corresponding to the background image is calculated; and the behavior category score of the target of the human body area is calculated according to the background area information corresponding to the background image and the neighboring target information.
  • the output module is configured to acquire an image of a current moment of the to-be-identified video a tracking area image corresponding to the human body running track information; sequentially superimposing the current time image and the tracking area image; weighting and summing the behavior category score and the sequentially superimposed result, and outputting The corresponding behavior category.
  • the filtering module is configured to acquire the human body region and perform analysis, and output a predicted value corresponding to the human body region; if the predicted value is a non-human body category, the predicted value is a non-human body category.
  • the human body region is filtered from the acquired human body region; if the predicted value is a human body category, the predicted value is calculated as a behavioral category score of the target in the human body region of the human body category.
  • the detecting module is configured to acquire the to-be-identified video, detect a human body region in the to-be-identified video, and track a pedestrian in the human body region to obtain a human body operation in the human body region. Track information.
  • Embodiments of the present invention also provide a computer storage medium comprising a set of instructions that, when executed, cause at least one processor to perform a method of human behavior recognition in the video described above.
  • Embodiments of the present invention provide a method, a device, and a computer storage medium for recognizing a human body in a video.
  • the human body region in the human body region is acquired by detecting a human body region in the video to be identified, and the human body region corresponding to the human body region is calculated according to the human body region.
  • the predicted value is used to filter the human body region whose predicted value is non-human body type, and the human body region whose predicted value is the human body type is obtained; and the human body region whose predicted value is the human body type is calculated to obtain the target value in the human body region whose predicted value is the human body category.
  • the behavior category score according to the behavior category score, the corresponding behavior category is output, which solves the problem that the human behavior performance in the identified video in the prior art is poor, real-time and low accuracy. Realize the real-time and accuracy of video recognition.
  • FIG. 1 is a schematic flow chart of a first embodiment of a method for human behavior recognition in the video of the present invention
  • FIG. 2 is a schematic structural diagram of a network model based on a non-sequential input depth in an embodiment of the present invention
  • FIG. 3 is a schematic structural diagram of a behavior recognition network model based on non-sequential input, fusion background and neighboring target features according to an embodiment of the present invention
  • FIG. 4 is a schematic structural diagram of a behavior recognition network model based on time series input, fusion background and neighboring target features according to an embodiment of the present invention
  • FIG. 5 is a schematic flowchart of a step of outputting a corresponding behavior category according to the behavior category score in the embodiment of the present invention
  • FIG. 6 is a schematic flowchart of a step of calculating, according to an embodiment of the present invention, a behavior type score of a target in a human body region in which a predicted value is a human body region;
  • FIG. 7 is a schematic flowchart of a step of calculating and outputting a corresponding behavior category in combination with the human body running track information according to an embodiment of the present invention
  • FIG. 8 is a schematic flowchart of a step of filtering a body region that is predicted to be a non-human body type according to the predicted value corresponding to the human body region calculated according to the human body region in the embodiment of the present invention
  • FIG. 9 is a schematic flowchart of a step of detecting a human body region in a video to be recognized and acquiring human body running track information in the human body region according to an embodiment of the present invention
  • FIG. 10 is a schematic diagram of functional modules of a first embodiment of a device for human behavior recognition in the video of the present invention.
  • the human body area in the video to be identified is detected, and the human body running track information in the human body area is acquired; the predicted value corresponding to the human body area is calculated according to the human body area, and the predicted value is a non-human body type.
  • the region is filtered to obtain a human body region whose predicted value is a human body category; the body region region whose predicted value is a human body category is calculated to obtain a behavior category score of a target in a human body region whose predicted value is a human body category; according to the behavior category score, the corresponding output is output. Behavior category.
  • a first embodiment of the present invention provides a method for human behavior recognition in a video, including:
  • Step S1 detecting a human body region in the to-be-identified video, and acquiring human body running track information in the human body region.
  • the executor of the method of the embodiment of the present invention may be a video monitoring device or a video identification device.
  • This embodiment is exemplified by a video monitoring device, and is of course not limited to other devices capable of realizing human behavior in the video.
  • the video monitoring device detects the human body region in the to-be-identified video, and acquires the human body running track information in the human body region.
  • the video monitoring device obtains the to-be-identified video and detects the human body region in the target video.
  • the video surveillance device can obtain the original video to be identified through the front-end video capture device, and use the detection based on the traditional feature classification.
  • the device detects the human body area in the video.
  • the video monitoring device After the acquisition of the to-be-identified video and the detection of the human body region in the target video, the video monitoring device tracks the pedestrians in the human body region to obtain the human body running track information in the human body region; in specific implementation, the video monitoring device A pedestrian tracking algorithm based on detection area matching can be used to track pedestrians in the picture to obtain motion trajectory information of the human body in the picture.
  • the result of human body detection and tracking can be saved in the form of a target ID and a sequence of detection area images, namely:
  • O(i,t) represents the information of the target i at time t
  • I t (i) is the image content detected by the target at time t
  • R t (i) is the position of the target at the time t
  • R In t (i) the horizontal and vertical coordinate positions and the length and width values of the upper left corner of the recording area are used in the form of vectors (x, y, w, h).
  • Step S2 Calculate a predicted value corresponding to the human body region according to the human body region, and filter the human body region whose predicted value is a non-human body category, and obtain the human body region whose predicted value is a human body category.
  • the video monitoring device calculates the predicted value corresponding to the human body region according to the human body region, and the human body region with the predicted value is a non-human body category. Filtering to get the predicted value of the human body The human body area of the category.
  • the video monitoring device acquires and analyzes the human body region, and outputs a predicted value corresponding to the human body region, and the predicted value includes a human body category and a non-human body category; in a specific implementation, after acquiring a certain human body region in the current frame, the video monitoring device
  • the image of the human body region is input into the background filtering network M1 network model for analysis.
  • the structure of the M1 network model is shown in FIG. 2, and the M1 network model is a deep convolution network model based on single frame image input;
  • the input is the detected foreground area image, followed by several Convolution Layers (CONV) with ReLU layer and pooling layer, and then connected with several Fully Connection Layers (FC) for depth feature calculation.
  • CONV Convolution Layers
  • FC Fully Connection Layers
  • the dimension of the last layer of the M1 network is 2 dimensions. After sigmoid transformation, it corresponds to the behavior category scores of the human body category and the non-human body category.
  • the predicted value is a non-human body type
  • the human body region whose predicted value is a non-human body category is filtered from the acquired human body region; after the classification by the M1 network model, the pre-detection and tracking algorithm may be filtered out to be misdetected as a human body.
  • the area of the category Since the network at this time is only calculated on the foreground image generated by the detection link (instead of the entire image), it does not generate significant computational overhead, and can improve the detection accuracy while satisfying the real-time performance of the entire system. Requirements.
  • the number of convolutional layers and fully connected layers in the M1 network model can be adjusted according to factors such as the size of the monitoring screen and the hardware performance of the deployed device.
  • a deep network model with relatively simple structure is used to further filter the detected foreground region; in the early detection link, the algorithm intentionally reduces the algorithm for foreground prediction. Thresholds allow the algorithm to return as many foreground areas as possible, minimizing the rate of missed detection. Since the network at this time is only calculated on the foreground image generated by the detection link (instead of the entire image), the computational overhead of the algorithm is greatly reduced, and the detection accuracy is improved, and the entire system is well satisfied. Sexual requirements.
  • step S3 the human body region whose predicted value is the human body type is calculated to obtain the behavior category score of the target in the human body region of the human body category.
  • the video monitoring device predicts the human body region.
  • the body area is calculated to obtain a predicted value The behavior category score for the target in the body region of the human body category.
  • the video monitoring device obtains a background image of the human body region whose predicted value is a human body type, and obtains description information of the background image; in a specific implementation, if the predicted result obtained by the M1 network model is a human body category (ie, a foreground in the image), the video The monitoring device can identify the behavior of each human body region in a single frame image by using a non-sequential input behavior based on neighboring target features with more complex structure and more recognizable capability.
  • the structure of the network model is shown in FIG. 3. As shown in the figure, the hidden layer of the M2 network model adds the background image of the current human target and the feature information of the adjacent target hidden layer.
  • the location of the feature fusion lies in the first fully connected layer of the network, as shown in the first FC in FIG.
  • the layer is shown; the background image of the area where the target is located can be obtained from a preset pure background image, as long as the portion corresponding to the position of the detection area is taken.
  • the complete background image can be obtained from a pre-set standard background image or via a dynamically updated background model. Note that the background image obtained by a target i at time t is B t (i) , then for a target area, its description information can be expressed as:
  • I t (i) and B t (i) share the same location area R t (i) .
  • the video monitoring device calculates the background region information corresponding to the background image according to the description information of the background image, and calculates the background image corresponding to the background image. Adjacent to the target information; in the specific implementation, the background image will obtain its visual feature description through several convolutional layers, and then obtain a corresponding first hidden layer feature through a fully connected layer, and its dimension and target image are obtained.
  • the first hidden layer has the same dimensions.
  • the feature calculation process of its first hidden layer can be expressed as:
  • FC 1 (I t (i) ) f 1 (c m (...c 1 I t (i) ));
  • c( ⁇ ) represents a convolution operation for an image
  • f( ⁇ ) represents a matrix multiplication operation and an offset amount operation of the fully connected layer.
  • FC 1 (B t (i) ) f 1 (c m (...c 1 B t (i) ));
  • some of the features of the first hidden layer of the model are features from adjacent targets, which are mainly from the target features in the adjacent regions of the current region.
  • the range of neighboring regions can be determined by setting a threshold.
  • the central location of the current target is:
  • the video monitoring device calculates the human body by combining the background region information corresponding to the background image and the adjacent target information.
  • the behavior category score of the target of the region in a specific implementation, the video monitoring device can record that the set of features of the first fully connected layer calculated by all the adjacent target regions is The maximum value of these eigenvalues in each dimension is separately counted:
  • the characteristics of the first fully connected layer of the network model of behavior recognition can be expressed as:
  • the feature passes through the subsequent fully connected layer, so that the entire network model naturally utilizes the background information and context information of the current target in the process of identifying.
  • the M2 network model output is a multi-dimensional vector, the length of the vector is the number of behavior categories to be identified, and the score on each dimension of the output represents the prediction probability on the category.
  • Step S4 outputting a corresponding behavior category according to the behavior category score.
  • the video monitoring device After completing the calculation of the body region of the human body class with the predicted value as the body class, the video monitoring device outputs the corresponding behavior class according to the behavior category score after the behavior class score of the target in the body region of the body class is calculated.
  • the behavior category is output; according to the behavior category score, if the category score output at this time is higher than a certain threshold in some categories with obvious static characteristics, Then directly output the category as the final prediction result.
  • the embodiments of the present invention are directed to monitoring the types of different behaviors in the video, and according to their different static characteristics and dynamic characteristics, respectively adopting timing (multi-frame image) and non-timing (single-frame image) input network pairs of different structures.
  • the image is analyzed, and finally the two different network outputs are combined to obtain the final behavior recognition result; in particular, for some static characteristic clear behavior categories, such as fighting, cycling, etc., the embodiment of the present invention mainly relies on a structure that is sufficiently complicated.
  • the non-sequential input network model performs fast prediction because these motion features are obvious. Once they appear, the image through a single frame can generally be accurately judged; for some behavior categories that are difficult to judge through a single frame image, such as walking and jogging, etc.
  • the corresponding behavior category is calculated and output in combination with the human body running track information.
  • the video monitoring device acquires the current time image of the to-be-identified video and the tracking area image corresponding to the human body running track information.
  • the video monitoring device can acquire the tracking area image corresponding to the current time image and the human body running track information, using the same
  • the superposition of the target previous time image identifies the input of the M3 network model as a multi-frame sequential input behavior based on the background and neighboring target features for further category prediction.
  • the structure diagram of the M3 network model is shown in Figure 4. Because the time-based target motion picture overlay is used as the input of the network, the M3 network model has a stronger ability to capture motion information, and has obvious advantages for some dynamic features with obvious behavior recognition.
  • the video monitoring device sequentially superimposes the current time image and the tracking area image; in specific implementation, the video monitoring device uses the M3 network model, uses the information of the motion track, and uses the same target at the current time and the previous number.
  • the sequential overlay of the tracking area image at the moment is used as the input to the model, ie:
  • the middle layer of the M3 network model will simultaneously fuse the depth features of the background region sequence of the current target and the hidden features of other target historical sequences in the current target neighbor region.
  • the information of the adjacent target is beneficial to improve the prediction accuracy of the algorithm.
  • the location of the hidden layer feature fusion of the M3 network model is also in the first fully connected layer of the network, as shown by the first FC layer in Figure 4.
  • For the background area of the M3 network model also take the sequence of background regions on its trajectory As input.
  • the acquisition of the adjacent target features is also consistent with the M2 network model.
  • the distance between the target and the preset threshold at the current time are used as the selection criteria of the neighboring objects, and the maximum value and weighted mean of the FC1 features are calculated to form the adjacent target features. description. After the fusion, input to the subsequent fully connected layer for further recognition calculation.
  • the M3 network model output is also a multi-dimensional vector, the length of the vector is the number of behavior categories to be identified, and the score on each dimension of the output is the prediction probability on the category.
  • the video monitoring device After the video image capture device sequentially superimposes the current time image and the tracking area image, the video monitoring device weights and sums the behavior category score and the sequentially superimposed result, and outputs the corresponding behavior category; in specific implementation, the video monitoring device Combining the processing results of M2 network model and M3 network model, the comprehensive behavior category prediction of the target to be detected is obtained.
  • the fusion method can be the weighted sum of the two groups of network results, and the weight can be obtained by the training set fitting effect.
  • the embodiment of the present invention combines the characteristics of the behaviors appearing in the surveillance video, and designs a fusion method based on the hidden layer features in the single frame input and multiframe input networks, and adopts the combination of the current target foreground, the background image information and the adjacent target information.
  • it enriches the available information of the classification network, so that the depth model used for classification can simultaneously utilize the information of the background area of the current target and the behavior information of other targets in the adjacent area, and the behavior recognition in the surveillance video. It has very valuable auxiliary information that improves the performance of the entire system for behavior recognition.
  • the embodiment of the invention provides a method for human body behavior recognition in a video, which realizes real-time performance and accuracy of improving video recognition.
  • a process of outputting a corresponding behavior category according to the behavior category score is performed.
  • step S4 includes:
  • Step S41 If the behavior category score is higher than a threshold of the preset behavior category, the behavior category is output.
  • the video monitoring device After completing the calculation of the body region of the human body class with the predicted value as the body class, the video monitoring device outputs the corresponding behavior class according to the behavior category score after the behavior class score of the target in the body region of the body class is calculated.
  • the behavior category is output; according to the behavior category score, if the category score output at this time is higher than a certain threshold in some categories with obvious static characteristics, Then directly output the category as the final prediction result.
  • step S42 if the behavior category score is not higher than the threshold of the preset behavior category, the corresponding behavior category is calculated and output according to the human body running track information.
  • the corresponding behavior category is calculated and output in combination with the human body running track information.
  • the video monitoring device acquires the current time image of the to-be-identified video and the tracking area image corresponding to the human body running track information.
  • the video monitoring device can acquire the tracking area image corresponding to the current time image and the human body running track information, using the same
  • the superposition of the target previous time image identifies the input of the M3 network model as a multi-frame sequential input behavior based on the background and neighboring target features for further category prediction.
  • the structure diagram of the M3 network model is shown in Figure 4. Because the time-based target motion picture overlay is used as the input of the network, the M3 network model has a stronger ability to capture motion information, and has obvious advantages for some dynamic features with obvious behavior recognition.
  • the video monitoring device After the tracking area image corresponding to the current time image of the to-be-identified video and the human body running track information is completed, the video monitoring device sequentially superimposes the current time image and the tracking area image; in specific implementation, the video monitoring device uses the M3 network.
  • the model using the information of the motion trajectory, uses the same target to overlap the sequence of the tracking area images at the current time and the previous time Add as input to the model, ie:
  • the middle layer of the M3 network model will simultaneously fuse the depth features of the background region sequence of the current target and the hidden features of other target historical sequences in the current target neighbor region.
  • the information of the adjacent target is beneficial to improve the prediction accuracy of the algorithm.
  • the location of the hidden layer feature fusion of the M3 network model is also in the first fully connected layer of the network, as shown by the first FC layer in Figure 4.
  • For the background area of the M3 network model also take the sequence of background regions on its trajectory As input.
  • the acquisition of the adjacent target features is also consistent with the M2 network model.
  • the distance between the target and the preset threshold at the current time are used as the selection criteria of the neighboring objects, and the maximum value and weighted mean of the FC1 features are calculated to form the adjacent target features. description. After the fusion, input to the subsequent fully connected layer for further recognition calculation.
  • the M3 network model output is also a multi-dimensional vector, the length of the vector is the number of behavior categories to be identified, and the score on each dimension of the output is the prediction probability on the category.
  • the video monitoring device After the video image capture device sequentially superimposes the current time image and the tracking area image, the video monitoring device weights and sums the behavior category score and the sequentially superimposed result, and outputs the corresponding behavior category; in specific implementation, the video monitoring device Combining the processing results of M2 network model and M3 network model, the comprehensive behavior category prediction of the target to be detected is obtained.
  • the fusion method can be the weighted sum of the two groups of network results, and the weight can be obtained by the training set fitting effect.
  • the embodiment of the present invention provides a method for human body behavior recognition in a video, which better improves the real-time performance and accuracy of video recognition.
  • the predicted human body area of the human body type is calculated to obtain the predicted value.
  • step S3 includes:
  • Step S31 Obtain a background image of the human body region whose predicted value is a human body category, and obtain description information of the background image.
  • the non-human target filtering algorithm is used to output the prediction corresponding to the human body region.
  • the value after filtering the human body region whose predicted value is non-human body type, the video monitoring device acquires the background image of the human body region whose predicted value is the human body category, and obtains the description information of the background image.
  • the video monitoring device can use a non-sequential input behavior based on the neighboring target features with more complex structure and stronger recognition ability. Identifying the M2 network model to identify the behavior of each human body region in a single frame image.
  • the structure of the network model is shown in Figure 3.
  • the hidden layer of the M2 network model adds the background image of the current human target and the adjacent target hidden layer.
  • Characteristic information the location of the feature fusion lies in the first fully connected layer of the network, as shown by the first FC layer in Figure 3; wherein the background image of the target region can be obtained from a pre-set pure background image As long as the part corresponding to the position of the detection area is taken.
  • the complete background image can be obtained from a pre-set standard background image or via a dynamically updated background model. Note that the background image obtained by a target i at time t is B t (i) , then for a target area, its description information can be expressed as:
  • I t (i) and B t (i) share the same location area R t (i) .
  • Step S32 Calculate background area information corresponding to the background image according to the description information of the background image, and calculate neighboring target information corresponding to the background image.
  • the video monitoring device calculates the background region information corresponding to the background image according to the description information of the background image, and calculates the corresponding background image. Proximity target information.
  • the background image will get its visual feature description through several convolutional layers, and then get its corresponding first hidden layer feature through a fully connected layer, its dimension and the target image obtained
  • the dimensions of an implicit layer are the same.
  • the feature calculation process of its first hidden layer can be expressed as:
  • FC 1 (I t (i) ) f 1 (c m (...c 1 I t (i) ));
  • c( ⁇ ) represents a convolution operation for an image
  • f( ⁇ ) represents a matrix multiplication operation and an offset amount operation of the fully connected layer.
  • FC 1 (B t (i) ) f 1 (c m (...c 1 B t (i) ));
  • the characteristic composition of the first hidden layer of the model, and some of it is from the neighbor
  • the characteristics of the target mainly from the target features in the neighborhood of the current region.
  • the range of neighboring regions can be determined by setting a threshold.
  • the central location of the current target is:
  • Step S33 calculating the behavior category score of the target of the human body region in combination with the background region information corresponding to the background image and the neighboring target information.
  • the video monitoring device After calculating the background area information corresponding to the background image according to the description information of the background image, and calculating the neighboring target information corresponding to the background image, the video monitoring device combines the background area information corresponding to the background image with the adjacent target information, and calculates The behavior category score of the target of the human body region.
  • the video monitoring device can record that the set of features of the first fully connected layer calculated by all adjacent target regions is The maximum value of these eigenvalues in each dimension is separately counted:
  • the characteristics of the first fully connected layer of the network model of behavior recognition can be expressed as:
  • the feature passes through the subsequent fully connected layer, so that the entire network model naturally utilizes the background information and context information of the current target in the process of identifying.
  • the M2 network model output is a multi-dimensional vector, the length of the vector is the number of behavior categories to be identified, and the score on each dimension of the output represents the prediction probability on the category.
  • the embodiment of the present invention provides a method for human body behavior recognition in a video, which better improves the real-time performance and accuracy of video recognition.
  • FIG. 7 is a step of calculating and outputting a corresponding behavior category in combination with the human body running track information according to an embodiment of the present invention.
  • FIG. 7 A schematic diagram of the process.
  • step S42 includes:
  • Step S421 Acquire a current time image of the video and a tracking area image corresponding to the human body running track information.
  • the video monitoring device acquires a current time image of the to-be-identified video and a tracking area image corresponding to the human body running track information.
  • the video monitoring device can acquire the tracking area image corresponding to the current time image and the human body running track information, and use the superposition of the same target previous time image as the multi-frame time input behavior recognition network model based on the background and the adjacent target features.
  • Input of the M3 network model for further category prediction.
  • the structure diagram of the M3 network model is shown in Figure 4. Because the time-based target motion picture overlay is used as the input of the network, the M3 network model has a stronger ability to capture motion information, and has obvious advantages for some dynamic features with obvious behavior recognition.
  • Step S422 sequentially superimposing the current time image and the tracking area image.
  • the video monitoring device sequentially superimposes the current time image and the tracking area image.
  • the video monitoring device uses the M3 network model, and uses the information of the motion trajectory to use the sequential superposition of the tracking image of the same target at the current time and the previous time as the input of the model, namely:
  • the middle layer of the M3 network model will simultaneously fuse the depth features of the background region sequence of the current target and the hidden features of other target historical sequences in the current target neighbor region.
  • the information of the adjacent target is beneficial to improve the prediction accuracy of the algorithm.
  • the location of the hidden layer feature fusion of the M3 network model is also in the first fully connected layer of the network, as shown by the first FC layer in Figure 4.
  • For the background area of the M3 network model also take the sequence of background regions on its trajectory As input.
  • the acquisition of the adjacent target features is also consistent with the M2 network model.
  • the distance between the target and the preset threshold at the current time are used as the selection criteria of the neighboring objects, and the maximum value and weighted mean of the FC1 features are calculated to form the adjacent target features. description. After the fusion, input to the subsequent fully connected layer for further recognition calculation.
  • the M3 network model output is also a multi-dimensional vector, the length of the vector is the number of behavior categories to be identified, and the score on each dimension of the output is the prediction probability on the category.
  • Step S423 weighting and summing the behavior category score and the result of performing the sequential superposition, and outputting the corresponding behavior category.
  • the video monitoring device weights and sums the behavior category score and the sequentially superimposed result, and outputs the corresponding behavior. category.
  • the video monitoring device combines the processing results of the M2 network model and the M3 network model to obtain a comprehensive behavior category prediction of the target to be detected, and the fusion method may be a weighted sum of the two network results, and the weight may pass The training set fitting effect is obtained.
  • the embodiment of the present invention provides a method for human body behavior recognition in a video, which better improves the real-time performance and accuracy of video recognition.
  • the predicted value corresponding to the human body region is calculated according to the human body region, and the A flow chart showing the steps of predicting the value of a human body region that is not a human body.
  • step S2 includes:
  • Step S21 acquiring the human body region and performing analysis, and outputting a predicted value corresponding to the human body region.
  • the video monitoring device After detecting the human body region in the to-be-identified video and acquiring the human body running track information in the human body region, acquires the human body region and performs analysis, and outputs the predicted value corresponding to the human body region.
  • the video monitoring device inputs the image of the human body region into the background filtering network M1 network model for analysis, and the structure of the M1 network model is as shown in FIG. 2 .
  • the M1 network model is a deep convolutional network model based on single-frame image input; the input of the network is the detected foreground area image, followed by several convolution layers (Convolution Layers, CONV) with ReLU layer and pooling layer. ), and then connected to the Fully Connection Layers (FC) for depth feature calculation.
  • the dimension of the last layer of the network is 2D. After sigmoid transformation, it corresponds to the human body category and the non-human body category. Behavior category score.
  • Step S22 If the predicted value is a non-human body category, the human body region whose predicted value is a non-human body category is filtered from the acquired human body region.
  • the video monitoring device can be filtered out after being classified by the M1 network model.
  • the pre-detection and tracking algorithms are misdetected as areas of the human body category. Since the network at this time is only calculated on the foreground image generated by the detection link (instead of the entire image), it does not generate significant computational overhead, and can improve the detection accuracy while satisfying the real-time performance of the entire system. Requirements. At the same time, the number of convolutional layers and fully connected layers in the M1 network model can be adjusted according to factors such as the size of the monitoring screen and the hardware performance of the deployed device.
  • a step of calculating the behavior category score of the target in the human body region of the human body category is performed.
  • the video monitoring device performs the above step S3 to calculate the behavior category score of the target in the human body region in which the predicted value is the human body category.
  • the embodiment of the present invention provides a method for human body behavior recognition in a video, which better improves the real-time performance and accuracy of video recognition.
  • the human body area in the video to be identified is detected, and the person is obtained.
  • step S1 includes:
  • Step S11 Acquire the video to be identified, and detect a human body region in the target video.
  • the video monitoring device acquires the to-be-identified video and detects the human body region in the target video.
  • the video surveillance device can obtain the original video to be identified through the front-end video capture device, and detect the human body region in the video by using a detector based on the traditional feature classification.
  • Step S12 Tracking pedestrians in the human body region to obtain human body running track information in the human body region.
  • the video monitoring device tracks the pedestrian in the human body region to obtain the human body running track information in the human body region.
  • the video monitoring device may track the pedestrians in the picture by using a tracking algorithm based on the detection of the detection area, thereby obtaining motion track information of the human body in the picture.
  • the result of human body detection and tracking can be saved in the form of a target ID and a sequence of detection area images, namely:
  • O(i,t) represents the information of the target i at time t
  • I t (i) is the image content detected by the target at time t
  • R t (i) is the position of the target at the time t
  • R In t (i) the horizontal and vertical coordinate positions and the length and width values of the upper left corner of the recording area are used in the form of vectors (x, y, w, h).
  • the embodiment of the present invention provides a method for human body behavior recognition in a video, which better improves the real-time performance and accuracy of video recognition.
  • the present invention also provides a corresponding apparatus embodiment.
  • a first embodiment of the present invention provides a device for recognizing a human body in a video, including:
  • the detecting module 100 is configured to detect a human body area in the video to be identified, and acquire the human body area Human body trajectory information in the domain.
  • the executor of the device in the embodiment of the present invention may be a video monitoring device or a video identification device.
  • This embodiment is exemplified by a video monitoring device, and is of course not limited to other devices capable of realizing human behavior in the video.
  • the detecting module 100 detects a human body region in the video to be identified, and acquires human body running track information in the human body region.
  • the video monitoring device obtains the to-be-identified video and detects the human body region in the target video.
  • the video surveillance device can obtain the original video to be identified through the front-end video capture device, and use the detection based on the traditional feature classification.
  • the device detects the human body area in the video.
  • the detecting module 100 tracks the pedestrian in the human body region to obtain the human body running track information in the human body region; in specific implementation, the video monitoring device A pedestrian tracking algorithm based on detection area matching can be used to track pedestrians in the picture to obtain motion trajectory information of the human body in the picture.
  • the result of human body detection and tracking can be saved in the form of a target ID and a sequence of detection area images, namely:
  • O(i,t) represents the information of the target i at time t
  • I t (i) is the image content detected by the target at time t
  • R t (i) is the position of the target at the time t
  • R In t (i) the horizontal and vertical coordinate positions and the length and width values of the upper left corner of the recording area are used in the form of vectors (x, y, w, h).
  • the filtering module 200 is configured to calculate a predicted value corresponding to the human body region according to the human body region, and filter the human body region whose predicted value is a non-human body category to obtain a human body region whose predicted value is a human body category.
  • the filtering module 200 calculates the predicted value corresponding to the human body region according to the human body region, and the human body region with the predicted value is a non-human body category. Filtering is performed to obtain a human body region whose predicted value is a human body type.
  • the video monitoring device acquires and analyzes the human body region, and outputs a predicted value corresponding to the human body region, and the predicted value includes a human body category and a non-human body category; in a specific implementation, after acquiring a certain human body region in the current frame, the video monitoring device Input the image of the human body area to the background
  • the filtering network M1 network model is analyzed.
  • the structure of the M1 network model is shown in Figure 2.
  • the M1 network model is a deep convolutional network model based on single-frame image input. The input of the network is the detected foreground area image. It is followed by several Convolution Layers (CONV) with ReLU layer and pooling layer, and then connected with several Fully Connection Layers (FC) for deep feature calculation. The last layer of the network is the output layer.
  • CONV Convolution Layers
  • FC Fully Connection Layers
  • the dimension is 2 dimensions, and after sigmoid transformation, it corresponds to the behavior category scores on the human body category and the non-human body category.
  • the filtering module 200 filters the human body region whose predicted value is a non-human body category from the acquired human body region; after the classification by the M1 network model, the previous detection and tracking algorithm error can be filtered out.
  • the number of convolutional layers and fully connected layers in the M1 network model can be adjusted according to factors such as the size of the monitoring screen and the hardware performance of the deployed device.
  • a deep network model with relatively simple structure is used to further filter the detected foreground region; in the early detection link, the algorithm intentionally reduces the algorithm for foreground prediction. Thresholds allow the algorithm to return as many foreground areas as possible, minimizing the rate of missed detection. Since the network at this time is only calculated on the foreground image generated by the detection link (instead of the entire image), the computational overhead of the algorithm is greatly reduced, and the detection accuracy is improved, and the entire system is well satisfied. Sexual requirements.
  • the calculation module 300 is configured to calculate, for the human body region whose predicted value is a human body category, the behavior category score of the target in the human body region of the human body category.
  • the calculation module 300 compares the predicted value.
  • the human body region of the human body type is calculated to obtain a behavior category score of the target in the human body region whose predicted value is the human body category.
  • the video monitoring device obtains a background image of the human body region whose predicted value is a human body type, and obtains description information of the background image; in a specific implementation, if the predicted result obtained by the M1 network model is a human body category (ie, a foreground in the image), the video The monitoring device can identify the behavior of each human body region in a single frame image by using a non-sequential input behavior based on neighboring target features with more complex structure and more recognizable capability.
  • the structure of the network model is shown in FIG. 3. As shown in the figure, the hidden layer of the M2 network model adds the background image of the current human target and the feature information of the adjacent target hidden layer.
  • the location of the feature fusion lies in the first fully connected layer of the network, as shown in the first FC in FIG.
  • the layer is shown; the background image of the area where the target is located can be obtained from a preset pure background image, as long as the portion corresponding to the position of the detection area is taken.
  • the complete background image can be obtained from a pre-set standard background image or via a dynamically updated background model. Note that the background image obtained by a target i at time t is B t (i) , then for a target area, its description information can be expressed as:
  • I t (i) and B t (i) share the same location area R t (i) .
  • the calculation module 300 calculates the background region information corresponding to the background image according to the description information of the background image, and calculates the background image corresponding to the background image. Adjacent to the target information; in the specific implementation, the background image will obtain its visual feature description through several convolutional layers, and then obtain a corresponding first hidden layer feature through a fully connected layer, and its dimension and target image are obtained.
  • the first hidden layer has the same dimensions.
  • the feature calculation process of its first hidden layer can be expressed as:
  • FC 1 (I t (i) ) f 1 (c m (...c 1 I t (i) ));
  • c( ⁇ ) represents a convolution operation for an image
  • f( ⁇ ) represents a matrix multiplication operation and an offset amount operation of the fully connected layer.
  • FC 1 (B t (i) ) f 1 (c m (...c 1 B t (i) ));
  • some of the features of the first hidden layer of the model are features from adjacent targets, which are mainly from the target features in the adjacent regions of the current region.
  • the range of neighboring regions can be determined by setting a threshold.
  • the central location of the current target is:
  • the calculating module 300 calculates the human body by combining the background area information corresponding to the background image and the adjacent target information.
  • the behavior category score of the target of the region in a specific implementation, the video monitoring device can record that the set of features of the first fully connected layer calculated by all the adjacent target regions is The maximum value of these eigenvalues in each dimension is separately counted:
  • the characteristics of the first fully connected layer of the network model of behavior recognition can be expressed as:
  • the feature passes through the subsequent fully connected layer, so that the entire network model naturally utilizes the background information and context information of the current target in the process of identifying.
  • the M2 network model output is a multi-dimensional vector, the length of the vector is the number of behavior categories to be identified, and the score on each dimension of the output represents the prediction probability on the category.
  • the output module 400 is configured to output a corresponding behavior category according to the behavior category score.
  • the output module 400 outputs the corresponding according to the behavior category score. Behavior category.
  • the behavior category is output; if the score according to the behavior category is scored, if the category score output at this time is in some static If the score on the category with obvious characteristics is higher than a certain threshold, the category is directly output as the final prediction result.
  • the embodiments of the present invention are directed to monitoring the types of different behaviors in the video, and according to their different static characteristics and dynamic characteristics, respectively adopting timing (multi-frame image) and non-timing (single-frame image) input network pairs of different structures.
  • the image is analyzed, and finally the two different network outputs are combined to obtain the final behavior recognition result; in particular, for some static characteristic clear behavior categories, such as fighting, cycling, etc., the embodiment of the present invention mainly relies on a structure that is sufficiently complicated.
  • the non-sequential input network model performs fast prediction because these motion features are obvious. Once they appear, the image through a single frame can generally be accurately judged; for some behavior categories that are difficult to judge through a single frame image, such as walking and jogging, etc.
  • the output module 400 combines the human body running track information to calculate and output the corresponding behavior category.
  • the video monitoring device acquires the current time image of the to-be-identified video and the tracking area image corresponding to the human body running track information.
  • the video monitoring device can acquire the tracking area image corresponding to the current time image and the human body running track information, using the same
  • the superposition of the target previous time image identifies the input of the M3 network model as a multi-frame sequential input behavior based on the background and neighboring target features for further category prediction.
  • the structure diagram of the M3 network model is shown in Figure 4. Because the time-based target motion picture overlay is used as the input of the network, the M3 network model has a stronger ability to capture motion information, and has obvious advantages for some dynamic features with obvious behavior recognition.
  • the output module 400 sequentially superimposes the current time image and the tracking area image; in specific implementation, the video monitoring device uses the M3 network.
  • the model uses the information of the motion trajectory to use the order of the same target at the current time and the tracking area image of the previous time as the input of the model, namely:
  • the middle layer of the M3 network model will simultaneously fuse the depth features of the background region sequence of the current target and the hidden features of other target historical sequences in the current target neighbor region.
  • the information of the adjacent target is beneficial to improve the prediction accuracy of the algorithm.
  • the location of the hidden layer feature fusion of the M3 network model is also in the first fully connected layer of the network, as shown by the first FC layer in Figure 4.
  • For the background area of the M3 network model also take the sequence of background regions on its trajectory As input.
  • the acquisition of the adjacent target features is also consistent with the M2 network model.
  • the distance between the target and the preset threshold at the current time are used as the selection criteria of the neighboring objects, and the maximum value and weighted mean of the FC1 features are calculated to form the adjacent target features. description. After the fusion, input to the subsequent fully connected layer for further recognition calculation.
  • the M3 network model output is also a multi-dimensional vector, the length of the vector is the number of behavior categories to be identified, and the score on each dimension of the output is the prediction probability on the category.
  • the output module 400 After the current time image and the tracking area image are sequentially superimposed, the output module 400 performs weighted summation on the behavior category score and the result of the sequential superposition, and outputs the corresponding behavior category; in specific implementation, the video monitoring device Combining the processing results of M2 network model and M3 network model, the comprehensive behavior category prediction of the target to be detected is obtained.
  • the fusion method can be the weighted sum of the two groups of network results, and the weight can be obtained by the training set fitting effect.
  • the invention combines the characteristics of the behaviors appearing in the surveillance video, and designs a fusion method based on the hidden layer features in the single frame input and multiframe input networks, using the combination of the current target foreground, the background image information and the adjacent target information.
  • the new implicit feature enriches the available information of the classification network, so that the depth model used for classification can simultaneously utilize the information of the background area of the current target and the behavior information of other targets in the adjacent area, which is very important for the behavior recognition in the surveillance video.
  • Valuable auxiliary information enhances the performance of the entire system for behavioral recognition.
  • the embodiment of the invention provides a device for recognizing a human body in a video, which realizes real-time performance and accuracy of improving video recognition.
  • the output module 400 is further configured to output the behavior category if the behavior category score is higher than a threshold of the preset behavior category; If the behavior category score is not higher than a threshold of the preset behavior category, the corresponding behavior category is calculated and output in combination with the human body running trajectory information.
  • the output module 400 outputs the corresponding according to the behavior category score. Behavior category.
  • the behavior category score is higher than a threshold of the preset behavior category, the behavior category is output; when the score according to the behavior category is above, if the category score output at this time is high in some categories with obvious static characteristics At a certain threshold, the category is directly output as the final prediction result.
  • the output module 400 combines the human body running trajectory information to calculate and output the corresponding behavior category.
  • the video monitoring device acquires the current time image of the to-be-identified video and the tracking area image corresponding to the human body running track information.
  • the video monitoring device can acquire the tracking area image corresponding to the current time image and the human body running track information, using the same
  • the superposition of the target previous time image identifies the input of the M3 network model as a multi-frame sequential input behavior based on the background and neighboring target features for further category prediction.
  • the structure diagram of the M3 network model is shown in Figure 4. Because the time-based target motion picture overlay is used as the input of the network, the M3 network model has a stronger ability to capture motion information, and has obvious advantages for some dynamic features with obvious behavior recognition.
  • the output module 400 sequentially superimposes the current time image and the tracking area image; in specific implementation, the video monitoring device uses the M3 network.
  • the model uses the information of the motion trajectory to use the order of the same target at the current time and the tracking area image of the previous time as the input of the model, namely:
  • the middle layer of the M3 network model will simultaneously fuse the depth features of the background region sequence of the current target and the hidden features of other target historical sequences in the current target neighbor region.
  • the information of the adjacent target is beneficial to improve the prediction accuracy of the algorithm.
  • the location of the hidden layer feature fusion of the M3 network model is also in the first fully connected layer of the network, as shown by the first FC layer in Figure 4.
  • For the background area of the M3 network model also take the sequence of background regions on its trajectory As input.
  • the acquisition of the adjacent target features is also consistent with the M2 network model.
  • the distance between the target and the preset threshold at the current time are used as the selection criteria of the neighboring objects, and the maximum value and weighted mean of the FC1 features are calculated to form the adjacent target features. description. After the fusion, input to the subsequent fully connected layer for further recognition calculation.
  • the M3 network model output is also a multi-dimensional vector, the length of the vector is the number of behavior categories to be identified, and the score on each dimension of the output is the prediction probability on the category.
  • the output module 400 After the current time image and the tracking area image are sequentially superimposed, the output module 400 performs weighted summation on the behavior category score and the result of the sequential superposition, and outputs the corresponding behavior category; in specific implementation, the video monitoring device Combining the processing results of M2 network model and M3 network model, the comprehensive behavior category prediction of the target to be detected is obtained.
  • the fusion method can be the weighted sum of the two groups of network results, and the weight can be obtained by the training set fitting effect.
  • the embodiment of the invention provides a device for recognizing a human body in a video, which realizes real-time performance and accuracy of improving video recognition.
  • the calculation module 300 is further configured to acquire a background image of the human body region whose predicted value is a human body category, and obtain a description of the background image. And calculating, according to the description information of the background image, background region information corresponding to the background image, and calculating neighboring target information corresponding to the background image; calculating background region information corresponding to the background image and neighboring target information, A behavior category score of the target of the human body region is obtained.
  • the module 300 acquires a background image of the human body region whose predicted value is a human body type, and obtains description information of the background image.
  • the video monitoring device can use a non-sequential input behavior based on the neighboring target features with more complex structure and stronger recognition ability. Identifying the M2 network model to identify the behavior of each human body region in a single frame image.
  • the structure of the network model is shown in Figure 3.
  • the hidden layer of the M2 network model adds the background image of the current human target and the adjacent target hidden layer.
  • Characteristic information the location of the feature fusion lies in the first fully connected layer of the network, as shown by the first FC layer in Figure 3; wherein the background image of the target region can be obtained from a pre-set pure background image As long as the part corresponding to the position of the detection area is taken.
  • the complete background image can be obtained from a pre-set standard background image or via a dynamically updated background model. Note that the background image obtained by a target i at time t is B t (i) , then for a target area, its description information can be expressed as:
  • I t (i) and B t (i) share the same location area R t (i) .
  • the calculation module 300 calculates the background region information corresponding to the background image according to the description information of the background image, and calculates the neighboring target corresponding to the background image. information.
  • the background image will get its visual feature description through several convolutional layers, and then get its corresponding first hidden layer feature through a fully connected layer, its dimension and the target image obtained
  • the dimensions of an implicit layer are the same.
  • the feature calculation process of its first hidden layer can be expressed as:
  • FC 1 (I t (i) ) f 1 (c m (...c 1 I t (i) ));
  • c( ⁇ ) represents a convolution operation for an image
  • f( ⁇ ) represents a matrix multiplication operation and an offset amount operation of the fully connected layer.
  • FC 1 (B t (i) ) f 1 (c m (...c 1 B t (i) ));
  • some of the features of the first hidden layer of the model are features from adjacent targets, which are mainly from the target features in the adjacent regions of the current region.
  • the range of neighboring regions can be determined by setting a threshold.
  • the central location of the current target is:
  • the calculating module 300 calculates the human body area by combining the background area information corresponding to the background image and the adjacent target information. The target's behavior category score.
  • the video monitoring device can record that the set of features of the first fully connected layer calculated by all adjacent target regions is The maximum value of these eigenvalues in each dimension is separately counted:
  • the characteristics of the first fully connected layer of the network model of behavior recognition can be expressed as:
  • the feature passes through the subsequent fully connected layer, so that the entire network model naturally utilizes the background information and context information of the current target in the process of identifying.
  • the M2 network model output is a multi-dimensional vector, the length of the vector is the number of behavior categories to be identified, and the score on each dimension of the output represents the prediction probability on the category.
  • the embodiment of the invention provides a device for recognizing a human body in a video, which better realizes real-time and accuracy of improving video recognition.
  • the output module 400 is further configured to acquire the current time image of the to-be-identified video and the tracking area image corresponding to the human body running track information. And sequentially superimposing the current time image and the tracking area image; weighting and summing the behavior category score and the sequentially superimposed result, and outputting a corresponding behavior category.
  • the output module 400 acquires the current moment image of the video to be identified and the human body running track.
  • the tracking area image corresponding to the trace information.
  • the video monitoring device can acquire the tracking area image corresponding to the current time image and the human body running track information, and use the superposition of the same target previous time image as the multi-frame time input behavior recognition network model based on the background and the adjacent target features.
  • Input of the M3 network model for further category prediction.
  • the structure diagram of the M3 network model is shown in Figure 4. Because the time-based target motion picture overlay is used as the input of the network, the M3 network model has a stronger ability to capture motion information, and has obvious advantages for some dynamic features with obvious behavior recognition.
  • the output module 400 After completing the acquisition of the tracking area image corresponding to the current time image of the video and the human body running track information, the output module 400 sequentially superimposes the current time image and the tracking area image.
  • the video monitoring device uses the M3 network model, and uses the information of the motion trajectory to use the sequential superposition of the tracking image of the same target at the current time and the previous time as the input of the model, namely:
  • the middle layer of the M3 network model will simultaneously fuse the depth features of the background region sequence of the current target and the hidden features of other target historical sequences in the current target neighbor region.
  • the information of the adjacent target is beneficial to improve the prediction accuracy of the algorithm.
  • the location of the hidden layer feature fusion of the M3 network model is also in the first fully connected layer of the network, as shown by the first FC layer in Figure 4.
  • For the background area of the M3 network model also take the sequence of background regions on its trajectory As input.
  • the acquisition of the adjacent target features is also consistent with the M2 network model.
  • the distance between the target and the preset threshold at the current time are used as the selection criteria of the neighboring objects, and the maximum value and weighted mean of the FC1 features are calculated to form the adjacent target features. description. After the fusion, input to the subsequent fully connected layer for further recognition calculation.
  • the M3 network model output is also a multi-dimensional vector, the length of the vector is the number of behavior categories to be identified, and the score on each dimension of the output is the prediction probability on the category.
  • the output module 400 weights and sums the behavior category score and the result of the sequential superposition, and outputs the corresponding behavior category.
  • the video monitoring device fuses the M2 network model and the M3 network model.
  • the processing result is obtained by the comprehensive behavior category prediction of the target to be detected, and the fusion method may be the weighted sum of the two groups of network results, and the weight of the weight can be obtained by the training set fitting effect.
  • the embodiment of the invention provides a device for recognizing a human body in a video, which better realizes real-time and accuracy of improving video recognition.
  • the filtering module 200 is further configured to acquire the human body region and perform analysis, and output a predicted value corresponding to the human body region; If the predicted value is a non-human body category, the human body region whose predicted value is a non-human body category is filtered from the acquired human body region; if the predicted value is a human body category, the predicted value is calculated as a human body category.
  • the behavior category score of the target in the human body region is further configured to acquire the human body region and perform analysis, and output a predicted value corresponding to the human body region. If the predicted value is a non-human body category, the human body region whose predicted value is a non-human body category is filtered from the acquired human body region; if the predicted value is a human body category, the predicted value is calculated as a human body category.
  • the behavior category score of the target in the human body region is further configured to acquire the human body region and perform analysis, and output a predicted value corresponding to the human body region. If the predicted value is a non-human body category, the
  • the filtering module 200 acquires the human body region and performs analysis, and outputs the predicted value corresponding to the human body region.
  • the video monitoring device inputs the image of the human body region into the background filtering network M1 network model for analysis, and the structure of the M1 network model is as shown in FIG. 2 .
  • the M1 network model is a deep convolutional network model based on single-frame image input; the input of the network is the detected foreground area image, followed by several convolution layers (Convolution Layers, CONV) with ReLU layer and pooling layer. ), and then connected to the Fully Connection Layers (FC) for depth feature calculation.
  • the dimension of the last layer of the network is 2D. After sigmoid transformation, it corresponds to the human body category and the non-human body category. Behavior category score.
  • the filtering module 200 filters the human body region whose predicted value is a non-human body category from the acquired human body region; in a specific implementation, the video monitoring device can be filtered out after being classified by the M1 network model.
  • the pre-detection and tracking algorithms are misdetected as areas of the human body category. Since the network at this time is only calculated on the foreground image generated by the detection link (instead of the entire image), it does not generate significant computational overhead, and can improve the detection accuracy while satisfying the real-time performance of the entire system. Requirements. At the same time, the number of convolutional layers and fully connected layers in the M1 network model can be adjusted according to factors such as the size of the monitoring screen and the hardware performance of the deployed device.
  • the filtering module 200 calculates the predicted value as a human body class.
  • the embodiment of the invention provides a device for recognizing a human body in a video, which better realizes real-time and accuracy of improving video recognition.
  • the detecting module 100 is further configured to acquire the to-be-identified video, and detect a human body region in the target video; The human body in the human body region is tracked to obtain the human body running track information in the human body region.
  • the detecting module 100 acquires a video to be identified, and detects a human body region in the target video.
  • the video surveillance device can obtain the original video to be identified through the front-end video capture device, and detect the human body region in the video by using a detector based on the traditional feature classification.
  • the detecting module 100 tracks the pedestrian in the human body region to obtain the human body running track information in the human body region.
  • the video monitoring device may track the pedestrians in the picture by using a tracking algorithm based on the detection of the detection area, thereby obtaining motion track information of the human body in the picture.
  • the result of human body detection and tracking can be saved in the form of a target ID and a sequence of detection area images, namely:
  • O(i,t) represents the information of the target i at time t
  • I t (i) is the image content detected by the target at time t
  • R t (i) is the position of the target at the time t
  • R In t (i) the horizontal and vertical coordinate positions and the length and width values of the upper left corner of the recording area are used in the form of vectors (x, y, w, h).
  • the embodiment of the invention provides a device for recognizing a human body in a video, which better realizes real-time and accuracy of improving video recognition.
  • the detection module 100, the filtering module 200, the calculation module 300, and the output module 400 can be identified by a human body in the video, a CPU, a central processing unit, a microprocessor (MCU, a Micro Control Unit). ), digital signal processor (DSP, Digital Signal Processor) or programmable logic array (FPGA, Field- Programmable Gate Array) implementation.
  • MCU Microcontrol Unit
  • DSP Digital Signal Processor
  • FPGA Field- Programmable Gate Array
  • embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention can take the form of a hardware embodiment, a software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage and optical storage, etc.) including computer usable program code.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
  • an embodiment of the present invention further provides a computer storage medium, the computer storage medium comprising a set of instructions, when the instruction is executed, causing at least one processor to perform the method for recognizing human behavior in the video.
  • the solution provided by the embodiment of the present invention acquires the human body running track information in the human body region by detecting the human body region in the to-be-identified video; the predicted value corresponding to the human body region is calculated according to the human body region, and the human body region with the predicted value is a non-human body category Filtering is performed to obtain a human body region whose predicted value is a human body type; a human body region whose predicted value is a human body type is calculated to obtain a behavior category score of a target in a human body region whose predicted value is a human body category; and a corresponding behavior is output according to the behavior category score Category, which improves the real-time and accuracy of video recognition.

Abstract

A human behaviour recognition method and apparatus in a video, and a computer storage medium. The method comprises: detecting a human region in a video to be recognised, and acquiring human moving track information in the human region (S1); according to the human region, obtaining a prediction value corresponding to the human region by means of calculation, and filtering a human region, the prediction value of which is a non-human category, so as to obtain a human region, the prediction value of which is a human category (S2); calculating the human region, the prediction value of which is a human category, so as to obtain a behaviour category score of a target in the human body, the prediction value of which is a human category (S3); and according to the behaviour category score, outputting a corresponding behaviour category (S4).

Description

视频中人体行为识别的方法、装置和计算机存储介质Method, device and computer storage medium for human behavior recognition in video 技术领域Technical field
本发明涉及视频识别技术领域,尤其涉及一种视频中人体行为识别的方法、装置和计算机存储介质。The present invention relates to the field of video recognition technologies, and in particular, to a method, device and computer storage medium for human behavior recognition in video.
背景技术Background technique
现有的视频行为分析技术主要包括检测、跟踪和识别三个步骤。传统的方法主要是提取一些人工定义的视觉特征,比如颜色直方图、SIFT、HoG等,然后根据这些特征进行目标的检测、跟踪和分类等。然而由于这些传统特征的计算方法是通过人为定义的,特征的描述能力比较有限。实际应用中如果全部依赖传统的方法实现检测、跟踪及识别系统,所能达到的识别性能往往比较有限。The existing video behavior analysis technology mainly includes three steps of detection, tracking and recognition. The traditional method is mainly to extract some manually defined visual features, such as color histogram, SIFT, HoG, etc., and then based on these features to detect, track and classify the target. However, since the calculation methods of these traditional features are artificially defined, the ability to describe features is limited. In practical applications, if all rely on traditional methods to implement detection, tracking and identification systems, the recognition performance that can be achieved is often limited.
与传统方法相对的是使用深度网络模型完成图片或视频中的行为检测与识别。通过深度网络的模型能够学习到更好的特征描述,目前已经有一些使用基于深度学习的方法在视频分析中的工作成果,包括3D-CNN、RCNN、two-streams等时序模型的应用。这些现有的基于深度网络的视频分类方法主要是一些通用的算法,在对于监控视频中的人体行为识别这一特定的应用场景,现有技术存在一定的不足与改善空间,例如,在监控的场景中对于不同类型的人的行为,在识别的过程中应该区别对待。有些行为通过静态的画面就能够迅速识别,比如打架、骑车等,有些动作则时序性上的规律较强,借助连续图像帧分析更有助于区分,比如走路与(慢)跑等行为。现有技术中使用单一的模型不能同时兼顾以上两个方面,影响实时性和准确性。In contrast to traditional methods, deep network models are used to perform behavior detection and recognition in pictures or videos. The model of the deep network can learn better characterization. At present, there are some work results in the video analysis using the deep learning-based method, including the application of time series models such as 3D-CNN, RCNN, and two-streams. These existing deep network-based video classification methods are mainly general-purpose algorithms. In the specific application scenario for human behavior recognition in surveillance video, the prior art has certain deficiencies and improvement spaces, for example, in monitoring. The behavior of different types of people in the scene should be treated differently in the process of identification. Some behaviors can be quickly identified through static images, such as fighting, cycling, etc. Some actions are more regular in timing, and continuous image frame analysis is more helpful in distinguishing, such as walking and (slow) running. The use of a single model in the prior art cannot simultaneously take into account the above two aspects, affecting real-time and accuracy.
发明内容Summary of the invention
为解决现有存在的技术问题,本发明实施例提供一种视频中人体行为识别的方法、装置和计算机存储介质。 In order to solve the existing technical problems, embodiments of the present invention provide a method, an apparatus, and a computer storage medium for human body behavior recognition in a video.
本发明实施例的技术方案是这样实现的:The technical solution of the embodiment of the present invention is implemented as follows:
本发明实施例提供的视频中人体行为识别的方法,包括:A method for human behavior recognition in a video provided by an embodiment of the present invention includes:
检测待识别视频中的人体区域,获取所述人体区域中的人体运行轨迹信息;Detecting a human body region in the to-be-identified video, and acquiring human body running track information in the human body region;
根据所述人体区域计算得到所述人体区域对应的预测值,对所述预测值为非人体类别的人体区域进行过滤,得到所述预测值为人体类别的人体区域;Calculating a predicted value corresponding to the human body region according to the human body region, and filtering the human body region whose predicted value is a non-human body category, and obtaining the human body region whose predicted value is a human body category;
对所述预测值为人体类别的人体区域进行计算得到所述预测值为人体类别的人体区域中的目标的行为类别得分;Calculating, by the body region of the human body category, the predicted value is a behavior category score of the target in the human body region of the human body category;
根据所述行为类别得分,输出相应的行为类别。According to the behavior category score, the corresponding behavior category is output.
优选地,根据所述行为类别得分,输出相应的行为类别,包括:Preferably, according to the behavior category score, the corresponding behavior category is output, including:
若所述行为类别得分高于预设行为类别的阈值,则输出所述行为类别;Outputting the behavior category if the behavior category score is higher than a threshold of a preset behavior category;
若所述行为类别得分不高于预设行为类别的阈值,则结合所述人体运行轨迹信息,计算并输出相应的行为类别。If the behavior category score is not higher than a threshold of the preset behavior category, the corresponding behavior category is calculated and output in combination with the human body running trajectory information.
优选地,所述对所述预测值为人体类别的人体区域进行计算得到所述预测值为人体类别的人体区域中的目标的行为类别得分,包括:Preferably, the calculating, by the body area of the human body category, the predicted value is a behavior category score of the target in the human body region of the human body category, including:
获取所述预测值为人体类别的人体区域的背景图像,得到所述背景图像的描述信息;Obtaining a background image of the human body region whose predicted value is a human body category, and obtaining description information of the background image;
根据所述背景图像的描述信息,计算所述背景图像对应的背景区域信息,并计算所述背景图像对应的邻近目标信息;Calculating background region information corresponding to the background image according to the description information of the background image, and calculating neighboring target information corresponding to the background image;
结合所述背景图像对应的背景区域信息和邻近目标信息,计算得到所述人体区域的目标的行为类别得分。Combining the background region information corresponding to the background image and the neighboring target information, the behavior category score of the target of the human body region is calculated.
优选地,所述结合所述人体运行轨迹信息,计算并输出相应的行为类别,包括:Preferably, the combining the human body running track information, calculating and outputting a corresponding behavior category, including:
获取所述待识别视频的当前时刻图像和所述人体运行轨迹信息对应的跟踪区域图像;Obtaining a current time image of the to-be-identified video and a tracking area image corresponding to the human body running track information;
将所述当前时刻图像和所述跟踪区域图像进行顺序叠加;And sequentially superimposing the current time image and the tracking area image;
对所述行为类别得分和所述进行顺序叠加后的结果进行加权求和,输出对应的行为类别。The behavior category score and the result of the sequential superposition are weighted and summed, and the corresponding behavior category is output.
优选地,所述根据所述人体区域计算得到所述人体区域对应的预测值, 对所述预测值为非人体类别的人体区域进行过滤,包括:Preferably, the predicting value corresponding to the human body region is calculated according to the human body region, Filtering the human body area whose predicted value is non-human body category, including:
获取所述人体区域并进行分析,输出所述人体区域对应的预测值;Obtaining and analyzing the human body region, and outputting a predicted value corresponding to the human body region;
若所述预测值为非人体类别,则将所述预测值为非人体类别的人体区域从所述获取的人体区域中进行过滤;If the predicted value is a non-human body category, filtering the human body region whose predicted value is a non-human body category from the acquired human body region;
若所述预测值为人体类别,则执行计算所述预测值为人体类别的人体区域中的目标的行为类别得分的步骤。If the predicted value is a human body category, a step of calculating the behavior category score of the target in the human body region of the human body category is performed.
优选地,所述检测待识别视频中的人体区域,获取所述人体区域中的人体运行轨迹信息,包括:Preferably, the detecting the human body region in the to-be-identified video and acquiring the human body running track information in the human body region includes:
获取所述待识别视频,对所述待识别视频中的人体区域进行检测;Obtaining the to-be-identified video, and detecting a human body region in the to-be-identified video;
对所述人体区域中的行人进行跟踪,得到所述人体区域中的人体运行轨迹信息。Tracking pedestrians in the human body region to obtain human body running track information in the human body region.
本发明实施例还提出一种视频中人体行为识别的装置,所述装置包括:The embodiment of the invention further provides a device for recognizing human behavior in a video, the device comprising:
检测模块,配置为检测待识别视频中的人体区域,获取所述人体区域中的人体运行轨迹信息;The detecting module is configured to detect a human body region in the to-be-identified video, and acquire information about the human body running track in the human body region;
过滤模块,配置为根据所述人体区域计算得到所述人体区域对应的预测值,对所述预测值为非人体类别的人体区域进行过滤,得到所述预测值为人体类别的人体区域;The filtering module is configured to calculate a predicted value corresponding to the human body region according to the human body region, and filter the human body region whose predicted value is a non-human body category, to obtain the human body region whose predicted value is a human body category;
计算模块,配置为对所述预测值为人体类别的人体区域进行计算得到所述预测值为人体类别的人体区域中的目标的行为类别得分;a calculation module configured to calculate, according to the human body region whose predicted value is a human body category, the behavior category score of the target in the human body region of the human body category;
输出模块,配置为根据所述行为类别得分,输出相应的行为类别。An output module configured to output a corresponding behavior category according to the behavior category score.
优选地,所述输出模块,配置为若所述行为类别得分高于预设行为类别的阈值,则输出所述行为类别;若所述行为类别得分不高于预设行为类别的阈值,则结合所述人体运行轨迹信息,计算并输出相应的行为类别。Preferably, the output module is configured to output the behavior category if the behavior category score is higher than a threshold of a preset behavior category; if the behavior category score is not higher than a threshold of a preset behavior category, The human body runs track information, and calculates and outputs a corresponding behavior category.
优选地,所述计算模块,配置为获取所述预测值为人体类别的人体区域的背景图像,得到所述背景图像的描述信息;根据所述背景图像的描述信息,计算所述背景图像对应的背景区域信息,并计算所述背景图像对应的邻近目标信息;结合所述背景图像对应的背景区域信息和邻近目标信息,计算得到所述人体区域的目标的行为类别得分。Preferably, the calculating module is configured to acquire a background image of the human body region whose predicted value is a human body category, to obtain description information of the background image, and calculate corresponding to the background image according to the description information of the background image. The background area information is calculated, and the neighboring target information corresponding to the background image is calculated; and the behavior category score of the target of the human body area is calculated according to the background area information corresponding to the background image and the neighboring target information.
优选地,所述输出模块,配置为获取所述待识别视频的当前时刻图像 和所述人体运行轨迹信息对应的跟踪区域图像;将所述当前时刻图像和所述跟踪区域图像进行顺序叠加;对所述行为类别得分和所述进行顺序叠加后的结果进行加权求和,输出对应的行为类别。Preferably, the output module is configured to acquire an image of a current moment of the to-be-identified video a tracking area image corresponding to the human body running track information; sequentially superimposing the current time image and the tracking area image; weighting and summing the behavior category score and the sequentially superimposed result, and outputting The corresponding behavior category.
优选地,所述过滤模块,配置为获取所述人体区域并进行分析,输出所述人体区域对应的预测值;若所述预测值为非人体类别,则将所述预测值为非人体类别的人体区域从所述获取的人体区域中进行过滤;若所述预测值为人体类别,则计算所述预测值为人体类别的人体区域中的目标的行为类别得分。Preferably, the filtering module is configured to acquire the human body region and perform analysis, and output a predicted value corresponding to the human body region; if the predicted value is a non-human body category, the predicted value is a non-human body category. The human body region is filtered from the acquired human body region; if the predicted value is a human body category, the predicted value is calculated as a behavioral category score of the target in the human body region of the human body category.
优选地,所述检测模块,配置为获取所述待识别视频,对所述待识别视频中的人体区域进行检测;对所述人体区域中的行人进行跟踪,得到所述人体区域中的人体运行轨迹信息。Preferably, the detecting module is configured to acquire the to-be-identified video, detect a human body region in the to-be-identified video, and track a pedestrian in the human body region to obtain a human body operation in the human body region. Track information.
本发明实施例还提供一种计算机存储介质,所述计算机存储介质包括一组指令,当执行所述指令时,引起至少一个处理器执行上述的视频中人体行为识别的方法。Embodiments of the present invention also provide a computer storage medium comprising a set of instructions that, when executed, cause at least one processor to perform a method of human behavior recognition in the video described above.
本发明实施例提供了一种视频中人体行为识别的方法、装置和计算机存储介质,通过检测待识别视频中的人体区域,获取人体区域中的人体运行轨迹信息;根据人体区域计算得到人体区域对应的预测值,对预测值为非人体类别的人体区域进行过滤,得到预测值为人体类别的人体区域;对预测值为人体类别的人体区域进行计算得到预测值为人体类别的人体区域中的目标的行为类别得分;根据行为类别得分,输出相应的行为类别,解决了现有技术中识别视频中人体行为性能较差,实时性和准确性较低的问题。实现了提升视频识别的实时性和准确性。Embodiments of the present invention provide a method, a device, and a computer storage medium for recognizing a human body in a video. The human body region in the human body region is acquired by detecting a human body region in the video to be identified, and the human body region corresponding to the human body region is calculated according to the human body region. The predicted value is used to filter the human body region whose predicted value is non-human body type, and the human body region whose predicted value is the human body type is obtained; and the human body region whose predicted value is the human body type is calculated to obtain the target value in the human body region whose predicted value is the human body category. The behavior category score; according to the behavior category score, the corresponding behavior category is output, which solves the problem that the human behavior performance in the identified video in the prior art is poor, real-time and low accuracy. Realize the real-time and accuracy of video recognition.
附图说明DRAWINGS
图1是本发明视频中人体行为识别的方法第一实施例的流程示意图;1 is a schematic flow chart of a first embodiment of a method for human behavior recognition in the video of the present invention;
图2是本发明实施例中基于非时序输入深度网络模型结构示意图;2 is a schematic structural diagram of a network model based on a non-sequential input depth in an embodiment of the present invention;
图3是本发明实施例中基于非时序输入,融合背景与邻近目标特征的行为识别网络模型结构示意图; 3 is a schematic structural diagram of a behavior recognition network model based on non-sequential input, fusion background and neighboring target features according to an embodiment of the present invention;
图4是本发明实施例中基于时序输入,融合背景与邻近目标特征的行为识别网络模型结构示意图;4 is a schematic structural diagram of a behavior recognition network model based on time series input, fusion background and neighboring target features according to an embodiment of the present invention;
图5是本发明实施例中根据所述行为类别得分,输出相应的行为类别的步骤的一种流程示意图;FIG. 5 is a schematic flowchart of a step of outputting a corresponding behavior category according to the behavior category score in the embodiment of the present invention; FIG.
图6是本发明实施例中对所述预测值为人体类别的人体区域进行计算得到所述预测值为人体类别的人体区域中的目标的行为类别得分的步骤的一种流程示意图;6 is a schematic flowchart of a step of calculating, according to an embodiment of the present invention, a behavior type score of a target in a human body region in which a predicted value is a human body region;
图7是本发明实施例中结合所述人体运行轨迹信息,计算并输出相应的行为类别的步骤的一种流程示意图;FIG. 7 is a schematic flowchart of a step of calculating and outputting a corresponding behavior category in combination with the human body running track information according to an embodiment of the present invention; FIG.
图8是本发明实施例中根据所述人体区域计算得到所述人体区域对应的预测值,对所述预测值为非人体类别的人体区域进行过滤的步骤的一种流程示意图;FIG. 8 is a schematic flowchart of a step of filtering a body region that is predicted to be a non-human body type according to the predicted value corresponding to the human body region calculated according to the human body region in the embodiment of the present invention; FIG.
图9是本发明实施例中检测待识别视频中的人体区域,获取所述人体区域中的人体运行轨迹信息的步骤的一种流程示意图;9 is a schematic flowchart of a step of detecting a human body region in a video to be recognized and acquiring human body running track information in the human body region according to an embodiment of the present invention;
图10是本发明视频中人体行为识别的装置第一实施例的功能模块示意图。FIG. 10 is a schematic diagram of functional modules of a first embodiment of a device for human behavior recognition in the video of the present invention.
具体实施方式detailed description
本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The implementation, functional features, and advantages of the present invention will be further described in conjunction with the embodiments.
应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
在本发明的各种实施例中:检测待识别视频中的人体区域,获取人体区域中的人体运行轨迹信息;根据人体区域计算得到人体区域对应的预测值,对预测值为非人体类别的人体区域进行过滤,得到预测值为人体类别的人体区域;对预测值为人体类别的人体区域进行计算得到预测值为人体类别的人体区域中的目标的行为类别得分;根据行为类别得分,输出相应的行为类别。In various embodiments of the present invention, the human body area in the video to be identified is detected, and the human body running track information in the human body area is acquired; the predicted value corresponding to the human body area is calculated according to the human body area, and the predicted value is a non-human body type. The region is filtered to obtain a human body region whose predicted value is a human body category; the body region region whose predicted value is a human body category is calculated to obtain a behavior category score of a target in a human body region whose predicted value is a human body category; according to the behavior category score, the corresponding output is output. Behavior category.
由此,解决了现有技术中识别视频中人体行为性能较差,实时性和准确性较低的问题。实现了提升视频识别的实时性和准确性。 Therefore, the problem that the performance of the human body in the recognized video in the prior art is poor, real-time and low accuracy is solved. Realize the real-time and accuracy of video recognition.
如图1所示,本发明第一实施例提出一种视频中人体行为识别的方法,包括:As shown in FIG. 1 , a first embodiment of the present invention provides a method for human behavior recognition in a video, including:
步骤S1,检测待识别视频中的人体区域,获取所述人体区域中的人体运行轨迹信息。Step S1: detecting a human body region in the to-be-identified video, and acquiring human body running track information in the human body region.
本发明实施例方法的执行主体可以为一种视频监控设备或视频识别设备,本实施例以视频监控设备进行举例,当然也不限定于其他能够实现识别视频中人体行为的设备。The executor of the method of the embodiment of the present invention may be a video monitoring device or a video identification device. This embodiment is exemplified by a video monitoring device, and is of course not limited to other devices capable of realizing human behavior in the video.
具体地,视频监控设备检测待识别视频中的人体区域,获取人体区域中的人体运行轨迹信息。Specifically, the video monitoring device detects the human body region in the to-be-identified video, and acquires the human body running track information in the human body region.
其中,视频监控设备获取待识别视频,对目标视频中的人体区域进行检测;在具体实现时,视频监控设备可以通过前端视频采集设备来获取待识别的原始视频,并使用基于传统特征分类的检测器对视频中的人体区域进行检测。The video monitoring device obtains the to-be-identified video and detects the human body region in the target video. In a specific implementation, the video surveillance device can obtain the original video to be identified through the front-end video capture device, and use the detection based on the traditional feature classification. The device detects the human body area in the video.
其中,在完成获取待识别视频,对目标视频中的人体区域进行检测后,视频监控设备对人体区域中的行人进行跟踪,得到人体区域中的人体运行轨迹信息;在具体实现时,视频监控设备可使用基于检测区域匹配的跟踪算法对画面中的行人进行跟踪,从而得到画面中的人体的运动轨迹信息。After the acquisition of the to-be-identified video and the detection of the human body region in the target video, the video monitoring device tracks the pedestrians in the human body region to obtain the human body running track information in the human body region; in specific implementation, the video monitoring device A pedestrian tracking algorithm based on detection area matching can be used to track pedestrians in the picture to obtain motion trajectory information of the human body in the picture.
其中,人体检测与跟踪的结果可以以目标ID与检测区域图像序列的形式保存,即:The result of human body detection and tracking can be saved in the form of a target ID and a sequence of detection area images, namely:
O(i,t)=It (i),Rt (i)O (i, t) = I t (i), R t (i);
其中O(i,t)代表目标i在t时刻的信息,It (i)是该目标在t时刻检测到的图像内容,Rt (i)是该目标在t时刻所在区域的位置,Rt (i)中使用向量(x,y,w,h)的形式记录区域的左上角横、纵坐标位置与长、宽值。Where O(i,t) represents the information of the target i at time t, I t (i) is the image content detected by the target at time t, and R t (i) is the position of the target at the time t, R In t (i) , the horizontal and vertical coordinate positions and the length and width values of the upper left corner of the recording area are used in the form of vectors (x, y, w, h).
步骤S2,根据所述人体区域计算得到所述人体区域对应的预测值,对所述预测值为非人体类别的人体区域进行过滤,得到所述预测值为人体类别的人体区域。Step S2: Calculate a predicted value corresponding to the human body region according to the human body region, and filter the human body region whose predicted value is a non-human body category, and obtain the human body region whose predicted value is a human body category.
具体地,在完成检测待识别视频中的人体区域,获取人体区域中的人体运行轨迹信息后,视频监控设备根据人体区域计算得到人体区域对应的预测值,对预测值为非人体类别的人体区域进行过滤,得到预测值为人体 类别的人体区域。Specifically, after detecting the human body region in the to-be-identified video and acquiring the human body running track information in the human body region, the video monitoring device calculates the predicted value corresponding to the human body region according to the human body region, and the human body region with the predicted value is a non-human body category. Filtering to get the predicted value of the human body The human body area of the category.
其中,视频监控设备获取人体区域并进行分析,输出人体区域对应的预测值,预测值包括人体类别和非人体类别;在具体实现时,当获取到当前帧中某一个人体区域后,视频监控设备将该人体区域的图像输入到背景过滤网络M1网络模型中进行分析,M1网络模型的结构如图2所示,M1网络模型是一个基于单帧图像输入的深度卷积网络模型;其中,网络的输入为检测到的前景区域图像,后接若干个附带ReLU层和pooling层的卷积层(Convolution Layers,CONV),再接上若干个全连通层(Fully Connection Layers,FC)进行深度的特征计算,M1网络的最后一层输出层的维数为2维,经过sigmoid变换后分别对应人体类别与非人体类别上的行为类别得分。The video monitoring device acquires and analyzes the human body region, and outputs a predicted value corresponding to the human body region, and the predicted value includes a human body category and a non-human body category; in a specific implementation, after acquiring a certain human body region in the current frame, the video monitoring device The image of the human body region is input into the background filtering network M1 network model for analysis. The structure of the M1 network model is shown in FIG. 2, and the M1 network model is a deep convolution network model based on single frame image input; The input is the detected foreground area image, followed by several Convolution Layers (CONV) with ReLU layer and pooling layer, and then connected with several Fully Connection Layers (FC) for depth feature calculation. The dimension of the last layer of the M1 network is 2 dimensions. After sigmoid transformation, it corresponds to the behavior category scores of the human body category and the non-human body category.
其中,若预测值为非人体类别,则将预测值为非人体类别的人体区域从获取的人体区域中进行过滤;通过M1网络模型的分类后,可以过滤掉前期检测与跟踪算法误测为人体类别的区域。由于此时的网络仅在检测环节产生的前景图像上进行计算(而非整张图像上),所以并不会产生明显的计算开销,在提高检测准确率的同时,能够满足整个系统实时性上的要求。同时,M1网络模型中的卷积层、全连通层的个数可以根据监控画面的大小与所部署设备的硬件性能等因素进行调整。Wherein, if the predicted value is a non-human body type, the human body region whose predicted value is a non-human body category is filtered from the acquired human body region; after the classification by the M1 network model, the pre-detection and tracking algorithm may be filtered out to be misdetected as a human body. The area of the category. Since the network at this time is only calculated on the foreground image generated by the detection link (instead of the entire image), it does not generate significant computational overhead, and can improve the detection accuracy while satisfying the real-time performance of the entire system. Requirements. At the same time, the number of convolutional layers and fully connected layers in the M1 network model can be adjusted according to factors such as the size of the monitoring screen and the hardware performance of the deployed device.
其中,本发明实施例在检测与跟踪环节后首先使用了一个结构相对简单的深度网络模型对检测到的前景区域进行进一步的过滤处理;在前期的检测环节,实现时有意降低算法对于前景预测的阈值,使算法尽可能返回更多的前景区域,尽量减少漏检率的产生。由于此时的网络仅在检测环节产生的前景图像上进行计算(而非整张图像上),所以大大减少了算法的计算开销,在提高检测准确率的同时,很好地满足了整个系统实时性上的要求。In the embodiment of the present invention, after the detection and tracking step, a deep network model with relatively simple structure is used to further filter the detected foreground region; in the early detection link, the algorithm intentionally reduces the algorithm for foreground prediction. Thresholds allow the algorithm to return as many foreground areas as possible, minimizing the rate of missed detection. Since the network at this time is only calculated on the foreground image generated by the detection link (instead of the entire image), the computational overhead of the algorithm is greatly reduced, and the detection accuracy is improved, and the entire system is well satisfied. Sexual requirements.
步骤S3,对所述预测值为人体类别的人体区域进行计算得到所述预测值为人体类别的人体区域中的目标的行为类别得分。In step S3, the human body region whose predicted value is the human body type is calculated to obtain the behavior category score of the target in the human body region of the human body category.
具体地,在完成根据人体区域计算得到人体区域对应的预测值,对预测值为非人体类别的人体区域进行过滤,得到预测值为人体类别的人体区域后,视频监控设备对预测值为人体类别的人体区域进行计算得到预测值 为人体类别的人体区域中的目标的行为类别得分。Specifically, after the predicted value corresponding to the human body region is calculated according to the human body region, the human body region whose predicted value is not the human body type is filtered, and after the predicted human body region of the human body category is obtained, the video monitoring device predicts the human body region. The body area is calculated to obtain a predicted value The behavior category score for the target in the body region of the human body category.
其中,视频监控设备获取预测值为人体类别的人体区域的背景图像,得到背景图像的描述信息;在具体实现时,如果M1网络模型得到的预测结果是人体类别(即画面中的前景),视频监控设备可以使用一个结构更复杂、识别能力更强的基于邻近目标特征的非时序输入行为识别M2网络模型对单帧图像内的每个人体区域进行行为的识别,该网络模型的结构如图3所示;M2网络模型的隐藏层中加入了当前人体目标所在背景图像和邻近目标隐藏层的特征信息,特征融合的位置在于网络的第一个全连通层,如图3中的第一个FC层所示;其中目标所在区域的背景图像可以从预先设定的纯净的背景图像中获得,只要取其中对应检测区域位置的部分即可。完整的背景图像可以通过预先设定的标准背景图像获得,或通过动态更新的背景模型获得。记某一目标i在t时刻得到的背景图像为Bt (i),那么对于一个目标区域,可以将它的描述信息表示为:The video monitoring device obtains a background image of the human body region whose predicted value is a human body type, and obtains description information of the background image; in a specific implementation, if the predicted result obtained by the M1 network model is a human body category (ie, a foreground in the image), the video The monitoring device can identify the behavior of each human body region in a single frame image by using a non-sequential input behavior based on neighboring target features with more complex structure and more recognizable capability. The structure of the network model is shown in FIG. 3. As shown in the figure, the hidden layer of the M2 network model adds the background image of the current human target and the feature information of the adjacent target hidden layer. The location of the feature fusion lies in the first fully connected layer of the network, as shown in the first FC in FIG. The layer is shown; the background image of the area where the target is located can be obtained from a preset pure background image, as long as the portion corresponding to the position of the detection area is taken. The complete background image can be obtained from a pre-set standard background image or via a dynamically updated background model. Note that the background image obtained by a target i at time t is B t (i) , then for a target area, its description information can be expressed as:
O(i,t)=It (i),Rt (i),Bt (i)O(i,t)=I t (i) , R t (i) , B t (i) ;
其中,It (i)和Bt (i)共用同一个位置区域Rt (i)Where I t (i) and B t (i) share the same location area R t (i) .
其中,在完成获取预测值为人体类别的人体区域的背景图像,得到背景图像的描述信息后,视频监控设备根据背景图像的描述信息,计算背景图像对应的背景区域信息,并计算背景图像对应的邻近目标信息;在具体实现时,背景图像会经过若干个卷积层得到它的视觉特征描述,然后经过一个全连通层得到它对应的第一个隐含层特征,它的维度与目标图像得到的第一个隐含层的维度相同。对于目标图像,它的第一个隐含层的特征计算过程可以表示为:After the background image of the body region of the human body category is obtained and the description information of the background image is obtained, the video monitoring device calculates the background region information corresponding to the background image according to the description information of the background image, and calculates the background image corresponding to the background image. Adjacent to the target information; in the specific implementation, the background image will obtain its visual feature description through several convolutional layers, and then obtain a corresponding first hidden layer feature through a fully connected layer, and its dimension and target image are obtained. The first hidden layer has the same dimensions. For the target image, the feature calculation process of its first hidden layer can be expressed as:
FC1(It (i))=f1(cm(...c1It (i)));FC 1 (I t (i) )=f 1 (c m (...c 1 I t (i) ));
其中,c(·)代表对于图像的卷积运算,f(·)代表全连接层的矩阵乘法操作与偏置量操作。类似的,对于背景位置图像,记它的第一个隐含层的特征为:Where c(·) represents a convolution operation for an image, and f(·) represents a matrix multiplication operation and an offset amount operation of the fully connected layer. Similarly, for a background position image, the characteristics of its first hidden layer are:
FC1(Bt (i))=f1(cm(...c1Bt (i)));FC 1 (B t (i) )=f 1 (c m (...c 1 B t (i) ));
其中,该模型的第一个隐含层的特征组成中,还有一部分是来自邻近目标的特征,这些特征主要来自于当前区域的邻近区域中的目标特征。可以通过设置一阈值来确定邻近区域的范围。记当前目标的中心位置为: Among them, some of the features of the first hidden layer of the model are features from adjacent targets, which are mainly from the target features in the adjacent regions of the current region. The range of neighboring regions can be determined by setting a threshold. The central location of the current target is:
Figure PCTCN2017071574-appb-000001
Figure PCTCN2017071574-appb-000001
其中,
Figure PCTCN2017071574-appb-000002
是目标区域左上角横坐标,
Figure PCTCN2017071574-appb-000003
是目标区域左上角纵坐标,
Figure PCTCN2017071574-appb-000004
是目标区域的宽度,
Figure PCTCN2017071574-appb-000005
是目标区域的高度。同时计算同一画面中其它前景目标的中心位置点
Figure PCTCN2017071574-appb-000006
Figure PCTCN2017071574-appb-000007
Figure PCTCN2017071574-appb-000008
的欧氏距离dij小于一定的阈值D或两者有交叉时,则将该前景归入当前目标的有效邻近目标中。
among them,
Figure PCTCN2017071574-appb-000002
Is the horizontal coordinate of the upper left corner of the target area.
Figure PCTCN2017071574-appb-000003
Is the ordinate of the upper left corner of the target area.
Figure PCTCN2017071574-appb-000004
Is the width of the target area,
Figure PCTCN2017071574-appb-000005
Is the height of the target area. Simultaneously calculate the center position of other foreground targets in the same picture
Figure PCTCN2017071574-appb-000006
when
Figure PCTCN2017071574-appb-000007
versus
Figure PCTCN2017071574-appb-000008
When the Euclidean distance d ij is less than a certain threshold D or there is an intersection between the two, the foreground is classified into the effective neighboring target of the current target.
其中,在完成根据背景图像的描述信息,计算背景图像对应的背景区域信息,并计算背景图像对应的邻近目标信息后,视频监控设备结合背景图像对应的背景区域信息和邻近目标信息,计算得到人体区域的目标的行为类别得分;在具体实现时,视频监控设备可以记所有的邻近目标区域计算到的第一个全连通层的特征的集合为
Figure PCTCN2017071574-appb-000009
分别统计这些特征值在每一维上的最大值:
After the background information corresponding to the background image is calculated according to the description information of the background image, and the neighboring target information corresponding to the background image is calculated, the video monitoring device calculates the human body by combining the background region information corresponding to the background image and the adjacent target information. The behavior category score of the target of the region; in a specific implementation, the video monitoring device can record that the set of features of the first fully connected layer calculated by all the adjacent target regions is
Figure PCTCN2017071574-appb-000009
The maximum value of these eigenvalues in each dimension is separately counted:
Figure PCTCN2017071574-appb-000010
Figure PCTCN2017071574-appb-000010
和加权平均值:And weighted average:
Figure PCTCN2017071574-appb-000011
Figure PCTCN2017071574-appb-000011
作为邻近目标的特征描述的组成部分。将以上两组特征拼接在一起,就能得到对于邻近目标描述的整体特征表示,即:As part of the characterization of the adjacent target. By stitching together the above two sets of features, you can get an overall feature representation of the proximity target description, namely:
Figure PCTCN2017071574-appb-000012
Figure PCTCN2017071574-appb-000012
如果当前目标在画面中没有任何邻近目标,则
Figure PCTCN2017071574-appb-000013
的值全部设为零。综合背景区域信息和邻近目标信息后,行为识别的网络模型的第一个全连通层的特征可以表示为:
If the current target does not have any adjacent targets in the picture, then
Figure PCTCN2017071574-appb-000013
The values are all set to zero. After synthesizing the background area information and the adjacent target information, the characteristics of the first fully connected layer of the network model of behavior recognition can be expressed as:
Figure PCTCN2017071574-appb-000014
Figure PCTCN2017071574-appb-000014
该特征经过后续的全连通层,使得整个网络模型在进行识别的过程中,自然地利用到了当前目标的背景区域信息和上下文信息。The feature passes through the subsequent fully connected layer, so that the entire network model naturally utilizes the background information and context information of the current target in the process of identifying.
其中,M2网络模型输出是一个多维的向量,向量的长度是待识别行为类别的个数,输出的每一维上的得分代表该类别上的预测概率。The M2 network model output is a multi-dimensional vector, the length of the vector is the number of behavior categories to be identified, and the score on each dimension of the output represents the prediction probability on the category.
步骤S4,根据所述行为类别得分,输出相应的行为类别。 Step S4, outputting a corresponding behavior category according to the behavior category score.
具体地,在完成对预测值为人体类别的人体区域进行计算得到预测值为人体类别的人体区域中的目标的行为类别得分后,视频监控设备根据行为类别得分,输出相应的行为类别。Specifically, after completing the calculation of the body region of the human body class with the predicted value as the body class, the video monitoring device outputs the corresponding behavior class according to the behavior category score after the behavior class score of the target in the body region of the body class is calculated.
其中,若行为类别得分高于预设行为类别的阈值,则输出行为类别;根据上述行为类别得分时,如果此时输出的类别得分在一些静态特征明显的类别上的得分高于一定的阈值,则直接输出该类别的作为最终的预测结果。Wherein, if the behavior category score is higher than the threshold of the preset behavior category, the behavior category is output; according to the behavior category score, if the category score output at this time is higher than a certain threshold in some categories with obvious static characteristics, Then directly output the category as the final prediction result.
其中,本发明实施例针对监控视频中不同行为的类型,根据它们的不同静态特性与动态特性,分别采用了不同结构的时序(多帧图像)与非时序(单帧图像)输入网络对提取到的图像进行分析,最后融合两种不同的网络输出得到最终的行为识别结果;具体地,对于一些静态特性明确的行为类别,如打架、骑车等,本发明实施例主要依赖于一个结构充分复杂的非时序输入网络模型进行快速预测,因为这些动作特征明显,一旦出现,通过单帧的影像一般就能够准确判断;而对于一些通过单帧图像难以判断的行为类别,如走路与慢跑等,主要使用一个采用时序叠加图像作为输入的深度网络进一步分析,提供比使用单一静态图像输入的网络更可靠的识别性能。另外,在时序输入与非时序输入的深度分类模型融合策略的设计上,采用了级联分类器的思想,提高整个分类系统的运行效率,实现实时行为识别的需求。The embodiments of the present invention are directed to monitoring the types of different behaviors in the video, and according to their different static characteristics and dynamic characteristics, respectively adopting timing (multi-frame image) and non-timing (single-frame image) input network pairs of different structures. The image is analyzed, and finally the two different network outputs are combined to obtain the final behavior recognition result; in particular, for some static characteristic clear behavior categories, such as fighting, cycling, etc., the embodiment of the present invention mainly relies on a structure that is sufficiently complicated. The non-sequential input network model performs fast prediction because these motion features are obvious. Once they appear, the image through a single frame can generally be accurately judged; for some behavior categories that are difficult to judge through a single frame image, such as walking and jogging, etc. Further analysis using a deep network with time-staggered images as input provides more reliable recognition performance than a network using a single static image input. In addition, in the design of the deep classification model fusion strategy of time series input and non-sequence input, the idea of cascade classifier is adopted to improve the operation efficiency of the whole classification system and realize the requirement of real-time behavior recognition.
其中,若行为类别得分不高于预设行为类别的阈值,则结合人体运行轨迹信息,计算并输出相应的行为类别。Wherein, if the behavior category score is not higher than the threshold of the preset behavior category, the corresponding behavior category is calculated and output in combination with the human body running track information.
其中,视频监控设备获取待识别视频的当前时刻图像和人体运行轨迹信息对应的跟踪区域图像;在具体实现时,视频监控设备可以获取当前时刻图像和人体运行轨迹信息对应的跟踪区域图像,使用同一目标先前时刻图像的叠加作为基于背景与邻近目标特征的多帧时序输入行为识别M3网络模型的输入,进行进一步的类别预测。M3网络模型的结构示意图如图4所示。由于采用的是时序的目标动作画面叠加作为网络的输入,所以M3网络模型具有更强的捕捉运动信息的能力,对于一些动态特征明显的行为识别具有明显的优势。The video monitoring device acquires the current time image of the to-be-identified video and the tracking area image corresponding to the human body running track information. In a specific implementation, the video monitoring device can acquire the tracking area image corresponding to the current time image and the human body running track information, using the same The superposition of the target previous time image identifies the input of the M3 network model as a multi-frame sequential input behavior based on the background and neighboring target features for further category prediction. The structure diagram of the M3 network model is shown in Figure 4. Because the time-based target motion picture overlay is used as the input of the network, the M3 network model has a stronger ability to capture motion information, and has obvious advantages for some dynamic features with obvious behavior recognition.
其中,在完成获取待识别视频的当前时刻图像和人体运行轨迹信息对 应的跟踪区域图像后,视频监控设备将当前时刻图像和跟踪区域图像进行顺序叠加;在具体实现时,视频监控设备使用M3网络模型,利用运动轨迹的信息,使用同一目标在当前时刻与前若干时刻的跟踪区域图像的顺序叠加作为模型的输入,即:Wherein, after obtaining the current time image of the to-be-identified video and the human body running track information pair After tracking the area image, the video monitoring device sequentially superimposes the current time image and the tracking area image; in specific implementation, the video monitoring device uses the M3 network model, uses the information of the motion track, and uses the same target at the current time and the previous number. The sequential overlay of the tracking area image at the moment is used as the input to the model, ie:
Figure PCTCN2017071574-appb-000015
Figure PCTCN2017071574-appb-000015
M3网络模型的中间层将同时融合当前目标所在的背景区域序列的深度特征、当前目标近邻区域中其它目标历史序列的隐含特征,邻近目标的信息有利于提升算法的预测准确性。The middle layer of the M3 network model will simultaneously fuse the depth features of the background region sequence of the current target and the hidden features of other target historical sequences in the current target neighbor region. The information of the adjacent target is beneficial to improve the prediction accuracy of the algorithm.
M3网络模型的隐藏层特征融合的位置同样在于网络的第一个全连通层,如图4中的第一个FC层所示。对于M3网络模型的背景区域,也取其轨迹上的背景区域序列
Figure PCTCN2017071574-appb-000016
作为输入。而对于邻近目标特征的获取也与M2网络模型基本一致,以当前时刻的目标间距离及预设的阈值作为邻近对象的选取准则,并计算它们的FC1特征的最大值、加权均值组成邻近目标特征描述。通过融合后,输入到后续的全连接层进行进一步的识别计算。
The location of the hidden layer feature fusion of the M3 network model is also in the first fully connected layer of the network, as shown by the first FC layer in Figure 4. For the background area of the M3 network model, also take the sequence of background regions on its trajectory
Figure PCTCN2017071574-appb-000016
As input. The acquisition of the adjacent target features is also consistent with the M2 network model. The distance between the target and the preset threshold at the current time are used as the selection criteria of the neighboring objects, and the maximum value and weighted mean of the FC1 features are calculated to form the adjacent target features. description. After the fusion, input to the subsequent fully connected layer for further recognition calculation.
其中,M3网络模型输出也是一个多维的向量,向量的长度是待识别行为类别的个数,输出的每一维上的得分为该类别上的预测概率。The M3 network model output is also a multi-dimensional vector, the length of the vector is the number of behavior categories to be identified, and the score on each dimension of the output is the prediction probability on the category.
其中,在完成将当前时刻图像和跟踪区域图像进行顺序叠加后,视频监控设备对行为类别得分和进行顺序叠加后的结果进行加权求和,输出对应的行为类别;在具体实现时,视频监控设备融合M2网络模型和M3网络模型的处理结果,得到待检测目标的综合行为类别预测,融合的方法可以是两组网络结果的加权和,权重的大小可以通过训练集拟合效果求得。After the video image capture device sequentially superimposes the current time image and the tracking area image, the video monitoring device weights and sums the behavior category score and the sequentially superimposed result, and outputs the corresponding behavior category; in specific implementation, the video monitoring device Combining the processing results of M2 network model and M3 network model, the comprehensive behavior category prediction of the target to be detected is obtained. The fusion method can be the weighted sum of the two groups of network results, and the weight can be obtained by the training set fitting effect.
其中,本发明实施例结合监控视频中出现的行为的特点,设计了基于单帧输入与多帧输入网络中隐含层特征的融合方法,采用当前目标前景、背景图像信息与邻近目标信息的组合作为的新的隐含特征,丰富了分类网络的可利用信息,使得用于分类的深度模型能够同时利用当前目标所在背景区域的信息及邻近区域中其他目标的行为信息,对于监控视频中行为识别具有非常有价值的辅助信息,提升了整个系统对于行为识别的性能。The embodiment of the present invention combines the characteristics of the behaviors appearing in the surveillance video, and designs a fusion method based on the hidden layer features in the single frame input and multiframe input networks, and adopts the combination of the current target foreground, the background image information and the adjacent target information. As a new implicit feature, it enriches the available information of the classification network, so that the depth model used for classification can simultaneously utilize the information of the background area of the current target and the behavior information of other targets in the adjacent area, and the behavior recognition in the surveillance video. It has very valuable auxiliary information that improves the performance of the entire system for behavior recognition.
通过上述方案,本发明实施例提供了一种视频中人体行为识别的方法,实现了提升视频识别的实时性和准确性。 Through the foregoing solution, the embodiment of the invention provides a method for human body behavior recognition in a video, which realizes real-time performance and accuracy of improving video recognition.
在一实施例中,为了更好地提升视频识别的实时性和准确性,参照图5,为本发明具体实施方式中根据所述行为类别得分,输出相应的行为类别的步骤的的一种流程示意图。In an embodiment, in order to better improve the real-time and accuracy of video recognition, referring to FIG. 5, in a specific embodiment of the present invention, a process of outputting a corresponding behavior category according to the behavior category score is performed. schematic diagram.
作为一种实施方式,上述步骤S4包括:As an implementation manner, the foregoing step S4 includes:
步骤S41,若所述行为类别得分高于预设行为类别的阈值,则输出所述行为类别。Step S41: If the behavior category score is higher than a threshold of the preset behavior category, the behavior category is output.
具体地,在完成对预测值为人体类别的人体区域进行计算得到预测值为人体类别的人体区域中的目标的行为类别得分后,视频监控设备根据行为类别得分,输出相应的行为类别。Specifically, after completing the calculation of the body region of the human body class with the predicted value as the body class, the video monitoring device outputs the corresponding behavior class according to the behavior category score after the behavior class score of the target in the body region of the body class is calculated.
其中,若行为类别得分高于预设行为类别的阈值,则输出行为类别;根据上述行为类别得分时,如果此时输出的类别得分在一些静态特征明显的类别上的得分高于一定的阈值,则直接输出该类别的作为最终的预测结果。Wherein, if the behavior category score is higher than the threshold of the preset behavior category, the behavior category is output; according to the behavior category score, if the category score output at this time is higher than a certain threshold in some categories with obvious static characteristics, Then directly output the category as the final prediction result.
步骤S42,若所述行为类别得分不高于预设行为类别的阈值,则结合所述人体运行轨迹信息,计算并输出相应的行为类别。In step S42, if the behavior category score is not higher than the threshold of the preset behavior category, the corresponding behavior category is calculated and output according to the human body running track information.
具体地,若行为类别得分不高于预设行为类别的阈值,则结合人体运行轨迹信息,计算并输出相应的行为类别。Specifically, if the behavior category score is not higher than the threshold of the preset behavior category, the corresponding behavior category is calculated and output in combination with the human body running track information.
其中,视频监控设备获取待识别视频的当前时刻图像和人体运行轨迹信息对应的跟踪区域图像;在具体实现时,视频监控设备可以获取当前时刻图像和人体运行轨迹信息对应的跟踪区域图像,使用同一目标先前时刻图像的叠加作为基于背景与邻近目标特征的多帧时序输入行为识别M3网络模型的输入,进行进一步的类别预测。M3网络模型的结构示意图如图4所示。由于采用的是时序的目标动作画面叠加作为网络的输入,所以M3网络模型具有更强的捕捉运动信息的能力,对于一些动态特征明显的行为识别具有明显的优势。The video monitoring device acquires the current time image of the to-be-identified video and the tracking area image corresponding to the human body running track information. In a specific implementation, the video monitoring device can acquire the tracking area image corresponding to the current time image and the human body running track information, using the same The superposition of the target previous time image identifies the input of the M3 network model as a multi-frame sequential input behavior based on the background and neighboring target features for further category prediction. The structure diagram of the M3 network model is shown in Figure 4. Because the time-based target motion picture overlay is used as the input of the network, the M3 network model has a stronger ability to capture motion information, and has obvious advantages for some dynamic features with obvious behavior recognition.
其中,在完成获取待识别视频的当前时刻图像和人体运行轨迹信息对应的跟踪区域图像后,视频监控设备将当前时刻图像和跟踪区域图像进行顺序叠加;在具体实现时,视频监控设备使用M3网络模型,利用运动轨迹的信息,使用同一目标在当前时刻与前若干时刻的跟踪区域图像的顺序叠 加作为模型的输入,即:After the tracking area image corresponding to the current time image of the to-be-identified video and the human body running track information is completed, the video monitoring device sequentially superimposes the current time image and the tracking area image; in specific implementation, the video monitoring device uses the M3 network. The model, using the information of the motion trajectory, uses the same target to overlap the sequence of the tracking area images at the current time and the previous time Add as input to the model, ie:
Figure PCTCN2017071574-appb-000017
Figure PCTCN2017071574-appb-000017
M3网络模型的中间层将同时融合当前目标所在的背景区域序列的深度特征、当前目标近邻区域中其它目标历史序列的隐含特征,邻近目标的信息有利于提升算法的预测准确性。The middle layer of the M3 network model will simultaneously fuse the depth features of the background region sequence of the current target and the hidden features of other target historical sequences in the current target neighbor region. The information of the adjacent target is beneficial to improve the prediction accuracy of the algorithm.
M3网络模型的隐藏层特征融合的位置同样在于网络的第一个全连通层,如图4中的第一个FC层所示。对于M3网络模型的背景区域,也取其轨迹上的背景区域序列
Figure PCTCN2017071574-appb-000018
作为输入。而对于邻近目标特征的获取也与M2网络模型基本一致,以当前时刻的目标间距离及预设的阈值作为邻近对象的选取准则,并计算它们的FC1特征的最大值、加权均值组成邻近目标特征描述。通过融合后,输入到后续的全连接层进行进一步的识别计算。
The location of the hidden layer feature fusion of the M3 network model is also in the first fully connected layer of the network, as shown by the first FC layer in Figure 4. For the background area of the M3 network model, also take the sequence of background regions on its trajectory
Figure PCTCN2017071574-appb-000018
As input. The acquisition of the adjacent target features is also consistent with the M2 network model. The distance between the target and the preset threshold at the current time are used as the selection criteria of the neighboring objects, and the maximum value and weighted mean of the FC1 features are calculated to form the adjacent target features. description. After the fusion, input to the subsequent fully connected layer for further recognition calculation.
其中,M3网络模型输出也是一个多维的向量,向量的长度是待识别行为类别的个数,输出的每一维上的得分为该类别上的预测概率。The M3 network model output is also a multi-dimensional vector, the length of the vector is the number of behavior categories to be identified, and the score on each dimension of the output is the prediction probability on the category.
其中,在完成将当前时刻图像和跟踪区域图像进行顺序叠加后,视频监控设备对行为类别得分和进行顺序叠加后的结果进行加权求和,输出对应的行为类别;在具体实现时,视频监控设备融合M2网络模型和M3网络模型的处理结果,得到待检测目标的综合行为类别预测,融合的方法可以是两组网络结果的加权和,权重的大小可以通过训练集拟合效果求得。After the video image capture device sequentially superimposes the current time image and the tracking area image, the video monitoring device weights and sums the behavior category score and the sequentially superimposed result, and outputs the corresponding behavior category; in specific implementation, the video monitoring device Combining the processing results of M2 network model and M3 network model, the comprehensive behavior category prediction of the target to be detected is obtained. The fusion method can be the weighted sum of the two groups of network results, and the weight can be obtained by the training set fitting effect.
通过上述方案,本发明实施例提供了一种视频中人体行为识别的方法,更好地实现了提升视频识别的实时性和准确性。Through the foregoing solution, the embodiment of the present invention provides a method for human body behavior recognition in a video, which better improves the real-time performance and accuracy of video recognition.
在一实施例中,为了更好地提升视频识别的实时性和准确性,参照图6,为本发明具体实施方式中对所述预测值为人体类别的人体区域进行计算得到所述预测值为人体类别的人体区域中的目标的行为类别得分的步骤的一种流程示意图。In an embodiment, in order to better improve the real-time and accuracy of the video recognition, referring to FIG. 6, in the specific embodiment of the present invention, the predicted human body area of the human body type is calculated to obtain the predicted value. A flow diagram of the steps of the behavioral category score of a target in a human body region of a human body category.
作为一种实施方式,上述步骤S3包括:As an implementation manner, the foregoing step S3 includes:
步骤S31,获取所述预测值为人体类别的人体区域的背景图像,得到所述背景图像的描述信息。Step S31: Obtain a background image of the human body region whose predicted value is a human body category, and obtain description information of the background image.
具体地,在完成采用非人体目标过滤算法,输出人体区域对应的预测 值,对预测值为非人体类别的人体区域进行过滤后,视频监控设备获取预测值为人体类别的人体区域的背景图像,得到背景图像的描述信息。Specifically, the non-human target filtering algorithm is used to output the prediction corresponding to the human body region. The value, after filtering the human body region whose predicted value is non-human body type, the video monitoring device acquires the background image of the human body region whose predicted value is the human body category, and obtains the description information of the background image.
其中,在具体实现时,如果M1网络模型得到的预测结果是人体类别(即画面中的前景),视频监控设备可以使用一个结构更复杂、识别能力更强的基于邻近目标特征的非时序输入行为识别M2网络模型对单帧图像内的每个人体区域进行行为的识别,该网络模型的结构如图3所示;M2网络模型的隐藏层中加入了当前人体目标所在背景图像和邻近目标隐藏层的特征信息,特征融合的位置在于网络的第一个全连通层,如图3中的第一个FC层所示;其中目标所在区域的背景图像可以从预先设定的纯净的背景图像中获得,只要取其中对应检测区域位置的部分即可。完整的背景图像可以通过预先设定的标准背景图像获得,或通过动态更新的背景模型获得。记某一目标i在t时刻得到的背景图像为Bt (i),那么对于一个目标区域,可以将它的描述信息表示为:In the specific implementation, if the prediction result obtained by the M1 network model is the human body category (ie, the foreground in the picture), the video monitoring device can use a non-sequential input behavior based on the neighboring target features with more complex structure and stronger recognition ability. Identifying the M2 network model to identify the behavior of each human body region in a single frame image. The structure of the network model is shown in Figure 3. The hidden layer of the M2 network model adds the background image of the current human target and the adjacent target hidden layer. Characteristic information, the location of the feature fusion lies in the first fully connected layer of the network, as shown by the first FC layer in Figure 3; wherein the background image of the target region can be obtained from a pre-set pure background image As long as the part corresponding to the position of the detection area is taken. The complete background image can be obtained from a pre-set standard background image or via a dynamically updated background model. Note that the background image obtained by a target i at time t is B t (i) , then for a target area, its description information can be expressed as:
O(i,t)=It (i),Rt (i),Bt (i)O(i,t)=I t (i) , R t (i) , B t (i) ;
其中,It (i)和Bt (i)共用同一个位置区域Rt (i)Where I t (i) and B t (i) share the same location area R t (i) .
步骤S32,根据所述背景图像的描述信息,计算所述背景图像对应的背景区域信息,并计算所述背景图像对应的邻近目标信息。Step S32: Calculate background area information corresponding to the background image according to the description information of the background image, and calculate neighboring target information corresponding to the background image.
具体地,在完成获取预测值为人体类别的人体区域的背景图像,得到背景图像的描述信息后,视频监控设备根据背景图像的描述信息,计算背景图像对应的背景区域信息,并计算背景图像对应的邻近目标信息。Specifically, after obtaining the background image of the human body region whose predicted value is the human body category, and obtaining the description information of the background image, the video monitoring device calculates the background region information corresponding to the background image according to the description information of the background image, and calculates the corresponding background image. Proximity target information.
其中,在具体实现时,背景图像会经过若干个卷积层得到它的视觉特征描述,然后经过一个全连通层得到它对应的第一个隐含层特征,它的维度与目标图像得到的第一个隐含层的维度相同。对于目标图像,它的第一个隐含层的特征计算过程可以表示为:Among them, in the specific implementation, the background image will get its visual feature description through several convolutional layers, and then get its corresponding first hidden layer feature through a fully connected layer, its dimension and the target image obtained The dimensions of an implicit layer are the same. For the target image, the feature calculation process of its first hidden layer can be expressed as:
FC1(It (i))=f1(cm(...c1It (i)));FC 1 (I t (i) )=f 1 (c m (...c 1 I t (i) ));
其中,c(·)代表对于图像的卷积运算,f(·)代表全连接层的矩阵乘法操作与偏置量操作。类似地,对于背景位置图像,记它的第一个隐含层的特征为:Where c(·) represents a convolution operation for an image, and f(·) represents a matrix multiplication operation and an offset amount operation of the fully connected layer. Similarly, for a background position image, the characteristics of its first hidden layer are:
FC1(Bt (i))=f1(cm(...c1Bt (i)));FC 1 (B t (i) )=f 1 (c m (...c 1 B t (i) ));
其中,该模型的第一个隐含层的特征组成中,还有一部分是来自邻近 目标的特征,这些特征主要来自于当前区域的邻近区域中的目标特征。可以通过设置一阈值来确定邻近区域的范围。记当前目标的中心位置为:Among them, the characteristic composition of the first hidden layer of the model, and some of it is from the neighbor The characteristics of the target, mainly from the target features in the neighborhood of the current region. The range of neighboring regions can be determined by setting a threshold. The central location of the current target is:
Figure PCTCN2017071574-appb-000019
Figure PCTCN2017071574-appb-000019
其中,
Figure PCTCN2017071574-appb-000020
是目标区域左上角横坐标,
Figure PCTCN2017071574-appb-000021
是目标区域左上角纵坐标,
Figure PCTCN2017071574-appb-000022
是目标区域的宽度,
Figure PCTCN2017071574-appb-000023
是目标区域的高度。同时计算同一画面中其它前景目标的中心位置点
Figure PCTCN2017071574-appb-000024
Figure PCTCN2017071574-appb-000025
Figure PCTCN2017071574-appb-000026
的欧氏距离dij小于一定的阈值D或两者有交叉时,则将该前景归入当前目标的有效邻近目标中。
among them,
Figure PCTCN2017071574-appb-000020
Is the horizontal coordinate of the upper left corner of the target area.
Figure PCTCN2017071574-appb-000021
Is the ordinate of the upper left corner of the target area.
Figure PCTCN2017071574-appb-000022
Is the width of the target area,
Figure PCTCN2017071574-appb-000023
Is the height of the target area. Simultaneously calculate the center position of other foreground targets in the same picture
Figure PCTCN2017071574-appb-000024
when
Figure PCTCN2017071574-appb-000025
versus
Figure PCTCN2017071574-appb-000026
When the Euclidean distance d ij is less than a certain threshold D or there is an intersection between the two, the foreground is classified into the effective neighboring target of the current target.
步骤S33,结合所述背景图像对应的背景区域信息和邻近目标信息,计算得到所述人体区域的目标的行为类别得分。Step S33, calculating the behavior category score of the target of the human body region in combination with the background region information corresponding to the background image and the neighboring target information.
具体地,在完成根据背景图像的描述信息,计算背景图像对应的背景区域信息,并计算背景图像对应的邻近目标信息后,视频监控设备结合背景图像对应的背景区域信息和邻近目标信息,计算得到人体区域的目标的行为类别得分。Specifically, after calculating the background area information corresponding to the background image according to the description information of the background image, and calculating the neighboring target information corresponding to the background image, the video monitoring device combines the background area information corresponding to the background image with the adjacent target information, and calculates The behavior category score of the target of the human body region.
其中,在具体实现时,视频监控设备可以记所有的邻近目标区域计算到的第一个全连通层的特征的集合为
Figure PCTCN2017071574-appb-000027
分别统计这些特征值在每一维上的最大值:
Wherein, in a specific implementation, the video monitoring device can record that the set of features of the first fully connected layer calculated by all adjacent target regions is
Figure PCTCN2017071574-appb-000027
The maximum value of these eigenvalues in each dimension is separately counted:
Figure PCTCN2017071574-appb-000028
Figure PCTCN2017071574-appb-000028
和加权平均值:And weighted average:
Figure PCTCN2017071574-appb-000029
Figure PCTCN2017071574-appb-000029
作为邻近目标的特征描述的组成部分。将以上两组特征拼接在一起,就能得到对于邻近目标描述的整体特征表示,即:As part of the characterization of the adjacent target. By stitching together the above two sets of features, you can get an overall feature representation of the proximity target description, namely:
Figure PCTCN2017071574-appb-000030
Figure PCTCN2017071574-appb-000030
如果当前目标在画面中没有任何邻近目标,则
Figure PCTCN2017071574-appb-000031
的值全部设为零。综合背景区域信息和邻近目标信息后,行为识别的网络模型的第一个全连通层的特征可以表示为:
If the current target does not have any adjacent targets in the picture, then
Figure PCTCN2017071574-appb-000031
The values are all set to zero. After synthesizing the background area information and the adjacent target information, the characteristics of the first fully connected layer of the network model of behavior recognition can be expressed as:
Figure PCTCN2017071574-appb-000032
Figure PCTCN2017071574-appb-000032
该特征经过后续的全连通层,使得整个网络模型在进行识别的过程中,自然地利用到了当前目标的背景区域信息和上下文信息。The feature passes through the subsequent fully connected layer, so that the entire network model naturally utilizes the background information and context information of the current target in the process of identifying.
其中,M2网络模型输出是一个多维的向量,向量的长度是待识别行为类别的个数,输出的每一维上的得分代表该类别上的预测概率。The M2 network model output is a multi-dimensional vector, the length of the vector is the number of behavior categories to be identified, and the score on each dimension of the output represents the prediction probability on the category.
通过上述方案,本发明实施例提供了一种视频中人体行为识别的方法,更好地实现了提升视频识别的实时性和准确性。Through the foregoing solution, the embodiment of the present invention provides a method for human body behavior recognition in a video, which better improves the real-time performance and accuracy of video recognition.
在一实施例中,为了更好地提升视频识别的实时性和准确性,参照图7,为本发明具体实施方式中结合所述人体运行轨迹信息,计算并输出相应的行为类别的步骤的一种流程示意图。In an embodiment, in order to better improve the real-time and accuracy of video recognition, reference is made to FIG. 7 , which is a step of calculating and outputting a corresponding behavior category in combination with the human body running track information according to an embodiment of the present invention. A schematic diagram of the process.
作为一种实施方式,上述步骤S42包括:As an implementation manner, the foregoing step S42 includes:
步骤S421,获取所述视频的当前时刻图像和所述人体运行轨迹信息对应的跟踪区域图像。Step S421: Acquire a current time image of the video and a tracking area image corresponding to the human body running track information.
具体地,视频监控设备获取待识别视频的当前时刻图像和人体运行轨迹信息对应的跟踪区域图像。Specifically, the video monitoring device acquires a current time image of the to-be-identified video and a tracking area image corresponding to the human body running track information.
其中,在具体实现时,视频监控设备可以获取当前时刻图像和人体运行轨迹信息对应的跟踪区域图像,使用同一目标先前时刻图像的叠加作为基于背景与邻近目标特征的多帧时序输入行为识别网络模型M3网络模型的输入,进行进一步的类别预测。M3网络模型的结构示意图如图4所示。由于采用的是时序的目标动作画面叠加作为网络的输入,所以M3网络模型具有更强的捕捉运动信息的能力,对于一些动态特征明显的行为识别具有明显的优势。In a specific implementation, the video monitoring device can acquire the tracking area image corresponding to the current time image and the human body running track information, and use the superposition of the same target previous time image as the multi-frame time input behavior recognition network model based on the background and the adjacent target features. Input of the M3 network model for further category prediction. The structure diagram of the M3 network model is shown in Figure 4. Because the time-based target motion picture overlay is used as the input of the network, the M3 network model has a stronger ability to capture motion information, and has obvious advantages for some dynamic features with obvious behavior recognition.
步骤S422,将所述当前时刻图像和所述跟踪区域图像进行顺序叠加。Step S422, sequentially superimposing the current time image and the tracking area image.
具体地,在完成获取视频的当前时刻图像和人体运行轨迹信息对应的跟踪区域图像后,视频监控设备将当前时刻图像和跟踪区域图像进行顺序叠加。Specifically, after the acquisition of the current time image of the video and the tracking area image corresponding to the human body running track information, the video monitoring device sequentially superimposes the current time image and the tracking area image.
其中,在具体实现时,视频监控设备使用M3网络模型,利用运动轨迹的信息,使用同一目标在当前时刻与前若干时刻的跟踪区域图像的顺序叠加作为模型的输入,即:In the specific implementation, the video monitoring device uses the M3 network model, and uses the information of the motion trajectory to use the sequential superposition of the tracking image of the same target at the current time and the previous time as the input of the model, namely:
Figure PCTCN2017071574-appb-000033
Figure PCTCN2017071574-appb-000033
M3网络模型的中间层将同时融合当前目标所在的背景区域序列的深度特征、当前目标近邻区域中其它目标历史序列的隐含特征,邻近目标的信息有利于提升算法的预测准确性。The middle layer of the M3 network model will simultaneously fuse the depth features of the background region sequence of the current target and the hidden features of other target historical sequences in the current target neighbor region. The information of the adjacent target is beneficial to improve the prediction accuracy of the algorithm.
M3网络模型的隐藏层特征融合的位置同样在于网络的第一个全连通层,如图4中的第一个FC层所示。对于M3网络模型的背景区域,也取其轨迹上的背景区域序列
Figure PCTCN2017071574-appb-000034
作为输入。而对于邻近目标特征的获取也与M2网络模型基本一致,以当前时刻的目标间距离及预设的阈值作为邻近对象的选取准则,并计算它们的FC1特征的最大值、加权均值组成邻近目标特征描述。通过融合后,输入到后续的全连接层进行进一步的识别计算。
The location of the hidden layer feature fusion of the M3 network model is also in the first fully connected layer of the network, as shown by the first FC layer in Figure 4. For the background area of the M3 network model, also take the sequence of background regions on its trajectory
Figure PCTCN2017071574-appb-000034
As input. The acquisition of the adjacent target features is also consistent with the M2 network model. The distance between the target and the preset threshold at the current time are used as the selection criteria of the neighboring objects, and the maximum value and weighted mean of the FC1 features are calculated to form the adjacent target features. description. After the fusion, input to the subsequent fully connected layer for further recognition calculation.
其中,M3网络模型输出也是一个多维的向量,向量的长度是待识别行为类别的个数,输出的每一维上的得分为该类别上的预测概率。The M3 network model output is also a multi-dimensional vector, the length of the vector is the number of behavior categories to be identified, and the score on each dimension of the output is the prediction probability on the category.
步骤S423,对所述行为类别得分和所述进行顺序叠加后的结果进行加权求和,输出对应的行为类别。Step S423, weighting and summing the behavior category score and the result of performing the sequential superposition, and outputting the corresponding behavior category.
具体地,在完成将当前时刻图像和跟踪区域图像进行顺序叠加后,进行多帧图像叠加输入处理后,视频监控设备对行为类别得分和进行顺序叠加后的结果进行加权求和,输出对应的行为类别。Specifically, after the current time image and the tracking area image are sequentially superimposed, after the multi-frame image superimposition input processing is performed, the video monitoring device weights and sums the behavior category score and the sequentially superimposed result, and outputs the corresponding behavior. category.
其中,在具体实现时,视频监控设备融合M2网络模型和M3网络模型的处理结果,得到待检测目标的综合行为类别预测,融合的方法可以是两组网络结果的加权和,权重的大小可以通过训练集拟合效果求得。In the specific implementation, the video monitoring device combines the processing results of the M2 network model and the M3 network model to obtain a comprehensive behavior category prediction of the target to be detected, and the fusion method may be a weighted sum of the two network results, and the weight may pass The training set fitting effect is obtained.
通过上述方案,本发明实施例提供了一种视频中人体行为识别的方法,更好地实现了提升视频识别的实时性和准确性。Through the foregoing solution, the embodiment of the present invention provides a method for human body behavior recognition in a video, which better improves the real-time performance and accuracy of video recognition.
在一实施例中,为了更好地提升视频识别的实时性和准确性,参照图8,为本发明具体实施方式中根据所述人体区域计算得到所述人体区域对应的预测值,对所述预测值为非人体类别的人体区域进行过滤的步骤的一种流程示意图。In an embodiment, in order to better improve the real-time and accuracy of the video recognition, referring to FIG. 8, in the specific embodiment of the present invention, the predicted value corresponding to the human body region is calculated according to the human body region, and the A flow chart showing the steps of predicting the value of a human body region that is not a human body.
作为一种实施方式,上述步骤S2包括:As an implementation manner, the foregoing step S2 includes:
步骤S21,获取所述人体区域并进行分析,输出所述人体区域对应的预测值。 Step S21: acquiring the human body region and performing analysis, and outputting a predicted value corresponding to the human body region.
具体地,在完成检测待识别视频中的人体区域,获取人体区域中的人体运行轨迹信息后,视频监控设备获取人体区域并进行分析,输出人体区域对应的预测值。Specifically, after detecting the human body region in the to-be-identified video and acquiring the human body running track information in the human body region, the video monitoring device acquires the human body region and performs analysis, and outputs the predicted value corresponding to the human body region.
其中,在具体实现时,当获取到当前帧中某一个人体区域后,视频监控设备将该人体区域的图像输入到背景过滤网络M1网络模型中进行分析,M1网络模型的结构如图2所示,M1网络模型是一个基于单帧图像输入的深度卷积网络模型;其中,网络的输入为检测到的前景区域图像,后接若干个附带ReLU层和pooling层的卷积层(Convolution Layers,CONV),再接上若干个全连通层(Fully Connection Layers,FC)进行深度的特征计算,网络的最后一层输出层的维数为2维,经过sigmoid变换后分别对应人体类别与非人体类别上的行为类别得分。In the specific implementation, after acquiring a certain human body region in the current frame, the video monitoring device inputs the image of the human body region into the background filtering network M1 network model for analysis, and the structure of the M1 network model is as shown in FIG. 2 . The M1 network model is a deep convolutional network model based on single-frame image input; the input of the network is the detected foreground area image, followed by several convolution layers (Convolution Layers, CONV) with ReLU layer and pooling layer. ), and then connected to the Fully Connection Layers (FC) for depth feature calculation. The dimension of the last layer of the network is 2D. After sigmoid transformation, it corresponds to the human body category and the non-human body category. Behavior category score.
步骤S22,若所述预测值为非人体类别,则将所述预测值为非人体类别的人体区域从所述获取的人体区域中进行过滤。Step S22: If the predicted value is a non-human body category, the human body region whose predicted value is a non-human body category is filtered from the acquired human body region.
具体地,若预测值为非人体类别,则将预测值为非人体类别的人体区域从获取的人体区域中进行过滤;在具体实现时,视频监控设备通过M1网络模型的分类后,可以过滤掉前期检测与跟踪算法误测为人体类别的区域。由于此时的网络仅在检测环节产生的前景图像上进行计算(而非整张图像上),所以并不会产生明显的计算开销,在提高检测准确率的同时,能够满足整个系统实时性上的要求。同时,M1网络模型中的卷积层、全连通层的个数可以根据监控画面的大小与所部署设备的硬件性能等因素进行调整。Specifically, if the predicted value is a non-human body category, the human body region whose predicted value is a non-human body category is filtered from the acquired human body region; in a specific implementation, the video monitoring device can be filtered out after being classified by the M1 network model. The pre-detection and tracking algorithms are misdetected as areas of the human body category. Since the network at this time is only calculated on the foreground image generated by the detection link (instead of the entire image), it does not generate significant computational overhead, and can improve the detection accuracy while satisfying the real-time performance of the entire system. Requirements. At the same time, the number of convolutional layers and fully connected layers in the M1 network model can be adjusted according to factors such as the size of the monitoring screen and the hardware performance of the deployed device.
若所述预测值为人体类别,则执行计算所述预测值为人体类别的人体区域中的目标的行为类别得分的步骤。If the predicted value is a human body category, a step of calculating the behavior category score of the target in the human body region of the human body category is performed.
具体地,若预测值为人体类别,则视频监控设备执行上述步骤S3,计算得到所述预测值为人体类别的人体区域中的目标的行为类别得分。Specifically, if the predicted value is a human body category, the video monitoring device performs the above step S3 to calculate the behavior category score of the target in the human body region in which the predicted value is the human body category.
通过上述方案,本发明实施例提供了一种视频中人体行为识别的方法,更好地实现了提升视频识别的实时性和准确性。Through the foregoing solution, the embodiment of the present invention provides a method for human body behavior recognition in a video, which better improves the real-time performance and accuracy of video recognition.
在一实施例中,为了更好地提升视频识别的实时性和准确性,参照图9,为本发明具体实施方式中所述检测待识别视频中的人体区域,获取所述人 体区域中的人体运行轨迹信息的步骤的一种流程示意图。In an embodiment, in order to better improve the real-time and accuracy of video recognition, referring to FIG. 9, in the embodiment of the present invention, the human body area in the video to be identified is detected, and the person is obtained. A schematic flow chart of the steps of the human body running track information in the body region.
作为一种实施方式,上述步骤S1包括:As an implementation manner, the foregoing step S1 includes:
步骤S11,获取所述待识别视频,对所述目标视频中的人体区域进行检测。Step S11: Acquire the video to be identified, and detect a human body region in the target video.
具体地,视频监控设备获取待识别视频,对目标视频中的人体区域进行检测。Specifically, the video monitoring device acquires the to-be-identified video and detects the human body region in the target video.
其中,在具体实现时,视频监控设备可以通过前端视频采集设备来获取待识别的原始视频,并使用基于传统特征分类的检测器对视频中的人体区域进行检测。In a specific implementation, the video surveillance device can obtain the original video to be identified through the front-end video capture device, and detect the human body region in the video by using a detector based on the traditional feature classification.
步骤S12,对所述人体区域中的行人进行跟踪,得到所述人体区域中的人体运行轨迹信息。Step S12: Tracking pedestrians in the human body region to obtain human body running track information in the human body region.
具体地,在完成获取待识别视频,对目标视频中的人体区域进行检测后,视频监控设备对人体区域中的行人进行跟踪,得到人体区域中的人体运行轨迹信息。Specifically, after the acquiring the to-be-identified video and detecting the human body region in the target video, the video monitoring device tracks the pedestrian in the human body region to obtain the human body running track information in the human body region.
其中,在具体实现时,视频监控设备可使用基于检测区域匹配的跟踪算法对画面中的行人进行跟踪,从而得到画面中的人体的运动轨迹信息。In a specific implementation, the video monitoring device may track the pedestrians in the picture by using a tracking algorithm based on the detection of the detection area, thereby obtaining motion track information of the human body in the picture.
其中,人体检测与跟踪的结果可以以目标ID与检测区域图像序列的形式保存,即:The result of human body detection and tracking can be saved in the form of a target ID and a sequence of detection area images, namely:
O(i,t)=It (i),Rt (i)O(i,t)=I t (i) , R t (i) ;
其中O(i,t)代表目标i在t时刻的信息,It (i)是该目标在t时刻检测到的图像内容,Rt (i)是该目标在t时刻所在区域的位置,Rt (i)中使用向量(x,y,w,h)的形式记录区域的左上角横、纵坐标位置与长、宽值。Where O(i,t) represents the information of the target i at time t, I t (i) is the image content detected by the target at time t, and R t (i) is the position of the target at the time t, R In t (i) , the horizontal and vertical coordinate positions and the length and width values of the upper left corner of the recording area are used in the form of vectors (x, y, w, h).
通过上述方案,本发明实施例提供了一种视频中人体行为识别的方法,更好地实现了提升视频识别的实时性和准确性。Through the foregoing solution, the embodiment of the present invention provides a method for human body behavior recognition in a video, which better improves the real-time performance and accuracy of video recognition.
基于上述视频中人体行为识别的方法实施例的实现,本发明还提供相应的装置实施例。Based on the implementation of the method embodiment of human behavior recognition in the above video, the present invention also provides a corresponding apparatus embodiment.
如图10所示,本发明第一实施例提出一种视频中人体行为识别的装置,包括:As shown in FIG. 10, a first embodiment of the present invention provides a device for recognizing a human body in a video, including:
检测模块100,配置为检测待识别视频中的人体区域,获取所述人体区 域中的人体运行轨迹信息。The detecting module 100 is configured to detect a human body area in the video to be identified, and acquire the human body area Human body trajectory information in the domain.
本发明实施例装置的执行主体可以为一种视频监控设备或视频识别设备,本实施例以视频监控设备进行举例,当然也不限定于其他能够实现识别视频中人体行为的设备。The executor of the device in the embodiment of the present invention may be a video monitoring device or a video identification device. This embodiment is exemplified by a video monitoring device, and is of course not limited to other devices capable of realizing human behavior in the video.
具体地,检测模块100检测待识别视频中的人体区域,获取人体区域中的人体运行轨迹信息。Specifically, the detecting module 100 detects a human body region in the video to be identified, and acquires human body running track information in the human body region.
其中,视频监控设备获取待识别视频,对目标视频中的人体区域进行检测;在具体实现时,视频监控设备可以通过前端视频采集设备来获取待识别的原始视频,并使用基于传统特征分类的检测器对视频中的人体区域进行检测。The video monitoring device obtains the to-be-identified video and detects the human body region in the target video. In a specific implementation, the video surveillance device can obtain the original video to be identified through the front-end video capture device, and use the detection based on the traditional feature classification. The device detects the human body area in the video.
其中,在完成获取待识别视频,对目标视频中的人体区域进行检测后,检测模块100对人体区域中的行人进行跟踪,得到人体区域中的人体运行轨迹信息;在具体实现时,视频监控设备可使用基于检测区域匹配的跟踪算法对画面中的行人进行跟踪,从而得到画面中的人体的运动轨迹信息。After the acquiring the to-be-identified video and detecting the human body region in the target video, the detecting module 100 tracks the pedestrian in the human body region to obtain the human body running track information in the human body region; in specific implementation, the video monitoring device A pedestrian tracking algorithm based on detection area matching can be used to track pedestrians in the picture to obtain motion trajectory information of the human body in the picture.
其中,人体检测与跟踪的结果可以以目标ID与检测区域图像序列的形式保存,即:The result of human body detection and tracking can be saved in the form of a target ID and a sequence of detection area images, namely:
O(i,t)=It (i),Rt (i)O(i,t)=I t (i) , R t (i) ;
其中O(i,t)代表目标i在t时刻的信息,It (i)是该目标在t时刻检测到的图像内容,Rt (i)是该目标在t时刻所在区域的位置,Rt (i)中使用向量(x,y,w,h)的形式记录区域的左上角横、纵坐标位置与长、宽值。Where O(i,t) represents the information of the target i at time t, I t (i) is the image content detected by the target at time t, and R t (i) is the position of the target at the time t, R In t (i) , the horizontal and vertical coordinate positions and the length and width values of the upper left corner of the recording area are used in the form of vectors (x, y, w, h).
过滤模块200,配置为根据所述人体区域计算得到所述人体区域对应的预测值,对所述预测值为非人体类别的人体区域进行过滤,得到所述预测值为人体类别的人体区域。The filtering module 200 is configured to calculate a predicted value corresponding to the human body region according to the human body region, and filter the human body region whose predicted value is a non-human body category to obtain a human body region whose predicted value is a human body category.
具体地,在完成检测待识别视频中的人体区域,获取人体区域中的人体运行轨迹信息后,过滤模块200根据人体区域计算得到人体区域对应的预测值,对预测值为非人体类别的人体区域进行过滤,得到预测值为人体类别的人体区域。Specifically, after detecting the human body region in the to-be-identified video and acquiring the human body running track information in the human body region, the filtering module 200 calculates the predicted value corresponding to the human body region according to the human body region, and the human body region with the predicted value is a non-human body category. Filtering is performed to obtain a human body region whose predicted value is a human body type.
其中,视频监控设备获取人体区域并进行分析,输出人体区域对应的预测值,预测值包括人体类别和非人体类别;在具体实现时,当获取到当前帧中某一个人体区域后,视频监控设备将该人体区域的图像输入到背景 过滤网络M1网络模型中进行分析,M1网络模型的结构如图2所示,M1网络模型是一个基于单帧图像输入的深度卷积网络模型;其中,网络的输入为检测到的前景区域图像,后接若干个附带ReLU层和pooling层的卷积层(Convolution Layers,CONV),再接上若干个全连通层(Fully Connection Layers,FC)进行深度的特征计算,网络的最后一层输出层的维数为2维,经过sigmoid变换后分别对应人体类别与非人体类别上的行为类别得分。The video monitoring device acquires and analyzes the human body region, and outputs a predicted value corresponding to the human body region, and the predicted value includes a human body category and a non-human body category; in a specific implementation, after acquiring a certain human body region in the current frame, the video monitoring device Input the image of the human body area to the background The filtering network M1 network model is analyzed. The structure of the M1 network model is shown in Figure 2. The M1 network model is a deep convolutional network model based on single-frame image input. The input of the network is the detected foreground area image. It is followed by several Convolution Layers (CONV) with ReLU layer and pooling layer, and then connected with several Fully Connection Layers (FC) for deep feature calculation. The last layer of the network is the output layer. The dimension is 2 dimensions, and after sigmoid transformation, it corresponds to the behavior category scores on the human body category and the non-human body category.
其中,若预测值为非人体类别,则过滤模块200将预测值为非人体类别的人体区域从获取的人体区域中进行过滤;通过M1网络模型的分类后,可以过滤掉前期检测与跟踪算法误测为人体类别的区域。由于此时的网络仅在检测环节产生的前景图像上进行计算(而非整张图像上),所以并不会产生明显的计算开销,在提高检测准确率的同时,能够满足整个系统实时性上的要求。同时,M1网络模型中的卷积层、全连通层的个数可以根据监控画面的大小与所部署设备的硬件性能等因素进行调整。If the predicted value is a non-human body category, the filtering module 200 filters the human body region whose predicted value is a non-human body category from the acquired human body region; after the classification by the M1 network model, the previous detection and tracking algorithm error can be filtered out. The area measured as a human body category. Since the network at this time is only calculated on the foreground image generated by the detection link (instead of the entire image), it does not generate significant computational overhead, and can improve the detection accuracy while satisfying the real-time performance of the entire system. Requirements. At the same time, the number of convolutional layers and fully connected layers in the M1 network model can be adjusted according to factors such as the size of the monitoring screen and the hardware performance of the deployed device.
其中,本发明实施例在检测与跟踪环节后首先使用了一个结构相对简单的深度网络模型对检测到的前景区域进行进一步的过滤处理;在前期的检测环节,实现时有意降低算法对于前景预测的阈值,使算法尽可能返回更多的前景区域,尽量减少漏检率的产生。由于此时的网络仅在检测环节产生的前景图像上进行计算(而非整张图像上),所以大大减少了算法的计算开销,在提高检测准确率的同时,很好地满足了整个系统实时性上的要求。In the embodiment of the present invention, after the detection and tracking step, a deep network model with relatively simple structure is used to further filter the detected foreground region; in the early detection link, the algorithm intentionally reduces the algorithm for foreground prediction. Thresholds allow the algorithm to return as many foreground areas as possible, minimizing the rate of missed detection. Since the network at this time is only calculated on the foreground image generated by the detection link (instead of the entire image), the computational overhead of the algorithm is greatly reduced, and the detection accuracy is improved, and the entire system is well satisfied. Sexual requirements.
计算模块300,配置为对所述预测值为人体类别的人体区域进行计算得到所述预测值为人体类别的人体区域中的目标的行为类别得分。The calculation module 300 is configured to calculate, for the human body region whose predicted value is a human body category, the behavior category score of the target in the human body region of the human body category.
具体地,在完成根据人体区域计算得到所述人体区域对应的预测值,对预测值为非人体类别的人体区域进行过滤,得到预测值为人体类别的人体区域后,计算模块300对预测值为人体类别的人体区域进行计算得到预测值为人体类别的人体区域中的目标的行为类别得分。Specifically, after calculating the predicted value corresponding to the human body region calculated according to the human body region, and filtering the human body region whose predicted value is a non-human body category, and obtaining the human body region whose predicted value is the human body category, the calculation module 300 compares the predicted value. The human body region of the human body type is calculated to obtain a behavior category score of the target in the human body region whose predicted value is the human body category.
其中,视频监控设备获取预测值为人体类别的人体区域的背景图像,得到背景图像的描述信息;在具体实现时,如果M1网络模型得到的预测结果是人体类别(即画面中的前景),视频监控设备可以使用一个结构更复杂、识别能力更强的基于邻近目标特征的非时序输入行为识别M2网络模型对 单帧图像内的每个人体区域进行行为的识别,该网络模型的结构如图3所示;M2网络模型的隐藏层中加入了当前人体目标所在背景图像和邻近目标隐藏层的特征信息,特征融合的位置在于网络的第一个全连通层,如图3中的第一个FC层所示;其中目标所在区域的背景图像可以从预先设定的纯净的背景图像中获得,只要取其中对应检测区域位置的部分即可。完整的背景图像可以通过预先设定的标准背景图像获得,或通过动态更新的背景模型获得。记某一目标i在t时刻得到的背景图像为Bt (i),那么对于一个目标区域,可以将它的描述信息表示为:The video monitoring device obtains a background image of the human body region whose predicted value is a human body type, and obtains description information of the background image; in a specific implementation, if the predicted result obtained by the M1 network model is a human body category (ie, a foreground in the image), the video The monitoring device can identify the behavior of each human body region in a single frame image by using a non-sequential input behavior based on neighboring target features with more complex structure and more recognizable capability. The structure of the network model is shown in FIG. 3. As shown in the figure, the hidden layer of the M2 network model adds the background image of the current human target and the feature information of the adjacent target hidden layer. The location of the feature fusion lies in the first fully connected layer of the network, as shown in the first FC in FIG. The layer is shown; the background image of the area where the target is located can be obtained from a preset pure background image, as long as the portion corresponding to the position of the detection area is taken. The complete background image can be obtained from a pre-set standard background image or via a dynamically updated background model. Note that the background image obtained by a target i at time t is B t (i) , then for a target area, its description information can be expressed as:
O(i,t)=It (i),Rt (i),Bt (i)O(i,t)=I t (i) , R t (i) , B t (i) ;
其中,It (i)和Bt (i)共用同一个位置区域Rt (i)Where I t (i) and B t (i) share the same location area R t (i) .
其中,在完成获取预测值为人体类别的人体区域的背景图像,得到背景图像的描述信息后,计算模块300根据背景图像的描述信息,计算背景图像对应的背景区域信息,并计算背景图像对应的邻近目标信息;在具体实现时,背景图像会经过若干个卷积层得到它的视觉特征描述,然后经过一个全连通层得到它对应的第一个隐含层特征,它的维度与目标图像得到的第一个隐含层的维度相同。对于目标图像,它的第一个隐含层的特征计算过程可以表示为:After the background image of the body region of the human body category is obtained, and the description information of the background image is obtained, the calculation module 300 calculates the background region information corresponding to the background image according to the description information of the background image, and calculates the background image corresponding to the background image. Adjacent to the target information; in the specific implementation, the background image will obtain its visual feature description through several convolutional layers, and then obtain a corresponding first hidden layer feature through a fully connected layer, and its dimension and target image are obtained. The first hidden layer has the same dimensions. For the target image, the feature calculation process of its first hidden layer can be expressed as:
FC1(It (i))=f1(cm(...c1It (i)));FC 1 (I t (i) )=f 1 (c m (...c 1 I t (i) ));
其中,c(·)代表对于图像的卷积运算,f(·)代表全连接层的矩阵乘法操作与偏置量操作。类似的,对于背景位置图像,记它的第一个隐含层的特征为:Where c(·) represents a convolution operation for an image, and f(·) represents a matrix multiplication operation and an offset amount operation of the fully connected layer. Similarly, for a background position image, the characteristics of its first hidden layer are:
FC1(Bt (i))=f1(cm(...c1Bt (i)));FC 1 (B t (i) )=f 1 (c m (...c 1 B t (i) ));
其中,该模型的第一个隐含层的特征组成中,还有一部分是来自邻近目标的特征,这些特征主要来自于当前区域的邻近区域中的目标特征。可以通过设置一阈值来确定邻近区域的范围。记当前目标的中心位置为:Among them, some of the features of the first hidden layer of the model are features from adjacent targets, which are mainly from the target features in the adjacent regions of the current region. The range of neighboring regions can be determined by setting a threshold. The central location of the current target is:
Figure PCTCN2017071574-appb-000035
Figure PCTCN2017071574-appb-000035
其中,
Figure PCTCN2017071574-appb-000036
是目标区域左上角横坐标,
Figure PCTCN2017071574-appb-000037
是目标区域左上角纵坐标,
Figure PCTCN2017071574-appb-000038
是目标区域的宽度,
Figure PCTCN2017071574-appb-000039
是目标区域的高度。同时计算同一画面中其它前景目标的中心位置点
Figure PCTCN2017071574-appb-000040
Figure PCTCN2017071574-appb-000041
Figure PCTCN2017071574-appb-000042
的欧氏距离dij小于一定的阈值D或两者有交叉时,则将该前景归入当前目标的有效邻近目标中。
among them,
Figure PCTCN2017071574-appb-000036
Is the horizontal coordinate of the upper left corner of the target area.
Figure PCTCN2017071574-appb-000037
Is the ordinate of the upper left corner of the target area.
Figure PCTCN2017071574-appb-000038
Is the width of the target area,
Figure PCTCN2017071574-appb-000039
Is the height of the target area. Simultaneously calculate the center position of other foreground targets in the same picture
Figure PCTCN2017071574-appb-000040
when
Figure PCTCN2017071574-appb-000041
versus
Figure PCTCN2017071574-appb-000042
When the Euclidean distance d ij is less than a certain threshold D or there is an intersection between the two, the foreground is classified into the effective neighboring target of the current target.
其中,在完成根据背景图像的描述信息,计算背景图像对应的背景区域信息,并计算背景图像对应的邻近目标信息后,计算模块300结合背景图像对应的背景区域信息和邻近目标信息,计算得到人体区域的目标的行为类别得分;在具体实现时,视频监控设备可以记所有的邻近目标区域计算到的第一个全连通层的特征的集合为
Figure PCTCN2017071574-appb-000043
分别统计这些特征值在每一维上的最大值:
After calculating the background area information corresponding to the background image according to the description information of the background image, and calculating the neighboring target information corresponding to the background image, the calculating module 300 calculates the human body by combining the background area information corresponding to the background image and the adjacent target information. The behavior category score of the target of the region; in a specific implementation, the video monitoring device can record that the set of features of the first fully connected layer calculated by all the adjacent target regions is
Figure PCTCN2017071574-appb-000043
The maximum value of these eigenvalues in each dimension is separately counted:
Figure PCTCN2017071574-appb-000044
Figure PCTCN2017071574-appb-000044
和加权平均值:And weighted average:
Figure PCTCN2017071574-appb-000045
Figure PCTCN2017071574-appb-000045
作为邻近目标的特征描述的组成部分。将以上两组特征拼接在一起,就能得到对于邻近目标描述的整体特征表示,即:As part of the characterization of the adjacent target. By stitching together the above two sets of features, you can get an overall feature representation of the proximity target description, namely:
Figure PCTCN2017071574-appb-000046
Figure PCTCN2017071574-appb-000046
如果当前目标在画面中没有任何邻近目标,则
Figure PCTCN2017071574-appb-000047
的值全部设为零。综合背景区域信息和邻近目标信息后,行为识别的网络模型的第一个全连通层的特征可以表示为:
If the current target does not have any adjacent targets in the picture, then
Figure PCTCN2017071574-appb-000047
The values are all set to zero. After synthesizing the background area information and the adjacent target information, the characteristics of the first fully connected layer of the network model of behavior recognition can be expressed as:
Figure PCTCN2017071574-appb-000048
Figure PCTCN2017071574-appb-000048
该特征经过后续的全连通层,使得整个网络模型在进行识别的过程中,自然地利用到了当前目标的背景区域信息和上下文信息。The feature passes through the subsequent fully connected layer, so that the entire network model naturally utilizes the background information and context information of the current target in the process of identifying.
其中,M2网络模型输出是一个多维的向量,向量的长度是待识别行为类别的个数,输出的每一维上的得分代表该类别上的预测概率。The M2 network model output is a multi-dimensional vector, the length of the vector is the number of behavior categories to be identified, and the score on each dimension of the output represents the prediction probability on the category.
输出模块400,配置为根据所述行为类别得分,输出相应的行为类别。The output module 400 is configured to output a corresponding behavior category according to the behavior category score.
具体地,在完成对所述预测值为人体类别的人体区域进行计算得到所述预测值为人体类别的人体区域中的目标的行为类别得分后,输出模块400根据所述行为类别得分,输出相应的行为类别。Specifically, after completing the calculation of the body region of the human body category, the predicted value is the behavior category score of the target in the human body region of the human body category, the output module 400 outputs the corresponding according to the behavior category score. Behavior category.
其中,若所述行为类别得分高于预设行为类别的阈值,则输出所述行为类别;在根据上述行为类别得分时,如果此时输出的类别得分在一些静 态特征明显的类别上的得分高于一定的阈值,则直接输出该类别的作为最终的预测结果。Wherein, if the behavior category score is higher than a threshold of the preset behavior category, the behavior category is output; if the score according to the behavior category is scored, if the category score output at this time is in some static If the score on the category with obvious characteristics is higher than a certain threshold, the category is directly output as the final prediction result.
其中,本发明实施例针对监控视频中不同行为的类型,根据它们的不同静态特性与动态特性,分别采用了不同结构的时序(多帧图像)与非时序(单帧图像)输入网络对提取到的图像进行分析,最后融合两种不同的网络输出得到最终的行为识别结果;具体地,对于一些静态特性明确的行为类别,如打架、骑车等,本发明实施例主要依赖于一个结构充分复杂的非时序输入网络模型进行快速预测,因为这些动作特征明显,一旦出现,通过单帧的影像一般就能够准确判断;而对于一些通过单帧图像难以判断的行为类别,如走路与慢跑等,主要使用一个采用时序叠加图像作为输入的深度网络进一步分析,提供比使用单一静态图像输入的网络更可靠的识别性能。另外,在时序输入与非时序输入的深度分类模型融合策略的设计上,采用了级联分类器的思想,提高整个分类系统的运行效率,实现实时行为识别的需求。The embodiments of the present invention are directed to monitoring the types of different behaviors in the video, and according to their different static characteristics and dynamic characteristics, respectively adopting timing (multi-frame image) and non-timing (single-frame image) input network pairs of different structures. The image is analyzed, and finally the two different network outputs are combined to obtain the final behavior recognition result; in particular, for some static characteristic clear behavior categories, such as fighting, cycling, etc., the embodiment of the present invention mainly relies on a structure that is sufficiently complicated. The non-sequential input network model performs fast prediction because these motion features are obvious. Once they appear, the image through a single frame can generally be accurately judged; for some behavior categories that are difficult to judge through a single frame image, such as walking and jogging, etc. Further analysis using a deep network with time-staggered images as input provides more reliable recognition performance than a network using a single static image input. In addition, in the design of the deep classification model fusion strategy of time series input and non-sequence input, the idea of cascade classifier is adopted to improve the operation efficiency of the whole classification system and realize the requirement of real-time behavior recognition.
其中,若行为类别不高于预设行为类别的阈值,则输出模块400结合人体运行轨迹信息,计算并输出相应的行为类别。Wherein, if the behavior category is not higher than the threshold of the preset behavior category, the output module 400 combines the human body running track information to calculate and output the corresponding behavior category.
其中,视频监控设备获取待识别视频的当前时刻图像和人体运行轨迹信息对应的跟踪区域图像;在具体实现时,视频监控设备可以获取当前时刻图像和人体运行轨迹信息对应的跟踪区域图像,使用同一目标先前时刻图像的叠加作为基于背景与邻近目标特征的多帧时序输入行为识别M3网络模型的输入,进行进一步的类别预测。M3网络模型的结构示意图如图4所示。由于采用的是时序的目标动作画面叠加作为网络的输入,所以M3网络模型具有更强的捕捉运动信息的能力,对于一些动态特征明显的行为识别具有明显的优势。The video monitoring device acquires the current time image of the to-be-identified video and the tracking area image corresponding to the human body running track information. In a specific implementation, the video monitoring device can acquire the tracking area image corresponding to the current time image and the human body running track information, using the same The superposition of the target previous time image identifies the input of the M3 network model as a multi-frame sequential input behavior based on the background and neighboring target features for further category prediction. The structure diagram of the M3 network model is shown in Figure 4. Because the time-based target motion picture overlay is used as the input of the network, the M3 network model has a stronger ability to capture motion information, and has obvious advantages for some dynamic features with obvious behavior recognition.
其中,在完成获取待识别视频的当前时刻图像和人体运行轨迹信息对应的跟踪区域图像后,输出模块400将当前时刻图像和跟踪区域图像进行顺序叠加;在具体实现时,视频监控设备使用M3网络模型,利用运动轨迹的信息,使用同一目标在当前时刻与前若干时刻的跟踪区域图像的顺序叠加作为模型的输入,即:After the tracking area image corresponding to the current time image of the to-be-identified video and the human body running track information is completed, the output module 400 sequentially superimposes the current time image and the tracking area image; in specific implementation, the video monitoring device uses the M3 network. The model uses the information of the motion trajectory to use the order of the same target at the current time and the tracking area image of the previous time as the input of the model, namely:
Figure PCTCN2017071574-appb-000049
Figure PCTCN2017071574-appb-000049
M3网络模型的中间层将同时融合当前目标所在的背景区域序列的深度特征、当前目标近邻区域中其它目标历史序列的隐含特征,邻近目标的信息有利于提升算法的预测准确性。The middle layer of the M3 network model will simultaneously fuse the depth features of the background region sequence of the current target and the hidden features of other target historical sequences in the current target neighbor region. The information of the adjacent target is beneficial to improve the prediction accuracy of the algorithm.
M3网络模型的隐藏层特征融合的位置同样在于网络的第一个全连通层,如图4中的第一个FC层所示。对于M3网络模型的背景区域,也取其轨迹上的背景区域序列
Figure PCTCN2017071574-appb-000050
作为输入。而对于邻近目标特征的获取也与M2网络模型基本一致,以当前时刻的目标间距离及预设的阈值作为邻近对象的选取准则,并计算它们的FC1特征的最大值、加权均值组成邻近目标特征描述。通过融合后,输入到后续的全连接层进行进一步的识别计算。
The location of the hidden layer feature fusion of the M3 network model is also in the first fully connected layer of the network, as shown by the first FC layer in Figure 4. For the background area of the M3 network model, also take the sequence of background regions on its trajectory
Figure PCTCN2017071574-appb-000050
As input. The acquisition of the adjacent target features is also consistent with the M2 network model. The distance between the target and the preset threshold at the current time are used as the selection criteria of the neighboring objects, and the maximum value and weighted mean of the FC1 features are calculated to form the adjacent target features. description. After the fusion, input to the subsequent fully connected layer for further recognition calculation.
其中,M3网络模型输出也是一个多维的向量,向量的长度是待识别行为类别的个数,输出的每一维上的得分为该类别上的预测概率。The M3 network model output is also a multi-dimensional vector, the length of the vector is the number of behavior categories to be identified, and the score on each dimension of the output is the prediction probability on the category.
其中,在完成将当前时刻图像和跟踪区域图像进行顺序叠加后,输出模块400对行为类别得分和进行顺序叠加后的结果进行加权求和,输出对应的行为类别;在具体实现时,视频监控设备融合M2网络模型和M3网络模型的处理结果,得到待检测目标的综合行为类别预测,融合的方法可以是两组网络结果的加权和,权重的大小可以通过训练集拟合效果求得。After the current time image and the tracking area image are sequentially superimposed, the output module 400 performs weighted summation on the behavior category score and the result of the sequential superposition, and outputs the corresponding behavior category; in specific implementation, the video monitoring device Combining the processing results of M2 network model and M3 network model, the comprehensive behavior category prediction of the target to be detected is obtained. The fusion method can be the weighted sum of the two groups of network results, and the weight can be obtained by the training set fitting effect.
其中,本发明结合监控视频中出现的行为的特点,设计了基于单帧输入与多帧输入网络中隐含层特征的融合方法,采用当前目标前景、背景图像信息与邻近目标信息的组合作为的新的隐含特征,丰富了分类网络的可利用信息,使得用于分类的深度模型能够同时利用当前目标所在背景区域的信息及邻近区域中其他目标的行为信息,对于监控视频中行为识别具有非常有价值的辅助信息,提升了整个系统对于行为识别的性能。The invention combines the characteristics of the behaviors appearing in the surveillance video, and designs a fusion method based on the hidden layer features in the single frame input and multiframe input networks, using the combination of the current target foreground, the background image information and the adjacent target information. The new implicit feature enriches the available information of the classification network, so that the depth model used for classification can simultaneously utilize the information of the background area of the current target and the behavior information of other targets in the adjacent area, which is very important for the behavior recognition in the surveillance video. Valuable auxiliary information enhances the performance of the entire system for behavioral recognition.
通过上述方案,本发明实施例提供了一种视频中人体行为识别的装置,实现了提升视频识别的实时性和准确性。Through the foregoing solution, the embodiment of the invention provides a device for recognizing a human body in a video, which realizes real-time performance and accuracy of improving video recognition.
在一实施例中,为了更好地提升视频识别的实时性和准确性,上述输出模块400,还配置为若所述行为类别得分高于预设行为类别的阈值,则输出所述行为类别;若所述行为类别得分不高于预设行为类别的阈值,则结合所述人体运行轨迹信息,计算并输出相应的行为类别。 In an embodiment, in order to better improve the real-time and accuracy of the video recognition, the output module 400 is further configured to output the behavior category if the behavior category score is higher than a threshold of the preset behavior category; If the behavior category score is not higher than a threshold of the preset behavior category, the corresponding behavior category is calculated and output in combination with the human body running trajectory information.
具体地,在完成对所述预测值为人体类别的人体区域进行计算得到所述预测值为人体类别的人体区域中的目标的行为类别得分后,输出模块400根据所述行为类别得分,输出相应的行为类别。Specifically, after completing the calculation of the body region of the human body category, the predicted value is the behavior category score of the target in the human body region of the human body category, the output module 400 outputs the corresponding according to the behavior category score. Behavior category.
其中,若所述行为类别得分高于预设行为类别的阈值,则输出所述行为类别;在根据上述行为类别得分时,如果此时输出的类别得分在一些静态特征明显的类别上的得分高于一定的阈值,则直接输出该类别的作为最终的预测结果。Wherein, if the behavior category score is higher than a threshold of the preset behavior category, the behavior category is output; when the score according to the behavior category is above, if the category score output at this time is high in some categories with obvious static characteristics At a certain threshold, the category is directly output as the final prediction result.
若行为类别不高于预设行为类别的阈值,则输出模块400结合人体运行轨迹信息,计算并输出相应的行为类别。If the behavior category is not higher than the threshold of the preset behavior category, the output module 400 combines the human body running trajectory information to calculate and output the corresponding behavior category.
其中,视频监控设备获取待识别视频的当前时刻图像和人体运行轨迹信息对应的跟踪区域图像;在具体实现时,视频监控设备可以获取当前时刻图像和人体运行轨迹信息对应的跟踪区域图像,使用同一目标先前时刻图像的叠加作为基于背景与邻近目标特征的多帧时序输入行为识别M3网络模型的输入,进行进一步的类别预测。M3网络模型的结构示意图如图4所示。由于采用的是时序的目标动作画面叠加作为网络的输入,所以M3网络模型具有更强的捕捉运动信息的能力,对于一些动态特征明显的行为识别具有明显的优势。The video monitoring device acquires the current time image of the to-be-identified video and the tracking area image corresponding to the human body running track information. In a specific implementation, the video monitoring device can acquire the tracking area image corresponding to the current time image and the human body running track information, using the same The superposition of the target previous time image identifies the input of the M3 network model as a multi-frame sequential input behavior based on the background and neighboring target features for further category prediction. The structure diagram of the M3 network model is shown in Figure 4. Because the time-based target motion picture overlay is used as the input of the network, the M3 network model has a stronger ability to capture motion information, and has obvious advantages for some dynamic features with obvious behavior recognition.
其中,在完成获取待识别视频的当前时刻图像和人体运行轨迹信息对应的跟踪区域图像后,输出模块400将当前时刻图像和跟踪区域图像进行顺序叠加;在具体实现时,视频监控设备使用M3网络模型,利用运动轨迹的信息,使用同一目标在当前时刻与前若干时刻的跟踪区域图像的顺序叠加作为模型的输入,即:After the tracking area image corresponding to the current time image of the to-be-identified video and the human body running track information is completed, the output module 400 sequentially superimposes the current time image and the tracking area image; in specific implementation, the video monitoring device uses the M3 network. The model uses the information of the motion trajectory to use the order of the same target at the current time and the tracking area image of the previous time as the input of the model, namely:
Figure PCTCN2017071574-appb-000051
Figure PCTCN2017071574-appb-000051
M3网络模型的中间层将同时融合当前目标所在的背景区域序列的深度特征、当前目标近邻区域中其它目标历史序列的隐含特征,邻近目标的信息有利于提升算法的预测准确性。The middle layer of the M3 network model will simultaneously fuse the depth features of the background region sequence of the current target and the hidden features of other target historical sequences in the current target neighbor region. The information of the adjacent target is beneficial to improve the prediction accuracy of the algorithm.
M3网络模型的隐藏层特征融合的位置同样在于网络的第一个全连通层,如图4中的第一个FC层所示。对于M3网络模型的背景区域,也取其轨迹上的背景区域序列
Figure PCTCN2017071574-appb-000052
作为输入。而对于邻近目标特征的获取也与M2网络模型基本一致,以当前时刻的目标间距离及预设的阈值作为 邻近对象的选取准则,并计算它们的FC1特征的最大值、加权均值组成邻近目标特征描述。通过融合后,输入到后续的全连接层进行进一步的识别计算。
The location of the hidden layer feature fusion of the M3 network model is also in the first fully connected layer of the network, as shown by the first FC layer in Figure 4. For the background area of the M3 network model, also take the sequence of background regions on its trajectory
Figure PCTCN2017071574-appb-000052
As input. The acquisition of the adjacent target features is also consistent with the M2 network model. The distance between the target and the preset threshold at the current time are used as the selection criteria of the neighboring objects, and the maximum value and weighted mean of the FC1 features are calculated to form the adjacent target features. description. After the fusion, input to the subsequent fully connected layer for further recognition calculation.
其中,M3网络模型输出也是一个多维的向量,向量的长度是待识别行为类别的个数,输出的每一维上的得分为该类别上的预测概率。The M3 network model output is also a multi-dimensional vector, the length of the vector is the number of behavior categories to be identified, and the score on each dimension of the output is the prediction probability on the category.
其中,在完成将当前时刻图像和跟踪区域图像进行顺序叠加后,输出模块400对行为类别得分和进行顺序叠加后的结果进行加权求和,输出对应的行为类别;在具体实现时,视频监控设备融合M2网络模型和M3网络模型的处理结果,得到待检测目标的综合行为类别预测,融合的方法可以是两组网络结果的加权和,权重的大小可以通过训练集拟合效果求得。After the current time image and the tracking area image are sequentially superimposed, the output module 400 performs weighted summation on the behavior category score and the result of the sequential superposition, and outputs the corresponding behavior category; in specific implementation, the video monitoring device Combining the processing results of M2 network model and M3 network model, the comprehensive behavior category prediction of the target to be detected is obtained. The fusion method can be the weighted sum of the two groups of network results, and the weight can be obtained by the training set fitting effect.
通过上述方案,本发明实施例提供了一种视频中人体行为识别的装置,实现了提升视频识别的实时性和准确性。Through the foregoing solution, the embodiment of the invention provides a device for recognizing a human body in a video, which realizes real-time performance and accuracy of improving video recognition.
在一实施例中,为了更好地提升视频识别的实时性和准确性,上述计算模块300,还配置为获取所述预测值为人体类别的人体区域的背景图像,得到所述背景图像的描述信息;根据所述背景图像的描述信息,计算所述背景图像对应的背景区域信息,并计算所述背景图像对应的邻近目标信息;结合所述背景图像对应的背景区域信息和邻近目标信息,计算得到所述人体区域的目标的行为类别得分。In an embodiment, in order to improve the real-time and accuracy of the video recognition, the calculation module 300 is further configured to acquire a background image of the human body region whose predicted value is a human body category, and obtain a description of the background image. And calculating, according to the description information of the background image, background region information corresponding to the background image, and calculating neighboring target information corresponding to the background image; calculating background region information corresponding to the background image and neighboring target information, A behavior category score of the target of the human body region is obtained.
具体地,在完成根据所述人体区域计算得到所述人体区域对应的预测值,对所述预测值为非人体类别的人体区域进行过滤,得到所述预测值为人体类别的人体区域后,计算模块300获取预测值为人体类别的人体区域的背景图像,得到背景图像的描述信息。Specifically, after calculating the predicted value corresponding to the human body region calculated according to the human body region, filtering the human body region whose predicted value is a non-human body category, and obtaining the human body region whose body value is the predicted body value, and then calculating The module 300 acquires a background image of the human body region whose predicted value is a human body type, and obtains description information of the background image.
其中,在具体实现时,如果M1网络模型得到的预测结果是人体类别(即画面中的前景),视频监控设备可以使用一个结构更复杂、识别能力更强的基于邻近目标特征的非时序输入行为识别M2网络模型对单帧图像内的每个人体区域进行行为的识别,该网络模型的结构如图3所示;M2网络模型的隐藏层中加入了当前人体目标所在背景图像和邻近目标隐藏层的特征信息,特征融合的位置在于网络的第一个全连通层,如图3中的第一个FC层所示;其中目标所在区域的背景图像可以从预先设定的纯净的背景图像中 获得,只要取其中对应检测区域位置的部分即可。完整的背景图像可以通过预先设定的标准背景图像获得,或通过动态更新的背景模型获得。记某一目标i在t时刻得到的背景图像为Bt (i),那么对于一个目标区域,可以将它的描述信息表示为:In the specific implementation, if the prediction result obtained by the M1 network model is the human body category (ie, the foreground in the picture), the video monitoring device can use a non-sequential input behavior based on the neighboring target features with more complex structure and stronger recognition ability. Identifying the M2 network model to identify the behavior of each human body region in a single frame image. The structure of the network model is shown in Figure 3. The hidden layer of the M2 network model adds the background image of the current human target and the adjacent target hidden layer. Characteristic information, the location of the feature fusion lies in the first fully connected layer of the network, as shown by the first FC layer in Figure 3; wherein the background image of the target region can be obtained from a pre-set pure background image As long as the part corresponding to the position of the detection area is taken. The complete background image can be obtained from a pre-set standard background image or via a dynamically updated background model. Note that the background image obtained by a target i at time t is B t (i) , then for a target area, its description information can be expressed as:
O(i,t)=It (i),Rt (i),Bt (i)O(i,t)=I t (i) , R t (i) , B t (i) ;
其中,It (i)和Bt (i)共用同一个位置区域Rt (i)Where I t (i) and B t (i) share the same location area R t (i) .
在完成获取预测值为人体类别的人体区域的背景图像,得到背景图像的描述信息后,计算模块300根据背景图像的描述信息,计算背景图像对应的背景区域信息,并计算背景图像对应的邻近目标信息。After obtaining the background image of the human body region whose predicted value is the human body category, and obtaining the description information of the background image, the calculation module 300 calculates the background region information corresponding to the background image according to the description information of the background image, and calculates the neighboring target corresponding to the background image. information.
其中,在具体实现时,背景图像会经过若干个卷积层得到它的视觉特征描述,然后经过一个全连通层得到它对应的第一个隐含层特征,它的维度与目标图像得到的第一个隐含层的维度相同。对于目标图像,它的第一个隐含层的特征计算过程可以表示为:Among them, in the specific implementation, the background image will get its visual feature description through several convolutional layers, and then get its corresponding first hidden layer feature through a fully connected layer, its dimension and the target image obtained The dimensions of an implicit layer are the same. For the target image, the feature calculation process of its first hidden layer can be expressed as:
FC1(It (i))=f1(cm(...c1It (i)));FC 1 (I t (i) )=f 1 (c m (...c 1 I t (i) ));
其中,c(·)代表对于图像的卷积运算,f(·)代表全连接层的矩阵乘法操作与偏置量操作。类似地,对于背景位置图像,记它的第一个隐含层的特征为:Where c(·) represents a convolution operation for an image, and f(·) represents a matrix multiplication operation and an offset amount operation of the fully connected layer. Similarly, for a background position image, the characteristics of its first hidden layer are:
FC1(Bt (i))=f1(cm(...c1Bt (i)));FC 1 (B t (i) )=f 1 (c m (...c 1 B t (i) ));
其中,该模型的第一个隐含层的特征组成中,还有一部分是来自邻近目标的特征,这些特征主要来自于当前区域的邻近区域中的目标特征。可以通过设置一阈值来确定邻近区域的范围。记当前目标的中心位置为:Among them, some of the features of the first hidden layer of the model are features from adjacent targets, which are mainly from the target features in the adjacent regions of the current region. The range of neighboring regions can be determined by setting a threshold. The central location of the current target is:
Figure PCTCN2017071574-appb-000053
Figure PCTCN2017071574-appb-000053
其中,
Figure PCTCN2017071574-appb-000054
是目标区域左上角横坐标,
Figure PCTCN2017071574-appb-000055
是目标区域左上角纵坐标,
Figure PCTCN2017071574-appb-000056
是目标区域的宽度,
Figure PCTCN2017071574-appb-000057
是目标区域的高度。同时计算同一画面中其它前景目标的中心位置点
Figure PCTCN2017071574-appb-000058
Figure PCTCN2017071574-appb-000059
Figure PCTCN2017071574-appb-000060
的欧氏距离dij小于一定的阈值D或两者有交叉时,则将该前景归入当前目标的有效邻近目标中。
among them,
Figure PCTCN2017071574-appb-000054
Is the horizontal coordinate of the upper left corner of the target area.
Figure PCTCN2017071574-appb-000055
Is the ordinate of the upper left corner of the target area.
Figure PCTCN2017071574-appb-000056
Is the width of the target area,
Figure PCTCN2017071574-appb-000057
Is the height of the target area. Simultaneously calculate the center position of other foreground targets in the same picture
Figure PCTCN2017071574-appb-000058
when
Figure PCTCN2017071574-appb-000059
versus
Figure PCTCN2017071574-appb-000060
When the Euclidean distance d ij is less than a certain threshold D or there is an intersection between the two, the foreground is classified into the effective neighboring target of the current target.
在完成根据背景图像的描述信息,计算背景图像对应的背景区域信息,并计算背景图像对应的邻近目标信息后,计算模块300结合背景图像对应的背景区域信息和邻近目标信息,计算得到人体区域的目标的行为类别得分。 After calculating the background area information corresponding to the background image according to the description information of the background image, and calculating the neighboring target information corresponding to the background image, the calculating module 300 calculates the human body area by combining the background area information corresponding to the background image and the adjacent target information. The target's behavior category score.
其中,在具体实现时,视频监控设备可以记所有的邻近目标区域计算到的第一个全连通层的特征的集合为
Figure PCTCN2017071574-appb-000061
分别统计这些特征值在每一维上的最大值:
Wherein, in a specific implementation, the video monitoring device can record that the set of features of the first fully connected layer calculated by all adjacent target regions is
Figure PCTCN2017071574-appb-000061
The maximum value of these eigenvalues in each dimension is separately counted:
Figure PCTCN2017071574-appb-000062
Figure PCTCN2017071574-appb-000062
和加权平均值:And weighted average:
Figure PCTCN2017071574-appb-000063
Figure PCTCN2017071574-appb-000063
作为邻近目标的特征描述的组成部分。将以上两组特征拼接在一起,就能得到对于邻近目标描述的整体特征表示,即:As part of the characterization of the adjacent target. By stitching together the above two sets of features, you can get an overall feature representation of the proximity target description, namely:
Figure PCTCN2017071574-appb-000064
Figure PCTCN2017071574-appb-000064
如果当前目标在画面中没有任何邻近目标,则
Figure PCTCN2017071574-appb-000065
的值全部设为零。综合背景区域信息和邻近目标信息后,行为识别的网络模型的第一个全连通层的特征可以表示为:
If the current target does not have any adjacent targets in the picture, then
Figure PCTCN2017071574-appb-000065
The values are all set to zero. After synthesizing the background area information and the adjacent target information, the characteristics of the first fully connected layer of the network model of behavior recognition can be expressed as:
Figure PCTCN2017071574-appb-000066
Figure PCTCN2017071574-appb-000066
该特征经过后续的全连通层,使得整个网络模型在进行识别的过程中,自然地利用到了当前目标的背景区域信息和上下文信息。The feature passes through the subsequent fully connected layer, so that the entire network model naturally utilizes the background information and context information of the current target in the process of identifying.
其中,M2网络模型输出是一个多维的向量,向量的长度是待识别行为类别的个数,输出的每一维上的得分代表该类别上的预测概率。The M2 network model output is a multi-dimensional vector, the length of the vector is the number of behavior categories to be identified, and the score on each dimension of the output represents the prediction probability on the category.
通过上述方案,本发明实施例提供了一种视频中人体行为识别的装置,更好地实现了提升视频识别的实时性和准确性。Through the above solution, the embodiment of the invention provides a device for recognizing a human body in a video, which better realizes real-time and accuracy of improving video recognition.
在一实施例中,为了更好地提升视频识别的实时性和准确性,上述输出模块400,还配置为获取所述待识别视频的当前时刻图像和所述人体运行轨迹信息对应的跟踪区域图像;将所述当前时刻图像和所述跟踪区域图像进行顺序叠加;对所述行为类别得分和所述进行顺序叠加后的结果进行加权求和,输出对应的行为类别。In an embodiment, in order to improve the real-time and accuracy of the video recognition, the output module 400 is further configured to acquire the current time image of the to-be-identified video and the tracking area image corresponding to the human body running track information. And sequentially superimposing the current time image and the tracking area image; weighting and summing the behavior category score and the sequentially superimposed result, and outputting a corresponding behavior category.
具体地,输出模块400获取待识别视频的当前时刻图像和人体运行轨 迹信息对应的跟踪区域图像。Specifically, the output module 400 acquires the current moment image of the video to be identified and the human body running track. The tracking area image corresponding to the trace information.
其中,在具体实现时,视频监控设备可以获取当前时刻图像和人体运行轨迹信息对应的跟踪区域图像,使用同一目标先前时刻图像的叠加作为基于背景与邻近目标特征的多帧时序输入行为识别网络模型M3网络模型的输入,进行进一步的类别预测。M3网络模型的结构示意图如图4所示。由于采用的是时序的目标动作画面叠加作为网络的输入,所以M3网络模型具有更强的捕捉运动信息的能力,对于一些动态特征明显的行为识别具有明显的优势。In a specific implementation, the video monitoring device can acquire the tracking area image corresponding to the current time image and the human body running track information, and use the superposition of the same target previous time image as the multi-frame time input behavior recognition network model based on the background and the adjacent target features. Input of the M3 network model for further category prediction. The structure diagram of the M3 network model is shown in Figure 4. Because the time-based target motion picture overlay is used as the input of the network, the M3 network model has a stronger ability to capture motion information, and has obvious advantages for some dynamic features with obvious behavior recognition.
在完成获取视频的当前时刻图像和人体运行轨迹信息对应的跟踪区域图像后,输出模块400将当前时刻图像和跟踪区域图像进行顺序叠加。After completing the acquisition of the tracking area image corresponding to the current time image of the video and the human body running track information, the output module 400 sequentially superimposes the current time image and the tracking area image.
其中,在具体实现时,视频监控设备使用M3网络模型,利用运动轨迹的信息,使用同一目标在当前时刻与前若干时刻的跟踪区域图像的顺序叠加作为模型的输入,即:In the specific implementation, the video monitoring device uses the M3 network model, and uses the information of the motion trajectory to use the sequential superposition of the tracking image of the same target at the current time and the previous time as the input of the model, namely:
Figure PCTCN2017071574-appb-000067
Figure PCTCN2017071574-appb-000067
M3网络模型的中间层将同时融合当前目标所在的背景区域序列的深度特征、当前目标近邻区域中其它目标历史序列的隐含特征,邻近目标的信息有利于提升算法的预测准确性。The middle layer of the M3 network model will simultaneously fuse the depth features of the background region sequence of the current target and the hidden features of other target historical sequences in the current target neighbor region. The information of the adjacent target is beneficial to improve the prediction accuracy of the algorithm.
M3网络模型的隐藏层特征融合的位置同样在于网络的第一个全连通层,如图4中的第一个FC层所示。对于M3网络模型的背景区域,也取其轨迹上的背景区域序列
Figure PCTCN2017071574-appb-000068
作为输入。而对于邻近目标特征的获取也与M2网络模型基本一致,以当前时刻的目标间距离及预设的阈值作为邻近对象的选取准则,并计算它们的FC1特征的最大值、加权均值组成邻近目标特征描述。通过融合后,输入到后续的全连接层进行进一步的识别计算。
The location of the hidden layer feature fusion of the M3 network model is also in the first fully connected layer of the network, as shown by the first FC layer in Figure 4. For the background area of the M3 network model, also take the sequence of background regions on its trajectory
Figure PCTCN2017071574-appb-000068
As input. The acquisition of the adjacent target features is also consistent with the M2 network model. The distance between the target and the preset threshold at the current time are used as the selection criteria of the neighboring objects, and the maximum value and weighted mean of the FC1 features are calculated to form the adjacent target features. description. After the fusion, input to the subsequent fully connected layer for further recognition calculation.
其中,M3网络模型输出也是一个多维的向量,向量的长度是待识别行为类别的个数,输出的每一维上的得分为该类别上的预测概率。The M3 network model output is also a multi-dimensional vector, the length of the vector is the number of behavior categories to be identified, and the score on each dimension of the output is the prediction probability on the category.
在完成将当前时刻图像和跟踪区域图像进行顺序叠加后,输出模块400对行为类别得分和进行顺序叠加后的结果进行加权求和,输出对应的行为类别。After the current time image and the tracking area image are sequentially superimposed, the output module 400 weights and sums the behavior category score and the result of the sequential superposition, and outputs the corresponding behavior category.
其中,在具体实现时,视频监控设备融合M2网络模型和M3网络模型 的处理结果,得到待检测目标的综合行为类别预测,融合的方法可以是两组网络结果的加权和,权重的大小可以通过训练集拟合效果求得。Among them, in the specific implementation, the video monitoring device fuses the M2 network model and the M3 network model. The processing result is obtained by the comprehensive behavior category prediction of the target to be detected, and the fusion method may be the weighted sum of the two groups of network results, and the weight of the weight can be obtained by the training set fitting effect.
通过上述方案,本发明实施例提供了一种视频中人体行为识别的装置,更好地实现了提升视频识别的实时性和准确性。Through the above solution, the embodiment of the invention provides a device for recognizing a human body in a video, which better realizes real-time and accuracy of improving video recognition.
在一实施例中,为了更好地提升视频识别的实时性和准确性,上述过滤模块200,还配置为获取所述人体区域并进行分析,输出所述人体区域对应的预测值;若所述预测值为非人体类别,则将所述预测值为非人体类别的人体区域从所述获取的人体区域中进行过滤;若所述预测值为人体类别,则计算所述预测值为人体类别的人体区域中的目标的行为类别得分。In an embodiment, in order to improve the real-time and accuracy of the video recognition, the filtering module 200 is further configured to acquire the human body region and perform analysis, and output a predicted value corresponding to the human body region; If the predicted value is a non-human body category, the human body region whose predicted value is a non-human body category is filtered from the acquired human body region; if the predicted value is a human body category, the predicted value is calculated as a human body category. The behavior category score of the target in the human body region.
具体地,在完成检测待识别视频中的人体区域,获取人体区域中的人体运行轨迹信息后,过滤模块200获取人体区域并进行分析,输出人体区域对应的预测值。Specifically, after detecting the human body region in the to-be-identified video and acquiring the human body running track information in the human body region, the filtering module 200 acquires the human body region and performs analysis, and outputs the predicted value corresponding to the human body region.
其中,在具体实现时,当获取到当前帧中某一个人体区域后,视频监控设备将该人体区域的图像输入到背景过滤网络M1网络模型中进行分析,M1网络模型的结构如图2所示,M1网络模型是一个基于单帧图像输入的深度卷积网络模型;其中,网络的输入为检测到的前景区域图像,后接若干个附带ReLU层和pooling层的卷积层(Convolution Layers,CONV),再接上若干个全连通层(Fully Connection Layers,FC)进行深度的特征计算,网络的最后一层输出层的维数为2维,经过sigmoid变换后分别对应人体类别与非人体类别上的行为类别得分。In the specific implementation, after acquiring a certain human body region in the current frame, the video monitoring device inputs the image of the human body region into the background filtering network M1 network model for analysis, and the structure of the M1 network model is as shown in FIG. 2 . The M1 network model is a deep convolutional network model based on single-frame image input; the input of the network is the detected foreground area image, followed by several convolution layers (Convolution Layers, CONV) with ReLU layer and pooling layer. ), and then connected to the Fully Connection Layers (FC) for depth feature calculation. The dimension of the last layer of the network is 2D. After sigmoid transformation, it corresponds to the human body category and the non-human body category. Behavior category score.
若预测值为非人体类别,则过滤模块200将预测值为非人体类别的人体区域从获取的人体区域中进行过滤;在具体实现时,视频监控设备通过M1网络模型的分类后,可以过滤掉前期检测与跟踪算法误测为人体类别的区域。由于此时的网络仅在检测环节产生的前景图像上进行计算(而非整张图像上),所以并不会产生明显的计算开销,在提高检测准确率的同时,能够满足整个系统实时性上的要求。同时,M1网络模型中的卷积层、全连通层的个数可以根据监控画面的大小与所部署设备的硬件性能等因素进行调整。If the predicted value is a non-human body category, the filtering module 200 filters the human body region whose predicted value is a non-human body category from the acquired human body region; in a specific implementation, the video monitoring device can be filtered out after being classified by the M1 network model. The pre-detection and tracking algorithms are misdetected as areas of the human body category. Since the network at this time is only calculated on the foreground image generated by the detection link (instead of the entire image), it does not generate significant computational overhead, and can improve the detection accuracy while satisfying the real-time performance of the entire system. Requirements. At the same time, the number of convolutional layers and fully connected layers in the M1 network model can be adjusted according to factors such as the size of the monitoring screen and the hardware performance of the deployed device.
若预测值为人体类别,则过滤模块200计算得到所述预测值为人体类 别的人体区域中的目标的行为类别得分。If the predicted value is a human body category, the filtering module 200 calculates the predicted value as a human body class. The behavior category score of the target in other body regions.
通过上述方案,本发明实施例提供了一种视频中人体行为识别的装置,更好地实现了提升视频识别的实时性和准确性。Through the above solution, the embodiment of the invention provides a device for recognizing a human body in a video, which better realizes real-time and accuracy of improving video recognition.
在一实施例中,为了更好地提升视频识别的实时性和准确性,上述检测模块100,还配置为获取所述待识别视频,对所述目标视频中的人体区域进行检测;对所述人体区域中的人体进行跟踪,得到所述人体区域中的人体运行轨迹信息。In an embodiment, in order to improve the real-time and accuracy of the video recognition, the detecting module 100 is further configured to acquire the to-be-identified video, and detect a human body region in the target video; The human body in the human body region is tracked to obtain the human body running track information in the human body region.
具体地,检测模块100获取待识别视频,对目标视频中的人体区域进行检测。Specifically, the detecting module 100 acquires a video to be identified, and detects a human body region in the target video.
其中,在具体实现时,视频监控设备可以通过前端视频采集设备来获取待识别的原始视频,并使用基于传统特征分类的检测器对视频中的人体区域进行检测。In a specific implementation, the video surveillance device can obtain the original video to be identified through the front-end video capture device, and detect the human body region in the video by using a detector based on the traditional feature classification.
在完成获取待识别视频,对目标视频中的人体区域进行检测后,检测模块100对人体区域中的行人进行跟踪,得到人体区域中的人体运行轨迹信息。After the acquiring the to-be-identified video and detecting the human body region in the target video, the detecting module 100 tracks the pedestrian in the human body region to obtain the human body running track information in the human body region.
其中,在具体实现时,视频监控设备可使用基于检测区域匹配的跟踪算法对画面中的行人进行跟踪,从而得到画面中的人体的运动轨迹信息。In a specific implementation, the video monitoring device may track the pedestrians in the picture by using a tracking algorithm based on the detection of the detection area, thereby obtaining motion track information of the human body in the picture.
其中,人体检测与跟踪的结果可以以目标ID与检测区域图像序列的形式保存,即:The result of human body detection and tracking can be saved in the form of a target ID and a sequence of detection area images, namely:
O(i,t)=It (i),Rt (i)O(i,t)=I t (i) , R t (i) ;
其中O(i,t)代表目标i在t时刻的信息,It (i)是该目标在t时刻检测到的图像内容,Rt (i)是该目标在t时刻所在区域的位置,Rt (i)中使用向量(x,y,w,h)的形式记录区域的左上角横、纵坐标位置与长、宽值。Where O(i,t) represents the information of the target i at time t, I t (i) is the image content detected by the target at time t, and R t (i) is the position of the target at the time t, R In t (i) , the horizontal and vertical coordinate positions and the length and width values of the upper left corner of the recording area are used in the form of vectors (x, y, w, h).
通过上述方案,本发明实施例提供了一种视频中人体行为识别的装置,更好地实现了提升视频识别的实时性和准确性。Through the above solution, the embodiment of the invention provides a device for recognizing a human body in a video, which better realizes real-time and accuracy of improving video recognition.
需要说明的是:实际应用时,检测模块100、过滤模块200、计算模块300、输出模块400可由视频中人体行为识别的装置中的CPU,Central Processing Unit)、微处理器(MCU,Micro Control Unit)、数字信号处理器(DSP,Digital Signal Processor)或可编程逻辑阵列(FPGA,Field- Programmable Gate Array)实现。It should be noted that, in actual application, the detection module 100, the filtering module 200, the calculation module 300, and the output module 400 can be identified by a human body in the video, a CPU, a central processing unit, a microprocessor (MCU, a Micro Control Unit). ), digital signal processor (DSP, Digital Signal Processor) or programmable logic array (FPGA, Field- Programmable Gate Array) implementation.
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用硬件实施例、软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention can take the form of a hardware embodiment, a software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage and optical storage, etc.) including computer usable program code.
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (system), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine for the execution of instructions for execution by a processor of a computer or other programmable data processing device. Means for implementing the functions specified in one or more of the flow or in a block or blocks of the flow chart.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。The computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device. The apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
基于此,本发明实施例还提供了一种计算机存储介质,所述计算机存储介质包括一组指令,当执行所述指令时,引起至少一个处理器执行上述的视频中人体行为识别的方法。Based on this, an embodiment of the present invention further provides a computer storage medium, the computer storage medium comprising a set of instructions, when the instruction is executed, causing at least one processor to perform the method for recognizing human behavior in the video.
以上仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围 内。The above are only the preferred embodiments of the present invention, and are not intended to limit the scope of the invention, and the equivalent structure or equivalent process transformations made by the description of the present invention and the drawings are directly or indirectly applied to other related technical fields. The same is included in the scope of patent protection of the present invention. Inside.
工业实用性Industrial applicability
本发明实施例提供的方案,通过检测待识别视频中的人体区域,获取人体区域中的人体运行轨迹信息;根据人体区域计算得到人体区域对应的预测值,对预测值为非人体类别的人体区域进行过滤,得到预测值为人体类别的人体区域;对预测值为人体类别的人体区域进行计算得到预测值为人体类别的人体区域中的目标的行为类别得分;根据行为类别得分,输出相应的行为类别,实现了提升视频识别的实时性和准确性。 The solution provided by the embodiment of the present invention acquires the human body running track information in the human body region by detecting the human body region in the to-be-identified video; the predicted value corresponding to the human body region is calculated according to the human body region, and the human body region with the predicted value is a non-human body category Filtering is performed to obtain a human body region whose predicted value is a human body type; a human body region whose predicted value is a human body type is calculated to obtain a behavior category score of a target in a human body region whose predicted value is a human body category; and a corresponding behavior is output according to the behavior category score Category, which improves the real-time and accuracy of video recognition.

Claims (13)

  1. 一种视频中人体行为识别的方法,所述方法包括:A method for human behavior recognition in a video, the method comprising:
    检测待识别视频中的人体区域,获取所述人体区域中的人体运行轨迹信息;Detecting a human body region in the to-be-identified video, and acquiring human body running track information in the human body region;
    根据所述人体区域计算得到所述人体区域对应的预测值,对所述预测值为非人体类别的人体区域进行过滤,得到所述预测值为人体类别的人体区域;Calculating a predicted value corresponding to the human body region according to the human body region, and filtering the human body region whose predicted value is a non-human body category, and obtaining the human body region whose predicted value is a human body category;
    对所述预测值为人体类别的人体区域进行计算得到所述预测值为人体类别的人体区域中的目标的行为类别得分;Calculating, by the body region of the human body category, the predicted value is a behavior category score of the target in the human body region of the human body category;
    根据所述行为类别得分,输出相应的行为类别。According to the behavior category score, the corresponding behavior category is output.
  2. 根据权利要求1所述的方法,其中,根据所述行为类别得分,输出相应的行为类别,包括:The method of claim 1, wherein the corresponding behavior category is output according to the behavior category score, including:
    若所述行为类别得分高于预设行为类别的阈值,则输出所述行为类别;Outputting the behavior category if the behavior category score is higher than a threshold of a preset behavior category;
    若所述行为类别得分不高于预设行为类别的阈值,则结合所述人体运行轨迹信息,计算并输出相应的行为类别。If the behavior category score is not higher than a threshold of the preset behavior category, the corresponding behavior category is calculated and output in combination with the human body running trajectory information.
  3. 根据权利要求2所述的方法,其中,所述对所述预测值为人体类别的人体区域进行计算得到所述预测值为人体类别的人体区域中的目标的行为类别得分,包括:The method according to claim 2, wherein the calculating the body region of the human body category by the predicted value is a behavior class score of the target in the human body region of the human body class, including:
    获取所述预测值为人体类别的人体区域的背景图像,得到所述背景图像的描述信息;Obtaining a background image of the human body region whose predicted value is a human body category, and obtaining description information of the background image;
    根据所述背景图像的描述信息,计算所述背景图像对应的背景区域信息,并计算所述背景图像对应的邻近目标信息;Calculating background region information corresponding to the background image according to the description information of the background image, and calculating neighboring target information corresponding to the background image;
    结合所述背景图像对应的背景区域信息和邻近目标信息,计算得到所述人体区域的目标的行为类别得分。Combining the background region information corresponding to the background image and the neighboring target information, the behavior category score of the target of the human body region is calculated.
  4. 根据权利要求2所述的方法,其中,所述结合所述人体运行轨迹信息,计算并输出相应的行为类别,包括: The method according to claim 2, wherein said combining said human body running track information, calculating and outputting a corresponding behavior category comprises:
    获取所述待识别视频的当前时刻图像和所述人体运行轨迹信息对应的跟踪区域图像;Obtaining a current time image of the to-be-identified video and a tracking area image corresponding to the human body running track information;
    将所述当前时刻图像和所述跟踪区域图像进行顺序叠加;And sequentially superimposing the current time image and the tracking area image;
    对所述行为类别得分和所述进行顺序叠加后的结果进行加权求和,输出对应的行为类别。The behavior category score and the result of the sequential superposition are weighted and summed, and the corresponding behavior category is output.
  5. 根据权利要求1所述的方法,其中,所述根据所述人体区域计算得到所述人体区域对应的预测值,对所述预测值为非人体类别的人体区域进行过滤,包括:The method according to claim 1, wherein the calculating the predicted value corresponding to the human body region according to the human body region, and filtering the human body region whose predicted value is a non-human body category comprises:
    获取所述人体区域并进行分析,输出所述人体区域对应的预测值;Obtaining and analyzing the human body region, and outputting a predicted value corresponding to the human body region;
    若所述预测值为非人体类别,则将所述预测值为非人体类别的人体区域从所述获取的人体区域中进行过滤;If the predicted value is a non-human body category, filtering the human body region whose predicted value is a non-human body category from the acquired human body region;
    若所述预测值为人体类别,则执行计算所述预测值为人体类别的人体区域中的目标的行为类别得分的步骤。If the predicted value is a human body category, a step of calculating the behavior category score of the target in the human body region of the human body category is performed.
  6. 根据权利要求1所述的方法,其中,所述检测待识别视频中的人体区域,获取所述人体区域中的人体运行轨迹信息,包括:The method according to claim 1, wherein the detecting the human body region in the to-be-identified video and acquiring the human body running track information in the human body region comprises:
    获取所述待识别视频,对所述待识别视频中的人体区域进行检测;Obtaining the to-be-identified video, and detecting a human body region in the to-be-identified video;
    对所述人体区域中的行人进行跟踪,得到所述人体区域中的人体运行轨迹信息。Tracking pedestrians in the human body region to obtain human body running track information in the human body region.
  7. 一种视频中人体行为识别的装置,所述装置包括:A device for human behavior recognition in a video, the device comprising:
    检测模块,配置为检测待识别视频中的人体区域,获取所述人体区域中的人体运行轨迹信息;The detecting module is configured to detect a human body region in the to-be-identified video, and acquire information about the human body running track in the human body region;
    过滤模块,配置为根据所述人体区域计算得到所述人体区域对应的预测值,对所述预测值为非人体类别的人体区域进行过滤,得到所述预测值为人体类别的人体区域;The filtering module is configured to calculate a predicted value corresponding to the human body region according to the human body region, and filter the human body region whose predicted value is a non-human body category, to obtain the human body region whose predicted value is a human body category;
    计算模块,配置为对所述预测值为人体类别的人体区域进行计算得到所述预测值为人体类别的人体区域中的目标的行为类别得分;a calculation module configured to calculate, according to the human body region whose predicted value is a human body category, the behavior category score of the target in the human body region of the human body category;
    输出模块,配置为根据所述行为类别得分,输出相应的行为类别。 An output module configured to output a corresponding behavior category according to the behavior category score.
  8. 根据权利要求7所述的装置,其中,The apparatus according to claim 7, wherein
    所述输出模块,配置为若所述行为类别得分高于预设行为类别的阈值,则输出所述行为类别;若所述行为类别得分不高于预设行为类别的阈值,则结合所述人体运行轨迹信息,计算并输出相应的行为类别。The output module is configured to output the behavior category if the behavior category score is higher than a threshold of a preset behavior category; and if the behavior category score is not higher than a threshold of a preset behavior category, combining the human body Run track information, calculate and output the corresponding behavior category.
  9. 根据权利要求8所述的装置,其中,The device according to claim 8, wherein
    所述计算模块,配置为获取所述预测值为人体类别的人体区域的背景图像,得到所述背景图像的描述信息;根据所述背景图像的描述信息,计算所述背景图像对应的背景区域信息,并计算所述背景图像对应的邻近目标信息;结合所述背景图像对应的背景区域信息和邻近目标信息,计算得到所述人体区域的目标的行为类别得分。The calculating module is configured to acquire a background image of the human body region whose predicted value is a human body type, to obtain description information of the background image, and calculate background information corresponding to the background image according to the description information of the background image. And calculating the neighboring target information corresponding to the background image; combining the background region information corresponding to the background image and the neighboring target information, calculating a behavior category score of the target of the human body region.
  10. 根据权利要求7所述的装置,其中,The apparatus according to claim 7, wherein
    所述输出模块,配置为获取所述待识别视频的当前时刻图像和所述人体运行轨迹信息对应的跟踪区域图像;将所述当前时刻图像和所述跟踪区域图像进行顺序叠加;对所述行为类别得分和所述进行顺序叠加后的结果进行加权求和,输出对应的行为类别。The output module is configured to acquire a current time image of the to-be-identified video and a tracking area image corresponding to the human body running track information; and sequentially superimpose the current time image and the tracking area image; The category score and the result of the sequential superposition are weighted and summed, and the corresponding behavior category is output.
  11. 根据权利要求7所述的装置,其中,The apparatus according to claim 7, wherein
    所述过滤模块,配置为获取所述人体区域并进行分析,输出所述人体区域对应的预测值;若所述预测值为非人体类别,则将所述预测值为非人体类别的人体区域从所述获取的人体区域中进行过滤;若所述预测值为人体类别,则计算所述预测值为人体类别的人体区域中的目标的行为类别得分。The filtering module is configured to acquire the human body region and perform analysis, and output a predicted value corresponding to the human body region; if the predicted value is a non-human body category, the predicted value is a non-human body region Filtering is performed in the acquired human body region; if the predicted value is a human body category, the predicted value is calculated as a behavior category score of the target in the human body region of the human body category.
  12. 根据权利要求7所述的装置,其中,The apparatus according to claim 7, wherein
    所述检测模块,配置为获取所述待识别视频,对所述待识别视频中的人体区域进行检测;对所述人体区域中的行人进行跟踪,得到所述人体区域中的人体运行轨迹信息。 The detecting module is configured to acquire the to-be-identified video, detect a human body region in the to-be-identified video, and track a pedestrian in the human body region to obtain human body running track information in the human body region.
  13. 一种计算机存储介质,所述计算机存储介质包括一组指令,当执行所述指令时,引起至少一个处理器执行如权利要求1至6任一项所述的视频中人体行为识别的方法。 A computer storage medium comprising a set of instructions that, when executed, cause at least one processor to perform a method of human behavior recognition in a video according to any one of claims 1 to 6.
PCT/CN2017/071574 2016-01-29 2017-01-18 Human behaviour recognition method and apparatus in video, and computer storage medium WO2017129020A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610067817.X 2016-01-29
CN201610067817.XA CN107025420A (en) 2016-01-29 2016-01-29 The method and apparatus of Human bodys' response in video

Publications (1)

Publication Number Publication Date
WO2017129020A1 true WO2017129020A1 (en) 2017-08-03

Family

ID=59397442

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/071574 WO2017129020A1 (en) 2016-01-29 2017-01-18 Human behaviour recognition method and apparatus in video, and computer storage medium

Country Status (2)

Country Link
CN (1) CN107025420A (en)
WO (1) WO2017129020A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109859234A (en) * 2017-11-29 2019-06-07 深圳Tcl新技术有限公司 A kind of video human trace tracking method, device and storage medium
CN110414421A (en) * 2019-07-25 2019-11-05 电子科技大学 A kind of Activity recognition method based on sequential frame image
CN110826702A (en) * 2019-11-18 2020-02-21 方玉明 Abnormal event detection method for multitask deep network
CN111061945A (en) * 2019-11-11 2020-04-24 汉海信息技术(上海)有限公司 Recommendation method and device, electronic equipment and storage medium
CN111242007A (en) * 2020-01-10 2020-06-05 上海市崇明区生态农业科创中心 Farming behavior supervision method
CN112016461A (en) * 2020-08-28 2020-12-01 深圳市信义科技有限公司 Multi-target behavior identification method and system
CN112149454A (en) * 2019-06-26 2020-12-29 杭州海康威视数字技术股份有限公司 Behavior recognition method, device and equipment
CN112232142A (en) * 2020-09-27 2021-01-15 浙江大华技术股份有限公司 Safety belt identification method and device and computer readable storage medium
CN112818881A (en) * 2021-02-07 2021-05-18 国网福建省电力有限公司营销服务中心 Human behavior recognition method

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107808139B (en) * 2017-11-01 2021-08-06 电子科技大学 Real-time monitoring threat analysis method and system based on deep learning
CN108229407A (en) * 2018-01-11 2018-06-29 武汉米人科技有限公司 A kind of behavioral value method and system in video analysis
CN110321761B (en) * 2018-03-29 2022-02-11 中国科学院深圳先进技术研究院 Behavior identification method, terminal equipment and computer readable storage medium
CN109508698B (en) * 2018-12-19 2023-01-10 中山大学 Human behavior recognition method based on binary tree
CN111325292B (en) * 2020-03-11 2023-05-02 中国电子工程设计院有限公司 Object behavior recognition method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081918A (en) * 2010-09-28 2011-06-01 北京大学深圳研究生院 Video image display control method and video image display device
CN102096803A (en) * 2010-11-29 2011-06-15 吉林大学 Safe state recognition system for people on basis of machine vision
CN102387345A (en) * 2011-09-09 2012-03-21 浙江工业大学 Safety monitoring system based on omnidirectional vision for old people living alone
CN103425971A (en) * 2013-08-28 2013-12-04 重庆大学 Method for monitoring abnormal behaviors of elderly people living alone in family environment
CN103500324A (en) * 2013-09-29 2014-01-08 重庆科技学院 Violent behavior recognition method based on video monitoring
WO2015103693A1 (en) * 2014-01-07 2015-07-16 Arb Labs Inc. Systems and methods of monitoring activities at a gaming venue

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081918A (en) * 2010-09-28 2011-06-01 北京大学深圳研究生院 Video image display control method and video image display device
CN102096803A (en) * 2010-11-29 2011-06-15 吉林大学 Safe state recognition system for people on basis of machine vision
CN102387345A (en) * 2011-09-09 2012-03-21 浙江工业大学 Safety monitoring system based on omnidirectional vision for old people living alone
CN103425971A (en) * 2013-08-28 2013-12-04 重庆大学 Method for monitoring abnormal behaviors of elderly people living alone in family environment
CN103500324A (en) * 2013-09-29 2014-01-08 重庆科技学院 Violent behavior recognition method based on video monitoring
WO2015103693A1 (en) * 2014-01-07 2015-07-16 Arb Labs Inc. Systems and methods of monitoring activities at a gaming venue

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109859234A (en) * 2017-11-29 2019-06-07 深圳Tcl新技术有限公司 A kind of video human trace tracking method, device and storage medium
CN112149454A (en) * 2019-06-26 2020-12-29 杭州海康威视数字技术股份有限公司 Behavior recognition method, device and equipment
CN110414421A (en) * 2019-07-25 2019-11-05 电子科技大学 A kind of Activity recognition method based on sequential frame image
CN110414421B (en) * 2019-07-25 2023-04-07 电子科技大学 Behavior identification method based on continuous frame images
CN111061945A (en) * 2019-11-11 2020-04-24 汉海信息技术(上海)有限公司 Recommendation method and device, electronic equipment and storage medium
CN111061945B (en) * 2019-11-11 2023-06-27 汉海信息技术(上海)有限公司 Recommendation method, recommendation device, electronic equipment and storage medium
CN110826702A (en) * 2019-11-18 2020-02-21 方玉明 Abnormal event detection method for multitask deep network
CN111242007A (en) * 2020-01-10 2020-06-05 上海市崇明区生态农业科创中心 Farming behavior supervision method
CN112016461A (en) * 2020-08-28 2020-12-01 深圳市信义科技有限公司 Multi-target behavior identification method and system
CN112232142A (en) * 2020-09-27 2021-01-15 浙江大华技术股份有限公司 Safety belt identification method and device and computer readable storage medium
CN112818881A (en) * 2021-02-07 2021-05-18 国网福建省电力有限公司营销服务中心 Human behavior recognition method
CN112818881B (en) * 2021-02-07 2023-12-22 国网福建省电力有限公司营销服务中心 Human behavior recognition method

Also Published As

Publication number Publication date
CN107025420A (en) 2017-08-08

Similar Documents

Publication Publication Date Title
WO2017129020A1 (en) Human behaviour recognition method and apparatus in video, and computer storage medium
Seemanthini et al. Human detection and tracking using HOG for action recognition
Wen et al. Detection, tracking, and counting meets drones in crowds: A benchmark
CN107967451B (en) Method for counting crowd of still image
CN103824070B (en) A kind of rapid pedestrian detection method based on computer vision
US9569531B2 (en) System and method for multi-agent event detection and recognition
WO2017150032A1 (en) Method and system for detecting actions of object in scene
Avgerinakis et al. Recognition of activities of daily living for smart home environments
Cheng et al. A self-constructing cascade classifier with AdaBoost and SVM for pedestriandetection
CN104992453A (en) Target tracking method under complicated background based on extreme learning machine
CN107833239B (en) Optimization matching target tracking method based on weighting model constraint
Bour et al. Crowd behavior analysis from fixed and moving cameras
Yang et al. Single shot multibox detector with kalman filter for online pedestrian detection in video
David An intellectual individual performance abnormality discovery system in civic surroundings
Singh et al. A deep learning based technique for anomaly detection in surveillance videos
Fradi et al. Spatio-temporal crowd density model in a human detection and tracking framework
Yi et al. Human action recognition based on action relevance weighted encoding
Zaidi et al. Video anomaly detection and classification for human activity recognition
Hu et al. AVMSN: An audio-visual two stream crowd counting framework under low-quality conditions
JP2021089717A (en) Method of subject re-identification
Garcia-Cobo et al. Human skeletons and change detection for efficient violence detection in surveillance videos
Zhang et al. Visual Object Tracking via Cascaded RPN Fusion and Coordinate Attention.
Hashmi et al. GAIT analysis: 3D pose estimation and prediction in defence applications using pattern recognition
Rashidan et al. Detection of different classes moving object in public surveillance using artificial neural network (ann)
Puchała et al. Feature engineering techniques for skeleton-based two-person interaction classification in video

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17743635

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17743635

Country of ref document: EP

Kind code of ref document: A1