CN117636454A

CN117636454A - Intelligent video behavior analysis method based on computer vision

Info

Publication number: CN117636454A
Application number: CN202310526954.5A
Authority: CN
Inventors: 陈洪军; 陈方晔
Original assignee: Guangzhou College Of Commerce
Current assignee: Guangzhou College Of Commerce
Priority date: 2023-05-10
Filing date: 2023-05-10
Publication date: 2024-03-01

Abstract

The invention discloses an intelligent video behavior analysis method based on computer vision, which relates to the technical field of computer vision and comprises the following steps of S1: firstly, learning a detector from a standard data set by adopting a transfer learning mode, and then transferring the detector from the standard data set according to the actual monitoring data characteristics to finish target detection; s2: determining the spatial position of a target, and identifying and analyzing the target behavior based on a multitask deep learning mode; s3: after target detection, behavior recognition and analysis, abstract analysis and search are carried out on behaviors in videos, specific functional modules are embedded into the existing intelligent video analyzer through pre-target detection, behavior analysis, recognition, abstract search and the like, real-time pre-behavior early warning and post-evidence obtaining are realized, and behaviors which possibly occur are analyzed and predicted through extracting moving targets from the videos, so that intelligent video processing is realized.

Description

Intelligent video behavior analysis method based on computer vision

Technical Field

The invention relates to the technical field of computer vision, in particular to an intelligent video behavior analysis method based on computer vision.

Background

Computer vision is a science of researching how to make a machine "see", further, it is to replace human eyes with a camera and a computer to identify, track and measure targets, and further to make graphic processing, so that the computer processing becomes an image more suitable for human eyes to observe or transmit to an instrument to detect, as a scientific discipline, the theory and technology related to computer vision research tries to build an artificial intelligence system capable of obtaining 'information' from images or multidimensional data, the world has fully advanced to an informatization age, thousands of visual data are generated every day in real life, and along with development of computer visual technology and artificial intelligence, the visual data are widely applied to various aspects of real life, become important media of our understanding and reconstruction world, and along with construction of safe cities in China, a world-wide maximum security monitoring network has been built, however, compared with the construction steps of infrastructure, after the aspects of monitoring video processing and analysis technology in China are obviously, how to effectively process and analyze the massive video data are significant in real life, and practical significance in how to effectively process and analyze the massive video data in life is realized.

In the intelligent analysis and processing of video monitoring data, the intelligent analysis of behaviors is a serious problem, for example, the behavior analysis of the people can help to distinguish abnormal behaviors of passengers such as airports, stations and the like, and the illegal behaviors of pedestrians at traffic intersections can be detected, and in the anti-theft application of markets and houses, the behavior analysis and prediction of suspicious people can be timely carried out, so that the occurrence of crimes can be prevented, the suspicious behaviors can be automatically identified and searched after the occurrence of cases, and the inefficiency in manual comparison can be avoided; in the case that the monitored object is a vehicle, suspicious behaviors of the vehicle can be predicted as well, and the method can be particularly used in the fields of intelligent traffic (behavior management such as vehicle violations), intelligent cities (slag soil throwing) and the like, so that behavior identification is a very important ring in the processing of video data.

Although the intelligent demands on the monitoring system are more and more urgent along with the wide application of video monitoring, most of the existing processing processes for video data still stay in a single simple video analysis stage, such as face recognition, license plate recognition and the like, the demands on the existing video monitoring are difficult to be met by the lower-level applications, the intelligent demands are not met, the moving targets are required to be extracted from the video to analyze the behavior performances of the moving targets, the behaviors possibly occurring are predicted in an inference mode, the feature extraction is not simply performed on the video frames, so that the intelligent video processing can be realized, and the intelligent video behavior analysis method based on computer vision is provided under the background.

Disclosure of Invention

The invention aims to provide an intelligent video behavior analysis method based on computer vision so as to solve the problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions:

the intelligent video behavior analysis method based on computer vision comprises the following steps:

s1: firstly, learning a detector from a standard data set by adopting a transfer learning mode, and then transferring the detector from the standard data set according to the actual monitoring data characteristics to finish target detection;

s2: determining the spatial position of a target, and identifying and analyzing the target behavior based on a multitask deep learning mode;

s3: and after target detection, behavior identification and analysis are carried out, the behaviors in the video are subjected to abstract analysis and retrieval.

As a further scheme of the invention: in the S1, based on the application in the video monitoring system, by using the target detection algorithm based on the transfer learning, the existing human body data set is used as the source domain, the video data obtained in the actual monitoring video is used as the target domain, and the transfer learning is performed from the original domain to the target domain, so that even if the number of labeling samples of the existing actual scene is small, effective transfer information can be obtained from the source domain data (the existing labeling data set), and an accurate classifier and detector can be obtained.

As still further aspects of the invention: in the step S1, the target detection is a binary classification problem, the target is to determine whether a certain detection window contains a target by using a classifier, the target detection model based on transfer learning needs to train a detector in an original domain, then detect unlabeled samples in the target domain on the trained detector, and calculate weights of positive and negative samples in the target domain according to reliable samples obtained by detection, thereby training a target detector more suitable for the target domain, and needs to consider different distributions of samples in the target domain and the original domain, calculate weights of the samples marked in the original domain according to the sample distribution in the target domain, and bring weight information into the target detector, thereby selecting the samples, and updating the detector, and the specific implementation steps include: the first step: by adopting a model-based migration learning strategy, firstly, clustering samples according to the similarity of training data in a feature space, and dividing the samples into different subsets, wherein the samples in the subsets are similar in the feature space; and a second step of: by training the corresponding detection model for each subset, such as classifying standing pedestrians into one type, classifying walking pedestrians into one type, different detectors can be trained in different postures; and a third step of: the detection model is refined, the finally obtained training model can transmit more useful information, on the basis of the sub-models, the samples are screened and weighted by combining with migration learning, so that the accurate classification of the samples in the target field is realized, the detectors are formed by connecting a group of classifiers in series, the first layer of detectors can rapidly screen candidate windows, positive samples can be ensured to pass through directly, negative samples are eliminated as far as possible, different characteristic judgment conditions are added in the subsequent layers, the candidate windows are judged, so that most of the negative samples can be ensured to be eliminated in the early stage by the detectors, the characteristics used for detection are complex again even in the last layers, the number of the candidate samples is reduced, the detection labels can still be rapidly obtained, the complex characteristic extraction operation of all the candidate windows is avoided, and the judgment time is shortened.

As still further aspects of the invention: in the step S2, after determining the spatial position where the specific target is located (the target detection in the step S1), the specific behavior of the target is analyzed and identified, and the target behavior is identified and analyzed based on the video data.

As still further aspects of the invention: in the step S2, a 3D convolutional neural network is adopted when the convolutional network training is performed, and the feature calculation is performed on the video data by a 3D convolutional neural network (3 DCNN) model, and the first step is that: inputting N frames of video frames, and then obtaining 5 different features including gray scale, x-direction gradient, y-direction gradient, x-direction optical flow and y-direction optical flow through a preset filter, wherein the extraction of the preset priori features is more targeted than random initialization, and the effect is better; and a second step of: the 3D convolution network is adopted to carry out convolution and downsampling operation on the extracted features, and a convolution layer is continuously connected with a sampling layer; and a third step of: finally, a full connection layer is output, the 3DCNN model expands the traditional CNN to the time dimension, the convolution operation is used for carrying out feature description on the space dimension, the y direction and the time axis direction of the space, so that motion information can be encoded to the feature space, more information can be provided than the CNN based on the graph, in the analysis of the actual monitoring video data, the 3DCNN model is improved, a multi-task learning mode is added, learning of other tasks such as various labels is adopted, and therefore the 3DCNN model better serves the next behavior retrieval task, the association relation among different tasks can be shared in a multi-task learning mode, a better generalization effect is obtained, and in the process of behavior recognition, the multi-task learning mode is needed to avoid overfitting of a target task due to the fact that the number of markable samples is small.

As still further aspects of the invention: in the step S2, the key of the multi-task learning is information sharing, and in the convolutional network, the information sharing is embodied in network parameter sharing, and the parameter sharing has two modes: the hard sharing mode shares parameters in each hidden layer of the network model, only keeps the output layer related to the task not to share, and the output full-connection layer is designed according to the specific task; each task in the soft sharing mode has own model and parameters, the sharing of multiple tasks is reflected on regularization of model parameter distance, and in the task of behavior identification, a hard sharing mode is to be adopted, on one hand, because the correlation between behavior identification and analysis of each task in a person is very high, each task does not need to have an independent model and parameters like the soft sharing mode, the parameter scale of the hard sharing mode is small, the training is easy, on the other hand, the hard sharing mode simultaneously learns multiple tasks, and the model needs to capture unified representation of each target behavior, so that the overfitting risk of the task can be greatly reduced.

As still further aspects of the invention: in the step S3, through accurately concentrating and abstracting videos, the supervision department can help to quickly mine useful information from massive video data, save a great deal of manpower and material resource costs, increase supervision efficiency, the existing video abstracting technology aims at traditional single-camera video data, and it is difficult to effectively process the video data of multiple cameras in the current complex scene, on one hand, because the existing single-view abstracting analysis does not consider the influence of illumination, shielding and the like of targets in the complex scene, on the other hand, the single-view abstracting analysis is adopted, space-time overlapping information existing between videos can be ignored, so that redundant repeated abstracting results are caused, the video behaviors are abstracted in a multi-view mode, in the multi-view behavior abstracting analysis process, the time topological structure between the multi-camera networks can carry out corresponding space-time constraint on the targets in the video data, thereby realizing multi-view abstracting analysis based on the network topological structure of the multi-camera, developing a behavior abstracting model fusing the overlapping fields of the multiple cameras according to the related redundant information provided by the multi-camera, expressing and abstracting the target actions under the multiple views, and analyzing the targets under the multiple views, and implementing hierarchical analysis of the motion dimension and the abstract information under the multiple-view motion dimension analysis.

As still further aspects of the invention: in the step S3, the data of different viewpoints are associated by adopting a way of establishing a space-time hypergraph, and different attribute relationships between the multi-viewpoint videos, such as time proximity, content similarity, high-level semantic feature relationship and the like, need to be considered in the construction process, and the specific construction method is as follows: the first step: each node in the hypergraph represents a picture extracted from the video, and the hyperedges correspond to one type of attribute relationship between pictures; and a second step of: the method comprises the steps of converting the hypergraph into a weighted space-time lens graph, quantitatively measuring the relation between multi-view videos by the edge weight value on the graph, converting complex and huge multi-view video data into solution of graph problems, calculating low-level visual characteristics of videos such as colors, motion vectors and the like by combining indexes calculated in early-stage behavior analysis on the basis of the solution, carrying out importance evaluation on video pictures, thereby more pertinently extracting characteristics, wherein the space-time hypergraph is a very good multi-analysis multi-view video model, not only comprises information of the video pictures, but also reflects relevance of different view data, determines the space-time hypergraph, determines weights of different video behavior pictures, can classify the video behavior pictures based on time similarity, adopts other algorithms (such as random walk algorithms), finally selects the video pictures as candidate pictures of a summary structure, generates final summary results by using a multi-objective optimization algorithm, and can adjust importance weights according to user requirements so as to adjust generated summary results.

As still further aspects of the invention: in the step S3, the high-level features with rich semantic information are obtained by fusing the vision information with the focused research on multiple dimensions, starting from the correlation of the data of each dimension, so that accurate and effective retrieval is realized, in the retrieval process, different semantic tags are marked on the video behaviors in the multi-task learning process of the previous video behavior analysis, the different semantic tags can be integrated into independent retrieval modules to form sub-retrieval modules, and therefore random forest strategies can be adopted, the different sub-retrieval modules are regarded as weak classifiers, and model design and optimization solution are carried out by adopting different depth decision tree principles.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the invention, the target detector obtained by training on the standard data set is migrated to the actual monitoring data by adopting a migration learning mode, so that the standard data set is used for training the network in advance to obtain the general image expression capability, and then the actual data is used for fine tuning, so that the problem of over fitting caused by insufficient training data can be solved; by adopting a multi-task learning mode and combining with actual demands, various types of labels are carried out on behaviors, on one hand, training of behavior importance or abnormal detection index sequencing is given, more effective information is provided for subsequent abstract searching tasks, and meanwhile, the risk of overfitting can be reduced to a certain extent by adopting the multi-task learning mode; the association relation among videos among a plurality of view points is described by adopting a space-time hypergraph mode, the weight can be set on the boundary of the hypergraph, the cooperative analysis is carried out on the target behaviors under the multiple view angles by adopting the hypergraph mode, the space-time adjacency and the content relativity are uniformly constructed in the hypergraph, and the abstract description of the multi-view point data is effectively realized; the method has the advantages that high-level semantic information is fused on the basis of the existing bottom visual characteristics, analysis such as behavior scoring and the like can be integrated into a video retrieval module through various labels of behaviors in the early stage, specific functional modules are embedded into the existing intelligent video analyzer through front target detection, behavior analysis, recognition, abstract retrieval and the like, real-time early warning of the behaviors in advance and convenience and rapidness in post evidence obtaining are realized, the behaviors which possibly occur are analyzed and predicted through extracting the moving targets from videos, intelligent video processing is realized, and therefore the intelligent social management is promoted, and the intelligent demands are met.

2. Aiming at the current situation that the existing intelligent video data is not fully utilized, the intelligent recognition technology of the behavior is embedded into the existing intelligent video analyzer, so that a common key technology is provided for public safety real-time monitoring, early warning and forecasting and emergency treatment of society, and a new module is embedded into the original intelligent video analysis system, so that the requirements of various industries on intelligent video big data analysis and treatment can be met, the intelligent information management level of various industries can be improved, the intelligent video big data analysis and treatment industry chain can be optimized, and the large-scale and industrialization of intelligent video big data analysis and treatment application can be better realized.

Drawings

Fig. 1 is a flow chart of a method of intelligent video behavior analysis based on computer vision.

Fig. 2 is a system topology diagram in a computer vision based intelligent video behavior analysis method.

Fig. 3 is a schematic diagram of a migration screening algorithm in a computer vision-based intelligent video behavior analysis method.

Fig. 4 is a schematic diagram of a 3 DCNN-based multi-task learning framework in a computer vision-based intelligent video behavior analysis method.

Fig. 5 is a schematic diagram of a parameter sharing model in an intelligent video behavior analysis method based on computer vision.

Fig. 6 is a schematic diagram of a hypergraph model of multi-view video construction in a computer vision-based intelligent video behavior analysis method.

Fig. 7 is a schematic diagram of the flow of the intelligent video monitoring algorithm in the intelligent video behavior analysis method based on computer vision.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1 to 7, in an embodiment of the present invention, a computer vision-based intelligent video behavior analysis method includes the following steps:

In the step S1, enough marked samples are needed in the learning process to ensure the accuracy of the trained classifier, the method obtains good performance in most application scenes, however, in many practical application fields, the assumption that the marked samples are sufficient is not necessarily true, the lack of enough training samples can cause poor learning effect, and the transfer learning assists the learning of new samples by using a small part of marked samples on the basis of a small number of marked samples, and does not need to make assumption of the same distribution of data; based on the application in the video monitoring system, the existing human body data set is used as a source domain, the video data obtained in the actual monitoring video is used as a target domain, and the transfer learning is carried out from the original domain to the target domain, so that effective transfer information can be obtained from the source domain data (the existing label data set) even if the number of label samples of the existing actual scene is small, and an accurate classifier and detector can be obtained.

The target detection is a binary classification problem, the target is to judge whether a certain detection window contains a target by using a classifier, the target detection model based on transfer learning needs to train a detector in an original field, then detect unlabeled samples in the target field on the trained detector, and calculate the weight of positive and negative samples in the target field according to reliable samples obtained by detection, so as to train a target detector more suitable for the target field, and needs to consider different distributions of samples in the target field and the original field, calculate the weight of the labeled samples in the original field according to the sample distribution in the target field, and bring weight information into the target detector, thereby selecting the samples, updating the detector, and referring to fig. 3, the specific implementation steps include: the first step: by adopting a model-based migration learning strategy, firstly, clustering samples according to the similarity of training data in a feature space, and dividing the samples into different subsets, wherein the samples in the subsets are similar in the feature space; and a second step of: by training the corresponding detection model for each subset, such as classifying standing pedestrians into one type, classifying walking pedestrians into one type, different detectors can be trained in different postures; and a third step of: because the detection model is refined, the finally obtained training model can transmit more useful information, and on the basis of the sub-models, the samples are gradually screened and weighted by combining with migration learning, so that the accurate classification of the samples in the target field is realized.

Because in the monitoring system, the target detection is required to have real-time requirement, by adopting a multi-layer detector model, the detector is formed by connecting a group of classifiers in series, the first layer detector can rapidly screen candidate windows, ensure that positive samples directly pass through, eliminate negative samples as far as possible, add different characteristic judgment conditions at the subsequent layers, judge the candidate windows, thereby ensuring that the detector eliminates most of the negative samples in the early stage, even in the last layers, the characteristics used for detection are complex again, the number of candidate samples is reduced, still can rapidly obtain detection labels, and avoids complex characteristic extraction operation on all candidate windows at first, thereby reducing judgment time.

In the step S2, after determining the spatial position of the specific target (the target detection in the step S1), the specific behavior of the target is analyzed and identified, the existing behavior identification technology includes two behavior identification modes based on images and based on videos, the behavior identification based on a single image lacks motion information, the behavior in a static image cannot be encoded by using traditional space-time features, the behavior identification based on videos can extract low-level features such as space-time points of interest (STIP) from space-time blocks, classify different behaviors such as high jump and long jump, and can not distinguish one of the behaviors when a single image is seen, and can easily identify and analyze the target behavior when the video is related before and after the connection.

Since the processed object is video data, a 3D convolutional neural network is adopted when the convolutional network training is carried out, and the video data is subjected to characteristic calculation based on a 3D convolutional neural network (3 DCNN) model, and the method comprises the following steps: inputting N frames of video frames, and then obtaining 5 different features including gray scale, x-direction gradient, y-direction gradient, x-direction optical flow and y-direction optical flow through a preset filter, wherein the extraction of the preset priori features is more targeted than random initialization, and the effect is better; and a second step of: the 3D convolution network is adopted to carry out convolution and downsampling operation on the extracted features, and a convolution layer is continuously connected with a sampling layer; and a third step of: finally, a full connection layer is output, the structure frame refers to fig. 4, the 3DCNN model expands the traditional CNN to the time dimension, the convolution operation is used for carrying out feature description on the space dimension and the time dimension of video data in the x direction, the y direction and the time axis direction, so that motion information can be encoded into the feature space, therefore, the 3DCNN model can have more information than the graph-based CNN, in the analysis of actual monitoring video data, a multitask learning mode is added, because the actual data processing only identifies that the behavior class of a target is insufficient, and the behavior is needed to be analyzed, and learning of other tasks such as various labels is adopted, so that the next behavior retrieval task is better served, for example, the simplest auxiliary task example, the behavior is classified and judged, and whether the behavior is suspicious or not is judged.

The multi-task learning is a machine learning method opposite to the single-task learning, a plurality of tasks are learned in parallel, and the results are mutually influenced, so that the multi-task learning is combined learning. Most machine learning tasks are single-task learning, complex problems can be simply resolved into independent sub-problems, and then the results are combined to obtain the results of the initial complex problems, however, the method is reasonable, and is incorrect in practice, because learning tasks aiming at the same behavior data in the real world are mutually associated, association relations among different tasks can be shared through a multi-task learning mode, so that a better generalization effect is obtained, and in the process of behavior recognition, the multi-task learning mode is more required to avoid overfitting of target tasks due to the small number of markable samples.

The key of the multi-task learning is the sharing of information, which is embodied in the sharing of network parameters in a convolutional network, and the sharing of parameters has two modes: hard sharing mode and soft sharing mode, refer to fig. 5: the hard sharing mode shares parameters in each hidden layer of the network model, only keeps the output layer related to the task not to share, and the output full-connection layer is designed according to the specific task; each task in the soft sharing mode has own model and parameters, the sharing of multiple tasks is reflected on regularization of model parameter distance, and in the task of behavior identification, a hard sharing mode is to be adopted, on one hand, because the correlation between behavior identification and analysis of each task in a person is very high, each task does not need to have an independent model and parameters like the soft sharing mode, the parameter scale of the hard sharing mode is small, the training is easy, on the other hand, the hard sharing mode simultaneously learns multiple tasks, and the model needs to capture unified representation of each target behavior, so that the overfitting risk of the task can be greatly reduced.

In the step S3, through performing accurate concentration and abstraction on the video, the supervision department can help to quickly mine useful information from massive video data, save a great deal of manpower and material resource costs, increase supervision efficiency, and the existing video abstraction technology is mostly aimed at traditional single-camera video data, so that it is difficult to effectively process the video data of multiple cameras in the current complex scene, on one hand, because the existing single-view abstraction analysis does not consider the influence of illumination, shielding and the like existing in the targets in the complex scene, and on the other hand, the space-time overlapping information existing between videos can be ignored by adopting the single-view abstraction analysis, so that redundant repeated abstraction results are caused.

The method has the advantages that the video behaviors are abstracted in a multi-view mode, in the multi-view behavior abstract analysis process, the time topological structure among the multi-camera networks can carry out corresponding space-time constraint on targets in video data, so that multi-view abstract analysis based on the multi-camera network topological structure can be realized, a behavior abstract model fused with multi-camera overlapping views is developed according to relevant redundant information provided by the multi-cameras, the coordinated expression and abstract analysis are carried out on target actions and behaviors under the multi-view, and the target motion information under the multi-time-space scale is fused, so that the hierarchical abstract analysis and description on the video data are realized.

By associating data of different viewpoints in a manner of establishing a space-time hypergraph, referring to fig. 6, different attribute relationships between multi-viewpoint videos, such as time proximity, content similarity, high-level semantic feature relationship and the like, need to be considered in the construction process, and the specific construction method is as follows: the first step: each node in the hypergraph represents a picture extracted from the video, and the hyperedges correspond to one type of attribute relationship between pictures; and a second step of: the method comprises the steps of converting the hypergraph into a weighted space-time lens graph, quantitatively measuring the relation between multi-view videos by the edge weight value on the graph, converting complex and huge multi-view video data into solution of graph problems, calculating low-level visual characteristics of videos such as colors, motion vectors and the like by combining indexes calculated in early-stage behavior analysis on the basis of the solution, carrying out importance evaluation on video pictures, thereby more pertinently extracting characteristics, wherein the space-time hypergraph is a very good multi-analysis multi-view video model, not only comprises information of the video pictures, but also reflects relevance of different view data, determines the space-time hypergraph, determines weights of different video behavior pictures, can classify the video behavior pictures based on time similarity, adopts other algorithms (such as random walk algorithms), finally selects the video pictures as candidate pictures of a summary structure, generates final summary results by using a multi-objective optimization algorithm, and can adjust importance weights according to user requirements so as to adjust generated summary results.

The prior video retrieval mostly uses various features as clues through mining from original data, however, the single content-based video retrieval model is difficult to fully mine abundant semantic information contained in the video, so that an accurate retrieval node is difficult to obtain, high-level features with more abundant semantic information are obtained from the relevance of data in each dimension through fusing visual information with emphasis on multiple dimensions, so that accurate and effective retrieval is realized, in the retrieval process, different semantic tags are marked on the video behaviors in the multi-task learning process of the prior video behavior analysis, the different semantic tags can be integrated into independent retrieval modules to form sub-retrieval modules, so that random forest strategies can be adopted, the different sub-retrieval modules are regarded as weak classifiers, and model design and optimal solution are carried out by adopting different depth decision tree principles.

The method is characterized in that the prior target detection, behavior analysis and recognition, later retrieval and the like are used for embedding specific functional modules in the prior intelligent video analyzer, real-time prior behavior early warning and post evidence acquisition are realized, the embedded intelligent behavior analysis algorithm needs to consider expandability, instantaneity, robustness and the like, and in addition, in order to realize behavior analysis more efficiently in practical application, the prior intelligent video analyzer is fused, and original recognition characteristics are fully utilized for collaborative analysis.

Example 1

By researching the framework and flow of the intelligent video monitoring algorithm, how to extract semantic understanding conforming to human cognition from original video data, namely, a computer is expected to automatically analyze and understand the video data like a human, such as judging which targets of interest are in a scene, historical motion tracks, what actions are performed, relationships among targets and the like, in general, the processing of video images in the intelligent video monitoring research can be divided into 3 layers of a bottom layer, a middle layer, a high layer and the like, and referring to fig. 7: the bottom layer: the method mainly comprises the steps of acquiring an image sequence from a video image acquisition terminal, detecting and tracking an interested target so as to carry out subsequent processing analysis on the target, mainly solving the problem of where the target is, wherein a target detection part can be divided into target modeling and background modeling, and target tracking is used for acquiring the activity time, the position, the movement direction, the movement speed, the size, the appearance (color, shape, texture) and the like of a moving target and can be divided into single-scene target tracking and cross-scene target tracking; middle layer: the method mainly extracts various information of the moving object on the basis of the bottom layer and carries out relevant judgment. The content comprises target identification, wherein the target identification is used for classifying targets and identifying the identity of the targets, and can be divided into target classification and individual identification, a bridge is built for the bottom layer to process to understand high-level behaviors through the middle layer analysis, the semantic interval between the bottom layer and the high-level is filled, and the problem of the targets is mainly solved; high-rise: the high-level processing is used for analyzing and understanding the behaviors of the targets, and the high-level semantics contain specific semantic scenes and are often closely related to specific applications. Behavior analysis can be classified into gesture recognition, behavior recognition and event analysis, and is mainly used for solving the problem of what the target is doing.

Example two

The method comprises the steps of distinguishing normal behaviors and abnormal behaviors in a crowd through modeling and analyzing behaviors of pedestrians in a monitoring video, finding disasters and accidents in time, performing early-stage research on crowd abnormal behavior detection tasks and current situations of the crowd abnormal behavior detection tasks through combing, analyzing and summarizing crowd abnormal behavior detection algorithms based on deep learning, and reconsidering research on convolutional neural networks, self-coding networks and generating research on crowd abnormal behavior detection tasks of an countermeasure network; the performance of the deep learning method on the UCSD pedestrian data set is compared and analyzed, and the task difficulty of crowd abnormal behavior detection is summarized.

Example III

By studying a plurality of video target detection methods, the method comprises target detection based on background modeling, target detection based on target modeling and the following steps: a rigid global template detection model, a target detection model based on a visual dictionary, a detection model based on a part model and a target detection model based on deep learning; the target detection is to extract a motion prospect or an interested target from a video or an image, namely, the position of the target at the current frame at the current moment is determined, and the occupied size of the target is determined, so that the target detection is in a basic position in an intelligent video monitoring algorithm, and the performance of follow-up target tracking and other algorithms, and the performance of target classification and identification are directly influenced by the quality of the target detection performance; according to the difference of processed data objects, the object detection can be divided into a moving object detection method based on background modeling and a detection method based on object modeling, wherein the method based on background modeling requires that an object of interest keeps moving and the background is kept unchanged, when the background changes, the method based on background modeling can misdetect the changed background as a moving foreground, and after the moving object is still for a period of time, the method is also classified as the background, so that the method is difficult to be used for scenes with the changed background, such as shooting by a handheld camera or a vehicle-mounted camera, and the method can generally meet the requirement of real-time property, so that the method is widely used in applications adopting a fixed camera.

Example IV

The advanced research of the crowd abnormal behavior detection method based on deep learning is explored, in the process, the algorithm for effectively distinguishing normal behavior and abnormal behavior is obtained by extracting the advanced features of the appearance and the movement of pedestrians in videos, the method based on deep learning is adopted to be divided into a supervised learning mode and an unsupervised/weakly supervised learning mode for experiments, a deep neural network extracts the advanced features with a tag image, a classifier divides the features into two types of normal behavior and abnormal behavior, the non-supervised/weakly supervised learning mode is adopted to establish a behavior model for the video data only containing normal behavior because enough abnormal behavior tag data is difficult to obtain, the advanced behavior features are extracted from a plurality of video sequences which do not contain abnormal behavior by detecting the behavior which do not accord with the model and are regarded as abnormal behavior, the normal behavior model is established by using the deep neural network in a training stage, the same type of features are extracted in a testing stage, the behavior deviating from the model is screened by using a proper detection method, and the area where the abnormal behavior is located is usually established.

Example five

The experimental comparison of partial algorithm for intelligent video behavior analysis based on computer vision is completed, and mainly comprises the following steps: comparing the common data set for abnormal behavior detection and the target tracking scene experiment; the abnormal behavior detection common data set experiment is generally divided into two types compared with the data set which is most commonly used for video abnormal behavior detection at present, and one type is used for individual abnormal behavior detection, such as UCSD, BEHAVE, CAVIAR, UCF-Crime, avenue and the like; another class is used for group abnormal behavior detection, as shown in the comparison table 1 of the common data set for abnormal behavior detection:

Table 1: abnormal behavior detection common dataset contrast

From the data in table 1, it can be seen that there are fewer data sets for group abnormalities, and more public data sets are used for detection and identification of low-density crowd or single person behavior.

The continuous position in the experimental comparison of the target tracking scene, namely the positioning of the target where, the target tracking problem is a basic problem in the field of computer vision, is an important link of intelligent video monitoring, has wide application value, can record the historical motion trail and motion parameters of the target of interest, and lays a foundation for the behavior analysis and understanding of the target of higher layers; according to different application scenes, the target tracking algorithm can be divided into two types of single-scene target tracking and multi-scene target tracking, wherein the single-scene target tracking comprises single-target tracking and multi-target tracking, the multi-scene target tracking can be divided into overlapping scene target tracking and non-overlapping scene target tracking, and by summarizing various characteristics of the single-scene target tracking algorithm, the overlapping scene target tracking algorithm and the non-overlapping scene target tracking algorithm, for example, in a single scene, the spatial positions of the same target in two continuous frames are very close; in the target tracking of the overlapped scenes, the target enters from one scene to the other scene through the overlapped scenes, and the identity of the target entering the new scene can be determined by utilizing continuous spatial relationship; in non-overlapping scene target tracking, due to the existence of blind areas among scenes, the observation of the same target by different scenes has great difference in time space, and the classification and characteristics of a target tracking algorithm are shown in table 2:

Table 2: target tracking algorithm classification and characteristics

Target tracking under a single scene aims at solving the continuous tracking of a specified single target, only one specified target is tracked in a video shot by a single camera, and the relationship between the target tracking and target detection is two, wherein the target tracking under the single scene is implemented by performing apparent modeling on a foreground target on the basis of target detection, and then finding the current optimal position of the target according to a certain tracking strategy (also called generating tracking); the other is that the target tracking and the target detection are carried out simultaneously, namely the tracking based on detection, the basic idea is that the tracking problem is regarded as a foreground and background classification problem, and the foreground area (also called discriminant tracking) which is most distinguished from the background is obtained by searching the current frame through a learning classifier; the multi-scene target tracking is to establish a unique identity for each moving target under a multi-camera monitoring network, so that the overall continuous tracking of the targets is ensured, each path of cameras operates a single-scene tracking algorithm, and the single-scene tracking algorithms are independent and dependent on each other and show that each path of cameras is required to detect and track the targets respectively until the targets leave the vision; dependencies are manifested in that when a certain camera detects a new target, information is exchanged with other cameras to determine the identity of the target, whether it is newly entered into the system or has already occurred in other scenarios; the problem of target tracking of overlapping scenes is solved by adopting a plurality of cameras to observe the same region from different view angles, and the spatial relationship provides favorable conditions for continuous tracking of targets across the scenes, and two main methods for solving the problem are provided: one is to first determine the view boundary between the views of the cameras, that is, determine the position of the overlapping area between two scenes, and determine the identity of the newly entered object based on the fact that the object appearing at the position of the view boundary corresponds to the object that has just entered the view in the other scene; another type of method is to establish the corresponding relation of the observed targets in different scenes based on the homography matrix, and calculate the positions of the targets in the corresponding scenes by utilizing the homography matrix, so as to correlate the targets in different scenes; the problem of non-overlapping scene target tracking is different from that of overlapping scenes, and is more different from that of traditional single scene target tracking, the time and the position of the same target observed by different cameras are discontinuous due to a monitoring blind area between the scenes, namely, serious space-time information loss exists in the target tracking of the non-overlapping scenes, the difficulty is increased for solving the problem of the target tracking of the non-overlapping scenes, and compared with the target tracking of the single scene, the target tracking of the non-overlapping scenes has two distinctive research contents, namely, network topology estimation of the cameras and target re-identification of the cross cameras.

Although the present invention has been described with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described, or equivalents may be substituted for elements thereof, and any modifications, equivalents, improvements and changes may be made without departing from the spirit and principles of the present invention.

Claims

1. The intelligent video behavior analysis method based on computer vision is characterized by comprising the following steps of: the method comprises the following steps:

2. The computer vision-based intelligent video behavior analysis method according to claim 1, wherein: in the S1, based on the application in the video monitoring system, by using the target detection algorithm based on the transfer learning, the existing human body data set is used as the source domain, the video data obtained in the actual monitoring video is used as the target domain, and the transfer learning is performed from the original domain to the target domain, so that even if the number of labeling samples of the existing actual scene is small, effective transfer information can be obtained from the source domain data (the existing labeling data set), and an accurate classifier and detector can be obtained.

3. The computer vision-based intelligent video behavior analysis method according to claim 1, wherein: in the step S1, the target detection is a binary classification problem, the target is to determine whether a certain detection window contains a target by using a classifier, the target detection model based on transfer learning needs to train a detector in an original domain, then detect unlabeled samples in the target domain on the trained detector, and calculate weights of positive and negative samples in the target domain according to reliable samples obtained by detection, thereby training a target detector more suitable for the target domain, and needs to consider different distributions of samples in the target domain and the original domain, calculate weights of the samples marked in the original domain according to the sample distribution in the target domain, and bring weight information into the target detector, thereby selecting the samples, and updating the detector, and the specific implementation steps include: the first step: by adopting a model-based migration learning strategy, firstly, clustering samples according to the similarity of training data in a feature space, and dividing the samples into different subsets, wherein the samples in the subsets are similar in the feature space; and a second step of: by training the corresponding detection model for each subset, such as classifying standing pedestrians into one type, classifying walking pedestrians into one type, different detectors can be trained in different postures; and a third step of: the detection model is refined, the finally obtained training model can transmit more useful information, on the basis of the sub-models, the samples are screened and weighted by combining with migration learning, so that the accurate classification of the samples in the target field is realized, the detectors are formed by connecting a group of classifiers in series, the first layer of detectors can rapidly screen candidate windows, positive samples can be ensured to pass through directly, negative samples are eliminated as far as possible, different characteristic judgment conditions are added in the subsequent layers, the candidate windows are judged, so that most of the negative samples can be ensured to be eliminated in the early stage by the detectors, the characteristics used for detection are complex again even in the last layers, the number of the candidate samples is reduced, the detection labels can still be rapidly obtained, the complex characteristic extraction operation of all the candidate windows is avoided, and the judgment time is shortened.

4. The computer vision-based intelligent video behavior analysis method according to claim 1, wherein: in the step S2, after determining the spatial position where the specific target is located (the target detection in the step S1), the specific behavior of the target is analyzed and identified, and the target behavior is identified and analyzed based on the video data.

5. The computer vision-based intelligent video behavior analysis method according to claim 1, wherein: in the step S2, a 3D convolutional neural network is adopted when the convolutional network training is performed, and the feature calculation is performed on the video data by a 3D convolutional neural network (3 DCNN) model, and the first step is that: inputting N frames of video frames, and then obtaining 5 different features including gray scale, x-direction gradient, y-direction gradient, x-direction optical flow and y-direction optical flow through a preset filter, wherein the extraction of the preset priori features is more targeted than random initialization, and the effect is better; and a second step of: the 3D convolution network is adopted to carry out convolution and downsampling operation on the extracted features, and a convolution layer is continuously connected with a sampling layer; and a third step of: finally, a full connection layer is output, the 3DCNN model expands the traditional CNN to the time dimension, the convolution operation is used for carrying out feature description on the space dimension, the y direction and the time axis direction of the space, so that motion information can be encoded to the feature space, more information can be provided than the CNN based on the graph, in the analysis of the actual monitoring video data, the 3DCNN model is improved, a multi-task learning mode is added, learning of other tasks such as various labels is adopted, and therefore the 3DCNN model better serves the next behavior retrieval task, the association relation among different tasks can be shared in a multi-task learning mode, a better generalization effect is obtained, and in the process of behavior recognition, the multi-task learning mode is needed to avoid overfitting of a target task due to the fact that the number of markable samples is small.

6. The computer vision-based intelligent video behavior analysis method according to claim 1, wherein: in the step S2, the key of the multi-task learning is information sharing, and in the convolutional network, the information sharing is embodied in network parameter sharing, and the parameter sharing has two modes: the hard sharing mode shares parameters in each hidden layer of the network model, only keeps the output layer related to the task not to share, and the output full-connection layer is designed according to the specific task; each task in the soft sharing mode has own model and parameters, the sharing of multiple tasks is reflected on regularization of model parameter distance, and in the task of behavior identification, a hard sharing mode is to be adopted, on one hand, because the correlation between behavior identification and analysis of each task in a person is very high, each task does not need to have an independent model and parameters like the soft sharing mode, the parameter scale of the hard sharing mode is small, the training is easy, on the other hand, the hard sharing mode simultaneously learns multiple tasks, and the model needs to capture unified representation of each target behavior, so that the overfitting risk of the task can be greatly reduced.

7. The computer vision-based intelligent video behavior analysis method according to claim 1, wherein: in the step S3, through accurately concentrating and abstracting videos, the supervision department can help to quickly mine useful information from massive video data, save a great deal of manpower and material resource costs, increase supervision efficiency, the existing video abstracting technology aims at traditional single-camera video data, and it is difficult to effectively process the video data of multiple cameras in the current complex scene, on one hand, because the existing single-view abstracting analysis does not consider the influence of illumination, shielding and the like of targets in the complex scene, on the other hand, the single-view abstracting analysis is adopted, space-time overlapping information existing between videos can be ignored, so that redundant repeated abstracting results are caused, the video behaviors are abstracted in a multi-view mode, in the multi-view behavior abstracting analysis process, the time topological structure between the multi-camera networks can carry out corresponding space-time constraint on the targets in the video data, thereby realizing multi-view abstracting analysis based on the network topological structure of the multi-camera, developing a behavior abstracting model fusing the overlapping fields of the multiple cameras according to the related redundant information provided by the multi-camera, expressing and abstracting the target actions under the multiple views, and analyzing the targets under the multiple views, and implementing hierarchical analysis of the motion dimension and the abstract information under the multiple-view motion dimension analysis.

8. The computer vision-based intelligent video behavior analysis method according to claim 1, wherein: in the step S3, the data of different viewpoints are associated by adopting a way of establishing a space-time hypergraph, and different attribute relationships between the multi-viewpoint videos, such as time proximity, content similarity, high-level semantic feature relationship and the like, need to be considered in the construction process, and the specific construction method is as follows: the first step: each node in the hypergraph represents a picture extracted from the video, and the hyperedges correspond to one type of attribute relationship between pictures; and a second step of: the method comprises the steps of converting the hypergraph into a weighted space-time lens graph, quantitatively measuring the relation between multi-view videos by the edge weight value on the graph, converting complex and huge multi-view video data into solution of graph problems, calculating low-level visual characteristics of videos such as colors, motion vectors and the like by combining indexes calculated in early-stage behavior analysis on the basis of the solution, carrying out importance evaluation on video pictures, thereby more pertinently extracting characteristics, wherein the space-time hypergraph is a very good multi-analysis multi-view video model, not only comprises information of the video pictures, but also reflects relevance of different view data, determines the space-time hypergraph, determines weights of different video behavior pictures, can classify the video behavior pictures based on time similarity, adopts other algorithms (such as random walk algorithms), finally selects the video pictures as candidate pictures of a summary structure, generates final summary results by using a multi-objective optimization algorithm, and can adjust importance weights according to user requirements so as to adjust generated summary results.

9. The computer vision-based intelligent video behavior analysis method according to claim 1, wherein: in the step S3, the high-level features with rich semantic information are obtained by fusing the vision information with the focused research on multiple dimensions, starting from the correlation of the data of each dimension, so that accurate and effective retrieval is realized, in the retrieval process, different semantic tags are marked on the video behaviors in the multi-task learning process of the previous video behavior analysis, the different semantic tags can be integrated into independent retrieval modules to form sub-retrieval modules, and therefore random forest strategies can be adopted, the different sub-retrieval modules are regarded as weak classifiers, and model design and optimization solution are carried out by adopting different depth decision tree principles.