CN111325279A

CN111325279A - Pedestrian and personal sensitive article tracking method fusing visual relationship

Info

Publication number: CN111325279A
Application number: CN202010121414.5A
Authority: CN
Inventors: 柯逍; 黄新恩
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2020-02-26
Filing date: 2020-02-26
Publication date: 2020-06-23
Anticipated expiration: 2040-02-26
Also published as: CN111325279B

Abstract

The invention relates to a pedestrian and carry-on sensitive article tracking method fusing visual relations, which comprises the steps of firstly carrying out sensitive article detection according to a deformable RetinaNet target detection model, then carrying out sensitive article visual relation detection by utilizing a visual relation detection model fusing multiple characteristics, and finally realizing the tracking of the visual relation by utilizing a Deep Sort multi-target tracking algorithm. The invention can effectively find potential threats in the surveillance video and improve the defects of the existing intelligent security system.

Description

Pedestrian and personal sensitive article tracking method fusing visual relationship

Technical Field

The invention relates to the technical field of computer vision, in particular to a pedestrian and carry-on sensitive article tracking method fusing a visual relationship.

Background

The treatment of social security incidents is an important component of public safety treatment in China, the social security incidents mainly show daily emergent safety incidents, and many incidents are severe in nature and have extremely serious influence. Meanwhile, due to the huge number of national people, compared with the countries with less population, the method has more potential safety hazards.

In order to prevent such potential safety hazards to a certain extent, video monitoring systems are gradually popularized, in recent years, the video monitoring markets in China are in a gradually increasing situation, and various big cities are gradually covered. In urban public security monitoring, pedestrians are used as main social activities and are main targets of public security monitoring, and a public security monitoring system monitors pedestrians and behaviors in various public places and can visually reflect the connection between criminal suspects and criminal activities. The behavior of people in various public places influences the social security order of cities, and in order to process sudden social security events, video monitoring must be carried out on the public places.

The construction of video monitoring systems and various applications thereof are playing an increasingly important role in the practice of crime fighting and maintenance stability, and have become a new method for investigation and solution of public security organs. After a case, a security officer usually obtains the activity track and behavior of a specific target by looking up the monitoring videos near the suspect's crime scene and before and after the incident at the first time, so as to further lock and track the target. In the existing working mode, the process is still mostly performed manually, but taking shenzhen as an example, the number of people living in shenzhen is ten million, the number of management population is twenty million, more than 130 million monitoring cameras are owned, and the number of monitoring videos cannot be processed only manually. In addition, the manual processing has great limitations, on one hand, human beings have many physiological defects as the main bodies for processing the monitoring videos, and secondly, in many cases, the human beings are not absolutely reliable observers, and often cannot realize the security threat in the videos or easily have omission, so that early warning cannot be given out in the first time. If any abnormal behavior in monitoring needs to be responded to at the first time, related safety personnel need to continuously and tightly watch the monitoring, so that the security personnel are prone to fatigue, and particularly under the condition of viewing multiple paths of monitoring videos, the eyes of people cannot capture multiple paths of information at the same time, and the local judgment is difficult to be made at the first time. The past video monitoring system only provides basic functions of monitoring video storage, playback and the like, can only record things which happen, cannot realize the function of intelligent alarm, and does not have the effect of early warning. In addition, the monitoring video often can only play the effect of collecting evidence afterwards, even though the appearance and appearance information of the suspect is obtained after the incident through the mode, the suspect can often escape quickly and cannot be caught in time, so that the intelligent security technology is urgently needed to assist security personnel in working, and a more perfect security system is constructed.

With the development of computer technology, a novel intelligent security monitoring system appears on the market successively, the core of the system is to monitor people through computer vision technology, computer vision research is one of the main branches of artificial intelligence research, and contents in videos can be understood through digital image analysis. If the monitoring camera is used as human eyes, the intelligent monitoring system is the brain for processing information. The intelligent monitoring system analyzes and filters a large amount of data in the monitored video by means of a powerful data processing function of a computer which is far superior to the human brain, and only provides useful key information for a monitor. The intelligent video monitoring can carry out all-weather unlimited time period monitoring, the computer makes up the physiological defect of human beings, can carry out real-time processing on the monitoring video all day long, and can send out an alarm at any time, in addition, the outstanding intelligent monitoring system can effectively improve the alarm precision, reduce the probability of missing report and misinformation, and through the superior image processing capability and the outstanding intelligent algorithm, the abnormal behavior of a certain target in a monitoring picture can be identified, and the group behavior of a plurality of people can be analyzed. Meanwhile, the behaviors can be analyzed, judged and alarmed, when abnormal behaviors such as wandering, running, jumping and the like or people are trampled, the alarm can be timely sent to security personnel, for example, a virtual warning line is set in part of intelligent security systems, and the application modes such as alarm and the like are initiated when the pedestrian exceeds the warning line in the video, so that the practicability of video monitoring is greatly improved. Moreover, different intelligent monitoring systems can identify various abnormal behaviors and phenomena, such as objects left in public places or behaviors that the objects are too long in stagnation in part of special areas, so that security personnel can be timely reminded to notice related monitoring video pictures before dangerous events occur, and precaution in the bud to a certain extent is realized. The series of intelligent security systems greatly accelerate the event processing speed, but have shortcomings, such as neglecting the problem that criminal suspects may carry dangerous articles with them, because criminal suspects often carry with them crime tools such as guns, knives and the like which are undoubtedly dangerous articles in most social security incidents, and besides, bags, luggage cases and the like carried with pedestrians are suspected to hide the crime tools, the detection of such sensitive articles is helpful for finding dangerous cases in time, the dangerous cases can be found in time when potential criminals in monitoring hold the dangerous articles, security personnel can handle the dangerous cases in time, and the shortcomings of the existing intelligent security systems are improved to a certain extent. Meanwhile, for personal articles such as bags and luggage cases which may hide criminal tools, the condition of sending out warning can be met only by corresponding suspicious behaviors. Therefore, the relation between the pedestrian and the object under the monitoring scene is considered, so that the security personnel can obtain richer semantic information in the monitoring video, and the detection effect on the public security event can be further improved. In conclusion, the sensitive article detection with the visual relationship fused has strong research significance and application value.

Disclosure of Invention

In view of the above, the invention aims to provide a method for tracking pedestrians and carry-on sensitive articles by fusing visual relationships, which can effectively discover potential threats in a monitoring video and overcome the defects of the existing intelligent security system.

The invention is realized by adopting the following scheme: a method for tracking pedestrians and carry-on sensitive articles with fusion of visual relations comprises the steps of firstly carrying out sensitive article detection according to a deformable convolution RetinaNet target detection model, then carrying out sensitive article visual relation detection by using a multi-feature fusion visual relation detection model, and finally realizing tracking of visual relations by using a Deep Sort multi-target tracking algorithm.

Further, the method specifically comprises the following steps:

step S1: constructing an image set of security sensitive articles;

step S2: performing data enhancement on the image set obtained in the step S1 to obtain an image set subjected to data enhancement, repeating the steps S3-S4 by adopting the image set subjected to data enhancement, training a retinet target detection model and a visual relation detection model, and performing the steps S2-S5 by using the two trained models during actual prediction;

step S3: detecting sensitive articles by utilizing a Retianet target detection model combined with deformable convolution;

step S4: visual relation detection of the sensitive article is carried out by using a visual relation detection model fusing multiple characteristics;

step S5: and carrying out Deep Sort-based multi-sensitive item visual relationship tracking.

Further, step S1 specifically includes the following steps:

step S11: analyzing the types of objects needing to be intensively observed in a security scene, and listing the objects including pedestrians, cutters, firearms, luggage cases, backpacks, handbags and water bottles as sensitive objects;

step S12: respectively downloading related pictures on the Internet through a crawler;

step S13: screening out pictures including a cutter, a firearm, a trunk, a backpack, a handbag and a water bottle in an opening source data set COCO image set, and converting a json format label file into an xml format;

step S14: and (4) selecting an object for each manual frame of the images acquired in the steps S12 and S13 by using labeling software labelImg, storing the position information and the classification information of the rectangular frame in an xml file, and taking the labeled image set as a security sensitive article image set.

Further, step S2 specifically includes the following steps:

step S21: contrast stretching is carried out on all pictures in the data set obtained in the step S1, the corresponding marking information in the xml is not changed, and the pictures which are subjected to contrast stretching are added into a new data set;

step S22: performing multi-scale transformation on all pictures in the data set obtained in the step S1, respectively changing the length and the width of the pictures into 1/2 and 2 times of the initial size, simultaneously performing corresponding coordinate transformation on the labeling information in the xml, and adding the processed pictures into a new data set;

step S23: cutting all pictures in the data set obtained in the step S1, cutting the edge of each picture 1/10, keeping the center, performing corresponding coordinate transformation on the marking information in the xml, and adding the processed pictures into a new data set;

step S24: adding random noise to all the pictures in the data set obtained in the step S1, keeping the marking information in the corresponding xml unchanged, and adding the processed pictures into a new data set;

step S25: the new data set is combined with the data obtained in step S1 to obtain a data enhanced image set.

Further, in step S3, the retinet target detection model with deformable convolution includes a Resnet-50 residual network, an upsampling and Add operation module, a deformable convolution module, a feature pyramid, and a classification sub-network and a regression sub-network; step S3 specifically includes the following steps:

step S31: the Resnet-50 residual error network is used as a backbone network of a Retineet target detection model, and the sensitive article image is input into the backbone network to extract image characteristics;

step S32: the characteristic graphs output by the last 5 convolutional layers of the backbone network are named as [ C3, C4, C5, C6 and C7 ];

step S33, carrying out 1 × 1 convolution operation on [ C3, C4 and C5], changing the dimension of the feature map, and outputting the feature map dimension of 256 d;

step S34: performing upsampling on C5, performing add operation on an upsampled result C5_ up and C4 to obtain C4_ added, then performing upsampling operation on C4_ added, and performing add operation on an upsampled result and C3 to obtain C3_ added;

step S35, respectively carrying out deformable convolution layers with convolution kernel size of 3 × 3 on C5, C4_ add and C3_ add to further extract features to respectively obtain [ P5, P4 and P3], namely the bottom three-layer features of the feature pyramid FPN;

step S36: c6 is convolved to obtain P6, then P6 is convolved to obtain P7, and thus a completely constructed characteristic pyramid [ P3, P4, P5, P6 and P7] is obtained;

and step S37, respectively sending the output of the feature pyramid into a classification sub-network and a regression sub-network, wherein the classification sub-network and the regression sub-network respectively comprise 4 convolution layers of 256 × 3 × 3, and respectively obtaining the category information of the sensitive objects in the input picture and the coordinate positions of the detection frames corresponding to the sensitive objects.

Further, the multi-feature fused visual relationship detection model in step S4 includes a spatial location module, a word embedding module, a local visual appearance feature module, a global context information module, and a feature fusion layer; step S4 specifically includes the following steps:

step S41, let P denote the set of all annotated pairs of people and objects, where in each pair of objects (S, o) ∈ P, S denotes the subject, o denotes the object P (S, o) denotes the set of all visual relationships, i.e. the set of predicates, of the pair of objects (S, o), R ═ S, P, o) | (S, o) ∈ P ^ P ∈ P (S, o) denotes the combination of all visual relationships for a sensitive article in one image, where P is the predicate;

step S42: for all the sensitive articles detected in step S3, calculating the spatial position features relative to each other in the spatial position feature module, wherein the spatial position features are calculated by:

in the formula, x, y, w and h respectively represent the abscissa of the upper left corner of the detection frame corresponding to the object, the ordinate of the upper left corner and the width and height of the corresponding boundary frame, and subscripts s and o respectively represent a subject s and an object o;

step S43: respectively representing the object categories of a subject s and a predicate o as a vector by using a word2vec word embedding model in a word embedding feature module, then connecting the two vectors together, and obtaining corresponding semantic embedding features through a full connection layer, wherein the semantic embedding features represent semantic prior information of the subject s and the predicate o;

step S44: sending the sensitive article image into a Resnet-50 residual error network of a Retineet target detection model with variable convolution to extract a feature map corresponding to the whole image, namely a global feature map, as global context information;

step S45: in the local visual appearance characteristic module, for a relation example (s, p, o), respectively intercepting local visual characteristics corresponding to the areas where the three elements are located in the corresponding relation example by using ROIPooling operation, wherein the visual characteristics of a predicate p are the characteristics of the joint areas of a subject s and an object o, extracting the visual characteristics of the areas corresponding to the three elements of the relation example, and then directly splicing the visual characteristics together to obtain the local visual appearance characteristic of each sensitive article;

step S46: and (3) sending the spatial position features obtained in the step (S42), the semantic embedded features obtained in the step (S43), the global feature map obtained in the step (S44) and the local visual appearance features of each sensitive article obtained in the step (S45) into a feature fusion layer of a visual relation detection model, wherein the fusion layer consists of full connection layers, 4 full connection layers are used for changing the dimensions of the spatial position features, the semantic embedded features, the global features and the local visual appearance features, then splicing the outputs of the 4 full connection layers, sending the output predicates and the confidence degrees of the relation examples into the two full connection layers, and rejecting the relation examples with the confidence degrees lower than a preset value.

Further, in step S46, the preset value is set to 0.3.

Further, step S5 specifically includes the following steps:

step S51: after sensitive article detection frame information output by the detection of a previous frame of sensitive article image of the video is obtained, performing track prediction on the sensitive article by using Kalman filtering to obtain a corresponding prediction frame so as to obtain the predicted position and size of a corresponding target in the next frame of image;

step S52: calculating the Mahalanobis distance M between the current detection frame and the prediction frame;

step S53: calculating cosine distance cos theta of the expression characteristics between the current detection frame and the prediction frame, and expressing the similarity between different characteristic vectors by using the minimum cosine distance;

step S54: calculating a weighted fusion value Z of the Mahalanobis distance M and the cosine distance cos theta;

step S55: realizing the matching of a prediction box by using a Hungarian algorithm, judging whether the image is associated with the previous frame or not by Hungarian matching, and simultaneously carrying out Kalman filtering prediction on the image; and when the image is associated with the previous frame, judging whether the visual relationship corresponding to the sensitive article is changed, and if so, giving an alarm.

Further, in step S54, the weighted fusion value Z is calculated as: z is 0.5 × M +0.5 × cos θ.

Compared with the prior art, the invention has the following beneficial effects: the method for detecting the sensitive article can effectively solve the problem of deformation of the article in the detection of the sensitive article, and further improves the accuracy rate on the basis of ensuring the real-time property. Meanwhile, compared with the traditional visual relation detection method, the multi-feature fused sensitive article visual relation detection method provided by the invention has better robustness, and the potential threat during visual relation transformation is further responded by using the provided multi-sensitive article visual relation tracking model. In conclusion, the method and the device can effectively find potential threats in the monitoring video and overcome the defects of the existing intelligent security system.

Drawings

FIG. 1 is a schematic diagram of the method of the embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a deformable convolution retinet target detection model according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a visual relationship detection model according to an embodiment of the invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1, the present embodiment provides a method for tracking pedestrians and portable sensitive objects by fusing a visual relationship, which includes performing sensitive object detection according to a deformable convolution RetinaNet target detection model, performing sensitive object visual relationship detection by using a multi-feature fusion visual relationship detection model, and finally performing tracking on a visual relationship by using a Deep Sort multi-target tracking algorithm.

In this embodiment, the method specifically includes the following steps:

step S1: constructing an image set of security sensitive articles;

In this embodiment, step S1 specifically includes the following steps:

step S12: respectively downloading related pictures on the Internet through a crawler; for example, downloading related pictures from Taobao and Baidu picture websites;

In this embodiment, step S2 specifically includes the following steps:

In this embodiment, in step S3, the retinet target detection model combined with deformable convolution includes a Resnet-50 residual network, an upsampling and Add operation module, a deformable convolution module, a feature pyramid, and a classification sub-network and a regression sub-network, as specifically shown in fig. 2; step S3 specifically includes the following steps:

In this embodiment, the visual relationship detection model fusing multiple features in step S4 includes a spatial location module, a word embedding module, a local visual appearance feature module, a global context information module, and a feature fusion layer, which is specifically shown in fig. 3; step S4 specifically includes the following steps:

step S46: and (3) sending the spatial position features obtained in the step (S42), the semantic embedded features obtained in the step (S43), the global feature map obtained in the step (S44) and the local visual appearance features of each sensitive article obtained in the step (S45) into a feature fusion layer of a visual relation detection model, wherein the feature fusion layer consists of full connection layers, 4 full connection layers are used for changing the dimensions of the spatial position features, the semantic embedded features, the global features and the local visual appearance features, then splicing the outputs of the 4 full connection layers, sending the output predicates and the confidence degrees of the relation examples into the two full connection layers, and rejecting the relation examples with the confidence degrees lower than a preset value.

In this embodiment, in step S46, the preset value is set to 0.3.

In this embodiment, step S5 specifically includes the following steps:

step S52: calculating the Mahalanobis distance M between the current detection frame and the prediction frame; the formula for calculating the Mahalanobis distance M is as follows:

wherein ∑ is a covariance matrix, μ is the mean of all random variables e;

wherein, let two vectors be A [ a ] respectively₁,a₂,a₃…a_n]，B[b₁,b₂,b₃…..b_n]Then, the cosine distance between the two can be represented by the cosine value of the included angle between them, and the calculation formula is:

then, the minimum cosine distance is used to represent the similarity d between different feature vectors²(i, j) as follows:

d²(i，j)＝min{1-r_j ^Tr_i}；

in the formula (d)²(i, j) is the apparent degree of match between the jth detection frame and the ith track, r_jThe sum ri is a feature vector extracted by the deep learning network and subjected to normalization processing, then r_j ^Tr_iRepresenting the cosine distance of two vectors, wherein the superscript T represents the transpose of the vector;

In the present embodiment, in step S54, the weighted fusion value Z is calculated as: z is 0.5 × M +0.5 × cos θ.

In this embodiment, in step S55, the hungarian algorithm is used to realize the matching of the prediction boxes, and the basic flow of the algorithm is as follows: created target matching matrix C_L×JAnd C is_L×JSubtracting the element with the minimum median value of the current row from the element of each row, obtaining a zero element in each row after the calculation, designing the calculated matrix as C1, then subtracting the element with the minimum value of the current column from each column of C1, obtaining a zero element in each column after the calculation, and designing the calculated matrix as C2. Then, covering the zero elements in the C2 with as few straight lines as possible, setting the number of the straight lines as m, if m is equal to min { L, J }, allocating from the row and the column with the least zero elements, and obtaining the optimal matching scheme until the allocation is finished, if not, finding out the elements which are not covered by the straight lines in the C2, marking the minimum value in the elements as mi, subtracting the b value from the row where all the elements which are not covered exist, adding an mi value to the column where the elements which are covered by the straight lines exist, and then, repeating the covering operation on the C2 zero elements.

In summary, in the embodiment, for the problem that most security systems in the market do not pay attention to sensitive articles and ignore the interaction relationship between people and articles, a pedestrian and carry-on sensitive article tracking method integrating the visual relationship is adopted. For an input image, sensing article detection is carried out by using Deform-Retianet, the detection algorithm adopts Focal loss as a loss function, combines Resnet-50 as a feature extraction network, introduces deformable convolution and respectively obtains the coordinates and classification information of an object through two different full convolution neural networks. For the detected visual relation subject (pedestrian) and object (sensitive article), Union is a joint area of the subject and the object, and represents the area of the predicate in the visual relation, and in the spatial position feature module, the relative position feature of the subject and the object is calculated, and the feature is composed of a four-dimensional vector and has scale invariance. In the word embedding characteristic module, word2vec is used as a ready-made language prior module to obtain the semantic embedding characteristics of the word. On the other hand, the original image is sent into Resnet-50 to extract features, the returned global feature map is used as global context information of the network, and the ROI Pooling is further used for obtaining the features of the corresponding positions of the objects in the feature map, namely the local appearance features of the subject, the predicate and the object, and the local appearance features are used as local visual appearance features. And sending the spatial position characteristics, the word embedding characteristics, the local visual appearance characteristics and the global context information into a fusion layer of the network together, and outputting the visual relationship in the image. And then, obtaining a tracking sensitive incidence matrix through weighted fusion of cosine distance and mahalanobis distance based on the characteristic vector, judging whether the image is associated with the previous frame through Hungary matching, and simultaneously carrying out Kalman filtering prediction on the image. And when the image is associated with the previous frame, judging whether the visual relationship corresponding to the sensitive article is changed, and if so, giving an alarm.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

Claims

1. A pedestrian and carry-on sensitive article tracking method fused with visual relations is characterized in that firstly sensitive article detection is carried out according to a deformable convolution RetinaNet target detection model, then sensitive article visual relation detection is carried out by utilizing a visual relation detection model fused with multiple characteristics, and finally tracking of the visual relations is realized by utilizing a Deep Sort multi-target tracking algorithm.

2. The visual relationship fused pedestrian and personal sensitive item tracking method according to claim 1, comprising the following steps:

step S1: constructing an image set of security sensitive articles;

3. The visual relationship fused pedestrian and personal sensitive item tracking method according to claim 2, wherein the step S1 comprises the following steps:

4. The visual relationship fused pedestrian and personal sensitive item tracking method according to claim 2, wherein the step S2 comprises the following steps:

5. The visual relationship fusing pedestrian and personal sensitive item tracking method of claim 2, wherein in step S3, the retinet target detection model combining deformable convolution comprises a Resnet-50 residual network, an upsampling and Add operation module, a deformable convolution module, a feature pyramid, and a classification sub-network and a regression sub-network; step S3 specifically includes the following steps:

6. The method for tracking pedestrians and personal sensitive articles fused with visual relationships according to claim 2, wherein the visual relationship detection model fused with multiple features in step S4 comprises a spatial location module, a word embedding module, a local visual appearance feature module, a global context information module and a feature fusion layer; step S4 specifically includes the following steps:

7. The visual relationship fusing pedestrian and personal sensitive item tracking method according to claim 6, wherein in step S46, the preset value is set to 0.3.

8. The visual relationship fused pedestrian and personal sensitive item tracking method according to claim 2, wherein the step S5 comprises the following steps:

9. The visual relationship fused pedestrian and personal sensitive item tracking method according to claim 8, wherein in step S54, the weighted fusion value Z is calculated as: z is 0.5 × M +0.5 × cos θ.