CN108921083B

CN108921083B - Illegal mobile vendor identification method based on deep learning target detection

Info

Publication number: CN108921083B
Application number: CN201810688380.0A
Authority: CN
Inventors: 陈晋音; 龚鑫; 方航; 俞露; 王诗铭
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2021-07-27
Anticipated expiration: 2038-06-28
Also published as: CN108921083A

Abstract

The invention aims to provide an illegal mobile vendor identification method based on deep learning target detection, which comprises the following steps: acquiring a road monitoring image, and cutting the road monitoring video into frame images; detecting the positions of the booth and the pedestrians from the frame image by using the target detection model; filtering the moving booth in the image according to the position of the booth, and keeping a fixed booth; based on the positions and the number of the fixed booths, clustering the pedestrians by using a K-means clustering method to obtain the pedestrians corresponding to each fixed booth; distinguishing different pedestrians and booths respectively by utilizing a pedestrian recognition model and a booth recognition model; and judging whether the pedestrians classified as the same fixed booth are pedlars or not. The method provided by the invention can realize automatic evidence obtaining for illegal mobile vendors in the road monitoring range, effectively improve the efficiency of the urban management department and reduce the labor cost.

Description

Illegal mobile vendor identification method based on deep learning target detection

Technical Field

The invention belongs to the field of intelligent city management application, and particularly relates to an illegal mobile vendor identification method based on deep learning objective detection.

Background

A mobile vendor refers to a businessman or a vendor that sells goods in a city in a mobile form without a fixed place of business. Most mobile vendors do not have operating license, and the sold goods can not be guaranteed in quality. In addition, the flowing stall has the actions of roasting and frying food with open fire, and a large amount of waste is generated, which affects the appearance of the city and causes pollution. The goods sold by vendors are in the form of breakfast, cooked food, fruits and other foods, and if the sanitary conditions and the food quality are not guaranteed, certain health hazards can be caused.

Therefore, mobile vendors become one of the major targets for the urban management sector to settle. Because the mobile vendor has strong mobility and wide range of motion, related departments are difficult to manage the mobile vendor. Along with the rapid development of the artificial intelligence technology, the mobile vendor can be identified by utilizing the correlation technology, so that the effect of automatic snapshot evidence taking is realized. The illegal mobile vendor identification system based on deep learning can automatically detect whether mobile vendors exist in the monitoring probe picture, so that the manpower of a city management department is saved, and the city management efficiency is improved.

In the process of identifying illegal mobile vendors, pedestrians and booths need to be detected from images, and according to the relative positions and the movement tracks of the pedestrians and the booths, the pedestrians are analyzed to be the mobile vendors and then captured and evidence is obtained. In this regard, it is necessary to use an object detection method to find out and identify an object of interest from an image. At present, common target detection methods are based on a deep learning technology and include methods such as Faster R-CNN, YOLO and SSD.

The invention discloses a method and a system for quickly searching a bayonet image vehicle based on deep learning, and relates to a method and a system for quickly searching the bayonet image vehicle based on deep learning, wherein a deep neural network is adopted to extract vehicle characteristic information, and an acceptance _ resnet _ v2 network is used to extract vehicle characteristics, so that the sharing of network weights is realized, a large amount of repeated calculation is effectively avoided, a loss function of the method and the system adopts triple sample training to directly generate 128-dimensional vectors, and in a picture searching stage, the method and the system adopt a characteristic clustering mode to establish indexes on the characteristics, so that the query speed is improved. The invention can accelerate the extraction speed of the image characteristics, respond quickly in real time and effectively check illegal vehicles such as fake-brand pursuits and fake-brand vehicles.

Disclosure of Invention

The invention aims to provide a method for identifying illegal mobile vendors based on deep learning objective detection, so as to realize automatic evidence obtaining of the illegal mobile vendors existing in a road monitoring range, effectively improve the efficiency of a city management department and reduce the labor cost.

An illegal mobile vendor identification method based on deep learning target detection comprises the following steps:

(1) acquiring a road monitoring image, and cutting the road monitoring video into frame images;

(2) detecting the positions of the booth and the pedestrians from the frame image by using the target detection model;

(3) filtering the moving booth in the image according to the position of the booth, and keeping a fixed booth;

(4) based on the positions and the number of the fixed booths, clustering the pedestrians by using a K-means clustering method to obtain the pedestrians corresponding to each fixed booth;

(5) distinguishing whether pedestrians or booths in different frame images are the same pedestrian or booths by utilizing a pedestrian recognition model and a booth recognition model;

(6) judging whether the pedestrians classified as the same fixed booth are pedlars or not;

the target detection model is obtained by training a learning network consisting of an increment Resnet v2 network and a Faster R-CNN network; the pedestrian identification model and the vendor identification model are obtained by network training of the inclusion Resnet v 2.

The learning network corresponding to the target detection model comprises:

the Inception Resnet v2 network is used for extracting the characteristics of the input frame image and outputting a characteristic diagram to the RPN network and the RoI pooling layer;

the RPN network receives the feature map output by the inclusion Resnet v2 network, extracts a rectangular candidate region possibly having a target, and outputs the rectangular candidate region to the RoI pooling layer;

the RoI pooling layer receives the feature map output by the inclusion Resnet v2 network and the rectangular candidate region output by the RPN network, maps the rectangular candidate region on the feature map and outputs the feature map to the full connection layer;

the full connection layer receives the feature map output by the RoI pooling layer, and outputs the category to which the object in the image of each rectangular candidate region belongs and the classification confidence coefficient of the object; and adjusting the boundary of the object in the rectangular candidate area and outputting coordinate information.

And marking the same type of marks on pedestrians and booths in the image respectively to form a training sample to train the target detection model.

The pedestrian identification model and the internet Resnet v2 network corresponding to the vendor identification model comprise:

the first layer is a Reshape function layer;

the second layer and the third layer are 3-by-3 convolution layers;

the fourth layer is a maximum pooling layer;

the fifth layer and the sixth layer are 3-by-3 convolution layers;

the seventh layer is a maximum pooling layer;

the eighth layer to the thirteenth layer are alternately connected with a Reduction network module and an addition network module;

the fourteenth layer is a 3 x 3 convolutional layer;

the eleventh layer is an average pooling layer;

the sixteenth layer is an output layer;

the seventeenth layer is a fully connected layer with 1 × 1024, and outputs a feature map and a vector with dimensions of 1 × 1024;

and the eighteenth layer is a fully-connected layer of 1 x N and is used for classifying the objects in the vector of 1 x 1024 dimensions, and outputting object classes and classification confidence coefficients, wherein N is the classification number.

The eighth layer to the thirteenth layer in the inclusion respet v2 network are sequentially a Reduction A module, 5 tandem inclusion A modules, a Reduction B module, 10 tandem inclusion B modules, a Reduction C module and 5 tandem inclusion C modules.

The Reduction-A module is formed by connecting four parts in parallel: the first part is a 1 x 1 convolution layer; the second part is 1 × 1 convolution layer, 3 × 3 convolution layer; the third part is 1 × 1 convolution layer, 3 × 3 convolution layer; the fourth part is a convolution layer of 1 x 1, an average pooling layer and the four parts are output in parallel; the Reduction-B module is formed by connecting three parts in parallel: the first part is a 1 x 1 convolution layer; the second part is 1 × 1 convolution layer, 3 × 3 convolution layer; the third part is an average pooling layer; the three parts are connected through a Concat layer and output after splicing; the Reduction-C module is formed by connecting three parts in parallel: the first part is a 1 x 1 convolution layer, a 1 x 1 convolution layer; the second part is 1 × 1 convolution layer, 3 × 3 convolution layer; the third part is 1 × 1 convolution layer, 3 × 3 convolution layer; the fourth part is an average pooling layer; the four parts are connected through a Concat layer and output after splicing;

the Incep-A module is formed by connecting three parts in parallel: the first part is a 1 x 1 convolution layer; the second part is 1 × 1 convolution layer, 3 × 3 convolution layer; the third part is 1 × 1 convolution layer, 3 × 3 convolution layer; the three parts are connected through a Concat layer, and form output together with a depth residual error network after passing through a 3 x 3 convolution layer; the Incep-B module is formed by connecting two parts in parallel: the first part is a 1 x 1 convolution layer; the second part is 1 × 1 convolution layer, 3 × 3 convolution layer; the two parts are connected through a Concat layer, and form output together with a depth residual error network after passing through a 3 x 3 convolution layer; the Inceptation-C module is formed by connecting two parts in parallel: the first part is a 1 x 1 convolution layer; the second part is 1 × 1 convolution layer, 3 × 3 convolution layer; these two parts are connected by a Concat layer, and after passing through a 3 × 3 convolutional layer, they constitute an output together with a depth residual network.

The pedestrian recognition method comprises the steps of extracting pedestrian images, marking the same class mark for the same pedestrian, wherein the class marks of different pedestrians are different and are used for training a pedestrian recognition model; and extracting the images of the mobile booths, marking the same class marks on the same booth, and training booth identification models when the class marks of different booths are different.

In step (3), the method for keeping the fixed booth is as follows:

storing the position and the feature vector of each detected booth in a database, and adding a counting variable COUNT; comparing the feature vector of a new booth with the stored target every time the booth is detected; if the same target is stored in the database and the coordinate change of the target is smaller than the preset value, the COUNT value COUNT + n is increased₁Updating the information of the corresponding target in the database; if the target is not stored in the database, storing the target into the database; if a target in the database does not appear in a frame, the COUNT value COUNT-n is decreased₂(ii) a Giving a highest threshold value COUNT _ MAX and a lowest threshold value COUNT _ MIN; if the COUNT is greater than the COUNT _ MAX, setting the COUNT as the highest value COUNT _ MAX; if COUNT is less than COUNT _ MIN, then the current target is deleted.

And adjusting the preset value of the coordinate change according to the actual situation.

If the COUNT is greater than the COUNT _ MAX, the COUNT is set to the highest value COUNT _ MAX, so that data out-of-range caused by an excessively large COUNT value can be prevented, and excessive data in the database cannot be deleted

The method for obtaining the pedestrian corresponding to each fixed booth in the step (4) comprises the following steps: according to the number n of the fixed booths, taking the central points of the n fixed booths as initial sample points; classifying the pedestrians by a K-means clustering method according to the gravity center distance between the center position of each pedestrian and each class cluster, and finally separating n classes corresponding to n fixed booths.

In the step (5), the method for distinguishing different pedestrians and booths by using the pedestrian recognition model and the booth recognition model respectively and judging that the pedestrians or booths in different frame images are the same pedestrian or booth comprises the following steps: extracting the features of the pedestrian image by using a pedestrian recognition model to obtain the feature vector of the pedestrian; extracting the characteristics of the booth image by using the booth identification model to obtain the characteristic vector of the booth; comparing the feature vectors of the stored pedestrians and the booths with the feature vectors of the stored pedestrians and the booths;

calculating a characteristic distance D under the Euclidean distance according to the characteristic vector; a threshold value T is given, if D is larger than T, pedestrians or booths in different frame images are not the same booth or pedestrian; and if D is less than or equal to T, the pedestrians or the booths in different frame images are the same booth or the pedestrians.

Characteristic distance at euclidean distance:

where D represents the Euclidean distance, n 1024 represents the feature vector dimension, a_iRepresenting the value of the i-th dimension in the feature vector a, b_iA value representing the ith dimension in the feature vector b; a and b represent pedestrians or booths in different frame images.

In the step (6), the method for determining whether the pedestrian classified as the same fixed booth is a vendor comprises the following steps: establishing a database for pedestrians, storing corresponding characteristic information and history classificationInformation and COUNT variable COUNT; the historical classification information refers to information which is classified by a certain pedestrian through a K-means clustering method in the process of processing multiple frames; every time a pedestrian is detected, the pedestrian is compared with the pedestrians in the database, and if the same pedestrian can be detected, the COUNT value COUNT + n is increased₁Adding current classification information to the historical classification information; if the same pedestrian can not be detected, adding the information into the database; if a pedestrian does not appear in the current frame, the COUNT value of the pedestrian is decreased, and the COUNT value of the pedestrian is decreased to be COUNT-n₂(ii) a Given a count THRESHOLD parameter C _ THRESHOLD and a percentage THRESHOLD parameter P _ THRESHOLD, if the historical classification information of a certain pedestrian is enough and is greater than C _ THRESHOLD, and the percentage of the pedestrian classified into a certain category is greater than P _ THRESHOLD, the pedestrian can be determined as a mobile vendor; giving a highest threshold value COUNT _ MAX and a lowest threshold value COUNT _ MIN; if the COUNT is greater than the COUNT _ MAX, setting the COUNT to be the highest value COUNT _ MAX; and if the COUNT is less than the COUNT _ MIN, deleting the corresponding pedestrian in the database.

The invention adopts a fast R-CNN (fast regional convolutional neural network) method, which is a mainstream deep learning network framework for target detection and has the advantages of higher identification precision than other methods; the position analysis of pedestrians and booths requires a clustering algorithm; the K-Means algorithm is a simple and effective unsupervised learning clustering algorithm, and samples are divided into different categories according to the distance of the samples on a feature space by randomly selecting initial sample points.

The method provided by the invention obtains the positions of pedestrians and booths from the road monitoring video, analyzes the target characteristics, screens and filters data to obtain the positions and the number of fixed booths, and finds out the vendor from the pedestrians by a K-means clustering-based method, thereby carrying out automatic evidence obtaining.

The practical benefits of the invention are mainly expressed in that: the system combines the deep learning technology, can automatically realize the automatic evidence obtaining function of illegal mobile vendors, utilizes the existing urban road video monitoring network, effectively improves the efficiency of urban management departments, and reduces the labor cost.

Drawings

FIG. 1 is a flow chart of a method for identifying a mobile bootlegger according to the present invention;

fig. 2 is a structure of an inclusion respet v2 network provided by the present invention;

FIG. 3 is a Reduction network module in the inclusion Resnet v2 network;

FIG. 4 is an inclusion network module in an inclusion Resnet v2 network;

FIG. 5 is an inclusion-C network module in an inclusion Resnet v2 network;

fig. 6 is a network structure of a target detection model provided by the present invention.

Detailed Description

In order to more specifically describe the present invention, the following detailed description is provided for the technical solution of the present invention with reference to the accompanying drawings and the specific embodiments.

As shown in fig. 1, the method for identifying a bootlegger based on deep learning target detection includes the following steps:

(1) and acquiring a road monitoring image, and cutting the road monitoring video into frame images.

(2) The positions of the booth and the pedestrians are detected from the frame image using the target detection model.

As shown in fig. 6, the learning network corresponding to the target detection model includes:

As shown in fig. 2, the inclusion respet v2 network corresponding to the pedestrian identification model and the vendor identification model includes:

the first layer is a Reshape function layer;

the second layer and the third layer are 3-by-3 convolution layers;

the fourth layer is a maximum pooling layer;

the fifth layer and the sixth layer are 3-by-3 convolution layers;

the seventh layer is a maximum pooling layer;

the fourteenth layer is a 3 x 3 convolutional layer;

the eleventh layer is an average pooling layer;

the sixteenth layer is an output layer;

As shown in FIG. 3, the Reduction-A module is composed of four parts in parallel: the first part is a 1 x 1 convolution layer; the second part is 1 × 1 convolution layer, 3 × 3 convolution layer; the third part is 1 × 1 convolution layer, 3 × 3 convolution layer; the fourth part is a convolution layer of 1 x 1, an average pooling layer and the four parts are output in parallel; the Reduction-B module is formed by connecting three parts in parallel: the first part is a 1 x 1 convolution layer; the second part is 1 × 1 convolution layer, 3 × 3 convolution layer; the third part is an average pooling layer; the three parts are connected through a Concat layer and output after splicing; the Reduction-C module is formed by connecting three parts in parallel: the first part is a 1 x 1 convolution layer, a 1 x 1 convolution layer; the second part is 1 × 1 convolution layer, 3 × 3 convolution layer; the third part is 1 × 1 convolution layer, 3 × 3 convolution layer; the fourth part is an average pooling layer; the four parts are connected through a Concat layer and output after splicing;

as shown in fig. 4 and 5, the inclusion-a module is composed of three parts in parallel: the first part is a 1 x 1 convolution layer; the second part is 1 × 1 convolution layer, 3 × 3 convolution layer; the third part is 1 × 1 convolution layer, 3 × 3 convolution layer; the three parts are connected through a Concat layer, and form output together with a depth residual error network after passing through a 3 x 3 convolution layer; the Incep-B module is formed by connecting two parts in parallel: the first part is a 1 x 1 convolution layer; the second part is 1 × 1 convolution layer, 3 × 3 convolution layer; the two parts are connected through a Concat layer, and form output together with a depth residual error network after passing through a 3 x 3 convolution layer; the Inceptation-C module is formed by connecting two parts in parallel: the first part is a 1 x 1 convolution layer; the second part is 1 × 1 convolution layer, 3 × 3 convolution layer; these two parts are connected by a Concat layer, and after passing through a 3 × 3 convolutional layer, they constitute an output together with a depth residual network.

The Incepton Resnet v2 network also comprises a Resnet network structure, and a depth residual error network is utilized to directly carry out input and output without an intermediate module, so that the phenomenon that the accuracy rate may be reduced as the network depth is increased is solved.

(3) And filtering the moving booth in the image according to the position of the booth, and keeping the fixed booth.

In the analysis process of the mobile booths, the target detection network cannot guarantee that all pedestrians and booths can be detected each time. In a certain frame, the same pedestrian and the same booth can be detected, and the next frame cannot be detected necessarily, which brings difficulty to the analysis process. Thus requiring the removal of the booth in motion.

Specifically, the position and feature vector of each detected booth is stored in a database, and a COUNT variable COUNT is added; comparing the feature vector of a new booth with the stored target every time the booth is detected; if the same target is stored in the database and the coordinate change of the target is smaller than the preset value, the COUNT value COUNT + n is increased₁Updating the information of the corresponding target in the database; if the target is not stored in the database, storing the target into the database; if a target in the database does not appear in a frame, the COUNT value COUNT-n is decreased₂(ii) a Giving a highest threshold value COUNT _ MAX and a lowest threshold value COUNT _ MIN; if the COUNT is greater than the COUNT _ MAX, setting the COUNT as the highest value COUNT _ MAX; if COUNT is less than COUNT _ MIN, then the current target is deleted.

(4) And based on the positions and the number of the fixed booths, clustering the pedestrians by using a K-means clustering method to obtain the pedestrians corresponding to each fixed booth.

Specifically, according to the number n of fixed booths, the central points of the n fixed booths are used as initial sample points; classifying the pedestrians by a K-means clustering method according to the gravity center distance between the center position of each pedestrian and each class cluster, and finally separating n classes corresponding to n fixed booths.

(5) And respectively distinguishing whether the pedestrians or the booths in different frame images are the same pedestrian or booths by utilizing the pedestrian recognition model and the booths recognition model.

Taking a pedestrian as an example, the target detection model detects and locates existing pedestrians in each frame, but cannot judge whether a certain two pedestrians are the same person in the two frames before and after. Therefore, each time one frame is processed, the pedestrian position information is acquired using the target detection model, the features of the pedestrian image are extracted using the pedestrian recognition model, and the feature vector of each pedestrian can be acquired.

Specifically, according to the difference of feature vectors generated by the pedestrian and the booth in the human recognition model and the booth recognition model respectively, whether the two objects are the same pedestrian or booth is judged according to the distance of the target in the feature space.

Specifically, the pedestrian recognition model is used for extracting the features of the pedestrian image to obtain the feature vector of the pedestrian; extracting the characteristics of the booth image by using the booth identification model to obtain the characteristic vector of the booth; comparing the feature vectors of the stored pedestrians and the booths with the feature vectors of the stored pedestrians and the booths;

Characteristic distance at euclidean distance:

(6) And judging whether the pedestrians classified as the same fixed booth are pedlars or not.

Specifically, a database is established for the pedestrian, and corresponding characteristic information is storedInformation, historical classification information, and a COUNT variable COUNT; the historical classification information refers to information which is classified by a certain pedestrian through a K-means clustering method in the process of processing multiple frames; every time a pedestrian is detected, the pedestrian is compared with the pedestrians in the database, and if the same pedestrian can be detected, the COUNT value COUNT + n is increased₁Adding current classification information to the historical classification information; if the same pedestrian can not be detected, adding the information into the database; if a pedestrian does not appear in the current frame, the COUNT value of the pedestrian is decreased, and the COUNT value of the pedestrian is decreased to be COUNT-n₂(ii) a Given a count THRESHOLD parameter C _ THRESHOLD and a percentage THRESHOLD parameter P _ THRESHOLD, if the historical classification information of a certain pedestrian is enough and is greater than C _ THRESHOLD, and the percentage of the pedestrian classified into a certain category is greater than P _ THRESHOLD, the pedestrian can be determined as a mobile vendor; giving a highest threshold value COUNT _ MAX and a lowest threshold value COUNT _ MIN; if the COUNT is greater than the COUNT _ MAX, setting the COUNT to be the highest value COUNT _ MAX; and if the COUNT is less than the COUNT _ MIN, deleting the corresponding pedestrian in the database.

Claims

1. An illegal mobile vendor identification method based on deep learning target detection comprises the following steps:

(1) acquiring a road monitoring video, and cutting the road monitoring video into frame images;

the target detection model is obtained by training a learning network consisting of an increment Resnet v2 network and a Faster R-CNN network; the pedestrian identification model and the vendor identification model are obtained by network training of increment Resnet v 2;

the method for keeping the fixed booth in the step (3) comprises the following steps: storing the position and the feature vector of each detected booth in a database, and adding a counting variable COUNT; comparing the feature vector of a new booth with the stored target every time the booth is detected; if the same target is stored in the database and the coordinate change of the target is smaller than the preset value, the COUNT value COUNT + n is increased₁Updating the information of the corresponding target in the database; if the target is not stored in the database, storing the target into the database; if a target in the database does not appear in a frame, the COUNT value COUNT-n is decreased₂。

2. The method of claim 1, wherein the step (3) of reserving a fixed booth further comprises: giving a highest threshold value COUNT _ MAX and a lowest threshold value COUNT _ MIN; if the COUNT is greater than the COUNT _ MAX, setting the COUNT as the highest value COUNT _ MAX; if COUNT is less than COUNT _ MIN, then the current target is deleted.

3. The method for identifying a bootlegger with target detection based on deep learning of claim 1, wherein the step (4) of obtaining the pedestrian corresponding to the fixed booth comprises: according to the number n of the fixed booths, taking the central points of the n fixed booths as initial sample points; classifying the pedestrians by a K-means clustering method according to the gravity center distance between the center position of each pedestrian and each class cluster, and finally separating n classes corresponding to n fixed booths.

4. The method of identifying a bootlegger with deep learning based object detection as claimed in claim 1, wherein the step (5) of distinguishing whether the pedestrians or the booths in different frame images are the same pedestrian or booths comprises:

extracting the features of the pedestrian image by using a pedestrian recognition model to obtain the feature vector of the pedestrian; extracting the characteristics of the booth image by using the booth identification model to obtain the characteristic vector of the booth; comparing the feature vectors of the stored pedestrians and the booths with the feature vectors of the stored pedestrians and the booths;

5. The method for identifying a bootlegger with target detection based on deep learning of claim 1, wherein the step (6) of determining whether the pedestrian classified as the same booth is a bootlegger comprises:

establishing a database for the pedestrian, and storing corresponding characteristic information, historical classification information and a counting variable COUNT; the historical classification information refers to information which is classified by a certain pedestrian through a K-means clustering method in the process of processing multiple frames; every time a pedestrian is detected, the pedestrian is compared with the pedestrians in the database, and if the same pedestrian can be detected, the COUNT value COUNT + n is increased₁Adding current classification information to the historical classification information; if the same pedestrian can not be detected, adding the information into the database; if a pedestrian does not appear in the current frame, the COUNT value of the pedestrian is decreased, and the COUNT value of the pedestrian is decreased to be COUNT-n₂；

Given a count THRESHOLD parameter C _ THRESHOLD and a percentage THRESHOLD parameter P _ THRESHOLD, a pedestrian may be considered a floating vendor if its historical classification information is sufficiently large, greater than C _ THRESHOLD, and its percentage classified into a category is greater than P _ THRESHOLD.

6. The method for identifying a bootlegger with deep learning based target detection as claimed in claim 1, wherein the step (6) of determining whether the pedestrian classified as the same booth is a bootlegger further comprises: giving a highest threshold value COUNT _ MAX and a lowest threshold value COUNT _ MIN; if the COUNT is greater than the COUNT _ MAX, setting the COUNT to be the highest value COUNT _ MAX; and if the COUNT is less than the COUNT _ MIN, deleting the corresponding pedestrian in the database.