CN109344688A

CN109344688A - The automatic identifying method of people in a kind of monitor video based on convolutional neural networks

Info

Publication number: CN109344688A
Application number: CN201810890872.8A
Authority: CN
Inventors: 陆虎
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2018-08-07
Filing date: 2018-08-07
Publication date: 2019-02-15

Abstract

The invention discloses the automatic identifying methods of people in monitor video based on convolutional neural networks a kind of.The collected video information of the recognition methods carries out processing training, video is changed into a frame frame image, the invention proposes a kind of convolutional neural networks, for the information in image there is people, convolutional neural networks through the invention extract the profile information of people, database is formed for the information extraction characteristic value of the people extracted, and it is tagged to information in database, then classification is compared with the information in server database to tested characteristics of image, calculating similarity is carried out to classification results.Its recognition methods is successively: Image Acquisition, image segmentation, image characteristics extraction, constructs database, and images match realizes an identification process.The advantages that present invention has flexible design, and powerful, system can reduce, and is able to satisfy the following demand to various what's news, can be applied to the various occasions of network security monitoring.

Description

The automatic identifying method of people in a kind of monitor video based on convolutional neural networks

Technical field

The present invention relates to the recognition methods of people in monitor video a kind of, specifically relate to one kind and are based on rolling up in deep learning The comparison of the features such as the profile of people, shape (including front, side and the back side) and identification side in the video monitoring of product neural network Method belongs to technical field of network security.

Background technique

With the enhancing that the security protection of people is realized, important component of the video monitoring as safety and protection system, is one The effective means for planting people's normal life order of fighting crime, safeguard, possesses vast market prospect.Traditional video monitoring system What system obtained is entire monitored picture, including people and background image.Monitoring management person is needed to spend time and manpower that analysis is gone to adopt Specific information in the video monitoring image collected.Need to save the time of image recognition using necessary technological means, with And improve the accuracy of image recognition.Therefore, how to automatically identify the people in video is a very important job, is found out People wherein included for scene analysis and realizes that people's networking is of great significance.As network security and Internet of Things development need It wants, the present invention realizes network monitoring using camera, while can be carried out processing to the video that IP Camera monitors, and is divided into Image one by one analyzes image, finds out people wherein included.Identify that the common method used has for people: Method, template matching method, subspace method, Hidden Markov Models, neural network method, elastic graph based on geometrical characteristic Matching process and Flexible Model about Ecology method etc..

The existing safe identification technology of public security is substantially the identification to face, such as application number " 201510471210.3 " one Real-time face recognition methods and system based on deep learning are planted, is based under application number " 201510063023.1 " monitoring scene deep It spends patents, these patents such as the face identification system of study and uses deep learning all to carry out recognition of face.But in practice, It is relatively difficult that face can be collected by video monitoring, need people specially and for it, to sum up, influence factor also has age increases, The factors such as shelter influence, all can make the degree of difficulty of human face detection and recognition greatly improve.And in many videos, it is easy to The video-frequency band comprising people is captured, video-frequency band can be decomposed into a frame frame image, it, can be according to the information for the people that image includes The information such as the shape of people or profile are extracted.Relative to recognition of face just for face part, and what the present invention was directed to The detection identification of entire people, the identification process of people is carried out according to the profile of entire people, gait or movement.

Summary of the invention

To overcome deficiency existing for above-mentioned existing inventive technique, the main purpose of the present invention is to provide one kind to be based on convolution The recognition methods of people, the present invention propose convolutional Neural net new in a kind of deep learning in the IP Camera monitoring of neural network Network realizes the segmentation and identification of people in image, and the method for being built into database.

In order to achieve the above objectives, it adopts the following technical scheme that

The recognition methods of people, sets about final realization in IP Camera monitoring based on convolutional neural networks of the invention Meter content is broadly divided into four parts: first part is the semantic segmentation that people is carried out based on convolutional neural networks, is mentioned using the present invention Convolutional neural networks model out carries out image, semantic segmentation, and the portrait in image is split；Second part is based on people Deep learning feature is extracted, this feature can be expressed as someone feature vector, and the present invention is special with VGG model extraction image Sign；Part III constructs database, records using the corresponding feature vector of the people of certain figure as one, is saved in data Library；Part IV is when identifying to people, to extract the deep learning feature of the people, and with the various people's that save in database Deep learning feature is matched, and pairing identification is carried out, and similar people is shown automatically.

Wherein, the first part carries out the semantic segmentation of people based on convolutional neural networks, and camera is collected Video, convert image one by one for video first.In order to effectively extract the people in image, the present invention is used Preparatory trained convolutional neural networks model is split the image one by one in the video of camera shooting, uses Convolutional network of the invention obtains segmentation result, and the background pixel point value in image in addition to human is then all set to 0, Only retain the value of the pixel of the part of people.

Wherein, the second part is based on people and extracts deep learning feature, using improved convolutional neural networks VGG mould Type extracts the deep learning feature of image.As described above, the image for the people that convolutional neural networks are split, as second The input that partial depth learning characteristic extracts.It inputs an image into VGG model, using the model as a feature extractor. In the full articulamentum of the model, the characteristics of image that dimension is 4096 is extracted, the information conversion of people in piece image is special in a row Levy vector.

Wherein, the building of the property data base of the Part III people.The people that the present invention obtains second part A corresponding feature vector saves in the database as a record data, and tagged in corresponding data Classification facilitates later comparison to match, and database is constructed using private database system software, and database is stored in service-specific Inside device, to ensure data safety and secrecy, and the server and network monitoring are networked.

Wherein, the identification matching of the Part IV people.Arbitrarily tested video is chosen, first converts video to Image one by one.Semantic segmentation is carried out by convolutional neural networks model again, extracts the people in image.Then pass through convolution Neural network VGG model extracts the deep learning feature of image, obtains the feature vector of people in each frame image.Finally Classification is compared to the information in the property data base constructed in the feature vector got and the Part III, to point Class result carries out calculating similarity, and the present invention is calculated using Euclidean distance and realized.The result value calculated for Euclidean distance Range about illustrates that distance is closer close to 1, that is, more matches between 0 to 1, on the contrary then diversity factor is higher.It is big to similarity It all exports in matched people's information of some threshold value.

Beneficial effects of the present invention:

Present invention generally provides a kind of convolutional neural networks and the recognition methods of the people based on the convolutional neural networks, difference In traditional recognition of face.Present invention introduces depth learning technology, the convolutional neural networks semantic segmentation that designs through the invention Model, convolutional neural networks and improvement VGG model combine, respectively to the image of people is split one by one in video, Extract deep learning feature.And it is compared with the information in database.The present invention can realize network monitoring using camera, Processing can be carried out to the video that IP Camera monitors simultaneously, be divided into image one by one, image is analyzed.It looks for People wherein included out.In the world not there are two completely the same people, each specific people includes an independence Feature vector, the characteristic vector data library of people can be constructed from the profile of people, shape (including front, side and the back side) etc.. It lays the foundation for the further development of network shooting safety.

Detailed description of the invention

Fig. 1 is convolutional neural networks proposed by the present invention.

Fig. 2 is application method implementation flow chart of the present invention；

Specific embodiment

The present invention will be further explained below with reference to the attached drawings.

Fig. 1 is a kind of convolutional neural networks model for the segmentation for realizing people proposed by the present invention.

The model includes 3 pond layers and convolutional layer, through the feature for the structure fusion interlayer that jumps, on characteristic spectrum It is merged, obtains final confidence map, i.e. segmentation output.In terms of the pretreatment of this mode input image, pre- instruction is utilized The residual error network ResNet-18 model perfected, and remove rear 7 layer network in addition to output layer, feature is carried out to original image It extracts, obtains the characteristic pattern of one (dimension channel × height height × width width), it is then again that pre-training is good The 7th layer of the inverse in addition to output layer of ResNet-18 (ResNet18 [- 7]), reciprocal 6th layer (ResNet18 [- 6]), reciprocal the It is used as subsequent pond layer, that is, tri- pond layers of pool3, pool4 and pool5 to respectively correspond for 5 layers (ResNet18 [- 5]) ResNet18 [- 7], ResNet18 [- 6] and ResNet18 [- 5] layer finally obtain 3 intermediate features maps of different sizes. Convolution sum sampling operation then is carried out to different intermediate maps, obtains the characteristic spectrum of final different layers,

Implementation process includes: that mode input end is an original image, to input original picture size as 3 × 320 × 480 For, pool3, pool4, pool5 are 3 pond layers, carry out one using to the feature 21 × 10 × 15 behind the pond pool3 first Secondary 32 times of up-samplings, obtain 21 × 320 × 480 confidence map；Secondly also the feature behind the pond pool4 is passed through one time 1 × 1 Characteristic pattern 21 × 20 × 30 that convolution obtains carries out 16 times of up-samplings, obtains 21 × 320 × 480 confidence map；Most The characteristic pattern 21 × 20 × 30 also obtained afterwards to the feature behind the pond pool5 by 1 × 1 convolution carries out 8 times upper Sampling, obtains 21 × 320 × 480 confidence map；Finally three characteristic spectrums are merged, obtain final characteristic spectrum Output obtains the segmentation result of corresponding region, realizes the segmentation of the people based on convolutional neural networks.

Fig. 2 is the realization design scheme of allomeric function of the present invention.Present invention specific implementation is as follows:

(1) the external camera of application acquires video, video is divided into video data one by one, and accomplish fluently mark Label are compared for later period identification.

(2) preliminary treatment is carried out to collected video image, it is converted into jpg lattice by different video file formats Formula or bmp format.Image is screened, the image file of the information in image containing someone is remained.

(3) image collection being handled, the convolutional neural networks of the invention realized using Fig. 1 are divided portrait, Portrait is split from image.

(4) map operation.Then the background pixel point value in image in addition to human is all set to 0, only retains people Part pixel value.Return to segmented image.

(5) image for there was only colour portrait is returned.

(6) with deep learning model VGG model is improved, to the picture extraction feature handled well.The model has 15 volumes Lamination, each convolutional layer configure the activation primitive of ReLu.The mode of down-sampling is carried out using maximum down-sampling in a model Pondization operation.Preceding 13 layers of network structure using VGG model of the model, this is to preferably adjust image in model Spatial coherence.The present invention only uses convolutional layer without the use of full articulamentum in the model.This is because full convolution operation is mentioning Ability with shared convolution when taking image, reduces the redundancy of information.The VGG network model has multiple convolutional layers and down-sampling The output of layer, convolution node layer indicates are as follows: WithThat indicate is l layers and l- 1 layer of corresponding characteristic pattern,Indicate convolution kernel of m-th of the characteristic pattern from l layers to current n-th of characteristic pattern, f (x)=1/ The linear activation primitive of ReLu that [1+exp (- x)] is each layer,Indicate biasing.The output of down-sampling node layer can indicate Are as follows:Wherein s × s indicates the size of down-sampling,For weight.By the VGG mould After type extraction feature, the eigenmatrix of a 1*4096 is obtained for every picture.

(7) test set in the feature vector and Database Systems of collected people is calculated, by calculating Europe Formula distance calculates image similarity.

(8) label of picture most like in database is returned out for the test image of selection.Or by the spy of extraction Sign vectormatrix is stored in Database Systems, stamps tag along sort to the corresponding picture of this feature vector.

(9) it finally shows comparison result and exports.

The series of detailed descriptions listed above only for feasible embodiment of the invention specifically Protection scope bright, that they are not intended to limit the invention, it is all without departing from equivalent implementations made by technical spirit of the present invention Or change should all be included in the protection scope of the present invention.

Claims

1. the automatic identifying method of people in a kind of monitor video based on convolutional neural networks, which is characterized in that including four portions Point:

First part is the semantic segmentation that people is carried out based on convolutional neural networks, and the portrait in image is split；

Second part is to extract deep learning feature based on people, and the deep learning feature can be expressed as someone feature vector；

Part III is building database, records using the corresponding feature vector of the people of certain figure as one, is saved in number According to library；

Part IV is to identify to people, extracts the deep learning feature of the people, and with the various people's that save in database Deep learning feature carries out match cognization, and similar people is shown automatically.

2. the automatic identifying method of people in a kind of monitor video based on convolutional neural networks according to claim 1, It is characterized in that, the first part is the semantic segmentation that people in image is carried out using convolutional neural networks model, is specifically included: right In the collected video of camera, image one by one is converted by video first；Then using preparatory trained convolution Image one by one in the video that neural network model shoots camera is split, then by image in addition to human Background pixel point value be all set to 0, only retain people part pixel value.

3. the automatic identifying method of people in a kind of monitor video based on convolutional neural networks according to claim 2, It is characterized in that, the construction of the convolutional neural networks model includes the following: that the model includes 3 pond layers and convolutional layer, benefit With the good residual error network ResNet-18 model of pre-training, feature extraction is carried out to original image, obtains a dimension channel × height height × width width characteristic pattern, by respectively obtaining intermediate features figure of different sizes after 3 pond layers Spectrum obtains three characteristic spectrums and merges, obtains final after carrying out convolution sum sampling operation to intermediate characteristic spectrum respectively Characteristic spectrum output.

4. the automatic identifying method of people in a kind of monitor video based on convolutional neural networks according to claim 1, It being characterized in that, the second part is the deep learning feature that image is extracted using improved convolutional neural networks VGG model, It specifically includes: special as second part deep learning to the image for the people that first part is split using convolutional neural networks The input extracted is levied, is input an image into improved convolutional neural networks VGG model, using the model as a feature extraction Device extracts the characteristics of image that dimension is 4096 in the full articulamentum of the model, in a row by the information conversion of people in piece image Feature vector.

5. the automatic identifying method of people in a kind of monitor video based on convolutional neural networks according to claim 1, It is characterized in that, every all tagged classification of record in the Part III, the database is stored in private server Face, and the server and network monitoring are networked.

6. the automatic identifying method of people in a kind of monitor video based on convolutional neural networks according to claim 1, It is characterized in that, the specific implementation of the Part IV includes: to convert image one by one for tested video first；Lead to again It crosses convolution neural network model and carries out semantic segmentation, extract the people in image；Then it by convolutional neural networks VGG model, mentions The deep learning feature for taking image obtains the feature vector of people in each frame image；Finally to the feature vector and institute got Classification is compared in the information recorded in the property data base constructed in the Part III stated, calculate to classification results similar Degree, the result value range of similar calculating about illustrates that distance is closer close to 1 between 0 to 1, that is, more matches, on the contrary Then diversity factor is higher；The matched people's information for being greater than threshold value to similarity all exports.

7. the automatic identifying method of people in a kind of monitor video based on convolutional neural networks according to claim 6, It is characterized in that, the calculating of the similarity is calculated using Euclidean distance to be realized.

8. the automatic identification of people in a kind of monitor video based on convolutional neural networks according to claim 1-7 Method, which is characterized in that the automatic identifying method is realized by following specific steps:

(1) the external camera of application acquires video, video is divided into video data one by one, and accomplish fluently label, uses It identifies and compares in the later period；

(2) to collected video image carry out preliminary treatment, by its by different video file formats be converted into jpg format or Bmp format；Image is screened；

(3) to treated, image collection is handled, and using the convolutional neural networks, portrait is divided, will from image Portrait is split；

(4) the background pixel point value in image in addition to human is all set to 0 by map operation, only retains the part of people Pixel value, return segmented image；

(5) image only comprising colour portrait is returned；

(6) with deep learning model VGG model is improved, to the picture extraction feature handled well, the VGG model includes 15 Convolutional layer, each convolutional layer configure the activation primitive of ReLu；The mode of down-sampling is carried out using maximum down-sampling in model Pondization operation；Preceding 13 layers of network structure using VGG model of the model；Only used in the model convolutional layer without the use of Full articulamentum；The VGG model has several convolutional layers and down-sampling layer, and wherein the output of convolution node layer indicates are as follows:The output of down-sampling node layer indicates are as follows: After the VGG model extraction feature, the eigenmatrix of a 1*4096 is obtained for every picture；

(7) test set in the feature vector and Database Systems of collected people is calculated, by calculate it is European away from From calculating image similarity；

(8) for the label of picture most like in the test image returned data library of selection, or by the feature vector of extraction Battle array is stored in Database Systems, stamps tag along sort to the corresponding picture of this feature vector；

(9) it shows comparison result and exports.