CN108280233A

CN108280233A - A kind of VideoGIS data retrieval method based on deep learning

Info

Publication number: CN108280233A
Application number: CN201810162847.8A
Authority: CN
Inventors: 邹志强; 戴海宏; 吴家皋; 何旭; 熊俊杰; 索玉聪
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2018-02-26
Filing date: 2018-02-26
Publication date: 2018-07-13

Abstract

The invention discloses a kind of VideoGIS data retrieval method based on deep learning, including：First in the case where carrying out room and time sampling to VideoGIS data, the Euclidean distance of VideoGIS frame frame difference is calculated, and key-frame extraction is carried out to video lens；Then the depth convolutional neural networks model being alternately made of convolutional layer, active coating and pond layer is established, the VideoGIS frame image of input is mapped layer by layer, realizes that the depth characteristic of VideoGIS frame image indicates；Finally carry out layering retrieval：First layer is to carry out coarse search with hash method and Hamming distance；The second layer is filtered the result of first layer coarse search, realizes the preceding m essence retrieval of the VideoGIS frame image from candidate pool.The present invention extracts key frame using frame difference Euclidean distance so that effectiveness of retrieval greatly improves, and is trained using depth convolutional neural networks model, extracts higher level character representation so that retrieval time and storage overhead are greatly reduced.

Description

A kind of VideoGIS data retrieval method based on deep learning

Technical field

The present invention relates to a kind of VideoGIS (Geographic Information System, ground based on deep learning Manage information system) data retrieval method, belong to technical field of computer vision.

Background technology

VideoGIS is that geographical video merges a kind of new video generated with GIS, and the retrieval of the video is to governability and people People's livelihood work brings huge facility.With the lasting enhancing of application breadth and depth, VideoGIS related industry has become newly Industry growth point.Meanwhile the raising that the development and city security protection built with smart city require, it is how big from VideoGIS The data needed for user are accurately found and obtained in data faces a series of bottleneck problems.On the one hand we have had accumulated flood tide VideoGIS data, and also continuing to throw huge fund creation data, on the other hand, multitude of video GIS data is limited by huge body Amount and the effective analysis of shortage, limit the breadth and depth of its application.Therefore, to these data be subject to analysis and utilization become for How key quickly and effectively retrieves oneself required data from these VideoGIS data and becomes and study recently Hot spot.

Traditional video frequency searching mode is video frequency searching and content based video retrieval system based on text key word (Content-Based Video Retrieval, CBVR).Since descriptive power is limited, subjectivity is strong and heavy workload etc. is former Cause, it is helpless for above-mentioned typical case based on the video frequency searching of text key word, the inspection of VideoGIS data depth cannot be met The demand of rope.Content based video retrieval system (CBVR) is exactly according to content input by user (image etc.), in video database In retrieve same or similar video clip or the process of key frame.In content based video retrieval system, retrieval Object be often no longer limited to video data itself, but based on video " content " description data, such as color characteristic and Textural characteristics.

Video frequency searching is generally divided into two video pre-filtering, feature extraction steps.Video pre-filtering it is the most key be close The extraction of key frame.Key frame is the characteristics of image for the key content for describing a video lens, and face can be extracted from key frame The low-level image features such as color, texture, shape, using the data source as video frequency abstract and database index.If extracting each frame of video, Data volume is huge, and there is the video frame repeated with redundancy, therefore the extraction of key frame is very heavy to establishing video index It wants.

In terms of feature extraction, traditional video frequency searching feature extraction algorithm (color characteristic, textural characteristics and shape feature Deng) very high domain knowledge is needed to the description of feature, and deep learning simulates the structure of human brain, utilizes convolutional Neural net The basic structures such as convolutional layer, pond layer and the full articulamentum of network, so that it may to allow network structure oneself to learn and extraction correlated characteristic. Therefore, can be had to VideoGIS image using deep learning extraction feature and more accurately describe degree so that VideoGIS number It is substantially reduced according to the range of retrieval, to reach accurate and quickly retrieve purpose.

In the prior art in order to efficiently indicate video frequency feature data, the method or two of real number character representation may be used It is worth the method for Hash coded representation.Method based on real number character representation refers to the real number feature vector for extracting video frame images It as expression, but takes since this representation method is comparable in retrieval and accounts for memory space, cannot meet extensive VideoGIS data retrieval demand；Method based on two-value Hash coded representation be by video frame images binary-coding to Amount carrys out coded representation, and compared to the method using real number character representation, under the expression of equal length, memory space significantly subtracts It is few.For example, in luv space, if as soon as a video feature vector accounts for 1024 bytes, then 100,000,000 video features need The memory space of 100G is wanted, and if the Hash coded representation of 128 bits of each video features, all video Hash Memory space only needs 1.6G.Meanwhile similar video frame images have similar two-value code, are then measured using Hamming distance Similitude between two-value code, speed are quite fast.

Invention content

Goal of the invention：In order to overcome the deficiencies in the prior art, the present invention to provide a kind of regarding based on deep learning Frequency GIS data search method is consumed with solving to be difficult to obtain in VideoGIS data retrieval accurately retrieval result, memory space Greatly, the slow problem of retrieval rate.

Realize foregoing invention content, it is necessary to solve several key problems：(1) it is directed to exist in VideoGIS library and repeat The problem of with the VideoGIS frame of redundancy, designs a kind of efficient extraction method of key frame；(2) it is directed in the prior art to video The not strong problem of GIS characteristics of the underlying image ability to express is realized using deep learning method based on depth convolutional neural networks Feature extraction algorithm；(3) the problem of being directed to retrieval rate is designed a kind of VideoGIS data retrieval method of layering retrieval, is being examined Suo Sudu, precision etc. meet the Search Requirement of extensive VideoGIS data.

Technical solution：To achieve the above object, the technical solution adopted by the present invention is：

A kind of VideoGIS data retrieval method based on deep learning, which is characterized in that include the following steps：

A. key-frame extraction

In order to ensure validity (i.e. the quantity of key frame is enough to represent video lens) and the inspection of VideoGIS data of key frame The efficiency of rope, and the time response of video is reflected, the present invention calculates video in the case where carrying out room and time sampling to video The Euclidean distance of GIS frames frame difference, and key-frame extraction is carried out to video lens；

The calculating of frame difference uses Euclidean distance wherein between consecutive frame, under normal circumstances, the VideoGIS in same camera lens Frame difference fluctuates above and below average value between frame, and changes smaller.It is assumed that frame difference between consecutive frame be (D1, D2 ... ..., Dn-1, N indicates the totalframes of camera lens), and VideoGIS frame is coloured image, needs to convert thereof into gray level image, it is assumed that the frame of conversion It is (X [1], X [2] ..., X [n]), then formula 1 is the frame difference calculation formula between all VideoGIS frames in camera lens.

It needs to specialize, since VideoGIS data are high-definition datas, key frame pixel is relatively high, causes follow-up The key point obtained when extraction is excessive, and characteristic matching speed is slow, influences VideoGIS data search efficiency, therefore the present invention is protecting Before depositing key frame, sampling processing has been carried out to camera lens, pass is reduced in the case that ensureing that key frame information is complete as far as possible The pixel of key frame.

B. depth characteristic is extracted

Establish the depth convolutional neural networks model being alternately made of convolutional layer, active coating and pond layer, the video of input GIS frames image is mapped layer by layer in a network, obtains each layer for the different representation of VideoGIS frame image, realization regards The depth characteristic of frequency GIS frame images indicates；

C. layering retrieval

The retrieving includes that coarse search and essence are retrieved：The high dimensional feature vector that depth network model is learnt first It is converted to two-value code, the similitude between Hamming distance measurement two-value code is then used, obtains the candidate of candidate similar key frame Pond；Then between the VideoGIS frame image in VideoGIS frame image to be retrieved and candidate pool being measured them with Euclidean distance Similitude, m similar retrieval results before finally obtaining.

Further, a. key-frame extractions specifically include：

Input：Video lens V={ V₁, V₂... V_n, the crucial frame number of selection：K；

Output：The key frame of video；

A1. the frame that adjacent key frame is calculated using Euclidean distance is poor, and cyclic variable i is from 1 to n-2 for setting, and n indicates camera lens Totalframes；

A2. it as i=n-2, indicates that all VideoGIS frames of camera lens have stepped through end, exports the Europe of VideoGIS frame difference Otherwise formula distance, end loop continue to execute a1；

A3. extreme value, maximum value, minimum value and the median of frame difference Euclidean distance are calculated；

If a4. extreme value>Median then filters out extreme value, otherwise deletes the extreme point less than or equal to median；

If the number of the extreme point for the crucial frame number K ＞ screenings a5. chosen, the extreme value of screening is chosen as key frame, Otherwise, preceding K frames are chosen in the extreme value of screening as key frame.

Further, the b. depth characteristics extraction specifically includes：

B1. the size of the preceding unified image of training：Size is uniformly arrived using the method placed in the middle for cutting (centerCrop) 224*224, i.e., first zoom to 224 proportionality coefficient according to minimum edge, then carries out whole scaling, is then with center to long side Benchmark does isometric cutting to both sides respectively, retains 224 length, can ensure that image is indeformable while protruding image substantially in this way Main body；

B2. depth convolutional neural networks model is established：Including 5 sections of convolution sums, 3 full articulamentums, there is 2-3 in every section of convolution A convolutional layer, while every section of convolution tail portion can connect a maximum pond layer to reduce the size of picture；Each convolutional layer has 3*3 Filter, then use activation primitive be correct linear unit (Rectified Linear Unit, ReLU), by activation letter Nonlinear transformation is counted up into, learning ability of this model to feature is enhanced；

B3. loss function and optimization method：After above-mentioned model construction, it would be desirable to training pattern, wherein loss function The logarithm of multiclass is selected to lose (categorical_crossentropy) function, carrying out parameter by stochastic gradient descent method seeks It is excellent to minimize loss function, wherein learning rate is 0.1, and attenuation term 1e-6, momentum 0.9 uses newton momentum (nesterov) Optimal gradient optimization algorithm；

B4. it is based on model extraction feature：When extracting feature, a unified size is scaled the images to by b1., and Image is inputted in above-mentioned model and is calculated, while training convolutional neural networks, finally obtains the feature vector of higher-dimension；First Stage beginning carries out feature extraction operation to VideoGIS key frame library first, generates higher-dimension real-valued, to construct one Property data base；When carrying out VideoGIS data retrieval, feature extraction operation is carried out to VideoGIS frame image to be retrieved, it is raw At feature to be retrieved.

Further, the depth convolutional neural networks specifically include：

First segment：Including 2 convolutional layers and a pond layer, input as 224 × 224 × 3 image datas, by 64 mistakes Filter, window size be 3*3 convolutional layer handle, then carry out ReLU activation primitive processing, output be characterized as 224 × 224 × 64, the core of maximum pond 2*2 is carried out by pond layer, step-length 2 obtains 112 × 112 × 64 data；

Second segment：Including 2 convolutional layers and a pond layer, input data 112 × 112 × 64 is filtered by 128 Device, window size be 3*3 convolutional layer handle, then carry out ReLU activation primitive processing, output be characterized as 112 × 112 × 128, the core of maximum pond 2*2 is carried out by pond layer, step-length 2 obtains 56 × 56 × 128 data；

Third section：Including 3 convolutional layers and a pond layer, input data 56 × 56 × 128, by 256 filters, The convolutional layer that window size is 3*3 is handled, and then carries out ReLU activation primitive processing, and output is characterized as 56 × 56 × 256, passes through Pond layer carries out the core of maximum pond 2*2, and step-length 2 obtains 28 × 28 × 256 data；

4th section：Including 3 convolutional layers and a pond layer, input data 28 × 28 × 256, by 512 filters, The convolutional layer that window size is 3*3 is handled, and then carries out ReLU activation primitive processing, and output is characterized as 28 × 28 × 512, passes through Pond layer carries out the core of maximum pond 2*2, and step-length 2 obtains 14 × 14 × 512 data；

5th section：Including 3 convolutional layers and a pond layer, input data 14 × 14 × 512, by 512 filters, The convolutional layer that window size is 3*3 is handled, and then carries out ReLU activation primitive processing, and output is characterized as 14 × 14 × 512, passes through Pond layer carries out the core of maximum pond 2*2, and step-length 2 obtains 7 × 7 × 512 data；

6th section：Input data 7 × 7 × 512, it is complete to connect, 4096 features are obtained, are then carried out at ReLU activation primitives Reason, output are characterized as 4096, by Dropout processing (preventing model over-fitting), finally obtain 4096 data；

7th section：Input data 4096, it is complete to connect, 4096 features are obtained, ReLU activation primitive processing is then carried out, it is defeated Go out to be characterized as 4096, by Dropout processing, finally obtains 4096 data；

8th section：Input data 4096, it is complete to connect, obtain 1000 characteristics.

Further, the first layer coarse search specifically includes：

In order to carry out efficient VideoGIS data retrieval, the high dimensional feature vector learnt by depth network model is turned It is melted into two-value code, the similitude between Hamming distance measurement two-value code is then used, obtains the candidate pool of candidate similar key frame.

In order to learn to obtain character representation simultaneously and obtain one group of hash function, in the good convolutional neural networks of pre-training Between 7th section and the 8th section, it is inserted into a new full articulamentum, (S types grow bent this layer using sigmoid activation primitives Line) feature vector of the 7th section of model output is converted to two-value code；Wherein, the initial parameter of depth convolutional neural networks be from Training is obtained on ImageNet data sets (an existing image data base), and new full articulamentum is initially joined Number, cryptographic Hash is built by the way of random projection transforms；

For VideoGIS frame to be retrieved, what is extracted first is the feature of the output of new full articulamentum, by activation Threshold value binarization after obtain two-value code；Finally by two in the two-value code of VideoGIS frame to be retrieved and property data base Hamming distance between value code is less than those of given threshold value VideoGIS frame image and is put into candidate pool.

Further, the second layer essence retrieval specifically includes：

In coarse search, the Hamming distance between two-value Hash codes is less than those of given threshold value VideoGIS frame image It is put into candidate pool, in order to obtain more accurately retrieval result, further using precisely retrieving on the basis of coarse search Method.

For the candidate pool image obtained in VideoGIS frame image to be retrieved and coarse search, according to from convolutional Neural net The feature of 7th section of extraction of network, specifically calculates the similarity between them with Euclidean distance, is regarded from candidate pool to determine The preceding m retrieval result of frequency GIS frame images.Euclidean distance is smaller, and the similarity of two images is higher, before thus determining M similar retrieval results.

Advantageous effect：A kind of VideoGIS data retrieval method based on deep learning provided by the invention, relative to existing Technology has the following advantages：1, it due to the extraction method of key frame using the Euclidean distance based on frame difference, has well solved and has regarded There are problems that repeating the VideoGIS frame with redundancy in the libraries frequency GIS, reduces the occupancy of memory, accelerate the index of video；2、 Due to carrying out feature extraction using depth convolutional neural networks, such feature vector has more precisely VideoGIS frame image Description degree, have preferable experiment effect, realize the feature extraction of each key frame in extensive VideoGIS data；3, exist Layering retrieval when, the speed of retrieval is improved under conditions of ensureing precision using the thought of two-value Hash, accomplished not only soon but also Standard meets the Search Requirement of extensive VideoGIS data.

Description of the drawings

Fig. 1 is a kind of flow chart of the VideoGIS data retrieval method based on deep learning of the present invention；

Fig. 2 is the flow chart of a. key-frame extractions in the present invention；

Fig. 3 is the structure chart of depth convolutional neural networks model in the present invention.

Specific implementation mode

The present invention is further described below in conjunction with the accompanying drawings.

It is as shown in Figure 1 a kind of VideoGIS data retrieval method based on deep learning, mainly includes the following steps that：

A. key-frame extraction

For VideoGIS data, there are many information repeated with redundancy in VideoGIS data, if do not carried out to it pre- Processing, then VideoGIS data volume can be quite big, effectiveness of retrieval will substantially reduce.For example, can in VideoGIS data Static picture can be will appear, if each frame of extraction video, then there will be the VideoGIS frame of repetition or redundancy.Cause This, we need to pre-process VideoGIS data first, are split to camera lens, choose valuable information, and representative regards The main contents of frequency camera lens, i.e. key frame.

Simultaneously as VideoGIS data are high-definition datas, key frame pixel is relatively high, causes to obtain when subsequent extracted Key point is excessive, and characteristic matching speed is slow, influences VideoGIS data search efficiency, therefore the present invention is before preserving key frame, Sampling processing has been carried out to camera lens first, has ensured the complete picture for reducing key frame of key frame information as far as possible Element.In key-frame extraction, need colored VideoGIS frame image being converted to gray level image, then calculate neighbor frame difference it Between Euclidean distance, to obtain the key frame of VideoGIS data, structure key frame library.

Fig. 2 indicates the flow chart of key-frame extraction, is as follows：

Input：Video lens V=V1, V2 ... and Vn }, the crucial frame number of selection：K=5；

Output：The key frame of video；

B. depth characteristic is extracted

Depth network has very strong feature abstraction ability, can extract the spy rich in semantic information to VideoGIS data Sign indicates.Therefore, identification is had more in order to make the Hash of acquisition encode, is extracted using depth characteristic, to obtain VideoGIS number According to depth characteristic indicate.

The present invention describes the feature of VideoGIS frame image with VGGNet (depth convolutional Neural) network architectures, and depth is special Sign extracting method is designed to 5 sections of convolution, including subsidiary pond layer and nonlinear activation layer, behind the last one convolutional layer Add a global pool layer to quantify to feature, as shown in Figure 3.It specifically includes：

B1. the size of the preceding unified image of training：It is using the method for centerCrop that size is unified to 224*224, i.e., first 224 proportionality coefficient is zoomed to according to minimum edge, then carries out whole scaling, then long side is divided on the basis of center to both sides Isometric cutting is not done, retains 224 size；

B2. depth convolutional neural networks model is established：Including 5 sections of convolution sums, 3 full articulamentums, there is 2-3 in every section of convolution A convolutional layer, while every section of convolution tail portion connects a maximum pond layer to reduce the size of picture；Each convolutional layer has 3*3's Then filter uses activation primitive ReLU, completes nonlinear transformation by activation primitive, enhance study energy of this model to feature Power；

B3. loss function and optimization method：After above-mentioned model construction, it would be desirable to training pattern, wherein selecting Categorical_crossentropy loss functions carry out parameter optimization to minimize loss letter by stochastic gradient descent method Number, wherein learning rate are 0.1, and attenuation term 1e-6, momentum 0.9 uses nesterov Optimal gradient optimization algorithms；

Wherein, the depth convolutional neural networks model specifically includes：

6th section：Input data 7 × 7 × 512, it is complete to connect, 4096 features are obtained, are then carried out at ReLU activation primitives Reason, output are characterized as 4096, by Dropout processing, finally obtain 4096 data；

C. layering retrieval

Retrieving is divided into two levels, and coarse search and essence are retrieved.First layer is carried out with hash method and Hamming distance Coarse search；The second layer is filtered the result of first layer coarse search, realize the VideoGIS frame image from candidate pool preceding m Essence retrieval.

1) a kind of coarse search with hash method and Hamming distance

In order to carry out efficient VideoGIS data retrieval, first by this model learning to high dimensional feature vector be converted to Then two-value code uses the similitude between Hamming distance measurement two-value code, obtains the candidate pool of candidate similar key frame.

In order to learn to obtain character representation simultaneously and obtain one group of hash function, in the good convolutional neural networks of pre-training Between 7th section and the 8th section, it is inserted into a new full articulamentum, (S types grow bent this layer using sigmoid activation primitives Line) feature vector of the 7th section of model output is converted to two-value code；Wherein, the initial parameter of depth convolutional neural networks be from Training is obtained on ImageNet data sets, and for new full articulamentum initial parameter, using the side of random projection transforms Formula builds cryptographic Hash；

For VideoGIS frame to be retrieved, what is extracted first is the feature of the output of new full articulamentum, by activation Threshold value binarization after obtain two-value code, wherein threshold value is 0.5；Finally by the two-value code of VideoGIS frame to be retrieved and spy The Hamming distance between two-value code in sign database is less than those of given threshold value VideoGIS frame image and is put into candidate pool In.

2) the preceding m essence retrieval of a kind of VideoGIS frame image from candidate pool

In coarse search, the Hamming distance between two-value Hash codes is less than those of threshold value VideoGIS frame image and is put into Into candidate pool, more accurately retrieval result in order to obtain, further using the method precisely retrieved on the basis of coarse search.

Compared with the existing technology, the VideoGIS data retrieval method based on deep learning provided in the present invention, is adopted Key frame is extracted with the frame difference of VideoGIS frame so that effectiveness of retrieval greatly improves；Using depth convolutional neural networks mould Type is trained, and extracts higher level character representation；Meanwhile it being carried under conditions of ensureing precision using the thought of two-value Hash The high speed of retrieval so that retrieval time and storage overhead are greatly reduced.

The above is only a preferred embodiment of the present invention, it should be pointed out that：For the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of VideoGIS data retrieval method based on deep learning, which is characterized in that include the following steps：

A. key-frame extraction

In the case where carrying out room and time sampling to VideoGIS data, the Euclidean distance of VideoGIS frame frame difference is calculated, and to video Camera lens carries out key-frame extraction；

B. depth characteristic is extracted

The depth convolutional neural networks model being alternately made of convolutional layer, active coating and pond layer is established, to the VideoGIS of input Frame image is mapped layer by layer, is obtained each layer for the different representation of VideoGIS frame image, is realized VideoGIS frame image Depth characteristic indicate；

C. layering retrieval

The retrieving includes that coarse search and essence are retrieved：The higher-dimension that first layer arrives depth convolutional neural networks model learning is special Sign vector is converted to two-value code, then uses the similitude between Hamming distance measurement two-value code, obtains candidate similar key frame Candidate pool；The second layer measures VideoGIS frame image to be retrieved and the VideoGIS frame image in candidate pool with Euclidean distance Similitude between them, m similar retrieval results before finally obtaining.

2. a kind of VideoGIS data retrieval method based on deep learning according to claim 1, which is characterized in that institute A. key-frame extractions are stated to specifically include：

Output：The key frame of video；

A1. the frame that adjacent key frame is calculated using Euclidean distance is poor, and cyclic variable i is from 1 to n-2 for setting, and n indicates the total of camera lens Frame number；

A2. as i=n-2, indicate that all VideoGIS frames of camera lens have stepped through end, output VideoGIS frame difference it is European away from From otherwise end loop continues to execute a1；

If the number of the extreme point for the crucial frame number K ＞ screenings a5. chosen, chooses the extreme value of screening as key frame, otherwise, Preceding K frames are as key frame in the extreme value of selection screening.

3. a kind of VideoGIS data retrieval method based on deep learning according to claim 1, which is characterized in that institute The extraction of b. depth characteristics is stated to specifically include：

B1. the size of the preceding unified image of training：It is using the method for centerCrop that picture size is unified to 224*224, i.e., first 224 proportionality coefficient is zoomed to according to minimum edge, and carries out whole scaling, and then long side is distinguished on the basis of center to both sides Isometric cutting is done, 224 size is retained；

B2. depth convolutional neural networks model is established：Including 5 sections of convolution sums, 3 full articulamentums, there is 2-3 volume in every section of convolution Lamination, while every section of convolution tail portion connects a maximum pond layer to reduce the size of picture；Each convolutional layer has the filtering of 3*3 Then device uses activation primitive ReLU, completes nonlinear transformation by activation primitive, enhance learning ability of this model to feature；

B3. loss function and optimization method：After above-mentioned model construction, need to train the model, wherein selecting Categorical_crossentropy loss functions carry out parameter optimization to minimize loss letter by stochastic gradient descent method Number, wherein learning rate are 0.1, and attenuation term 1e-6, momentum 0.9 uses nesterov Optimal gradient optimization algorithms；

B4. it is based on model extraction feature：When extracting feature, a unified size is scaled the images to by b1., and will figure It is calculated as inputting in above-mentioned model, while training convolutional neural networks, finally obtains the feature vector of higher-dimension；It is initializing Stage carries out feature extraction operation to VideoGIS key frame library first, generates higher-dimension real-valued, to one feature of construction Database；When carrying out VideoGIS data retrieval, feature extraction operation is carried out to VideoGIS frame image to be retrieved, generation waits for Retrieval character.

4. a kind of VideoGIS data retrieval method based on deep learning according to claim 3, which is characterized in that institute Depth convolutional neural networks model is stated to specifically include：

First segment：Including 2 convolutional layers and a pond layer, inputs as 224 × 224 × 3 image datas, filtered by 64 Device, the convolutional layer that window size is 3*3 are handled, and then carry out ReLU activation primitive processing, and output is characterized as 224 × 224 × 64, The core of maximum pond 2*2 is carried out by pond layer, step-length 2 obtains 112 × 112 × 64 data；

Second segment：Including 2 convolutional layers and a pond layer, input data 112 × 112 × 64, by 128 filters, windows The convolutional layer that mouth size is 3*3 is handled, and then carries out ReLU activation primitive processing, and output is passed through characterized by 112 × 112 × 128 Pond layer carries out the core of maximum pond 2*2, and step-length 2 obtains 56 × 56 × 128 data；

Third section：Including 3 convolutional layers and a pond layer, input data 56 × 56 × 128, by 256 filters, windows The convolutional layer that size is 3*3 is handled, and then carries out ReLU activation primitive processing, output is characterized as 56 × 56 × 256, by pond Layer carries out the core of maximum pond 2*2, and step-length 2 obtains 28 × 28 × 256 data；

4th section：Including 3 convolutional layers and a pond layer, input data 28 × 28 × 256, by 512 filters, windows The convolutional layer that size is 3*3 is handled, and then carries out ReLU activation primitive processing, output is characterized as 28 × 28 × 512, by pond Layer carries out the core of maximum pond 2*2, and step-length 2 obtains 14 × 14 × 512 data；

5th section：Including 3 convolutional layers and a pond layer, input data 14 × 14 × 512, by 512 filters, windows The convolutional layer that size is 3*3 is handled, and then carries out ReLU activation primitive processing, output is characterized as 14 × 14 × 512, by pond Layer carries out the core of maximum pond 2*2, and step-length 2 obtains 7 × 7 × 512 data；

6th section：Input data 7 × 7 × 512, it is complete to connect, 4096 features are obtained, ReLU activation primitive processing is then carried out, Output is characterized as 4096, by Dropout processing, finally obtains 4096 data；

7th section：Input data 4096, it is complete to connect, 4096 features are obtained, ReLU activation primitive processing is then carried out, output is special Sign is 4096, by Dropout processing, finally obtains 4096 data；

5. a kind of VideoGIS data retrieval method based on deep learning according to claim 4, which is characterized in that institute First layer coarse search is stated to specifically include：

Between the 7th section and the 8th section of the good depth convolutional neural networks model of pre-training, it is inserted into a new full connection The feature vector of the 7th section of output of model is converted to two-value code by layer, this layer using sigmoid activation primitives；Wherein, depth The initial parameter of convolutional neural networks is trained obtained from ImageNet data sets, and initial for new full articulamentum Parameter builds cryptographic Hash by the way of random projection transforms；

For VideoGIS frame to be retrieved, what is extracted first is the feature of the output of new full articulamentum, passes through the threshold to activation Two-value code is obtained after value binarization；Finally by the two-value code in the two-value code of VideoGIS frame to be retrieved and property data base Between Hamming distance be less than those of given threshold value VideoGIS frame image and be put into candidate pool.

6. a kind of VideoGIS data retrieval method based on deep learning according to claim 5, which is characterized in that institute The retrieval of second layer essence is stated to specifically include：

For the candidate pool image obtained in VideoGIS frame image to be retrieved and coarse search, according to from convolutional neural networks The feature of 7th section of extraction, specifically calculates the similarity between them, to determine the VideoGIS from candidate pool with Euclidean distance The preceding m retrieval result of frame image.