CN114972506A

CN114972506A - Image positioning method based on deep learning and street view image

Info

Publication number: CN114972506A
Application number: CN202210478747.2A
Authority: CN
Inventors: 陈玉敏; 褚天佑; 徐真珍; 陈国栋; 陈娒杰; 陈玥君; 苏恒
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-05-05
Filing date: 2022-05-05
Publication date: 2022-08-30
Anticipated expiration: 2042-05-05

Abstract

The invention provides an image positioning method based on deep learning and street view images, which is characterized in that in order to extract image features with geographic position information in an image, a feature extraction network based on deep learning is constructed, and a landmark data set is utilized to improve the weight of the network on the features containing the position information; then, extracting the aggregate features of the street view images by a feature aggregation method to improve the speed of matching the features, and reducing the influence of repeated texture features by using a feature similarity calculation method; and finally, determining the geographic position coordinates of the image to be positioned through the local peak value of the kernel density estimation method, and improving the coordinate ranking of the correct position in the result, thereby further improving the accuracy of overall positioning and providing support for the analysis of the spatial distribution and development trend of the event reflected in the image.

Description

Image positioning method based on deep learning and street view image

Technical Field

The invention belongs to the field of image visual position positioning application, and particularly relates to an image positioning method based on deep learning and street view images.

Background

With the advent of social media, it has become normal to present news or events in the form of images. The internet has become a major channel for news distribution and dissemination today. The spatial location is important information to be conveyed by the news event in the image, so that the positioning of the geographic location of the image can provide support for analyzing the spatial distribution and development trend of the event and for implementing an intervention measure.

However, there are still difficulties in obtaining their location information directly or automatically from pictures of news or social media. On the one hand, for security or privacy concerns, users often hide their own geographic location or only show ambiguous semantic locations when sending public information, and may delete EXIF information when sharing images to avoid exposing picture taking locations. On the other hand, picture position information may be lost during uploading, compression, or copying. This makes it difficult for a large number of missing images in a network to be efficiently analyzed and utilized.

Images related to an event are usually taken at the place of the event and the content of the images implies a clue to the geographical location, which provides a precondition for the position location of the images. Buildings, layouts, etc. in the street are generally capable of expressing the geographical location information of the image, while street view images may provide visual reference and coordinate positioning for the location positioning of the image due to latitude and longitude information, wide coverage, dense distribution in the city, and reflection of the city environment at various shooting angles. Image features with geographic position features in the image can be extracted through an image retrieval mode. The street view dataset is then used to match the images using a similarity matching algorithm. And finally, determining the position of the image to be positioned according to the returned street view result and the coordinate information of the street view result.

However, since images in the network are diversified by the shooting angle of view, the shooting time period, and the contents of representation, the effective position information in the images is not sufficiently prominent and is difficult to be automatically recognized and extracted, increasing the difficulty in positioning the position of the images. With the development of deep learning, extracting representative image features from images is a current research hotspot. The image features based on the deep learning can be divided into depth global features and depth local features, and the global features are extracted through a convolution pooling layer and can express the overall information of the image. The local features are usually extracted from the full convolution network, then representative features are selected by using a feature selection method, and the local features can express the information of local areas of the image. In an image to be positioned, a building or a street view is often used as a background, the building or the street view occupies a small area in the whole image and is not prominent enough, and interference of pedestrians, vehicles, billboards and the like also exists in the street view image used as a reference data set, so that compared with a depth global feature, the depth local feature can better express information of a local area of the image, and how to extract and select the image feature containing position information is the current technical difficulty.

Although street view images can be used as reference data sets for image matching and position location, there are still many technical problems to be overcome. Street views usually store information in a panoramic format for a 360 ° range around, with each panoramic typically having only latitude and longitude information. In constructing a data set, the panorama is typically transformed into a plurality of perspective views consistent with the camera warping rules using a projection and backprojection method. Because the perspective view content of each street view in different directions is different, and similar scenes exist in the perspective views of adjacent street views, it is difficult to acquire a training set, a verification set and a test set by cleaning and classifying data only by using the coordinate information of the street views, which results in that the deep neural network cannot learn enough features containing geographic position information.

On the other hand, street views have a large amount of data, and it is difficult to efficiently perform fast matching using deep local features in large-scale city street view image retrieval, and time complexity needs to be reduced by combining an aggregation method or a data organization method. The city facade structure in the street view often contains a lot of repeated texture information, and the image features with the information can generate a visual explosion phenomenon during image matching, namely the repeated texture features contribute more in image similarity calculation, which easily causes wrong retrieval results, and how to reduce the influence of the repeated texture features is another technical difficulty while aggregating, organizing and matching the features.

In summary, in the process of positioning the image, it is not yet possible to effectively extract the geographic location information in the image and effectively retrieve and position the image by using the street view image in a large-scale urban environment. Therefore, it is desirable to provide an image positioning method based on deep learning and street view images, which provides spatial location information for images and further provides support for spatial distribution and development trend analysis of events.

Disclosure of Invention

The invention aims to provide an image positioning method based on deep learning and street view images, so as to solve the problems that images to be positioned are searched by utilizing the street view images in a large-scale urban environment, and positioning is carried out by utilizing street view results and longitude and latitude coordinates obtained by searching.

The technical scheme adopted by the invention comprises the following steps:

step 1: and acquiring and processing street view and image data to be positioned. The method comprises the steps of obtaining an image to be positioned and a street view panoramic picture of a city corresponding to the image, then preprocessing street view data, wherein the preprocessing comprises splicing, cutting, projecting and the like of the street view image so as to obtain a plane perspective street view without deformation, and meanwhile, obtaining and recording information such as coordinates of the street view and the like as a reference data set.

Step 2: a training data set is generated. The method comprises the steps of collecting a landmark data set, downloading and managing landmark images according to metadata tags, randomly selecting a certain category and a certain number of images from the landmark images, and then cleaning and filtering abnormal images in each category through data to generate a training data set.

And step 3: and constructing a feature extraction network. And establishing an end-to-end deep convolution neural network to extract the characteristics of the streetscape and the image to be positioned. The former part of the network consists of a full convolution neural network and is responsible for extracting the dense features of the image. After the full convolution network, a characteristic screening network module is added, which consists of a smoothing layer, an attention layer and a whitening layer and is used for screening dense characteristics output by the previous part. Wherein, representative features are extracted according to the scores of the attention layer to the features.

And 4, step 4: and training the feature extraction network and extracting the image features of the street view. The network in step 3 is trained using a training data set. Before training data is input into a network, a series of binary group pairs are randomly generated according to image labels, each binary group comprises a reference image, a positive sample and a plurality of negative samples, during training, local features of the images are aggregated into global pooling features and serve as network output, and a contrast loss function is adopted to calculate network loss and an iterative optimization network. Until the network converges to obtain the feature extraction network model. Local image features of street view images are extracted through the model, when the features are extracted, multi-scale local features of each image are extracted through image scaling, and the extracted feature file content comprises: local feature values, local feature weights, image scaling, feature description positions.

And 5: a codebook of features is generated and aggregated features are computed. Randomly selecting local features of partial street view images, setting the number of clustering centers to be generated, and then performing feature clustering to generate a feature codebook. And calculating the aggregation characteristic vector of the to-be-positioned image and the streetscape in all the reference data sets according to the characteristic codebook. Wherein each image corresponds to an aggregate feature.

Step 6: and establishing an inverted index and carrying out street view matching. And establishing a reverse index table for inquiring the street view image through the characteristics according to the one-to-one correspondence relationship between the aggregation characteristics and the street view image. And performing similarity calculation on the feature vector of the image to be positioned and the feature vector of the street view image, returning and sequencing street view features with high similarity, and inquiring and retrieving the street view features according to the inverted index table to obtain the corresponding street view image.

And 7: and returning the position coordinates according to the retrieval result. And estimating a peak value of similarity distribution in the space by a kernel density estimation method by simultaneously considering longitude and latitude information and similarity ranking of the retrieval result, taking the peak value as a candidate result of positioning, and returning the coordinate position of the image to be positioned according to the size of the peak value.

In the above image positioning method based on deep learning and street view images, in step 1, the image to be positioned is usually an image shot in an outdoor scene, and the acquisition channels include, but are not limited to, news websites, social media, camera shooting, and the like. Street view image acquisition channels include, but are not limited to, web street view map services, street view vehicle acquisitions, and the like. Street view images vary according to the format and deformation mode of the original data, and the preprocessing steps include but are not limited to image stitching, image cropping, image matching, image projection and the like, and each street view image is usually converted into a plurality of planar perspective views with different orientations.

In the above-mentioned image positioning method based on deep learning and street view image, in step 2, the Landmark data set commonly used is the google Landmark data set v2, and a San Francisco Landmark data set, a Tokyo 24/7 data set or a Pitts250k data set may also be used. The data cleaning method is to identify landmark images not belonging to the category by means of image matching or image retrieval, such as removing images with fewer feature points by matching using SIFT, SURF or image features based on deep learning.

In the above image positioning method based on deep learning and street view images, in step 3, the full convolution neural network may be formed by removing the last pooling layer and the full connection layer through a ResNet network, and a feature screening module is connected behind the network to score and select dense features, where the feature screening module is composed of a smoothing layer, an attention layer, and a whitening layer. SmoothingThe larger activation values in adjacent channels in a layer aggregation dense feature consist of an M × M sized average pooled layer. The attention layer scores the dense features, the first n local features with higher scores are screened out, and the first n local features are selected from the score l ₂ And (4) realizing a normalization function. The whitening layer performs dimension reduction and decorrelation on the features, the whitening layer is composed of convolution layers with the size of 1 multiplied by 1 and with bias, and the parameters are obtained through local image feature training extracted by a pre-training network before network training.

In the above image positioning method based on deep learning and street view image, in step 4, during network training, a global pooling feature of the network is extracted by a pooling method, the feature is 1 × 1 × D dimension, and the calculation method is as follows:

where v represents the convolution signature of the network output, w (v) calculates the weight of the output for the attention layer function, f (v) ^′ ) The convolution characteristic v output by the network is a local characteristic obtained after passing through a smoothing layer and a whitening layer, H is the length of a characteristic diagram, and W is the width of the characteristic diagram.

The loss of the network is calculated using the global pooled features, where the contrast loss function used is represented as follows:

wherein d is the Euclidean distance between the features of the samples in the tuple, y is whether the samples in the tuple belong to the same class, if yes, the value is 1, otherwise, the value is 0, N is the number of the samples, and margin is a set threshold. Before each iterative optimization of the network, a series of binary group pairs are randomly generated according to image labels of training data, each tuple is composed of a reference image, a positive sample and a plurality of negative samples, the positive sample is randomly selected from the same type of labels, the negative samples are randomly selected before each iteration, a plurality of image extraction pooling polymerization features are firstly selected as negative sample pools, then the negative sample pools are matched with the reference images and are sequenced, and when each tuple is generated, the first n images which are different from the reference image are selected from the pools as the negative samples.

In the characteristic extraction stage, the multi-scale characteristics of the image are extracted through image scaling, the output of the network is directly extracted, and the top n local characteristics are extracted according to the sequence of the weighted values from large to small. The local characteristic weight is a weight value output by the attention layer, the image scaling scale is the scaling size of an image when the image is input into the network, and the characteristic description position is calculated according to the receptive field size of the full convolution neural network to obtain the coordinate position of the center of the characteristic receptive field as the characteristic description position.

In the above-mentioned image localization method based on deep learning and street view images, in step 5, an aggregate feature is generated for each image. The aggregation method aggregates local features of an image in dimension n x d into aggregated features in dimension k x d, wherein k is the number of aggregation centers. The specific implementation method comprises the following steps:

step 5.1: randomly selecting a part of extracted image features, setting clustering parameters, generating K clustering centers by using a K-means clustering method, and constructing a clustering codebook, wherein the clustering codebook is marked as C ═ { C ═ C ₁ ,…,c _k }。

Step 5.2: in the aggregation process, n local features of one image are respectively distributed to k cluster centers, any one local feature of each image is distributed to the cluster center closest to the local feature, and the residual error between the local feature and the cluster center is calculated, which can be expressed as:

r(x)＝v _i -q(x)

where r (x) represents the residual between the local feature and the cluster center, v _i Denotes the ith local feature, and q (x) denotes the cluster center corresponding to the local feature.

Step 5.3: if a plurality of local features exist in a cluster center, calculating residual errors of the cluster center and summing the residual errors to obtain 1 x d-dimensional features, wherein the sum of the residual errors of the cluster center is calculated as follows:

wherein, V (X) _c ) Being characteristic of polymerization, X _c Representing the representation of the local features of image X after quantization of the codebook of features.

Step 5.4: and combining the features of the k clustering centers to form k x d-dimensional aggregation features.

In the above-mentioned image positioning method based on deep learning and street view image, in step 6, the method of creating the inverted index table for querying the street view image by features is to generate a key-value pair dictionary of "features" ═ street view image ". The similarity calculation method of the feature vector of the image to be positioned and the feature vector of the street view image is represented as follows:

wherein, Similarity (X) _c ,Y _c ) To take a value of the similarity of image X and image Y,

V(X _c ) Is a feature of the aggregate of image X, and σ _α (u) is a similarity calculation function, u is the dot product of the characteristics of two images in a certain clustering center, sign is a sign function, 1 is taken when u is larger than zero, otherwise-1 is taken, and | u | is a modulus of u; and each of oc and τ is a constant. And for a certain image to be positioned, calculating feature similarity with all reference data sets and sequencing, and then inquiring the corresponding street view image according to the image features according to the inverted index table so as to obtain a street view retrieval result.

In the image localization method based on deep learning and street view image, in step 7, the kernel density estimation method considers the search results in the first N steps 6 to be expressed as:

wherein S (x) _i ,y _i ) For the ith street view in the result of step 6 at coordinate (x) _i ,y _i ) The similarity value between the sample and the query image is determined, r is the query radius, and n is the number of samples in the range of the query radius r by taking (x, y) as the center. And extracting local peak values of kernel density analysis and sequencing the local peak values to serve as a position positioning result.

The image positioning method based on the deep learning and the street view images can extract local geographic position feature information in the images, can quickly search the images on a large-scale city scale through feature aggregation and similarity matching, and can effectively position the shooting positions of the images through the search result position estimation based on the kernel density.

Drawings

FIG. 1 is a flow chart of the technical solution of the present invention.

Fig. 2 is a schematic diagram of a feature extraction network and training and extraction stages constructed by the present invention.

Detailed description of the preferred embodiment

For the purpose of facilitating the understanding and practice of the present invention, as will be described in further detail below with reference to the accompanying drawings and examples, it is to be understood that the examples described herein are for purposes of illustration and explanation, and are not intended to limit the invention.

The core problem to be solved by the invention is as follows: the spatial position of the image can provide support for the distribution and trend of events, etc. Since the content in the image that can effectively convey the geographical location information is difficult to extract, it cannot be retrieved and located. According to the method, a depth convolution neural network is established and a feature selection module is combined to extract representative local image features, then candidate results are retrieved from street scenes in a large-scale urban environment by extracting aggregation features, and image position information is extracted according to the longitude and latitude of the results, so that the position positioning of the image is realized.

Referring to fig. 1, the image positioning method based on deep learning and street view images provided by the invention comprises the following steps:

step 1: and acquiring and processing street view and image data to be positioned. The method comprises the following specific steps;

step 1.1: the image to be positioned can be obtained through a news website, social media or camera shooting and other methods. Street view images can be acquired through methods such as network street view map service, street view vehicle acquisition and the like. In addition to collecting and storing street view images, metadata corresponding to street views, such as longitude and latitude information, needs to be collected.

Step 1.2: and preprocessing the street view. For the processing of the equidistant panorama, the street view is spliced to obtain a complete street view panorama. And then cutting the panoramic image, and removing invalid values at the upper side, the lower side or the left side and the right side, so that the aspect ratio of the image is kept to be 2: 1.

step 1.3: a street view perspective is generated. And according to the set projection parameters, converting each panoramic picture into a plurality of plane perspective street view pictures without deformation. The projection method comprises two steps, firstly projecting the panoramic picture onto a spherical surface, then setting proper projection parameters to project the panoramic picture onto a plane, wherein the projection parameters are set as FOV: 60 °, Pitch: [5 ° 20 ° 35 ° ], Yaw: [0 degree ]

45 ° 90 ° 135 ° 180 ° 225 ° 270 ° 315 °). Wherein, FOV is the field angle, Pitch is the Pitch angle, and Yaw is the course angle. Each panorama can generate 24 street view maps of 480 x 640 size according to the combination of the three parameters.

And 2, step: a training data set is generated. The method comprises the following specific steps:

step 2.1: a google landmark data set v2 is collected, data is downloaded and stored according to metadata tags, and 1500 categories of images are randomly selected from the data.

Step 2.2: the landmark data set is cleaned and a training set is generated. Extracting SIFT image features of images in 1500 classes, matching the images in one class with other images in the class, and removing the images if the total number of the matched feature points is less than a set threshold value, otherwise, keeping the images. And generating a training set by using the cleaned landmark data set.

And step 3: and constructing a feature extraction network. Referring to fig. 2, an end-to-end deep convolutional neural network is established to extract the features of the image. In the former part of the network, a pre-trained ResNet101 network removes the last two layers of the pooling layer and the full connection layer to form a full convolution neural network which is responsible for extracting the dense features of the image. And adding a characteristic screening network module after the full convolution network for screening the dense characteristics output by the previous part and extracting representative characteristics. The feature screening module consists of a smoothing layer, an attention layer and a whitening layer. The smoothing layer consists of an average pooling layer of 3 × 3 size. The attention layer is scored for the dense features, the first 1000 local features with higher scores are screened out, and the score is calculated according to the number of the local features ₂ And (4) realizing a normalization function. The whitening layer performs dimension reduction and decorrelation on the features, is composed of convolution layers with the size of 1 multiplied by 1 and with bias, and is obtained by training features of 5000 images randomly extracted through a pre-training network before network training.

And 4, step 4: and training the feature extraction network and extracting the image features of the street view. Referring to fig. 2, the specific steps are as follows:

step 4.1: and generating an image sample. The network in step 3 is trained using a training data set. Before each iterative optimization of the network, a series of binary group pairs are randomly generated according to the image labels of the training data, each tuple is composed of a reference image, a positive sample and a plurality of negative samples, 2500 binary group pairs are formed, and the batch size is 5. And randomly selecting one positive sample from the same type labels, randomly selecting 20000 images to extract pooling aggregation characteristics as a negative sample pool before each iteration of the negative sample, then matching and sequencing the negative sample with the reference image, and selecting the first 5 images which are different from the reference image from the pool as the negative sample when each tuple is generated. Before each iteration, images in the negative sample pool need to be randomly selected again to reconstruct tuples.

Step 4.2: and training the feature extraction network. And calculating the global pooling characteristics of the images during training, wherein the method comprises the following steps:

where v represents the convolution signature of the network output, w (v) calculates the weight of the output for the attention layer function, f (v) ^′ ) The convolution characteristic v output by the network is a local characteristic obtained after passing through a smoothing layer and a whitening layer, H is the length of a characteristic diagram, and W is the width of the characteristic diagram. Iteratively optimizing the network using a comparative loss function, in the following loss function expression:

d is the Euclidean distance between the characteristics of the samples in the tuple, y is whether the samples in the tuple belong to the same class, if yes, the value is 1, otherwise, the value is 0, margin is 0.8, the optimizer selects Adam, and the learning rate is 1 multiplied by 10 ^-5 Weight attenuation of 1 × 10 ^-4 And the learning rate attenuation is an exponential attenuation mode, the value is 0.99, and the iteration is performed for 100 times until the network is converged to obtain a feature extraction network model.

Step 4.3: and (5) extracting image features. The image is scaled by different scales to generate a plurality of image input networks, and then the output of the network is directly extracted, wherein the scaling scales are [2.0,1.414,1.0,0.707,0.5,0.353,0.25 ]. And (4) according to the sorting of the weight values from large to small, taking the top 1000 local features, wherein the dimension of each feature is 128, and simultaneously recording the scale, the position and the weight value corresponding to the feature.

And 5: a codebook of features is generated and aggregated features are computed. The aggregation method aggregates 1000 × 128-dimensional local features of one image into k × 128-dimensional aggregated features. The method comprises the following specific steps:

step 5.1: randomly selecting 1/10 street view image features, setting the number of clustering centers to be generated to be 262,144, then carrying out feature clustering and constructing a feature codebook, wherein the clustering method is a K-means clustering method, and the generated K clustering centers are used as the clustering codebook and are marked as C ═ C ₁ ,…,c _k }。

Step 5.2: in the aggregation process, 1000 local features of one image are respectively distributed to k clustering centers, any one local feature of each image is distributed to the clustering center closest to the local feature, and the residual error between the local feature and the clustering center is calculated:

r(x)＝v _i -q(x)

where r (x) denotes the residual of the local feature from the cluster center, v _i Denotes the ith local feature, and q (x) denotes the cluster center corresponding to the local feature.

Step 5.3: if a plurality of local features exist in a cluster center, calculating and summing residual errors of the cluster center to obtain a feature with dimensions of 1 × 128, wherein the sum of the residual errors of the cluster center is calculated as:

Step 5.4: and combining the features of the k clustering centers to form k multiplied by 128 dimensional aggregation features.

Step 6: establishing an inverted index and carrying out street view matching, which comprises the following specific steps

Step 6.1: and establishing an inverted index table for inquiring the street view image through the characteristics according to the one-to-one correspondence relationship between the aggregation characteristics and the street view image. The inverted index table is implemented with a key-value pair dictionary of "feature" ═ street view image ".

Step 6.2: and performing similarity calculation on the feature vector of the image to be positioned and the feature vector of the street view image, returning and sequencing street view features with high similarity, and obtaining a corresponding street view retrieval result according to the inverted index table.

In the method for calculating the similarity between the feature vector of the image to be positioned and the feature vector of the street view image:

V(X _c ) Is a feature of the aggregate of image X, and σ _α And (u) is a similarity calculation function, u is the dot product of features of two images in a certain cluster center, sign is a sign function, 1 is taken when u is larger than zero, otherwise-1 is taken, | u | is the modulus of u, oc is 3, and tau is 0. And for a certain query image, calculating the feature similarity of all the reference data sets and sorting the feature similarity from large to small, then querying and acquiring street view images according to an inverted index table, and reserving the first 100 retrieval results for the position positioning in the step 7.

And 7: and returning the position coordinates according to the retrieval result. The position of the image is located by a kernel density estimation method, which considers the search results in the first 100 steps 6, and the query radius is set to 150 meters, and the kernel density estimation method is expressed as:

wherein S (x) _i ,y _i ) For the ith street view in the result of step 6 at coordinate (x) _i ,y _i ) And (3) determining the similarity value of the image to the query image, wherein r is the query radius, and n is the number of samples within the range of the query radius r by taking (x, y) as the center of a circle. Obtaining a local maximum matrix in the result of the nuclear density estimation method through focus statistics, subtracting the two matrixes, and extracting a region of which the result is zero and the result matrix of the nuclear density analysis is not zeroAs local peaks and sorted according to peak order, as a result of position location.

It should be understood that the above description is for the purpose of illustrating preferred embodiments of the present invention and is not to be construed as limiting the scope of the invention, which is defined in the appended claims, and all changes and modifications that can be made therein by those skilled in the art are intended to be embraced therein.

Claims

1. An image positioning method based on deep learning and street view images is characterized by comprising the following steps:

step 1, obtaining and processing street view and image data to be positioned;

step 2, generating a training data set;

step 3, constructing a feature extraction network: establishing an end-to-end deep convolutional neural network to extract characteristics of streetscapes and images to be positioned, wherein the front part of the network consists of a full convolutional neural network and is responsible for extracting dense characteristics of the images, and a characteristic screening network module is added behind the full convolutional network and consists of a smooth layer, an attention layer and a whitening layer and is used for screening the dense characteristics output by the front part;

step 4, training a feature extraction network and extracting local image features of streetscapes: training the feature extraction network in the step 3 by using a training data set, randomly generating a series of binary group pairs according to image labels before training data are input into the feature extraction network, wherein each binary group comprises a reference image, a positive sample and a plurality of negative samples, and during training, performing iterative optimization on the network by using a loss function until the network converges to obtain a feature extraction network model, and extracting local image features of the street view image through the model;

step 5, generating a feature codebook and calculating aggregation features: randomly selecting local image characteristics of partial street view images, setting the number of clustering centers to be generated, then performing characteristic clustering to generate a characteristic codebook, calculating aggregation characteristic vectors of the street view images in the image to be positioned and all reference data sets according to the characteristic codebook, wherein each image corresponds to one aggregation characteristic;

step 6, establishing an inverted index and carrying out street view matching: establishing a reverse index table for inquiring street view images through the features according to the one-to-one correspondence of the aggregation features and the street view images, performing similarity calculation on the feature vectors of the images to be positioned and the feature vectors of the street view images, returning and sorting street view features with high similarity, and inquiring and searching the street view features according to the reverse index table to obtain corresponding street view images;

and 7, returning position coordinates according to the retrieval result: and estimating a peak value of similarity distribution in the space by a kernel density estimation method by simultaneously considering longitude and latitude information and similarity ranking of the retrieval result, taking the peak value as a candidate result of positioning, and returning the coordinate position of the image to be positioned according to the size of the peak value.

2. The image positioning method based on deep learning and streetscape images as claimed in claim 1, characterized in that: the specific implementation manner of the step 1 is as follows;

step 1.1, the image to be positioned can be obtained through a news website, a social media or a camera shooting method, the street view image can be obtained through a network street view map service and a street view vehicle acquisition method, and metadata corresponding to the street view, including longitude and latitude information, is collected;

step 1.2, preprocessing the street view image; for the processing of the equidistant panorama, firstly splicing the street view images to obtain a complete street view panorama, then cutting the street view panorama, and removing invalid values at the upper side, the lower side or the left side and the right side so as to keep the aspect ratio of the images as 2: 1;

step 1.3, generating a street view perspective; according to the set projection parameters, each street view panoramic picture is converted into a plurality of plane perspective street view pictures without deformation, the projection method comprises two steps, firstly, the panoramic picture is projected onto a spherical surface, then, the proper projection parameters are set to project the panoramic picture on a plane, and the projection parameters are set as FOV: 60 °, Pitch: [5 ° 20 ° 35 ° ], Yaw: [0 ° 45 ° 90 ° 135 ° 180 ° 225 ° 270 ° 315 ° ], wherein the FOV is the field angle, Pitch is the Pitch angle, and Yaw is the course angle, and a plurality of street view maps of a certain size can be generated from each panoramic view according to the combination of the three parameters.

3. The image positioning method based on deep learning and streetscape images as claimed in claim 1, characterized in that: the specific implementation manner of the step 2 is as follows;

step 2.1, collecting Google landmark data sets v2, downloading and storing data according to metadata tags, and randomly selecting images of N categories from the data;

and 2.2, cleaning the landmark data set and generating a training set, extracting SIFT image features of the images in the N classes, matching the images in one class with other images in the class, removing the images if the total number of the matched feature points is less than a set threshold value, otherwise, reserving the images, and generating the training set by using the cleaned landmark data set.

4. The image positioning method based on deep learning and streetscape images as claimed in claim 1, characterized in that: in step 3, the full convolution neural network is formed by removing the last pooling layer and the full connection layer through a ResNet network, and a feature screening module is connected behind the network to score and select dense features, wherein the feature screening module consists of a smoothing layer, an attention layer and a whitening layer; the larger activation value in a plurality of adjacent channels in the dense features of the aggregation of the smooth layer is formed by an average pooling layer with the size of M multiplied by M; the attention layer scores the dense features, the first n local features with higher scores are screened out, and the first n local features are selected from the score l ₂ Realizing a normalization function; the whitening layer performs dimension reduction and decorrelation on the features, the whitening layer is composed of convolution layers with the size of 1 multiplied by 1 and with bias, and the network parameters are obtained through local image feature training extracted by a pre-training network before network training.

5. The image positioning method based on deep learning and streetscape images as claimed in claim 1, characterized in that: in step 4, during network training, extracting global pooling characteristics of the network through a pooling method, wherein the characteristics are 1 × 1 × D dimensions, and the calculation method is as follows:

wherein v represents a convolution feature graph output by the network, W (v) is the weight of the calculated output of the attention layer function, f (v') is the local feature obtained after the convolution feature v output by the network passes through a smoothing layer and a whitening layer, H is the length of the feature graph, and W is the width of the feature graph;

computing the loss of the network using the global pooled features, wherein the loss function used is represented as follows:

wherein d is the Euclidean distance between the characteristics of the samples in the tuple, y is whether the samples in the tuple belong to the same class, if yes, the value is 1, otherwise, the value is 0, N is the number of the samples, and margin is a set threshold;

before each iteration optimization of the network, a series of binary group pairs are randomly generated according to image labels of training data, each tuple is composed of a reference image, a positive sample and a plurality of negative samples, the positive sample is randomly selected from the same type of labels, the negative samples are randomly selected before each iteration, a plurality of image extraction pooling polymerization features are firstly selected as negative sample pools, then the negative sample pools are matched with the reference image and are sequenced, and the first n images which are different from the reference image are selected from the pools as negative samples when each tuple is generated; in the characteristic extraction stage, multi-scale characteristics of the image are extracted through image scaling, the output of the network is directly extracted, and the top n local characteristics are selected according to the sequence of the weighted values from large to small; the local characteristic weight is a weight value output by the attention layer, the image scaling scale is the scaling size of an image when the image is input into the network, and the characteristic description position takes the coordinate position of the center of the characteristic receptive field as the characteristic description position according to the receptive field size of the full convolution neural network.

6. The image positioning method based on deep learning and streetscape images as claimed in claim 1, characterized in that: step 5, generating an aggregation characteristic for each image, and aggregating the n × d-dimensional local characteristics of one image into k × d-dimensional aggregation characteristics by using an aggregation method, wherein k is an aggregation center number; the specific implementation method comprises the following steps:

step 5.1, randomly selecting a part of extracted image features, setting clustering parameters, generating K clustering centers by using a K-means clustering method and constructing a clustering codebook, and marking as C ═ { C ═ C ₁ ,…,c _k }；

Step 5.2, in the aggregation process, respectively allocating n local features of one image to k clustering centers, allocating any one local feature of each image to the clustering center closest to the local feature, and calculating the residual error between the local feature and the clustering center, which can be expressed as:

r(x)＝v _i -q(x)

where r (x) denotes the residual of the local feature from the cluster center, v _i Representing the ith local feature, and q (x) representing a cluster center corresponding to the local feature;

step 5.3, if a plurality of local features exist in a cluster center, calculating residual errors of the cluster center and summing the residual errors to obtain 1 × d dimensional features, wherein the step of calculating the residual error sum of the cluster center can be expressed as:

wherein, V (X) _c ) Being characteristic of polymerization, X _c Representing the expression of local features of the image X after quantization of a feature codebook;

and 5.4, combining the features of the k clustering centers to form k x d-dimensional aggregation features.

7. The image positioning method based on deep learning and street view images as claimed in claim 6, characterized in that: in step 6, the method for establishing the inverted index table for querying the street view image through the features is completed by generating a key value pair dictionary of "feature" ═ street view image ", and the method for calculating the similarity between the feature vector of the image to be located and the feature vector of the street view image is represented as follows:

wherein, Similarity (X) _c ,Y _c ) To take a value for the similarity of image X and image Y,

V(X _c ) Is a feature of the aggregate of image X, and σ _α (u) is a similarity calculation function, u is the dot product of the features of the two images in a certain clustering center, sign is a sign function, 1 is taken when u is larger than zero, otherwise-1 is taken, | u | is the modulus of u, and oc and tau are constants;

and for a certain image to be positioned, calculating feature similarity with all reference data sets and sequencing, and then inquiring the corresponding street view image according to the image features according to the inverted index table so as to obtain a street view retrieval result.

8. The image positioning method based on deep learning and street view images as claimed in claim 1, characterized in that: in step 7, the kernel density estimation method considers the first N search results in step 6, extracts a local peak of the analysis result as a positioning result, and may be represented as:

wherein S (x) _i ,y _i ) For the ith street view in the result of step 6 at coordinate (x) _i ,y _i ) And (3) determining the similarity value of the image to be queried, wherein r is a query radius, n is the number of samples in the range of the query radius r by taking (x, y) as a circle center, and extracting and sequencing local peak values of kernel density analysis to be used as a position positioning result.