CN110175251A

CN110175251A - The zero sample Sketch Searching method based on semantic confrontation network

Info

Publication number: CN110175251A
Application number: CN201910442481.4A
Authority: CN
Inventors: 杨延华; 许欣勋; 张啸哲; 邓成
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-05-25
Filing date: 2019-05-25
Publication date: 2019-08-27

Abstract

The invention proposes a kind of zero sample Sketch Searching method based on semantic confrontation network, mainly solve the problems, such as that prior art sketch variance within clusters are larger and the lower visual knowledge of zero sample setting is difficult to move to from known class and has no class.Its scheme are as follows: obtain training sample set；The semantic confrontation network of building, extracts RGB image feature by VGG16 network；Building generates network to generate the RGB image feature with identification；By the semantic confrontation network generative semantics feature of sketch input to be retrieved, semantic feature and random Gaussian input are generated and generate RGB image feature in network, is found in image retrieval library and obtains search result with most like preceding 200 images of RGB image feature.Present invention reduces the variance within clusters of sketch characteristics of image, it can guarantee the RGB image feature generated in each classification according to sketch image, improve the retrieval performance of zero sample Sketch Searching, can be used for e-commerce, medical diagnosis, remotely sensed image.

Description

Zero sample sketch retrieval method based on semantic countermeasure network

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a zero-sample sketch retrieval method which can be used for electronic commerce, medical diagnosis and remote sensing imaging.

Background

The sketch retrieval refers to retrieving real natural images according to the hand-drawn sketch. The zero sample sketch retrieval method is a method for retrieving real natural images of hand-drawn sketches of unknown classes. The existing sketch retrieval method mainly comprises two types: features based on artificial design and methods based on deep learning. The method based on artificial design features comprises a gradient field HOG descriptor and a SIFT descriptor, while the method based on deep learning comprises a twin network, a triplet network, a deep sketch hash and the like, and the main ideas of the methods are to extract discriminant features of images or text information and project the discriminant features to a common feature space for similarity measurement. However, the existing sketch retrieval method is premised on that all the categories are required to be known in the training stage, so that the scale of the training data cannot be guaranteed to cover all the categories in a real scene, and the retrieval performance is sharply reduced when the categories are not found in the test. Meanwhile, different people have different understandings on the sketch, so that the intra-class variance of the drawn sketch is large, and the task of sketch retrieval is more challenging.

The zero sample sketch retrieval is to realize the visual knowledge migration from a known category to an unseen category under the setting of a zero sample, thereby solving the problem of the existing sketch retrieval. Currently, researchers have proposed two methods for Zero sample Sketch retrieval, for example, an article entitled "Zero-Shot Sketch-Image Hashing" published by Yuming Shen and Li Liu et al in the Computer Vision and pattern recognition conference of 2018 discloses a Zero sample Sketch hash retrieval method, which constructs an end-to-end three-network framework, wherein the first two networks are binary encoders, the third network utilizes a kronecker fusion layer and a graph convolution, reduces heterogeneity of Sketch images, enhances semantic relation between data, and also proposes a hash generation method for reconstructing semantic knowledge representation of Zero sample retrieval; an article entitled "a Zero-Shot frame for Sketch-Based Image Retrieval" published at the European Conference on Computer Vision Conference of 2018 by saii Kiran yelarthi et al discloses a method of generating a model Based on a depth condition against an automatic encoder and a variation automatic encoder, which takes a Sketch feature vector as an input, randomly fills missing information using the generated model to generate a natural Image feature vector, and then retrieves an Image from a database using these generated natural Image feature vectors. Although the above methods achieve good performance, neither method takes into account the problem of large variance in the sketch class, so that semantic information extracted by a pre-trained convolutional neural network has weak discrimination capability, and it is difficult to accurately migrate the visual knowledge of the sketch from the known class to the unseen class.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a zero sample sketch retrieval method based on a semantic confrontation network, so that better discriminant semantic information is extracted through a pre-trained convolutional neural network, and the visual knowledge of the sketch is accurately transferred from a known class to an unseen class.

The technical idea of the invention is that the semantic features of the sketch are learned by adopting a semantic countermeasure module in an end-to-end semantic countermeasure network, so that the intra-class variance of the sketch features is reduced; by adding the triple loss into the generation module, the identifiability of the RGB image features generated in each category is ensured, so that the problem that visual identification is difficult to migrate from a known category to an unseen category under zero sample setting is solved.

According to the above thought, the implementation steps of the invention include the following:

(1) obtaining a training sample set:

(1a) respectively extracting 10,400 RGB images and corresponding 10,400 binary sketch images from a Sketchy sketch retrieval database to form a pair of first training samples; respectively extracting 138,839 RGB images and 138,839 binary sketch images of corresponding categories from a TU-Berlin sketch retrieval database to form a pair of second training samples;

(1b) randomly and horizontally turning all 298,478 extracted pictures to obtain 298,478 randomly and horizontally turned images;

(1c) 298,478 images after random horizontal turning are resized to 224 multiplied by 224, and 298,478 images are respectively formed into a training sample set S containing a first training sample₁And a training sample set S comprising second training samples₂；

(2) Constructing a semantic countermeasure network:

setting a semantic countermeasure network consisting of a semantic feature extraction network, a word embedding network and a semantic discriminator, wherein:

the semantic feature extraction network is used for extracting semantic features of the binary sketch image;

the word embedding network is used for extracting word vectors of category information corresponding to the binary sketch image;

a semantic discriminator for performing countercheck learning on the semantic features of the extracted draft image and the word vectors corresponding to the class marks through a countercheck loss L_adv(θ_S,θ_D) Parameters of the semantic feature extraction network are updated, and the judgment of semantic features of the output sketch image is improved;

the output of the semantic feature extraction network and the word embedding network in the semantic countermeasure network are input into a semantic discriminator for countermeasure learning;

(3) performing feature extraction on the RGB images in the training sample set:

(3a) performing feature extraction on the RGB images in the first training sample set by using a VGG16 network pre-trained on an ImageNet data set, and selecting the output of a second full-connection layer in the network as the final RGB image feature of the first training sample set, wherein the dimension of the image feature is 4096;

(3b) performing feature extraction on the RGB images in the second training sample set by using a VGG16 network pre-trained on the ImageNet data set, and selecting the output of a second full connection layer in the network as the final RGB image feature of the second training sample set, wherein the dimension of the image feature is 4096;

(4) constructing a generating network:

constructing a generation network sequentially consisting of a concatenate layer, a conditional encoder, a triple loss layer, a KL loss layer, a decoder, an image reconstruction loss layer, a regressor and a semantic reconstruction loss layer, wherein:

a coordinate layer for extracting output sketch semantic feature vector x of the network from the semantic features^semAnd RGB image feature vector x^imgCarrying out dimensional splicing;

a conditional coder for distributing the data P (x) with the output of the concatenate layer as input^img,x^sem) Obtaining prior distribution P (z) of hidden latent variable z after passing through a conditional coder, and calculating a mean vector mu and a standard deviation vector sigma of the prior distribution P (z);

a triple loss layer for keeping the discriminability of the generated features in each training class, taking the mean vector output mu of the conditional encoder as input, and training the encoder by using a triple loss function, wherein the loss function of the loss layer is L_tri；

KL loss layer for distributing P (x) data^img,x^sem) And a variation distribution Q (z | x)^img,x^sem) Approximation, then by applying a loss function L_KLDetermining a lower bound of variation;

a decoder for learning the potential vector z with the dimension of 1024 to obtain the semantic feature x with the dimension of 300^semStitching as input to generate RGB image features corresponding to the sketch imagesThe mathematical expression of the decoding process is:

wherein noise represents random Gaussian noise Z-N (0,1), the noise dimension is 1024,represents a decoder;

an image reconstruction loss layer for ensuring that the generated RGB image features have sufficient discriminability, using a reconstruction loss function:the decoder is trained, wherein,representing RGB image features, x, corresponding to the generated sketch image^imgRepresenting the characteristics of the original RGB image,represents a 2 norm;

a regressor for converting the output of the decoderAs input, semantic features are reconstructed by a regressorThe mathematical expression of the regression process is:

wherein noise represents random Gaussian noise Z-N (0,1), the noise dimension is 1024,representing a regressor;

semantic reconstruction loss layer to guarantee generated RGB image featuresCategory level semantic information can be saved, and the loss function of the layer is as follows:wherein,representing reconstructed sketch semantic features, x^semSemantic features representing sketches;

(5) training the semantic countermeasure network and the generation network:

(5a) initializing the semantic countermeasure network and the generation network, wherein network parameters adopted during random initialization obey Gaussian distribution with the mean value of 0 and the standard deviation of 0.1 to obtain the initialized semantic countermeasure network and the generation network;

(5b) let the loss function of the whole network be L ═ L_adv+L_tri+L_KL+L_{recon_img}+L_{recon_sem}；

(5c) Taking the sketch image preprocessed in the step 1 and the corresponding category information thereof as input data of an initialized semantic countermeasure network, outputting semantic features corresponding to the sketch, taking the semantic features corresponding to the sketch and RGB image features extracted by using a pre-trained VGG16 network as input data of a generation network, and realizing training of the semantic countermeasure network and the generation network by minimizing a loss function L to obtain the trained semantic countermeasure network and the generation network;

(6) carrying out zero sample sketch retrieval on the sketch image to be retrieved:

(6a) extracting a sketch image from a test sample set which is not intersected with the training sample set, and cutting the sketch image to obtain a sketch image to be retrieved;

(6b) inputting the sketch image to be retrieved into a trained semantic feature extraction network, and outputting a semantic feature vector corresponding to the sketch image;

(6c) splicing the semantic feature vectors and the random Gaussian noise, inputting the spliced semantic feature vectors and the random Gaussian noise into a trained generation network, and generating RGB image features corresponding to a plurality of sketches through an encoder and a decoder;

(6d) and taking the average value of the multiple generated RGB image characteristics as the final RGB image characteristics, and searching the first 200 images which are most similar to the generated final RGB image characteristics in the image retrieval library according to the cosine distance.

Compared with the prior art, the invention has the following advantages:

in the training stage, by means of the advantages of category-level semantic information, the semantic countermeasure module in the end-to-end semantic countermeasure network is adopted to learn the semantic features of the sketch, so that the intra-category variance of the sketch image features is reduced; and triple loss is added in a generating network, so that the identifiability of the RGB image features generated in each class is ensured, and the problem that visual identification is difficult to migrate from a known class to an unseen class under zero sample setting is solved.

Compared with the prior art, the method simplifies the training process and effectively improves the retrieval performance of zero-sample sketch retrieval.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a graph comparing the search results of the present invention with the conventional method.

Detailed description of the preferred embodiments

The invention is described in further detail below with reference to the following figures and specific implementations:

referring to fig. 1, the zero sample sketch retrieval method based on the semantic countermeasure network of the invention comprises the following implementation steps:

step 1, a training sample set is obtained.

1.1) respectively extracting 10,400 RGB images and 10,400 corresponding binary sketch images from a Sketchy sketch retrieval database to form a pair of first training samples; respectively extracting 138,839 RGB images and 138,839 binary sketch images of corresponding categories from a TU-Berlin sketch retrieval database to form a pair of second training samples;

1.2) randomly and horizontally turning all 298,478 extracted pictures to obtain 298,478 randomly and horizontally turned images;

1.3) resizing 298,478 images after random horizontal flipping to 224 × 224, and respectively forming 298,478 images into a training sample set S containing a first training sample₁And a training sample set S comprising second training samples₂：

Wherein,for the ith RGB image in the Sketchy database,is composed ofA binary sketch image of the corresponding category,for the jth RGB image in the TU-Berlin database,is composed ofAnd (4) corresponding to the binary sketch image of the category.

And 2, constructing a semantic countermeasure network.

the semantic feature extraction network is used for extracting semantic features of the binary sketch image, specifically is a VGG16 network pre-trained on ImageNet, selects a fifth convolutional layer of the VGG16 network as convolutional output, and outputs a semantic feature vector with the dimension of 300 through a full connection layer;

the word embedding network is used for extracting word vectors of category information corresponding to the binary sketch image, and acquiring category-level word vector representation with the dimension of 300 by adopting a word vector model pre-trained on Wikipedia;

a semantic discriminator used for carrying out counterstudy on the semantic features of the extracted draft image and the word vectors corresponding to the class marks, updating the parameters of the semantic feature extraction network through a counterstudy loss function, and improving the discrimination of the semantic features of the output draft image, wherein the loss function L_adv(θ_S,θ_D) The mathematical expression of (a) is:

wherein,indicating expectations, y indicates class semantic information for the sketch, W (-) indicates word embedding into the network,representing a semantic discriminator, θ_DPresentation languageThe parameters of the sense discriminator are defined,representing a semantic feature extraction network, θ_SParameters, x, representing a semantic feature extraction network^skeRepresenting a sketch image;

the output of the semantic feature extraction network and the word embedding network in the semantic countermeasure network are input into a semantic discriminator for countermeasure learning.

And 3, extracting the characteristics of the RGB images in the training sample set.

3.1) performing feature extraction on the RGB images in the first training sample set by using a VGG16 network pre-trained on the ImageNet data set, and selecting the output of a second full-connection layer in the network as the final RGB image feature of the first training sample set, wherein the dimension of the image feature is 4096;

3.2) performing feature extraction on the RGB images in the second training sample set by using a VGG16 network pre-trained on the ImageNet data set, and selecting the output of a second full connection layer in the network as the final RGB image feature of the second training sample set, wherein the dimension of the image feature is 4096.

And 4, constructing a generating network.

the coordinate layer is used for extracting a sketch semantic feature vector x with the output dimension of 300 of a network for semantic feature extraction^semAnd a RGB image feature vector x with a dimension of 4096^imgPerforming dimension splicing, and outputting a feature vector with a dimension of 4396;

the conditional coder comprises a first full connection layer with an input dimension of 4396 and an output dimension of 4096, a nonlinear active layer ReLU, and a one-dimensional encoder with momentum parameters of 0.99 and eps of 1e-3The data distribution system comprises a batch normalization layer, a Dropout layer with the deactivation rate of 0.3, a second full-connection layer with the output dimension of 2048, a nonlinear activation layer ReLU and a one-dimensional batch normalization layer with the momentum parameter of 0.99 and the eps being 1e-3, and is used for taking the output of the concatenate layer as input to enable the data distribution P (x is the x value of the data distribution layer) to be distributed^img,x^sem) Obtaining a mean vector mu and a standard deviation vector sigma through a conditional coder to form prior distribution P (z) of a hidden latent variable z;

the triple loss layer is used for keeping the discriminability of the generated features in each training category, taking the mean vector output mu of the conditional encoder as input, and training the encoder by using a triple loss function L_triThe mathematical expression of (a) is:

wherein d (·,. cndot.) representsA distance function, E (-) represents a potential embedding function to obtain the mean vector μ,which represents a fixed sample of the specimen that is,which is indicative of a positive sample,represents a negative sample, δ represents an edge value;

the KL loss layer is used for enabling the data distribution P (x)^img,x^sem) And a variation distribution Q (z | x)^img,x^sem) Approximation, then by applying a loss function L_KLDetermining a lower bound of variation, L_KLThe mathematical expression of (a) is:

wherein,θ_Eparameter, theta, representing a conditional encoder network_D'A parameter indicative of a network of decoders,indicating expectation, x^imgAnd x^semRespectively representing RGB image characteristics and semantic characteristics, KL (. | ·) represents solving KL divergence, Q (z | x)^img,x^sem) Representing the output variation distribution of the encoder network,a posteriori distribution, P (x), representing the semantic feature xsem^img|z,x^sem) Representing the distribution of output conditions of the decoder network;

the decoder consists of a first full-link layer with an input dimension of 1324 and an output dimension of 4096, a nonlinear active layer ReLU, a second full-link layer with an output dimension of 4096 and the nonlinear active layer ReLU in sequence, and is used for learning a potential vector z with a dimension of 1024 to obtain a semantic feature x with a dimension of 300^semStitching as input to generate RGB image features corresponding to the sketch imagesGeneratingThe mathematical expression of (a) is:

the image reconstruction loss layer is used for ensuring that the generated RGB image features have enough discriminability, and uses a reconstruction loss function:the decoder is trained, wherein,representing RGB image features, x, corresponding to the generated sketch image^imgRepresenting the characteristics of the original RGB image,represents a 2 norm;

the regressor consists of a first full-link layer with an input dimension of 4096 and an output dimension of 2048, a nonlinear active layer ReLU, a second full-link layer with an output dimension of 300 and a nonlinear active layer Tanh in sequence, and is used for outputting the output of the decoderAs input, semantic features are reconstructed by a regressorReconstructionThe mathematical expression of (a) is:

the semantic reconstruction loss layer is used for ensuring that the generated RGB image features can store category-level semantic information, and the loss function of the layer is as follows:wherein,representing reconstructed sketch semantic features, x^semRepresenting semantic features of the sketch.

And 5, training the semantic countermeasure network and the generation network.

5.1) initializing the semantic countermeasure network and the generation network, wherein network parameters adopted during random initialization obey Gaussian distribution with the mean value of 0 and the standard deviation of 0.1 to obtain the initialized semantic countermeasure network and the generation network;

5.2) setting the loss function of the whole network as: l ═ L_adv+L_tri+L_KL+L_{recon_img}+L_{recon_sem}；

5.3) taking the sketch image preprocessed in the step 1 and the corresponding category information thereof as input data of an initialized semantic countermeasure network, outputting semantic features corresponding to the sketch, taking the semantic features corresponding to the sketch and RGB image features extracted by using a pre-trained VGG16 network as input data of a generation network, realizing training of the semantic countermeasure network and the generation network by minimizing a loss function L, and adopting an Adam optimizer in a deep learning toolbox PyTorch when training the network, wherein the initial learning rate is 0.0001, and the initial learning rate is β₁＝0.5，β₂0.99, and for the stability of training, training the semantic countermeasure network and the generation network alternately in the first 2 times of training, and training the whole network in an end-to-end mode in the next 18 times of training, wherein the training is performed for 20 times in total, so that the trained semantic countermeasure network and the generation network are obtained.

And 6, carrying out zero sample sketch retrieval on the sketch image to be retrieved.

6.1) extracting a sketch image from a test sample set which is not intersected with the training sample set in category, and cutting the sketch image to obtain a sketch image to be retrieved;

6.2) inputting the sketch image to be retrieved into the trained semantic feature extraction network, and outputting a semantic feature vector corresponding to the sketch image;

6.3) splicing the semantic feature vectors and the random Gaussian noise and inputting the spliced semantic feature vectors and the random Gaussian noise into a trained generation network, and generating RGB image features corresponding to a plurality of sketches through an encoder and a decoder;

6.4) taking the average value of the multiple generated RGB image characteristics as the final RGB image characteristics, searching the first 200 images which are most similar to the generated final RGB image characteristics in the image retrieval library according to the cosine distance, and finally calculating the retrieval precision according to the 200 retrieved images.

The technical effects of the present invention will be further explained below by combining with simulation experiments.

1. Simulation conditions are as follows:

the simulation experiment is carried out by using a GPU with the model number of NVIDIA GTX TITAN V and based on a tool box PyTorch of deep learning.

2. Simulation content:

the invention carries out simulation experiments on two data sets Sketchy and TU-Berlin which are disclosed to be specially used for the performance test of a sketch retrieval method, wherein:

the data set Sketchy contains 75,479 sketch images and 73,002 RGB images from 125 different classes, and 104 training classes in 125 classes are used as known classes and 21 test classes are used as unseen classes according to the experimental setting of standard zero sample learning;

the data set TU-Berlin contains 20,000 sketch images and 204,070 RGB images from 250 different classes, 194 training classes out of the 250 classes as known classes and 56 test classes as unseen classes according to the experimental setup of standard zero sample learning.

The results of simulation comparison experiments on the two public data sets Sketchy and TU-Berlin by using the method and the prior sketch retrieval method and zero sample learning method based on the deep convolutional neural network are shown in the table 1.

TABLE 1

Precision @200 and mAP @200 in Table 1 are the precision and average precision means, respectively, for the top 200 retrieved images.

As can be seen from the simulation results in Table 1, the accuracy and average accuracy mean of the present invention on both data sets is higher than the accuracy and average accuracy mean of the prior art on both data sets.

The retrieval results of the present invention and the best CVAE method in the prior art are visualized on the Sketchy data set, and the results are shown in fig. 2 by comparing the top 10 images out of the top 200 images retrieved.

As can be seen from FIG. 2, when the sketch pictures of 3 different test categories are searched, the top 10 searched pictures and the sketch pictures of the invention belong to the same category, and the searched result of the CVAE method has the picture with the wrong search.

Claims

1. A zero sample sketch retrieval method based on a semantic countermeasure network is characterized by comprising the following steps:

(1) obtaining a training sample set:

(1c) 298,478 images after random horizontal turning are resized to 224 multiplied by 224, and 298,478 images are respectively formed into a training sample set S containing a first training sample₁And a training sample set S comprising second training samples₂：

(2) Constructing a semantic countermeasure network:

setting a semantic countermeasure network consisting of a semantic feature extraction network, a word embedding network and a semantic discriminator, wherein,

(3) performing feature extraction on the RGB images in the training sample set:

(4) constructing a generating network:

a conditional coder for distributing the data P (x) with the output of the concatenate layer as input^img,x^sem) Obtaining a mean vector mu and a standard deviation vector sigma through a conditional coder to form prior distribution P (z) of a hidden latent variable z;

an image reconstruction loss layer for ensuring that the generated RGB image features have enough discriminationSex, using a reconstruction loss function:the decoder is trained, wherein,representing RGB image features, x, corresponding to the generated sketch image^imgRepresenting the characteristics of the original RGB image,represents a 2 norm;

(5) training the semantic countermeasure network and the generation network:

2. According to the rightThe method of claim 1, wherein the training sample set S of the first training sample in (1c)₁And a training sample set S of second training samples₂Respectively, as follows:

3. The method of claim 1, wherein the semantic feature extraction network in (2) adopts a VGG16 network pre-trained on ImageNet data set, and selects a fifth convolutional layer of the VGG16 network as convolutional output, and outputs a semantic feature vector with dimension of 300 through a full connection layer.

4. The method of claim 1, wherein the word embedding network in (2) employs a word vector model pre-trained on wikipedia to obtain a class-level word vector representation with dimension 300.

5. The method according to claim 1, wherein the semantic classifier in (2) comprises a first fully-connected layer with an input dimension of 300, a first sigmoid nonlinear activation layer with an output dimension of 200, and a second fully-connected layer with an output dimension of 1, and the output of the semantic classifier updates the parameters of the semantic feature extraction network through a countermeasure loss L_advThe mathematical expression of (a) is:

wherein,indicating expectations, y indicates class semantic information for the sketch, W (-) indicates word embedding into the network,representing a semantic discriminator, θ_DA parameter representing a semantic discriminator,representing a semantic feature extraction network, θ_SParameters, x, representing a semantic feature extraction network^skeRepresenting a sketch image.

6. The method according to claim 1, wherein the conditional encoder in (4) is composed of a first fully-connected layer with input dimension of 4396 and output dimension of 4096, a nonlinear active layer ReLU, a one-dimensional batch normalization layer with momentum parameters of 0.99 and eps of 1e-3, a Dropout layer with deactivation rate of 0.3, a second fully-connected layer with output dimension of 2048, a nonlinear active layer ReLU, and a one-dimensional batch normalization layer with momentum parameters of 0.99 and eps of 1e-3, in that order.

7. The method of claim 1, wherein the triple loss layer in (4) is trained on the encoder using a triple loss function L using the mean vector output μ of the conditional encoder as an input_triThe mathematical expression of (a) is:

wherein d (·,. cndot.) represents l₂A distance function, E (-) represents a potential embedding function to obtain the mean vector μ,which represents a fixed sample of the specimen that is,which is indicative of a positive sample,representing negative samples and δ representing an edge value.

8. The method according to claim 1, wherein the KL loss layer in (4) is formed by applying a loss function L_KLDetermining a lower bound of variation, L_KLThe mathematical expression of (a) is:

wherein, theta_EParameter, theta, representing a conditional encoder network_D'A parameter indicative of a network of decoders,indicating expectation, x^imgAnd x^semRespectively representing RGB image characteristics and semantic characteristics, KL (. | ·) represents solving KL divergence, Q (z | x)^img,x^sem) Representing the output variation distribution of the encoder network,a posteriori distribution, P (x), representing the semantic feature xsem^img|z,x^sem) Representing the distribution of output conditions of the decoder network.

9. The method of claim 1, wherein the decoder in (4) consists of a first fully-connected layer with an input dimension of 1324 and an output dimension of 4096, a non-linear active layer ReLU, a second fully-connected layer with an output dimension of 4096, and a non-linear active layer ReLU, in that order.

10. The method of claim 1, wherein the regressor in (4) consists of a first fully-connected layer with an input dimension of 4096 and an output dimension of 2048, a nonlinear active layer ReLU, a second fully-connected layer with an output dimension of 300, and a nonlinear active layer Tanh in that order.