AU2018101525A4

AU2018101525A4 - Category-partitioned Content Based Image Retrieval for fine-grained objects with feature extraction through Convolution Neural Network and feature reduction through principle component analysis

Info

Publication number: AU2018101525A4
Application number: AU2018101525A
Authority: AU
Inventors: Lei Chen; Yinan Guo; Hongyi Nie; Ziyuan Qin; Zhuoya Yang; Yexin ZHANG
Original assignee: Yang Zhuoya Miss
Current assignee: Yang Zhuoya Miss
Priority date: 2018-10-14
Filing date: 2018-10-14
Publication date: 2018-11-15
Anticipated expiration: 2026-10-14

Abstract

Our system is to bridge the semantic gap, with the concept of classes involved in feature extraction and accordingly feature matching. Different from conventional CBIR system, our system is designed to return the most similar images among which belong to the same class, categorized by their brands combined with their types, for example the Cushesandals and Ecco-slipons, which reaches to semantic level. In general, our invention puts forward a new image retrieval system that uses ResNet-50 as the base model to achieve the best retrieval result.

Description

Title

Category-partitioned Content Based Image Retrieval for fine-grained objects with feature extraction through Convolution Neural Network and feature reduction through principle component analysis

FIELD OF THE INVENTION

With the rapid spread and development of multimedia technology, more and more people start to obtain and query images through Internet. Existing image retrieval techniques are mainly text-based image retrieval or content-based image retrieval. Text-based retrieval requires manual labeling, and the workload is large and different people's understanding of the image may be ambiguous. Content-based image retrieval relies only on low-level visual features, often deviating from the understanding of the querying user, resulting in inaccurate search results. Therefore, semantic-based image retrieval came into being.

This invention can be applied to an e-commerce platform, and is convenient for users to perform fine-grained product retrieval. The brand of the product can be identified, and the product of the same brand can be recommended for the user. It is also possible to identify the style of the product and recommend the product in same style to users.

BACKGROUND OF THE INVENTION

In recent years, alongside the rapid development of technology, e-commerce industry has grown remarkably. Millions of commodities can be searched and purchased online with just a few clicks. In the past, the most commonly applied retrieval system was Keyword Based Information Retrieval (KBIR). KBIR is a system that accepts a keyword as its input, and returns all products labeled with that keyword. The shortcoming of this retrieval system is obvious. When facing a large dataset, it would cost a lot of time and effort for researchers to manually label each product. In addition, human labeling could be very subjective, in addition, there are many features that are beyond description. These subjective labeling and indescribable features would further increase the difficulties of constructing a large and accurate text retrieval database. Besides, there is a huge intention gap, which makes things worse. On the consumer’s side, when they are looking for a product of a specific style, and they could not come up with words to describe that style, they would not get expected search result and thus gain a poor searching experience. To prevent these problems from happening, people have moved on to a new searching method — content based image retrieval.

Content based image retrieval (CBIR), is the application of computer vision techniques to the image retrieval problem. This system works by first extracting features of an object’s image, then, comparing these features with the others’ in feature database, and at last, returning the images with similar features back to the users, with images ranked by their distance to the target image. Currently, one of the most frequently used feature extraction algorithms is SIFT algorithm, as it is invariant to rotation change and robust to many factors such as affine distortion, addition of noise, and illumination changes, etc. Nevertheless, SIFT algorithm still has its limitations. For example, in the IFSVRC match, even with its best performance, it received an inaccurate rate higher than 26 percent, which could not meet the base-line demand for some image retrieval systems.

However, for CBIR, there are other limitations, such as semantic gaps. Since the feature extracted by CBIR system is low-level visual feature, the retrieval results may include images which look alike but irrelevant in semantics. Therefore, image retrieval on semantic level would be a hot topic for a long time.

With the development in the field of deep learning, researchers start to use CNN based model to implement image retrieval. This type of model performs much better than traditional model. Among all the variation of deep learning model, VGG is first one that uses several small convolutional kernels instead of one big size kernel. This change can decrease the numbers of parameters being passed down and enhance the representativeness of the network. Inception model, on the other hand, increases calculation efficiency by turning sparse matrices to dense sub-matrices. Inception V2, which is developed based on the old Inception model, is integrated with a method called Batch Normalization. This method would improve the efficiency of training and classification accuracy. Recently, Resnet model with its the idea of the deep residual network has aroused wide concern. [1] Its residual structure allows people to stack a lot more layers without lowering down the classification accuracy on training set and leaving the model incapable to train due to its extreme depth, but improving the ability of the model to extract more abstract features through a deeper network. Hashing learning decreases the cost of storage while also increases the speed of retrieval. CNNH model is the combination of deep learning and Hashing. It compares the similarity of two images by measure the Hamming distance of their hashing code which can be obtained from hashing learning.

Alongside the improvement in the algorithm, there are still weaknesses existing in these models. As the number of layers and filters in a deep convolutional network increases, these models would receive heavier calculation and longer training duration. On the user’s side, this means that they will need to wait too much longer for a search result to return. On the researcher’s side, this means that the training process of a model may take up hours, sometimes days, and even weeks. In addition to the efficiency problem, there are many variances we need to overcome as well. For example, when different people are taking pictures for the same product, the angle of the photo and size of the product may not be the same. These photos could also be taken under different level of illumination and environmental factors. With these variances existing and photos referring to same product, it would be a lot harder for a system to extract accurate and meaningful features. Therefore, how to design a CBIR system that is capable of overcoming all the variances, extracting meaningful features, and achieving high search accuracy and efficiency, is still an open-ended question.

SUMMARY OF THE INVENTION

In order to build a system that is efficient, accurate, and capable of extracting semantic-level features, we adopt ResNet-50 as our base model, building the whole model to extract high dimensional features of class information of images, then to greatly reduce their dimensionality by PCA with barely accuracy loss. After the off-line part is stored persistently, marked by the feature database built, the online part begins with extracting the features of the target image, through the same process above, then matches the target feature with features in database by cosine similarity and ends in returning similar images based on class-partition.

The dataset we select is a fine-grained footwear dataset that consists of over 5000 pictures of shoes. These pictures are classified into 107 different categories based on their brands and types. To make this dataset more training friendly, two main changes are made: dataset adjustment, and dataset enlargement. As for dataset adjustment, the idea is to remove all the small classes that contain only two or three images, which has no contributions to classification and training model. After the process, the dataset contains more than four thousand images classified into seventy-one different classes. Then we begin to make adjustment on each image. Though resizing, cropping, and other minor adjustment, images which vary remarkably by angle, size, illumination, etc. are transformed into something more consistent. As for dataset enlargement, the goal is to solve the potential over-fitting, which could be achieved by editing a single image into multiple ones through the process of displacement, rotation, symmetry, etc.

As for our model, its function is to extract features about category of images’ content, which reaches up to semantic level. The key is to build a neural network which is deep enough to extract abstract-level features, rather than low-level visual features. But when training, the problem of gradient disappearance and the inefficiency should be avoided.

For our invention, we accomplish the goal to extract abstract feature of category-partition, which involves semantic level, by building a Resnet-based model and training it through classification. We are convinced that the better accuracy the trained-model performs on classification tasks, the feature the model extract from images is more of semantics, as it involves the concept of “class”, while the feature that conventional CBIR models extract is low-level visual feature, where sometimes semantic gap happens. That is to say, the ultimate goal of our system and model is to bridge the semantic gap, by partitioning dataset based on semantic-level feature, rather than natural partitioned low-level visual feature, like color, shape, etc.

Generally, it is believed that the better accuracy a model has for image classification, the better it would perform in image retrieval system, since its performance is evaluated by if the returned images belongs to the target image’s category. After comparing both ResNet-50 and VGG16, we decided to use ResNet-50 as our primary model. Since these models are originally designed for classification purpose, their top layers for classification is removed, with new layers added for feature extraction. As the ResNet-50’s performance of classification for fine-grained dataset is tested to be better than VGG’s, our model is built on the base of ResNet-50 excluding its top layers, called base-model below. There are two main changes made in building the system’s model to extract features of images. The first change is to add a fully connected layer into the model. This layer accepts the feature map obtained from the base model as its input and achieve the effect of Dimensionality Reduction. After this layer, the shape of the feature map would be reduced (into 1*1*1024). Then, we introduce “ReLu”, an activation function, into our model. The purpose of this layer is to introduce some nonlinear factors into our model, so that our model could be capable of making more complicated classification. On the attempt to improve the efficiency of calculation, we implement PCA algorithm right after this layer. This reduces the dimension of our features from 1024 to 200, thus reducing the time needed for calculation.

After obtaining a compressed feature map, namely having the feature database built, we shift the team’s focus to developing user interface. Our interface is developed using python GUI. Whenever a user input an image into this interface, our system will extract the feature of the image, through the whole model we build and PCA. We then use cosine similarity to compare the feature with ones in the feature database. The top 10 images with the highest cosine similarity value, or shortest distance, will then be displayed on the interface.

Having our retrieval system built completely, it is of great significance to evaluate how well our system works, compared with others. After searching through different evaluation strategies such as Retrieval Rate (RR), Modified Retrieval Rank (MRR), Mean Average Precisions (MAP), etc. we decided to manually set the testing accuracy as a criterion. Any image with a similarity greater than 0.92, the top-1 accuracy of classification on test dataset of model, will be predicted as the same class, the expected images the system supposed to return. From there, evaluation of retrieval performance can be made by Precision, Recall, and F-score to our model.

[1] He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition [J]. 2015:770-778.

DESCRIPTION OF THE FIGURES

To make the implementation of this invention easier to understand, process related figures are showed below, only for the aim of description and explanation but not for limitation, where in:

Fig-1 is the general flow diagram of the invention. It has listed the main steps, known as frame.

Fig-2 illustrates the unique Identity Blocks of Resnet model.

Fig-3 illustrates the construction of our whole model.

Fig-4 shows detailed process of principle component analysis.

Fig-5 displays search results in a real scenario when a user downloads pictures of the products he wants to purchase online and searches by the images. Notably, this is a consistent performance of this retrieval system for unseen images.

DESCRIPTION OF PREFERRED EMBODIMENT

On the purpose to make the invention easier to understand, the embodiment of the invention will be presented in details, with the general process shown in Fig-1.

Distinguished from the conventional Convolutional Neural Network (CNN), with whole model shown in Fig-3, the base of our model in extracting the general feature of images, namely resnet-50, is a convolutional neural network of more depth, consisted of residual blocks displayed in Fig-2. Admittedly, resnet-50 has some advantages due to its unique block structure. On one hand, thanks to its shortcut connections in residual blocks, the model can prevent the descent disappearance problem from happening, even with the network deeper. On the other hand, compared with shallower networks, its depth of residual network offers better performance in classification and recognition remarkably and the training process more efficient. As for the whole image retrieval system, to apply the base-model, with our modifications, to extract features in semantic level is the core of our invention. PCA and cross entropy are applied to this retrieval system 0

Step A '. Image dataset Collection

The dataset we use for our system is a footwear dataset that contains more than 5000 images. These images are initially classified into 107 different classes based on their brands and shoe types. The problem with this database is that the numbers of image of these classes are not balanced. With few classes have over 150 images, most classes have more than 30 images. However, there are still classes that only contain two or three images. Therefore, we filter several small classes that have no contributions to image retrieval, since it is impossible to train the model to recognize the similar pictures among that classes, and reduce the size of dataset into 4572 pictures with 71 classes.

Step B: Image adjustment

After classes reduction on the whole dataset, we make some adjustment on each individual image according to its bounding box, which will improve the feature extraction. Thus, after this process, objects of images is at the center of images with barely no blank space. This process can be separated into two parts, image relocation and image resizing. As for image relocation, we crop and re-center all the shoes due to their coordinates we get. In addition to relocation, images are resized into 224px*224px, as the input to feed the model we build, aiming to get more distinguished feature.

Step C: dataset enlargement

As it is mentioned above, the imbalanced dataset and, to some extent, small quantity of images, 65 images of a class on average, is easy to result in over-fitting when training model, especially as we use 70 percent images as training set. Apparently, to figure out this problem, we need to expand our dataset. By performing arbitrary translation, rotation, flipping, scaling, etc. on every image, we transform one image into twenty images. After the expansion, we have 90,000 images totally in our dataset, which is large enough to train the model to extract features.

Step D: Build a Resnet-Based Model A resnet-50 based model is built, of which the TOP layer is replaced by two fully-connected layers, with the new model shown in Fig-3. Aiming to extract features of images with lower dimension, the first fully connected layer, part of the hidden layer we add, takes the output of the base-model, feature vector of 2048 dimensions, as input, mapping it to feature vectors with 1024 dimensions. And this layer makes training the network easier, which will be discussed in following Step. The other layer is to map features into different classes.

Step E: Start Training

To find the best fitting weights of the model we build, is to minimize the cost, the cross-entropy function. Given the weights are calculated as weight = weight - learning-rate X gradient, back propagation is utilized to calculate the partial derivative of all variables of this multi-layer complex function. The cost is defined by cross-entropy, with its calculation expressed as E = -Σ;=ι7ί *logfy ·

Then weights of the model are updated constantly until the cost begin to converge, which marks the weights are best suited. And accordingly, the training epoch can be set manually.

And as it is mentioned above, owing to we building our model based on Resnet-50 and replacing its top layer with two fully connected layer including “Relu” and “softmax” to activate, we don’t have to train the model from stage one. Given the pre-train model Resnet-50, the model can firstly load weights of Resnet-50, the base-model mentioned above and return the feature vectors of 2048 dimensions of images through the base-model, without training effort. Then we take those feature vectors as input and set the class number as the category number of the datasets to train the rest of the model. Actually, it is a little different from fine-tune, as the hidden layer we add, to some extent, transforms the 2048 dimensions features, which is best suited for ImageNet classification, into 1024 dimensions features, which is best fitted for our fine-grained shoes classification accordingly. Notably, this is a mode for people to train shallower network for classification to get remarkably accuracy on their datasets without paying too much effort.

When the model performs well on classification of the fine-grained classes, it illustrates that the feature it extracts has great contribution to distinguish classes, since the training is guided on the level of classes and the dataset cannot de partitioned by visual features. Therefore, it might be safe to say that the model is capable of extract features involves “class” concept and would perform well on other fine-grained datasets, partly excluding the low-level visual features

Step F: Save and Test

The well-trained model on classification is saved persistently while the above step is done. Then when it comes to evaluate its performance on classification, we need to feed the model the original dataset. It does not include pictures generated in Step C, since the retrieval work cannot return a series of transformed pictures. And surprisingly, there is no over-fitting problems. Not only do the model behaves well on training set as it gets accuracy over 99%, it also gets top-1 accuracy over 92% on test set, with top-5 accuracy over 98.6%

Step G: Principle Component Analysis

Given the performance of the model we have saved in Step F, obviously, it is well fitted on classification, which means the features extracted through our model are considerably distinguished based on their category. But it is 1024 dimensions of feature for every image, which is still too large to compute, with too many meaningless dimensions having no contributions to express the uniqueness of their category and distinguish the class they belong from the others. Therefore, principle component analysis (PCA) is applied to solve this problem, with process shown in Fig-4. PCA is a statistical procedure that uses dimensionality reduction matrix to map features from high dimensional space to low dimensional space. In this step, we firstly get the feature matrix of images, the output of the hidden layer we add, via the model we have saved in the last step. Then after normalizing it and calculating covariance matrix of the normalized matrix, we get the eigenvalues and eigenvectors of the covariance matrix. The eigenvalues are sorted ascendingly into a new eigenmatrix. To preserve the first 200 columns of eigenmatrix and then cut off the rest of it, we get the dimensionality reduction matrix M. At the end of this step, the feature matrix after dimensional reduction can be obtained by multiplying the feature matrix with M.

Step H: Feature Extraction and Feature Database Formation

After PCA, the reduced features are extracted. Since the reduced features still behave well on classification, with the accuracy maintaining that high level, we do not have to concern about the reduced features would drag down retrieval performance of this system. Therefore, their reduced features can be stored persistently in a database, with their URL and index stored as well, and the feature database is formed.

Step I: Feature Matching

The distance between images are computed by cosine similarity, since it is easy to obtain the feature vector of low dimension of each image, with its calculation expressed as

Of each search, the system extracts target image’s feature, then compares this feature vector with all features stored in feature database and finally display the top ten similar images, based on cosine similarity.

Step J: Performance and Evaluation

Up to this step, the frame of our system is introduced. Then to evaluate image retrieval system, there are several factors to concern about. Accuracy, computational efficiency and memory cost. Admittedly, there

is always space to improve on memory cost and efficiency, which is beyond the discussion here. With common focus drawn to the accuracy, the performance of this category-partition based image retrieval system is evaluated by precision, recall and F-l score. As it is known that it is to some extent, unrealistic to achieve high on both precision and recall, our system is not an exception. These two indexes are fluctuated by the feature similarity, over which we regard these images as the same class and tag them as positive. But notably, given that retrieval result is subjective, our system can absolutely stand the test to return the style-alike shoes, based on their brands and category, when searching unseen images on the original dataset, which is shown in Fig-5.

Claims

CLAIM

1. A complete CBIR system developed based on a Convolutional Neural Networks model named ResNet-50, in which are that this system first go through dataset adjustment and dataset enlargement.
2. A system as claim 1 said, which is built on ResNet-50; it is capable of extracting abstract features about category information, not low-level visual features, to some extent in semantic level, and therefore capable of retrieval for fine-grained products.
3. A system as claim 1 said, which is applied PCA algorithm to reduce dimension, and calculated the difference between images using cosine similarity.