WO2019230666A1

WO2019230666A1 - Feature amount extraction device, method, and program

Info

Publication number: WO2019230666A1
Application number: PCT/JP2019/020948
Authority: WO
Inventors: 之人渡邉; 周平田良島; 島村　潤; 杵渕　哲也
Original assignee: 日本電信電話株式会社
Priority date: 2018-06-01
Filing date: 2019-05-27
Publication date: 2019-12-05
Also published as: JP2019211913A

Abstract

The present invention makes it possible to extract a feature amount for performing a precise search for an object within an image in a target category. A discretionary image is input into a convolutional neural network that has been trained to identify image categories from among a plurality of categories including a target category in advance, output that is obtained for each image and that is for each output position and each channel of the convolutional layers of the convolutional neural network is calculated as a feature map representing the features of an image, a classifier that is trained in advance and that uses a feature map as input to classify the category of an image is used to calculate weighting representing the impact of the feature map on the target category, the weighting is applied to the feature map so as to eliminate the impact on the target category, and a feature amount vector is calculated on the basis of the obtained feature map.

Description

Feature extraction apparatus, method, and program

The present invention relates to a feature quantity extraction apparatus, method, and program, and more particularly, to a feature quantity extraction apparatus, method, and program for extracting feature quantities for searching for an object in an image.

With the widespread use of small-sized imaging devices such as smartphones, there is an increasing demand for technologies for searching for objects that appear in images that are taken from arbitrary objects in various places and environments.

Conventionally, various techniques for searching for an object using a convolutional neural network (CNN) have been invented and disclosed. A typical procedure will be described according to the technique described in Non-Patent Document 1. First, a feature vector is extracted from an image by spatially pooling the output of the convolution layer for each region using CNN learned for image classification. Next, the inner product of the feature quantity vectors is calculated for two different images. It is considered that the same object is reflected as the value increases. By constructing a reference image database in advance with an image (reference image) including an object to be recognized, and searching for an image that contains the same object as the newly input image (query image), An existing object can be identified.

Further, Non-Patent Document 2 describes a method of learning CNN using a teacher label indicating an object included in an image. Create a triple of images based on a group of about 200,000 teacher images, and learn CNN so that the distance between image pairs with the same teacher label is smaller than the distance between image pairs with different teacher labels As a result, highly accurate object retrieval is realized.

The method of Non-Patent Document 1 is to extract a feature vector from a convolutional neural network (CNN) that has been learned so as to identify the category of a learning image group. When there are a plurality of objects of the same category, the feature vector The distance between them becomes unnecessarily close. As a result, there is a problem in that the search accuracy of objects that are not similar is deteriorated.

The method of Non-Patent Document 2 can obtain a feature vector that can discriminate different objects with high accuracy by learning CNN using a teacher image for object search. However, there is a problem that preparing a large number of teacher images for object search requires a great deal of cost. As an example, the preparation of 10,000 teacher images with a target category of automobile and the preparation of 10,000 teacher images of a specific vehicle type in the automobile are more difficult and costly in the latter case.

As described above, until now, no method has been invented to achieve a high-precision search without using a teacher image for object search in object search by CNN.

The present invention has been made to solve the above-described problems, and a feature amount extraction apparatus, method, and program capable of extracting feature amounts for accurately searching for an object in an image in a target category. The purpose is to provide.

In order to achieve the above object, a feature quantity extraction device according to a first invention is a feature quantity extraction device that extracts a feature quantity from an arbitrary image, and pre-categorizes an image category from a plurality of categories including a target category. The arbitrary image is input to the convolutional neural network learned to identify, and the output for each channel and the output position of the convolutional layer of the convolutional neural network obtained for each image is the feature of the image. A weight representing the influence of the feature map on the target category using a feature map calculation unit that calculates the feature map representing the image and a classifier that classifies the category of the image using the feature map as an input. Obtained by applying the weight to the feature map so as to remove the influence on the target category. It is configured to include a feature transform unit for calculating a feature amount vector based on the feature maps, a.

In the feature quantity extraction device according to the first aspect of the invention, the weight calculation unit calculates a weight representing the influence of the feature map on the target category for each channel and each output position using a classifier. You may make it do.

In the feature quantity extraction device according to the first aspect of the invention, the weight calculation unit calculates a weight representing the influence of the feature map on the target category for each channel and each output position using a classifier. The channel obtained by integrating the weights for each channel at the same output position, or obtained by integrating the weights for each output position in the same channel. Each weight may be calculated.

In the feature amount extraction apparatus according to the first invention, the classifier outputs a probability for each category as a classification result, and the weight calculation unit uses the differential coefficient of the probability of the target category, The weight may be calculated.

In the feature amount extraction apparatus according to the first invention, the feature amount vector calculated for the image is collated with a feature amount vector previously extracted from each of the reference images for the target category, and A matching unit that outputs each of the corresponding reference images as a search result may be further included.

A feature amount extraction method according to a second invention is a feature amount extraction method in a feature amount extraction apparatus for extracting a feature amount from an arbitrary image, wherein the feature map calculation unit pre-images images from a plurality of categories including a target category. The arbitrary image is input to a convolutional neural network that is trained to identify a category, and an output for each channel and output position of the convolutional layer of the convolutional neural network obtained for each image is obtained. The step of calculating as a feature map representing the features of the image, and the weight calculation unit using the classifier that has been learned in advance and classifies the category of the image by using the feature map as an input, to the target category of the feature map A step of calculating a weight representing the influence of the target, and the feature converting unit so as to remove the influence on the target category. And executes includes a step of calculating a feature vector based on the feature map obtained by applying weights to the feature map, the.

The program according to the third invention is a program for causing a computer to function as each part of the feature quantity extraction device according to the first invention.

According to the feature amount extraction apparatus, method, and program of the present invention, an arbitrary image is input to a convolutional neural network that has been learned in advance to identify a category of an image from a plurality of categories including a target category. Classification for classifying image categories by using the feature map as input and calculating the output for each channel and output position of the convolutional layer of the convolutional neural network obtained as the feature map as a feature map representing the features of the image. The feature vector is calculated based on the feature map obtained by applying the weight to the feature map so as to remove the effect on the target category. To extract features for accurate search for objects in the image in the target category Can be, the effect is obtained that.

It is a block diagram which shows the structure of the feature-value extraction apparatus which concerns on embodiment of this invention. It is a flowchart which shows the feature-value extraction process routine in the feature-value extraction apparatus which concerns on embodiment of this invention.

Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

Next, the configuration of the feature quantity extraction device according to the embodiment of the present invention will be described. A feature amount extraction apparatus 1 illustrated in FIG. 1 extracts a feature amount vector 6 that can accurately search a reference image including the same object as a query image in which an object belonging to a specific target category is captured. It is. In the following description, it is assumed that the image 4 corresponds to a query image, and the reference image set 5 corresponds to an image set including one or more reference images. Note that the reference image of the reference image set 5 may be used as the image 4.

1 can be composed of a computer including a CPU, a RAM, and a ROM that stores a program for executing a feature amount extraction processing routine described later and various data. The feature quantity extraction device 1 includes a feature map calculation unit 11, a weight calculation unit 12, a feature conversion unit 13, and a matching unit 14.

The feature quantity extraction device 1 of this embodiment communicates information with each other via a database 2 and communication means (not shown).

The database 2 can be configured by, for example, a file system mounted on a general general-purpose computer. In the present embodiment, as an example, it is assumed that the database 2 stores in advance the data of each reference image of the reference image set 5 for each category and the feature vector of each extracted reference image. In the present embodiment, it is assumed that an identifier such as a serial number ID (Identification) or a unique reference image file name that can uniquely identify each reference image is given. In addition, the database 2 stores, for each reference image, an identifier of the reference image and image data of the reference image in association with each other. Alternatively, the database 2 may be similarly implemented and configured with RDBMS (Relational Database Management System) or the like. The information stored in the database 2 includes, for example, information that expresses the content of the reference image (such as a title of the reference image, a summary sentence, or a keyword), and information about the format of the reference image (data amount of the reference image, However, storage of such information is not essential in the implementation of the present disclosure.

The database 2 may be provided either inside or outside the feature quantity extraction apparatus 1, and any known communication means can be used. In the present embodiment, the database 2 is assumed to be provided outside the feature quantity extraction device 1, and the feature quantity is extracted using the Internet and a network such as TCP / IP (Transmission Control Protocol / Internet Protocol) as communication means. It is assumed that the apparatus 1 is connected to be communicable.

Further, each unit and database 2 included in the feature quantity extraction device 1 includes an arithmetic processing device such as a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory), and It may be configured by a computer or server provided with a storage device such as an HDD (Hard Disk Drive), and the processing of each unit may be executed by a program. This program may be stored in advance in the storage device included in the feature quantity extraction device 1, stored in a recording medium such as a magnetic disk, an optical disk, and a semiconductor memory, or provided through a network. Is possible. Of course, any other components need not be realized by a single computer or server, but may be realized by being distributed to a plurality of computers connected by a network.

The feature map calculation unit 11 inputs an image 4 to a convolutional neural network (CNN) learned in advance so as to identify an image category from a plurality of categories including the target category, and is obtained for each image 4. The output for each channel and each output position of the convolutional layer is calculated as a feature map representing the features of the image 4.

As a CNN, a known network may be used, but it is assumed that learning has been made in advance to identify the category of the image 4. The feature map is, for example, the output of an arbitrary intermediate layer of CNN called VGG-16 and VGG-19 described in Non-Patent Document 3 or ResNet-50 and ResNet-101 described in Non-Patent Document 4. It can be obtained by extraction. In the most preferred example, the feature map is obtained from the final convolutional layer corresponding to the state immediately before the entire coupling layer. Hereinafter, for explanation, it is assumed that a feature map is obtained from the final convolution layer of VGG-16 (for example, the third layer in the fifth block). A feature map can be extracted from the image 4 as described above.

[Non-Patent Document 3] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, In ICLR, 2015.
[Non-Patent Document 4] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, In CVPR, 2016.
The weight calculation unit 12 calculates a weight representing the influence of the extracted feature map on the target category by using a classifier that classifies the category of the image by using the feature map as input.

When the image 4 is classified into the target category, the weight calculation unit 12 calculates a weight where the numerical value of the portion having a large influence on the target category is large and the numerical value of the portion having a small influence is small.

First, the feature map is input to the classifier, and the probability for each category is obtained as the category classification result of the image 4. Any classifier may be used as long as it outputs a probability for each category as a classification result. Any known classifier may be used, but a fully connected multilayer neural network whose final layer is a softmax layer. A network is preferred. In an example of the present embodiment, the layers after the final convolutional layer of VGG-16 used in the feature map extraction process are obtained as classifiers.

Next, for the target category, using the feature map and the probability of the target category, a weight representing the degree of influence on the target category in the feature map is calculated for each channel and output position of the convolution layer. In this case, the weight corresponds to the output position (vertical and horizontal) and the channel of the feature map, and has the same size as the size of the feature map (vertical, horizontal and the number of channels). Although the method for obtaining the degree of influence is not limited, as an example, the weight is calculated using the differential coefficient of the probability of the target category in the extracted feature map described in Non-Patent Document 5.

[Non-Patent Document 5] R.R.

It is also possible to calculate the weight for each channel obtained by integrating the weights for each output position in the same channel. In this case, the differential coefficient is averaged within the channel, and the value averaged for each channel can be used as the weight. The weight corresponds to the channel of the feature map, has the same number as the number of channels of the feature map, and has a larger value as the channel is more influenced by the category.

It is also possible to calculate the weight for each output position obtained by integrating the weights for each channel at the same output position. In this case, the weight corresponds to the output position of the feature map and has the same size as the length and width of the feature map. The integration can be performed by a known method such as summing, averaging, or taking the maximum value of all channels at each output position.

The weight calculation unit 12 finally normalizes and outputs the weight calculated by the above procedure. Since the weight may include a negative value, it is preferable to apply a process such as subtracting the minimum value, subtracting the minimum value and dividing by the maximum value, or replacing the negative value with 0.

With the above processing, a weight having a larger value can be obtained at a location where the influence of the category is greater with respect to the feature map channel and / or the output position.

The feature conversion unit 13 calculates the feature vector 6 based on the feature map obtained by applying the weight to the feature map so as to remove the influence on the target category.

In the process of the feature conversion unit 13, a weight application process and a vectorization process are performed.

The feature converter 13 first applies weights to the feature map. When the weight corresponds to the output position and channel of the feature map, since the weight is the same size as the feature map, the simplest is to subtract the corresponding weight value from each pixel (output position) of the feature map. The influence of the category in the feature map can be suppressed. Alternatively, the weight value may be normalized from 0 to 1, subtracted from 1, and multiplied by the feature map. If the weight corresponds only to the output position of the feature map or only to the channel, the weight can be applied by performing the same processing on each value of the feature map corresponding to the corresponding output position or channel. .

In addition, the influence of the category can be suppressed by setting the value of the corresponding part of the feature map to 0 for the part where the weight is a certain value or more. The constant value may be defined in advance or may be an average value of weights.

Next, the feature conversion unit 13 obtains the feature quantity vector 6 from the feature map to which the weight is applied. A known method may be used as a method for calculating the feature quantity vector 6 from the feature map. For example, a method described in Non-Patent Document 1 may be used. In this case, first, rectangles of various sizes are defined, and a vector of the number of rectangles × the number of channels is obtained by obtaining the maximum value in the rectangle for each channel. By normalizing the feature vector group, adding the values of the same channel and normalizing again, it can be expressed by the feature vector 6 having the dimension of the number of channels. For normalization, a known method may be used, but L2 normalization is preferable.

The weight may be applied after the feature vector 6 is calculated. For example, from the feature vector 6 obtained from the feature map, the feature vector 6 obtained from the output position of the feature map and the weight corresponding to the channel is used. The weight may be applied by subtraction. In this case, it is preferable to normalize each feature vector 6 before and after subtraction.

The collation unit 14 collates the feature quantity vector 6 calculated for the image 4 with the feature quantity vector extracted from each of the reference images for the target category stored in the database 2, and determines the reference image corresponding to the image 4. Each is output as a search result 7.

The similarity may be obtained by any known scale such as inner product or cosine similarity. Reference images having the same or similar meaning content are output as search results 7 in order from the highest similarity. Alternatively, a known indexing method may be used for obtaining these, and for example, the feature vector 6 is hashed using a method disclosed in Patent Document 1 to find a reference image that is approximately similar. May be.

[Patent Document 1] JP 2013-68884 A

Next, the operation of the feature quantity extraction device 1 according to the embodiment of the present invention will be described. The feature amount extraction apparatus 1 executes a feature amount extraction processing routine shown in FIG.

First, in step S101, the feature map calculation unit 11 inputs the image 4 to the CNN learned in advance so as to identify the category of the image from a plurality of categories including the target category, and is obtained for each image 4. The output for each channel and each output position of the convolutional layer of the CNN is calculated as a feature map representing the features of the image 4.

Next, in step S102, the weight calculation unit 12 calculates a weight representing the influence of the extracted feature map on the target category using a classifier that has been learned in advance and classifies the category of the image using the feature map as an input. To do. Here, the feature map is input to the classifier, the probability for each category is obtained as the category classification result of the image 4, the feature map and the probability of the target category are used for the target category, and the target category in the feature map is obtained. The weight representing the degree of the influence is calculated for each channel and output position of the feature map.

In step S103, the feature conversion unit 13 calculates the feature vector 6 based on the feature map obtained by applying the weight to the feature map so as to remove the influence on the target category.

In step S104, the feature amount vector 6 calculated for the image 4 is compared with the feature amount vector extracted from each of the reference images for the target category stored in the database 2, and each of the reference images corresponding to the image 4 is checked. Is output as the search result 7.

As described above, according to the feature quantity extraction device according to the embodiment of the present invention, an arbitrary image is input to the CNN learned in advance to identify the category of the image from a plurality of categories including the target category. Then, the output for each channel and output position of the CNN convolutional layer obtained for each image is calculated as a feature map representing the feature of the image, and the category of the image is classified using the feature map learned in advance as an input. The classifier is used to calculate the weight that represents the influence of the feature map on the target category, and the feature vector is calculated based on the feature map obtained by applying the weight to the feature map so as to remove the influence on the target category. By calculating, it is possible to extract a feature amount for accurately searching for an object in the image in the target category.

Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

For example, although the case where the collation unit 14 is provided in the feature quantity extraction device 1 has been described as an example, the present invention is not limited to this, and a collation device may be provided outside. In this case, it is assumed that the collation device is connected to the feature amount extraction device and the database so as to communicate with each other.

DESCRIPTION OF SYMBOLS 1 Feature-value extraction apparatus 2 Database 4 Image 5 Reference image set 6 Feature-value vector 7 Search result 11 Feature map calculation part 12 Calculation part 13 Feature conversion part 14 Collation part

Claims

A feature amount extraction device that extracts a feature amount from an arbitrary image,
A channel of the convolutional layer of the convolutional neural network obtained by inputting the arbitrary image to a convolutional neural network previously learned to identify a category of the image from a plurality of categories including the target category, and obtained for each image. A feature map calculation unit that calculates an output for each and an output position as a feature map representing the feature of the image;
A weight calculator that calculates a weight representing an influence of the feature map on the target category using a classifier that classifies the category of the image by using the feature map as an input,
A feature conversion unit that calculates a feature vector based on a feature map obtained by applying the weight to the feature map so as to remove the influence on the target category;
A feature amount extraction device.
The feature amount extraction apparatus according to claim 1, wherein the weight calculation unit calculates a weight representing an influence of the feature map on the target category for each channel and each output position using a classifier.
The weight calculator uses a classifier to calculate a weight representing the influence of the feature map on the target category for each channel and each output position, and calculates a weight for each channel at the same output position. The feature amount extraction according to claim 1, wherein a weight for each output position obtained by integrating or a weight for each output position obtained by integrating the weight for each output position in the same channel is calculated. apparatus.
The classifier outputs a probability for each category as a classification result,
4. The feature quantity extraction device according to claim 1, wherein the weight calculation unit calculates the weight using a differential coefficient of the probability of the target category.
Collation for collating the feature vector calculated for the image with a feature vector extracted from each reference image for the target category in advance and outputting each reference image corresponding to the image as a search result Part,
5. The feature quantity extraction device according to claim 1, further comprising:
A feature amount extraction method in a feature amount extraction apparatus that extracts a feature amount from an arbitrary image,
The convolutional neural network obtained by inputting the arbitrary image to a convolutional neural network that has been learned in advance to identify a category of an image from a plurality of categories including the target category, and obtained by each of the images Calculating the output for each channel and output position of the convolutional layer of the network as a feature map representing the features of the image;
A step of calculating a weight representing an influence of the feature map on the target category by using a classifier that classifies the category of the image using the feature map as an input, the weight calculating unit learning in advance;
A feature converting unit calculating a feature vector based on a feature map obtained by applying the weight to the feature map so as to remove the influence on the target category;
A feature amount extraction method.
A program for causing a computer to function as each part of the feature quantity extraction device according to any one of claims 1 to 5.