EP2492826A1

EP2492826A1 - High-accuracy similarity search system

Info

Publication number: EP2492826A1
Application number: EP12153718A
Authority: EP
Inventors: Takao Murakami; Kenta Takahashi
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2011-02-28
Filing date: 2012-02-02
Publication date: 2012-08-29
Also published as: US20120221574A1; CN102693258A; JP2012178095A; JP5465689B2

Abstract

A pivot is determined from enrolled data by a pivot determination unit, raw data is acquired, features are extracted from the raw data, a score is calculated as one of a distance and a degree of similarity between the features, an index vector is generated by using the score for the pivot, a score is calculated as one of a distance and a degree of similarity between the index vectors, a parameter of each non-pivot including a regression coefficient is trained by using training data, order to select the non-pivots is, by using the score between search data and the non-pivot as well as the regression coefficient, determined in descending order of posterior probability through logistic regression, and a search result is outputted based on the score between the search data and the enrolled data.

Description

The present invent relates to a method and a system for searching similarity in inputted unstructured data.
Searching, compared to inputted unstructured data such as an image, a moving picture, a document, binary data, or biological body information, unstructured data similar thereto is called similarity search. The similarity search is typically performed by extracting from raw unstructured data (hereinafter called raw data) information called features used for distance calculation (or similarity calculation) and then considering that a smaller distance indicating a degree of disagreement between the features (or a greater degree of similarity indicating a degree of agreement between the features) indicates a greater degree of similarity. The distance (degree of similarity) between the features is called score.
Examples include: a method (k-Nearest Neighbor Search) of calculating a distance (or degree of similarity) between raw data inputted at time of search (hereinafter called search data) and raw data enrolled in a database (hereinafter called enrolled data), selecting K pieces of the enrolled data in ascending order of distance (or descending order of the degree of similarity), and outputting information related thereto as search results; and a method (Range Search) of outputting as search results information related to the enrolled data whose distance (or degree of similarity) is smaller (or larger) than a threshold value r.
At this point, calculating scores for all the pieces of enrolled data where a total number of enrolled data is N requires N times of score calculation. Typically, the score calculation requires a significant amount of time; therefore, an increase in the number N of pieces of enrolled data results in an almost proportional increase in the amount of search time. On the contrary, suggested is distance-based indexing by which scores between the pieces of enrolled data are previously calculated, the order to select the pieces of enrolled data for which the score is to be calculated by using this is determined, and the calculation of the scores from the pieces of enrolled data is stopped in the middle of processing to thereby reduce the number of times of score calculation.
For example, in E. CHAVEZ, K. FIGUEROA and G. NAVARRO", "Effective Proximity Retrieval by Ordering Permutations," IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 30, No. 9, pp. 1647-1658 (2008), from N pieces of enrolled data, for example, M (M<N) pieces of enrolled data (hereinafter called pivots) are selected randomly, a distance between each piece of enrolled data and each pivot is calculated, a vector (hereinafter called first index vector) used at time of search by using this distance is obtained for each piece of enrolled data, a distance between search data inputted at the time of search and each pivot is calculated to obtain a second index vector of the search data, and then the order to select the remaining pieces of enrolled data (hereinafter called non-pivots) are determined in ascending order of a distance between the first and second index vectors). As the index vector, obtained in E. CHAVEZ, K. FIGUEROA and G. NAVARRO, "Effective Proximity Retrieval by Ordering Permutations," IEEE Trans. On Pattern Analysis and Machine Intelligence, Vol. 30, No. 9, pp. 1647-1658 (2008) is a vector with which IDs of the pivots are arranged in ascending order of distance.
In Non Patent Literature 1, the order to select the non-pivots is determined in the ascending order of the distance between the first and second index vectors. However, this method leaves room for improvement in that an expected number of non-pivots for which score calculation is not performed (for which a score is not searched) is reduced, despite that the score from the search data is smaller than the threshold value r, when the calculation of the score from the non-pivot is stopped in the middle of processing, that is, in terms of search accuracy.
It is a preferred aim of the present invention to theoretically minimize an expected number of non-pivots for which the score is not calculated and thus not searched.
The present invention is characterized by having: a pivot determination unit that determines a pivot from enrolled data; a raw data acquisition unit that acquires raw data; a feature extraction unit that extracts features from the raw data; a score calculation unit that calculates a score as one of a distance and a degree of similarity between the features; an index vector generation unit that generates an index vector by using the score for the pivot; a Δ score calculation unit that calculates a Δ score as one of a distance and a degree of similarity between the index vectors; an non-pivot-specific parameter training unit that trains, by using training data, a parameter of each non-pivot including a regression coefficient; a non-pivot selection order determination unit that determines, by using the Δ score between inputted search data and the non-pivot as well as the regression coefficient, in order to select the non-pivots in descending order of posterior probability through logistic regression; a search result output unit that outputs a search result based on the score between the search data and the enrolled data; and a database that holds the feature of the enrolled data, pivot information indicating which piece of the enrolled data is the pivot, an index including the index vector of each non-pivot, and the parameter of each non-pivot.
With the present invention, by using a non-pivot-specific regression coefficient, the order to select the non-pivots is determined in descending order of posterior probability through logistic regression. This makes it possible to theoretically minimize the expected number of non-pivots for which the score is not calculated and thus not searched, despite that the score from the search data is smaller than the threshold value r. This consequently provides effect of dramatically improving accuracy.
In the drawings:

FIG. 1 is a block diagram showing functional configuration according to a first embodiment of the present invention;
FIG. 2 is a block diagram showing hardware configuration according to the first embodiment and a second embodiment of the invention;
FIG. 3 is a flow diagram showing enrollment processing according to the first embodiment of the invention;
FIG. 4 is a flow diagram showing supplementary information generation processing according to the first and second embodiments of the invention;
FIG. 5 is a flow diagram showing search processing according to the first embodiment of the invention;
FIG. 6 is a schematic diagram showing a feature space and indexes;
FIG. 7 is a block diagram showing functional configuration according to the second embodiment of the invention;
FIG. 8 is a flow diagram showing enrollment processing according to the second embodiment of the invention; and
FIG. 9 is a flow diagram showing search processing according to the second embodiment of the invention.

[First Embodiment]

Hereinafter, the first embodiment will be described with reference to the accompanying drawings.
A similarity search system of this embodiment is a similar image search system that, as a result of inputting an image by the user, searches a database in a server terminal for a similar image. Unstructured data such as a moving picture, music, a document, or binary data instead of an image may be used. The similarity search system of this embodiment uses a color histogram as features of the image and uses a Euclid distance as a score between the features.
The similarity search system of this embodiment preselects M pivots from N pieces of enrolled data. As a method of selecting the pivots, there is, for example, a method of selecting them randomly. Next, the similarity search system calculates a score between each piece of the remaining enrolled data (each non-pivot) and each of the pivots and, based on this, obtains a first index vector used at time of search for each non-pivot. At time of search, the similarity search system calculates a score between the inputted search data and each pivot and, based on this, obtains a second index vector of the search data. The index vector is a vector as a clue directly teaching positional relationship between each non-pivot and the search data without obtaining a score. Typically, it takes a great deal of time for calculating the score between the search data and each piece of the enrolled data, but the number of times of score calculation can be reduced (that is, high-speed search can be performed) by determining the order to select non-pivots by using a distance (or a degree of similarity) between the index vectors (hereinafter called Δ score), performing calculation of the score from the non-pivot T(<N-M) times (where T is an upper limit value predefined by the system manager or the like), and then stopping the calculation of the score from the non-pivot in the middle of performance.
As the index vector, a vector formed of the score from each pivot (hereinafter called score vector) may be provided or a vector (hereinafter referred to as permutation vector) with IDs of the pivots arranged in ascending order of distance (or degree of similarity) may be provided. A collection of the first index vectors of the different non-pivots is called an index.
FIG. 6 shows an example of search data Q and enrolled data X₁, X₂, ...X_N in a feature space. Note that X₁, X₂, ...X_M denote pivots and X_M+1, X_M+2, ...X_N denote non-pivots. Here, two clusters are formed and they are greatly separated from each other. Moreover, the number of dimensions of the feature is very large and it takes time to calculate a score between the features.
FIGS. 6 (a1) and (a2) show examples of the second index vector of the search data and indexes when a score vector and a permutation vector are respectively used as the index vector. Note that the Euclid distance between the features is used as a score.
For example, in FIG. 6 (a1), the score between X_M+1 and X₁ is 70 and the score vector S_M+1 of X_M+1 is S_M+1=(70, 28, 1053, ..., 43)^T. In FIG. 6 (a2), the pivot realizing the smallest score of the scores between X_M+1 and the pivots is X₂ and the score of X_M+1 is T_M+1= (X₂, X_M-1, ..., X₃)^T.
For the Δ score (distance or degree of similarity between the index vectors), when the score vector is used as the index vector, for example, any of Manhattan distance, the Euclid distance, etc. is assumed, and when the permutation vector is used, for example, any of Spearman Rho, etc. is assumed. Alternatively, for example, what is obtained by subtracting the aforementioned distance from a maximum possible value may be used as the degree of similarity.
For example, when the score vector is used as the index vector and the Euclid distance is used as the Δ score, where a Euclid distance between a score vector S_q of the search data and a score vector S_i of enrolled data X_i is D_e (S_q, S_i), $D_{e} (S_{q} S_{i}) = \sum_{z = 1}^{M} {(S_{q} (z) - S_{i} (z) ())}^{2}$

is obtained. Here, S_i(z) denotes a z-th element in the score vector S_i. In the case of FIG. 6(a1), obtained calculation can be D_e(S_q, S_M+1)=(78-70)²+(95-28)²+_...+(39-43)².
When the permutation vector is used as the index vector and Spearman Rho is used as the Δ score, where Spearman Rho between a permutation vector T_q of the search data and a permutation vector T_i of the enrolled data X_i is D_ρ (T_q, T_i), $D_{ρ} (T_{q} T_{i}) = \sum_{z = 1}^{M} {(z - T_{q}^{- 1} (T_{i} (z) ()))}^{2}$

is obtained. Here, T_i(z) denotes a suffix number of the z-th element in the permutation vector T_i. For example, where T_i=(X₂, X_M, X₁, ..., X₃)^T, T_i(1)=2, T_i(2)=M, T_i(3)=1, ..., T_i (M) =3. T_q ^-1(i) denotes at what place in the permutation vector T_q the element X_i is located. For example, where T_q= (X_M X₁, X₂, ..., X₃)^T, T_q ^-1(1)=2, T_q ^-1(2)=3, T_q ^-1(3)=M, ..., T_q ^-1(M)=1_. In FIG. 6(a2), obtained calculation can be Dp(Tq, T_M+1)=(1-3)²+(2-1)²+...+(M-M)².
A first characteristic of the similarity search system of this embodiment is that an index vector size (the number of dimensions of the index vector) of each non-pivot is uniquely determined (trained) before search by using prepared data (training data). A method of training the index vector size will be described in detail below.
FIGS. 6(b1) and (b2) show examples of the indexes when index vectors are held in correspondence with the index vector size of each trained non-pivot in a case where the score vector and the permutation vector are used as the index vectors. In this case, when the score vector is used as the index vector, for the score vector, rearrangement is made so that the number of elements corresponding to a score vector size are provided in ascending or descending order of scores, and in order to tell to which pivot the concerned score corresponds, a permutation vector with the same length is also held.
For example, in FIG. 6(b1), the score vector size of X_M+1 is trained as 3, and the score vector S_M+1=(28, 43, 70) is held together with the permutation vector T_M+1= (X₂, X_M, M₁)^T. In FIG. 6(b2), a permutation vector size is trained as 2, and a permutation vector T_M+1 is T_M+1 = (X₂, X_M)^T. In FIGS. 6(b1) and (b2), blacked-out sections are saved into the database.
As described above, in this embodiment, by using the training data, the non-pivot-specific index vector size is trained, and the non-pivot-specific index vector is saved in correspondence with the index vector size of this non-pivot. This makes it possible to reduce the index vector size for each non-pivot. This results in reduction in a size of indexes saved into the database, which can provide effect of realizing system weight reduction. Details of a method of training the index vector size will be described below.
The Δ score in this case is indicated, when the score vector is used as the index vector and the Euclid distance is used as the Δ score, where the Euclid distance between a score vector S_q of the search data and a score vector S_i of the enrolled data X_i (permutation vector is T_i and the score vector size is Z_i) is D_e (S_q, S_i, T_i, Z_i) : $D_{e} (S_{q} S_{i} T_{i} Z_{i}) = \sum_{z = 1}^{Z_{i}} {(S_{q} (T_{i} (z)) - S_{i} (T_{i} (z)))}^{2}$
In the case of FIG. 6(b1), obtained calculation can be D_e(S_q, S_M+1, T_i, Z_i)=(95-28)²+(39-43)²+...+(78-70)².
Moreover, when the permutation vector is used as the index vector and the Spearman Rho is used as the Δ score, where Spearman Rho between the permutation vector T_q of the search data and the permutation vector T_i of the enrolled data X_i (permutation vector size is Z_i) is D_ρ (T_q, T_i, Z_i), $D_{ρ} (T_{q} T_{i} Z_{i}) = \sum_{z = 1}^{Z_{i}} {(z - T_{q}^{- 1} (T_{i} (z) ()))}^{2}$

is obtained. In the case of FIG. 6(a2), obtained calculation can be D_ρ (T_q, T_M+1, Z_i) = (1-3)²+(2-1)².
As described above, the Δ score, corresponding to the index vector size, between the search data and the non-pivot (that is, distance between z_i-dimension vectors) is calculated. This requires shorter time for the Δ score calculation than for calculating a Δ score corresponding to the number (M) of pivots (that is, distance between M-dimension vectors). This consequently provides effect of improving speed.
A second characteristic of the similarity search system of this embodiment is that after the ΔS_q, _M+1, ..., ΔS_q, _N for each non-pivot is obtained in this manner, by using logistic regression, the order to select non-pivots is determined in descending order of posterior probability P (sq, i<r|ΔS_q, _i) (M+1≤i≤N) for which the score s_q, _i from the search data is smaller than a threshold value r. The posterior probability P (sq, i<r|ΔS_q, _i) can be deformed by use of Bayes' theorem as follows: $\begin{array}{l} P (s_{q, i} < r | {ΔS}_{q, i}) & = \frac{P ({Δ S}_{q, i} | s_{q, i} < r) P (s_{q, i} < r)}{P ({Δ S}_{q, i})} \\ = \frac{1}{1 + \exp (- a_{i})} \\ = σ (- a_{i}) \end{array}$

where σ ( )is a logistic sigmoid function, and a_i is: $a_{i} = l n \frac{P ({Δ S}_{q, i} | s_{q, i} < r) P (s_{q, i} < r)}{P ({ΔS}_{q, i}) - P ({ΔS}_{q, i} | s_{q, i} < r) P (s_{q, i} < r)}$
A logistic sigmoid function σ ( ) is a monotonically increasing function, and thus determining the order to select the non-pivots in descending order of a_i permits determination of the order to select the non-pivots in descending order of the posterior probability P (s_q, I<r|ΔS_q, _i) . The a_i can be obtained by using the logistic regression. In the logistic regression, a_i can be obtained in an approximate manner by: $a_{i} ≒ w_{i, 1} {ΔS}_{q, i} + w_{i, 0}$
The w_i, ₁ and w_i, ₀ are non-pivot-specific regression coefficients of logistic regression (M+1≤i≤n). It is possible to adopt a method of using a value common to the non-pivots as the regression coefficient, but since the regression coefficient definitely differs in value from one non-pivot to another, it is possible to more properly obtain the a_i by using the non-pivot-specific regression coefficient. Moreover, according to Formula 7, a_i can be obtained in an approximate manner through performing multiplication once and performing addition once on the Δ score ΔS_q, _i, and thus it takes little time for calculating the a_i. The regression coefficient is uniquely determined (trained) before search by using prepared data (training data), as is the case with the index vector size. Details of a method of training the regression coefficient will be described below.
Assuming here that an aggregation of Δ scores ΔS_q, _M+1, ..., ΔS_q, _N for each non-pivot is ΔS_q and then the non-pivot determined as the e(1≤e≤N-M)-th place is X_m(e) (M+1≤m(e)≤N), an expected number of non-pivots for which the score is consequently not calculated (that is, not searched) despite that the score from the search data is smaller than the threshold value r after calculating the score from the non-pivot T(<N-M) times can be denoted as: $\begin{matrix} \sum_{e = T + 1}^{N - M} 1 \times P (s_{q, m (e)} < r | {ΔS}_{q}) + 0 \times P (s_{q, m (e)} ≧ r | {ΔS}_{q}) \\ = \sum_{e = T + 1}^{N - M} P (s_{q, m (e)} < r | {ΔS}_{q}) \\ ≒ \sum_{e = T + 1}^{N - M} P (s_{q, m (e)} < r | {ΔS}_{q, m (e)}) \end{matrix}$
Note, however, that for the approximation from the second to third lines, what has the greatest influence on the posterior probability of the non-pivot X_m(e) is a Δ score ΔS_q,m(e) for X_m(e). In Formula 8, approximation can be achieved by a sum of posterior probabilities P (sq, m(e)<r|ΔS_{q, m(e)}) of the non-pivot X_m(e) for which score calculation has not yet been performed, but this sum can be minimized when the score from the non-pivot is calculated T-times in descending order of the posterior probability P (sq, m (e) <r|ΔS_{q, m(e)}).
Therefore, in this embodiment, the order to select the non-pivots is determined in descending order of the posterior probability through the logistic regression by using the non-pivot-specific regression coefficient, and this makes it possible to theoretically minimize the expected number of non-pivots for which the score is not calculated and thus not searched despite that the score from the search data is smaller than the threshold value r. This consequently provides effect of dramatically improving accuracy. Details of a method of training the regression coefficient will be described below.
FIG. 1 shows a configuration example of the similarity search system of this embodiment. In this embodiment, raw data is an image.
This system is composed of: an enrollment terminal 100 that transmits to a server terminal enrollment information acquired from a user; a server terminal 200 that saves the enrollment information, generates supplementary information from the enrollment information, and performs similarity search on raw search data by using the enrollment information and the supplementary information; a client terminal 300 that transmits to the server terminal 200 the raw search data inputted by the user; and a network 400.
The number of each of the enrollment terminal 100, the server terminal 200, and the client terminal 300 may be one or more. The enrollment terminal 100 may be the same terminal as the server terminal 200 or as the client terminal 300. Moreover, the enrollment terminal 100 is not necessarily provided. The server terminal 200 may be the same terminal as the client terminal 300. The network 400 may use a network such as WAN or LAN, communication between devices using a USB, an IEEE 1394, or the like, or wireless communication such as a portable phone network or BlueTooth.
For example, assumed configuration is that the enrollment terminal 100 includes a plurality of PCs in a firm, the server terminal 200 is one server in a data center operated by the firm, the client terminal 300 includes a plurality of users' individual PCs, and the network 400 is the Internet, and assumed operation is that an employee in the firm performs image enrollment. In this case, the enrollment terminal 100 may be a server in the data center, so that a server manager can perform image enrollment. Alternatively, the enrollment terminal 100 may be provided in the user's individual PC, so that the user can perform image enrollment. Alternatively, without providing the enrollment terminal 100, the server terminal 200 may perform automatic collection from the Internet. Alternatively, the enrollment terminal 100, the server terminal 200, and the client terminal 300 may be provided in the user's individual PC, so that image enrollment, supplementary information generation, and search can be performed on the individual PC.
The enrollment terminal 100 is composed of: a raw data acquisition unit 101 that acquires raw data; and a communication I/F 102.
The server terminal 200 is composed of: a pivot determination unit 201 that determines M pivots from N pieces of enrolled data; a feature extraction unit 202 that extracts features from raw data; a score calculation unit 203 that calculates a score as a distance (or a degree of similarity) between the features; an index vector generation unit 204 that generates an index vector by using a score for a non-pivot or a pivot of search data; a Δ score calculation unit 205 that calculates a distance (or degree of similarity) (hereinafter called Δ score) between the index vectors; a non-pivot-specific parameter training unit 206 that trains a non-pivot-specific parameter by using training data; a non-pivot selection order determination unit 207 that determines the order to select the non-pivots by using a Δ score between the inputted search data and the non-pivot; a search result output unit 208 that outputs search results based on a score between the search data and the enrolled data; a communication I/F 209, and a database 210.
The database 210 holds master data 220. The master data 220 holds enrollment information 230 of each enrolled user and supplementary information 240. The enrollment information 230 holds, for each piece of the enrolled data, an enrolled data ID 231, raw data 232, and a feature 233. The supplementary information 240 holds: pivot information 241 that indicates which piece of the enrolled data is a pivot; an index 242; and a non-pivot-specific parameter 250. The index 242 holds an index vector 243 for each non-pivot. The non-pivot-specific parameter 250 holds, for each non-pivot, an index vector size 251 and a regression coefficient 252 that is used for logistic regression.
The client terminal 300 is composed of: a raw data acquisition unit 301 that acquires raw data; and a communication I/F 302.
FIG. 2 shows hardware configuration of the enrollment terminal 100, the server terminal 200, and the client terminal 300 according to this embodiment. These terminals can be composed of: as shown in the figure, a CPU 500, a memory 501, an HDD 502, an input dev ice 503, an output device 504, and a communication device 505.
FIG. 3 shows processing procedures and a data flow of enrollment according to this embodiment.
The enrollment terminal 100 acquires raw enrolled data from the user (step S101).
The enrollment terminal 100 transmits the raw enrolled data to the server terminal 200 (step S102).
The server terminal 200 extracts features for enrollment from the raw enrolled data (step S103).
The server terminal 200 saves into the database 210 the enrollment information 230 including the enrolled data ID 231 specific to the enrolled data, the raw data 232 for enrollment, and the feature 233 for enrollment (step S104).
FIG. 4 shows processing procedures and a data flow of supplementary information generation according to this embodiment. This processing is performed between when enrollment processing is performed and when search processing is performed. For example, it is possible to perform this processing immediately after the enrollment or at night on a day when the enrollment is performed. Moreover, this processing involves two cases: the case where supplementary information is newly generated; and the case where the supplementary information for enrolled data added after the last supplementary information generation is updated.
The server terminal 200 acquires the enrollment information 230 of each enrolled user from the database 210 to newly generate supplementary information and acquires the added enrollment information 230 from the database 210 to updates the supplementary information (step S201).
To newly generate supplementary information, the server terminal 200 newly determines M pivots from among the raw data 232 of the N pieces of enrollment information 230 (step S202). To update the supplementary information, this step is omitted and the raw data 232 of the added enrollment information 230 is provided as a non-pivot. Methods of determining a pivot include: for example, random selection; and determining as a pivot upon every pivot selection the one which has a smallest (or largest) sum of scores or Δ scores from the pivots determined by that time.
The server terminal 200 obtains a score between each pivot and each of the (N-M) non-pivots to generate the index vector 243 to newly generate supplementary information, and obtains a score between each pivot and each of the added non-pivots to generate the index vector 243 to update the supplementary information (step S203).
The server terminal 200 uniquely determines (trains) the non-pivot-specific parameter 250 composed of the index vector size 251 and the regression coefficient 252 used for the logistic regression by using prepared data (training data) for each of the N-M non-pivots to newly generate supplementary information and for each added non-pivot to update the supplementary information (step S204). Details of a method of training the non-pivot-specific parameter 250 composed of the index vector size 251 and the regression coefficient 252 will be described below.
The server terminal 200, to newly generate supplementary information, saves into the database 210, as the supplementary information 240, the pivot information 241 indicating which piece of the enrolled data is a pivot, the index 242 composed of the index vector 243 of each of the N-M non pivots, and the non-pivot-specific parameter 250 composed of the index vector size 251 and the regression coefficient 252 of each trained non-pivot. The server terminal 200, to update the supplementary information, adds the generated index vector 243 to the index 242 of the database 210, and adds the index vector size 251 and the regression coefficient 252 of each trained non-pivot to the non-pivot-specific parameter 250. At this point, for the index vector 243 of each non-pivot, saving or addition for the index vector size 251 of the concerned non-pivot is performed (step S205).
FIG. 5 shows processing procedures and a data flow of search according to this embodiment.
The server terminal 200 acquires the master data 220 from the database 210 (step 5301).
The client terminal 300 acquires raw search data from the user (step S302).
The client terminal 300 transmits the raw search data to the server terminal 200 (step S303).
The server terminal 200 extracts a feature for search from the raw search data (step S304).
The server terminal 200 calculates a score between the search data and each pivot (step S305).
The server terminal 200, based on the score between the search data and each pivot, generates an index vector of the search data (step S306).
The server terminal 200, by using the index vector of the search data, the index 242 including the index vector of each non-pivot, and the index vector size 251 of each non-pivot, calculates a Δ score between the search data and each of the non-pivots (step S307).
The server terminal 200, based on a Δ score ΔS_q, _M+1, ..., ΔS_q, _N, by using regression coefficients w_i, ₁ and w_i, ₀ of logistic regression of each non-pivot, obtains by Formula 7 a value a_i related in a monotonically increasing manner to posterior probability P(sq, i<r|ΔSq, i)(M+1≤i≤N) for which the score S_q, _i from the search data is smaller than the threshold value r, and determines the order to select the non-pivots in descending order of a_i (step S308).
The server terminal 200 initializes at 0 the number of times t of calculating the score between the search data and the non-pivot (step S309).
The server terminal 200 calculates a score between the search data and the non-pivot selected in accordance with the order to select the non-pivots determined at step S308 (step S310).
The server terminal 200 increases the number of times t of calculating the score between the search data and the non-pivot by an increment of 1 (step S311).
The server terminal 200 proceeds to step S310 if the number of times t of calculating the score between the search data and the non-pivot is equal to or smaller than an upper limit value T and proceeds to step S313 if it is larger than the upper limit value T (step S312).
The server terminal 200 transmits the raw data 232 as search results to the client terminal 300 (step S313). At this point, a method of selecting k pieces of enrolled data in ascending order (or descending order) of score and providing them as search results (k-Nearest Neighbor Search) may be adopted, or a method of providing as search results the enrolled data for which the score is smaller (or larger) than the threshold value r (Range Search) may be adopted.
The client terminal 300 displays the raw data 232 as the search results (step S314).
Hereinafter, details of the method of training by using the training data the parameter 250 composed of the index vector size 251 and the regression coefficient 252 for each non-pivot in step S204 will be described. As the training data, (N-1) non-pivots other than the concerned non-pivot for which the parameter is trained may be used, or data previously prepared separately from the enrolled data may be used.
First, the method of training the regression coefficients w_i, ₁ and w_i, _o when an index vector size Z_i is fixed at a certain value will be described. Assume that the training data are Q₁, Q₂, ...Q_N' (where N' is the number of pieces of training data). Moreover, a Δ score between the training data Qj (1≤j≤N') and the non-pivot X_i
(M+1≤i≤N) is a ΔS_{j, i}, and an aggregation of Δ scores for the non-pivot X_i of each training data Qj (1≤j≤N') is expressed by: ${ΔS}_{i} = \{{ΔS}_{j, i} | 1 ≦ j ≦ Nʹ\}$
For example, when the score vector is used as the index vector and the Euclid distance is used as the Δ score, the ΔS_{j, i} can be expressed as D_e (S_qj, Si, Ti, Z_i) (where S_qj is a score vector of the training data Qj) and can be calculated by Formula 3. When the permutation vector is used as the index vector and the Spearman Rho is used as the Δ score, the ΔS_{j, i} is D_ρ 〈T_qj, Ti, Zi〉 (where T_qj is a permutation vector of the training data Q_j) and can be calculated by Formula 4.
Further, assume that a label that takes 1 when a score S_{j, i} between the training data Qj (1≤j≤N') and the non-pivot Xi (M+1≤i≤N) is smaller than the threshold value r and that takes 0 in other cases is defined as L_ji and an aggregation of labels for the non-pivot Xi of each training data Qj (1≤j≤N') is expressed by: $L_{i} = \{L_{j i} | 1 ≦ j ≦ Nʹ\}$
Furthermore, a regression coefficient of the non-pivot X_i can be expressed, where w_i, ₁ and w_i, ₀ are arranged, in a vector form of: $w_{i} = {(w_{i, 1} w_{i, 0})}^{T}$
In this embodiment, the aggregation ΔS_i of the Δ scores for the non-pivot X_i and the aggregation L_i of the labels are used for training the regression coefficient w_i.
As the method of training the regression coefficient, there is a method of using maximum A posterior probability estimation and maximum likelihood estimation. To train the regression coefficient w_i through the maximum A posterior probability estimation by using the aggregation ΔS_i of the Δ scores for the non-pivot X_i and the aggregation Li of the labels, a parameter w_i ^MAP is obtained through: $\begin{array}{l} w_{i}^{MAP} & = \underset{w_{i}}{argmaxP} (w_{i} | {ΔS}_{i}, L_{i}) \\ = \underset{w_{i}}{argmaxP} ({ΔS}_{i}, L_{i} | w_{i}) P (w_{i}) \\ = \underset{w_{i}}{argmaxP} (L_{i} | {ΔS}_{i}, w_{i}) P ({ΔS}_{i} | w_{i}) P (w_{i}) \\ = \underset{w_{i}}{argmaxP} (L_{i} | {ΔS}_{i}, w_{i}) P (Δ S_{i}) P (w_{i}) \\ = \underset{w_{i}}{argmaxP} (L_{i} | {ΔS}_{i}, w_{i}) P (w_{i}) \end{array}$

and they are provided as training results. However, Bayes' theorem is used for deformation on the 2nd to 4th lines and the ΔS_i and w_i are independent from each other for deformation on the 4th to 5th lines (that is, P(ΔS_i|w_i)=P(ΔS_i)). For deformation on the 5th to 6th deformation, the fact that P(ΔS_i) is fixed, not depending on w_i is used. Moreover, argmax f(x) indicates x that maximizes f(x). To train the regression coefficient w_i through the maximum likelihood estimation, a parameter w_i ^ML is obtained through: $\begin{array}{l} w_{i}^{ML} & = \underset{w_{i}}{argmaxP} ({ΔS}_{i}, L_{i} | w_{i}) \\ = \underset{w_{i}}{argmaxP} (L_{i} | {ΔS}_{i}, w_{i}) P ({ΔS}_{i} | w_{i}) \\ = \underset{w_{i}}{argmaxP} (L_{i} | {ΔS}_{i}, w_{i}) P ({ΔS}_{i}) \\ = \underset{w_{i}}{argmaxP} (L_{i} | {ΔS}_{i}, w_{i}) \end{array}$

and they are provided as training results.
As shown by Formulae 12 and 13, the maximum A posterior probability estimation is different from the maximum likelihood estimation in a point that the regression coefficient is trained in view of the posterior probability P (w_i) of the regression coefficient w_i. As described above, the maximum A posterior probability estimation is characterized by being capable of training the regression coefficient more toughly than the maximum likelihood estimation, by considering the posterior probability of the regression coefficient, even when the number of pieces of training data is small. In particular, in this embodiment, the number of labels L_ji taking 1 (that is, the number of pieces of the training data Q_j similar to the non-pivot X_i) is typically very small, and thus the regression coefficient may not be appropriately trained through the maximum likelihood estimation. Even in such a case, the regression coefficient can be appropriately trained through the maximum A posterior probability estimation.
P(Li|ΔS_i, w_i) can be obtained by: $\begin{array}{l} P (L_{i} | {ΔS}_{i}, w_{i}) & = \prod_{j = 1}^{Nʹ} P {(s_{j, i} < r | {ΔS}_{j, i}, w_{i})}^{L_{j i}} P {(s_{j, i} ≧ r | {ΔS}_{j, i}, w_{i})}^{2 - L_{j i}} \\ = \prod_{j = 1}^{Nʹ} P {(s_{j, i} < r | {ΔS}_{j, i}, w_{i})}^{L_{j i}} (1 - P {(s_{j, i} < r | {ΔS}_{j, i}, w_{i}))}^{1 - L_{j i}} \\ = \prod_{j = 1}^{Nʹ} σ {(a_{j, i})}^{L_{j i}} {(1 - σ (a_{j, i}))}^{1 - L_{j i}} \end{array}$
Note that, however, the label L_ji for the 1st to 2nd lines takes 1 when the score s_j, _i between the training data Q_j and the non-pivot X_i is smaller than the threshold value r and takes 0 in other cases and dependence on the Δ score ΔS_j, _i is used. Moreover, a_{j, i} is: $a_{j, i} = l n \frac{P ({ΔS}_{j, i} | s_{j, i} < r) P (s_{j, i} < r)}{P ({ΔS}_{j, i}) - P ({ΔS}_{j, i} | s_{j, i} < r) P (s_{j, i} < r)}$
By using the logistic regression described above, $a_{j, i} ≒ w_{i, 1} {ΔS}_{j, i} + w_{i, 0}$

can be obtained.
Assuming that P(w_i) is, for example, normal distribution of an average vector 0 and variance-covariance matrix Σ_o,
There is a method of obtaining: $P (w_{i}) = N (0 \sum_{0})$
There are, for example, a method of presetting Σ_o at an adequate value and a method of automatically determining it by using empirical Bayes method based on the training data. Moreover, an average vector other than 0 may be used, or for example, exponential distribution or gamma distribution other than normal distribution may be used as a distribution model.
At this point, the regression coefficient w_i ^MAP or w_i ^ML obtained through the maximum A posterior probability estimation or the maximum likelihood estimation (that is, which maximizes Formula 16 or 17) can be calculated by using, for example, a Newton-Raplon method. This is a method of sequentially obtaining the value w_i ^MAP of the maximum A posterior probability estimation or the value w_i ^ML of the maximum likelihood estimation with the following procedures.

1. An initial value w_i ^(o) of wi is set appropriately. For example, w_i ^(o) =0, and τ←0.
2. As described below, w_i ^(τ+1) is obtained. The symbol τ is the number of times of sequential calculation: ${w_{i}}^{(τ + 1)} = {w_{i}}^{(τ)} - {(\nabla \nabla E ({w_{i}}^{(τ)}))}^{- 1} \nabla E ({w_{i}}^{(τ)})$

Note that E(w_i ^(τ) is posterior probability, or a negative log of the likelihood. The symbol V is a differential operator vector. This is called an error function. In the case of the maximum A posterior probability estimation, $E ({w_{i}}^{(τ)}) = - log P (L_{i} | {ΔS}_{i}, {w_{i}}^{(τ)}) P ({w_{i}}^{(τ)})$

and in the case of the maximum likelihood estimation, $E ({w_{i}}^{(τ)}) = - log P (L_{i} | {ΔS}_{i}, {w_{i}}^{(τ)})$
Moreover, ∇E(w_i ^(τ)) and ∇∇E(w_i ^(τ)) are a first-order differential column vector and a second-order differential line-column, respectively. For example, in the case of the maximum A posterior probability estimation, when Formulae 14, 16, and 17 are employed, $\nabla E (w_{i}^{(τ)}) = {\sum_{0}}^{- 1} w_{i}^{(τ)} + \sum_{j = 1}^{Nʹ} (σ (a_{j, i}^{(τ)}) - L_{j, i}) x_{j}$
$\nabla \nabla E (w_{i}^{(τ)}) = {\sum_{0}}^{- 1} + \sum_{j = 1}^{Nʹ} σ (a_{j, i}^{(τ)}) (1 - σ (a_{j, i}^{(τ)})) x_{j} {x_{j}}^{T}$

can be obtained, where ${a_{j, i}}^{(τ)} ≒ {w_{i, 1}}^{(τ)} {ΔS}_{j, i} + {w_{i, 0}}^{(τ)}$
$x_{j} = {({ΔS}_{j, i} 1)}^{T}$

3. when a difference between w_i ^(τ+1) and w_i ^(τ) is sufficiently small or when τ exceeds a fixed value, w_i ^(τ+1) ends as w_i ^MAP or w_i ^ML. Otherwise, as τ ← τ+1, the process returns to 2.

Next, a method of training the index vector size Z_i will be described. To this end, the aforementioned operation is performed while the index vector size Z_i is varied to various values (for example, values of 1 to M), and the w_i ^MAP or the w_i ^ML for which the error function is as small as possible and the Zi that achieves this may be provided as training results. This makes it possible obtain the best parameter in terms of accuracy.
Alternatively, the non-pivot-specific parameter may be trained so that a sum of the error functions for the non-pivot becomes as small as possible while the index size is equal to or smaller than a fixed value. To this end, the w_i ^MAP with which the sum of error functions for the non-pivot becomes largest while Z_i of each non-pivot is varied to various values in a range where the index size is equal to or smaller than the fixed value and Zi that realizes this may be provided as training results (M+1≤i≤N). This makes it possible to realize, when a required value is set for the size of the supplementary information, most excellent performance in terms of accuracy in a range that satisfies this.
Moreover, in this embodiment, obtaining the label L_ji (1≤j≤NI, M+1≤i≤N) requires calculation of a total (N-M)×N' scores, which typically takes a great deal of time. Thus, a Δ score between each non-pivot and each piece of the (N') training data may be obtained, (<N') pieces of the training data may be selected in ascending order (where v' is a value predefined by a system manager or the like), and they may be used for training. The piece of training data with a small Δ score is similar to the non-pivot with high possibility, and this makes it possible to reduce the number of times of score calculation required for the training to (N-M) ×ν' pieces while suppressing reduction in the number of labels L_ji that take 1 (that is, that is similar to the non-pivot X_i) as much as possible. This consequently provides effect of performing high-speed training.
Moreover, for example, in a case where the pieces of enrolled data form several clusters in the feature space, the parameter such as the index vector size may take a similar or same value for each cluster.
Therefore, in this embodiment, clustering may be performed on the non-pivots, the non-pivot-specific parameter may be trained so that some or all of the parameters obtained for each cluster are common. As a clustering method, any of hierarchical methods such as a nearest neighbor method, a farthest neighbor method, a group average method, and a Ward method may be used. Training a common parameter for each cluster in this manner makes it possible to reduce a size of the parameter. This consequently provides effect of realizing further system weight reduction.
Moreover, in a case where the enrolled data is used as the training data, when enrolled data has been added, it is possible that the parameter training is not performed successfully since the index vector size of the training data is small. However, training the common parameter for each cluster as described above makes it possible to easily perform the parameter training by using the common parameter for the cluster to which the concerned enrolled data belongs.

[Second Embodiment]

Hereinafter, with reference to the accompanying drawings, the second embodiment will be described. A similarity search system of this embodiment is a biological body identification system which, as a result of inputting biological body information by a user who attempts authentication (hereinafter referred to as authenticated user), searches a database in a client terminal for similar biological body information, and thereby identifies to which user (hereinafter referred to as enrolled user) enrolled in the database the authenticated user corresponds, and performs authentication based on results of this identification.
FIG. 7 shows a configuration example of the biological body identification system of this embodiment. Here, only a point different from FIG. 1 will be described. In this embodiment, raw data is biological body information.
This system is composed of: an enrollment terminal 100 that transmits to a server terminal a feature of biological body information obtained from the user; a server terminal 200 that saves enrollment information, generates supplementary information from the enrollment information, and performs biological body identification on a feature for authentication by using the enrollment information and the supplementary information; a client terminal 300 that transmits to the server terminal 200 a group ID and the feature for the authentication inputted by the user; and a network 400.
For example, it is possible to from, for an information access control system or an attendance management system of a firm, the enrollment terminal 100 with a plurality of PCs in the firm, the server terminal 200 with one server in a data center operated by the firm, the client terminal 300 with a plurality of employees' PCs, and the network 400 with the Internet. Moreover, it is possible to form, for an entrance and exit management system in the firm, the enrollment terminal 100, the server terminal 200, and the client terminal 300 in the same entrance and exit management device. A group ID 221 may be a value specific to a business place to which the user belongs, or may be set to be specific to each client terminal 300 or each base. In the former case, possible operation is to input the group ID at the time of authentication by the user. In the latter case, the user is not required to input the group ID at the time of authentication.
The enrollment terminal 100 further has: a group ID/user name acquisition unit 103 that acquires a group ID and a user name; and a feature extraction unit 104 that extracts a feature from raw data.
The server terminal 200 does not have a feature extraction unit 202 but has a group narrowing unit 209a and has master data 220 for each group ID. The master data 220 has a group ID 221. Enrollment information 230 does not have raw data 232 but has a user name 234 for each piece of the enrollment information.
Possible features of the biological body information are, for example, minutiae for fingerprints, an iris code for an iris, and cepstrum for a vocal print. Possible scores between the two pieces of biological body information are the number and a ratio of corresponding minutiae for the fingerprint, a Hamming distance for the iris, and a Mahalanobis distance for the vocal print.
The client terminal 300 further has: a group ID acquisition unit 303 that acquires a group ID; and a feature extraction unit 304 that extracts a feature from raw data.
Hardware configuration of the enrollment terminal 100, the server terminal 200, and the client terminal 300 according to this embodiment is the same as that of FIG. 2.
FIG. 8 shows processing procedures and a data flow of enrollment processing according to this embodiment. Step S101 of FIG. 8 is equal to step S101 of FIG. 3.
The enrollment terminal 100 acquires a group ID and a user name from the user (step S101a).
The enrollment terminal 100 extracts a feature for enrollment from raw enrolled data (step S102a).
The enrollment terminal 100 transmits to the server terminal 200 the group ID, the user name, and the feature for enrollment (Step S103a).
The server terminal 200, if the master data 220 corresponding to the group ID is in the database 210, adds to the master data 220 the enrollment information 230 including an enrolled data ID 231 specific to the enrolled data, the user name 234, and a feature 233 for enrollment. If there is no master data 220, the enrollment information 230 including the group ID 221, the enrolled data ID 231 specific to the enrolled data, the user name 234, and the feature 233 for enrollment is newly created (step S104a).
Processing procedures and a data flow of supplementary information generation processing according to this embodiment is the same as that of FIG. 4. Note that, however, this processing is performed for each group ID. The number N of pieces of the enrollment information 230 and the number M of pivots may be different from one group ID to another.
FIG. 9 shows processing procedures and a data flow of search processing according to this embodiment. Steps S302, S305 to S312, and S314 of FIG. 9 are equal to steps S302, S305 to S312, and S314 of FIG. 3.
The server terminal 200 acquires the master data 220 for each ID from the database 210 (step S301a).
The client terminal 300 acquires the group ID from the user (step S302a). The group ID may be a value specific to each client terminal 300 or each base, or may not be acquired from the user in this case.
The client terminal 300 extracts a feature for search from raw search data (step S303a).
The client terminal 300 transmits the group ID and the feature for search to the server terminal 200 (step S304a) .
A target of search by the server terminal 200 is master data corresponding to the acquired group ID (step S305a).
As described above, in this embodiment, narrowing the enrolled data by using the group ID is performed. This makes it possible to dramatically reduce the number of pieces of enrolled data for which the score is calculated. This consequently provides effect of further improving speed.
The server terminal 200 transmits, as a search result, the user name 234 corresponding to the enrolled data to the client terminal 300 (step S313a).
The client terminal 300 displays the user name 234 corresponding to the enrolled data as the search result (step S314a).
The present invention is applicable to any application that performs similarity search on unstructured data such as an image, a moving picture, music, a document, binary data, or biological body information. For example, the invention is applicable to a similar image search system, a similar moving picture search system, a similar music search system, a similar document search system, a similar file search system using fuzzy hash, an information access control system, an attendance management system, and an entrance and exit management system.

Claims

A similarity search system comprising:
a pivot determination unit that determines a pivot from enrolled data;

a raw data acquisition unit that acquires raw data;

a feature extraction unit that extracts features from the raw data;

a score calculation unit that calculates a score as one of a distance and a degree of similarity between the features;

an index vector generation unit that generates an index vector by using the score for the pivot;

a Δ score calculation unit that calculates a Δ score as one of a distance and a degree of similarity between the index vectors;

a non-pivot-specific parameter training unit that trains, by using training data, a parameter of each non-pivot including a regression coefficient;

a non-pivot selection order determination unit that determines, by using the Δ score between search data and the non-pivot as well as the regression coefficient, in order to select the non-pivots in descending order of posterior probability through logistic regression;

a search result output unit that outputs a search result based on the score between the search data and the enrolled data; and

a database that holds the feature of the enrolled data, pivot information indicating which piece of the enrolled data is the pivot, an index including the index vector of each non-pivot, and the parameter of each non-pivot.
The similarity search system according to claim 1,
wherein the non-pivot-specific parameter training unit trains the parameter of each non-pivot including an index vector size.
The similarity search system according to claim 2,
wherein the non-pivot-specific parameter training unit trains the parameter of each non-pivot including the index vector size so as to provide the smallest possible error function.
The similarity search system according to claim 2,
wherein the non-pivot-specific parameter training unit trains the parameter of each non-pivot including the index vector size so that a sum of error functions for the non pivot becomes as small as possible while a size of the index is equal to or smaller than a fixed value.
The similarity search system according to claim 1,
wherein the non-pivot-specific parameter training unit trains the parameter of each non-pivot through maximum a posterior probability estimation.
The similarity search system according to claim 1,
wherein the non-pivot-specific parameter training unit trains the parameter of each non-pivot through maximum likelihood estimation.
The similarity search system according to claim 1,
wherein the non-pivot-specific parameter training unit, for each non-pivot, calculates a Δ score from the training data and selects the training data to be used for training by using the Δ score.
The similarity search system according to claim 1,
wherein the non-pivot-specific parameter training unit uses the enrolled data as the training data.
The similarity search system according to claim 1,
wherein the non-pivot-specific parameter training unit uses, as the training data, data previously prepared separately from the enrolled data.
The similarity search system according to claim 1,
wherein the non-pivot-specific parameter training unit performs clustering on the non-pivots and trains the parameter of each non-pivot so that some or all of the parameters are common for each obtained cluster.
The similarity search system according to claim 1,
wherein the index vector generation unit generates a permutation vector as the index vector.
The similarity search system according to claim 1,
wherein the index vector generation unit generates a score vector as the index vector.
The similarity search system according to claim 1, having a group narrowing unit that narrows the enrolled data by using a group ID,
wherein the data base holds the group ID.
A high-precision similarity search method in a server terminal performing similarity search on raw data transmitted from a client terminal by an enrollment terminal, the high-precision similarity search method comprising the steps of:
generating enrolled data composed of features extracted from the raw data;

selecting a pivot from the enrolled data;

calculating a score defined as one of a distance and a degree of similarity between the features;

generating an index vector by using the score for the pivot;

calculating a Δ score defined as one of a distance and a degree of similarity between the index vectors;

training, by using prepared training data, a parameter including a regression coefficient of each non-pivot not selected as the pivot from the enrolled data;

determining, by using the Δ score between inputted search data and the non-pivot as well as the regression coefficient, in order to select the non-pivots in descending order of posterior probability through logistic regression;

outputting a search result based on the score between the search data and the enrolled data; and

holding in a database the features of the enrolled data, pivot information indicating which piece of the enrolled data is the pivot, an index including the index vector of each non-pivot, and a parameter of each non-pivot.
The high-precision similarity search method according to claim 14, further comprising the steps of:
In the determination of the selection order, by using the training data, training the parameter of each non-pivot including the regression coefficient, and by using the Δ score between the search data and the non-pivot as well as the regression coefficient, determining the order to select the non-pivots in the descending order of posterior probability through the logistic regression.