CN104376120A

CN104376120A - Information retrieval method and system

Info

Publication number: CN104376120A
Application number: CN201410733635.2A
Authority: CN
Inventors: 皮特; 李玺; 张仲非
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2014-12-04
Filing date: 2014-12-04
Publication date: 2015-02-25
Anticipated expiration: 2034-12-04
Also published as: CN104376120B

Abstract

The invention provides an information retrieval method. The information retrieval method includes the following steps that S10, data for ranking study are input, and feature extraction is carried out on the data; S20, obtained data sample features are input, similarity information between samples is input, and a training data set composed of an inquiry sample-ranking list pair is obtained; S30, a mathematical model is built; S40, update formulas of all parameters are deduced, and the parameters of a Bruggeman distance function and a slack variable of the model are updated in an iteration mode until all the parameters converge; S50, a newly input inquiry sample is retrieved on the data set, all samples with concentrated data are ranked in an ascending order mode according to the distance between all the samples and the inquiry sample, and the sequence is output to serve as a retrieved result. The information retrieval method integrates the advantages of a structure support vector machine and the Bruggeman distance function, overcomes the limitation of a traditional distance function, and is high in retrieval accuracy.

Description

Information retrieval method and system

Technical Field

The invention relates to the technical field of information retrieval, in particular to an information retrieval method and an information retrieval system.

Background

In the information age, data in various forms has increased explosively, and an information retrieval technique for retrieving information required by a user from a large amount of data is important. In particular, rank learning is an active research topic in the fields of information retrieval and data mining. The goal of rank learning is to learn a ranking function to accurately characterize the correlation between data samples, i.e., the ranking function outputs a rank list to the input query samples, such that samples that are related to the query samples are ranked as far forward as possible and samples that are not related to the query samples are ranked as far backward as possible. Since the degree of correlation between data samples is usually determined by a similarity or distance metric, the essence of the ranking learning is to learn a similarity or distance metric function to accurately characterize the correlation between data samples, so that similar or correlated samples are close in distance and dissimilar or uncorrelated samples are far in distance.

How to learn an effective distance function to capture the correlation between the intrinsic patterns of the data features and the data is a basic problem in data mining. The conventional distance metric learning method has two limitations. First, existing distance metric learning methods usually assume a metric fixed in the whole feature space, and thus lack flexibility and generalization capability, and it is difficult to mine local patterns of data. Second, for high dimensional data, the traditional metric learning method is computationally expensive and even difficult to process. Take the most common mahalanobis distance as an example:

d_M(x_a，x_b)＝(x_a-x_b)^TM(x_a-x_b)

where M is a symmetric semi-positive definite matrix. The measurement matrix M is fixed and invariant in the whole input space, there is no flexibility, and the variable dimension of the matrix M to be solved is the square of the data dimension, so that it is difficult to process data of high dimension. Furthermore, the mahalanobis distance may be equivalent to the squared euclidean distance after linearly mapping the data from the original feature space to another implied subspace:

d_M(x_a，x_b)＝||R(x_a-x_b)||²

wherein R is^TAnd R is M. Therefore, the mahalanobis distance can only mine linear correlation patterns in the data features, and cannot capture complex nonlinear patterns implicit in the data features. In summary, we need to develop a new distance function learning method to overcome the above-mentioned limitations of the conventional distance function.

Disclosure of Invention

In order to solve the above problems, an object of the present invention is to provide an information retrieval method, which can capture complex nonlinear patterns hidden in data and can efficiently process high-dimensional data, so as to more accurately make similar or related samples closer to each other and make dissimilar or unrelated samples farther from each other, thereby improving the efficiency and accuracy of retrieval.

In order to achieve the purpose, the technical scheme of the invention is as follows:

an information retrieval method, comprising the steps of:

s10: inputting data for sequencing learning, performing feature extraction on the data, and converting original data into data sample features for machine learning;

s20: inputting the obtained data sample characteristics, inputting similarity information among samples, and obtaining a training data set consisting of a query sample-ordered list pair;

s30: establishing a mathematical model for an obtained training data set consisting of a query sample-ordered list pair based on a structural support vector machine and a Blackerman distance function;

s40: deducing an updating formula of each parameter according to the established mathematical model, and iteratively updating the parameters of the Blackermann distance function and the relaxation variables of the model until each parameter is converged;

s50: and searching the newly input query sample on the data set according to the obtained Blackermann distance function, arranging the samples in the data set in ascending order according to the distance from the query sample, and outputting the sequence as a search result.

Further, in step S30, a structure learning model is established with the structure support vector machine as a framework, the overall sorting structure cost based on the brewagen distance function is optimized, and a regular term is added for adjustment;

the established mathematical model comprises a parametric model and a nonparametric model, and the Bluegeman distance function in the model has a parametric form or a nonparametric form.

Further, step S40 includes:

s401: approximating the established mathematical model by using a single relaxation variable secant plane method to enable model parameters to be solvable and deducing an updating formula of the model parameters;

s402: and iteratively updating the parameters of the model according to the derived updating formula until the parameters converge.

The other technical scheme of the invention is as follows:

an information retrieval system comprises a data preprocessing module, a model input processing module, a modeling module, a parameter updating module and a retrieval module; the data preprocessing module inputs data for sequencing learning, performs feature extraction on the data, and outputs data sample features for machine learning; the model input processing module inputs the characteristics of the data samples obtained by the data preprocessing module, inputs similarity information among the samples and outputs a training data set consisting of a query sample-ordered list pair; the modeling module is used for establishing a mathematical model based on a structural support vector machine and a Blackerman distance function according to a training data set consisting of a query sample-ordered list pair output by the model input processing module; the parameter updating module is used for deducing an updating formula of each parameter according to the mathematical model output by the modeling module, and iteratively updating the parameter of the Blackermann distance function and the relaxation variable of the model until convergence; the retrieval module is used for retrieving the newly input query samples according to the Blackermann distance function obtained by the parameter iterative updating module, arranging the samples in the data set in ascending order according to the distance between the samples and the query samples, and outputting the ordering as a retrieval result.

Further, the modeling module is further used for establishing a structure learning model by taking a structure support vector machine as a frame, optimizing the overall sorting structure cost based on the Blackermann distance function, and adding a regular term for adjustment.

Further, the modeling module is also used for modeling of a parametric model and a non-parametric model.

Furthermore, the parameter updating module approximates the mathematical model output by the modeling module by using a single relaxation variable secant plane method, so that the model parameters can be solved, and an updating formula of the model parameters is deduced; and iteratively updating the parameters of the model according to the derived updating formula until the parameters converge.

The information retrieval method combines the advantages of a structure support vector machine and a Blackermann distance function, and overcomes the limitation of the traditional distance function. Compared with the existing sequencing learning method and distance measurement learning method, the method provided by the invention has higher retrieval accuracy.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

FIG. 2 is a flow chart of modeling parameter update of the present invention.

FIG. 3 is a block diagram of the module structure of the information retrieval system of the present invention.

Detailed Description

The embodiment of the invention provides an information retrieval method.

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one skilled in the art from the embodiments given herein are intended to be within the scope of the invention.

The terms "first," "second," and the like in the description and in the claims, and in the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the invention in its embodiments for distinguishing between objects of the same nature. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The following are detailed below.

Referring to fig. 1, the information retrieval method of the present invention learns a breegmann distance function using a structure support vector machine as a framework, and accordingly, outputs a retrieval result sequence for a new query sample. The method specifically comprises the following steps:

further, step S30 includes:

s301: establishing a structure learning model by taking a structure support vector machine as a frame, optimizing the overall sorting structure cost based on the Blackermann distance function, and adding a regular term for adjustment;

the established mathematical model comprises a parametric model and a nonparametric model, namely, the Blackermann distance function in the model can have a parametric form or a nonparametric form.

further, step S40 includes:

The information retrieval method of the invention is explained by taking an image data set containing category label information, SIFT features and a nonparametric modeling method as examples. The method comprises the following steps:

s100: inputting image data for sequencing learning, extracting SIFT (scale invariant feature transform) features of the image data, converting the image into numerical features for machine learning, and recording the numerical features as X ═ X₁，...，x_n]Where n is the number of image samples, x_i∈R^m(i 1.., n) is the extracted image data feature, m is the feature dimension;

s200: inputting the image data characteristics X obtained in the step S100, inputting the category label data of the image samples, regarding two image samples with the same label as related and two image samples with different labels as unrelated, establishing a training data set consisting of a query sample-ordered list pair according to the principle that the samples related to the query sample are arranged in front of the unrelated samples and are arranged behind the unrelated samples, and recording the training data set asWherein x is_iRepresenting the features of the ith image sample,representing a real ordered list corresponding to the ith image sample;

s300: based on the structural support vector machine and the breegmann distance function, a mathematical model is established for the SIFT image features obtained in step S100 and the training data set composed of the query sample-ordered list pair obtained in step S200, specifically, the mathematical model is established as follows:

first, a symmetric brewagen distance function is used as a distance measure reflecting the correlation between samples:

wherein x is_a，x_b∈R^m；Is a strictly convex function of the shape of the circle,representing a functionOf the gradient of (c).

Formalizing convex functions in a non-parametric waySuppose thatBelonging to the positive definite kernel function k (x)_a，x_b) Defined regenerative nuclear Hilbert space, i.e.Has the following form:

wherein, k (x)_aV.) for arbitrary x_aShould be a convex function, α_iMore than or equal to 0(i ═ 1.. multidot.n) ensuresIs a convex function. Examples of kernel functions k are: k (x)_a，x_b)＝(x_a ^Tx_b+1)²，exp(x_a ^Tx_b) And the like.

The learning model is established as follows:

wherein,

D⁺and D^-Sets of samples that are relevant and irrelevant, respectively, to the query sample x; Δ (y, y)^*) Is a loss function of the ordering structure, and quantifies when the real ordering is y^*The penalty for predicting the rank y should be satisfied if Δ (y, y) is 0 and Δ (y ', y) > 0 for any y' ≠ y, such as: Δ (y, y)^*)＝1-AUC(y，y^*)，1-MAP(y，y^*) Etc.; k ═ K (x)_i，x_j)]_n×nIs a kernel matrix; y is the value space of the ordered list.

In the above-described mathematical model, the model,represented is the correspondence score between the query sample x and the ranked list y; the purpose of the constraints of the optimization problem is to impose different penalties reasonably on different orderings according to the closeness of the different orderings to the real ordering, so that the different orderings y are closer to the real ordering^*The more different ordering y (i.e., Δ (y, y)^*) Larger) should be as large as possible, compared to the true rank y^*Closer ordering y (i.e. Δ (y, y)^*) Smaller) the scoring interval may be smaller.

S400: for the mathematical model established in step S300, a single relaxation variable secant plane method is used for approximation. The flow chart of the algorithm is shown in fig. 2, and is specifically divided into the following three steps:

(1) initializing an effective constraint set ACS as an empty set:

(2) learning model parameters (α, ξ) under the current ACS:

(3) updating the effective constraint set ACS:

and (2) iterating for two steps (1) and (2) until the following convergence condition is met, namely the increment of the punishment degree is not larger than a certain threshold epsilon:

s500: blackermann distance function defined according to model parameter alpha output by S400And searching the newly input query sample on the data set, namely, arranging the samples in the data set in ascending order according to the distance between the samples and the query sample, and outputting the sequence as a search result.

Another embodiment of the present invention is an information retrieval system, which includes a data preprocessing module, a model input processing module, a modeling module, a parameter updating module, and a retrieval module. The data preprocessing module inputs data for sequencing learning, performs feature extraction on the data, and outputs data sample features for machine learning; the model input processing module inputs the characteristics of the data samples obtained by the data preprocessing module, inputs similarity information among the samples and outputs a training data set consisting of a query sample-ordered list pair; the modeling module is used for establishing a mathematical model based on a structural support vector machine and a Blackerman distance function according to a training data set consisting of a query sample-ordered list pair output by the model input processing module; the parameter updating module is used for deducing an updating formula of each parameter according to the mathematical model output by the modeling module, and iteratively updating the parameter of the Blackermann distance function and the relaxation variable of the model until convergence; the retrieval module is used for retrieving the newly input query samples according to the Blackermann distance function obtained by the parameter iterative updating module, arranging the samples in the data set in ascending order according to the distance between the samples and the query samples, and outputting the ordering as a retrieval result.

Further, the modeling module is used for establishing a structure learning model by taking a structure support vector machine as a frame, optimizing the overall sorting structure cost based on the Blackermann distance function, and adding a regular term for adjustment; the modeling module is also used for modeling of a parametric model and a non-parametric model, i.e. the brewagen distance function in the model may have a parametric form or a non-parametric form.

The invention takes a structure support vector machine as a frame, takes a symmetrical Blackermann distance function as distance measurement for excavating the correlation among samples:

wherein x is_a，x_b∈R^mIs a data sample, which is an m-dimensional real vector;is a strictly convex function of the shape of the circle,representing a functionOf the gradient of (c). The Blackermann distance function is a general form of a plurality of distance measurement functions including Euclidean distance and Marek distance, and has better generalization capability and flexibility; by designing and learning convex functionsComplex local and nonlinear patterns of data features can be mined while high dimensional data can be processed efficiently. Therefore, the invention combines the advantages of the structure support vector machine and the Blackermann distance function, and overcomes the limitations of the traditional distance function. Compared with the existing sequencing learning method and distance measurement learning method, the method provided by the invention has higher retrieval accuracy.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention may be implemented by software plus necessary general hardware, and may also be implemented by special hardware including special integrated circuits, special CPUs, special memories, special components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, the implementation of a software program is a more preferable embodiment for the present invention. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk of a computer, and includes instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

In summary, the above embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the above embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the above embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An information retrieval method, comprising the steps of:

2. The information retrieval method according to claim 1, characterized in that: in the step S30, a structure learning model is established by taking a structure support vector machine as a frame, the overall sorting structure cost based on the Blackermann distance function is optimized, and a regular term is added for adjustment;

3. The information retrieval method as set forth in claim 2, wherein the step S40 includes:

4. An information retrieval system characterized by: the system comprises a data preprocessing module, a model input processing module, a modeling module, a parameter updating module and a retrieval module; the data preprocessing module inputs data for sequencing learning, performs feature extraction on the data, and outputs data sample features for machine learning; the model input processing module inputs the characteristics of the data samples obtained by the data preprocessing module, inputs similarity information among the samples and outputs a training data set consisting of a query sample-ordered list pair; the modeling module is used for establishing a mathematical model based on a structural support vector machine and a Blackerman distance function according to a training data set consisting of a query sample-ordered list pair output by the model input processing module; the parameter updating module is used for deducing an updating formula of each parameter according to the mathematical model output by the modeling module, and iteratively updating the parameter of the Blackermann distance function and the relaxation variable of the model until convergence; the retrieval module is used for retrieving the newly input query samples according to the Blackermann distance function obtained by the parameter iterative updating module, arranging the samples in the data set in ascending order according to the distance between the samples and the query samples, and outputting the ordering as a retrieval result.

5. The information retrieval system as recited in claim 4, wherein: the modeling module is also used for establishing a structure learning model by taking a structure support vector machine as a frame, optimizing the overall sorting structure cost based on the Blackermann distance function, and adding a regular term for adjustment.

6. The information retrieval system as recited in claim 5, wherein: the modeling module is also used for modeling a parametric model and a non-parametric model.

7. The information retrieval system as recited in claim 6, wherein: the parameter updating module approximates the mathematical model output by the modeling module by using a single relaxation variable secant plane method, so that the model parameters can be solved, and an updating formula of the model parameters is deduced; and iteratively updating the parameters of the model according to the derived updating formula until the parameters converge.