CN100517331C

CN100517331C - Literature retrieval method based on semantic small-word model

Info

Publication number: CN100517331C
Application number: CNB2007100516072A
Authority: CN
Inventors: 金海�; 宁小敏; 袁平鹏; 武浩; 余一娇
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2007-03-02
Filing date: 2007-03-02
Publication date: 2009-07-22
Anticipated expiration: 2027-03-02
Also published as: CN101017504A

Abstract

This invention discloses one file index method based on language meanings small world, which comprises the following steps: firstly using latent meanings index to extract file property vector to maintain file property to lower its dimensions and to reduce the information memory volume; then using supportive vector machine to sort all common files to form sort information to mark the sort interest proportion; finally using social network small world with small linkage point with high proportion interest of certain file sort to form network topological structure with small property.

Description

Document retrieval method based on semantic worldlet model

Technical field

The invention belongs to the Distributed Calculation and the information retrieval of computer realm, be specifically related to a kind of document retrieval method based on semantic worldlet model, this method is mainly utilized the efficient information storage and retrieval problem in the shared peer-to-peer network of semantic worldlet model solution documentation ﹠ info.

Background technology

The peer-to-peer network system is because characteristics such as its extensibility, fault-tolerance, autonomy and self-organization more and more receive people's concern in large-scale information retrieval field.But in the peer-to-peer network that documentation ﹠ info is shared, how to carry out effective information storage and retrieval and remain one and have very big challenging problem.

The worldlet phenomenon extensively is present in the community network, promptly can connect by very short social relationships chain between in the world everyone, the length of social relationships chain generally is no more than six, be called as " six degree separation theorems ", this theoretical reason that exists is in community network, people have some friends similar to its interest usually, it is not necessarily similar to its interest but the friend of numerous social bonds is arranged also to have simultaneously minority, thereby people can connect each other by very short " friend's friend " social relationships chain.

Potential semantic indexing is the expansion to the vector space model in traditional information retrieval, can eliminate the synonym that influences recall ratio and precision ratio and the polysemia that extensively exist in the information retrieval, on the semantic concept space basis of document, realize dimensionality reduction, reduce the documentation ﹠ info memory space document.

Support vector machine is a kind of machine learning method, is widely used in fields such as pattern-recognition, data qualification, can realize the classification to extensive document efficiently and accurately.

At present, the information storage and retrieval in the peer-to-peer network is mainly based on following method: centralized index (as Napster, BitTorrent), inquiry flood (Gnutella) or random walk.But above method all requires accurate meta data match (as filename or key word) to finish searching requirement, owing to can't obtain the semantic information of other nodes in the network, so need the recall ratio of a large amount of node of search blindly, thereby cause serious offered load with the guarantee information retrieval.Can improve query performance by improved neighbor node index information (as local index) guiding query messages, but upgrading index information requires very large overhead.Extensibility that can provide based on the peer-to-peer network that structure is arranged (as CAN, Chord) of distributed hashtable and effective search performance, but can only support the mode of searching of key word/value, for the full-text search in the information retrieval field is inappropriate, and safeguards that the expense of the peer network architecture that structure is arranged is very big.

Summary of the invention

The purpose of this invention is to provide a kind of document retrieval method based on semantic worldlet model, this method can improve the recall ratio and the inquiry velocity of retrieval.

The present invention is based on the document retrieval method of semantic worldlet model, comprise the steps:

(1) set up the overall network topology structure with semantic worldlet feature, its step comprises:

(1.1) utilize potential semantic indexing method to extract the document feature vector, document proper vector comprises frequency that word occurs in the document and the number of times that occurs in the literature collection of all document feature vectors to be extracted;

(1.2) on above-mentioned document feature vector basis, support vector machine has directed learning to the training document, obtains the support vector model;

(1.3) every shared machine of participation document is called node in the peer-to-peer network, each node is after obtaining above support vector model, all shared documents to this node are classified, and form classified information, other interest ratio of this node document category of this classified information sign;

(1.4) each node node of in the scope of two jumpings, selecting other interest of document category to be in similar proportion, the standard of selection is that similarity surpasses predetermined similarity threshold;

(1.5) if part of nodes is very high in other interest ratio of a certain document category in the peer-to-peer network, surpass predetermined threshold value, then this node is arranged to have the super semantic node that possibility is directly linked by other nodes;

(1.6) all nodes directly link with Probability p with super semantic node outside surpassing the scope of two jumpings, wherein 0＜p≤0.001;

(2) have on the network topology basis of semantic worldlet feature in foundation, carry out document information retrieval, its step comprises:

(2.1) node sends query requests, and each query statement comprises the document classification of key word of the inquiry and inquiry;

(2.2) if the document classification of inquiry belongs to the document classification of the node that sends this query statement, and its interest ratio is then jumped into step (2.3) greater than 50%; Otherwise, jump into step (2.5);

(2.3) node carries out local search, returns Query Result;

(2.4) each short chain that this query statement is transmitted to this node connects node; And each long-chain connect node, if the document classification of its happiest interesting ratio is consistent with the document classification of query statement, then is transmitted to this long-chain and connects node, and jump into step (2.6), otherwise directly jump into step (2.6);

(2.5) query statement is transmitted to the neighbor node that each physics of this node directly links to each other; And each long-chain connect node, if the document classification of its happiest interesting ratio is consistent with the document classification of query statement, then is transmitted to this long-chain and connects node; Jump into step (2.6);

(2.6) poll-final.

At the storage and the recall precision problem that exist in the shared peer-to-peer network of documentation ﹠ info, the present invention provides a kind of documentation ﹠ info that is applicable to share the storage and the search method of peer-to-peer network in conjunction with the worldlet phenomenon in potential semantic indexing and support vector machine and the community network.The inventive method can be with documentation ﹠ info note right way of conduct formula tissue, utilize worldlet phenomenon in the community network (be in the community network people can be) by very short path acquaintance, under the prerequisite that reduces message transmission and offered load, improve the recall ratio and the inquiry velocity of retrieval.Adopt the inventive method, query statement can be routed to the node of most possible this request of answer, rather than traditional blindness route, thereby search efficiency is provided; Simultaneously, the long-chain that makes full use of in the worldlet connects, and makes the query statement also can be by very fast other parts that are routed in the network, rather than is trapped in the little web search scope, thereby improves the important indicator recall ratio of information retrieval.Particularly, the present invention has following characteristics:

(1) uses potential semantic indexing to extract the document feature vector and can under the situation that as far as possible keeps the documentation ﹠ info feature, reduce information storage;

(2) utilize support vector machine to node documentation ﹠ info classification, the accuracy rate height the more important thing is that the document classification information of node can express the semanteme of this node, for follow-up search provides effective support;

(3) utilize the worldlet phenomenon, can make Query Information very fast be routed to relevant node, improve recall ratio, and can reduce network overhead.

Description of drawings

Fig. 1 sets up the network topology process flow diagram with semantic worldlet feature.

Fig. 2 is based on the document information retrieval process flow diagram of semantic topological structure.

Embodiment

The present invention will be further described below in conjunction with the drawings and specific embodiments.

The present invention includes two key steps, promptly at first need to set up network topology with semantic worldlet feature; Secondly, offer information retrieval, below above two steps are described respectively in the enterprising style of writing of setting up of topological structure.

(1.1) utilize potential semantic indexing method to extract the document feature vector, document proper vector comprises frequency that word occurs in the document and the number of times that occurs in the literature collection of all document feature vectors to be extracted.

(1.3) each node in the peer-to-peer network is classified to all shared documents of this node after obtaining above support vector model, forms classified information, other interest ratio of this node document category of this classified information sign; The standard of classification is determined by concrete application, share as computer document, then can select to be divided into computer system organization (ComputerSystems Organization), computational mathematics (Mathematics of Computing), infosystem (Information Systems) etc. according to the computer classification system of ACM;

(1.4) each node node of in the scope of two jumpings, selecting other interest of document category to be in similar proportion, the standard of selecting is that similarity surpasses predetermined similarity threshold, the span of this threshold value is [0.5,1], thereby the short chain that satisfies in the worldlet phenomenon connects requirement;

(1.5) if part of nodes is very high in other interest ratio of a certain document category in the peer-to-peer network, surpass predetermined threshold value, the span of this threshold value is [0.8,1], then this node is arranged to have the super semantic node that possibility is directly linked by other nodes;

(1.6) scope of two jumpings is outer directly to link with Probability p with super semantic node all nodes surpassing, and promptly node is a Probability p with the possibility of super semantic node connection, 0＜p≤0.001 wherein, thus the long-chain that satisfies in the worldlet phenomenon connects requirement;

After finishing above-mentioned steps (1.1)-(1.6), all nodes in the peer-to-peer network all have the less short chain similar to its interest that directly links to each other and connect node, have simultaneously few not necessarily similar to its interest but one fix on the very high long-chain of other interest ratio of a certain document category and connect, thereby form network topology with semantic worldlet feature.

(2.2) if the document classification of inquiry belongs to the higher interest proportional parts (this interest ratio is selected greater than 50%) of the node that sends this query statement, then jump into step (2.3); Otherwise, jump into step (2.5);

(2.3) node carries out local search, returns Query Result;

(2.6) poll-final.

Example:

(1) the concrete enforcement of setting up the network topology structure with semantic worldlet feature comprises following step:

(1.1) utilize potential semantic indexing to extract the document feature vector, specific as follows:

Potential semantic indexing is the expansion to the vector space model in traditional information retrieval.In vector space model, document and question blank are shown as the weight information of all words in literature collection, and query statement is represented by both cosine of angle in vector space with the similarity of document.If t different word arranged in the set that d document arranged, then uses word-document matrix A=(a _Ij) ∈ R _{T * d}Represent this set.Every column vector a _jCorresponding document j, a _IjThe weight of expression word i in document j.By svd, matrix A is broken down into three matrix U, ∑ and V, and wherein ∑ is the diagonal angle matrix of the capable d row of t, and its singular value is σ ₁〉=σ ₂〉=... 〉=σ _{Min (t, d)}, keeping k maximum singular value in the ∑, matrix A can be by matrix A _k=U _k∑ _kV _k' approximate representation;

(1.2) support vector machine has directed learning to the training document, obtains the support vector model, and the support vector model is by the matrix ∑ _kV _k' the d column vector represent;

(1.3) all shared documents of this node are classified, form classified information, specific as follows:

Each document representation on the node becomes document vector p ', utilizes the support vector model with vectorial U _k' p ' classification, the document semanteme of node is represented as S={N, Pr}, wherein N represents the document sum of this node, Pr={Pr ₁, Pr ₂..., Pr _mRepresent other interest ratio of each document category;

(1.4) each node node of in the scope of two jumpings, selecting other interest of document category to be in similar proportion, the standard of selection is that similarity surpasses predetermined similarity threshold 0.5, and is specific as follows:

For node P ₁With node P ₂, its document semanteme is respectively S ₁={ N ₁, Pr ¹And S ₂={ N ₂, Pr ², P then ₁And P ₂Between similarity be Sim (P ₁, P ₂)=((1+logmin (N ₁, N ₂))/(1+logmin (N ₁, N ₂))) * (‖ Pr ¹‖ ‖ Pr ²‖), ‖ Pr wherein ¹‖ ‖ Pr ²‖ is a vector multiplication.

(1.5) if part of nodes is very high in other interest ratio of a certain document category in the peer-to-peer network, surpass predetermined threshold value, then this node is arranged to have the super semantic node that possibility is directly linked by other nodes; All nodes directly link with 0.001 probability with super semantic node in surpassing the scope of two jumpings, and the tolerance of wherein super semantic node P is U

(P)=((1+logN)/(1+log (maxN _i))) * maxPr ⁱ, wherein i gets all nodes in the peer-to-peer network, and predetermined threshold value is 0.8, if U (P)＞0.8, then this node definition is super semantic node; The probability that other nodes and this super semantic node directly link be d (u, v) ^-r, wherein (u v) represents the shortest jumping figure between node u and the node v to d, and r represents 1/2 of this peer-to-peer network average degree.

According to said process, all nodes in the peer-to-peer network all have the less short chain similar to its interest that directly links to each other and connect node, have simultaneously few not necessarily similar to its interest but one fix on the very high long-chain of other interest ratio of a certain document category and connect, thereby form network topology with semantic worldlet feature.

(2) have on the network topology basis of semantic worldlet feature in foundation, can carry out document information retrieval, concrete steps are as follows:

(2.1) node sends query requests, and each query statement comprises the document classification of key word of the inquiry and inquiry, and wherein query statement is Q={K, c}, and K represents key word of the inquiry, c represents to inquire about the document classification;

(2.2) if the document classification of inquiry belongs to the higher interest proportional parts (this interest ratio is greater than 50%) of the node that sends this query statement, then carry out local search, and Query Result is returned; Simultaneously, each short chain that query statement Q is transmitted to this node connects node; Connecing node for each long-chain, is c if this long-chain connects the happiest interesting ratio document classification of node, then query statement Q is transmitted to this node processing;

(2.3) if the document classification of inquiry does not belong to the higher interest proportional parts of the node that sends this query statement, then query statement Q is transmitted to the neighbor node that directly links to each other with the physics of this node; And each long-chain connect node, and be c if this long-chain connects the happiest interesting ratio document classification of node, then query statement Q is transmitted to this node processing;

(2.4) poll-final.

According to said method, query statement can be routed to the node of most possible this request of answer, rather than traditional blindness route, thereby search efficiency is provided; Simultaneously, the long-chain that makes full use of in the worldlet connects, and make query statement also can be routed to other parts in the network soon, rather than be confined in the little network range, thus the recall ratio of raising information retrieval.

This programme not only is suitable for the peer-to-peer network that documentation ﹠ info is shared; and can be equal to change or replacement accordingly according to technical scheme of the present invention; the peer-to-peer networks of sharing as image information etc., and all these changes or replacement all should belong to the protection domain of claims of the present invention.

Claims

1, a kind of document retrieval method based on semantic worldlet model comprises the steps:

(1.1) utilize potential semantic indexing method to extract the document feature vector, document proper vector comprises the frequency of word appearance in the document and the number of times that word occurs in the literature collection of all document feature vectors to be extracted;

(1.3) each node in the peer-to-peer network is classified to all shared documents of this node after obtaining above support vector model, forms classified information, other interest ratio of this node document category of this classified information sign;

(1.5) if part of nodes is very high in other interest ratio of a certain document category in the peer-to-peer network, surpass predetermined threshold value, then be arranged to have can be by the direct super semantic node of link of other nodes for this node;

(2.3) node carries out local search, returns Query Result;

(2.6) poll-final.