CN100517331C - Literature retrieval method based on semantic small-word model - Google Patents

Literature retrieval method based on semantic small-word model Download PDF

Info

Publication number
CN100517331C
CN100517331C CNB2007100516072A CN200710051607A CN100517331C CN 100517331 C CN100517331 C CN 100517331C CN B2007100516072 A CNB2007100516072 A CN B2007100516072A CN 200710051607 A CN200710051607 A CN 200710051607A CN 100517331 C CN100517331 C CN 100517331C
Authority
CN
China
Prior art keywords
node
document
semantic
query statement
interest
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2007100516072A
Other languages
Chinese (zh)
Other versions
CN101017504A (en
Inventor
金海�
宁小敏
袁平鹏
武浩
余一娇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CNB2007100516072A priority Critical patent/CN100517331C/en
Publication of CN101017504A publication Critical patent/CN101017504A/en
Application granted granted Critical
Publication of CN100517331C publication Critical patent/CN100517331C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

This invention discloses one file index method based on language meanings small world, which comprises the following steps: firstly using latent meanings index to extract file property vector to maintain file property to lower its dimensions and to reduce the information memory volume; then using supportive vector machine to sort all common files to form sort information to mark the sort interest proportion; finally using social network small world with small linkage point with high proportion interest of certain file sort to form network topological structure with small property.

Description

Document retrieval method based on semantic worldlet model
Technical field
The invention belongs to the Distributed Calculation and the information retrieval of computer realm, be specifically related to a kind of document retrieval method based on semantic worldlet model, this method is mainly utilized the efficient information storage and retrieval problem in the shared peer-to-peer network of semantic worldlet model solution documentation ﹠ info.
Background technology
The peer-to-peer network system is because characteristics such as its extensibility, fault-tolerance, autonomy and self-organization more and more receive people's concern in large-scale information retrieval field.But in the peer-to-peer network that documentation ﹠ info is shared, how to carry out effective information storage and retrieval and remain one and have very big challenging problem.
The worldlet phenomenon extensively is present in the community network, promptly can connect by very short social relationships chain between in the world everyone, the length of social relationships chain generally is no more than six, be called as " six degree separation theorems ", this theoretical reason that exists is in community network, people have some friends similar to its interest usually, it is not necessarily similar to its interest but the friend of numerous social bonds is arranged also to have simultaneously minority, thereby people can connect each other by very short " friend's friend " social relationships chain.
Potential semantic indexing is the expansion to the vector space model in traditional information retrieval, can eliminate the synonym that influences recall ratio and precision ratio and the polysemia that extensively exist in the information retrieval, on the semantic concept space basis of document, realize dimensionality reduction, reduce the documentation ﹠ info memory space document.
Support vector machine is a kind of machine learning method, is widely used in fields such as pattern-recognition, data qualification, can realize the classification to extensive document efficiently and accurately.
At present, the information storage and retrieval in the peer-to-peer network is mainly based on following method: centralized index (as Napster, BitTorrent), inquiry flood (Gnutella) or random walk.But above method all requires accurate meta data match (as filename or key word) to finish searching requirement, owing to can't obtain the semantic information of other nodes in the network, so need the recall ratio of a large amount of node of search blindly, thereby cause serious offered load with the guarantee information retrieval.Can improve query performance by improved neighbor node index information (as local index) guiding query messages, but upgrading index information requires very large overhead.Extensibility that can provide based on the peer-to-peer network that structure is arranged (as CAN, Chord) of distributed hashtable and effective search performance, but can only support the mode of searching of key word/value, for the full-text search in the information retrieval field is inappropriate, and safeguards that the expense of the peer network architecture that structure is arranged is very big.
Summary of the invention
The purpose of this invention is to provide a kind of document retrieval method based on semantic worldlet model, this method can improve the recall ratio and the inquiry velocity of retrieval.
The present invention is based on the document retrieval method of semantic worldlet model, comprise the steps:
(1) set up the overall network topology structure with semantic worldlet feature, its step comprises:
(1.1) utilize potential semantic indexing method to extract the document feature vector, document proper vector comprises frequency that word occurs in the document and the number of times that occurs in the literature collection of all document feature vectors to be extracted;
(1.2) on above-mentioned document feature vector basis, support vector machine has directed learning to the training document, obtains the support vector model;
(1.3) every shared machine of participation document is called node in the peer-to-peer network, each node is after obtaining above support vector model, all shared documents to this node are classified, and form classified information, other interest ratio of this node document category of this classified information sign;
(1.4) each node node of in the scope of two jumpings, selecting other interest of document category to be in similar proportion, the standard of selection is that similarity surpasses predetermined similarity threshold;
(1.5) if part of nodes is very high in other interest ratio of a certain document category in the peer-to-peer network, surpass predetermined threshold value, then this node is arranged to have the super semantic node that possibility is directly linked by other nodes;
(1.6) all nodes directly link with Probability p with super semantic node outside surpassing the scope of two jumpings, wherein 0<p≤0.001;
(2) have on the network topology basis of semantic worldlet feature in foundation, carry out document information retrieval, its step comprises:
(2.1) node sends query requests, and each query statement comprises the document classification of key word of the inquiry and inquiry;
(2.2) if the document classification of inquiry belongs to the document classification of the node that sends this query statement, and its interest ratio is then jumped into step (2.3) greater than 50%; Otherwise, jump into step (2.5);
(2.3) node carries out local search, returns Query Result;
(2.4) each short chain that this query statement is transmitted to this node connects node; And each long-chain connect node, if the document classification of its happiest interesting ratio is consistent with the document classification of query statement, then is transmitted to this long-chain and connects node, and jump into step (2.6), otherwise directly jump into step (2.6);
(2.5) query statement is transmitted to the neighbor node that each physics of this node directly links to each other; And each long-chain connect node, if the document classification of its happiest interesting ratio is consistent with the document classification of query statement, then is transmitted to this long-chain and connects node; Jump into step (2.6);
(2.6) poll-final.
At the storage and the recall precision problem that exist in the shared peer-to-peer network of documentation ﹠ info, the present invention provides a kind of documentation ﹠ info that is applicable to share the storage and the search method of peer-to-peer network in conjunction with the worldlet phenomenon in potential semantic indexing and support vector machine and the community network.The inventive method can be with documentation ﹠ info note right way of conduct formula tissue, utilize worldlet phenomenon in the community network (be in the community network people can be) by very short path acquaintance, under the prerequisite that reduces message transmission and offered load, improve the recall ratio and the inquiry velocity of retrieval.Adopt the inventive method, query statement can be routed to the node of most possible this request of answer, rather than traditional blindness route, thereby search efficiency is provided; Simultaneously, the long-chain that makes full use of in the worldlet connects, and makes the query statement also can be by very fast other parts that are routed in the network, rather than is trapped in the little web search scope, thereby improves the important indicator recall ratio of information retrieval.Particularly, the present invention has following characteristics:
(1) uses potential semantic indexing to extract the document feature vector and can under the situation that as far as possible keeps the documentation ﹠ info feature, reduce information storage;
(2) utilize support vector machine to node documentation ﹠ info classification, the accuracy rate height the more important thing is that the document classification information of node can express the semanteme of this node, for follow-up search provides effective support;
(3) utilize the worldlet phenomenon, can make Query Information very fast be routed to relevant node, improve recall ratio, and can reduce network overhead.
Description of drawings
Fig. 1 sets up the network topology process flow diagram with semantic worldlet feature.
Fig. 2 is based on the document information retrieval process flow diagram of semantic topological structure.
Embodiment
The present invention will be further described below in conjunction with the drawings and specific embodiments.
The present invention includes two key steps, promptly at first need to set up network topology with semantic worldlet feature; Secondly, offer information retrieval, below above two steps are described respectively in the enterprising style of writing of setting up of topological structure.
(1) set up the overall network topology structure with semantic worldlet feature, its step comprises:
(1.1) utilize potential semantic indexing method to extract the document feature vector, document proper vector comprises frequency that word occurs in the document and the number of times that occurs in the literature collection of all document feature vectors to be extracted.
(1.2) on above-mentioned document feature vector basis, support vector machine has directed learning to the training document, obtains the support vector model;
(1.3) each node in the peer-to-peer network is classified to all shared documents of this node after obtaining above support vector model, forms classified information, other interest ratio of this node document category of this classified information sign; The standard of classification is determined by concrete application, share as computer document, then can select to be divided into computer system organization (ComputerSystems Organization), computational mathematics (Mathematics of Computing), infosystem (Information Systems) etc. according to the computer classification system of ACM;
(1.4) each node node of in the scope of two jumpings, selecting other interest of document category to be in similar proportion, the standard of selecting is that similarity surpasses predetermined similarity threshold, the span of this threshold value is [0.5,1], thereby the short chain that satisfies in the worldlet phenomenon connects requirement;
(1.5) if part of nodes is very high in other interest ratio of a certain document category in the peer-to-peer network, surpass predetermined threshold value, the span of this threshold value is [0.8,1], then this node is arranged to have the super semantic node that possibility is directly linked by other nodes;
(1.6) scope of two jumpings is outer directly to link with Probability p with super semantic node all nodes surpassing, and promptly node is a Probability p with the possibility of super semantic node connection, 0<p≤0.001 wherein, thus the long-chain that satisfies in the worldlet phenomenon connects requirement;
After finishing above-mentioned steps (1.1)-(1.6), all nodes in the peer-to-peer network all have the less short chain similar to its interest that directly links to each other and connect node, have simultaneously few not necessarily similar to its interest but one fix on the very high long-chain of other interest ratio of a certain document category and connect, thereby form network topology with semantic worldlet feature.
(2) have on the network topology basis of semantic worldlet feature in foundation, carry out document information retrieval, its step comprises:
(2.1) node sends query requests, and each query statement comprises the document classification of key word of the inquiry and inquiry;
(2.2) if the document classification of inquiry belongs to the higher interest proportional parts (this interest ratio is selected greater than 50%) of the node that sends this query statement, then jump into step (2.3); Otherwise, jump into step (2.5);
(2.3) node carries out local search, returns Query Result;
(2.4) each short chain that this query statement is transmitted to this node connects node; And each long-chain connect node, if the document classification of its happiest interesting ratio is consistent with the document classification of query statement, then is transmitted to this long-chain and connects node, and jump into step (2.6), otherwise directly jump into step (2.6);
(2.5) query statement is transmitted to the neighbor node that each physics of this node directly links to each other; And each long-chain connect node, if the document classification of its happiest interesting ratio is consistent with the document classification of query statement, then is transmitted to this long-chain and connects node; Jump into step (2.6);
(2.6) poll-final.
Example:
(1) the concrete enforcement of setting up the network topology structure with semantic worldlet feature comprises following step:
(1.1) utilize potential semantic indexing to extract the document feature vector, specific as follows:
Potential semantic indexing is the expansion to the vector space model in traditional information retrieval.In vector space model, document and question blank are shown as the weight information of all words in literature collection, and query statement is represented by both cosine of angle in vector space with the similarity of document.If t different word arranged in the set that d document arranged, then uses word-document matrix A=(a Ij) ∈ R T * dRepresent this set.Every column vector a jCorresponding document j, a IjThe weight of expression word i in document j.By svd, matrix A is broken down into three matrix U, ∑ and V, and wherein ∑ is the diagonal angle matrix of the capable d row of t, and its singular value is σ 1〉=σ 2〉=... 〉=σ Min (t, d), keeping k maximum singular value in the ∑, matrix A can be by matrix A k=U kkV k' approximate representation;
(1.2) support vector machine has directed learning to the training document, obtains the support vector model, and the support vector model is by the matrix ∑ kV k' the d column vector represent;
(1.3) all shared documents of this node are classified, form classified information, specific as follows:
Each document representation on the node becomes document vector p ', utilizes the support vector model with vectorial U k' p ' classification, the document semanteme of node is represented as S={N, Pr}, wherein N represents the document sum of this node, Pr={Pr 1, Pr 2..., Pr mRepresent other interest ratio of each document category;
(1.4) each node node of in the scope of two jumpings, selecting other interest of document category to be in similar proportion, the standard of selection is that similarity surpasses predetermined similarity threshold 0.5, and is specific as follows:
For node P 1With node P 2, its document semanteme is respectively S 1={ N 1, Pr 1And S 2={ N 2, Pr 2, P then 1And P 2Between similarity be Sim (P 1, P 2)=((1+logmin (N 1, N 2))/(1+logmin (N 1, N 2))) * (‖ Pr 1‖ ‖ Pr 2‖), ‖ Pr wherein 1‖ ‖ Pr 2‖ is a vector multiplication.
(1.5) if part of nodes is very high in other interest ratio of a certain document category in the peer-to-peer network, surpass predetermined threshold value, then this node is arranged to have the super semantic node that possibility is directly linked by other nodes; All nodes directly link with 0.001 probability with super semantic node in surpassing the scope of two jumpings, and the tolerance of wherein super semantic node P is U
(P)=((1+logN)/(1+log (maxN i))) * maxPr i, wherein i gets all nodes in the peer-to-peer network, and predetermined threshold value is 0.8, if U (P)>0.8, then this node definition is super semantic node; The probability that other nodes and this super semantic node directly link be d (u, v) -r, wherein (u v) represents the shortest jumping figure between node u and the node v to d, and r represents 1/2 of this peer-to-peer network average degree.
According to said process, all nodes in the peer-to-peer network all have the less short chain similar to its interest that directly links to each other and connect node, have simultaneously few not necessarily similar to its interest but one fix on the very high long-chain of other interest ratio of a certain document category and connect, thereby form network topology with semantic worldlet feature.
(2) have on the network topology basis of semantic worldlet feature in foundation, can carry out document information retrieval, concrete steps are as follows:
(2.1) node sends query requests, and each query statement comprises the document classification of key word of the inquiry and inquiry, and wherein query statement is Q={K, c}, and K represents key word of the inquiry, c represents to inquire about the document classification;
(2.2) if the document classification of inquiry belongs to the higher interest proportional parts (this interest ratio is greater than 50%) of the node that sends this query statement, then carry out local search, and Query Result is returned; Simultaneously, each short chain that query statement Q is transmitted to this node connects node; Connecing node for each long-chain, is c if this long-chain connects the happiest interesting ratio document classification of node, then query statement Q is transmitted to this node processing;
(2.3) if the document classification of inquiry does not belong to the higher interest proportional parts of the node that sends this query statement, then query statement Q is transmitted to the neighbor node that directly links to each other with the physics of this node; And each long-chain connect node, and be c if this long-chain connects the happiest interesting ratio document classification of node, then query statement Q is transmitted to this node processing;
(2.4) poll-final.
According to said method, query statement can be routed to the node of most possible this request of answer, rather than traditional blindness route, thereby search efficiency is provided; Simultaneously, the long-chain that makes full use of in the worldlet connects, and make query statement also can be routed to other parts in the network soon, rather than be confined in the little network range, thus the recall ratio of raising information retrieval.
This programme not only is suitable for the peer-to-peer network that documentation ﹠ info is shared; and can be equal to change or replacement accordingly according to technical scheme of the present invention; the peer-to-peer networks of sharing as image information etc., and all these changes or replacement all should belong to the protection domain of claims of the present invention.

Claims (1)

1, a kind of document retrieval method based on semantic worldlet model comprises the steps:
(1) set up the overall network topology structure with semantic worldlet feature, its step comprises:
(1.1) utilize potential semantic indexing method to extract the document feature vector, document proper vector comprises the frequency of word appearance in the document and the number of times that word occurs in the literature collection of all document feature vectors to be extracted;
(1.2) on above-mentioned document feature vector basis, support vector machine has directed learning to the training document, obtains the support vector model;
(1.3) each node in the peer-to-peer network is classified to all shared documents of this node after obtaining above support vector model, forms classified information, other interest ratio of this node document category of this classified information sign;
(1.4) each node node of in the scope of two jumpings, selecting other interest of document category to be in similar proportion, the standard of selection is that similarity surpasses predetermined similarity threshold;
(1.5) if part of nodes is very high in other interest ratio of a certain document category in the peer-to-peer network, surpass predetermined threshold value, then be arranged to have can be by the direct super semantic node of link of other nodes for this node;
(1.6) all nodes directly link with Probability p with super semantic node outside surpassing the scope of two jumpings, wherein 0<p≤0.001;
(2) have on the network topology basis of semantic worldlet feature in foundation, carry out document information retrieval, its step comprises:
(2.1) node sends query requests, and each query statement comprises the document classification of key word of the inquiry and inquiry;
(2.2) if the document classification of inquiry belongs to the document classification of the node that sends this query statement, and its interest ratio is then jumped into step (2.3) greater than 50%; Otherwise, jump into step (2.5);
(2.3) node carries out local search, returns Query Result;
(2.4) each short chain that this query statement is transmitted to this node connects node; And each long-chain connect node, if the document classification of its happiest interesting ratio is consistent with the document classification of query statement, then is transmitted to this long-chain and connects node, and jump into step (2.6), otherwise directly jump into step (2.6);
(2.5) query statement is transmitted to the neighbor node that each physics of this node directly links to each other; And each long-chain connect node, if the document classification of its happiest interesting ratio is consistent with the document classification of query statement, then is transmitted to this long-chain and connects node; Jump into step (2.6);
(2.6) poll-final.
CNB2007100516072A 2007-03-02 2007-03-02 Literature retrieval method based on semantic small-word model Expired - Fee Related CN100517331C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2007100516072A CN100517331C (en) 2007-03-02 2007-03-02 Literature retrieval method based on semantic small-word model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007100516072A CN100517331C (en) 2007-03-02 2007-03-02 Literature retrieval method based on semantic small-word model

Publications (2)

Publication Number Publication Date
CN101017504A CN101017504A (en) 2007-08-15
CN100517331C true CN100517331C (en) 2009-07-22

Family

ID=38726511

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007100516072A Expired - Fee Related CN100517331C (en) 2007-03-02 2007-03-02 Literature retrieval method based on semantic small-word model

Country Status (1)

Country Link
CN (1) CN100517331C (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101534309B (en) 2009-04-14 2013-03-13 华为技术有限公司 A node registration method, a routing update method, a communication system and the relevant equipment
CN101877711B (en) * 2009-04-28 2013-08-28 华为技术有限公司 Social network establishment method and device, and community discovery method and device
CN102136007B (en) * 2011-03-31 2013-07-10 石家庄铁道大学 Small world property-based engineering information organization method
CN102663001A (en) * 2012-03-15 2012-09-12 华南理工大学 Automatic blog writer interest and character identifying method based on support vector machine
CN107038155A (en) * 2017-04-23 2017-08-11 四川用联信息技术有限公司 The extracting method of text feature is realized based on improved small-world network model

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Searching scheme in P2P system based on semantic overlaynetwork. Huo Ying , Chen Zhigang.Journal of Southeast University(English Edition),Vol.22 No.3. 2006
Searching scheme in P2P system based on semantic overlaynetwork. Huo Ying , Chen Zhigang.Journal of Southeast University(English Edition),Vol.22 No.3. 2006 *
基于分类检索的字聚簇P2P系统. 范刚龙.武汉理工大学学报,第28卷第7期. 2006
基于分类检索的字聚簇P2P系统. 范刚龙.武汉理工大学学报,第28卷第7期. 2006 *
潜在语义分析权重计算的改进. 刘云峰,齐欢,Xiang'en,Hu,Zhiqiang,Cai.中文信息学报,第19卷第6期. 2005
潜在语义分析权重计算的改进. 刘云峰,齐欢,Xiang'en,Hu,Zhiqiang,Cai.中文信息学报,第19卷第6期. 2005 *

Also Published As

Publication number Publication date
CN101017504A (en) 2007-08-15

Similar Documents

Publication Publication Date Title
Batsakis et al. Improving the performance of focused web crawlers
CN100517331C (en) Literature retrieval method based on semantic small-word model
CN104391908B (en) Multiple key indexing means based on local sensitivity Hash on a kind of figure
Hammouda et al. Collaborative document clustering
CN101272399A (en) Method for implementing full text retrieval system based on P2P network
CN107153687B (en) Indexing method for social network text data
Vishwakarma et al. A comparative study of K-means and K-medoid clustering for social media text mining
Chen et al. A unified framework for web link analysis
Gunaratna et al. Alignment and dataset identification of linked data in semantic web
Yang et al. On characterizing and computing the diversity of hyperlinks for anti-spamming page ranking
Yuan et al. A distributed link prediction algorithm based on clustering in dynamic social networks
Ke et al. Scalability of findability: effective and efficient IR operations in large information networks
Wu et al. Graph embedding based real-time social event matching for EBSNs recommendation
Selvakumara Samy et al. Intelligent web-history based on a hybrid clustering algorithm for future-internet systems
Su et al. Bibliometric assessments of network formations by keyword-based vector space model
Du et al. A novel page ranking algorithm based on triadic closure and hyperlink-induced topic search
Ban et al. CICPV: A new academic expert search model
Miao et al. Link scientific publications using linked data
Pan et al. Ranked web service matching for service description using owl-s
Liu et al. Discovery of web services based on collaborated semantic link network
Qiu et al. Web service discovery based on semantic matchmaking with UDDI
Ke et al. Studying the clustering paradox and scalability of search in highly distributed environments
CN112749246B (en) Evaluation method and device of search phrase, server and storage medium
Gadamshetti et al. RDRLLJ: Integrating Deep Learning Approach with Latent Semantic Analysis for Document Retrieval
Shoji et al. Diversity-Based HITS: web page ranking by referrer and referral diversity

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090722

Termination date: 20120302