CN1851705A - Body-bused subject type network reptile system configuration method - Google Patents

Body-bused subject type network reptile system configuration method Download PDF

Info

Publication number
CN1851705A
CN1851705A CN200610040742.2A CN200610040742A CN1851705A CN 1851705 A CN1851705 A CN 1851705A CN 200610040742 A CN200610040742 A CN 200610040742A CN 1851705 A CN1851705 A CN 1851705A
Authority
CN
China
Prior art keywords
url
ontology
page
class
ontology library
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN200610040742.2A
Other languages
Chinese (zh)
Other versions
CN100392658C (en
Inventor
高阳
苏畅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CNB2006100407422A priority Critical patent/CN100392658C/en
Publication of CN1851705A publication Critical patent/CN1851705A/en
Application granted granted Critical
Publication of CN100392658C publication Critical patent/CN100392658C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The present invented method includes: 1, resolving web page; 2, preprocessing current page txt information to obtain word layer information; 3, converting word layer information into main unit information; 4, calculating page subject degree of correlation; 5, if subject degree of correlation being greater than set value then extracting current all out links directional URL, otherwise turning to step 7; 6, if directional URL having accessed then extracting next links, otherwise according to said links located page subject degree of correlation size inserting preference waiting accessing queue; 7, selecting the first URL from preference waiting accessing queue, i.e. highest priority accessing; 8, repeating executing step 1-7, until occurring new URL without meeting the condition or reaching certain limit. Said invention has advantages of high accuracy rating result and small calculating and storage load.

Description

Subject-type network reptile system configuration method based on body
One, technical field
The present invention relates to a kind of crawler system construction method, relate in particular to a kind of subject-type network reptile system configuration method.
Two, background technology
The Web reptile is one of core of search engine, network crawler system can be worked more efficiently be subjected to more and more researchers' attention.Wherein, more become the focus of current research at the web crawler system of particular topic.The target of subject-type Web reptile is to make crawler system avoid having access to the relevant Web page of non-theme as far as possible, and those Web pages relevant with theme of central access.This Web crawler system mainly is used in the search engine and Web information retrieval system of those specific areas.
Present subject-type crawler system mainly is based on the text key word statistical information of the Web page and estimates its degree of subject relativity.But the content of the Web page varies, and the crucial dictionary of its correspondence is all very huge usually, so the computing cost of system is very big and need carry out a large amount of high dimensional data maintenances.In addition, because polysemy and many speech one adopted phenomenon that natural language itself exists, it is often quite difficult only to portray the content of the theme or the page by keyword, thereby makes the topic relativity evaluation deviation occur.The present invention solves this problem by introducing ontology.
Three, summary of the invention
1, goal of the invention: the objective of the invention is at the deficiencies in the prior art, provide a kind of based on ontology efficiently, subject-type network reptile system configuration method accurately.
2, technical scheme: the present invention is based on one is the ontology library management system of key concept with the theme notion, by the text message on the word layer in the Web page being converted into the body category information on the conceptual level, calculate the topic relativity of the page in conjunction with the bulk junction composition, and then instruct the operation of crawler system.This method may further comprise the steps:
(1) the current Web page is resolved;
(2) text message of current page is carried out pre-service and obtain word layer information;
(3) by the body management system word layer information is converted into ontology information;
(4) the degree of subject relativity of the ontology information that obtains in conjunction with the ontology library calculating page;
(5) if the current page degree of subject relativity greater than setting value the order extract in the current Web page all go out to link URL pointed, otherwise then execution in step (7);
(6) if the accessed mistake of this link URL pointed then extract next link, the degree of subject relativity size if this URL is accessed according to this link place page is inserted priority scheduling formation to be visited;
(7) choose first URL from priority scheduling formation to be visited, what just priority was the highest conducts interviews;
(8) repeated execution of steps (1) is to (7), till the new URL that does not satisfy condition occurring or arriving certain limit.
Wherein, the ontology library in the above-mentioned steps (4) passes through to gather existing common ontology storehouse, and these ontology libraries are handled, and obtains the ontology library of this method, and its step comprises:
(1) extracts the class that has now in the ontology library;
(2) extract hierarchical relationship and the funtcional relationship that has class in the ontology library now;
(3) class as node, hierarchical relationship and funtcional relationship be as the directed edge of connected node, forms the basic structure of ontology library body layer;
(4) at each class in each ontology library, make up the keyword set corresponding with such, form the lexis of ontology library.
3, beneficial effect: compare with existing crawler system according to the network crawler system that method of the present invention is constructed, its remarkable advantage is: on the basis of semantic understanding, give crawler system higher intelligent, make the accuracy of system and work efficiency obtain to improve.
Four, description of drawings
Fig. 1 is that system of the present invention forms structural drawing;
Fig. 2 is a workflow diagram of the present invention;
Fig. 3 ontology library structural representation.
Five, embodiment
As shown in Figure 1, the constructed network crawler system of the inventive method comprises basic reptile operational module, degree of subject relativity evaluation module and body management system module.Wherein, comprise pre-service and relatedness computation submodule in the degree of subject relativity evaluation module again.
The inventive method flow process describes in detail as shown in Figure 2 below:
Step (1):, isolate body matter text message wherein by the html file of the current Web page is resolved.
Step (2): the text message of separating is carried out pre-service.Here we add up the times N (w that each keyword occurs according to the lists of keywords of systemic presupposition usually in current document i).
Step (3):, calculate the class of body class in current document frequently according to the pairing keyword set of each body class in the ontology library.
Body construction synoptic diagram such as Fig. 3, shown in the figure being one is the part of the ontology library structural drawing of key concept with music (music).This bulk junction composition comprises a series of conceptual abstractions to real things, such as: " music ", " person ", these notions have constituted the class (class) in the body management system.In addition, comprised also among the figure that these logical relations and hierarchical relationship have constituted the set of relations (relation) in the body management system such as the logical relation between connection classes such as " to play " and the class and such as the hierarchical relationship between " music " and " jazz ".Except real in the drawings class and relation, the body management system is also being managed a lexis that is lower than body layer.The all corresponding text word finder in lexis of each class in the body layer or relation, such as for class " music ", its pairing text word finder has just comprised " song, melody, music ".
According to the pairing keyword set of each body class in the ontology library, the class of body class in current document be f frequently Ck DCan calculate by following formula:
f c k D = N w 1 k D + N w 2 k D + L + N w i k D
F wherein Ck DRepresentation class c kThe frequency that in document D, occurs; Class c kCorresponding to text word finder a: w 1 k, w 2 k, L, w i kVocabulary w i kN has appearred in document D Wik DInferior.
Step (4): by the body class frequently, calculate the degree of subject relativity r of current page in conjunction with the theme ontology library D
After the occurrence number of the class in the ontology library in document calculated, just can obtain mapping from a given Web document to its corresponding topic relativity.In the process that realizes this mapping, need utilize the structural relation between each key element among the body figure, and frequently each class in the body is given a mark to the degree of subject relativity of the text in conjunction with the class that previous calculations obtains, at last each class is carried out comprehensively obtaining the degree of correlation of full page to theme at last.
The degree of subject relativity r of page D DComputing formula is as follows:
r D = Σ c k ∈ D ( f C k D × w c k )
Wherein, f Ck DBe class c kClass in page D frequently; w CkBe class c kWeights.Usually before web crawlers operation, need be to the weights assignment W of each class Ck 0:
W c k 0 = 1.00 × n d ( c k , T )
Wherein n represents discount factor, d (c k, T) represent from such to this ontology library core classes, i.e. distance between the theme class T of this subject-type network crawler system.
Step (5):, compare with reference value the degree of subject relativity of current page.If greater than reference value, illustrate that then current page satisfies the degree of subject relativity requirement, need isolate the link that goes out in this page; If less than reference value, illustrate that then this page does not satisfy the requirement of degree of subject relativity, system transfers execution in step 7.
Step (6): handle the link information in the current page.
All-links in the sequential processes current page if should link the accessed mistake of URL pointed, then extracts next link, if this URL is not accessed, then inserts priority scheduling formation to be visited according to the degree of subject relativity size of this link place page.
Step (7): choose from priority scheduling formation to be visited and come first URL of formation, what just priority was the highest conducts interviews.
Step (8): repeated execution of steps (1) is to (7), till the new URL that does not satisfy condition occurring or arriving certain limit.

Claims (2)

1, a kind of subject-type network reptile system configuration method based on body comprises the following steps: that (1) resolve the current Web page, it is characterized in that this method also comprises the following steps:
(2) text message of current page is carried out pre-service and obtain word layer information;
(3) by the body management system word layer information is converted into ontology information;
(4) the degree of subject relativity of the ontology information that obtains in conjunction with the ontology library calculating page;
(5) if the current page degree of subject relativity greater than setting value the order extract in the current Web page all go out to link URL pointed, otherwise then execution in step (7);
(6) if should link the accessed mistake of URL pointed, then extract next link; If this URL is not accessed, then insert priority scheduling formation to be visited according to the degree of subject relativity size of this link place page;
(7) choose first URL from priority scheduling formation to be visited, what just priority was the highest conducts interviews;
(8) repeated execution of steps (1) is to (7), till the new URL that does not satisfy condition occurring or arriving certain limit.
2, the subject-type network reptile system configuration method based on body as claimed in claim 1, it is characterized in that the ontology library in the step (4) passes through to gather existing common ontology storehouse, and these ontology libraries are handled, obtain the ontology library of this method, its step comprises:
(1) extracts the class that has now in the ontology library;
(2) extract hierarchical relationship and the funtcional relationship that has class in the ontology library now;
(3) class as node, hierarchical relationship and funtcional relationship be as the directed edge of connected node, forms the basic structure of ontology library body layer;
(4) at each class in each ontology library, make up the keyword set corresponding with such, form the lexis of ontology library.
CNB2006100407422A 2006-05-30 2006-05-30 Body-bused subject type network reptile system configuration method Expired - Fee Related CN100392658C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2006100407422A CN100392658C (en) 2006-05-30 2006-05-30 Body-bused subject type network reptile system configuration method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2006100407422A CN100392658C (en) 2006-05-30 2006-05-30 Body-bused subject type network reptile system configuration method

Publications (2)

Publication Number Publication Date
CN1851705A true CN1851705A (en) 2006-10-25
CN100392658C CN100392658C (en) 2008-06-04

Family

ID=37133184

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2006100407422A Expired - Fee Related CN100392658C (en) 2006-05-30 2006-05-30 Body-bused subject type network reptile system configuration method

Country Status (1)

Country Link
CN (1) CN100392658C (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008067749A1 (en) * 2006-12-06 2008-06-12 Huawei Technologies Co., Ltd. Media content managing system and method
CN100452054C (en) * 2007-05-09 2009-01-14 崔志明 Integrated data source finding method for deep layer net page data source
CN102129472A (en) * 2011-04-14 2011-07-20 上海红神信息技术有限公司 Construction method for high-efficiency hybrid storage structure of semantic-orient search engine
CN101561814B (en) * 2009-05-08 2012-05-09 华中科技大学 Topic crawler system based on social labels
CN101355587B (en) * 2008-09-17 2012-05-23 杭州华三通信技术有限公司 Method and apparatus for obtaining URL information as well as method and system for implementing searching engine
CN103034732A (en) * 2012-12-26 2013-04-10 福建师范大学 Network robot algorithm for precisely grabbing links
CN103714140A (en) * 2013-12-23 2014-04-09 北京锐安科技有限公司 Searching method and device based on topic-focused web crawler
CN106339378A (en) * 2015-07-07 2017-01-18 中国科学院信息工程研究所 Data collecting method based on keyword oriented topic web crawlers

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101917413B (en) * 2010-07-29 2013-07-17 清华大学 Service assembly system and method based on service quality optimization and semantic information integration

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5838964A (en) * 1995-06-26 1998-11-17 Gubser; David R. Dynamic numeric compression methods
US6006232A (en) * 1997-10-21 1999-12-21 At&T Corp. System and method for multirecord compression in a relational database
JP2001282820A (en) * 2000-01-25 2001-10-12 Sony Corp Data compression method, retrieval method and device, data packet signal and recording medium

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008067749A1 (en) * 2006-12-06 2008-06-12 Huawei Technologies Co., Ltd. Media content managing system and method
CN100449547C (en) * 2006-12-06 2009-01-07 华为技术有限公司 Medium contents management system and method
US8200597B2 (en) 2006-12-06 2012-06-12 Huawei Technologies Co., Ltd. System and method for classifiying text and managing media contents using subtitles, start times, end times, and an ontology library
CN100452054C (en) * 2007-05-09 2009-01-14 崔志明 Integrated data source finding method for deep layer net page data source
CN101355587B (en) * 2008-09-17 2012-05-23 杭州华三通信技术有限公司 Method and apparatus for obtaining URL information as well as method and system for implementing searching engine
CN101561814B (en) * 2009-05-08 2012-05-09 华中科技大学 Topic crawler system based on social labels
CN102129472A (en) * 2011-04-14 2011-07-20 上海红神信息技术有限公司 Construction method for high-efficiency hybrid storage structure of semantic-orient search engine
CN102129472B (en) * 2011-04-14 2012-12-19 上海红神信息技术有限公司 Construction method for high-efficiency hybrid storage structure of semantic-orient search engine
CN103034732A (en) * 2012-12-26 2013-04-10 福建师范大学 Network robot algorithm for precisely grabbing links
CN103714140A (en) * 2013-12-23 2014-04-09 北京锐安科技有限公司 Searching method and device based on topic-focused web crawler
CN106339378A (en) * 2015-07-07 2017-01-18 中国科学院信息工程研究所 Data collecting method based on keyword oriented topic web crawlers

Also Published As

Publication number Publication date
CN100392658C (en) 2008-06-04

Similar Documents

Publication Publication Date Title
CN1851705A (en) Body-bused subject type network reptile system configuration method
Lin et al. Phrase clustering for discriminative learning
Creutz et al. Unsupervised discovery of morphemes
KR101176079B1 (en) Phrase-based generation of document descriptions
KR101223172B1 (en) Phrase-based searching in an information retrieval system
CN101630314B (en) Semantic query expansion method based on domain knowledge
KR101223173B1 (en) Phrase-based indexing in an information retrieval system
Wang et al. A fast KNN algorithm for text categorization
CN102207945B (en) Knowledge network-based text indexing system and method
US8161036B2 (en) Index optimization for ranking using a linear model
US8280721B2 (en) Efficiently representing word sense probabilities
KR20060048779A (en) Phrase identification in an information retrieval system
CN100401301C (en) Body learning based intelligent subject-type network reptile system configuration method
CN1755682A (en) System and method for ranking search results using link distance
CN1904886A (en) Method and apparatus for establishing link structure between multiple documents
CN101079024A (en) Special word list dynamic generation system and method
CN101256573B (en) Reaction type search method and contents correlation technique based on contents relativity
CN101916294A (en) Method for realizing exact search by utilizing semantic analysis
CN101251847A (en) Electronic dictionary thesaurus structure suitable for mobile equipment
Barrio et al. Sampling strategies for information extraction over the deep web
Hamdi et al. Machine learning vs deterministic rule-based system for document stream segmentation
CN1766871A (en) The processing method of the semi-structured data extraction of semantics of based on the context
CN101488127A (en) Bit mark character string retrieval technique
Momin et al. Web document clustering using document index graph
Del-Castillo-Escobedo et al. QA on the web: A preliminary study for Spanish language

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20080604

Termination date: 20100530