CN1851705A

CN1851705A - Body-bused subject type network reptile system configuration method

Info

Publication number: CN1851705A
Application number: CN200610040742.2A
Authority: CN
Inventors: 高阳; 苏畅
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2006-05-30
Filing date: 2006-05-30
Publication date: 2006-10-25
Anticipated expiration: 2026-05-30
Also published as: CN100392658C

Abstract

The present invented method includes: 1, resolving web page; 2, preprocessing current page txt information to obtain word layer information; 3, converting word layer information into main unit information; 4, calculating page subject degree of correlation; 5, if subject degree of correlation being greater than set value then extracting current all out links directional URL, otherwise turning to step 7; 6, if directional URL having accessed then extracting next links, otherwise according to said links located page subject degree of correlation size inserting preference waiting accessing queue; 7, selecting the first URL from preference waiting accessing queue, i.e. highest priority accessing; 8, repeating executing step 1-7, until occurring new URL without meeting the condition or reaching certain limit. Said invention has advantages of high accuracy rating result and small calculating and storage load.

Description

Subject-type network reptile system configuration method based on body

One, technical field

The present invention relates to a kind of crawler system construction method, relate in particular to a kind of subject-type network reptile system configuration method.

Two, background technology

The Web reptile is one of core of search engine, network crawler system can be worked more efficiently be subjected to more and more researchers' attention.Wherein, more become the focus of current research at the web crawler system of particular topic.The target of subject-type Web reptile is to make crawler system avoid having access to the relevant Web page of non-theme as far as possible, and those Web pages relevant with theme of central access.This Web crawler system mainly is used in the search engine and Web information retrieval system of those specific areas.

Present subject-type crawler system mainly is based on the text key word statistical information of the Web page and estimates its degree of subject relativity.But the content of the Web page varies, and the crucial dictionary of its correspondence is all very huge usually, so the computing cost of system is very big and need carry out a large amount of high dimensional data maintenances.In addition, because polysemy and many speech one adopted phenomenon that natural language itself exists, it is often quite difficult only to portray the content of the theme or the page by keyword, thereby makes the topic relativity evaluation deviation occur.The present invention solves this problem by introducing ontology.

Three, summary of the invention

1, goal of the invention: the objective of the invention is at the deficiencies in the prior art, provide a kind of based on ontology efficiently, subject-type network reptile system configuration method accurately.

2, technical scheme: the present invention is based on one is the ontology library management system of key concept with the theme notion, by the text message on the word layer in the Web page being converted into the body category information on the conceptual level, calculate the topic relativity of the page in conjunction with the bulk junction composition, and then instruct the operation of crawler system.This method may further comprise the steps:

(1) the current Web page is resolved;

(2) text message of current page is carried out pre-service and obtain word layer information;

(3) by the body management system word layer information is converted into ontology information;

(4) the degree of subject relativity of the ontology information that obtains in conjunction with the ontology library calculating page;

(5) if the current page degree of subject relativity greater than setting value the order extract in the current Web page all go out to link URL pointed, otherwise then execution in step (7);

(6) if the accessed mistake of this link URL pointed then extract next link, the degree of subject relativity size if this URL is accessed according to this link place page is inserted priority scheduling formation to be visited;

(7) choose first URL from priority scheduling formation to be visited, what just priority was the highest conducts interviews;

(8) repeated execution of steps (1) is to (7), till the new URL that does not satisfy condition occurring or arriving certain limit.

Wherein, the ontology library in the above-mentioned steps (4) passes through to gather existing common ontology storehouse, and these ontology libraries are handled, and obtains the ontology library of this method, and its step comprises:

(1) extracts the class that has now in the ontology library;

(2) extract hierarchical relationship and the funtcional relationship that has class in the ontology library now;

(3) class as node, hierarchical relationship and funtcional relationship be as the directed edge of connected node, forms the basic structure of ontology library body layer;

(4) at each class in each ontology library, make up the keyword set corresponding with such, form the lexis of ontology library.

3, beneficial effect: compare with existing crawler system according to the network crawler system that method of the present invention is constructed, its remarkable advantage is: on the basis of semantic understanding, give crawler system higher intelligent, make the accuracy of system and work efficiency obtain to improve.

Four, description of drawings

Fig. 1 is that system of the present invention forms structural drawing;

Fig. 2 is a workflow diagram of the present invention;

Fig. 3 ontology library structural representation.

Five, embodiment

As shown in Figure 1, the constructed network crawler system of the inventive method comprises basic reptile operational module, degree of subject relativity evaluation module and body management system module.Wherein, comprise pre-service and relatedness computation submodule in the degree of subject relativity evaluation module again.

The inventive method flow process describes in detail as shown in Figure 2 below:

Step (1):, isolate body matter text message wherein by the html file of the current Web page is resolved.

Step (2): the text message of separating is carried out pre-service.Here we add up the times N (w that each keyword occurs according to the lists of keywords of systemic presupposition usually in current document _i).

Step (3):, calculate the class of body class in current document frequently according to the pairing keyword set of each body class in the ontology library.

Body construction synoptic diagram such as Fig. 3, shown in the figure being one is the part of the ontology library structural drawing of key concept with music (music).This bulk junction composition comprises a series of conceptual abstractions to real things, such as: " music ", " person ", these notions have constituted the class (class) in the body management system.In addition, comprised also among the figure that these logical relations and hierarchical relationship have constituted the set of relations (relation) in the body management system such as the logical relation between connection classes such as " to play " and the class and such as the hierarchical relationship between " music " and " jazz ".Except real in the drawings class and relation, the body management system is also being managed a lexis that is lower than body layer.The all corresponding text word finder in lexis of each class in the body layer or relation, such as for class " music ", its pairing text word finder has just comprised " song, melody, music ".

According to the pairing keyword set of each body class in the ontology library, the class of body class in current document be f frequently _Ck ^DCan calculate by following formula:

f_{c_{k}}^{D} = N_{w_{1}^{k}}^{D} + N_{w_{2}^{k}}^{D} + L + N_{w_{i}^{k}}^{D}

F wherein _Ck ^DRepresentation class c _kThe frequency that in document D, occurs; Class c _kCorresponding to text word finder a: w ₁ ^k, w ₂ ^k, L, w _i ^kVocabulary w _i ^kN has appearred in document D _Wik ^DInferior.

Step (4): by the body class frequently, calculate the degree of subject relativity r of current page in conjunction with the theme ontology library _D

After the occurrence number of the class in the ontology library in document calculated, just can obtain mapping from a given Web document to its corresponding topic relativity.In the process that realizes this mapping, need utilize the structural relation between each key element among the body figure, and frequently each class in the body is given a mark to the degree of subject relativity of the text in conjunction with the class that previous calculations obtains, at last each class is carried out comprehensively obtaining the degree of correlation of full page to theme at last.

The degree of subject relativity r of page D _DComputing formula is as follows:

r_{D} = \underset{c_{k} &Element; D}{Σ} (f_{C_{k}}^{D} \times w_{c_{k}})

Wherein, f _Ck ^DBe class c _kClass in page D frequently; w _CkBe class c _kWeights.Usually before web crawlers operation, need be to the weights assignment W of each class _Ck ⁰:

W_{c_{k}}^{0} = 1.00 \times n^{d (c_{k}, T)}

Wherein n represents discount factor, d (c _k, T) represent from such to this ontology library core classes, i.e. distance between the theme class T of this subject-type network crawler system.

Step (5):, compare with reference value the degree of subject relativity of current page.If greater than reference value, illustrate that then current page satisfies the degree of subject relativity requirement, need isolate the link that goes out in this page; If less than reference value, illustrate that then this page does not satisfy the requirement of degree of subject relativity, system transfers execution in step 7.

Step (6): handle the link information in the current page.

All-links in the sequential processes current page if should link the accessed mistake of URL pointed, then extracts next link, if this URL is not accessed, then inserts priority scheduling formation to be visited according to the degree of subject relativity size of this link place page.

Step (7): choose from priority scheduling formation to be visited and come first URL of formation, what just priority was the highest conducts interviews.

Step (8): repeated execution of steps (1) is to (7), till the new URL that does not satisfy condition occurring or arriving certain limit.

Claims

1, a kind of subject-type network reptile system configuration method based on body comprises the following steps: that (1) resolve the current Web page, it is characterized in that this method also comprises the following steps:

(6) if should link the accessed mistake of URL pointed, then extract next link; If this URL is not accessed, then insert priority scheduling formation to be visited according to the degree of subject relativity size of this link place page;

2, the subject-type network reptile system configuration method based on body as claimed in claim 1, it is characterized in that the ontology library in the step (4) passes through to gather existing common ontology storehouse, and these ontology libraries are handled, obtain the ontology library of this method, its step comprises:

(1) extracts the class that has now in the ontology library;