CN100401301C

CN100401301C - Body learning based intelligent subject-type network reptile system configuration method

Info

Publication number: CN100401301C
Application number: CNB2006100407437A
Authority: CN
Inventors: 高阳; 苏畅
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2006-05-30
Filing date: 2006-05-30
Publication date: 2008-07-09
Anticipated expiration: 2026-05-30
Also published as: CN1851706A

Abstract

The present invention discloses a constructing method for an intelligent subject type network crawler system based on a main body, which comprises the following procedures that (1) a current Web page is analyzed; (2) obtained word layer information is processed; (3) the word layer information is converted into main body information; (4) the subject correlation degree of the page is calculated; (5) if the subject correlation degree of the page is higher than the high limit of a set value, a content information set of the page is updated, or else a procedure (9) is executed; (6) the interest rate of each main body is updated; (7) the weight value of a main body class is learned and modified; (8) the algorithm of the main body is optimized; (9) URLs in the current Web page are orderly extracted, or else a procedure (11) is executed; (10) if the URL pointed by the link is accessed, the next link is extracted, or else a priority access waiting queue is inserted according to the value of the subject correlation degree of the page in which the link is positioned; (11) a first URL is selected; (12) the procedures from (1) to (11) are executed repeatedly. The present invention has the advantages of high accuracy rate of a result and lower expenses of calculation and storage, and the system has higher robustness.

Description

Intelligent subject-type network reptile system configuration method based on body learning

One, technical field

The present invention relates to a kind of web crawlers construction method, relate in particular to a kind of intelligent subject-type web crawlers construction method.

Two, background technology

The Web reptile is one of core of search engine, network crawler system can be worked more efficiently be subjected to more and more researchers' attention, wherein, more becomes the focus of current research at the web crawler system of particular topic.The target of subject-type Web reptile is to make crawler system avoid having access to the relevant Web page of non-theme as far as possible, and those Web pages relevant with theme of central access.This Web crawler system mainly is used in the search engine and Web information retrieval system of those specific areas.

Present subject-type crawler system mainly is based on the text key word statistical information of the Web page and estimates its degree of subject relativity.But the content of the Web page varies, and the crucial dictionary of its correspondence is all very huge usually, so the computing cost of system is very big and need carry out a large amount of high dimensional data maintenances.In addition, because polysemy and many speech one adopted phenomenon that natural language itself exists, it is often quite difficult only to portray the content of the theme or the page by keyword, thereby makes the topic relativity evaluation deviation occur.The present invention has solved this problem by introducing ontology.

Though utilize ontology library to instruct the subject-type network crawler system can improve its efficient really as knowledge base, its performance still depends on the quality of ontology library to a great extent.If the ontology library as the backstage knowledge base is comprehensive inadequately or own just wrong, the poor efficiency of setting up subject-type network crawler system thereon is then inevitable.Meanwhile, another challenge is: because the body constructing technology does not also reach and can portray the field ontology library of this theme for any one theme provides corresponding with it at present comprehensively.Therefore, the present invention proposes intelligent subject-type network reptile system configuration method, thereby this system has realized this target by introducing the body learning module based on body learning.

Three, summary of the invention

1, goal of the invention: the purpose of this invention is to provide a kind of based on the body learning technology efficiently, subject-type intelligent network reptile construction method accurately.

2, technical scheme: in order to reach above-mentioned goal of the invention, method of the present invention comprises the following steps:

(1) the current Web page is resolved;

(2) text message of current page is carried out pre-service and obtain word layer information;

(3) by the body management system word layer information is converted into ontology information;

(4) the degree of subject relativity of the ontology information that obtains in conjunction with the ontology library calculating page;

(5) if the current page degree of subject relativity is higher than the high limit of setting value, then upgrade page content information collection, if the current page degree of subject relativity is lower than the high limit of setting value then transfers to carry out (9) after upgrading page content information collection;

(6) upgrade each body interest rate in the ontology library by getting the content of pages information set after upgrading;

(7) by the evolution of interest rate body is learnt, revised the weights of body class;

(8) using the breeding algorithm is optimized the body learning algorithm;

(9) if the current page degree of subject relativity greater than the setting value lowest limit the order extract in the current Web page all go out to link URL pointed, otherwise then execution in step (11);

(10) if the accessed mistake of this link URL pointed then extract next link, the degree of subject relativity size if this URL is accessed according to this link place page is inserted priority scheduling formation to be visited;

(11) choose first URL from priority scheduling formation to be visited, what just priority was the highest conducts interviews;

(12) repeated execution of steps 1 to 11, up to the new URL that occurs not satisfying condition.

Wherein, the ontology library in the above-mentioned steps (4) is by gathering existing common ontology storehouse, and these ontology libraries are handled, and meets the ontology library of this method with foundation, and its step is as follows:

(4.1) extract the class that has now in the ontology library;

(4.2) extract hierarchical relationship and the logical relation that has class in the ontology library now;

(4.3) class as node, hierarchical relationship and logical relation be as the same limit of having of connected node, forms the basic structure of ontology library body layer;

(4.4) at each class in each ontology library, make up the keyword set corresponding with such, form the lexis of ontology library.

The whole subject-type intelligent network crawler system based on body learning constructed by this method can be divided into two cycle subsystem: one is by reptile and formation maintenance module, the network crawler system that pre-service and topic relativity evaluation module and body management system module are formed based on body, this subcycle has realized the basic function based on the network crawler system of body; The another one circulation is by pre-service and topic relativity evaluation module, the body evolutionary system that body management system module and body learning module are formed, and this subcycle has realized the evolution and the modification of ontology library.

3, beneficial effect: method of the present invention compared with prior art, its remarkable advantage is: on the basis of semantic understanding, give the ability of crawler system study, make its accuracy and work efficiency obtain to improve, and make system have very high robustness.

Four, description of drawings

Fig. 1 is that system of the present invention forms structural drawing;

Fig. 2 is a workflow diagram of the present invention;

Fig. 3 ontology library structural representation.

Five, embodiment

As shown in Figure 1, the intelligent subject-type network crawler system based on body learning of using the construction method structure of present embodiment comprises four modules, be respectively reptile and formation maintenance module, pre-service and topic relativity evaluation module, body management system module and body learning module.Wherein, comprise pre-service and relatedness computation submodule in the degree of subject relativity evaluation module again.

The inventive method flow process describes in detail as shown in Figure 2 below:

This method comprises the following steps:

Step (1):, isolate body matter text message wherein by the html file of the current Web page is resolved;

Step (2): the text message of separating is carried out pre-service, add up the times N (w that each keyword occurs according to the lists of keywords of systemic presupposition in current document usually _i);

Step (3):, calculate the class of body class in current document frequently according to the pairing keyword set of each body class in the ontology library;

Body construction synoptic diagram such as Fig. 3, shown in the figure being one is the part of the ontology library structural drawing of key concept with music (music).This bulk junction composition comprises a series of conceptual abstractions to real things, such as: " music ", " person ", these notions have constituted the class (class) in the body management system.In addition, comprised also among the figure that these logical relations and hierarchical relationship have constituted the set of relations (relation) in the body management system such as the logical relation between connection classes such as " to play " and the class and such as the hierarchical relationship between " music " and " jazz ".Except real in the drawings class and relation, the body management system is also being managed a lexis that is lower than body layer.The all corresponding text word finder in lexis of each class in the body layer or relation, such as for class " music ", its pairing text word finder has just comprised " song, melody, music ".

According to the pairing keyword set of each body class in the ontology library, the class of body class in current document frequently

Can calculate by following formula:

f_{c_{k}}^{D} = N_{w_{1}^{k}}^{D} + N_{w_{2}^{k}}^{D} + L + N_{w_{i}^{k}}^{D}

Wherein

Representation class c _kThe frequency that in document D, occurs; Class c _kCorresponding to text word finder a: w ₁ ^k, w ₂ ^k, L, w _i ^kVocabulary w _i ^kIn document D, occurred

Inferior.

Step (4): by the body class frequently, calculate the degree of subject relativity r of current page in conjunction with the theme ontology library _D

After the occurrence number of the class in the ontology library in document calculated, just can obtain mapping from a given Web document to its corresponding topic relativity.In the process that realizes this mapping, need utilize the structural relation between each key element among the body figure, and frequently each class in the body is given a mark to the degree of subject relativity of the text in conjunction with the class that previous calculations obtains, at last each class is carried out comprehensively obtaining the degree of correlation of full page to theme at last.

The degree of subject relativity r of page D _DComputing formula is as follows:

r_{D} = \underset{c_{k} &Element; D}{Σ} (f_{C_{k}}^{D} \times w_{c_{k}})

Wherein,

Be class c _kClass in page D frequently;

Be class c _kWeights.Usually before web crawlers operation, need be to the weights assignment of each class

W_{c_{k}}^{0} = 1.00 \times n^{d (c_{k}, T)}

Wherein n represents discount factor, d (c _k, T) represent from such to this ontology library core classes, i.e. distance between the theme class T of this subject-type network crawler system.

Step (5):, think that then this page launches around theme, upgrades page content information collection K thus if the current page degree of subject relativity is higher than the high limit of setting value; Height is limit if the current page degree of subject relativity is lower than setting value, though illustrate that then it is not to launch around theme that this content of pages may be correlated with in theme, system transfers execution (9) behind the renewal page content information collection;

The following variable that is comprised among the content of pages information set K:

N _c: the sum of the page that web crawlers has been handled;

N _t: the sum that belongs to the page of target topic in the page that web crawlers has been handled;

n _i: the sum that occurs the page of class i in the ontology library in the page that web crawlers has been handled;

n _i ^t: belong to target topic in the page that web crawlers has been handled and the sum of the page of class i in the ontology library occurs;

Here c _iOccur this incident of body class i in the representation page, t represents that this page satisfies particular topic and requires this incident.

After above information was gathered, important parameters just can calculate with following formula in some body learnings:

P(g)＝N _g/N _t

P (g | c_{i}) = P (g \cap c_{i}) / P (c_{i}) = n_{i}^{g} / n_{i}

Step (6): by getting the interest rate that the content of pages information set upgrades each body in the ontology library after upgrading:

Obtain after the above variable, just can calculate the interest rate of each body class theme t.Make c in the body class _iThe interest rate be Quo (t, c _i), then have:

Quo(g，c _i)＝P(g⌒c _i)/(P(g)gP(c _i))

Step (7): the evolution by the interest rate is learnt body, revises the weights of body class:

W_{c_{i}}^{k} = Quo (g, c_{i}) g W_{c_{i}}^{k - 1}

As body class c _iBe closely related with theme, then its interest rate Quo (t, c _i) greater than 1, so in the network crawler system operational process, body class c _iWeights will constantly increase; Otherwise, as body class c _iWith theme when irrelevant, then its interest rate Quo (t, c _i) less than 1, so in the network crawler system operational process, body class c _iWeights will constantly will constantly reduce.

Step (8): the body learning algorithm is optimized with fortune breeding algorithm:

In the subject-type intelligent network crawler system model that makes up according to method of the present invention, adopted two kinds of different breeding algorithms (Propagation Methods) to improve the efficient of body learning based on body learning.A kind of is " bellman " algorithm, and the design philosophy of this algorithm is based in the bulk junction composition, on the tight semantic relation between the adjacent body class.Bulk junction composition as shown in Figure 3, if body class " band " weights in recent learning cycle change, then body class " band " just should be as the bellman, send stroke, the body class that notice is adjacent, for example: class " rock band " and " group ", allow the change that these classes are also made and it is approximate, and the degree that changes should be inversely proportional to this body class and " bellman " the distance between the corresponding class.The algorithm that another kind is used for improving body learning efficient is called " patrolling the mark people " algorithm.The design philosophy of this algorithm is based in the bulk junction composition, is correlated with the body class on the tight logical relation the theme body class from a certain theme.Body among same Fig. 3 is an example, after body class " band " changes, the body class " group " and " person " that are positioned at from " band " to theme class on " music " path should do similar variation, and its amplitude of variation should be inversely proportional to the distance between such and " patrolling the mark people " class.

Step (9): the degree of subject relativity of current page, compare,, illustrate that then current page satisfies the degree of subject relativity requirement, need isolate the link that goes out in this page if greater than the reference value lowest limit with the reference value lowest limit; If less than the reference value lowest limit, illustrate that then this page does not satisfy the requirement of degree of subject relativity, system transfers execution in step (11).

Step (10): handle the link information in the current page:

All-links in the sequential processes current page if should link the accessed mistake of URL pointed, then extracts next link; If this URL is not accessed, then insert priority scheduling formation to be visited according to the degree of subject relativity size of this link place page.

Step (11); Choose from priority scheduling formation to be visited and come first URL of formation, what just priority was the highest conducts interviews.

Step (12): repeated execution of steps 1 to 11, till the new URL that does not satisfy condition occurring or arriving certain limit.

Claims

1. the intelligent subject-type network reptile system configuration method based on body learning comprises the following steps

(1) the current Web page is resolved, it is characterized in that this method comprises the following steps:

(5) if the current page degree of subject relativity is higher than the high limit of setting value, then upgrade page content information collection; If the current page degree of subject relativity is lower than the high limit of setting value, then upgrade behind the page content information collection then execution in step (9);

(8) utilization breeding algorithm is optimized the body learning algorithm;

(9) if the current page degree of subject relativity greater than the setting value lowest limit, then the order extract in the current Web page all go out to link URL pointed; Otherwise then execution in step (11);

(10) if should link the accessed mistake of URL pointed, then extract next link; If this URL is not accessed, then insert priority scheduling formation to be visited according to the degree of subject relativity size of this link place page;

(12) repeated execution of steps (1) is to (11), up to the new URL that occurs not satisfying condition.

2. the intelligent subject-type network reptile system configuration method based on body learning as claimed in claim 1, it is characterized in that the ontology library in the step (4) passes through to gather existing common ontology storehouse, and these ontology libraries are handled, meet the ontology library of this method with foundation, its step is as follows:

(4.1) extract the class that has now in the ontology library;

(4.3) class as node, hierarchical relationship and logical relation be as the directed edge of connected node, forms the basic structure of ontology library body layer;

(4.4) at each class in each ontology library, make up the keyword set platform corresponding with such, form the lexis of ontology library.