CN100401301C - Body learning based intelligent subject-type network reptile system configuration method - Google Patents

Body learning based intelligent subject-type network reptile system configuration method Download PDF

Info

Publication number
CN100401301C
CN100401301C CNB2006100407437A CN200610040743A CN100401301C CN 100401301 C CN100401301 C CN 100401301C CN B2006100407437 A CNB2006100407437 A CN B2006100407437A CN 200610040743 A CN200610040743 A CN 200610040743A CN 100401301 C CN100401301 C CN 100401301C
Authority
CN
China
Prior art keywords
page
class
subject
ontology
ontology library
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2006100407437A
Other languages
Chinese (zh)
Other versions
CN1851706A (en
Inventor
高阳
苏畅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CNB2006100407437A priority Critical patent/CN100401301C/en
Publication of CN1851706A publication Critical patent/CN1851706A/en
Application granted granted Critical
Publication of CN100401301C publication Critical patent/CN100401301C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The present invention discloses a constructing method for an intelligent subject type network crawler system based on a main body, which comprises the following procedures that (1) a current Web page is analyzed; (2) obtained word layer information is processed; (3) the word layer information is converted into main body information; (4) the subject correlation degree of the page is calculated; (5) if the subject correlation degree of the page is higher than the high limit of a set value, a content information set of the page is updated, or else a procedure (9) is executed; (6) the interest rate of each main body is updated; (7) the weight value of a main body class is learned and modified; (8) the algorithm of the main body is optimized; (9) URLs in the current Web page are orderly extracted, or else a procedure (11) is executed; (10) if the URL pointed by the link is accessed, the next link is extracted, or else a priority access waiting queue is inserted according to the value of the subject correlation degree of the page in which the link is positioned; (11) a first URL is selected; (12) the procedures from (1) to (11) are executed repeatedly. The present invention has the advantages of high accuracy rate of a result and lower expenses of calculation and storage, and the system has higher robustness.

Description

Intelligent subject-type network reptile system configuration method based on body learning
One, technical field
The present invention relates to a kind of web crawlers construction method, relate in particular to a kind of intelligent subject-type web crawlers construction method.
Two, background technology
The Web reptile is one of core of search engine, network crawler system can be worked more efficiently be subjected to more and more researchers' attention, wherein, more becomes the focus of current research at the web crawler system of particular topic.The target of subject-type Web reptile is to make crawler system avoid having access to the relevant Web page of non-theme as far as possible, and those Web pages relevant with theme of central access.This Web crawler system mainly is used in the search engine and Web information retrieval system of those specific areas.
Present subject-type crawler system mainly is based on the text key word statistical information of the Web page and estimates its degree of subject relativity.But the content of the Web page varies, and the crucial dictionary of its correspondence is all very huge usually, so the computing cost of system is very big and need carry out a large amount of high dimensional data maintenances.In addition, because polysemy and many speech one adopted phenomenon that natural language itself exists, it is often quite difficult only to portray the content of the theme or the page by keyword, thereby makes the topic relativity evaluation deviation occur.The present invention has solved this problem by introducing ontology.
Though utilize ontology library to instruct the subject-type network crawler system can improve its efficient really as knowledge base, its performance still depends on the quality of ontology library to a great extent.If the ontology library as the backstage knowledge base is comprehensive inadequately or own just wrong, the poor efficiency of setting up subject-type network crawler system thereon is then inevitable.Meanwhile, another challenge is: because the body constructing technology does not also reach and can portray the field ontology library of this theme for any one theme provides corresponding with it at present comprehensively.Therefore, the present invention proposes intelligent subject-type network reptile system configuration method, thereby this system has realized this target by introducing the body learning module based on body learning.
Three, summary of the invention
1, goal of the invention: the purpose of this invention is to provide a kind of based on the body learning technology efficiently, subject-type intelligent network reptile construction method accurately.
2, technical scheme: in order to reach above-mentioned goal of the invention, method of the present invention comprises the following steps:
(1) the current Web page is resolved;
(2) text message of current page is carried out pre-service and obtain word layer information;
(3) by the body management system word layer information is converted into ontology information;
(4) the degree of subject relativity of the ontology information that obtains in conjunction with the ontology library calculating page;
(5) if the current page degree of subject relativity is higher than the high limit of setting value, then upgrade page content information collection, if the current page degree of subject relativity is lower than the high limit of setting value then transfers to carry out (9) after upgrading page content information collection;
(6) upgrade each body interest rate in the ontology library by getting the content of pages information set after upgrading;
(7) by the evolution of interest rate body is learnt, revised the weights of body class;
(8) using the breeding algorithm is optimized the body learning algorithm;
(9) if the current page degree of subject relativity greater than the setting value lowest limit the order extract in the current Web page all go out to link URL pointed, otherwise then execution in step (11);
(10) if the accessed mistake of this link URL pointed then extract next link, the degree of subject relativity size if this URL is accessed according to this link place page is inserted priority scheduling formation to be visited;
(11) choose first URL from priority scheduling formation to be visited, what just priority was the highest conducts interviews;
(12) repeated execution of steps 1 to 11, up to the new URL that occurs not satisfying condition.
Wherein, the ontology library in the above-mentioned steps (4) is by gathering existing common ontology storehouse, and these ontology libraries are handled, and meets the ontology library of this method with foundation, and its step is as follows:
(4.1) extract the class that has now in the ontology library;
(4.2) extract hierarchical relationship and the logical relation that has class in the ontology library now;
(4.3) class as node, hierarchical relationship and logical relation be as the same limit of having of connected node, forms the basic structure of ontology library body layer;
(4.4) at each class in each ontology library, make up the keyword set corresponding with such, form the lexis of ontology library.
The whole subject-type intelligent network crawler system based on body learning constructed by this method can be divided into two cycle subsystem: one is by reptile and formation maintenance module, the network crawler system that pre-service and topic relativity evaluation module and body management system module are formed based on body, this subcycle has realized the basic function based on the network crawler system of body; The another one circulation is by pre-service and topic relativity evaluation module, the body evolutionary system that body management system module and body learning module are formed, and this subcycle has realized the evolution and the modification of ontology library.
3, beneficial effect: method of the present invention compared with prior art, its remarkable advantage is: on the basis of semantic understanding, give the ability of crawler system study, make its accuracy and work efficiency obtain to improve, and make system have very high robustness.
Four, description of drawings
Fig. 1 is that system of the present invention forms structural drawing;
Fig. 2 is a workflow diagram of the present invention;
Fig. 3 ontology library structural representation.
Five, embodiment
As shown in Figure 1, the intelligent subject-type network crawler system based on body learning of using the construction method structure of present embodiment comprises four modules, be respectively reptile and formation maintenance module, pre-service and topic relativity evaluation module, body management system module and body learning module.Wherein, comprise pre-service and relatedness computation submodule in the degree of subject relativity evaluation module again.
The inventive method flow process describes in detail as shown in Figure 2 below:
This method comprises the following steps:
Step (1):, isolate body matter text message wherein by the html file of the current Web page is resolved;
Step (2): the text message of separating is carried out pre-service, add up the times N (w that each keyword occurs according to the lists of keywords of systemic presupposition in current document usually i);
Step (3):, calculate the class of body class in current document frequently according to the pairing keyword set of each body class in the ontology library;
Body construction synoptic diagram such as Fig. 3, shown in the figure being one is the part of the ontology library structural drawing of key concept with music (music).This bulk junction composition comprises a series of conceptual abstractions to real things, such as: " music ", " person ", these notions have constituted the class (class) in the body management system.In addition, comprised also among the figure that these logical relations and hierarchical relationship have constituted the set of relations (relation) in the body management system such as the logical relation between connection classes such as " to play " and the class and such as the hierarchical relationship between " music " and " jazz ".Except real in the drawings class and relation, the body management system is also being managed a lexis that is lower than body layer.The all corresponding text word finder in lexis of each class in the body layer or relation, such as for class " music ", its pairing text word finder has just comprised " song, melody, music ".
According to the pairing keyword set of each body class in the ontology library, the class of body class in current document frequently
Figure C20061004074300061
Can calculate by following formula:
f c k D = N w 1 k D + N w 2 k D + L + N w i k D
Wherein
Figure C20061004074300063
Representation class c kThe frequency that in document D, occurs; Class c kCorresponding to text word finder a: w 1 k, w 2 k, L, w i kVocabulary w i kIn document D, occurred
Figure C20061004074300064
Inferior.
Step (4): by the body class frequently, calculate the degree of subject relativity r of current page in conjunction with the theme ontology library D
After the occurrence number of the class in the ontology library in document calculated, just can obtain mapping from a given Web document to its corresponding topic relativity.In the process that realizes this mapping, need utilize the structural relation between each key element among the body figure, and frequently each class in the body is given a mark to the degree of subject relativity of the text in conjunction with the class that previous calculations obtains, at last each class is carried out comprehensively obtaining the degree of correlation of full page to theme at last.
The degree of subject relativity r of page D DComputing formula is as follows:
r D = Σ c k ∈ D ( f C k D × w c k )
Wherein,
Figure C20061004074300066
Be class c kClass in page D frequently;
Figure C20061004074300067
Be class c kWeights.Usually before web crawlers operation, need be to the weights assignment of each class
Figure C20061004074300068
W c k 0 = 1.00 × n d ( c k , T )
Wherein n represents discount factor, d (c k, T) represent from such to this ontology library core classes, i.e. distance between the theme class T of this subject-type network crawler system.
Step (5):, think that then this page launches around theme, upgrades page content information collection K thus if the current page degree of subject relativity is higher than the high limit of setting value; Height is limit if the current page degree of subject relativity is lower than setting value, though illustrate that then it is not to launch around theme that this content of pages may be correlated with in theme, system transfers execution (9) behind the renewal page content information collection;
The following variable that is comprised among the content of pages information set K:
N c: the sum of the page that web crawlers has been handled;
N t: the sum that belongs to the page of target topic in the page that web crawlers has been handled;
n i: the sum that occurs the page of class i in the ontology library in the page that web crawlers has been handled;
n i t: belong to target topic in the page that web crawlers has been handled and the sum of the page of class i in the ontology library occurs;
Here c iOccur this incident of body class i in the representation page, t represents that this page satisfies particular topic and requires this incident.
After above information was gathered, important parameters just can calculate with following formula in some body learnings:
P(g)=N g/N t
P ( g | c i ) = P ( g ∩ c i ) / P ( c i ) = n i g / n i
Step (6): by getting the interest rate that the content of pages information set upgrades each body in the ontology library after upgrading:
Obtain after the above variable, just can calculate the interest rate of each body class theme t.Make c in the body class iThe interest rate be Quo (t, c i), then have:
Quo(g,c i)=P(g⌒c i)/(P(g)gP(c i))
Step (7): the evolution by the interest rate is learnt body, revises the weights of body class:
W c i k = Quo ( g , c i ) g W c i k - 1
As body class c iBe closely related with theme, then its interest rate Quo (t, c i) greater than 1, so in the network crawler system operational process, body class c iWeights will constantly increase; Otherwise, as body class c iWith theme when irrelevant, then its interest rate Quo (t, c i) less than 1, so in the network crawler system operational process, body class c iWeights will constantly will constantly reduce.
Step (8): the body learning algorithm is optimized with fortune breeding algorithm:
In the subject-type intelligent network crawler system model that makes up according to method of the present invention, adopted two kinds of different breeding algorithms (Propagation Methods) to improve the efficient of body learning based on body learning.A kind of is " bellman " algorithm, and the design philosophy of this algorithm is based in the bulk junction composition, on the tight semantic relation between the adjacent body class.Bulk junction composition as shown in Figure 3, if body class " band " weights in recent learning cycle change, then body class " band " just should be as the bellman, send stroke, the body class that notice is adjacent, for example: class " rock band " and " group ", allow the change that these classes are also made and it is approximate, and the degree that changes should be inversely proportional to this body class and " bellman " the distance between the corresponding class.The algorithm that another kind is used for improving body learning efficient is called " patrolling the mark people " algorithm.The design philosophy of this algorithm is based in the bulk junction composition, is correlated with the body class on the tight logical relation the theme body class from a certain theme.Body among same Fig. 3 is an example, after body class " band " changes, the body class " group " and " person " that are positioned at from " band " to theme class on " music " path should do similar variation, and its amplitude of variation should be inversely proportional to the distance between such and " patrolling the mark people " class.
Step (9): the degree of subject relativity of current page, compare,, illustrate that then current page satisfies the degree of subject relativity requirement, need isolate the link that goes out in this page if greater than the reference value lowest limit with the reference value lowest limit; If less than the reference value lowest limit, illustrate that then this page does not satisfy the requirement of degree of subject relativity, system transfers execution in step (11).
Step (10): handle the link information in the current page:
All-links in the sequential processes current page if should link the accessed mistake of URL pointed, then extracts next link; If this URL is not accessed, then insert priority scheduling formation to be visited according to the degree of subject relativity size of this link place page.
Step (11); Choose from priority scheduling formation to be visited and come first URL of formation, what just priority was the highest conducts interviews.
Step (12): repeated execution of steps 1 to 11, till the new URL that does not satisfy condition occurring or arriving certain limit.

Claims (2)

1. the intelligent subject-type network reptile system configuration method based on body learning comprises the following steps
(1) the current Web page is resolved, it is characterized in that this method comprises the following steps:
(2) text message of current page is carried out pre-service and obtain word layer information;
(3) by the body management system word layer information is converted into ontology information;
(4) the degree of subject relativity of the ontology information that obtains in conjunction with the ontology library calculating page;
(5) if the current page degree of subject relativity is higher than the high limit of setting value, then upgrade page content information collection; If the current page degree of subject relativity is lower than the high limit of setting value, then upgrade behind the page content information collection then execution in step (9);
(6) upgrade each body interest rate in the ontology library by getting the content of pages information set after upgrading;
(7) by the evolution of interest rate body is learnt, revised the weights of body class;
(8) utilization breeding algorithm is optimized the body learning algorithm;
(9) if the current page degree of subject relativity greater than the setting value lowest limit, then the order extract in the current Web page all go out to link URL pointed; Otherwise then execution in step (11);
(10) if should link the accessed mistake of URL pointed, then extract next link; If this URL is not accessed, then insert priority scheduling formation to be visited according to the degree of subject relativity size of this link place page;
(11) choose first URL from priority scheduling formation to be visited, what just priority was the highest conducts interviews;
(12) repeated execution of steps (1) is to (11), up to the new URL that occurs not satisfying condition.
2. the intelligent subject-type network reptile system configuration method based on body learning as claimed in claim 1, it is characterized in that the ontology library in the step (4) passes through to gather existing common ontology storehouse, and these ontology libraries are handled, meet the ontology library of this method with foundation, its step is as follows:
(4.1) extract the class that has now in the ontology library;
(4.2) extract hierarchical relationship and the logical relation that has class in the ontology library now;
(4.3) class as node, hierarchical relationship and logical relation be as the directed edge of connected node, forms the basic structure of ontology library body layer;
(4.4) at each class in each ontology library, make up the keyword set platform corresponding with such, form the lexis of ontology library.
CNB2006100407437A 2006-05-30 2006-05-30 Body learning based intelligent subject-type network reptile system configuration method Expired - Fee Related CN100401301C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2006100407437A CN100401301C (en) 2006-05-30 2006-05-30 Body learning based intelligent subject-type network reptile system configuration method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2006100407437A CN100401301C (en) 2006-05-30 2006-05-30 Body learning based intelligent subject-type network reptile system configuration method

Publications (2)

Publication Number Publication Date
CN1851706A CN1851706A (en) 2006-10-25
CN100401301C true CN100401301C (en) 2008-07-09

Family

ID=37133185

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2006100407437A Expired - Fee Related CN100401301C (en) 2006-05-30 2006-05-30 Body learning based intelligent subject-type network reptile system configuration method

Country Status (1)

Country Link
CN (1) CN100401301C (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100458795C (en) * 2007-02-13 2009-02-04 北京搜狗科技发展有限公司 Intelligent word input method and input method system and updating method thereof
CN100452054C (en) * 2007-05-09 2009-01-14 崔志明 Integrated data source finding method for deep layer net page data source
CN100461184C (en) * 2007-07-10 2009-02-11 北京大学 Subject crawling method based on link hierarchical classification in network search
CN101788988B (en) * 2009-01-22 2012-06-27 蔡亮华 Information extraction method
CN102087648B (en) * 2009-12-03 2013-06-19 北京大学 Method and system for fetching news comment page
CN102262635A (en) * 2010-05-25 2011-11-30 北京启明星辰信息技术股份有限公司 Page crawler system and page crawler method
CN102073730B (en) * 2011-01-14 2012-09-26 哈尔滨工程大学 Method for constructing topic web crawler system
CN103034732A (en) * 2012-12-26 2013-04-10 福建师范大学 Network robot algorithm for precisely grabbing links

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5838964A (en) * 1995-06-26 1998-11-17 Gubser; David R. Dynamic numeric compression methods
US6006232A (en) * 1997-10-21 1999-12-21 At&T Corp. System and method for multirecord compression in a relational database
CN1316707A (en) * 2000-01-25 2001-10-10 索尼株式会社 Data compaction and search method and data retieval equipment and recording medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5838964A (en) * 1995-06-26 1998-11-17 Gubser; David R. Dynamic numeric compression methods
US6006232A (en) * 1997-10-21 1999-12-21 At&T Corp. System and method for multirecord compression in a relational database
CN1316707A (en) * 2000-01-25 2001-10-10 索尼株式会社 Data compaction and search method and data retieval equipment and recording medium

Also Published As

Publication number Publication date
CN1851706A (en) 2006-10-25

Similar Documents

Publication Publication Date Title
CN100401301C (en) Body learning based intelligent subject-type network reptile system configuration method
CN101630314B (en) Semantic query expansion method based on domain knowledge
CN100392658C (en) Body-bused subject type network reptile system configuration method
CN105488024B (en) The abstracting method and device of Web page subject sentence
CN100573513C (en) Be used to arrange the document of Search Results to improve the method and system of diversity and abundant information degree
CN102207945B (en) Knowledge network-based text indexing system and method
CN101493819B (en) Method for optimizing detection of search engine cheat
CN103605665A (en) Keyword based evaluation expert intelligent search and recommendation method
CN110059181A (en) Short text stamp methods, system, device towards extensive classification system
CN104375992A (en) Address matching method and device
CN102637170A (en) Question pushing method and system
CN101710318A (en) Knowledge intelligent acquiring system of vegetable supply chains
Ordentlich et al. Network-efficient distributed word2vec training system for large vocabularies
CN101770521A (en) Focusing relevancy ordering method for vertical search engine
CN103412878B (en) Document theme partitioning method based on domain knowledge map community structure
CN111930774A (en) Automatic construction method and system for power knowledge graph ontology
CN104679738A (en) Method and device for mining Internet hot words
CN103714140A (en) Searching method and device based on topic-focused web crawler
CN104298776A (en) LDA model-based search engine result optimization system
CN105069103A (en) Method and system for APP search engine to utilize client comment
CN104137095A (en) System for evolutionary analytics
CN105787097A (en) Distributed index establishment method and system based on text clustering
CN110413865A (en) Semantic expressiveness model and its method based on alternating binary coding device characterization model
CN105740310A (en) Automatic answer summarizing method and system for question answering system
CN101266660A (en) Reality inconsistency analysis method based on descriptive logic

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20080709

Termination date: 20100530