CN103310013A

CN103310013A - Subject-oriented web page collection system

Info

Publication number: CN103310013A
Application number: CN2013102751157A
Authority: CN
Inventors: 王宝会; 于雷; 王丽华; 王新河; 尹科
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2013-07-02
Filing date: 2013-07-02
Publication date: 2013-09-18

Abstract

The invention relates to a subject-oriented web page collection system, which belongs to the field of network communication and is used for the aspect of collecting subject-oriented network information. The subject-oriented web page collection system comprises a sample training module, a strategy search module and a collecting module, wherein the sample training module carries out analysis and calculation through a manually-set web page sample library to obtain subject characteristic vectors and values and a similarity threshold value of pages; the strategy search module control a retrieved URL (Uniform Resource Locator) address set and controls a search range in candidate seed websites; the collecting module receives the URL address set sent by the strategy search module and carries out page purification, characteristic extraction, analysis, collection and storage; and in the process of characteristic analysis, whether the subject characteristic vectors and values and the similarity threshold value of a subject web page need to be filled by manually referring the result of the sample training module is judged. The subject-oriented web page collection system has higher efficiency and stronger page adaptability and effectively solves the problems in the prior art.

Description

A kind of web retrieval system of subject-oriented

Technical field

The present invention relates to a kind of web retrieval system of subject-oriented, belong to network communication field, be used for the network information gathering aspect towards theme.

Background technology

Along with the rapid growth of WEB information resources, traditional information search system can't guarantee the upgrading in time of information, and because the subject area of Information Monitoring is too extensive, can't satisfy people's demand growing to the customized information retrieval service.Recent study person constantly proposes the developing direction of New-generation search engines, and subject search is a class of wherein particularly giving prominence to.Compare with the general search engine, the range of search less of topic search engine, precision ratio and recall ratio are easy to guarantee.In search procedure, need not travel through whole WEB, only need to select the page relevant with the theme page to conduct interviews, substantially avoid the crisis that conventional information acquisition system information index expands.

There are the following problems for existing theme network crawler: whether the webpage that (1) is difficult in the accuracy of judgement targeted website when carrying out the subject web page information acquisition is the webpage of this theme, collects non-a large amount of subject web pages so hold very much when gathering.(2) advantage of theme network crawler need not the page is traveled through exactly, only needs to select the page with Topic relative to conduct interviews, but in the process of selecting, very difficult with the page definition of Topic relative.

Summary of the invention

The technology of the present invention is dealt with problems: overcome the deficiencies in the prior art, a kind of web retrieval system of subject-oriented is provided, this system has higher efficient and stronger page adaptability, has effectively solved problems of the prior art.

Technical solution of the present invention: a kind of web retrieval system of subject-oriented is characterized in that comprising: sample training module, decision search module and acquisition module;

The sample training module, the webpage Sample Storehouse by artificial setting carries out analysis and calculation and draws the theme feature vector sum, and the similarity threshold values of the page;

The decision search module, the URL(URL of control retrieval is web page address) address set, the hunting zone is controlled at the candidate seed website;

Acquisition module, the URL address set that the reception strategy search module sends over, and carry out purification, feature extraction, analysis, the also collection preservation of the page; In the process of carrying out signature analysis, judge that the theme feature vector with subject web page needs the result of artificial reference sample training module to fill in value and similarity threshold values;

Described decision search module implementation procedure is:

(11) at first construct theme class Buffer Pool positivePool, non-theme class Buffer Pool negtivePool two class Buffer Pools, be used for depositing URL class entity, i.e. URL address in the URL address set; Two Buffer Pool initialization values are null set; Deposit in the described theme class Buffer Pool and the URL address that gathers Topic relative, what deposit in the non-theme class Buffer Pool is and the incoherent URL of collection theme address, the effect of Buffer Pool is exactly to put the URL address in order to can use fast when gathering webpage, and being divided into theme class and non-theme class is in order to form the URL address set of theme class; Described Buffer Pool is packaged queue operation, is the computer realm common technology;

(12) artificial selected seed website Seeds, the initial set that consists of search utility Spider is the URL address set;

(13) initial set of the seed website of manually choosing is carried out the Spider search, and in conjunction with fixed point strategy, Buffer Pool strategy, record corresponding page address; The page address of this module records all offers the collection that acquisition module carries out the page at last; Described fixed point strategy is for only carrying out search in Website to artificial selected website; Described Buffer Pool strategy is looked into heavy speed for Buffer Pool is put in the page address that gathers in the time of can accelerating to gather like this;

The implementation procedure of described acquisition module is: to page address Raw URLi, and i ∈ N, i≤n does following processing:

(21) page pre-service obtains page P _i

(22) if pos itivePool does not then search for for sky, if P _iThink then that with entries match wherein webpage belongs to the theme page, P _iProcessing finishes, and returns (21) and processes P _i+ 1;

(23) if negtivePool does not then search for for sky, if P _iThink then that with entries match wherein webpage does not belong to the theme page, P _iProcessing finishes, and returns (21) and processes P _i+ 1;

(24) make up page P according to formula (6) _iHVSM proper vector V;

V=(k _tω ₁,…,k _tω _i,k _sω ₁’，…，k _sω _j’)=(ω ₁’，ω ₂’，…，ω _n’) (6)

N=i+j in the formula, ω i, ω j represent respectively V ₁, V ₂Institute comprises Feature Words t _i, s _jWeights, n is total dimension of feature space.k _t, k _sFor in the system to content feature vector V ₁, structural eigenvector V ₂The weights of giving (can carry out manually the setting of weights), s ₁Expression architectural characteristic parameter, t _iThe expression entry.

(25) according to formula (7) with P _iCarry out similarity analysis with the training sample page, obtain similarity L;

L = sim (Vi, Vj) = \frac{Σ_{k = 1}^{n} (ω_{ik}^{,} \times ω_{jk}^{,})}{\sqrt{(Σ_{k = 1}^{n} ω_{ik}^{, 2}) \cdot (Σ_{k = 1}^{n} ω_{jk}^{, 2})}} - - - (7)

Sin is the sine formula symbol in the formula, and Vi is that content feature vector, Vj structural eigenvector, ω i, ω j represent respectively V ₁, V ₂Institute comprises Feature Words t _i, s _jWeights, s _jExpression architectural characteristic parameter, t _iThe expression entry.

(26) if similarity L, thinks then that webpage belongs to the theme page greater than setting threshold ε, deposit it in web database, and deposit Raw URL i in positivePool; Otherwise deposit Raw URL i in negtive2Pool.

The present invention's advantage compared with prior art is: the present invention obtains theme vector and value and similarity threshold values by the training for target sample, and by in gatherer process, theme vector and value and similarity threshold values being included in the calculating, make theme search algorithm design and the Design of Search Engine of native system have higher efficient and stronger page adaptability, effectively solved the top problem of mentioning, and effect is comparatively desirable in test.

Description of drawings

Fig. 1 is system architecture diagram of the present invention.

Embodiment

As shown in Figure 1, system of the present invention forms sample training module, decision search module and acquisition module by three modules.The webpage Sample Storehouse of sample training module by artificial setting carries out analysis and calculation and draws the theme feature vector sum, and calculates for the similarity threshold values of the page; The decision search module then is the URL address set of control system retrieval, and the hunting zone is controlled at the candidate seed website; The function of acquisition module is, the URL address that the reception strategy search module sends over, and carry out the page purification, feature extraction, analysis, and gather and preserve.

The following describes concrete function and the reciprocal process of several main modular.

1, decision search module

The information search that the Functional Design of decision search module is based on the internet is a kind of technology based on hyperlink.Usually, the mode that hyperlink exists can be divided three classes, if gather each link that runs into, Web Spider will be fast, overlay network widely.But this mode that network is infinitely searched for very likely causes topic drift.Find through research, for a certain theme (such as industry or field), authoritative Web website is clearer and more definite, and often has higher similarity for its URL of the page that relates to similar theme.Based on above understanding, system has taked following two kinds of strategies to the design of search: (1) fixed point strategy, namely choose authority's industry website as initial set, and only the internal links of these candidate seed websites (Seed URL s) is processed.(2) Buffer Pool strategy namely carries out classified finishing according to webpage URL similarity to the page.

The concrete function implementation procedure of decision search module is as follows:

(1) the manual set of setting kind of child node forms the storehouse, targeted website;

(2) internal links of seed node set website is processed (namely extracting the URL address in the website);

(3) set up the address buffer pond;

(4) according to the similarity of webpage URL the page is carried out classified finishing;

(5) push the URL address set to acquisition module.

2, acquisition module

The function of acquisition module is, the URL address that the reception strategy search module sends over, and carry out the page purification, feature extraction, analysis, and gather and preserve.

The specific implementation process of acquisition module is:

(1) receives the URL address set that the decision search module pushes;

(2) page of needs collection carried out pre-service;

Often there are the irrelevant contents of various and theme in webpage, these information structures the noise content of webpage, increased the complexity of the page, therefore to carry out purified treatment to webpage first.For the webpage noise problem, the heuristic rule of document is defined as a template (template) with noise region.For the webpage collection that uses same template, think that the content that repeatedly occurs is noise data.But because the undemanding grammer of HTML, the source file of at present a large amount of webpages is write the disappearance that label often occurs and is mixed situation.Therefore, need first target web to be resolved, utilize the HTML(HTML (Hypertext Markup Language), i.e. HTML(Hypertext Markup Language) be for a kind of markup language of describing web document; Resolver is to the source file arrangement that standardizes.In addition, in order to simplify webpage DOM(document dbject model Document Object Model, be called for short DOM, the standard program interface of the processing extensible markup language of W3C tissue recommendation) tree, when making up dom tree such as the script(HTML label), note among the comment(HTML), among the style(HTML pattern) etc. node all be left in the basket.Packaged class libraries of Tidy(that system adopts W3C to provide can be used for HTML and convert XML to) source file is repaired.

(3) Feature Selection of the theme page and extract proper vector;

Page character representation uses vector space model (V ector SpaceM odel, VSM) method.In the VSM model, file d is mapped as a proper vector V (d):

V(d)=(t ₁，ω ₁(d)；…；t _n，ω _n(d)) (1)

T in the formula _i(i=1,2 ..., n) be the entry item (entry is the spelling words intellectual that marks off according to page structure on the page, is network collection field general term) that row do not duplicate mutually; ω _i(d) be t _iWeighted value in d is generally defined as t _iThe function of the frequency of occurrences in d.

The html tag of webpage has reflected wherein semanteme and the structural information of institute's content, can extract by guide theme feature word.There is achievement in research to show that the theme ability to express in 12 index sources such as web page contents theme and webpage autograph, article title has the theme ability to express in 12 index sources such as sequencing chapter title that sequencing is arranged.In addition, the scale of the entry frequency of text and the page has certain correlativity.Consider that the WEB page is a kind of semi-structured document, native system improves traditional VSM model, has adopted the mixed vector spatial model to represent.The theme feature vector of webpage is by content feature vector V ₁With structural eigenvector V ₂Two parts form, and give V ₁, V ₂Different weights.

1) content feature vector V ₁

Give different weights to the entry that page diverse location occurs, webpage is divided into four parts: title (＜title 〉,＜head 〉) (B1), key word (＜font 〉,＜strong 〉,＜b 〉,＜big 〉,＜I,＜u 〉) (B2), link anchor literal (＜A 〉) (B3), other parts (B4).First three part can be distinguished by html tag.If entry t occurs Ni time accordingly in these positions, its corresponding weighted value ω (d) is:

ω（d)=N ₁＊W _B1+(N ₂＊W _B2+N ₃＊W _B3+N ₄＊W _B4)/S (2)

W in the formula _B1The gold content of expression entry present position, S representation page scale.Can suitably adjust as required W _B1Give suitable value.Then content feature vector is expressed as:

V ₁=(t ₁，ω ₁;···t _i，ω _i;···t _n,ω _n) (3)

T in the formula _i(i=1 ... n) entry in the representation page, ω _i(i=1 ... n) represent the weights that each entry is corresponding.

2) structural eigenvector V ₂

With t _iConnotation expand the architectural characteristic parameter that forms the page.So-called page structure characterisitic parameter comprises following two dvielements: the 1. structural element such as table, tr, td (or th); 2. domain knowledge keyword.Structural eigenvector V ₂Be expressed as:

V ₂=(s ₁，ω(s ₁);···s _i，ω(s _i);···s _n，ω(s _n)) (4)

S in the formula _i(i=1 ... n) each architectural characteristic parameter in the representation page, ω () (i=1 ... n) expression (the weighted value function of correspondence in page d.The definition of ω is determined by element type concrete in the page.

3) HVSM aspect of model vector V

HVSM model (mixed vector spatial model) is described the page from content of pages and two aspects of page structure, is the concentrated expression to content characteristic and architectural feature.Adopt the HVSM model to describe the content and structure characteristic information that page feature can be stayed web page joint as much as possible.Suppose that the theme feature vector of the page is by structural element piece V ₁With content element piece V ₂Two parts form.V ₁Comprise the text feature set of words { t that reflects content of pages ₁..., t _i, V ₂Text feature set of words { s for the reflection page structure ₁..., s _j, corresponding theme feature word set is T=V ₁∪ V ₂Then page HVSM aspect of model vector V can be expressed as:

V=(ω ₁,…,ω _i,ω ₁’,…，ω _j’)=(ω ₁，ω ₂，…，ω _n) (5)

N=i+j, ω i, ω j represent respectively V in the formula ₁, V ₂Institute comprises Feature Words t _i, s _jWeights, n is total dimension of feature space.Can be to content feature vector V in the system ₁, structural eigenvector V ₂Give different weights k _t, k _s, to show that page text and page structure are to the influence power of page feature.Thus formula (5) is carried out the proper vector V that assignment can obtain the page:

V=(k _tω ₁,…,k _tω _i,k _sω ₁’,…,k _sω _j’)=(ω ₁’，ω ₂’，…，ω _n’) (6)

N=i+j for different research themes, can suitably adjust k as required _t, k _s, give corresponding weights.ω _i(d) be t _iWeighted value in d is generally defined as t _iThe function of the frequency of occurrences in d; t _i(i=1,2 ..., n) be the entry item that row do not duplicate mutually.

(4) carrying out page similarity after extraction page feature and the proper vector judges;

The HVSM proper vector of the page adopts the cosine similarity measure to judge the topic similarity degree of the page after being determined by formula (6).For any two page P _iAnd P _j, calculate corresponding HVSM proper vector V _iAnd V _jVectorial cosine distance, cosine value larger (sin is sinusoidal symbol) illustrates that the Topic Similarity of two pages is higher.Page P _iAnd P _jThe following calculating of similarity L:

L = sim (Vi, Vj) = \frac{Σ_{k = 1}^{n} (ω_{ik}^{,} \times ω_{jk}^{,})}{\sqrt{(Σ_{k = 1}^{n} ω_{ik}^{, 2}) · (Σ_{k = 1}^{n} ω_{jk}^{, 2})}} - - - (7)

ω in the formula _i(d) be t _iWeighted value in d is generally defined as t _iThe function of the frequency of occurrences in d; t _i(i=1,2 ..., n) be the entry item that row do not duplicate mutually.If theme training sample HVSM aspect of model vector set is combined into { V ₁..., V _n, adopt the vector arithmetic mean value V of this set _cRepresent this theme feature vector, namely

V_{c} = \frac{1}{n} Σ_{k = 1}^{n} V_{k} - - - (8)

For the unknown page, after determining its HVSM proper vector, calculate this vector and theme feature vector V according to the similarity formula _cBetween similarity.Setting threshold ε ∈ (0,1) if similarity, thinks then that webpage belongs to the theme page greater than threshold epsilon, deposits it in web database, otherwise abandons this page.Threshold epsilon is an empirical numerical value, need to revise threshold value according to the learning outcome of training sample, to reach optimum efficiency.

3, sample training module

The sample training module is by artificial definition Sample Storehouse, then the theme page in the Sample Storehouse is carried out feature and eigenwert must be extracted, then calculate the threshold values of sample, calculate by feature extraction and threshold values to a large amount of sample datas, draw optimum proper vector weight proportion (being the ratio of proper vector and structural eigenvector) and threshold values.

(1) artificial selected target Sample Storehouse (data that guarantee Sample Storehouse all are that the same subject data are in order to train);

(2) the Sample Storehouse webpage is purified;

Often there are the irrelevant contents of various and theme in webpage, these information structures the noise content of webpage, increased the complexity of the page, therefore to carry out purified treatment to webpage first.For the webpage noise problem, the heuristic rule of document is defined as a template (template) with noise region.For the webpage collection that uses same template, think that the content that repeatedly occurs is noise data.But because the undemanding grammer of HTML, the source file of at present a large amount of webpages is write the disappearance that label often occurs and is mixed situation.Therefore, need to resolve target web first, utilize html parser to the source file arrangement that standardizes.In addition, in order to simplify the webpage dom tree, when making up dom tree such as script, comment, the nodes such as style all are left in the basket.The Tidy that system adopts W3C to provide repairs source file.

(3) feature of the extraction Sample Storehouse page;

V(d)=(t ₁，ω ₁(d)；…；t _n，ω _n(d)) (1)

T in the formula _i(i=1,2 ..., n) be the entry item that row do not duplicate mutually; ω _i(d) be t _iWeighted value in d is generally defined as t _iThe function of the frequency of occurrences in d.

1) content feature vector V ₁

Give different weights to the entry that page diverse location occurs, webpage is divided into four parts: title (＜title 〉,＜head 〉) (B1), key word (＜font 〉,＜s trong 〉,＜b 〉,＜big 〉,＜I,＜u 〉) (B2), link anchor literal (＜A

) (B3), other parts (B4).First three part can be distinguished by html tag.If entry t occurs Ni time accordingly in these positions, its corresponding weighted value ω (d) is:

ω(d)=N ₁＊W _B1+(N ₂＊W _B2+N ₃＊W _B3+N ₄＊W _B4)/S (2)

V ₁=(t ₁,ω ₁;…t _i,ω _i;…t _n,ω _n) (3)

T in the formula _i(i=1 ... n) each text entry in the representation page, ω _i(i=1 ... n) represent the weights that each entry is corresponding.

2) structural eigenvector V ₂

V ₂=(s ₁,ω(s ₁);…s _i,ω(s _i);…s _n,ω(s _n)) (4)

(i=1 in the formula ... n) each architectural characteristic parameter in the representation page, ω () (i=1 ... n) be illustrated in weighted value function corresponding among the page d.The definition of ω is determined by element type concrete in the page.

(3) form the proper vector storehouse;

The page feature of extracting is stored, namely the proper vector value that calculates is stored to form the proper vector storehouse.

(4) analytical characteristic vector storehouse on average obtains theme feature vector and value by weighted sum, and obtains the similarity threshold values of Sample Storehouse;

The page character numerical value that the page of same subject extracts is analyzed, be analyzed by giving different weights with structural eigenvector for proper vector, analysis draws optimum weights and distributes, distribution namely how to carry out weights can obtain theme page eigenwert distribute the most reasonable.And calculate the threshold values of similar pages contrast, for follow-up carry out theme when gathering parameters carry out reference.

The non-elaborated part of the present invention belongs to techniques well known.

The above; only be part embodiment of the present invention, but protection scope of the present invention is not limited to this, any those skilled in the art are in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.

Claims

1. the web retrieval system of a subject-oriented is characterized in that comprising: sample training module, decision search module and acquisition module;

The decision search module, the URL address set of control retrieval is controlled at the candidate seed website to the hunting zone;

Described decision search module implementation procedure is:

(11) at first construct theme class Buffer Pool positivePool, non-theme class Buffer Pool negtivePool two class Buffer Pools, be used for depositing URL class entity, i.e. URL address in the URL address set; Two Buffer Pool initialization values are null set; Deposit in the described theme class Buffer Pool and the URL address that gathers Topic relative, what deposit in the non-theme class Buffer Pool is and the incoherent URL of collection theme address, the effect of Buffer Pool is to put the URL address in order to can use fast when gathering webpage, and being divided into theme class and non-theme class is in order to form the URL address set of theme class; Described Buffer Pool is packaged queue operation;

(13) initial set of the seed website of manually choosing is carried out the Spider search, and in conjunction with fixed point strategy, Buffer Pool strategy, record corresponding page address; The page address of this module records all offers the collection that acquisition module carries out the page at last; Described fixed point strategy is for only carrying out search in Website to artificial selected website; Described Buffer Pool strategy is for putting into Buffer Pool to the page address that gathers.

2. the web retrieval system of subject-oriented according to claim 1, it is characterized in that: the implementation procedure of described acquisition module is: to page address Raw URLi, i ∈ N, the following processing of i≤n:

(21) page pre-service obtains page P _i

(24) make up page P according to formula (6) _iHVSM proper vector V;

N=i+j in the formula, ω i, ω _jRepresent respectively V ₁, V ₂Institute comprises Feature Words t _i, s _jWeights, n is total dimension of feature space, k _t, k _sFor to content feature vector V ₁, structural eigenvector V ₂The weights of giving, s _jExpression architectural characteristic parameter, t _iThe expression entry;

L = sim (Vi, Vj) = \frac{Σ_{k = 1}^{n} (ω_{ik}^{,} \times ω_{jk}^{,})}{\sqrt{(Σ_{k = 1}^{n} ω_{ik}^{, 2}) \cdot (Σ_{k = 1}^{n} ω_{jk}^{, 2})}} - - - (7)

Sin is the sine formula symbol in the formula, and Vi is that content feature vector, Vj structural eigenvector, ω i, ω j represent respectively V ₁, V ₂Institute comprises Feature Words t _i, s _jWeights, s _jExpression architectural characteristic parameter, t _iThe expression entry;

(26) if similarity L, thinks then that webpage belongs to the theme page greater than setting threshold ε, deposit the theme page in web database, and deposit Raw URL i in positivePool; Otherwise deposit Raw URL i in negtive2Pool.