CN103310013A - Subject-oriented web page collection system - Google Patents

Subject-oriented web page collection system Download PDF

Info

Publication number
CN103310013A
CN103310013A CN2013102751157A CN201310275115A CN103310013A CN 103310013 A CN103310013 A CN 103310013A CN 2013102751157 A CN2013102751157 A CN 2013102751157A CN 201310275115 A CN201310275115 A CN 201310275115A CN 103310013 A CN103310013 A CN 103310013A
Authority
CN
China
Prior art keywords
page
theme
subject
module
url
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013102751157A
Other languages
Chinese (zh)
Inventor
王宝会
于雷
王丽华
王新河
尹科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN2013102751157A priority Critical patent/CN103310013A/en
Publication of CN103310013A publication Critical patent/CN103310013A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to a subject-oriented web page collection system, which belongs to the field of network communication and is used for the aspect of collecting subject-oriented network information. The subject-oriented web page collection system comprises a sample training module, a strategy search module and a collecting module, wherein the sample training module carries out analysis and calculation through a manually-set web page sample library to obtain subject characteristic vectors and values and a similarity threshold value of pages; the strategy search module control a retrieved URL (Uniform Resource Locator) address set and controls a search range in candidate seed websites; the collecting module receives the URL address set sent by the strategy search module and carries out page purification, characteristic extraction, analysis, collection and storage; and in the process of characteristic analysis, whether the subject characteristic vectors and values and the similarity threshold value of a subject web page need to be filled by manually referring the result of the sample training module is judged. The subject-oriented web page collection system has higher efficiency and stronger page adaptability and effectively solves the problems in the prior art.

Description

A kind of web retrieval system of subject-oriented
Technical field
The present invention relates to a kind of web retrieval system of subject-oriented, belong to network communication field, be used for the network information gathering aspect towards theme.
Background technology
Along with the rapid growth of WEB information resources, traditional information search system can't guarantee the upgrading in time of information, and because the subject area of Information Monitoring is too extensive, can't satisfy people's demand growing to the customized information retrieval service.Recent study person constantly proposes the developing direction of New-generation search engines, and subject search is a class of wherein particularly giving prominence to.Compare with the general search engine, the range of search less of topic search engine, precision ratio and recall ratio are easy to guarantee.In search procedure, need not travel through whole WEB, only need to select the page relevant with the theme page to conduct interviews, substantially avoid the crisis that conventional information acquisition system information index expands.
There are the following problems for existing theme network crawler: whether the webpage that (1) is difficult in the accuracy of judgement targeted website when carrying out the subject web page information acquisition is the webpage of this theme, collects non-a large amount of subject web pages so hold very much when gathering.(2) advantage of theme network crawler need not the page is traveled through exactly, only needs to select the page with Topic relative to conduct interviews, but in the process of selecting, very difficult with the page definition of Topic relative.
Summary of the invention
The technology of the present invention is dealt with problems: overcome the deficiencies in the prior art, a kind of web retrieval system of subject-oriented is provided, this system has higher efficient and stronger page adaptability, has effectively solved problems of the prior art.
Technical solution of the present invention: a kind of web retrieval system of subject-oriented is characterized in that comprising: sample training module, decision search module and acquisition module;
The sample training module, the webpage Sample Storehouse by artificial setting carries out analysis and calculation and draws the theme feature vector sum, and the similarity threshold values of the page;
The decision search module, the URL(URL of control retrieval is web page address) address set, the hunting zone is controlled at the candidate seed website;
Acquisition module, the URL address set that the reception strategy search module sends over, and carry out purification, feature extraction, analysis, the also collection preservation of the page; In the process of carrying out signature analysis, judge that the theme feature vector with subject web page needs the result of artificial reference sample training module to fill in value and similarity threshold values;
Described decision search module implementation procedure is:
(11) at first construct theme class Buffer Pool positivePool, non-theme class Buffer Pool negtivePool two class Buffer Pools, be used for depositing URL class entity, i.e. URL address in the URL address set; Two Buffer Pool initialization values are null set; Deposit in the described theme class Buffer Pool and the URL address that gathers Topic relative, what deposit in the non-theme class Buffer Pool is and the incoherent URL of collection theme address, the effect of Buffer Pool is exactly to put the URL address in order to can use fast when gathering webpage, and being divided into theme class and non-theme class is in order to form the URL address set of theme class; Described Buffer Pool is packaged queue operation, is the computer realm common technology;
(12) artificial selected seed website Seeds, the initial set that consists of search utility Spider is the URL address set;
(13) initial set of the seed website of manually choosing is carried out the Spider search, and in conjunction with fixed point strategy, Buffer Pool strategy, record corresponding page address; The page address of this module records all offers the collection that acquisition module carries out the page at last; Described fixed point strategy is for only carrying out search in Website to artificial selected website; Described Buffer Pool strategy is looked into heavy speed for Buffer Pool is put in the page address that gathers in the time of can accelerating to gather like this;
The implementation procedure of described acquisition module is: to page address Raw URLi, and i ∈ N, i≤n does following processing:
(21) page pre-service obtains page P i
(22) if pos itivePool does not then search for for sky, if P iThink then that with entries match wherein webpage belongs to the theme page, P iProcessing finishes, and returns (21) and processes P i+ 1;
(23) if negtivePool does not then search for for sky, if P iThink then that with entries match wherein webpage does not belong to the theme page, P iProcessing finishes, and returns (21) and processes P i+ 1;
(24) make up page P according to formula (6) iHVSM proper vector V;
V=(k tω 1,…,k tω i,k sω 1’,…,k sω j’)=(ω 1’,ω 2’,…,ω n’) (6)
N=i+j in the formula, ω i, ω j represent respectively V 1, V 2Institute comprises Feature Words t i, s jWeights, n is total dimension of feature space.k t, k sFor in the system to content feature vector V 1, structural eigenvector V 2The weights of giving (can carry out manually the setting of weights), s 1Expression architectural characteristic parameter, t iThe expression entry.
(25) according to formula (7) with P iCarry out similarity analysis with the training sample page, obtain similarity L;
L = sim ( Vi , Vj ) = Σ k = 1 n ( ω ik , × ω jk , ) ( Σ k = 1 n ω ik , 2 ) · ( Σ k = 1 n ω jk , 2 ) - - - ( 7 )
Sin is the sine formula symbol in the formula, and Vi is that content feature vector, Vj structural eigenvector, ω i, ω j represent respectively V 1, V 2Institute comprises Feature Words t i, s jWeights, s jExpression architectural characteristic parameter, t iThe expression entry.
(26) if similarity L, thinks then that webpage belongs to the theme page greater than setting threshold ε, deposit it in web database, and deposit Raw URL i in positivePool; Otherwise deposit Raw URL i in negtive2Pool.
The present invention's advantage compared with prior art is: the present invention obtains theme vector and value and similarity threshold values by the training for target sample, and by in gatherer process, theme vector and value and similarity threshold values being included in the calculating, make theme search algorithm design and the Design of Search Engine of native system have higher efficient and stronger page adaptability, effectively solved the top problem of mentioning, and effect is comparatively desirable in test.
Description of drawings
Fig. 1 is system architecture diagram of the present invention.
Embodiment
As shown in Figure 1, system of the present invention forms sample training module, decision search module and acquisition module by three modules.The webpage Sample Storehouse of sample training module by artificial setting carries out analysis and calculation and draws the theme feature vector sum, and calculates for the similarity threshold values of the page; The decision search module then is the URL address set of control system retrieval, and the hunting zone is controlled at the candidate seed website; The function of acquisition module is, the URL address that the reception strategy search module sends over, and carry out the page purification, feature extraction, analysis, and gather and preserve.
The following describes concrete function and the reciprocal process of several main modular.
1, decision search module
The information search that the Functional Design of decision search module is based on the internet is a kind of technology based on hyperlink.Usually, the mode that hyperlink exists can be divided three classes, if gather each link that runs into, Web Spider will be fast, overlay network widely.But this mode that network is infinitely searched for very likely causes topic drift.Find through research, for a certain theme (such as industry or field), authoritative Web website is clearer and more definite, and often has higher similarity for its URL of the page that relates to similar theme.Based on above understanding, system has taked following two kinds of strategies to the design of search: (1) fixed point strategy, namely choose authority's industry website as initial set, and only the internal links of these candidate seed websites (Seed URL s) is processed.(2) Buffer Pool strategy namely carries out classified finishing according to webpage URL similarity to the page.
The concrete function implementation procedure of decision search module is as follows:
(1) the manual set of setting kind of child node forms the storehouse, targeted website;
(2) internal links of seed node set website is processed (namely extracting the URL address in the website);
(3) set up the address buffer pond;
(4) according to the similarity of webpage URL the page is carried out classified finishing;
(5) push the URL address set to acquisition module.
2, acquisition module
The function of acquisition module is, the URL address that the reception strategy search module sends over, and carry out the page purification, feature extraction, analysis, and gather and preserve.
The specific implementation process of acquisition module is:
(1) receives the URL address set that the decision search module pushes;
(2) page of needs collection carried out pre-service;
Often there are the irrelevant contents of various and theme in webpage, these information structures the noise content of webpage, increased the complexity of the page, therefore to carry out purified treatment to webpage first.For the webpage noise problem, the heuristic rule of document is defined as a template (template) with noise region.For the webpage collection that uses same template, think that the content that repeatedly occurs is noise data.But because the undemanding grammer of HTML, the source file of at present a large amount of webpages is write the disappearance that label often occurs and is mixed situation.Therefore, need first target web to be resolved, utilize the HTML(HTML (Hypertext Markup Language), i.e. HTML(Hypertext Markup Language) be for a kind of markup language of describing web document; Resolver is to the source file arrangement that standardizes.In addition, in order to simplify webpage DOM(document dbject model Document Object Model, be called for short DOM, the standard program interface of the processing extensible markup language of W3C tissue recommendation) tree, when making up dom tree such as the script(HTML label), note among the comment(HTML), among the style(HTML pattern) etc. node all be left in the basket.Packaged class libraries of Tidy(that system adopts W3C to provide can be used for HTML and convert XML to) source file is repaired.
(3) Feature Selection of the theme page and extract proper vector;
Page character representation uses vector space model (V ector SpaceM odel, VSM) method.In the VSM model, file d is mapped as a proper vector V (d):
V(d)=(t 1,ω 1(d);…;t n,ω n(d)) (1)
T in the formula i(i=1,2 ..., n) be the entry item (entry is the spelling words intellectual that marks off according to page structure on the page, is network collection field general term) that row do not duplicate mutually; ω i(d) be t iWeighted value in d is generally defined as t iThe function of the frequency of occurrences in d.
The html tag of webpage has reflected wherein semanteme and the structural information of institute's content, can extract by guide theme feature word.There is achievement in research to show that the theme ability to express in 12 index sources such as web page contents theme and webpage autograph, article title has the theme ability to express in 12 index sources such as sequencing chapter title that sequencing is arranged.In addition, the scale of the entry frequency of text and the page has certain correlativity.Consider that the WEB page is a kind of semi-structured document, native system improves traditional VSM model, has adopted the mixed vector spatial model to represent.The theme feature vector of webpage is by content feature vector V 1With structural eigenvector V 2Two parts form, and give V 1, V 2Different weights.
1) content feature vector V 1
Give different weights to the entry that page diverse location occurs, webpage is divided into four parts: title (<title 〉,<head 〉) (B1), key word (<font 〉,<strong 〉,<b 〉,<big 〉,<I,<u 〉) (B2), link anchor literal (<A 〉) (B3), other parts (B4).First three part can be distinguished by html tag.If entry t occurs Ni time accordingly in these positions, its corresponding weighted value ω (d) is:
ω(d)=N 1*W B1+(N 2*W B2+N 3*W B3+N 4*W B4)/S (2)
W in the formula B1The gold content of expression entry present position, S representation page scale.Can suitably adjust as required W B1Give suitable value.Then content feature vector is expressed as:
V 1=(t 1,ω 1;···t i,ω i;···t nn) (3)
T in the formula i(i=1 ... n) entry in the representation page, ω i(i=1 ... n) represent the weights that each entry is corresponding.
2) structural eigenvector V 2
With t iConnotation expand the architectural characteristic parameter that forms the page.So-called page structure characterisitic parameter comprises following two dvielements: the 1. structural element such as table, tr, td (or th); 2. domain knowledge keyword.Structural eigenvector V 2Be expressed as:
V 2=(s 1,ω(s 1);···s i,ω(s i);···s n,ω(s n)) (4)
S in the formula i(i=1 ... n) each architectural characteristic parameter in the representation page, ω () (i=1 ... n) expression (the weighted value function of correspondence in page d.The definition of ω is determined by element type concrete in the page.
3) HVSM aspect of model vector V
HVSM model (mixed vector spatial model) is described the page from content of pages and two aspects of page structure, is the concentrated expression to content characteristic and architectural feature.Adopt the HVSM model to describe the content and structure characteristic information that page feature can be stayed web page joint as much as possible.Suppose that the theme feature vector of the page is by structural element piece V 1With content element piece V 2Two parts form.V 1Comprise the text feature set of words { t that reflects content of pages 1..., t i, V 2Text feature set of words { s for the reflection page structure 1..., s j, corresponding theme feature word set is T=V 1∪ V 2Then page HVSM aspect of model vector V can be expressed as:
V=(ω 1,…,ω i1’,…,ω j’)=(ω 1,ω 2,…,ω n) (5)
N=i+j, ω i, ω j represent respectively V in the formula 1, V 2Institute comprises Feature Words t i, s jWeights, n is total dimension of feature space.Can be to content feature vector V in the system 1, structural eigenvector V 2Give different weights k t, k s, to show that page text and page structure are to the influence power of page feature.Thus formula (5) is carried out the proper vector V that assignment can obtain the page:
V=(k tω 1,…,k tω i,k sω 1’,…,k sω j’)=(ω 1’,ω 2’,…,ω n’) (6)
N=i+j for different research themes, can suitably adjust k as required t, k s, give corresponding weights.ω i(d) be t iWeighted value in d is generally defined as t iThe function of the frequency of occurrences in d; t i(i=1,2 ..., n) be the entry item that row do not duplicate mutually.
(4) carrying out page similarity after extraction page feature and the proper vector judges;
The HVSM proper vector of the page adopts the cosine similarity measure to judge the topic similarity degree of the page after being determined by formula (6).For any two page P iAnd P j, calculate corresponding HVSM proper vector V iAnd V jVectorial cosine distance, cosine value larger (sin is sinusoidal symbol) illustrates that the Topic Similarity of two pages is higher.Page P iAnd P jThe following calculating of similarity L:
L = sim ( Vi , Vj ) = Σ k = 1 n ( ω ik , × ω jk , ) ( Σ k = 1 n ω ik , 2 ) · ( Σ k = 1 n ω jk , 2 ) - - - ( 7 )
ω in the formula i(d) be t iWeighted value in d is generally defined as t iThe function of the frequency of occurrences in d; t i(i=1,2 ..., n) be the entry item that row do not duplicate mutually.If theme training sample HVSM aspect of model vector set is combined into { V 1..., V n, adopt the vector arithmetic mean value V of this set cRepresent this theme feature vector, namely
V c = 1 n Σ k = 1 n V k - - - ( 8 )
For the unknown page, after determining its HVSM proper vector, calculate this vector and theme feature vector V according to the similarity formula cBetween similarity.Setting threshold ε ∈ (0,1) if similarity, thinks then that webpage belongs to the theme page greater than threshold epsilon, deposits it in web database, otherwise abandons this page.Threshold epsilon is an empirical numerical value, need to revise threshold value according to the learning outcome of training sample, to reach optimum efficiency.
3, sample training module
The sample training module is by artificial definition Sample Storehouse, then the theme page in the Sample Storehouse is carried out feature and eigenwert must be extracted, then calculate the threshold values of sample, calculate by feature extraction and threshold values to a large amount of sample datas, draw optimum proper vector weight proportion (being the ratio of proper vector and structural eigenvector) and threshold values.
(1) artificial selected target Sample Storehouse (data that guarantee Sample Storehouse all are that the same subject data are in order to train);
(2) the Sample Storehouse webpage is purified;
Often there are the irrelevant contents of various and theme in webpage, these information structures the noise content of webpage, increased the complexity of the page, therefore to carry out purified treatment to webpage first.For the webpage noise problem, the heuristic rule of document is defined as a template (template) with noise region.For the webpage collection that uses same template, think that the content that repeatedly occurs is noise data.But because the undemanding grammer of HTML, the source file of at present a large amount of webpages is write the disappearance that label often occurs and is mixed situation.Therefore, need to resolve target web first, utilize html parser to the source file arrangement that standardizes.In addition, in order to simplify the webpage dom tree, when making up dom tree such as script, comment, the nodes such as style all are left in the basket.The Tidy that system adopts W3C to provide repairs source file.
(3) feature of the extraction Sample Storehouse page;
Page character representation uses vector space model (V ector SpaceM odel, VSM) method.In the VSM model, file d is mapped as a proper vector V (d):
V(d)=(t 1,ω 1(d);…;t n,ω n(d)) (1)
T in the formula i(i=1,2 ..., n) be the entry item that row do not duplicate mutually; ω i(d) be t iWeighted value in d is generally defined as t iThe function of the frequency of occurrences in d.
The html tag of webpage has reflected wherein semanteme and the structural information of institute's content, can extract by guide theme feature word.There is achievement in research to show that the theme ability to express in 12 index sources such as web page contents theme and webpage autograph, article title has the theme ability to express in 12 index sources such as sequencing chapter title that sequencing is arranged.In addition, the scale of the entry frequency of text and the page has certain correlativity.Consider that the WEB page is a kind of semi-structured document, native system improves traditional VSM model, has adopted the mixed vector spatial model to represent.The theme feature vector of webpage is by content feature vector V 1With structural eigenvector V 2Two parts form, and give V 1, V 2Different weights.
1) content feature vector V 1
Give different weights to the entry that page diverse location occurs, webpage is divided into four parts: title (<title 〉,<head 〉) (B1), key word (<font 〉,<s trong 〉,<b 〉,<big 〉,<I,<u 〉) (B2), link anchor literal (<A
) (B3), other parts (B4).First three part can be distinguished by html tag.If entry t occurs Ni time accordingly in these positions, its corresponding weighted value ω (d) is:
ω(d)=N 1*W B1+(N 2*W B2+N 3*W B3+N 4*W B4)/S (2)
W in the formula B1The gold content of expression entry present position, S representation page scale.Can suitably adjust as required W B1Give suitable value.Then content feature vector is expressed as:
V 1=(t 11;…t ii;…t nn) (3)
T in the formula i(i=1 ... n) each text entry in the representation page, ω i(i=1 ... n) represent the weights that each entry is corresponding.
2) structural eigenvector V 2
With t iConnotation expand the architectural characteristic parameter that forms the page.So-called page structure characterisitic parameter comprises following two dvielements: the 1. structural element such as table, tr, td (or th); 2. domain knowledge keyword.Structural eigenvector V 2Be expressed as:
V 2=(s 1,ω(s 1);…s i,ω(s i);…s n,ω(s n)) (4)
(i=1 in the formula ... n) each architectural characteristic parameter in the representation page, ω () (i=1 ... n) be illustrated in weighted value function corresponding among the page d.The definition of ω is determined by element type concrete in the page.
(3) form the proper vector storehouse;
The page feature of extracting is stored, namely the proper vector value that calculates is stored to form the proper vector storehouse.
(4) analytical characteristic vector storehouse on average obtains theme feature vector and value by weighted sum, and obtains the similarity threshold values of Sample Storehouse;
The page character numerical value that the page of same subject extracts is analyzed, be analyzed by giving different weights with structural eigenvector for proper vector, analysis draws optimum weights and distributes, distribution namely how to carry out weights can obtain theme page eigenwert distribute the most reasonable.And calculate the threshold values of similar pages contrast, for follow-up carry out theme when gathering parameters carry out reference.
The non-elaborated part of the present invention belongs to techniques well known.
The above; only be part embodiment of the present invention, but protection scope of the present invention is not limited to this, any those skilled in the art are in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.

Claims (2)

1. the web retrieval system of a subject-oriented is characterized in that comprising: sample training module, decision search module and acquisition module;
The sample training module, the webpage Sample Storehouse by artificial setting carries out analysis and calculation and draws the theme feature vector sum, and the similarity threshold values of the page;
The decision search module, the URL address set of control retrieval is controlled at the candidate seed website to the hunting zone;
Acquisition module, the URL address set that the reception strategy search module sends over, and carry out purification, feature extraction, analysis, the also collection preservation of the page; In the process of carrying out signature analysis, judge that the theme feature vector with subject web page needs the result of artificial reference sample training module to fill in value and similarity threshold values;
Described decision search module implementation procedure is:
(11) at first construct theme class Buffer Pool positivePool, non-theme class Buffer Pool negtivePool two class Buffer Pools, be used for depositing URL class entity, i.e. URL address in the URL address set; Two Buffer Pool initialization values are null set; Deposit in the described theme class Buffer Pool and the URL address that gathers Topic relative, what deposit in the non-theme class Buffer Pool is and the incoherent URL of collection theme address, the effect of Buffer Pool is to put the URL address in order to can use fast when gathering webpage, and being divided into theme class and non-theme class is in order to form the URL address set of theme class; Described Buffer Pool is packaged queue operation;
(12) artificial selected seed website Seeds, the initial set that consists of search utility Spider is the URL address set;
(13) initial set of the seed website of manually choosing is carried out the Spider search, and in conjunction with fixed point strategy, Buffer Pool strategy, record corresponding page address; The page address of this module records all offers the collection that acquisition module carries out the page at last; Described fixed point strategy is for only carrying out search in Website to artificial selected website; Described Buffer Pool strategy is for putting into Buffer Pool to the page address that gathers.
2. the web retrieval system of subject-oriented according to claim 1, it is characterized in that: the implementation procedure of described acquisition module is: to page address Raw URLi, i ∈ N, the following processing of i≤n:
(21) page pre-service obtains page P i
(22) if pos itivePool does not then search for for sky, if P iThink then that with entries match wherein webpage belongs to the theme page, P iProcessing finishes, and returns (21) and processes P i+ 1;
(23) if negtivePool does not then search for for sky, if P iThink then that with entries match wherein webpage does not belong to the theme page, P iProcessing finishes, and returns (21) and processes P i+ 1;
(24) make up page P according to formula (6) iHVSM proper vector V;
V=(k tω 1,…,k tω i,k sω 1’,…,k sω j’)=(ω 1’,ω 2’,…,ω n’) (6)
N=i+j in the formula, ω i, ω jRepresent respectively V 1, V 2Institute comprises Feature Words t i, s jWeights, n is total dimension of feature space, k t, k sFor to content feature vector V 1, structural eigenvector V 2The weights of giving, s jExpression architectural characteristic parameter, t iThe expression entry;
(25) according to formula (7) with P iCarry out similarity analysis with the training sample page, obtain similarity L;
L = sim ( Vi , Vj ) = Σ k = 1 n ( ω ik , × ω jk , ) ( Σ k = 1 n ω ik , 2 ) · ( Σ k = 1 n ω jk , 2 ) - - - ( 7 )
Sin is the sine formula symbol in the formula, and Vi is that content feature vector, Vj structural eigenvector, ω i, ω j represent respectively V 1, V 2Institute comprises Feature Words t i, s jWeights, s jExpression architectural characteristic parameter, t iThe expression entry;
(26) if similarity L, thinks then that webpage belongs to the theme page greater than setting threshold ε, deposit the theme page in web database, and deposit Raw URL i in positivePool; Otherwise deposit Raw URL i in negtive2Pool.
CN2013102751157A 2013-07-02 2013-07-02 Subject-oriented web page collection system Pending CN103310013A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013102751157A CN103310013A (en) 2013-07-02 2013-07-02 Subject-oriented web page collection system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013102751157A CN103310013A (en) 2013-07-02 2013-07-02 Subject-oriented web page collection system

Publications (1)

Publication Number Publication Date
CN103310013A true CN103310013A (en) 2013-09-18

Family

ID=49135231

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013102751157A Pending CN103310013A (en) 2013-07-02 2013-07-02 Subject-oriented web page collection system

Country Status (1)

Country Link
CN (1) CN103310013A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106126642A (en) * 2016-06-23 2016-11-16 北京工业大学 A kind of financial warehouse receipt wind control information crawler calculated based on streaming and screening technique
CN106709052A (en) * 2017-01-06 2017-05-24 电子科技大学 Keyword based topic-focused web crawler design method
CN107273499A (en) * 2017-06-16 2017-10-20 成都布林特信息技术有限公司 Data grab method based on vertical search engine
CN109635182A (en) * 2018-12-21 2019-04-16 全通教育集团(广东)股份有限公司 Parallelization data tracking method based on educational information theme
CN109670099A (en) * 2018-12-21 2019-04-23 全通教育集团(广东)股份有限公司 Based on education network message subject acquisition method
CN112100500A (en) * 2020-09-23 2020-12-18 高小翎 Example learning-driven content-associated website discovery method
CN112214558A (en) * 2020-11-18 2021-01-12 国家计算机网络与信息安全管理中心 Theme correlation degree judging method and device
WO2020114528A3 (en) * 2020-03-09 2021-01-14 中国民用航空总局第二研究所 Method, device, and system for tracking persons potentially infected at public places during epidemic

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王煜,张浩斌: "面向主题的网页采集系统的设计与研究", 《计算机与数字工程》 *
邬耀宗: "《三角函数》", 31 July 1979, 浙江人民出版社 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106126642A (en) * 2016-06-23 2016-11-16 北京工业大学 A kind of financial warehouse receipt wind control information crawler calculated based on streaming and screening technique
CN106126642B (en) * 2016-06-23 2020-01-17 北京工业大学 Financial warehouse receipt wind control information crawling and screening method based on stream-oriented computing
CN106709052A (en) * 2017-01-06 2017-05-24 电子科技大学 Keyword based topic-focused web crawler design method
CN106709052B (en) * 2017-01-06 2020-09-04 电子科技大学 Topic web crawler design method based on keywords
CN107273499A (en) * 2017-06-16 2017-10-20 成都布林特信息技术有限公司 Data grab method based on vertical search engine
CN109635182A (en) * 2018-12-21 2019-04-16 全通教育集团(广东)股份有限公司 Parallelization data tracking method based on educational information theme
CN109670099A (en) * 2018-12-21 2019-04-23 全通教育集团(广东)股份有限公司 Based on education network message subject acquisition method
WO2020114528A3 (en) * 2020-03-09 2021-01-14 中国民用航空总局第二研究所 Method, device, and system for tracking persons potentially infected at public places during epidemic
CN112100500A (en) * 2020-09-23 2020-12-18 高小翎 Example learning-driven content-associated website discovery method
CN112214558A (en) * 2020-11-18 2021-01-12 国家计算机网络与信息安全管理中心 Theme correlation degree judging method and device
CN112214558B (en) * 2020-11-18 2023-08-15 国家计算机网络与信息安全管理中心 Theme relevance discriminating method and device

Similar Documents

Publication Publication Date Title
CN103310013A (en) Subject-oriented web page collection system
CN106874378B (en) Method for constructing knowledge graph based on entity extraction and relation mining of rule model
CN103823824B (en) A kind of method and system that text classification corpus is built automatically by the Internet
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN104182412B (en) A kind of web page crawl method and system
CN102902806B (en) A kind of method and system utilizing search engine to carry out query expansion
CN102262661B (en) Web page access forecasting method based on k-order hybrid Markov model
CN103853738B (en) A kind of recognition methods of info web correlation region
CN104408148B (en) A kind of field encyclopaedia constructing system based on general encyclopaedia website
CN103310026B (en) A kind of lightweight common webpage topic crawler method based on search engine
CN104199833B (en) The clustering method and clustering apparatus of a kind of network search words
CN102799677B (en) Water conservation domain information retrieval system and method based on semanteme
CN102662969B (en) Internet information object positioning method based on webpage structure semantic meaning
CN106777043A (en) A kind of academic resources acquisition methods based on LDA
CN102831121A (en) Method and system for extracting webpage information
CN104899273A (en) Personalized webpage recommendation method based on topic and relative entropy
CN102306204A (en) Subject area identifying method based on weight of text structure
CN105574047A (en) Website main page feature analysis based Chinese website sorting method and system
CN103294781A (en) Method and equipment used for processing page data
CN102591992A (en) Webpage classification identifying system and method based on vertical search and focused crawler technology
CN105975639B (en) Search result ordering method and device
Prajapati A survey paper on hyperlink-induced topic search (HITS) algorithms for web mining
CN103324700A (en) Noumenon concept attribute learning method based on Web information
CN103246732A (en) Online Web news content extracting method and system
CN114090861A (en) Education field search engine construction method based on knowledge graph

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20130918

RJ01 Rejection of invention patent application after publication