CN106126705A - A kind of large scale network data crawl system in real time - Google Patents

A kind of large scale network data crawl system in real time Download PDF

Info

Publication number
CN106126705A
CN106126705A CN201610507120.XA CN201610507120A CN106126705A CN 106126705 A CN106126705 A CN 106126705A CN 201610507120 A CN201610507120 A CN 201610507120A CN 106126705 A CN106126705 A CN 106126705A
Authority
CN
China
Prior art keywords
degree
information
page
module
large scale
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610507120.XA
Other languages
Chinese (zh)
Inventor
刘丽君
李成华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WUHAN TIPDM INTELLIGENT TECHNOLOGY Co Ltd
Original Assignee
WUHAN TIPDM INTELLIGENT TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WUHAN TIPDM INTELLIGENT TECHNOLOGY Co Ltd filed Critical WUHAN TIPDM INTELLIGENT TECHNOLOGY Co Ltd
Priority to CN201610507120.XA priority Critical patent/CN106126705A/en
Publication of CN106126705A publication Critical patent/CN106126705A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

A kind of large scale network data crawl system in real time, and initialization seed optimizes module, for the kind sublink of typing website;Periodically qualified web page interlinkage is joined in seed set, as the set of initial seed;Integrate module, for the web document of HTML is obtained, and the information in text is labeled;Hyperlinks between Web pages information integration module, for preserve the description data of the hyperlink of webpage;Web pages relevance computing module, for analyzing degree of subject relativity by concrete numerical value, represents the information of degree of association with concrete numerical information;Hyperlink importance degree computing module, the numerical information being used for calculating, as the foundation judging degree of association, is also to carry out quantitative analysis by concrete numerical value;If the number of links that current page is comprised has reached certain numerical value, represent that this page has several to link, if the number comprised has reached default value and represented that comprised subject resource has met preset requirement.

Description

A kind of large scale network data crawl system in real time
Technical field
The present invention relates to big data field of cloud computer technology, crawl in real time particularly to a kind of large scale network data and be System.
Background technology
Developing rapidly and becoming increasingly popular along with the Internet, the content that network information platform can be provided by is abundant all the more many Coloured silk, user search for information needed time face search difficulty increase and information sifting needed for consume plenty of time and energy also with ?.The appearance of search engine solves a difficult problem for magnanimity information retrieval.Search engine carries out the collection of resource by reptile. Web crawlers carries out crawling and collecting of web document by network connection, i.e. starts with from previously given URL, utilizes H1vrP Agreement crawls required html document, and analyzes the hyperlink included in these html documents, again captures the chain not accessed The resource connect and comprise.So repeatedly until there is no new URL.
But due to the fast development of mobile Internet, the newest web page contents presents explosive growth, and traditional climbs Take system and cannot meet the demand that large scale network data crawl.
Summary of the invention
Therefore, it is necessary to provide a kind of can crawl in real time large scale network data large scale network data climb in real time Take system.
A kind of large scale network data crawl system in real time, and it includes such as lower module:
Initialization seed optimizes module, for the kind sublink of typing website;By the way of Meta Search Engine, by optimum result Feed back to user;Excavate and link link forward with thematic relation degree;Periodically qualified web page interlinkage is joined seed In set, as the set of initial seed;
Integrate module, for the web document of HTML is obtained, and the information in text is labeled;
Hyperlinks between Web pages information integration module, for preserve the description data of the hyperlink of webpage, if having 2 The individual page A and B, if the hyperlink of A has pointed to B, then the information content thinking in B the given tacit consent to information content quality than A is high, If be simultaneously directed to A and B2 hyperlink when of user's Query Information simultaneously, then the information quality being defaulted as A with B is identical;
Web pages relevance computing module, for analyzing degree of subject relativity by concrete numerical value, with concrete numerical value letter Breath represents the information of degree of association;
Hyperlink importance degree computing module, for depending on the numerical information calculated as one that judges degree of association According to, also it is to carry out quantitative analysis by concrete numerical value;If the number of links that current page is comprised has reached certain numerical value, Represent that this page has several to link, if the number comprised has reached default value and represented that comprised subject resource meets Preset requirement.
Crawl in real time in system in large scale network data of the present invention, described web pages relevance computing module bag Include:
Single ergodic unit, for inputting the character string in Webpage literary composition, is defined as m-scontent;Circulation searching, Condition is to find corresponding marker character, and marker character is defined as delimiters;Search function interception position 1, search function defines For Find ();Search function interception position 2, uses same search function;Interception position 1,2 also exports character string, character String is defined as dest;Traversal terminates, output string;
Repetitive, is used for repeating single ergodic unit, until the tap point of information is excavated, and by pure Text classification assembly algorithms extracts the characteristic vector key word that information excavating point is excavated;In the acquisition of information theme phase excavated The algorithm vector space model of Guan Du represents.
Crawl in real time in system in large scale network data of the present invention,
Vector space model is expressed as follows:
First analyze the text message of Webpage, define α=(w here1, w2... wn), i=l, 2 ... n,
Number of times key word occur is added up, key word localization criteria the highest for the frequency of occurrences, here frequency It is defined as xi, build a vector xiwi, and define the vectorial β=(x of page subject matter1w1,x2w2,…xnwn), i=1,2, ... n,;Then two vectorial cosine functions just can reflect the frequency that key word is occurred, the concrete formula of degree of association is as follows:
The angle of two of which vector The biggest, represent that frequency is the least, show the least with the degree of association of theme;Angle is the least represents that the frequency occurred is the biggest, illustrates with main The degree of association of topic is higher;
The threshold value of current web page and degree of subject relativity is set;Represent relevant to theme more than threshold value, otherwise with theme not phase Close, classification is carried out for the webpage relevant to theme and preserves, be submitted to Database index data.
Crawl in real time in system in large scale network data of the present invention, current web page and degree of subject relativity are set Threshold value includes:
Periodically stochastical sampling, it is thus achieved that the original web page page documents of predetermined number, and being correlated with by manual analysis webpage Property, and calculate accuracy rate;
Accuracy rate is added up repeatedly, if the accuracy rate fluctuation of prediction number of times statistics is less than preset error value, then drops Low threshold is used for improving reptile coverage rate;If the accuracy rate fluctuation of prediction number of times statistics is more than or equal to preset error value, then Improve threshold value for improving the accuracy rate of reptile;
Repeat and accuracy rate is added up repeatedly until obtaining the threshold value wanted.
Crawling in real time in system in large scale network data of the present invention, hyperlink importance degree computing module includes:
As follows to the computing formula of page importance degree:
pu=w1*cos<α,β>+w2* Hub (u), wherein Hub (u) represents the link importance degree of webpage,CL (u) represents the connection number location searched, its maximum CmaxRepresent;The weights of page degree of association Representing with m1, the weights m2 of page link degree represents;M1 and m2 meets following condition 0 < m1, m2 < l and m1+m2=l.
Crawling in real time in system in large scale network data of the present invention, described integration module also includes:
The web document of HTML obtained and analyses whether as video HTML, in this way, continuing to judge whether URL belongs to Domain name of creeping can be run, be not belonging to run domain name of creeping and directly terminate, belong to and can run domain name of creeping and then obtain the territory of URL Name, and obtain the video parsing class corresponding with this domain name;Judge that video resolves whether class is empty, then terminates for sky, be not empty Continue to determine whether the broadcast address for video HTML, be not that broadcast address then terminates, be that broadcast address is then from URL and content Obtain video true download address list, when video true download address list is not empty, return the true download address of video List also terminates;When being not video HTML, and the information in text is labeled and challenges hyperlinks between Web pages information integration Module.
The large scale network data that implementing the present invention provides crawl system in real time and compared with prior art have following useful Effect: analyze degree of subject relativity with concrete numerical value by arranging web pages relevance computing module, with concrete numerical information Represent the information of degree of association;By hyperlink importance degree computing module using the numerical information calculated as judging degree of association A foundation, be also to carry out quantitative analysis by concrete numerical value;If the number of links that current page is comprised has reached one Fixed numerical value, represents that this page has several to link, if the number comprised has reached the master that default value represents comprised Topic resource has met preset requirement, it is possible to obtain the network data wanted in the big data of magnanimity, and by arranging integration mould The web document of HTML is obtained and analyses whether as video HTML by block, it is possible to distinguishes generic web page and video web-pages, is Crawl in hgher efficiency.
Accompanying drawing explanation
Fig. 1 is that the large scale network data of the embodiment of the present invention crawl system architecture diagram in real time.
Fig. 2 is the structured flowchart of web pages relevance computing module in Fig. 1.
Detailed description of the invention
As shown in Figure 1, 2, a kind of large scale network data crawl system in real time, and it includes such as lower module:
Initialization seed optimizes module, for the kind sublink of typing website;By the way of Meta Search Engine, by optimum result Feed back to user;Excavate and link link forward with thematic relation degree;Periodically qualified web page interlinkage is joined seed In set, as the set of initial seed.
Alternatively, initialization seed optimizes in module, arranges greatest priority queue, is safeguarded in greatest priority queue Set set in, corresponding priority key of each element in set.By the maximum priority queue following flow process of support:
Insert queue Insert (set, e, key): be inserted in set by the element e that priority is key;
Highest queue Max (set): return the element that set set medium priority is the highest;
Extract queue Ext (set): return the element that in set set, priority is the highest, and it deleted from set;
It is incremented by queue (set, e, key): the priority of element e in set set is set to key.
By the present embodiment, can be realized by raft, there is the highest efficiency.
Integrate module, for the web document of HTML is obtained, and the information in text is labeled.
Hyperlinks between Web pages information integration module, for preserve the description data of the hyperlink of webpage, if having 2 The individual page A and B, if the hyperlink of A has pointed to B, then the information content thinking in B the given tacit consent to information content quality than A is high, If be simultaneously directed to A and B2 hyperlink when of user's Query Information simultaneously, then the information quality being defaulted as A with B is identical.
Web pages relevance computing module, for analyzing degree of subject relativity by concrete numerical value, with concrete numerical value letter Breath represents the information of degree of association.
Hyperlink importance degree computing module, for depending on the numerical information calculated as one that judges degree of association According to, also it is to carry out quantitative analysis by concrete numerical value;If the number of links that current page is comprised has reached certain numerical value, Represent that this page has several to link, if the number comprised has reached default value and represented that comprised subject resource meets Preset requirement.
Crawl in real time in system in large scale network data of the present invention, described web pages relevance computing module bag Include:
Single ergodic unit, for inputting the character string in Webpage literary composition, is defined as m-scontent;Circulation searching, Condition is to find corresponding marker character, and marker character is defined as delimiters;Search function interception position 1, search function defines For Find ();Search function interception position 2, uses same search function;Interception position 1,2 also exports character string, character String is defined as dest;Traversal terminates, output string.
Repetitive, is used for repeating single ergodic unit, until the tap point of information is excavated, and by pure Text classification assembly algorithms extracts the characteristic vector key word that information excavating point is excavated;In the acquisition of information theme phase excavated The algorithm vector space model of Guan Du represents.
Crawl in real time in system in large scale network data of the present invention,
Vector space model is expressed as follows:
First analyze the text message of Webpage, define α=(w here1, w2... wn), i=l, 2 ... n,
Number of times key word occur is added up, key word localization criteria the highest for the frequency of occurrences, here frequency It is defined as xi, build a vector xiwi, and define the vectorial β=(x of page subject matter1w1,x2w2,…xnwn), i=1,2, ... n,;Then two vectorial cosine functions just can reflect the frequency that key word is occurred, the concrete formula of degree of association is as follows:
The angle of two of which vector The biggest, represent that frequency is the least, show the least with the degree of association of theme;Angle is the least represents that the frequency occurred is the biggest, illustrates with main The degree of association of topic is higher.
The threshold value of current web page and degree of subject relativity is set;Represent relevant to theme more than threshold value, otherwise with theme not phase Close, classification is carried out for the webpage relevant to theme and preserves, be submitted to Database index data.
Crawl in real time in system in large scale network data of the present invention, current web page and degree of subject relativity are set Threshold value includes:
Periodically stochastical sampling, it is thus achieved that the original web page page documents of predetermined number, and being correlated with by manual analysis webpage Property, and calculate accuracy rate.
Accuracy rate is added up repeatedly, if the accuracy rate fluctuation of prediction number of times statistics is less than preset error value, then drops Low threshold is used for improving reptile coverage rate;If the accuracy rate fluctuation of prediction number of times statistics is more than or equal to preset error value, then Improve threshold value for improving the accuracy rate of reptile.
Repeat and accuracy rate is added up repeatedly until obtaining the threshold value wanted.
Crawling in real time in system in large scale network data of the present invention, hyperlink importance degree computing module includes:
As follows to the computing formula of page importance degree:
pu=w1*cos<α,β>+w2* Hub (u), wherein Hub (u) represents the link importance degree of webpage,CL (u) represents the connection number location searched, its maximum CmaxRepresent;The weights of page degree of association Representing with m1, the weights m2 of page link degree represents;M1 and m2 meets following condition 0 < m1, m2 < l and m1+m2=l.
Crawling in real time in system in large scale network data of the present invention, described integration module also includes:
The web document of HTML obtained and analyses whether as video HTML, in this way, continuing to judge whether URL belongs to Domain name of creeping can be run, be not belonging to run domain name of creeping and directly terminate, belong to and can run domain name of creeping and then obtain the territory of URL Name, and obtain the video parsing class corresponding with this domain name;Judge that video resolves whether class is empty, then terminates for sky, be not empty Continue to determine whether the broadcast address for video HTML, be not that broadcast address then terminates, be that broadcast address is then from URL and content Obtain video true download address list, when video true download address list is not empty, return the true download address of video List also terminates;When being not video HTML, and the information in text is labeled and challenges hyperlinks between Web pages information integration Module.
The large scale network data that implementing the present invention provides crawl system in real time and compared with prior art have following useful Effect: analyze degree of subject relativity with concrete numerical value by arranging web pages relevance computing module, with concrete numerical information Represent the information of degree of association;By hyperlink importance degree computing module using the numerical information calculated as judging degree of association A foundation, be also to carry out quantitative analysis by concrete numerical value;If the number of links that current page is comprised has reached one Fixed numerical value, represents that this page has several to link, if the number comprised has reached the master that default value represents comprised Topic resource has met preset requirement, it is possible to obtain the network data wanted in the big data of magnanimity, and by arranging integration mould The web document of HTML is obtained and analyses whether as video HTML by block, it is possible to distinguishes generic web page and video web-pages, is Crawl in hgher efficiency.
It is understood that for the person of ordinary skill of the art, can conceive according to the technology of the present invention and do Go out other various corresponding changes and deformation, and all these change all should belong to the protection model of the claims in the present invention with deformation Enclose.

Claims (6)

1. large scale network data crawl system in real time, it is characterised in that it includes such as lower module:
Initialization seed optimizes module, for the kind sublink of typing website;By the way of Meta Search Engine, by optimum result feedback To user;Excavate and link link forward with thematic relation degree;Periodically qualified web page interlinkage is joined seed set In, as the set of initial seed;
Integrate module, for the web document of HTML is obtained, and the information in text is labeled;
Hyperlinks between Web pages information integration module, for preserve the description data of the hyperlink of webpage, if there being 2 pages Face A and B, if the hyperlink of A has pointed to B, then the information content thinking in B the given tacit consent to information content quality than A is high, simultaneously If being simultaneously directed to A and B2 hyperlink when of user's Query Information, then the information quality being defaulted as A with B is identical;
Web pages relevance computing module, for analyzing degree of subject relativity by concrete numerical value, comes with concrete numerical information Represent the information of degree of association;
Hyperlink importance degree computing module, the numerical information being used for calculating is as the foundation judging degree of association, also It is to carry out quantitative analysis by concrete numerical value;If the number of links that current page is comprised has reached certain numerical value, represent This page has several to link, if the number comprised has reached default value, to represent that comprised subject resource has met pre- If requirement.
2. large scale network data as claimed in claim 1 crawl system in real time, it is characterised in that described web pages relevance meter Calculation module includes:
Single ergodic unit, for inputting the character string in Webpage literary composition, is defined as m-scontent;Circulation searching, condition Being to find corresponding marker character, marker character is defined as delimiters;Search function interception position 1, search function is defined as Find();Search function interception position 2, uses same search function;Interception position 1,2 also exports character string, character string It is defined as dest;Traversal terminates, output string;
Repetitive, is used for repeating single ergodic unit, until the tap point of information is excavated, and passes through plain text Classification assembly algorithms extracts the characteristic vector key word that information excavating point is excavated;At the acquisition of information degree of subject relativity excavated Algorithm vector space model represent.
3. large scale network data as claimed in claim 2 crawl system in real time, it is characterised in that
Vector space model is expressed as follows:
First analyze the text message of Webpage, define α=(w here1, w2... wn), i=l, 2 ... n, key word is gone out Existing number of times is added up, and key word localization criteria the highest for the frequency of occurrences, here frequency is defined as xi, build one to Amount xiwi, and define the vectorial β=(x of page subject matter1w1,x2w2,…xnwn), i=1,2 ... n,;Then two vectorial cosine letters Number just can reflect the frequency that key word is occurred, the concrete formula of degree of association is as follows:
The angle of two of which vector is the biggest, Represent that frequency is the least, show the least with the degree of association of theme;Angle is the least represents that the frequency occurred is the biggest, and the phase with theme is described Close Du Genggao;
The threshold value of current web page and degree of subject relativity is set;Represent relevant to theme more than threshold value, otherwise uncorrelated with theme, right Carry out classification in the webpage relevant to theme to preserve, be submitted to Database index data.
4. large scale network data as claimed in claim 3 crawl system in real time, it is characterised in that arrange current web page and master The threshold value of topic degree of association includes:
Periodically stochastical sampling, it is thus achieved that the original web page page documents of predetermined number, and by the dependency of manual analysis webpage, and Calculate accuracy rate;
Accuracy rate is added up repeatedly, if the accuracy rate fluctuation of prediction number of times statistics is less than preset error value, then reduces threshold Value is used for improving reptile coverage rate;If the accuracy rate fluctuation of prediction number of times statistics is more than or equal to preset error value, then improve Threshold value is for improving the accuracy rate of reptile;
Repeat and accuracy rate is added up repeatedly until obtaining the threshold value wanted.
5. large scale network data as claimed in claim 4 crawl system in real time, it is characterised in that hyperlink importance degree calculates Module includes:
As follows to the computing formula of page importance degree:
pu=w1* cos < α, β >+w2* Hub (u), wherein Hub (u) represents the link importance degree of webpage,CL U () represents the connection number location searched, its maximum CmaxRepresent;The weights m1 of page degree of association represents, page link The weights m2 of degree represents;M1 and m2 meets following condition 0 < m1, m2 < l and m1+m2=l.
6. large scale network data as claimed in claim 5 crawl system in real time, and described integration module also includes:
The web document of HTML obtained and analyses whether as video HTML, in this way, continuing to judge whether URL belongs to and can transport Row is creeped domain name, is not belonging to run domain name of creeping and directly terminates, and belongs to and can run domain name of creeping and then obtain the domain name of URL, and Obtain the video corresponding with this domain name and resolve class;Judge that video resolves whether class is empty, then terminates for sky, does not continues to sentence for sky Whether disconnected be the broadcast address of video HTML, is not that broadcast address then terminates, is that broadcast address is then regarded from URL and content The true download address list of frequency, when video true download address list is not empty, returns video true download address list also Terminate;When being not video HTML, and the information in text is labeled and challenges hyperlinks between Web pages information integration module.
CN201610507120.XA 2016-07-01 2016-07-01 A kind of large scale network data crawl system in real time Pending CN106126705A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610507120.XA CN106126705A (en) 2016-07-01 2016-07-01 A kind of large scale network data crawl system in real time

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610507120.XA CN106126705A (en) 2016-07-01 2016-07-01 A kind of large scale network data crawl system in real time

Publications (1)

Publication Number Publication Date
CN106126705A true CN106126705A (en) 2016-11-16

Family

ID=57467666

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610507120.XA Pending CN106126705A (en) 2016-07-01 2016-07-01 A kind of large scale network data crawl system in real time

Country Status (1)

Country Link
CN (1) CN106126705A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107480275A (en) * 2017-08-21 2017-12-15 成都西维数码科技有限公司 A kind of harmful information monitoring method and system based on big data
CN108052517A (en) * 2017-10-19 2018-05-18 福建中金在线信息科技有限公司 Data search method and system
CN108710672A (en) * 2018-05-17 2018-10-26 南京大学 A kind of Theme Crawler of Content method based on increment bayesian algorithm
CN108985068A (en) * 2018-06-26 2018-12-11 广东电网有限责任公司信息中心 Loophole quick sensing, positioning and the method and system of verifying
CN109446396A (en) * 2018-10-17 2019-03-08 珠海市智图数研信息技术有限公司 A kind of intelligent crawler frame system of line business information
CN109739848A (en) * 2018-12-28 2019-05-10 杭州铭智云教育科技有限公司 A kind of data extraction method
CN109948019A (en) * 2019-01-10 2019-06-28 中央财经大学 A kind of deep layer Network Data Capture method
CN111767482A (en) * 2020-05-21 2020-10-13 中国地质大学(武汉) Self-adaptive crawling method for focused web crawler

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103186676A (en) * 2013-04-08 2013-07-03 湖南农业大学 Method for searching thematic knowledge self growth form focused crawlers

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103186676A (en) * 2013-04-08 2013-07-03 湖南农业大学 Method for searching thematic knowledge self growth form focused crawlers

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王艳阁: "主题微博爬虫的设计与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
邱荷花: "基于Hadhoop的视频爬虫系统的设计与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107480275A (en) * 2017-08-21 2017-12-15 成都西维数码科技有限公司 A kind of harmful information monitoring method and system based on big data
CN108052517A (en) * 2017-10-19 2018-05-18 福建中金在线信息科技有限公司 Data search method and system
CN108710672A (en) * 2018-05-17 2018-10-26 南京大学 A kind of Theme Crawler of Content method based on increment bayesian algorithm
CN108710672B (en) * 2018-05-17 2020-04-14 南京大学 Theme crawler method based on incremental Bayesian algorithm
CN108985068A (en) * 2018-06-26 2018-12-11 广东电网有限责任公司信息中心 Loophole quick sensing, positioning and the method and system of verifying
CN109446396A (en) * 2018-10-17 2019-03-08 珠海市智图数研信息技术有限公司 A kind of intelligent crawler frame system of line business information
CN109739848A (en) * 2018-12-28 2019-05-10 杭州铭智云教育科技有限公司 A kind of data extraction method
CN109739848B (en) * 2018-12-28 2021-11-09 深圳市科联汇通科技有限公司 Data extraction method
CN109948019A (en) * 2019-01-10 2019-06-28 中央财经大学 A kind of deep layer Network Data Capture method
CN109948019B (en) * 2019-01-10 2021-10-08 中央财经大学 Deep network data acquisition method
CN111767482A (en) * 2020-05-21 2020-10-13 中国地质大学(武汉) Self-adaptive crawling method for focused web crawler
CN111767482B (en) * 2020-05-21 2023-06-06 中国地质大学(武汉) Self-adaptive crawling method for focused web crawlers

Similar Documents

Publication Publication Date Title
CN106126705A (en) A kind of large scale network data crawl system in real time
CN101894170B (en) Semantic relationship network-based cross-mode information retrieval method
US7640488B2 (en) System, method, and service for using a focused random walk to produce samples on a topic from a collection of hyper-linked pages
CN101256596B (en) Method and system for instation guidance
Abebe et al. Generic metadata representation framework for social-based event detection, description, and linkage
CN110020189A (en) A kind of article recommended method based on Chinese Similarity measures
CN106484764A (en) User&#39;s similarity calculating method based on crowd portrayal technology
CN104008203A (en) User interest discovering method with ontology situation blended in
CN105045901A (en) Search keyword push method and device
CN103678412A (en) Document retrieval method and device
CN102880723A (en) Searching method and system for identifying user retrieval intention
CN105159930A (en) Search keyword pushing method and apparatus
CN102402566A (en) Web user behavior analysis method based on Chinese webpage automatic classification technology
CN104598607A (en) Method and system for recommending search phrase
CN109271514A (en) Generation method, classification method, device and the storage medium of short text disaggregated model
CN106484797A (en) Accident summary abstracting method based on sparse study
CN105939359A (en) Method and device for detecting privacy leakage of mobile terminal
CN107239512A (en) The microblogging comment spam recognition methods of relational network figure is commented in a kind of combination
Ye et al. A web services classification method based on GCN
CN104281619A (en) System and method for ordering search results
CN104346382A (en) Text analysis system and method employing language query
KR100917458B1 (en) Method and system of providing recommended words
Geng et al. Research on improved focused crawler and its application in food safety public opinion analysis
CN104462241A (en) Population property classification method and device based on anchor texts and peripheral texts in URLs
Yang et al. A topic-specific web crawler with web page hierarchy based on HTML Dom-Tree

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20161116