CN106484895A - The accurate crawling method of internet information based on multiple analysis - Google Patents

The accurate crawling method of internet information based on multiple analysis Download PDF

Info

Publication number
CN106484895A
CN106484895A CN201610915910.1A CN201610915910A CN106484895A CN 106484895 A CN106484895 A CN 106484895A CN 201610915910 A CN201610915910 A CN 201610915910A CN 106484895 A CN106484895 A CN 106484895A
Authority
CN
China
Prior art keywords
content
information
page
crawling
crawls
Prior art date
Application number
CN201610915910.1A
Other languages
Chinese (zh)
Inventor
陈文康
李江伟
赵光俊
李欣荣
王汝英
柳长俊
宋洋
刘圣通
彭晓武
Original Assignee
天津市普迅电力信息技术有限公司
国网信息通信产业集团有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 天津市普迅电力信息技术有限公司, 国网信息通信产业集团有限公司 filed Critical 天津市普迅电力信息技术有限公司
Priority to CN201610915910.1A priority Critical patent/CN106484895A/en
Publication of CN106484895A publication Critical patent/CN106484895A/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The present invention relates to a kind of accurate crawling method of the internet information based on multiple analysis, have including step:The first step, page info crawls, and first page info is divided into static page information and dynamic page information, is then crawled static page information and dynamic page information respectively;Second step, carries out dissection process to the above-mentioned information that crawls, and first by above-mentioned, to crawl information classification be structural information and the non-structural information meeting dom model, then carries out resolution rules analysis respectively to classification information;3rd step, carries out multithreading task process to the task that crawls after above-mentioned clear and definite resolution rules, and configures the cycle frequency of various multithreading tasks.The present invention crawls accuracy height, and motility is strong, and mobility strong is easy and simple to handle, and task visualization is directly perceived, is easy to the later stage to the various analyses crawling data.

Description

The accurate crawling method of internet information based on multiple analysis

Technical field

The invention belongs under the Internet environment complicated and changeable, obtaining the technical field of accurate information data, particularly A kind of accurate crawling method of the internet information based on multiple analysis.

Background technology

Current web universe, technology is very advanced, also extremely complex, and various websites all can be using certain content update side Formula, to reach upgrading in time of web site contents, to attract more users.And for the various assaults on the Internet, if Count a whole set of perfect anti-hacking means, wherein just include the content of pages anti-creep technology that all can dispose each big website etc. Deng.

And current era is definitely the epoch that information is king, who can obtain first-hand information on demand in time, and who will be in business First-hand basis for estimation is grasped in industry market.For this specific demand it would be desirable to provide the solution of complete set for market user Certainly scheme, and this demand is also more liked by individual, and these crowds just comprise media staff, commentator, enterprise Industry marketing personal etc..

Content of the invention

It is an object of the invention to overcoming the deficiencies in the prior art, and prior art is imperfect, provides one kind to be based on The accurate crawling method of internet information of multiple analysis.

The present invention solves its technical problem and takes technical scheme below to realize:

A kind of accurate crawling method of the internet information based on multiple analysis, it is as follows that the method comprising the steps of:

The first step, page info crawls, and first page info is divided into static page information and dynamic page information, then Static page information and dynamic page information are crawled respectively, and will distribute each page one theme, be easy to the later stage Carry out data analysiss;

Second step, carries out dissection process to the above-mentioned information that crawls, and crawls information classification for meeting dom model by above-mentioned first Structural information and non-structural information, then classification information is processed respectively, comprises the following steps that:

(1) judge to crawl whether information meets HTML form;

(2) load the source code crawling content to a Document object;

(3) analysis crawls the label rule of content, finds out the tag definition logic with uniqueness;

(4) utilize dom solution to model to analyse principle, determine analytical expression;

3rd step, carries out multithreading task process to the task that crawls after above-mentioned clear and definite resolution rules, and configures various many The cycle frequency of thread task.

And, the concrete grammar that static page information and dynamic page information are crawled respectively of the described first step For:

(1) the crawling of static page information

1. choose the webpage needing to crawl content, clearly specific web page address;

2. confirm in this page, to need the content crawling to be static content, that is, will not in the case of the page is updated The content of pages changing in real time;

3. utilize inlet flow to read the data of long-range static resource, for avoiding the prevention of various anti-creep technology, need to set Put refer, the camouflage of setting User-Agent is processed;

4. judge the integrity of reading of content, that is, reading of content be standard meet HTML form, And crawl and in source code, comprise our contents to be crawled, that is, confirm that content of pages crawls successfully.

(2) the crawling of dynamic page information

1. choose the webpage needing to crawl content, clearly specific web page address;

2. confirm in this page, to need the content crawling to be dynamic content, that is, in the case of the page is updated, page Face content can change in real time;

3. utilize a kind of literal translation formula script to obtain the loading technique of dynamic content, specially adopt Phantomjs's Increase income plug-in unit or HtmlUtil deeply to bore, read this content of pages from system server terminal, and for avoiding various anti-creeps The prevention of technology, needs to be setting refer, and the camouflage of setting User-Agent is processed;

4. judge the integrity of reading of content, that is, reading of content be standard meet HTML form, And crawl and in source code, comprise our contents to be crawled, that is, confirm that content of pages crawls successfully.

And, the tag definition logic in (3) in described second step specifically includes the iterative relation of father node and child node And the particular feature parameter of sane level.

And, in the dissection process of the structural information meeting dom model of described second step and the parsing of non-structural information In process step, be both needed to process further including:

1. if there is more content at the same level it is impossible to be accurate to the situation crawling content, need analysis ordering relation at the same level, And determine the sequence number crawling content;

2. have in crawling superfluous content situation when, need to be deleted superfluous content it is ensured that being parsed Content is consistent with content to be crawled.

And, described 3rd step carries out multithreading task process to the task that crawls after clear and definite resolution rules, and configures each Plant the cycle frequency of multithreading task, including following particular content:

(1) species of configuration cycle frequency may be selected:

1. determine certain time in some day, this time is accurate to Millisecond;

2. some skies in a certain week are determined;

3. some skies of some month are determined;

4. some skies in a certain year are determined;

5. repeat in a number of times specified;

6. repeat to the time a specified/date;

7. infinitely repeat;

8. repeat in an interval.

(2) according to resolution rules and configuration crawl time cycle frequency, by each resolution rules treat as a task one Individual thread, configures the multiple thread of multiple tasks to identical content, realizes step content:

1. create the not constant volume thread pool of multithreading, thread pool is according to the automatic dilatation of Thread Count;

2. start thread, configure thread persistence;

3. after the task that crawls terminates, thread reclaims automatically, discharges thread resources;

4. according to configuration cycle frequency, thread automatic next time.

Advantages of the present invention and good effect are

1st, the present invention crawls accuracy height.

2nd, the present invention has motility by force, the feature of mobility strong.

3rd, the present invention is easy and simple to handle, and task visualization is directly perceived.

Brief description

Fig. 1 is the architectural configurations schematic diagram of the used hardware of the inventive method.

Specific embodiment

Hereinafter the embodiment of the present invention is further described, following examples are descriptive, is not determinate, no Protection scope of the present invention can be limited with this.

A kind of accurate crawling method of the internet information based on multiple analysis, as shown in figure 1, the hardware system that the method uses System includes multiple external websites, and the information outer net equipment being connected with multiple external websites is set with information outer net by isolating device The standby information Intranet equipment connecting, wherein, information outer net equipment includes the information outer net service of load equalizer and respective numbers Device, wherein, information Intranet equipment include the information intranet server of respective numbers, structured database, system server and after Platform management end, the content of the method comprises the steps:

The first step, page info crawls, and first page info is divided into static page information and dynamic page information, then Static page information and dynamic page information are crawled respectively, and will distribute each page one theme, be easy to the later stage Carry out data analysiss;

(1) the crawling, including step content of static page information:

1. choose the webpage needing to crawl content, clearly specific web page address;

2. confirm in this page, to need the content crawling to be static content, that is, will not in the case of the page is updated The content of pages changing in real time;

3. utilize inlet flow to read the data of long-range static resource, for avoiding the prevention of various anti-creep technology, need to do one A little camouflages are processed, for example, arrange refer, the mode such as setting User-Agent;

4. judge the integrity of reading of content, that is, reading of content be standard meet HTML form, And crawl and in source code, comprise our contents to be crawled, that is, confirm that content of pages crawls successfully.

(2) the crawling, including step content of dynamic page information:

1. choose the webpage needing to crawl content, clearly specific web page address;

2. confirm in this page, to need the content crawling to be dynamic content, that is, in the case of the page is updated, page Face content can change in real time;

3. utilize a kind of literal translation formula script to obtain the loading technique of dynamic content, specially for example with Phantomjs increase income plug-in unit or HtmlUtil deeply to bore, read this content of pages from system server terminal, and for keeping away Exempt from the prevention of various anti-creep technology, need to do some camouflages process, for example, refer, the mode such as setting User-Agent are set;

4. judge the integrity of reading of content, that is, reading of content be standard meet HTML form, And crawl and in source code, comprise our contents to be crawled, that is, confirm that content of pages crawls successfully;

Second step, carries out dissection process to the above-mentioned information that crawls, and crawls information classification for meeting dom model by above-mentioned first Structural information and non-structural information, then classification information is processed respectively, comprises the following steps that:

(1) judge to crawl whether information meets HTML form;

(2) load the source code crawling content to a Document object;

(3) analysis crawl content label rule, find out have uniqueness tag definition logic (comprise father node with son section The iterative relation of point and the particular feature parameter of sane level);

(4) utilize dom solution to model to analyse principle, determine analytical expression;

(5) if there is more content at the same level it is impossible to be accurate to the situation crawling content, analysis sequence at the same level is needed to close System, and determine the sequence number crawling content;

(6) have in crawling superfluous content situation when, need to be deleted superfluous content it is ensured that being parsed Content is consistent with content to be crawled;

3rd step, carries out multithreading task process to the task that crawls after above-mentioned clear and definite resolution rules, and configures various many The cycle frequency of thread task, specifically includes following content:

(1) species of configurable period frequency:

1. determine certain time in some day, this time is accurate to Millisecond;

2. some skies in a certain week are determined;

3. some skies of some month are determined;

4. some skies in a certain year are determined;

5. repeat in a number of times specified;

6. repeat to the time a specified/date;

7. infinitely repeat;

8. repeat in an interval;

(2) according to clear and definite resolution rules and crawl time cycle frequency above, each resolution rules is treated as one and appoints It is engaged in as a thread, the multiple thread of multiple tasks is configured to identical content, realizes step content:

1. create the not constant volume thread pool of multithreading, thread pool is according to the automatic dilatation of Thread Count;

2. start thread, configure thread persistence;

3. after the task that crawls terminates, thread reclaims automatically, discharges thread resources;

4. according to configuration cycle frequency, thread automatic next time.

Entirely crawl process, crawled the webpage of content by initial searching, crawl to dynamic, static content, then to climbing The resolution rules taking source code content are analyzed, and subsequently accurate formula crawls task configuration automatically, all linked with one another, to realize multiple analysis The accurate crawling method of internet information afterwards.

Claims (5)

1. a kind of accurate crawling method of the internet information based on multiple analysis is it is characterised in that the method comprising the steps of is as follows:
The first step, page info crawls, and first page info is divided into static page information and dynamic page information, then will be quiet State page info and dynamic page information are crawled respectively, and will distribute each page one theme, are easy to the later stage and carry out Data analysiss;
Second step, carries out dissection process to the above-mentioned information that crawls, and first by above-mentioned, to crawl information classification be the knot meeting dom model Structure information and non-structural information, then process respectively to classification information, comprise the following steps that:
(1) judge to crawl whether information meets HTML form;
(2) load the source code crawling content to a Document object;
(3) analysis crawls the label rule of content, finds out the tag definition logic with uniqueness;
(4) utilize dom solution to model to analyse principle, determine analytical expression;
3rd step, carries out multithreading task process to the task that crawls after above-mentioned clear and definite resolution rules, and configures various multithreadings The cycle frequency of task.
2. the accurate crawling method of the internet information based on multiple analysis according to claim 1 it is characterised in that:Described The concrete grammar that static page information and dynamic page information are crawled respectively of the first step is:
(1) the crawling of static page information
1. choose the webpage needing to crawl content, clearly specific web page address;
2. confirm in this page, to need the content crawling to be static content, that is, will not be real-time in the case of the page is updated The content of pages changing;
3. utilize inlet flow to read the data of long-range static resource, for avoiding the prevention of various anti-creep technology, need to arrange Refer, the camouflage of setting User-Agent is processed;
4. judge the integrity of reading of content, that is, reading of content be standard meet HTML form, and Crawl and in source code, comprise our contents to be crawled, that is, confirm that content of pages crawls successfully.
(2) the crawling of dynamic page information
1. choose the webpage needing to crawl content, clearly specific web page address;
2. confirm in this page, to need the content crawling to be dynamic content, that is, in the case of the page is updated, in the page Appearance can change in real time;
3. a kind of literal translation formula script is utilized to obtain the loading technique of dynamic content, specially increasing income using Phantomjs Plug-in unit or HtmlUtil deeply to bore, and read this content of pages from system server terminal, and for avoiding various anti-creep technology Prevention, need do setting refer, setting User-Agent camouflage process;
4. judge the integrity of reading of content, that is, reading of content be standard meet HTML form, and Crawl and in source code, comprise our contents to be crawled, that is, confirm that content of pages crawls successfully.
3. the accurate crawling method of the internet information based on multiple analysis according to claim 1 it is characterised in that:Described Tag definition logic in (3) in second step specifically includes the iterative relation that father node is with child node and the particular feature ginseng of sane level Number.
4. the accurate crawling method of the internet information based on multiple analysis according to claim 1 it is characterised in that:Institute State in the dissection process of the structural information meeting dom model of second step and the dissection process step of non-structural information, be both needed to into One step processes and includes:
1. if there is more content at the same level it is impossible to be accurate to the situation crawling content, need analysis ordering relation at the same level, and really Surely crawl the sequence number of content;
2. have in crawling superfluous content situation when, need to be deleted superfluous content the content it is ensured that being parsed It is consistent with content to be crawled.
5. the accurate crawling method of the internet information based on multiple analysis according to claim 1 it is characterised in that:Described 3rd step carries out multithreading task process to the task that crawls after clear and definite resolution rules, and configures the cycle of various multithreading tasks Frequency, including following particular content:
(1) species of configuration cycle frequency may be selected:
1. determine certain time in some day, this time is accurate to Millisecond;
2. some skies in a certain week are determined;
3. some skies of some month are determined;
4. some skies in a certain year are determined;
5. repeat in a number of times specified;
6. repeat to the time a specified/date;
7. infinitely repeat;
8. repeat in an interval;
(2) according to resolution rules and configuration crawl time cycle frequency, by each resolution rules treat as one line of a task Journey, configures the multiple thread of multiple tasks to identical content, realizes step content:
1. create the not constant volume thread pool of multithreading, thread pool is according to the automatic dilatation of Thread Count;
2. start thread, configure thread persistence;
3. after the task that crawls terminates, thread reclaims automatically, discharges thread resources;
4. according to configuration cycle frequency, thread automatic next time.
CN201610915910.1A 2016-10-21 2016-10-21 The accurate crawling method of internet information based on multiple analysis CN106484895A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610915910.1A CN106484895A (en) 2016-10-21 2016-10-21 The accurate crawling method of internet information based on multiple analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610915910.1A CN106484895A (en) 2016-10-21 2016-10-21 The accurate crawling method of internet information based on multiple analysis

Publications (1)

Publication Number Publication Date
CN106484895A true CN106484895A (en) 2017-03-08

Family

ID=58270307

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610915910.1A CN106484895A (en) 2016-10-21 2016-10-21 The accurate crawling method of internet information based on multiple analysis

Country Status (1)

Country Link
CN (1) CN106484895A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107092826A (en) * 2017-03-24 2017-08-25 北京国舜科技股份有限公司 Web page contents real-time safety monitoring method
CN107943588A (en) * 2017-11-22 2018-04-20 用友金融信息技术股份有限公司 Data processing method, system, computer equipment and readable storage medium storing program for executing

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103488675A (en) * 2013-07-11 2014-01-01 哈尔滨工程大学 Automatic precise extraction device for multi-webpage news comment contents
CN104915387A (en) * 2015-05-25 2015-09-16 成都视达科信息技术有限公司 Internet website static state page processing system and method
CN104965901A (en) * 2015-06-30 2015-10-07 北京奇虎科技有限公司 Method and apparatus for grabbing content of target page
CN105183886A (en) * 2015-09-25 2015-12-23 中国民生银行股份有限公司 Webpage content extraction method and device
US20160055243A1 (en) * 2014-08-22 2016-02-25 Ut Battelle, Llc Web crawler for acquiring content

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103488675A (en) * 2013-07-11 2014-01-01 哈尔滨工程大学 Automatic precise extraction device for multi-webpage news comment contents
US20160055243A1 (en) * 2014-08-22 2016-02-25 Ut Battelle, Llc Web crawler for acquiring content
CN104915387A (en) * 2015-05-25 2015-09-16 成都视达科信息技术有限公司 Internet website static state page processing system and method
CN104965901A (en) * 2015-06-30 2015-10-07 北京奇虎科技有限公司 Method and apparatus for grabbing content of target page
CN105183886A (en) * 2015-09-25 2015-12-23 中国民生银行股份有限公司 Webpage content extraction method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107092826A (en) * 2017-03-24 2017-08-25 北京国舜科技股份有限公司 Web page contents real-time safety monitoring method
CN107092826B (en) * 2017-03-24 2020-02-21 北京国舜科技股份有限公司 Webpage content safety real-time monitoring method
CN107943588A (en) * 2017-11-22 2018-04-20 用友金融信息技术股份有限公司 Data processing method, system, computer equipment and readable storage medium storing program for executing

Similar Documents

Publication Publication Date Title
Stoffel et al. rptR: Repeatability estimation and variance decomposition by generalized linear mixed‐effects models
Wang et al. An empirical study on developer interactions in stackoverflow
CN104537097B (en) Microblogging public sentiment monitoring system
CN104602042B (en) Label setting method based on user behavior
Zaiane et al. Towards evaluating learners' behaviour in a web-based distance learning environment
EP2210198B1 (en) System and method for searching for documents
Wilsey et al. Relationships among indices suggest that richness is an incomplete surrogate for grassland biodiversity
Dennis et al. Computational aspects of N‐mixture models
Koler-Povh et al. Impact of open access on citation of scholarly publications in the field of civil engineering
US8832102B2 (en) Methods and apparatuses for clustering electronic documents based on structural features and static content features
EP2973038A1 (en) Classifying resources using a deep network
Freire et al. Techniques for developing more accessible web applications: a survey towards a process classification
Neis et al. Towards automatic vandalism detection in OpenStreetMap
US20110258175A1 (en) Marker search system for augmented reality service
Corbi et al. Review of current student-monitoring techniques used in elearning-focused recommender systems and learning analytics: The experience api & lime model case study
US20110314382A1 (en) Systems of computerized agents and user-directed semantic networking
CN103425799A (en) Personalized research direction recommending system and method based on themes
CN104573054A (en) Information pushing method and equipment
Kanishcheva et al. Method of integration and content management of the information resources network
CN102279894B (en) Method for searching, integrating and providing comment information based on semantics and searching system
US9858308B2 (en) Real-time content recommendation system
CN103544623B (en) A kind of Web service recommendation method based on user preference feature modeling
CN102637170A (en) Question pushing method and system
Vakali et al. Smart Cities Data Streams Integration: experimenting with Internet of Things and social data flows
CN103049532A (en) Method for creating knowledge base engine on basis of sudden event emergency management and method for inquiring knowledge base engine

Legal Events

Date Code Title Description
PB01 Publication
C06 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination