CN103927370A

CN103927370A - Network information batch acquisition method of combined text and picture information

Info

Publication number: CN103927370A
Application number: CN201410166752.5A
Authority: CN
Inventors: 唐宇波; 夏平嵩
Original assignee: Focus Technology Co Ltd
Current assignee: Focus Technology Co Ltd
Priority date: 2014-04-23
Filing date: 2014-04-23
Publication date: 2014-07-16
Anticipated expiration: 2034-04-23
Also published as: CN103927370B

Abstract

The invention provides a network information batch acquisition method of combined text and picture information. Through a series of configuration, according to the method, target network information can be acquired, replication removal of the target network information can be achieved, and the target network information can be stored into a database and can be sent to a place designated by a client according to a format designated by a client. The method includes the steps that websites in need of information acquisition and specific URLs and the page number of information list pages are determined; according to the URLs of the list pages, and a common part is found out and stored in list configuration information; when information acquisition is conducted, a system reads URL common part information in the list configuration information, serial number information of all the list pages is obtained according to the total number of the list pages, and therefore the URLs of all the list pages to be acquired of a target network are combined; detailed page content capturing is conducted according to detailed page link addresses stored in a linking library to be captured; processing of pictures in detailed page contents is conducted; after the detailed page contents are captured, content data are led out to a designated interface.

Description

A kind of network information batch capture method of cypher and pictorial information

Technical field

The present invention is applied to Internet technical field, relates to a kind of network information batch capture method of cypher and pictorial information.

Technical background

Along with the fast development of internet, on internet, accumulated a large amount of various information, as the pricing information of Domestic News, potential customers' information, competing product, real-time financial Information, statistical report, industry analysis report, supply-demand information etc.For enterprise, by these information, business datum in conjunction with enterprises is analyzed, for enterprise management decision-making, have very large booster action, on the other hand, enterprise arranges, digests after these information, be published in the website of own enterprise, enriching the content of enterprise web site, promote visitor's experience property, is also helpful.

There are now a lot of instruments can realize the collection of web page contents, but main or take the acquisition method of Word message as main, the pictorial information of webpage is not effectively gathered, also lack a kind of reliable and effective method, information on network is carried out to batch capture reliably, and realize the repeatability judgement that gathers content.

Patent " the regular automatic capturing method of cyber journalism information " (number of patent application: CN201210402435.X) can realize the timing acquisition for news information, can the content of targeted website be saved in file server by configuration.But this method is not processed for the abnormal conditions in gatherer process, can not correctly identify the redundant code in the information page, the picture in webpage is not processed yet simultaneously.

Therefore, harmonious to the pictorial information of webpage and Word message, gathering reliably is in batches a problem demanding prompt solution.

Summary of the invention

For problems of the prior art, this patent has been invented a kind of network information batch capture method of cypher and pictorial information, it can be realized the collection of targeted website information, duplicate removal, be stored into database by a series of configurations, and the functions such as place that send to client's appointment by client's specified format.

A network information batch capture method for cypher and pictorial information, comprising:

1, determine the website that need to carry out information collection, and determine the concrete URL that needs the information list page of collection in this website, and the page quantity of these list page.

Wherein, can select a plurality of websites to carry out the batch capture of information.According to the different time periods, the acquisition time of a plurality of websites, acquisition mode, collection content are dispatched, in the online crest time, be set to serial acquisition mode, after the information collection of a website being completed, then start the information collection of next website.In the online trough time, be set to parallel acquisition mode, information collection is carried out in a plurality of websites simultaneously, guaranteed the efficient of collection, and the utilization of resources is efficient.

2, according to the URL of a plurality of list page, find out the public part of these URL, be kept in list configuration information, in addition, the serial number information of these list page is kept in list configuration information.

3, when information collection for the first time, the public partial information of URL in system read list configuration information, from the total quantity of list page, just can obtain the serial number information of whole list page, thereby is combined into the URL of targeted website all list page to be collected.

For later information collection, the public partial information of URL in system read list configuration information, and the list page serial number information of up-to-date 2 pages, be combined into the URL of the targeted website list page of up-to-date 2 pages to be collected.

System, according to these URL, captures the source code of these original lists of targeted website, and by source code is resolved, obtains the detail page chained address comprising in list page.

The mode of wherein obtaining detail page chained address has 2 kinds, and these two kinds of modes are kept in detail page configuration information:

(1) label mode.First starting position sign and the end position sign of setting the detail page chained address comprising in list page source code, this tick lables shows as coding fixing in webpage, is kept in detail page configuration information.In list page source code, search these tick lableses, between starting position sign and end position sign, extract detail page chained address, and be saved in chained library to be captured.

(2) specific on-link mode (OLM).First analyze the detail page chained address comprising in list page source code, the needs according to gathering content, therefrom extract the condition code of detail page chained address, then by the structure acquisition condition of regular expression, are kept in detail page configuration information.In list page source code, obtain all detail page chained addresses.Then mate with condition code, if can deserve, be just saved in chained library to be captured.

The detail page chained address obtaining, and captures the detail page chained address of preserving in chained library and compares, if not identical, the detail page chained address just this being obtained is saved in chained library to be captured; Otherwise, abandon the detail page chained address that this obtains, will prevent that some link from repeating crawled phenomenon and occurring like this.

Wherein, gather all detail page chained addresses and use label mode, collection meets the detail page chained address of content conditions and uses specific on-link mode (OLM).Above-mentioned two kinds of modes are carried out choice for use according to actual conditions, also can combine cross-reference.

Abnormal conditions are processed:

(1) overlong time.Because whether website is addressable, cannot estimate, in the time of may accessing, there will be extremely, by capturing expired time of process setting, when a website does not respond for a long time, can initiatively exit, avoid occupying for a long time system resource.

(2) information is omitted.Some website is disposed at a plurality of IP, for different IP, has different returning results.Such as, it is inconsistent having number of site to access at home the result representing with external access.When occurring that capturing result occurs drain message, through investigation, be to cause this problem because IP address limits, system, by corresponding proxy server is set, is used the server of other IP to conduct interviews, and obtains complete web page contents.

(3) frequently access.Produce abnormal reason and be because access websites is too frequent, violated the access rule of targeted website, thereby targeted website is limited and caused this access.At this moment by crawl frequency is set, the grasp speed that slows down, gathers, and the regular hour was waited in the website of target before page of every access.So just can evade restriction, carry out normally data acquisition.

4, according to the detail page chained address of preserving in chained library to be captured, carry out the crawl of detail page content.

Access successively the detail page chained address of preserving in chained library to be captured, obtain the source code of detail page.

In detail page content, generally all can comprise the information such as title, information content, author, source, time.In the source code of detail page, except the above-mentioned information comprising, also include various HTML code, and the HTML code corresponding with it all contained in the front and back of these information, therefore, can normal presentation for the detail page content after guaranteeing to gather, other HTML code are carried out to cleaning and filtering.

For information such as title, author, source, times, as long as HTML code corresponding to each information removed, retain corresponding information just passable.

For information content information wherein, need to retain line feed code and picture chained address, then remove wherein each kind of HTML code, for the JavaScript code wherein comprising, in order to guarantee to gather the safety of content, need to process shielding.

After detail page content captures successfully, this detail page chained address is saved in and is captured in chained library, for carry out repeatability judgement later, use.

5, for the processing of picture in detail page content.

The picture of information is the important component part in information content, when gathering information content, and the just chained address of picture simultaneously obtaining, needing follow-up continuation to process could download picture.

After Word message in all detail pages all captures and finishes, then according to obtained picture chained address, start to carry out the crawl of picture, be conducive to promote the efficiency of whole processing procedure.

First after picture chained address being identified, and be saved in picture collection storehouse, for follow-up capturing pictures operation.

Processings that need to unitize while preserving picture chained address, the general picture chained address obtaining, also has parameter after picture file name therein, when preservation picture chained address by "? " character afterwards weeds out.

Afterwards, the picture chained address according to preserving in picture collection storehouse, downloads to this locality by picture:

(1), using the picture chained address in picture collection storehouse as parameter, call the picture processing script on file server.Wherein, in script, comprise set up catalogue, gather picture, be saved in respective directories, the order such as file rename.

(2) carry out this picture processing script.

(3) after script is carried out, corresponding picture downloads, and is saved in the catalogue of appointment.

Because the picture that need to preserve may be a lot, and picture file is not of uniform size, the whole time of crawl can not be determined.So when capturing, adopted picture to capture parallel design, can, when capturing a plurality of pictures, significantly raise the efficiency like this.

Abnormal conditions are processed:

May there is disabled situation in picture chained address, this cause picture access less than, during normal process, to capturing pictures, an expired time is set, if spent this time, do not obtain picture file, just no longer continue to have attempted, saved system resource, guarantee the validity that resource is used.

6, after capturing detail page content, content-data is exported to specified interface.

Before deriving, automatically examine, the form with webpage while automatically examining is carried out preview to the content being captured.

Content-data is exported to specified interface, and the data source of native system support comprises the database of the main flows such as Oracle, Mysql, SQLServer, and supports to export to the file of the forms such as TXT text, EXCEL, and supports e-mail transmitting function.

Useful result of the present invention is as follows:

1, effectively combine Word message and the pictorial information of information content, complete collects this locality by an information, and can conveniently reproduce demonstration;

2, in gatherer process, for abnormal conditions, add several different methods to process, guarantee the reliable and stable of data acquisition;

3, optimize and gather resource, improved the utilization factor of resource, also improved the efficiency that information gathers;

4, the unified processing of a plurality of collection demands dispatched, and avoids overlapping development, improves development efficiency.

Accompanying drawing explanation

The network information batch capture method processing flow chart of a kind of cypher of Fig. 1 the present embodiment and pictorial information.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described in further detail.

As Fig. 1, the network information batch capture method treatment scheme of a kind of cypher of the present embodiment and pictorial information, comprising:

Step 11, the definite website that need to carry out information collection, and determine the concrete URL that needs the information list page of collection in this website, and the page quantity of these list page.

Wherein, can select a plurality of websites to carry out the batch capture of information.In the online crest time, be set to serial acquisition mode, after the information collection of a website being completed, then start the information collection of next website.In the online trough time, be set to parallel acquisition mode, information collection is carried out in a plurality of websites simultaneously, guaranteed the efficient of collection, and the utilization of resources is efficient.

For example, at the 8:00-24:00 of every day, adopt serial acquisition mode;

At the 0:00-8:00 of every day, adopt parallel acquisition mode.

Step 12, according to the URL of a plurality of list page, find out the public part of these URL, be kept in list configuration information, in addition, the total quantity information of these list page is kept in list configuration information.

For example check the dynamic news pages of Sina's real estate market, the original list of its news has 3 pages, analyze the URL of the dynamic news list page of Sina's real estate market, can be by " http://search.house.sina.com.cn/bj/news/scdt/page (*)/" the public part of part in their URL, using " 01 – 03 " as the information that represents original list quantity.

Step 13, when information for the first time gathers, the public partial information of URL in system read list configuration information, from the total quantity of list page, just can obtain the serial number information of whole list page, thereby be combined into the URL of targeted website all list page to be collected.

For later information collection, the public partial information of URL in read list configuration information, and the list page serial number information of up-to-date 2 pages, be combined into the URL of the targeted website list page of up-to-date 2 pages to be collected.

According to above-mentioned example, the public part " http://search.house.sina.com.cn/bj/news/scdt/page (*)/" of preserving, and represent original list serial number information " 01 – 03 ", combine, form all list page URL:

http://search.house.sina.com.cn/bj/news/scdt/page01

http://search.house.sina.com.cn/bj/news/scdt/page02

http://search.house.sina.com.cn/bj/news/scdt/page03

System, according to these URL, captures the source code of these original lists, and obtains by source code is resolved the detail page chained address comprising in list page.

As follows in the information in page source code, the link is here exactly the link that needs collection.

Configuration information is: start from: " <h3><a href=" "; End at "/" > ".

Finally obtaining detail page chained address is:

http://bj.house.sina.com.cn/news/2014-04-15/17352689405.shtml

(2) specific on-link mode (OLM).First analyze the detail page chained address comprising in list page source code, the needs according to gathering content, therefrom extract the condition code of detail page chained address, then by the structure acquisition condition of regular expression, are kept in detail page configuration information.In in list page source code, obtain all detail page chained addresses.Then mate with condition code, if can deserve, be just saved in chained library to be captured.

As configuration information in previous example is: must comprise http://bj.house.sina.com.cn/news/, all like this links that meet this form all can gather.

In addition, by the mode of getting rid of, be configured, will do not need the content information gathering to be configured exactly, as: must not comprise/scdt/|/zhuanti/|page.In this way, support to use multiple different filtercondition simultaneously.

Abnormal conditions are processed:

(1) overlong time.Because whether website is addressable, cannot estimate, in the time of may accessing, there will be extremely, to capturing expired time of process setting, when a website does not respond for a long time, can initiatively exit, avoid occupying for a long time system resource.

(3) frequently access.Produce abnormal reason and be because access websites is too frequent, violated the access rule of targeted website, thereby targeted website is limited and caused this access.At this moment by crawl frequency is set, the grasp speed collection of slowing down was waited for the regular hour to the website of target before page of every access.So just can evade restriction, carry out normally data acquisition.

The detail page chained address of preserving in step 14, basis chained library to be captured, carries out the crawl of detail page content.

As for a news http://bj.house.sina.com.cn/news/2014-04-15/17352689405.shtml

It is partly as follows that its source code cuts choosing:

Need the configuration information of the title of formulation to be: starting position is masked as " <title> ", end position is masked as " </title> ", just the title of this information " [cover story] agency destiny 7 is foretold greatly " can be collected;

The configuration information of same issuing time is: starting position is masked as " <div class=" tc zwdate mb15 " ><span>2 ", end position is masked as " </span> ", just the issuing time of this information " 2014-04-1517:35:11 " can be collected;

For information content information wherein, need to retain line feed code and picture tag code, then remove wherein each kind of HTML code, for the JavaScript code wherein comprising, in order to guarantee to gather the safety of content, need to process shielding.

--text end--> ", and, because corresponding information content need to be changed to picture chained address local picture chained address, and the label reservation that will enter a new line, other various html tag codes remove.Information content after processing for this example is as follows.

After detail page content captures successfully, this detail page chained address is saved in and is captured in chained library, the repeatability judgement during for detail page crawl later.

Step 15, for the processing of picture in detail page content.

The picture of information is the important component part in information content, when gathering information content, and the just chained address of picture simultaneously obtaining, needing follow-up continuation to process could also download picture.

Then Word message in all detail pages starts to carry out the crawl of picture after all capturing and finishing according to obtained picture chained address, improved reliability, also promotes the efficiency of whole processing procedure.

For example in above-mentioned Sina website, need to gather 15 detailed pages, wherein obtain 5 picture chained addresses, after the Word message collection of these 15 detailed pages finishes, start to gather picture according to 5 picture chained addresses so.

First picture link is identified, and be saved in picture collection storehouse, for follow-up capturing pictures operation.

The pictorial information obtaining in this example is:

http://src.house.sina.com.cn/imp/imp/deal/f1/28/b/fd28900e78b71af7280d18693f3_p1_mk1.jpg

Processings that need to unitize while preserving picture chained address, the general picture chained address obtaining, also has parameter after picture file name therein, when preservation picture chained address by "? " parameter afterwards is all removed.

Afterwards, the picture chained address according to preserving in picture collection storehouse, downloads to this locality by picture.

(2) carry out this picture processing script.

(3) corresponding picture is downloaded, and be saved in the catalogue of appointment.

Because the picture that need to preserve may be a lot, and picture file is not of uniform size, the picture that may have several million sizes exists, and the whole time of crawl is uncertain.So when capturing, adopted the picture Grasp Modes that walk abreast, in a plurality of pictures of crawl, significantly raised the efficiency.

Abnormal conditions are processed:

May there is disabled situation in picture chained address, this will cause picture access less than, during normal process, to capturing pictures, an expired time is set, if spent this time, do not obtain fileinfo, just no longer continue to have attempted, saved system resource, guarantee the validity that resource is used.

It is 1 minute-1.5 minutes that this expired time is set in the present embodiment.

After step 16, crawl detail page content, content-data is exported to specified interface.

Before deriving, also need automatically to examine, the form with webpage while automatically examining is carried out preview to the content being captured.

The present invention also can have other numerous embodiments; in the situation that not deviating from spirit of the present invention and essence thereof; those of ordinary skill in the art can make according to the present invention various corresponding changes and distortion, and these change and be out of shape the protection domain that all should belong to the appended claim of the present invention accordingly.

Claims

1. a network information batch capture method for cypher and pictorial information, is characterized in that, comprising:

Step 1, the definite website that need to carry out information collection, and determine the concrete URL that needs the information list page of collection in this website, and the page quantity of these list page;

Step 2, according to the URL of a plurality of list page, find out the public part of these URL, be kept in list configuration information, in addition, the serial number information of these list page is kept in list configuration information;

Step 3, when information for the first time gathers, the public partial information of URL in system read list configuration information, from the total quantity of list page, obtains the serial number information of whole list page, thereby is combined into the URL of targeted website all list page to be collected;

For later information collection, the public partial information of URL in system read list configuration information, and the list page serial number information of up-to-date 2 pages, be combined into the URL of the targeted website list page of up-to-date 2 pages to be collected;

System, according to these URL, captures the source code of these original lists of targeted website, and by source code is resolved, obtains the detail page chained address comprising in list page;

The detail page chained address of preserving in step 4, basis chained library to be captured, carries out the crawl of detail page content;

Access successively the detail page chained address of preserving in chained library to be captured, obtain the source code of detail page;

After detail page content captures successfully, this detail page chained address is saved in and is captured in chained library, for carry out repeatability judgement later, use;

Step 5, for the processing of picture in detail page content;

The picture of information is the important component part in information content, when gathering information content, and the just chained address of picture simultaneously obtaining, needing follow-up continuation to process could download picture;

After Word message in all detail pages all captures and finishes, then according to obtained picture chained address, start to carry out the crawl of picture, be conducive to promote the efficiency of whole processing procedure;

After step 6, crawl detail page content, content-data is exported to specified interface.

2. network information batch capture method according to claim 1, is characterized in that, in step 1:

Can select a plurality of websites to carry out the batch capture of information, according to the different time periods, the acquisition time of a plurality of websites, acquisition mode, collection content are dispatched, in the online crest time, be set to serial acquisition mode, after the information collection of a website being completed, then start the information collection of next website; In the online trough time, be set to parallel acquisition mode, information collection is carried out in a plurality of websites simultaneously, guarantee the efficient of collection, and the utilization of resources is efficient.

3. network information batch capture method according to claim 1, is characterized in that, in step 3:

The mode of obtaining detail page chained address has 2 kinds, and these two kinds of modes are kept in detail page configuration information:

(1) label mode: first set starting position sign and the end position sign of the detail page chained address comprising in list page source code, this tick lables shows as coding fixing in webpage, is kept in detail page configuration information; In list page source code, search these tick lableses, between starting position sign and end position sign, extract detail page chained address, and be saved in chained library to be captured;

(2) specific on-link mode (OLM): first analyze the detail page chained address comprising in list page source code, according to the needs that gather content, therefrom extract the condition code of detail page chained address, then by the structure acquisition condition of regular expression, be kept in detail page configuration information; In list page source code, obtain all detail page chained addresses, then mate with condition code, if can deserve, be just saved in chained library to be captured.

4. network information batch capture method according to claim 3, is characterized in that:

The detail page chained address obtaining, and captures the detail page chained address of preserving in chained library and compares, if not identical, the detail page chained address just this being obtained is saved in chained library to be captured; Otherwise, abandon the detail page chained address that this obtains, prevent that some link from repeating crawled phenomenon and occurring;

Wherein, gather all detail page chained addresses and use label mode, collection meets the detail page chained address of content conditions and uses specific on-link mode (OLM).

5. network information batch capture method according to claim 4, is characterized in that, abnormal conditions are processed:

(1) overlong time: cannot estimate because whether website is addressable, in the time of may accessing, there will be abnormal, by capturing expired time of process setting, when a website is not for a long time when response, can initiatively exit, avoid occupying for a long time system resource;

(2) information is omitted: some website is disposed at a plurality of IP, for different IP, have different returning results, when occurring that capturing result occurs drain message, through investigation, be to cause this problem because IP address limits, system is by arranging corresponding proxy server, use the server of other IP to conduct interviews, obtain complete web page contents;

(3) frequently access: producing abnormal reason is because access websites is too frequent, violated the access rule of targeted website, thereby targeted website is limited and is caused this access, at this moment by crawl frequency is set, the grasp speed that slows down, gathers, and the regular hour was waited in the website of target before page of every access, evade restriction, carry out normally data acquisition.

6. network information batch capture method according to claim 1, is characterized in that, in step 4:

In detail page content, generally all can comprise title, information content, author, source, temporal information; In the source code of detail page, except the above-mentioned information comprising, also include various HTML code, and the HTML code corresponding with it all contained in the front and back of these information, therefore, can normal presentation for the detail page content after guaranteeing to gather, other HTML code are carried out to cleaning and filtering;

For title, author, source, temporal information, as long as HTML code corresponding to each information removed, retain corresponding information just passable;

7. network information batch capture method according to claim 1, is characterized in that, in step 5:

First, after picture chained address is identified, and be saved in picture collection storehouse, for follow-up capturing pictures operation;

Processings that need to unitize while preserving picture chained address, the general picture chained address obtaining, also has parameter after picture file name therein, when preservation picture chained address by "? " character afterwards weeds out;

(1), using the picture chained address in picture collection storehouse as parameter, call the picture processing script on file server; Wherein, in script, comprise set up catalogue, gather picture, be saved in respective directories, file rename order;

(2) carry out this picture processing script;

8. network information batch capture method according to claim 7, is characterized in that, abnormal conditions are processed:

May there is disabled situation in picture chained address, this cause picture access less than, during normal process, to capturing pictures, an expired time is set, if spent this time, do not obtain picture file, just no longer continue to attempt, save system resource, guarantee the validity that resource is used.

9. network information batch capture method according to claim 1, is characterized in that, in step 6: automatically examine before deriving, the form with webpage while automatically examining is carried out preview to the content being captured.