CN103927370B

CN103927370B - Network information batch acquisition method of combined text and picture information

Info

Publication number: CN103927370B
Application number: CN201410166752.5A
Authority: CN
Inventors: 唐宇波; 夏平嵩
Original assignee: Focus Technology Co Ltd
Current assignee: Focus Technology Co Ltd
Priority date: 2014-04-23
Filing date: 2014-04-23
Publication date: 2015-02-18
Anticipated expiration: 2034-04-23
Also published as: CN103927370A

Abstract

The invention provides a network information batch acquisition method of combined text and picture information. Through a series of configuration, according to the method, target network information can be acquired, replication removal of the target network information can be achieved, and the target network information can be stored into a database and can be sent to a place designated by a client according to a format designated by a client. The method includes the steps that websites in need of information acquisition and specific URLs and the page number of information list pages are determined; according to the URLs of the list pages, and a common part is found out and stored in list configuration information; when information acquisition is conducted, a system reads URL common part information in the list configuration information, serial number information of all the list pages is obtained according to the total number of the list pages, and therefore the URLs of all the list pages to be acquired of a target network are combined; detailed page content capturing is conducted according to detailed page link addresses stored in a linking library to be captured; processing of pictures in detailed page contents is conducted; after the detailed page contents are captured, content data are led out to a designated interface.

Description

A kind of network information batch capture method of cypher and pictorial information

Technical field

The present invention is applied to Internet technical field, relates to a kind of network information batch capture method of cypher and pictorial information.

Technical background

Along with the fast development of internet, internet have accumulated a large amount of various information, as the pricing information of Domestic News, potential customers' information, competing product, real time financial information, statistical report, industry analysis report, supply-demand information etc.For enterprise, by these information, business datum in conjunction with enterprises is analyzed, have very large booster action for enterprise management decision-making, on the other hand, enterprise arrange, digest these information after, be published in the website of oneself enterprise, enrich the content of enterprise web site, promoting the experience property of visitor, is also helpful.

There is now a lot of instrument can realize the collection of web page contents, but it is main still based on the acquisition method of Word message, the pictorial information of webpage is not effectively gathered, also lack a kind of reliable and effective method, information on network is carried out reliably batch capture, and the repeatability realizing gathering content judges.

Patent " the regular automatic capturing method of cyber journalism information " (number of patent application: CN201210402435.X) can realize the timing acquisition for news information, the content of targeted website can be saved in file server by configuration.But this method does not process for the abnormal conditions in gatherer process, correctly can not identify the redundant code in the information page, not process the picture in webpage yet simultaneously.

Therefore, to the pictorial information of webpage and Word message harmonious, it is a problem demanding prompt solution that batch carries out collection reliably.

Summary of the invention

For problems of the prior art, the invention a kind of network information batch capture method of cypher and pictorial information, it can be realized collection, duplicate removal to targeted website information by a series of configuration, be stored into database, and is sent to the functions such as the place that client specifies by client's specified format.

A network information batch capture method for cypher and pictorial information, comprising:

1, determine the website needing to carry out information collection, and determine the concrete URL needing the information list page gathered in this website, and the page quantity of these list page.

Wherein, multiple website can be selected to carry out the batch capture of information.According to the different time periods, the acquisition time of multiple website, acquisition mode, collection content are dispatched, at online peak time, be set to serial acquisition mode, after namely the information collection of a website being completed, then start the information collection of next website.At online decrease amount, be set to parallel acquisition mode, namely information collection carried out to multiple website simultaneously, ensure that the efficient of collection, and the utilization of resources is efficient.

2, according to the URL of multiple list page, find out the public part of these URL, be kept in list configuration information, in addition, the serial number information of these list page is kept in list configuration information.

3, when first time, information gathered, the public partial information of URL in system read list configuration information, from the total quantity of list page, just can obtain the serial number information of whole list page, thus is combined into the URL of targeted website all list page to be collected.

For later information collection, the public partial information of URL in system read list configuration information, and the list page serial number information of up-to-date 2 pages, be combined into the URL of the targeted website list page of up-to-date 2 pages to be collected.

System, according to these URL, captures the source code of these original lists of targeted website, and by resolving source code, obtains the detail page chained address comprised in list page.

The mode wherein obtaining detail page chained address has 2 kinds, and these two kinds of modes are kept in detail page configuration information:

(1) tagged manner.First set starting position mark and the end position mark of the detail page chained address comprised in list page source code, this tick lables shows as coding fixing in webpage, is kept in detail page configuration information.In list page source code, search these tick lableses, between starting position mark with end position mark, extract detail page chained address, and be saved in and wait to capture in chained library.

(2) specific on-link mode (OLM).First analyze the detail page chained address comprised in list page source code, according to the needs gathering content, therefrom extract the condition code of detail page chained address, then pass through the structure acquisition condition of regular expression, be kept in detail page configuration information.All detail page chained addresses are obtained in list page source code.Then mate with condition code, if can to deserve, be just saved in and wait to capture in chained library.

The detail page chained address obtained, and captures the detail page chained address of preserving in chained library and compares, if not identical, just this detail page chained address obtained be saved in and wait to capture in chained library; Otherwise, then abandoning the detail page chained address that this obtains, occurring preventing some link from repeating crawled phenomenon like this.

Wherein, gather all detail page chained addresses and use tagged manner, gather the detail page chained address meeting content conditions and use specific on-link mode (OLM).Above-mentioned two kinds of modes carry out choice for use according to actual conditions, also can combine cross-reference.

Abnormal conditions process:

(1) overlong time.Because whether website may have access to cannot estimate, when may access, there will be exception, by crawl process setting expired time, when a website does not respond for a long time time, initiatively can exit, avoid occupying system resource for a long time.

(2) information is omitted.Some website is disposed at multiple IP, has different returning results for different IP.Such as, it is inconsistent for having number of site to access with accessing the result represented at home abroad.When occurring that capturing result occurs drain message, be that system, by arranging corresponding proxy server, uses the server of other IP to conduct interviews, obtains complete web page contents because IP address limits cause this problem through investigation.

(3) frequently access.Producing abnormal reason is because access websites is too frequent, violate the access rule of targeted website, and targeted website thus limit this access causes.At this moment by arranging crawl frequency, slow down grasp speed, gathers, and waited for the regular hour to the website of target before often accessing a page.So just can evade restriction, carry out data acquisition normally.

4, capturing the detail page chained address of preserving in chained library according to waiting, carrying out the crawl of detail page content.

Access successively and wait to capture the detail page chained address of preserving in chained library, obtain the source code of detail page.

For in detail page content, generally all the information such as title, information content, author, source, time can be comprised.In the source code of detail page, except the above-mentioned information comprised, also include various HTML code, and the HTML code corresponding with it is all contained in the front and back of these information, therefore, in order to ensure that the detail page content after gathering can normal presentation, cleaning and filtering is carried out to other HTML code.

For information such as title, author, source, times, as long as HTML code corresponding for each information is removed, retain corresponding information just passable.

For information content information wherein, needing to retain line feed code and image link address, then remove wherein each kind of HTML code, for the JavaScript code wherein comprised, in order to ensure the safety gathering content, needing to carry out to process shielding.

After detail page content captures successfully, this detail page chained address is saved in and captures in chained library, judge to use for carrying out repeatability later.

5, for the process of picture in detail page content.

The picture of information is the important component part in information content, and when gathering information content, the chained address of the just picture simultaneously obtained, needs follow-up continuation process picture could be downloaded.

After Word message in all detail pages all captures and terminates, then start according to obtained image link address the crawl carrying out picture, be conducive to the efficiency promoting whole processing procedure.

First by image link Address Recognition out after, and be saved in picture collection storehouse, for follow-up capturing pictures operation.

Need unitized process when preserving image link address, the general image link address obtained, also has parameter after picture file name wherein, when preserving image link address by "? " character afterwards weeds out.

Afterwards, according to the image link address of preserving in picture collection storehouse, picture is downloaded to this locality:

(1) using the image link address in picture collection storehouse as parameter, call the picture processing script on file server.Wherein, comprise in script and set up catalogue, gather picture, be saved in the order such as respective directories, file rename.

(2) this picture processing script is performed.

(3), after script performs, corresponding picture downloads, and is saved in the catalogue of specifying.

Because the picture that needs are preserved may be a lot, and picture file is not of uniform size, the whole time of crawl can not be determined.So when crawl, have employed picture and capture parallel design, when the multiple picture of crawl, significantly can raise the efficiency like this.

Abnormal conditions process:

Disabled situation may be there is in image link address, this cause picture access less than, during normal process, an expired time is set to capturing pictures, if spent this time do not obtain picture file, just no longer continue to have attempted, save system resource, ensure the validity that resource uses.

6, after capturing detail page content, content-data is exported to specified interface.

Automatically audit before derivation, automatically in the form of a web page preview is carried out to captured content during examination & verification.

Content-data is exported to specified interface, and the data source of native system support comprises the database of the main flows such as Oracle, Mysql, SQLServer, and supports the file exporting to the forms such as TXT text, EXCEL, and supports e-mail transmitting function.

Beneficial outcomes of the present invention is as follows:

1, effective Word message and the pictorial information combining information content, complete collects this locality by an information, and conveniently can reproduce display;

2, in gatherer process, for abnormal conditions, add multiple method process, ensure the reliable and stable of data acquisition;

3, optimize collection resource, improve the utilization factor of resource, also improve the efficiency that information gathers;

4, the unified process scheduling of multiple collection demand, avoids overlapping development, improves development efficiency.

Accompanying drawing explanation

A kind of cypher of Fig. 1 the present embodiment and the network information batch capture method processing flow chart of pictorial information.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described in further detail.

As Fig. 1, the network information batch capture method treatment scheme of a kind of cypher of the present embodiment and pictorial information, comprising:

Step 11, determine the website that needs to carry out information collection, and determine the concrete URL needing the information list page gathered in this website, and the page quantity of these list page.

Wherein, multiple website can be selected to carry out the batch capture of information.At online peak time, be set to serial acquisition mode, after namely the information collection of a website being completed, then start the information collection of next website.At online decrease amount, be set to parallel acquisition mode, namely information collection carried out to multiple website simultaneously, ensure that the efficient of collection, and the utilization of resources is efficient.

Such as at the 8:00-24:00 of every day, adopt serial acquisition mode;

At the 0:00-8:00 of every day, adopt parallel acquisition mode.

Step 12, URL according to multiple list page, find out the public part of these URL, be kept in list configuration information, in addition, the total quantity information of these list page is kept in list configuration information.

Such as check the dynamic news pages of Sina's real estate market, the original list of its news has 3 pages, analyze the URL of the dynamic news list page of Sina's real estate market, can by " http://search.house.sina.com.cn/bj/news/scdt/page (*)/" the public part of part in their URL, the information of " 01 – 03 " representatively original list quantity.

Step 13, first time information gather time, the public partial information of URL in system read list configuration information, from the total quantity of list page, just can obtain the serial number information of whole list page, thus be combined into the URL of targeted website all list page to be collected.

For later information collection, the public partial information of the URL in read list configuration information, and the list page serial number information of up-to-date 2 pages, be combined into the URL of the targeted website list page of up-to-date 2 pages to be collected.

According to above-mentioned example, the public part " http://search.house.sina.com.cn/bj/news/scdt/page (*)/" of preserving, and represent original list serial number information " 01 – 03 ", combine, form all list page URL:

http://search.house.sina.com.cn/bj/news/scdt/page01

http://search.house.sina.com.cn/bj/news/scdt/page02

http://search.house.sina.com.cn/bj/news/scdt/page03

System, according to these URL, captures the source code of these original lists, and by carrying out source code resolving the detail page chained address obtaining and comprise in list page.

As follows in the information in page source code, link is here exactly the link needing to gather.

Then configuration information is: start from: " <h3><a href=" "; End at "/" > ".

Finally obtaining detail page chained address is:

http://bj.house.sina.com.cn/news/2014-04-15/17352689405.shtml

As in previous example, configuration information is: must comprise http://bj.house.sina.com.cn/news/, all like this links meeting this form all can gather.

In addition, be configured by the mode got rid of, will the content information gathered do not needed to be configured exactly, as :/scdt/|/zhuanti/|page must not be comprised.In this way, support to use multiple different filtercondition simultaneously.

Abnormal conditions process:

(1) overlong time.Because whether website may have access to cannot estimate, when may access, there will be exception, to crawl process setting expired time, when a website does not respond for a long time time, initiatively can exit, avoid occupying system resource for a long time.

(3) frequently access.Producing abnormal reason is because access websites is too frequent, violate the access rule of targeted website, and targeted website thus limit this access causes.At this moment by arranging crawl frequency, grasp speed collection of slowing down, waited for the regular hour to the website of target before often accessing a page.So just can evade restriction, carry out data acquisition normally.

Step 14, basis are waited to capture the detail page chained address of preserving in chained library, carry out the crawl of detail page content.

As for a news http://bj.house.sina.com.cn/news/2014-04-15/17352689405.shtml

Its source code cuts and selects part as follows:

The configuration information of the title that needs are formulated is: starting position is masked as " <title> ", end position is masked as " </title> ", just the title of this information " [cover story] agency destiny 7 is foretold greatly " can be collected;

The configuration information of same issuing time is: starting position is masked as " <div class=" tc zwdatemb15 " ><span>2 ", end position is masked as " </span> ", just the issuing time of this information " 2014-04-1517:35:11 " can be collected;

For information content information wherein, needing to retain line feed code and picture tag code, then remove wherein each kind of HTML code, for the JavaScript code wherein comprised, in order to ensure the safety gathering content, needing to carry out to process shielding.

--text end--> ", and because corresponding information content, need image link address to be changed to local image link address, and the label that will enter a new line retains, other various html tag codes remove.As follows for the information content after the process of this example.

After detail page content captures successfully, this detail page chained address is saved in and captures in chained library, repeatability when capturing for later detail page judges.

Step 15, process for picture in detail page content.

The picture of information is the important component part in information content, and when gathering information content, the chained address of the just picture simultaneously obtained, needs follow-up continuation process picture also could be downloaded.

After Word message in all detail pages all captures and terminates, then start according to obtained image link address the crawl carrying out picture, improve reliability, also promote the efficiency of whole processing procedure.

Such as need collection 15 detailed pages in above-mentioned Sina website, wherein obtain 5 image link addresses, so after the Word message collection of these 15 detailed pages terminates, start according to 5 image link seismic acquisition pictures.

First image link is identified, and be saved in picture collection storehouse, for follow-up capturing pictures operation.

The pictorial information obtained in this example is:

http://src.house.sina.com.cn/imp/imp/deal/f1/28/b/fd28900e78b71af7280d18693f3_p1_mk1.jpg

Need unitized process when preserving image link address, the general image link address obtained, also has parameter after picture file name wherein, when preserving image link address by "? " parameter afterwards is all removed.

Afterwards, according to the image link address of preserving in picture collection storehouse, picture is downloaded to this locality.

(2) this picture processing script is performed.

(3) picture of correspondence is downloaded, and be saved in the catalogue of specifying.

Because the picture that needs are preserved may be a lot, and picture file is not of uniform size, the picture that may have several million sizes exists, and the whole time of crawl is uncertain.So when crawl, have employed picture and to walk abreast Grasp Modes, when the multiple picture of crawl, significantly raise the efficiency.

Abnormal conditions process:

Disabled situation may be there is in image link address, this will cause picture access less than, during normal process, an expired time is set to capturing pictures, if spent this time do not obtain fileinfo, just no longer continue to have attempted, save system resource, ensure the validity that resource uses.

Arranging this expired time in the present embodiment is 1 minute-1.5 minutes.

After step 16, crawl detail page content, content-data is exported to specified interface.

Also need before derivation automatically to audit, automatically in the form of a web page preview is carried out to captured content during examination & verification.

The present invention also can have other numerous embodiments; when not deviating from the present invention's spirit and essence thereof; those of ordinary skill in the art can make various corresponding change and distortion according to the present invention, and these change accordingly and are out of shape the protection domain that all should belong to the claim appended by the present invention.

Claims

1. a network information batch capture method for cypher and pictorial information, is characterized in that, comprising:

Step one, determine the website that needs to carry out information collection, and determine the concrete URL needing the information list page gathered in this website, and the page quantity of described list page;

Step 2, URL according to multiple list page, find out the public part of this URL, be kept in list configuration information, the serial number information of described list page is kept in list configuration information simultaneously;

When step 3, first time, information gathered, system reads the public partial information of URL in described list configuration information, from the total quantity of list page, obtains the serial number information of whole list page, is combined into the URL of targeted website all list page to be collected;

For later information collection, system reads the public partial information of URL in described list configuration information, and the list page serial number information of up-to-date 2 pages, is combined into the URL of the targeted website list page of up-to-date 2 pages to be collected;

System, according to the URL of the list page of described up-to-date 2 pages, captures the source code of this original list of targeted website, by resolving source code, obtains the detail page chained address comprised in the list page of described up-to-date 2 pages;

Step 4, basis are waited to capture the detail page chained address of preserving in chained library, carry out the crawl of detail page content;

Access successively and wait to capture the detail page chained address of preserving in chained library, obtain the source code of detail page;

After detail page content captures successfully, this detail page chained address is saved in and captures in chained library, judge to use for carrying out repeatability later;

Step 5, process for picture in detail page content;

Gather information content, obtain the chained address of picture, after the Word message in all detail pages all captures and terminates, start according to obtained image link address the crawl carrying out picture;

Step 6, capture described detail page content after, content-data is exported to specified interface.

2. network information batch capture method according to claim 1, is characterized in that, in step one:

Multiple website is selected to carry out the batch capture of information, according to the different time periods, the acquisition time of multiple website, acquisition mode, collection content are dispatched, at online peak time, be set to serial acquisition mode, namely after the information collection of a website being completed, then the information collection of next website is started; At online decrease amount, be set to parallel acquisition mode, namely information collection carried out to multiple website simultaneously.

3. network information batch capture method according to claim 1, is characterized in that, in step 3:

The mode obtaining detail page chained address has 2 kinds, and these two kinds of modes are kept in detail page configuration information:

(1) tagged manner: the starting position mark of the detail page chained address comprised in setting list page source code and end position mark, described starting position mark and end position mark show as coding fixing in webpage, are kept in detail page configuration information; Search in list page source code described starting position mark and end position mark, from starting position mark with end position mark between, extract detail page chained address, and be saved in treat crawl chained library in;

(2) specific on-link mode (OLM): analyze the detail page chained address comprised in list page source code, therefrom extracts the condition code of detail page chained address, by the structure acquisition condition of regular expression, is kept in detail page configuration information according to collection content; In list page source code, obtain all detail page chained addresses, mate with condition code, if to deserve, be just saved in and wait to capture in chained library.

4. network information batch capture method according to claim 3, is characterized in that:

The detail page chained address obtained, and captures the detail page chained address of preserving in chained library and compares, if not identical, just obtained detail page chained address be saved in and wait to capture in chained library; Otherwise, then abandon obtained detail page chained address, prevent capturing the detail page chained address of preserving in chained library and repeat crawled phenomenon generation;

Wherein, gather all detail page chained addresses and use tagged manner, gather the detail page chained address meeting content conditions and use specific on-link mode (OLM).

5. network information batch capture method according to claim 4, is characterized in that, abnormal conditions process:

(1) overlong time: when website visiting is abnormal, expired time is arranged to crawl process, does not respond if exceed website described in described expired time, then log off;

(2) information omit: when website exist multiple IP dispose time, different returning results is had for different IP, crawl result is caused to occur drain message if IP address limits, then system is by arranging corresponding proxy server, use the server of other IP to conduct interviews, obtain complete web page contents;

(3) frequently access: when website visiting is too frequent, violate the access rule of targeted website, then targeted website is by arranging crawl frequency, slow down grasp speed, gather, wait for the regular hour before a page is often accessed to targeted website, evade restriction, carry out data acquisition normally.

6. network information batch capture method according to claim 1, is characterized in that, in step 4:

For title, author, source, temporal information, HTML code corresponding for each information is removed, retains corresponding information;

For information content information wherein, retain line feed code and image link address, remove wherein each kind of HTML code, shielding JavaScript code wherein.

7. network information batch capture method according to claim 1, is characterized in that, in step 5:

By image link Address Recognition out after, be saved in picture collection storehouse, for the operation of follow-up capturing pictures;

Unitized process when preserving image link address, when obtaining image link address, for the parameter after picture file name, when preserving image link address by "? " character afterwards weeds out;

According to the image link address of preserving in picture collection storehouse, picture is downloaded to this locality:

(1) using the image link address in picture collection storehouse as parameter, call the picture processing script on file server; Wherein, comprise in script set up catalogue, gather picture, be saved in respective directories, file rename order;

(2) this picture processing script is performed;

8. network information batch capture method according to claim 7, is characterized in that, abnormal conditions process:

When image link address is unavailable picture cannot be accessed time, expired time is set, after exceeding described expired time, no longer continue attempt obtain described picture.

9. network information batch capture method according to claim 1, is characterized in that, in step 6:

Automatically audit before derivation, automatically in the form of a web page preview is carried out to captured detail page content during examination & verification.