CN103927370A - Network information batch acquisition method of combined text and picture information - Google Patents

Network information batch acquisition method of combined text and picture information Download PDF

Info

Publication number
CN103927370A
CN103927370A CN201410166752.5A CN201410166752A CN103927370A CN 103927370 A CN103927370 A CN 103927370A CN 201410166752 A CN201410166752 A CN 201410166752A CN 103927370 A CN103927370 A CN 103927370A
Authority
CN
China
Prior art keywords
information
picture
page
detail page
chained address
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410166752.5A
Other languages
Chinese (zh)
Other versions
CN103927370B (en
Inventor
唐宇波
夏平嵩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Focus Technology Co Ltd
Original Assignee
Focus Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Focus Technology Co Ltd filed Critical Focus Technology Co Ltd
Priority to CN201410166752.5A priority Critical patent/CN103927370B/en
Publication of CN103927370A publication Critical patent/CN103927370A/en
Application granted granted Critical
Publication of CN103927370B publication Critical patent/CN103927370B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Abstract

The invention provides a network information batch acquisition method of combined text and picture information. Through a series of configuration, according to the method, target network information can be acquired, replication removal of the target network information can be achieved, and the target network information can be stored into a database and can be sent to a place designated by a client according to a format designated by a client. The method includes the steps that websites in need of information acquisition and specific URLs and the page number of information list pages are determined; according to the URLs of the list pages, and a common part is found out and stored in list configuration information; when information acquisition is conducted, a system reads URL common part information in the list configuration information, serial number information of all the list pages is obtained according to the total number of the list pages, and therefore the URLs of all the list pages to be acquired of a target network are combined; detailed page content capturing is conducted according to detailed page link addresses stored in a linking library to be captured; processing of pictures in detailed page contents is conducted; after the detailed page contents are captured, content data are led out to a designated interface.

Description

A kind of network information batch capture method of cypher and pictorial information
Technical field
The present invention is applied to Internet technical field, relates to a kind of network information batch capture method of cypher and pictorial information.
Technical background
Along with the fast development of internet, on internet, accumulated a large amount of various information, as the pricing information of Domestic News, potential customers' information, competing product, real-time financial Information, statistical report, industry analysis report, supply-demand information etc.For enterprise, by these information, business datum in conjunction with enterprises is analyzed, for enterprise management decision-making, have very large booster action, on the other hand, enterprise arranges, digests after these information, be published in the website of own enterprise, enriching the content of enterprise web site, promote visitor's experience property, is also helpful.
There are now a lot of instruments can realize the collection of web page contents, but main or take the acquisition method of Word message as main, the pictorial information of webpage is not effectively gathered, also lack a kind of reliable and effective method, information on network is carried out to batch capture reliably, and realize the repeatability judgement that gathers content.
Patent " the regular automatic capturing method of cyber journalism information " (number of patent application: CN201210402435.X) can realize the timing acquisition for news information, can the content of targeted website be saved in file server by configuration.But this method is not processed for the abnormal conditions in gatherer process, can not correctly identify the redundant code in the information page, the picture in webpage is not processed yet simultaneously.
Therefore, harmonious to the pictorial information of webpage and Word message, gathering reliably is in batches a problem demanding prompt solution.
Summary of the invention
For problems of the prior art, this patent has been invented a kind of network information batch capture method of cypher and pictorial information, it can be realized the collection of targeted website information, duplicate removal, be stored into database by a series of configurations, and the functions such as place that send to client's appointment by client's specified format.
A network information batch capture method for cypher and pictorial information, comprising:
1, determine the website that need to carry out information collection, and determine the concrete URL that needs the information list page of collection in this website, and the page quantity of these list page.
Wherein, can select a plurality of websites to carry out the batch capture of information.According to the different time periods, the acquisition time of a plurality of websites, acquisition mode, collection content are dispatched, in the online crest time, be set to serial acquisition mode, after the information collection of a website being completed, then start the information collection of next website.In the online trough time, be set to parallel acquisition mode, information collection is carried out in a plurality of websites simultaneously, guaranteed the efficient of collection, and the utilization of resources is efficient.
2, according to the URL of a plurality of list page, find out the public part of these URL, be kept in list configuration information, in addition, the serial number information of these list page is kept in list configuration information.
3, when information collection for the first time, the public partial information of URL in system read list configuration information, from the total quantity of list page, just can obtain the serial number information of whole list page, thereby is combined into the URL of targeted website all list page to be collected.
For later information collection, the public partial information of URL in system read list configuration information, and the list page serial number information of up-to-date 2 pages, be combined into the URL of the targeted website list page of up-to-date 2 pages to be collected.
System, according to these URL, captures the source code of these original lists of targeted website, and by source code is resolved, obtains the detail page chained address comprising in list page.
The mode of wherein obtaining detail page chained address has 2 kinds, and these two kinds of modes are kept in detail page configuration information:
(1) label mode.First starting position sign and the end position sign of setting the detail page chained address comprising in list page source code, this tick lables shows as coding fixing in webpage, is kept in detail page configuration information.In list page source code, search these tick lableses, between starting position sign and end position sign, extract detail page chained address, and be saved in chained library to be captured.
(2) specific on-link mode (OLM).First analyze the detail page chained address comprising in list page source code, the needs according to gathering content, therefrom extract the condition code of detail page chained address, then by the structure acquisition condition of regular expression, are kept in detail page configuration information.In list page source code, obtain all detail page chained addresses.Then mate with condition code, if can deserve, be just saved in chained library to be captured.
The detail page chained address obtaining, and captures the detail page chained address of preserving in chained library and compares, if not identical, the detail page chained address just this being obtained is saved in chained library to be captured; Otherwise, abandon the detail page chained address that this obtains, will prevent that some link from repeating crawled phenomenon and occurring like this.
Wherein, gather all detail page chained addresses and use label mode, collection meets the detail page chained address of content conditions and uses specific on-link mode (OLM).Above-mentioned two kinds of modes are carried out choice for use according to actual conditions, also can combine cross-reference.
Abnormal conditions are processed:
(1) overlong time.Because whether website is addressable, cannot estimate, in the time of may accessing, there will be extremely, by capturing expired time of process setting, when a website does not respond for a long time, can initiatively exit, avoid occupying for a long time system resource.
(2) information is omitted.Some website is disposed at a plurality of IP, for different IP, has different returning results.Such as, it is inconsistent having number of site to access at home the result representing with external access.When occurring that capturing result occurs drain message, through investigation, be to cause this problem because IP address limits, system, by corresponding proxy server is set, is used the server of other IP to conduct interviews, and obtains complete web page contents.
(3) frequently access.Produce abnormal reason and be because access websites is too frequent, violated the access rule of targeted website, thereby targeted website is limited and caused this access.At this moment by crawl frequency is set, the grasp speed that slows down, gathers, and the regular hour was waited in the website of target before page of every access.So just can evade restriction, carry out normally data acquisition.
4, according to the detail page chained address of preserving in chained library to be captured, carry out the crawl of detail page content.
Access successively the detail page chained address of preserving in chained library to be captured, obtain the source code of detail page.
In detail page content, generally all can comprise the information such as title, information content, author, source, time.In the source code of detail page, except the above-mentioned information comprising, also include various HTML code, and the HTML code corresponding with it all contained in the front and back of these information, therefore, can normal presentation for the detail page content after guaranteeing to gather, other HTML code are carried out to cleaning and filtering.
For information such as title, author, source, times, as long as HTML code corresponding to each information removed, retain corresponding information just passable.
For information content information wherein, need to retain line feed code and picture chained address, then remove wherein each kind of HTML code, for the JavaScript code wherein comprising, in order to guarantee to gather the safety of content, need to process shielding.
After detail page content captures successfully, this detail page chained address is saved in and is captured in chained library, for carry out repeatability judgement later, use.
5, for the processing of picture in detail page content.
The picture of information is the important component part in information content, when gathering information content, and the just chained address of picture simultaneously obtaining, needing follow-up continuation to process could download picture.
After Word message in all detail pages all captures and finishes, then according to obtained picture chained address, start to carry out the crawl of picture, be conducive to promote the efficiency of whole processing procedure.
First after picture chained address being identified, and be saved in picture collection storehouse, for follow-up capturing pictures operation.
Processings that need to unitize while preserving picture chained address, the general picture chained address obtaining, also has parameter after picture file name therein, when preservation picture chained address by "? " character afterwards weeds out.
Afterwards, the picture chained address according to preserving in picture collection storehouse, downloads to this locality by picture:
(1), using the picture chained address in picture collection storehouse as parameter, call the picture processing script on file server.Wherein, in script, comprise set up catalogue, gather picture, be saved in respective directories, the order such as file rename.
(2) carry out this picture processing script.
(3) after script is carried out, corresponding picture downloads, and is saved in the catalogue of appointment.
Because the picture that need to preserve may be a lot, and picture file is not of uniform size, the whole time of crawl can not be determined.So when capturing, adopted picture to capture parallel design, can, when capturing a plurality of pictures, significantly raise the efficiency like this.
Abnormal conditions are processed:
May there is disabled situation in picture chained address, this cause picture access less than, during normal process, to capturing pictures, an expired time is set, if spent this time, do not obtain picture file, just no longer continue to have attempted, saved system resource, guarantee the validity that resource is used.
6, after capturing detail page content, content-data is exported to specified interface.
Before deriving, automatically examine, the form with webpage while automatically examining is carried out preview to the content being captured.
Content-data is exported to specified interface, and the data source of native system support comprises the database of the main flows such as Oracle, Mysql, SQLServer, and supports to export to the file of the forms such as TXT text, EXCEL, and supports e-mail transmitting function.
Useful result of the present invention is as follows:
1, effectively combine Word message and the pictorial information of information content, complete collects this locality by an information, and can conveniently reproduce demonstration;
2, in gatherer process, for abnormal conditions, add several different methods to process, guarantee the reliable and stable of data acquisition;
3, optimize and gather resource, improved the utilization factor of resource, also improved the efficiency that information gathers;
4, the unified processing of a plurality of collection demands dispatched, and avoids overlapping development, improves development efficiency.
Accompanying drawing explanation
The network information batch capture method processing flow chart of a kind of cypher of Fig. 1 the present embodiment and pictorial information.
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is described in further detail.
As Fig. 1, the network information batch capture method treatment scheme of a kind of cypher of the present embodiment and pictorial information, comprising:
Step 11, the definite website that need to carry out information collection, and determine the concrete URL that needs the information list page of collection in this website, and the page quantity of these list page.
Wherein, can select a plurality of websites to carry out the batch capture of information.In the online crest time, be set to serial acquisition mode, after the information collection of a website being completed, then start the information collection of next website.In the online trough time, be set to parallel acquisition mode, information collection is carried out in a plurality of websites simultaneously, guaranteed the efficient of collection, and the utilization of resources is efficient.
For example, at the 8:00-24:00 of every day, adopt serial acquisition mode;
At the 0:00-8:00 of every day, adopt parallel acquisition mode.
Step 12, according to the URL of a plurality of list page, find out the public part of these URL, be kept in list configuration information, in addition, the total quantity information of these list page is kept in list configuration information.
For example check the dynamic news pages of Sina's real estate market, the original list of its news has 3 pages, analyze the URL of the dynamic news list page of Sina's real estate market, can be by " http://search.house.sina.com.cn/bj/news/scdt/page (*)/" the public part of part in their URL, using " 01 – 03 " as the information that represents original list quantity.
Step 13, when information for the first time gathers, the public partial information of URL in system read list configuration information, from the total quantity of list page, just can obtain the serial number information of whole list page, thereby be combined into the URL of targeted website all list page to be collected.
For later information collection, the public partial information of URL in read list configuration information, and the list page serial number information of up-to-date 2 pages, be combined into the URL of the targeted website list page of up-to-date 2 pages to be collected.
According to above-mentioned example, the public part " http://search.house.sina.com.cn/bj/news/scdt/page (*)/" of preserving, and represent original list serial number information " 01 – 03 ", combine, form all list page URL:
http://search.house.sina.com.cn/bj/news/scdt/page01
http://search.house.sina.com.cn/bj/news/scdt/page02
http://search.house.sina.com.cn/bj/news/scdt/page03
System, according to these URL, captures the source code of these original lists, and obtains by source code is resolved the detail page chained address comprising in list page.
The mode of wherein obtaining detail page chained address has 2 kinds, and these two kinds of modes are kept in detail page configuration information:
(1) label mode.First starting position sign and the end position sign of setting the detail page chained address comprising in list page source code, this tick lables shows as coding fixing in webpage, is kept in detail page configuration information.In list page source code, search these tick lableses, between starting position sign and end position sign, extract detail page chained address, and be saved in chained library to be captured.
As follows in the information in page source code, the link is here exactly the link that needs collection.
<h3><a?href="http://bj.house.sina.com.cn/news/2014-04-15/17352689405.shtml">
Configuration information is: start from: " <h3><a href=" "; End at "/" > ".
Finally obtaining detail page chained address is:
http://bj.house.sina.com.cn/news/2014-04-15/17352689405.shtml
(2) specific on-link mode (OLM).First analyze the detail page chained address comprising in list page source code, the needs according to gathering content, therefrom extract the condition code of detail page chained address, then by the structure acquisition condition of regular expression, are kept in detail page configuration information.In in list page source code, obtain all detail page chained addresses.Then mate with condition code, if can deserve, be just saved in chained library to be captured.
As configuration information in previous example is: must comprise http://bj.house.sina.com.cn/news/, all like this links that meet this form all can gather.
In addition, by the mode of getting rid of, be configured, will do not need the content information gathering to be configured exactly, as: must not comprise/scdt/|/zhuanti/|page.In this way, support to use multiple different filtercondition simultaneously.
The detail page chained address obtaining, and captures the detail page chained address of preserving in chained library and compares, if not identical, the detail page chained address just this being obtained is saved in chained library to be captured; Otherwise, abandon the detail page chained address that this obtains, will prevent that some link from repeating crawled phenomenon and occurring like this.
Wherein, gather all detail page chained addresses and use label mode, collection meets the detail page chained address of content conditions and uses specific on-link mode (OLM).Above-mentioned two kinds of modes are carried out choice for use according to actual conditions, also can combine cross-reference.
Abnormal conditions are processed:
(1) overlong time.Because whether website is addressable, cannot estimate, in the time of may accessing, there will be extremely, to capturing expired time of process setting, when a website does not respond for a long time, can initiatively exit, avoid occupying for a long time system resource.
(2) information is omitted.Some website is disposed at a plurality of IP, for different IP, has different returning results.Such as, it is inconsistent having number of site to access at home the result representing with external access.When occurring that capturing result occurs drain message, through investigation, be to cause this problem because IP address limits, system, by corresponding proxy server is set, is used the server of other IP to conduct interviews, and obtains complete web page contents.
(3) frequently access.Produce abnormal reason and be because access websites is too frequent, violated the access rule of targeted website, thereby targeted website is limited and caused this access.At this moment by crawl frequency is set, the grasp speed collection of slowing down was waited for the regular hour to the website of target before page of every access.So just can evade restriction, carry out normally data acquisition.
The detail page chained address of preserving in step 14, basis chained library to be captured, carries out the crawl of detail page content.
Access successively the detail page chained address of preserving in chained library to be captured, obtain the source code of detail page.
In detail page content, generally all can comprise the information such as title, information content, author, source, time.In the source code of detail page, except the above-mentioned information comprising, also include various HTML code, and the HTML code corresponding with it all contained in the front and back of these information, therefore, can normal presentation for the detail page content after guaranteeing to gather, other HTML code are carried out to cleaning and filtering.
For information such as title, author, source, times, as long as HTML code corresponding to each information removed, retain corresponding information just passable.
As for a news http://bj.house.sina.com.cn/news/2014-04-15/17352689405.shtml
It is partly as follows that its source code cuts choosing:
Need the configuration information of the title of formulation to be: starting position is masked as " <title> ", end position is masked as " </title> ", just the title of this information " [cover story] agency destiny 7 is foretold greatly " can be collected;
The configuration information of same issuing time is: starting position is masked as " <div class=" tc zwdate mb15 " ><span>2 ", end position is masked as " </span> ", just the issuing time of this information " 2014-04-1517:35:11 " can be collected;
For information content information wherein, need to retain line feed code and picture tag code, then remove wherein each kind of HTML code, for the JavaScript code wherein comprising, in order to guarantee to gather the safety of content, need to process shielding.
--text end--> ", and, because corresponding information content need to be changed to picture chained address local picture chained address, and the label reservation that will enter a new line, other various html tag codes remove.Information content after processing for this example is as follows.
After detail page content captures successfully, this detail page chained address is saved in and is captured in chained library, the repeatability judgement during for detail page crawl later.
Step 15, for the processing of picture in detail page content.
The picture of information is the important component part in information content, when gathering information content, and the just chained address of picture simultaneously obtaining, needing follow-up continuation to process could also download picture.
Then Word message in all detail pages starts to carry out the crawl of picture after all capturing and finishing according to obtained picture chained address, improved reliability, also promotes the efficiency of whole processing procedure.
For example in above-mentioned Sina website, need to gather 15 detailed pages, wherein obtain 5 picture chained addresses, after the Word message collection of these 15 detailed pages finishes, start to gather picture according to 5 picture chained addresses so.
First picture link is identified, and be saved in picture collection storehouse, for follow-up capturing pictures operation.
The pictorial information obtaining in this example is:
http://src.house.sina.com.cn/imp/imp/deal/f1/28/b/fd28900e78b71af7280d18693f3_p1_mk1.jpg
Processings that need to unitize while preserving picture chained address, the general picture chained address obtaining, also has parameter after picture file name therein, when preservation picture chained address by "? " parameter afterwards is all removed.
Afterwards, the picture chained address according to preserving in picture collection storehouse, downloads to this locality by picture.
(1), using the picture chained address in picture collection storehouse as parameter, call the picture processing script on file server.Wherein, in script, comprise set up catalogue, gather picture, be saved in respective directories, the order such as file rename.
(2) carry out this picture processing script.
(3) corresponding picture is downloaded, and be saved in the catalogue of appointment.
Because the picture that need to preserve may be a lot, and picture file is not of uniform size, the picture that may have several million sizes exists, and the whole time of crawl is uncertain.So when capturing, adopted the picture Grasp Modes that walk abreast, in a plurality of pictures of crawl, significantly raised the efficiency.
Abnormal conditions are processed:
May there is disabled situation in picture chained address, this will cause picture access less than, during normal process, to capturing pictures, an expired time is set, if spent this time, do not obtain fileinfo, just no longer continue to have attempted, saved system resource, guarantee the validity that resource is used.
It is 1 minute-1.5 minutes that this expired time is set in the present embodiment.
After step 16, crawl detail page content, content-data is exported to specified interface.
Before deriving, also need automatically to examine, the form with webpage while automatically examining is carried out preview to the content being captured.
Content-data is exported to specified interface, and the data source of native system support comprises the database of the main flows such as Oracle, Mysql, SQLServer, and supports to export to the file of the forms such as TXT text, EXCEL, and supports e-mail transmitting function.
The present invention also can have other numerous embodiments; in the situation that not deviating from spirit of the present invention and essence thereof; those of ordinary skill in the art can make according to the present invention various corresponding changes and distortion, and these change and be out of shape the protection domain that all should belong to the appended claim of the present invention accordingly.

Claims (9)

1. a network information batch capture method for cypher and pictorial information, is characterized in that, comprising:
Step 1, the definite website that need to carry out information collection, and determine the concrete URL that needs the information list page of collection in this website, and the page quantity of these list page;
Step 2, according to the URL of a plurality of list page, find out the public part of these URL, be kept in list configuration information, in addition, the serial number information of these list page is kept in list configuration information;
Step 3, when information for the first time gathers, the public partial information of URL in system read list configuration information, from the total quantity of list page, obtains the serial number information of whole list page, thereby is combined into the URL of targeted website all list page to be collected;
For later information collection, the public partial information of URL in system read list configuration information, and the list page serial number information of up-to-date 2 pages, be combined into the URL of the targeted website list page of up-to-date 2 pages to be collected;
System, according to these URL, captures the source code of these original lists of targeted website, and by source code is resolved, obtains the detail page chained address comprising in list page;
The detail page chained address of preserving in step 4, basis chained library to be captured, carries out the crawl of detail page content;
Access successively the detail page chained address of preserving in chained library to be captured, obtain the source code of detail page;
After detail page content captures successfully, this detail page chained address is saved in and is captured in chained library, for carry out repeatability judgement later, use;
Step 5, for the processing of picture in detail page content;
The picture of information is the important component part in information content, when gathering information content, and the just chained address of picture simultaneously obtaining, needing follow-up continuation to process could download picture;
After Word message in all detail pages all captures and finishes, then according to obtained picture chained address, start to carry out the crawl of picture, be conducive to promote the efficiency of whole processing procedure;
After step 6, crawl detail page content, content-data is exported to specified interface.
2. network information batch capture method according to claim 1, is characterized in that, in step 1:
Can select a plurality of websites to carry out the batch capture of information, according to the different time periods, the acquisition time of a plurality of websites, acquisition mode, collection content are dispatched, in the online crest time, be set to serial acquisition mode, after the information collection of a website being completed, then start the information collection of next website; In the online trough time, be set to parallel acquisition mode, information collection is carried out in a plurality of websites simultaneously, guarantee the efficient of collection, and the utilization of resources is efficient.
3. network information batch capture method according to claim 1, is characterized in that, in step 3:
The mode of obtaining detail page chained address has 2 kinds, and these two kinds of modes are kept in detail page configuration information:
(1) label mode: first set starting position sign and the end position sign of the detail page chained address comprising in list page source code, this tick lables shows as coding fixing in webpage, is kept in detail page configuration information; In list page source code, search these tick lableses, between starting position sign and end position sign, extract detail page chained address, and be saved in chained library to be captured;
(2) specific on-link mode (OLM): first analyze the detail page chained address comprising in list page source code, according to the needs that gather content, therefrom extract the condition code of detail page chained address, then by the structure acquisition condition of regular expression, be kept in detail page configuration information; In list page source code, obtain all detail page chained addresses, then mate with condition code, if can deserve, be just saved in chained library to be captured.
4. network information batch capture method according to claim 3, is characterized in that:
The detail page chained address obtaining, and captures the detail page chained address of preserving in chained library and compares, if not identical, the detail page chained address just this being obtained is saved in chained library to be captured; Otherwise, abandon the detail page chained address that this obtains, prevent that some link from repeating crawled phenomenon and occurring;
Wherein, gather all detail page chained addresses and use label mode, collection meets the detail page chained address of content conditions and uses specific on-link mode (OLM).
5. network information batch capture method according to claim 4, is characterized in that, abnormal conditions are processed:
(1) overlong time: cannot estimate because whether website is addressable, in the time of may accessing, there will be abnormal, by capturing expired time of process setting, when a website is not for a long time when response, can initiatively exit, avoid occupying for a long time system resource;
(2) information is omitted: some website is disposed at a plurality of IP, for different IP, have different returning results, when occurring that capturing result occurs drain message, through investigation, be to cause this problem because IP address limits, system is by arranging corresponding proxy server, use the server of other IP to conduct interviews, obtain complete web page contents;
(3) frequently access: producing abnormal reason is because access websites is too frequent, violated the access rule of targeted website, thereby targeted website is limited and is caused this access, at this moment by crawl frequency is set, the grasp speed that slows down, gathers, and the regular hour was waited in the website of target before page of every access, evade restriction, carry out normally data acquisition.
6. network information batch capture method according to claim 1, is characterized in that, in step 4:
In detail page content, generally all can comprise title, information content, author, source, temporal information; In the source code of detail page, except the above-mentioned information comprising, also include various HTML code, and the HTML code corresponding with it all contained in the front and back of these information, therefore, can normal presentation for the detail page content after guaranteeing to gather, other HTML code are carried out to cleaning and filtering;
For title, author, source, temporal information, as long as HTML code corresponding to each information removed, retain corresponding information just passable;
For information content information wherein, need to retain line feed code and picture chained address, then remove wherein each kind of HTML code, for the JavaScript code wherein comprising, in order to guarantee to gather the safety of content, need to process shielding.
7. network information batch capture method according to claim 1, is characterized in that, in step 5:
First, after picture chained address is identified, and be saved in picture collection storehouse, for follow-up capturing pictures operation;
Processings that need to unitize while preserving picture chained address, the general picture chained address obtaining, also has parameter after picture file name therein, when preservation picture chained address by "? " character afterwards weeds out;
Afterwards, the picture chained address according to preserving in picture collection storehouse, downloads to this locality by picture:
(1), using the picture chained address in picture collection storehouse as parameter, call the picture processing script on file server; Wherein, in script, comprise set up catalogue, gather picture, be saved in respective directories, file rename order;
(2) carry out this picture processing script;
(3) after script is carried out, corresponding picture downloads, and is saved in the catalogue of appointment.
8. network information batch capture method according to claim 7, is characterized in that, abnormal conditions are processed:
May there is disabled situation in picture chained address, this cause picture access less than, during normal process, to capturing pictures, an expired time is set, if spent this time, do not obtain picture file, just no longer continue to attempt, save system resource, guarantee the validity that resource is used.
9. network information batch capture method according to claim 1, is characterized in that, in step 6: automatically examine before deriving, the form with webpage while automatically examining is carried out preview to the content being captured.
CN201410166752.5A 2014-04-23 2014-04-23 Network information batch acquisition method of combined text and picture information Active CN103927370B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410166752.5A CN103927370B (en) 2014-04-23 2014-04-23 Network information batch acquisition method of combined text and picture information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410166752.5A CN103927370B (en) 2014-04-23 2014-04-23 Network information batch acquisition method of combined text and picture information

Publications (2)

Publication Number Publication Date
CN103927370A true CN103927370A (en) 2014-07-16
CN103927370B CN103927370B (en) 2015-02-18

Family

ID=51145591

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410166752.5A Active CN103927370B (en) 2014-04-23 2014-04-23 Network information batch acquisition method of combined text and picture information

Country Status (1)

Country Link
CN (1) CN103927370B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104715016A (en) * 2015-02-04 2015-06-17 北京中搜网络技术股份有限公司 Search engine collection method
CN104731928A (en) * 2015-03-27 2015-06-24 李冬 Data collecting and processing equipment
CN106302797A (en) * 2016-08-31 2017-01-04 北京锐安科技有限公司 A kind of cookie accesses De-weight method and device
CN106293686A (en) * 2015-06-25 2017-01-04 阿里巴巴集团控股有限公司 The method and device of exhibiting pictures annotation in code text
CN107273497A (en) * 2017-06-16 2017-10-20 郑州云海信息技术有限公司 A kind of vulnerability information acquisition method and device
CN107644028A (en) * 2016-07-20 2018-01-30 平安科技(深圳)有限公司 The collection method and system of web data
CN107784113A (en) * 2017-11-08 2018-03-09 深圳市科盾科技有限公司 Html web page collecting method, device and computer-readable recording medium
CN108052648A (en) * 2017-12-26 2018-05-18 福建中金在线信息科技有限公司 A kind of website image deletion method, device and electronic equipment
CN108133010A (en) * 2017-12-22 2018-06-08 新奥(中国)燃气投资有限公司 A kind of information grasping means and device
CN108470296A (en) * 2017-02-23 2018-08-31 阿里巴巴集团控股有限公司 A kind of business object information processing method and processing device
CN111159518A (en) * 2019-12-26 2020-05-15 深圳前海环融联易信息科技服务有限公司 News data acquisition method and device, computer equipment and storage medium
CN111460255A (en) * 2020-03-26 2020-07-28 第一曲库(北京)科技有限公司 Music work information data acquisition and storage method
CN111931113A (en) * 2020-09-16 2020-11-13 深圳壹账通智能科技有限公司 Data cleaning method and related equipment
CN111953766A (en) * 2020-08-07 2020-11-17 福建省天奕网络科技有限公司 Method and system for collecting network data
CN112187949A (en) * 2020-10-09 2021-01-05 珠海格力电器股份有限公司 Picture batch downloading method and device, storage medium and electronic device
CN114417200A (en) * 2022-01-04 2022-04-29 马上消费金融股份有限公司 Network data acquisition method and device and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010231508A (en) * 2009-03-27 2010-10-14 Kddi Corp Device, method and program for determining significance
CN103309954A (en) * 2013-05-27 2013-09-18 复旦大学 Html webpage based data extracting system
CN103631806A (en) * 2012-08-24 2014-03-12 华为技术有限公司 Network information fetching method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010231508A (en) * 2009-03-27 2010-10-14 Kddi Corp Device, method and program for determining significance
CN103631806A (en) * 2012-08-24 2014-03-12 华为技术有限公司 Network information fetching method and device
CN103309954A (en) * 2013-05-27 2013-09-18 复旦大学 Html webpage based data extracting system

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104715016B (en) * 2015-02-04 2018-02-16 北京中搜搜悦网络技术有限公司 One kind searches happy acquisition method
CN104715016A (en) * 2015-02-04 2015-06-17 北京中搜网络技术股份有限公司 Search engine collection method
CN104731928A (en) * 2015-03-27 2015-06-24 李冬 Data collecting and processing equipment
CN106293686A (en) * 2015-06-25 2017-01-04 阿里巴巴集团控股有限公司 The method and device of exhibiting pictures annotation in code text
CN106293686B (en) * 2015-06-25 2019-08-02 阿里巴巴集团控股有限公司 The method and device that exhibiting pictures annotate in code text
CN107644028B (en) * 2016-07-20 2020-09-04 平安科技(深圳)有限公司 Method and system for collecting webpage data
CN107644028A (en) * 2016-07-20 2018-01-30 平安科技(深圳)有限公司 The collection method and system of web data
CN106302797A (en) * 2016-08-31 2017-01-04 北京锐安科技有限公司 A kind of cookie accesses De-weight method and device
CN108470296B (en) * 2017-02-23 2022-02-25 阿里巴巴集团控股有限公司 Business object information processing method and device
CN108470296A (en) * 2017-02-23 2018-08-31 阿里巴巴集团控股有限公司 A kind of business object information processing method and processing device
CN107273497A (en) * 2017-06-16 2017-10-20 郑州云海信息技术有限公司 A kind of vulnerability information acquisition method and device
CN107784113A (en) * 2017-11-08 2018-03-09 深圳市科盾科技有限公司 Html web page collecting method, device and computer-readable recording medium
CN108133010A (en) * 2017-12-22 2018-06-08 新奥(中国)燃气投资有限公司 A kind of information grasping means and device
CN108052648B (en) * 2017-12-26 2020-08-21 福建中金在线信息科技有限公司 Website picture deleting method and device and electronic equipment
CN108052648A (en) * 2017-12-26 2018-05-18 福建中金在线信息科技有限公司 A kind of website image deletion method, device and electronic equipment
CN111159518A (en) * 2019-12-26 2020-05-15 深圳前海环融联易信息科技服务有限公司 News data acquisition method and device, computer equipment and storage medium
CN111159518B (en) * 2019-12-26 2023-10-24 深圳前海环融联易信息科技服务有限公司 News data acquisition method and device, computer equipment and storage medium
CN111460255A (en) * 2020-03-26 2020-07-28 第一曲库(北京)科技有限公司 Music work information data acquisition and storage method
CN111953766A (en) * 2020-08-07 2020-11-17 福建省天奕网络科技有限公司 Method and system for collecting network data
CN111931113A (en) * 2020-09-16 2020-11-13 深圳壹账通智能科技有限公司 Data cleaning method and related equipment
CN112187949A (en) * 2020-10-09 2021-01-05 珠海格力电器股份有限公司 Picture batch downloading method and device, storage medium and electronic device
CN114417200A (en) * 2022-01-04 2022-04-29 马上消费金融股份有限公司 Network data acquisition method and device and electronic equipment

Also Published As

Publication number Publication date
CN103927370B (en) 2015-02-18

Similar Documents

Publication Publication Date Title
CN103927370B (en) Network information batch acquisition method of combined text and picture information
CN107895009B (en) Distributed internet data acquisition method and system
CN102930059B (en) Method for designing focused crawler
CN107943838B (en) Method and system for automatically acquiring xpath generated crawler script
CN104077402B (en) Data processing method and data handling system
CN105224691B (en) A kind of information processing method and device
CN102932207B (en) The method of monitoring website access information and server
CN102833233B (en) Method and device for recognizing web pages
CN103455600B (en) A kind of video URL grasping means, device and server apparatus
CN110417873B (en) Network information extraction system for realizing recording webpage interactive operation
CN102724184B (en) A kind of web page storage sharing method and server
CN111444408B (en) Network search processing method and device and electronic equipment
CN108334641B (en) Method, system, electronic equipment and storage medium for collecting user behavior data
CN102932206A (en) Method and system for monitoring website access information
CN103248707B (en) File access method, system and equipment
CN104391978A (en) Method and device for storing and processing web pages of browsers
CN103530336A (en) Equipment and method for identifying invalid parameters in URLs
CN102710795A (en) Hotspot collecting method and device
CN108710670A (en) A kind of log analysis method, device, electronic equipment and readable storage medium storing program for executing
CN104252532A (en) Website information statistic method and device
CN107800686A (en) A kind of fishing website recognition methods and device
CN103067387A (en) Monitoring system and monitoring method for anti phishing
CN104281629A (en) Method and device for extracting picture from webpage and client equipment
CN103530337A (en) Device and method for recognizing invalid parameters in URL
CN103605742B (en) Recognize the method and device of Internet resources entity catalogue page

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant