CN103927370B - Network information batch acquisition method of combined text and picture information - Google Patents

Network information batch acquisition method of combined text and picture information Download PDF

Info

Publication number
CN103927370B
CN103927370B CN201410166752.5A CN201410166752A CN103927370B CN 103927370 B CN103927370 B CN 103927370B CN 201410166752 A CN201410166752 A CN 201410166752A CN 103927370 B CN103927370 B CN 103927370B
Authority
CN
China
Prior art keywords
information
page
detail page
list
picture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410166752.5A
Other languages
Chinese (zh)
Other versions
CN103927370A (en
Inventor
唐宇波
夏平嵩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Focus Technology Co Ltd
Original Assignee
Focus Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Focus Technology Co Ltd filed Critical Focus Technology Co Ltd
Priority to CN201410166752.5A priority Critical patent/CN103927370B/en
Publication of CN103927370A publication Critical patent/CN103927370A/en
Application granted granted Critical
Publication of CN103927370B publication Critical patent/CN103927370B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Abstract

The invention provides a network information batch acquisition method of combined text and picture information. Through a series of configuration, according to the method, target network information can be acquired, replication removal of the target network information can be achieved, and the target network information can be stored into a database and can be sent to a place designated by a client according to a format designated by a client. The method includes the steps that websites in need of information acquisition and specific URLs and the page number of information list pages are determined; according to the URLs of the list pages, and a common part is found out and stored in list configuration information; when information acquisition is conducted, a system reads URL common part information in the list configuration information, serial number information of all the list pages is obtained according to the total number of the list pages, and therefore the URLs of all the list pages to be acquired of a target network are combined; detailed page content capturing is conducted according to detailed page link addresses stored in a linking library to be captured; processing of pictures in detailed page contents is conducted; after the detailed page contents are captured, content data are led out to a designated interface.

Description

A kind of network information batch capture method of cypher and pictorial information
Technical field
The present invention is applied to Internet technical field, relates to a kind of network information batch capture method of cypher and pictorial information.
Technical background
Along with the fast development of internet, internet have accumulated a large amount of various information, as the pricing information of Domestic News, potential customers' information, competing product, real time financial information, statistical report, industry analysis report, supply-demand information etc.For enterprise, by these information, business datum in conjunction with enterprises is analyzed, have very large booster action for enterprise management decision-making, on the other hand, enterprise arrange, digest these information after, be published in the website of oneself enterprise, enrich the content of enterprise web site, promoting the experience property of visitor, is also helpful.
There is now a lot of instrument can realize the collection of web page contents, but it is main still based on the acquisition method of Word message, the pictorial information of webpage is not effectively gathered, also lack a kind of reliable and effective method, information on network is carried out reliably batch capture, and the repeatability realizing gathering content judges.
Patent " the regular automatic capturing method of cyber journalism information " (number of patent application: CN201210402435.X) can realize the timing acquisition for news information, the content of targeted website can be saved in file server by configuration.But this method does not process for the abnormal conditions in gatherer process, correctly can not identify the redundant code in the information page, not process the picture in webpage yet simultaneously.
Therefore, to the pictorial information of webpage and Word message harmonious, it is a problem demanding prompt solution that batch carries out collection reliably.
Summary of the invention
For problems of the prior art, the invention a kind of network information batch capture method of cypher and pictorial information, it can be realized collection, duplicate removal to targeted website information by a series of configuration, be stored into database, and is sent to the functions such as the place that client specifies by client's specified format.
A network information batch capture method for cypher and pictorial information, comprising:
1, determine the website needing to carry out information collection, and determine the concrete URL needing the information list page gathered in this website, and the page quantity of these list page.
Wherein, multiple website can be selected to carry out the batch capture of information.According to the different time periods, the acquisition time of multiple website, acquisition mode, collection content are dispatched, at online peak time, be set to serial acquisition mode, after namely the information collection of a website being completed, then start the information collection of next website.At online decrease amount, be set to parallel acquisition mode, namely information collection carried out to multiple website simultaneously, ensure that the efficient of collection, and the utilization of resources is efficient.
2, according to the URL of multiple list page, find out the public part of these URL, be kept in list configuration information, in addition, the serial number information of these list page is kept in list configuration information.
3, when first time, information gathered, the public partial information of URL in system read list configuration information, from the total quantity of list page, just can obtain the serial number information of whole list page, thus is combined into the URL of targeted website all list page to be collected.
For later information collection, the public partial information of URL in system read list configuration information, and the list page serial number information of up-to-date 2 pages, be combined into the URL of the targeted website list page of up-to-date 2 pages to be collected.
System, according to these URL, captures the source code of these original lists of targeted website, and by resolving source code, obtains the detail page chained address comprised in list page.
The mode wherein obtaining detail page chained address has 2 kinds, and these two kinds of modes are kept in detail page configuration information:
(1) tagged manner.First set starting position mark and the end position mark of the detail page chained address comprised in list page source code, this tick lables shows as coding fixing in webpage, is kept in detail page configuration information.In list page source code, search these tick lableses, between starting position mark with end position mark, extract detail page chained address, and be saved in and wait to capture in chained library.
(2) specific on-link mode (OLM).First analyze the detail page chained address comprised in list page source code, according to the needs gathering content, therefrom extract the condition code of detail page chained address, then pass through the structure acquisition condition of regular expression, be kept in detail page configuration information.All detail page chained addresses are obtained in list page source code.Then mate with condition code, if can to deserve, be just saved in and wait to capture in chained library.
The detail page chained address obtained, and captures the detail page chained address of preserving in chained library and compares, if not identical, just this detail page chained address obtained be saved in and wait to capture in chained library; Otherwise, then abandoning the detail page chained address that this obtains, occurring preventing some link from repeating crawled phenomenon like this.
Wherein, gather all detail page chained addresses and use tagged manner, gather the detail page chained address meeting content conditions and use specific on-link mode (OLM).Above-mentioned two kinds of modes carry out choice for use according to actual conditions, also can combine cross-reference.
Abnormal conditions process:
(1) overlong time.Because whether website may have access to cannot estimate, when may access, there will be exception, by crawl process setting expired time, when a website does not respond for a long time time, initiatively can exit, avoid occupying system resource for a long time.
(2) information is omitted.Some website is disposed at multiple IP, has different returning results for different IP.Such as, it is inconsistent for having number of site to access with accessing the result represented at home abroad.When occurring that capturing result occurs drain message, be that system, by arranging corresponding proxy server, uses the server of other IP to conduct interviews, obtains complete web page contents because IP address limits cause this problem through investigation.
(3) frequently access.Producing abnormal reason is because access websites is too frequent, violate the access rule of targeted website, and targeted website thus limit this access causes.At this moment by arranging crawl frequency, slow down grasp speed, gathers, and waited for the regular hour to the website of target before often accessing a page.So just can evade restriction, carry out data acquisition normally.
4, capturing the detail page chained address of preserving in chained library according to waiting, carrying out the crawl of detail page content.
Access successively and wait to capture the detail page chained address of preserving in chained library, obtain the source code of detail page.
For in detail page content, generally all the information such as title, information content, author, source, time can be comprised.In the source code of detail page, except the above-mentioned information comprised, also include various HTML code, and the HTML code corresponding with it is all contained in the front and back of these information, therefore, in order to ensure that the detail page content after gathering can normal presentation, cleaning and filtering is carried out to other HTML code.
For information such as title, author, source, times, as long as HTML code corresponding for each information is removed, retain corresponding information just passable.
For information content information wherein, needing to retain line feed code and image link address, then remove wherein each kind of HTML code, for the JavaScript code wherein comprised, in order to ensure the safety gathering content, needing to carry out to process shielding.
After detail page content captures successfully, this detail page chained address is saved in and captures in chained library, judge to use for carrying out repeatability later.
5, for the process of picture in detail page content.
The picture of information is the important component part in information content, and when gathering information content, the chained address of the just picture simultaneously obtained, needs follow-up continuation process picture could be downloaded.
After Word message in all detail pages all captures and terminates, then start according to obtained image link address the crawl carrying out picture, be conducive to the efficiency promoting whole processing procedure.
First by image link Address Recognition out after, and be saved in picture collection storehouse, for follow-up capturing pictures operation.
Need unitized process when preserving image link address, the general image link address obtained, also has parameter after picture file name wherein, when preserving image link address by "? " character afterwards weeds out.
Afterwards, according to the image link address of preserving in picture collection storehouse, picture is downloaded to this locality:
(1) using the image link address in picture collection storehouse as parameter, call the picture processing script on file server.Wherein, comprise in script and set up catalogue, gather picture, be saved in the order such as respective directories, file rename.
(2) this picture processing script is performed.
(3), after script performs, corresponding picture downloads, and is saved in the catalogue of specifying.
Because the picture that needs are preserved may be a lot, and picture file is not of uniform size, the whole time of crawl can not be determined.So when crawl, have employed picture and capture parallel design, when the multiple picture of crawl, significantly can raise the efficiency like this.
Abnormal conditions process:
Disabled situation may be there is in image link address, this cause picture access less than, during normal process, an expired time is set to capturing pictures, if spent this time do not obtain picture file, just no longer continue to have attempted, save system resource, ensure the validity that resource uses.
6, after capturing detail page content, content-data is exported to specified interface.
Automatically audit before derivation, automatically in the form of a web page preview is carried out to captured content during examination & verification.
Content-data is exported to specified interface, and the data source of native system support comprises the database of the main flows such as Oracle, Mysql, SQLServer, and supports the file exporting to the forms such as TXT text, EXCEL, and supports e-mail transmitting function.
Beneficial outcomes of the present invention is as follows:
1, effective Word message and the pictorial information combining information content, complete collects this locality by an information, and conveniently can reproduce display;
2, in gatherer process, for abnormal conditions, add multiple method process, ensure the reliable and stable of data acquisition;
3, optimize collection resource, improve the utilization factor of resource, also improve the efficiency that information gathers;
4, the unified process scheduling of multiple collection demand, avoids overlapping development, improves development efficiency.
Accompanying drawing explanation
A kind of cypher of Fig. 1 the present embodiment and the network information batch capture method processing flow chart of pictorial information.
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is described in further detail.
As Fig. 1, the network information batch capture method treatment scheme of a kind of cypher of the present embodiment and pictorial information, comprising:
Step 11, determine the website that needs to carry out information collection, and determine the concrete URL needing the information list page gathered in this website, and the page quantity of these list page.
Wherein, multiple website can be selected to carry out the batch capture of information.At online peak time, be set to serial acquisition mode, after namely the information collection of a website being completed, then start the information collection of next website.At online decrease amount, be set to parallel acquisition mode, namely information collection carried out to multiple website simultaneously, ensure that the efficient of collection, and the utilization of resources is efficient.
Such as at the 8:00-24:00 of every day, adopt serial acquisition mode;
At the 0:00-8:00 of every day, adopt parallel acquisition mode.
Step 12, URL according to multiple list page, find out the public part of these URL, be kept in list configuration information, in addition, the total quantity information of these list page is kept in list configuration information.
Such as check the dynamic news pages of Sina's real estate market, the original list of its news has 3 pages, analyze the URL of the dynamic news list page of Sina's real estate market, can by " http://search.house.sina.com.cn/bj/news/scdt/page (*)/" the public part of part in their URL, the information of " 01 – 03 " representatively original list quantity.
Step 13, first time information gather time, the public partial information of URL in system read list configuration information, from the total quantity of list page, just can obtain the serial number information of whole list page, thus be combined into the URL of targeted website all list page to be collected.
For later information collection, the public partial information of the URL in read list configuration information, and the list page serial number information of up-to-date 2 pages, be combined into the URL of the targeted website list page of up-to-date 2 pages to be collected.
According to above-mentioned example, the public part " http://search.house.sina.com.cn/bj/news/scdt/page (*)/" of preserving, and represent original list serial number information " 01 – 03 ", combine, form all list page URL:
http://search.house.sina.com.cn/bj/news/scdt/page01
http://search.house.sina.com.cn/bj/news/scdt/page02
http://search.house.sina.com.cn/bj/news/scdt/page03
System, according to these URL, captures the source code of these original lists, and by carrying out source code resolving the detail page chained address obtaining and comprise in list page.
The mode wherein obtaining detail page chained address has 2 kinds, and these two kinds of modes are kept in detail page configuration information:
(1) tagged manner.First set starting position mark and the end position mark of the detail page chained address comprised in list page source code, this tick lables shows as coding fixing in webpage, is kept in detail page configuration information.In list page source code, search these tick lableses, between starting position mark with end position mark, extract detail page chained address, and be saved in and wait to capture in chained library.
As follows in the information in page source code, link is here exactly the link needing to gather.
<h3><a href="http://bj.house.sina.com.cn/news/2014-04-15/17352689405.shtml">
Then configuration information is: start from: " <h3><a href=" "; End at "/" > ".
Finally obtaining detail page chained address is:
http://bj.house.sina.com.cn/news/2014-04-15/17352689405.shtml
(2) specific on-link mode (OLM).First analyze the detail page chained address comprised in list page source code, according to the needs gathering content, therefrom extract the condition code of detail page chained address, then pass through the structure acquisition condition of regular expression, be kept in detail page configuration information.All detail page chained addresses are obtained in list page source code.Then mate with condition code, if can to deserve, be just saved in and wait to capture in chained library.
As in previous example, configuration information is: must comprise http://bj.house.sina.com.cn/news/, all like this links meeting this form all can gather.
In addition, be configured by the mode got rid of, will the content information gathered do not needed to be configured exactly, as :/scdt/|/zhuanti/|page must not be comprised.In this way, support to use multiple different filtercondition simultaneously.
The detail page chained address obtained, and captures the detail page chained address of preserving in chained library and compares, if not identical, just this detail page chained address obtained be saved in and wait to capture in chained library; Otherwise, then abandoning the detail page chained address that this obtains, occurring preventing some link from repeating crawled phenomenon like this.
Wherein, gather all detail page chained addresses and use tagged manner, gather the detail page chained address meeting content conditions and use specific on-link mode (OLM).Above-mentioned two kinds of modes carry out choice for use according to actual conditions, also can combine cross-reference.
Abnormal conditions process:
(1) overlong time.Because whether website may have access to cannot estimate, when may access, there will be exception, to crawl process setting expired time, when a website does not respond for a long time time, initiatively can exit, avoid occupying system resource for a long time.
(2) information is omitted.Some website is disposed at multiple IP, has different returning results for different IP.Such as, it is inconsistent for having number of site to access with accessing the result represented at home abroad.When occurring that capturing result occurs drain message, be that system, by arranging corresponding proxy server, uses the server of other IP to conduct interviews, obtains complete web page contents because IP address limits cause this problem through investigation.
(3) frequently access.Producing abnormal reason is because access websites is too frequent, violate the access rule of targeted website, and targeted website thus limit this access causes.At this moment by arranging crawl frequency, grasp speed collection of slowing down, waited for the regular hour to the website of target before often accessing a page.So just can evade restriction, carry out data acquisition normally.
Step 14, basis are waited to capture the detail page chained address of preserving in chained library, carry out the crawl of detail page content.
Access successively and wait to capture the detail page chained address of preserving in chained library, obtain the source code of detail page.
For in detail page content, generally all the information such as title, information content, author, source, time can be comprised.In the source code of detail page, except the above-mentioned information comprised, also include various HTML code, and the HTML code corresponding with it is all contained in the front and back of these information, therefore, in order to ensure that the detail page content after gathering can normal presentation, cleaning and filtering is carried out to other HTML code.
For information such as title, author, source, times, as long as HTML code corresponding for each information is removed, retain corresponding information just passable.
As for a news http://bj.house.sina.com.cn/news/2014-04-15/17352689405.shtml
Its source code cuts and selects part as follows:
The configuration information of the title that needs are formulated is: starting position is masked as " <title> ", end position is masked as " </title> ", just the title of this information " [cover story] agency destiny 7 is foretold greatly " can be collected;
The configuration information of same issuing time is: starting position is masked as " <div class=" tc zwdatemb15 " ><span>2 ", end position is masked as " </span> ", just the issuing time of this information " 2014-04-1517:35:11 " can be collected;
For information content information wherein, needing to retain line feed code and picture tag code, then remove wherein each kind of HTML code, for the JavaScript code wherein comprised, in order to ensure the safety gathering content, needing to carry out to process shielding.
--text end--> ", and because corresponding information content, need image link address to be changed to local image link address, and the label that will enter a new line retains, other various html tag codes remove.As follows for the information content after the process of this example.
After detail page content captures successfully, this detail page chained address is saved in and captures in chained library, repeatability when capturing for later detail page judges.
Step 15, process for picture in detail page content.
The picture of information is the important component part in information content, and when gathering information content, the chained address of the just picture simultaneously obtained, needs follow-up continuation process picture also could be downloaded.
After Word message in all detail pages all captures and terminates, then start according to obtained image link address the crawl carrying out picture, improve reliability, also promote the efficiency of whole processing procedure.
Such as need collection 15 detailed pages in above-mentioned Sina website, wherein obtain 5 image link addresses, so after the Word message collection of these 15 detailed pages terminates, start according to 5 image link seismic acquisition pictures.
First image link is identified, and be saved in picture collection storehouse, for follow-up capturing pictures operation.
The pictorial information obtained in this example is:
http://src.house.sina.com.cn/imp/imp/deal/f1/28/b/fd28900e78b71af7280d18693f3_p1_mk1.jpg
Need unitized process when preserving image link address, the general image link address obtained, also has parameter after picture file name wherein, when preserving image link address by "? " parameter afterwards is all removed.
Afterwards, according to the image link address of preserving in picture collection storehouse, picture is downloaded to this locality.
(1) using the image link address in picture collection storehouse as parameter, call the picture processing script on file server.Wherein, comprise in script and set up catalogue, gather picture, be saved in the order such as respective directories, file rename.
(2) this picture processing script is performed.
(3) picture of correspondence is downloaded, and be saved in the catalogue of specifying.
Because the picture that needs are preserved may be a lot, and picture file is not of uniform size, the picture that may have several million sizes exists, and the whole time of crawl is uncertain.So when crawl, have employed picture and to walk abreast Grasp Modes, when the multiple picture of crawl, significantly raise the efficiency.
Abnormal conditions process:
Disabled situation may be there is in image link address, this will cause picture access less than, during normal process, an expired time is set to capturing pictures, if spent this time do not obtain fileinfo, just no longer continue to have attempted, save system resource, ensure the validity that resource uses.
Arranging this expired time in the present embodiment is 1 minute-1.5 minutes.
After step 16, crawl detail page content, content-data is exported to specified interface.
Also need before derivation automatically to audit, automatically in the form of a web page preview is carried out to captured content during examination & verification.
Content-data is exported to specified interface, and the data source of native system support comprises the database of the main flows such as Oracle, Mysql, SQLServer, and supports the file exporting to the forms such as TXT text, EXCEL, and supports e-mail transmitting function.
The present invention also can have other numerous embodiments; when not deviating from the present invention's spirit and essence thereof; those of ordinary skill in the art can make various corresponding change and distortion according to the present invention, and these change accordingly and are out of shape the protection domain that all should belong to the claim appended by the present invention.

Claims (9)

1. a network information batch capture method for cypher and pictorial information, is characterized in that, comprising:
Step one, determine the website that needs to carry out information collection, and determine the concrete URL needing the information list page gathered in this website, and the page quantity of described list page;
Step 2, URL according to multiple list page, find out the public part of this URL, be kept in list configuration information, the serial number information of described list page is kept in list configuration information simultaneously;
When step 3, first time, information gathered, system reads the public partial information of URL in described list configuration information, from the total quantity of list page, obtains the serial number information of whole list page, is combined into the URL of targeted website all list page to be collected;
For later information collection, system reads the public partial information of URL in described list configuration information, and the list page serial number information of up-to-date 2 pages, is combined into the URL of the targeted website list page of up-to-date 2 pages to be collected;
System, according to the URL of the list page of described up-to-date 2 pages, captures the source code of this original list of targeted website, by resolving source code, obtains the detail page chained address comprised in the list page of described up-to-date 2 pages;
Step 4, basis are waited to capture the detail page chained address of preserving in chained library, carry out the crawl of detail page content;
Access successively and wait to capture the detail page chained address of preserving in chained library, obtain the source code of detail page;
After detail page content captures successfully, this detail page chained address is saved in and captures in chained library, judge to use for carrying out repeatability later;
Step 5, process for picture in detail page content;
Gather information content, obtain the chained address of picture, after the Word message in all detail pages all captures and terminates, start according to obtained image link address the crawl carrying out picture;
Step 6, capture described detail page content after, content-data is exported to specified interface.
2. network information batch capture method according to claim 1, is characterized in that, in step one:
Multiple website is selected to carry out the batch capture of information, according to the different time periods, the acquisition time of multiple website, acquisition mode, collection content are dispatched, at online peak time, be set to serial acquisition mode, namely after the information collection of a website being completed, then the information collection of next website is started; At online decrease amount, be set to parallel acquisition mode, namely information collection carried out to multiple website simultaneously.
3. network information batch capture method according to claim 1, is characterized in that, in step 3:
The mode obtaining detail page chained address has 2 kinds, and these two kinds of modes are kept in detail page configuration information:
(1) tagged manner: the starting position mark of the detail page chained address comprised in setting list page source code and end position mark, described starting position mark and end position mark show as coding fixing in webpage, are kept in detail page configuration information; Search in list page source code described starting position mark and end position mark, from starting position mark with end position mark between, extract detail page chained address, and be saved in treat crawl chained library in;
(2) specific on-link mode (OLM): analyze the detail page chained address comprised in list page source code, therefrom extracts the condition code of detail page chained address, by the structure acquisition condition of regular expression, is kept in detail page configuration information according to collection content; In list page source code, obtain all detail page chained addresses, mate with condition code, if to deserve, be just saved in and wait to capture in chained library.
4. network information batch capture method according to claim 3, is characterized in that:
The detail page chained address obtained, and captures the detail page chained address of preserving in chained library and compares, if not identical, just obtained detail page chained address be saved in and wait to capture in chained library; Otherwise, then abandon obtained detail page chained address, prevent capturing the detail page chained address of preserving in chained library and repeat crawled phenomenon generation;
Wherein, gather all detail page chained addresses and use tagged manner, gather the detail page chained address meeting content conditions and use specific on-link mode (OLM).
5. network information batch capture method according to claim 4, is characterized in that, abnormal conditions process:
(1) overlong time: when website visiting is abnormal, expired time is arranged to crawl process, does not respond if exceed website described in described expired time, then log off;
(2) information omit: when website exist multiple IP dispose time, different returning results is had for different IP, crawl result is caused to occur drain message if IP address limits, then system is by arranging corresponding proxy server, use the server of other IP to conduct interviews, obtain complete web page contents;
(3) frequently access: when website visiting is too frequent, violate the access rule of targeted website, then targeted website is by arranging crawl frequency, slow down grasp speed, gather, wait for the regular hour before a page is often accessed to targeted website, evade restriction, carry out data acquisition normally.
6. network information batch capture method according to claim 1, is characterized in that, in step 4:
For title, author, source, temporal information, HTML code corresponding for each information is removed, retains corresponding information;
For information content information wherein, retain line feed code and image link address, remove wherein each kind of HTML code, shielding JavaScript code wherein.
7. network information batch capture method according to claim 1, is characterized in that, in step 5:
By image link Address Recognition out after, be saved in picture collection storehouse, for the operation of follow-up capturing pictures;
Unitized process when preserving image link address, when obtaining image link address, for the parameter after picture file name, when preserving image link address by "? " character afterwards weeds out;
According to the image link address of preserving in picture collection storehouse, picture is downloaded to this locality:
(1) using the image link address in picture collection storehouse as parameter, call the picture processing script on file server; Wherein, comprise in script set up catalogue, gather picture, be saved in respective directories, file rename order;
(2) this picture processing script is performed;
(3), after script performs, corresponding picture downloads, and is saved in the catalogue of specifying.
8. network information batch capture method according to claim 7, is characterized in that, abnormal conditions process:
When image link address is unavailable picture cannot be accessed time, expired time is set, after exceeding described expired time, no longer continue attempt obtain described picture.
9. network information batch capture method according to claim 1, is characterized in that, in step 6:
Automatically audit before derivation, automatically in the form of a web page preview is carried out to captured detail page content during examination & verification.
CN201410166752.5A 2014-04-23 2014-04-23 Network information batch acquisition method of combined text and picture information Active CN103927370B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410166752.5A CN103927370B (en) 2014-04-23 2014-04-23 Network information batch acquisition method of combined text and picture information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410166752.5A CN103927370B (en) 2014-04-23 2014-04-23 Network information batch acquisition method of combined text and picture information

Publications (2)

Publication Number Publication Date
CN103927370A CN103927370A (en) 2014-07-16
CN103927370B true CN103927370B (en) 2015-02-18

Family

ID=51145591

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410166752.5A Active CN103927370B (en) 2014-04-23 2014-04-23 Network information batch acquisition method of combined text and picture information

Country Status (1)

Country Link
CN (1) CN103927370B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104715016B (en) * 2015-02-04 2018-02-16 北京中搜搜悦网络技术有限公司 One kind searches happy acquisition method
CN104731928A (en) * 2015-03-27 2015-06-24 李冬 Data collecting and processing equipment
CN106293686B (en) * 2015-06-25 2019-08-02 阿里巴巴集团控股有限公司 The method and device that exhibiting pictures annotate in code text
CN107644028B (en) * 2016-07-20 2020-09-04 平安科技(深圳)有限公司 Method and system for collecting webpage data
CN106302797B (en) * 2016-08-31 2019-08-13 北京锐安科技有限公司 A kind of cookie access De-weight method and device
CN108470296B (en) * 2017-02-23 2022-02-25 阿里巴巴集团控股有限公司 Business object information processing method and device
CN107273497A (en) * 2017-06-16 2017-10-20 郑州云海信息技术有限公司 A kind of vulnerability information acquisition method and device
CN107784113A (en) * 2017-11-08 2018-03-09 深圳市科盾科技有限公司 Html web page collecting method, device and computer-readable recording medium
CN108133010A (en) * 2017-12-22 2018-06-08 新奥(中国)燃气投资有限公司 A kind of information grasping means and device
CN108052648B (en) * 2017-12-26 2020-08-21 福建中金在线信息科技有限公司 Website picture deleting method and device and electronic equipment
CN111159518B (en) * 2019-12-26 2023-10-24 深圳前海环融联易信息科技服务有限公司 News data acquisition method and device, computer equipment and storage medium
CN111460255A (en) * 2020-03-26 2020-07-28 第一曲库(北京)科技有限公司 Music work information data acquisition and storage method
CN111953766A (en) * 2020-08-07 2020-11-17 福建省天奕网络科技有限公司 Method and system for collecting network data
CN111931113B (en) * 2020-09-16 2021-01-05 深圳壹账通智能科技有限公司 Data cleaning method and related equipment
CN112187949B (en) * 2020-10-09 2021-08-20 珠海格力电器股份有限公司 Picture batch downloading method and device, storage medium and electronic device
CN114417200B (en) * 2022-01-04 2023-04-14 马上消费金融股份有限公司 Network data acquisition method and device and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010231508A (en) * 2009-03-27 2010-10-14 Kddi Corp Device, method and program for determining significance
CN103309954A (en) * 2013-05-27 2013-09-18 复旦大学 Html webpage based data extracting system
CN103631806A (en) * 2012-08-24 2014-03-12 华为技术有限公司 Network information fetching method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010231508A (en) * 2009-03-27 2010-10-14 Kddi Corp Device, method and program for determining significance
CN103631806A (en) * 2012-08-24 2014-03-12 华为技术有限公司 Network information fetching method and device
CN103309954A (en) * 2013-05-27 2013-09-18 复旦大学 Html webpage based data extracting system

Also Published As

Publication number Publication date
CN103927370A (en) 2014-07-16

Similar Documents

Publication Publication Date Title
CN103927370B (en) Network information batch acquisition method of combined text and picture information
CN102930059B (en) Method for designing focused crawler
CN103023710B (en) A kind of safety test system and method
CN102710795B (en) Hotspot collecting method and device
US9218482B2 (en) Method and device for detecting phishing web page
US6910071B2 (en) Surveillance monitoring and automated reporting method for detecting data changes
CN107943838B (en) Method and system for automatically acquiring xpath generated crawler script
CN107895009A (en) One kind is based on distributed internet data acquisition method and system
CN102752154B (en) Detecting method of dead link of Web site
CN104063454A (en) Search push method and device for mining user demands
CN102833233B (en) Method and device for recognizing web pages
CN103455600B (en) A kind of video URL grasping means, device and server apparatus
CN105224691B (en) A kind of information processing method and device
CN103970788A (en) Webpage-crawling-based crawler technology
CN102663052B (en) Method and device for providing search results of search engine
CN103279567A (en) Web data collection method and system both based on AJAX (asynchronous javascript and extensible markup language)
CN107145556B (en) Universal distributed acquisition system
CN111444408B (en) Network search processing method and device and electronic equipment
CN102065147A (en) Method and device for obtaining user login information based on enterprise application system
CN103248707B (en) File access method, system and equipment
CN104391978A (en) Method and device for storing and processing web pages of browsers
CN104252532A (en) Website information statistic method and device
CN103067387A (en) Monitoring system and monitoring method for anti phishing
CN104580230A (en) Website attack verification method and device
CN107800686A (en) A kind of fishing website recognition methods and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant