CN104462140A - Webpage data collecting method and device - Google Patents
Webpage data collecting method and device Download PDFInfo
- Publication number
- CN104462140A CN104462140A CN201310439192.1A CN201310439192A CN104462140A CN 104462140 A CN104462140 A CN 104462140A CN 201310439192 A CN201310439192 A CN 201310439192A CN 104462140 A CN104462140 A CN 104462140A
- Authority
- CN
- China
- Prior art keywords
- collected
- webpage
- data
- described webpage
- message
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Abstract
The invention provides a webpage data collecting method and device. The method comprises the steps that a webpage to be collected is determined and loaded; whether loading of the webpage to be collected is finished is detected, and if yes, current data corresponding to the webpage to be collected are obtained; according to pre-configured collecting rules, data matched with the collecting rules are connected from the data corresponding to the webpage to be collected. The webpage data collecting method and device can collect the data of the webpage effectively.
Description
Technical field
The present invention relates to computer realm, particularly relate to a kind of webpage data acquiring method and device.
Background technology
Along with developing rapidly of computer network, internet has become a huge information resource database, the most frequently used resource is presented by the mode of webpage usually, and user just can be made in information acquisition by the information obtained in this information resource database, the manpower that the saving of resource consolidation aspect is a large amount of and fund.In the resource ocean of vastness, how to search exactly, storage of classifying, our needs of processing and utilization information resources, be but a difficult problem.
Information can be searched easily by traditional search engine, when data volume is less, can stores manually and process.But data volume is larger, artificial method is carried out data and is stored the problem that just there is inefficiency.
Summary of the invention
The invention provides a kind of webpage data acquiring method and device, the technical matters of effective collecting webpage data cannot be realized for solving prior art.
First aspect of the present invention is to provide a kind of webpage data acquiring method, comprising:
Determine and load webpage to be collected;
Detect described webpage to be collected whether loaded, if so, then obtain the data that current described webpage to be collected is corresponding;
According to pre-configured collection rule, from the data that described webpage to be collected is corresponding, gather the data of mating with described collection rule.
Another aspect of the present invention is to provide a kind of collecting webpage data device, comprising:
First processing module, for determining webpage to be collected;
Second processing module, for loading described webpage to be collected;
Detection module, for detecting described webpage to be collected whether loaded, if so, then obtains the data that current described webpage to be collected is corresponding;
Acquisition module, for according to pre-configured collection rule, from the data that described webpage to be collected is corresponding, gathers the data of mating with described collection rule.
Webpage data acquiring method provided by the invention and device, if by webpage loaded to be collected, then from data corresponding to described webpage to be collected, gather the technical scheme of the data of mate with the collection rule preset, effective realization gathers web data.
Accompanying drawing explanation
The schematic flow sheet of a kind of webpage data acquiring method that Fig. 1 provides for the embodiment of the present invention one;
The structural representation of a kind of collecting webpage data device that Fig. 2 provides for the embodiment of the present invention two.
Embodiment
For making the object of the embodiment of the present invention, technical scheme and advantage clearly, below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described.
The schematic flow sheet of a kind of webpage data acquiring method that Fig. 1 provides for the embodiment of the present invention one, as shown in Figure 1, described method comprises:
101, determine and load webpage to be collected.
Optionally, 101 can be performed according to the data harvesting request received, or perform 101 according to predetermined period timing, the present embodiment does not limit to its executive condition.Then corresponding, in 101, describedly determine webpage to be collected, specifically can comprise:
According to the data harvesting request comprising web page address to be collected received, using webpage corresponding for described web page address to be collected as described webpage to be collected; Or,
According to the data harvesting request received, using current web page as described webpage to be collected; Or,
According to preset cycle, timing using current web page as described webpage to be collected.
Wherein, the form of described web page address to be collected is specifically as follows, URL(uniform resource locator) (UniformResource Locator is called for short URL).By present embodiment, user can be made to select to take corresponding trigger collection mode according to different self-demands, and then more effectively carry out data acquisition.
Concrete, in actual applications, when needs load certain webpage, usually first by the login authentication of this webpage, the rear loading that just can realize webpage to be logined successfully.Then corresponding, in order to realize carrying out collecting webpage data more easily, in 101, before described loading webpage to be collected, can also comprise:
The log-on message that each webpage that inquiry prestores is corresponding, obtain the log-on message that described webpage to be collected is corresponding, described log-on message comprises login account and login password;
Send checking request to Website server, described checking request comprises log-on message corresponding to described webpage to be collected;
Described loading webpage to be collected, specifically can comprise:
What if receive, described Website server returned is verified message, then load described webpage to be collected.
In the present embodiment, prestore the log-on message that each webpage is corresponding, thus when needs log in and load certain webpage, can realize automatically logging in this webpage, and carry out webpage loading after logining successfully, and then further facilitate, effectively carry out data acquisition.
102, detect described webpage to be collected whether loaded, if so, then obtain the data that current described webpage to be collected is corresponding.
Concrete, detect described webpage to be collected whether loaded described in 102, specifically can comprise:
By MSHTML assembly, obtain the current reading state to described webpage to be collected;
If the current reading state to described webpage to be collected is completion status, then judge described webpage loaded to be collected.
In actual applications, MSHTML assembly and standard interface thereof can be passed through, all elements of access named web page.Concrete further, in the present embodiment, the document interface HTMLDocument that it can be used to provide by the com assembly of MSHTML, HTMLDocument2, HTMLDocument3, request is initiated to webpage to be collected again by interface createDocumentFromUrl, obtain an object htmldoc2 of IHTMLDocument2, when the reading state of htmldoc2 is " complete ", htmldoc2 is converted to an example htmldoc3 of IHTMLDocument3 interface, data corresponding to webpage to be collected are obtained afterwards by htmldoc3.documentElement.innerHTML.
103, according to pre-configured collection rule, from the data that described webpage to be collected is corresponding, the data of mating with described collection rule are gathered.
Concrete, by the method for matching regular expressions, from the data that described webpage to be collected is corresponding, corresponding data can be obtained.
The webpage data acquiring method that the present embodiment provides, if by webpage loaded to be collected, then gathers the technical scheme of corresponding data from data corresponding to described webpage to be collected, effectively realizes gathering web data.Further, the scheme provided by the present embodiment can be avoided, when webpage not yet loaded just carry out that collecting webpage data causes, the inaccurate incomplete problem of the web data collected, thus further increase collecting webpage data accuracy and reliability.
The structural representation of a kind of collecting webpage data device that Fig. 2 provides for the embodiment of the present invention two, as shown in Figure 2, described device comprises: the first processing module 21, second processing module 22, detection module 23 and acquisition module 24; Wherein,
First processing module 21, for determining webpage to be collected;
Second processing module 22, for loading described webpage to be collected;
Detection module 23, for detecting described webpage to be collected whether loaded, if so, then obtains the data that current described webpage to be collected is corresponding;
Acquisition module 24, for according to pre-configured collection rule, from the data that described webpage to be collected is corresponding, gathers the data of mating with described collection rule.
Concrete, after first processing module 21 determines webpage to be collected, second processing module 22 loads this webpage to be collected, accordingly, for this webpage to be collected, whether loaded detects detection module 23, and data corresponding to this webpage to be collected are obtained when loaded being detected, thus data corresponding to the webpage described to be collected that acquisition module 24 is obtained according to detection module 23 and pre-configured collection rule, gather the data of mating with described collection rule.
Optionally, as a kind of enforceable mode of the present embodiment, the first processing module 21, specifically may be used for the data harvesting request comprising web page address to be collected according to receiving, determine webpage to be collected, the web page address of described webpage to be collected is described web page address to be collected; Or, according to the data harvesting request received, using current web page as described webpage to be collected; Or, according to preset cycle, timing using current web page as described webpage to be collected.
By present embodiment, user can be made to select to take corresponding trigger collection mode according to different self-demands, and then more effectively carry out data acquisition.
Concrete, in actual applications, the log-on message that each webpage is corresponding can be prestored, thus when needs log in and load certain webpage, automatically log in this webpage, and carry out webpage loading after logining successfully, thus further facilitate, effectively carry out data acquisition, then corresponding, described device can also comprise:
Enquiry module, for inquiring about log-on message corresponding to each webpage of prestoring, obtain the log-on message that described webpage to be collected is corresponding, described log-on message comprises login account and login password;
Sending module, for sending checking request to Website server, described checking request comprises log-on message corresponding to described webpage to be collected;
First processing module 21, what receive if specifically may be used for that described Website server returns is verified message, then load described webpage to be collected.
Concrete again, detection module 23 specifically can comprise: acquiring unit, for by MSHTML assembly, obtains the current reading state to described webpage to be collected; Identifying unit, if be completion status for the current reading state to described webpage to be collected, then judges described webpage loaded to be collected.
The collecting webpage data device that the present embodiment provides, if by webpage loaded to be collected, then gathers the technical scheme of corresponding data from data corresponding to described webpage to be collected, effectively realizes gathering web data.Further, the scheme provided by the present embodiment can be avoided, when webpage not yet loaded just carry out that collecting webpage data causes, the inaccurate incomplete problem of the web data collected, thus further increase collecting webpage data accuracy and reliability.
Those skilled in the art can be well understood to, and for convenience and simplicity of description, the specific works process of the device of foregoing description, with reference to the corresponding process in preceding method embodiment, can not repeat them here.
One of ordinary skill in the art will appreciate that: all or part of step realizing above-mentioned each embodiment of the method can have been come by the hardware that programmed instruction is relevant.Aforesaid program can be stored in a computer read/write memory medium.This program, when performing, performs the step comprising above-mentioned each embodiment of the method; And aforesaid storage medium comprises: ROM, RAM, magnetic disc or CD etc. various can be program code stored medium.
Last it is noted that above each embodiment is only in order to illustrate technical scheme of the present invention, be not intended to limit; Although with reference to foregoing embodiments to invention has been detailed description, those of ordinary skill in the art is to be understood that: it still can be modified to the technical scheme described in foregoing embodiments, or carries out equivalent replacement to wherein some or all of technical characteristic; And these amendments or replacement, do not make the essence of appropriate technical solution depart from the scope of various embodiments of the present invention technical scheme.
Claims (8)
1. a webpage data acquiring method, is characterized in that, comprising:
Determine and load webpage to be collected;
Detect described webpage to be collected whether loaded, if so, then obtain the data that current described webpage to be collected is corresponding;
According to pre-configured collection rule, from the data that described webpage to be collected is corresponding, gather the data of mating with described collection rule.
2. method according to claim 1, is characterized in that, the described webpage to be collected of described detection whether loaded, specifically comprises:
By MSHTML assembly, obtain the current reading state to described webpage to be collected;
If the current reading state to described webpage to be collected is completion status, then judge described webpage loaded to be collected.
3. method according to claim 1, is characterized in that, before described loading webpage to be collected, also comprises:
The log-on message that each webpage that inquiry prestores is corresponding, obtain the log-on message that described webpage to be collected is corresponding, described log-on message comprises login account and login password;
Send checking request to Website server, described checking request comprises log-on message corresponding to described webpage to be collected;
Described loading webpage to be collected, specifically comprises:
What if receive, described Website server returned is verified message, then load described webpage to be collected.
4. the method according to any one of claim 1-3, is characterized in that, describedly determines webpage to be collected, specifically comprises:
According to the data harvesting request comprising web page address to be collected received, using webpage corresponding for described web page address to be collected as described webpage to be collected; Or,
According to the data harvesting request received, using current web page as described webpage to be collected; Or,
According to preset cycle, timing using current web page as described webpage to be collected.
5. a collecting webpage data device, is characterized in that, comprising:
First processing module, for determining webpage to be collected;
Second processing module, for loading described webpage to be collected;
Detection module, for detecting described webpage to be collected whether loaded, if so, then obtains the data that current described webpage to be collected is corresponding;
Acquisition module, for according to pre-configured collection rule, from the data that described webpage to be collected is corresponding, gathers the data of mating with described collection rule.
6. device according to claim 6, is characterized in that, described detection module specifically comprises:
Acquiring unit, for by MSHTML assembly, obtains the current reading state to described webpage to be collected;
Identifying unit, if be completion status for the current reading state to described webpage to be collected, then judges described webpage loaded to be collected.
7. device according to claim 5, is characterized in that, described device also comprises:
Enquiry module, for inquiring about log-on message corresponding to each webpage of prestoring, obtain the log-on message that described webpage to be collected is corresponding, described log-on message comprises login account and login password;
Sending module, for sending checking request to Website server, described checking request comprises log-on message corresponding to described webpage to be collected;
Described first processing module, if be verified message specifically for what receive that described Website server returns, then loads described webpage to be collected.
8. the device according to any one of claim 5-7, it is characterized in that, described first processing module, specifically for the data harvesting request comprising web page address to be collected that basis receives, determine webpage to be collected, the web page address of described webpage to be collected is described web page address to be collected; Or, according to the data harvesting request received, using current web page as described webpage to be collected; Or, according to preset cycle, timing using current web page as described webpage to be collected.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310439192.1A CN104462140A (en) | 2013-09-24 | 2013-09-24 | Webpage data collecting method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310439192.1A CN104462140A (en) | 2013-09-24 | 2013-09-24 | Webpage data collecting method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104462140A true CN104462140A (en) | 2015-03-25 |
Family
ID=52908196
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310439192.1A Pending CN104462140A (en) | 2013-09-24 | 2013-09-24 | Webpage data collecting method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104462140A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106502802A (en) * | 2016-10-12 | 2017-03-15 | 山东浪潮云服务信息科技有限公司 | A kind of concurrent acquisition method in distributed high in the clouds transmitted based on Avro RPC |
CN108090071A (en) * | 2016-11-22 | 2018-05-29 | 北大方正集团有限公司 | Collection of resources method and apparatus in resources bank |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070198924A1 (en) * | 1999-03-02 | 2007-08-23 | Hiroshi Koike | Dynamic web page generation method and system |
CN101441629A (en) * | 2007-11-19 | 2009-05-27 | 上海新纳广告传媒有限公司 | Automatic acquiring method of non-structured web page information |
CN101561802A (en) * | 2008-04-18 | 2009-10-21 | 上海复旦光华信息科技股份有限公司 | Web page structural data extraction method and system |
CN103092817A (en) * | 2013-01-18 | 2013-05-08 | 五八同城信息技术有限公司 | Data collection method and data collection device based on script engine |
-
2013
- 2013-09-24 CN CN201310439192.1A patent/CN104462140A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070198924A1 (en) * | 1999-03-02 | 2007-08-23 | Hiroshi Koike | Dynamic web page generation method and system |
CN101441629A (en) * | 2007-11-19 | 2009-05-27 | 上海新纳广告传媒有限公司 | Automatic acquiring method of non-structured web page information |
CN101561802A (en) * | 2008-04-18 | 2009-10-21 | 上海复旦光华信息科技股份有限公司 | Web page structural data extraction method and system |
CN103092817A (en) * | 2013-01-18 | 2013-05-08 | 五八同城信息技术有限公司 | Data collection method and data collection device based on script engine |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106502802A (en) * | 2016-10-12 | 2017-03-15 | 山东浪潮云服务信息科技有限公司 | A kind of concurrent acquisition method in distributed high in the clouds transmitted based on Avro RPC |
CN108090071A (en) * | 2016-11-22 | 2018-05-29 | 北大方正集团有限公司 | Collection of resources method and apparatus in resources bank |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102693271B (en) | A kind of network information recommending method and system | |
CN102082792A (en) | Phishing webpage detection method and device | |
CN102819713B (en) | A kind of method and system detecting bullet window safe | |
CN108829838B (en) | Batch processing method of account information and server | |
CN102663062A (en) | Method and device for processing invalid links in search result | |
CN102752288A (en) | Method and device for identifying network access action | |
CN111159514B (en) | Method, device and equipment for detecting task effectiveness of web crawler and storage medium | |
US11263062B2 (en) | API mashup exploration and recommendation | |
CN101441629A (en) | Automatic acquiring method of non-structured web page information | |
CN105022694A (en) | Test case generation method and system for mobile terminal test | |
CN102710646A (en) | Method and system for collecting phishing websites | |
CN102663052A (en) | Method and device for providing search results of search engine | |
KR20180074774A (en) | How to identify malicious websites, devices and computer storage media | |
WO2016086784A1 (en) | Method, apparatus and system for collecting webpage data | |
CN104462140A (en) | Webpage data collecting method and device | |
US20160154886A1 (en) | Accounting for authorship in a web log search engine | |
CN113641742A (en) | Data extraction method, device, equipment and storage medium | |
CN102306181B (en) | Method and system for providing network resources | |
CN103248513A (en) | Network information data collection method and system based on Office suite | |
CN111221711A (en) | User behavior data processing method, server and storage medium | |
CN110704721A (en) | Client data processing method and device, terminal equipment and readable storage medium | |
CN113515455B (en) | Automatic test method and system | |
CN104794397A (en) | Virus detection method and device | |
KR102247067B1 (en) | Method, apparatus and computer program for processing URL collected in web site | |
CN106095946B (en) | Page processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20150325 |
|
RJ01 | Rejection of invention patent application after publication |