CN104462140A - Webpage data collecting method and device - Google Patents

Webpage data collecting method and device Download PDF

Info

Publication number
CN104462140A
CN104462140A CN201310439192.1A CN201310439192A CN104462140A CN 104462140 A CN104462140 A CN 104462140A CN 201310439192 A CN201310439192 A CN 201310439192A CN 104462140 A CN104462140 A CN 104462140A
Authority
CN
China
Prior art keywords
collected
webpage
data
described webpage
message
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310439192.1A
Other languages
Chinese (zh)
Inventor
任艳方
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN201310439192.1A priority Critical patent/CN104462140A/en
Publication of CN104462140A publication Critical patent/CN104462140A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Abstract

The invention provides a webpage data collecting method and device. The method comprises the steps that a webpage to be collected is determined and loaded; whether loading of the webpage to be collected is finished is detected, and if yes, current data corresponding to the webpage to be collected are obtained; according to pre-configured collecting rules, data matched with the collecting rules are connected from the data corresponding to the webpage to be collected. The webpage data collecting method and device can collect the data of the webpage effectively.

Description

Webpage data acquiring method and device
Technical field
The present invention relates to computer realm, particularly relate to a kind of webpage data acquiring method and device.
Background technology
Along with developing rapidly of computer network, internet has become a huge information resource database, the most frequently used resource is presented by the mode of webpage usually, and user just can be made in information acquisition by the information obtained in this information resource database, the manpower that the saving of resource consolidation aspect is a large amount of and fund.In the resource ocean of vastness, how to search exactly, storage of classifying, our needs of processing and utilization information resources, be but a difficult problem.
Information can be searched easily by traditional search engine, when data volume is less, can stores manually and process.But data volume is larger, artificial method is carried out data and is stored the problem that just there is inefficiency.
Summary of the invention
The invention provides a kind of webpage data acquiring method and device, the technical matters of effective collecting webpage data cannot be realized for solving prior art.
First aspect of the present invention is to provide a kind of webpage data acquiring method, comprising:
Determine and load webpage to be collected;
Detect described webpage to be collected whether loaded, if so, then obtain the data that current described webpage to be collected is corresponding;
According to pre-configured collection rule, from the data that described webpage to be collected is corresponding, gather the data of mating with described collection rule.
Another aspect of the present invention is to provide a kind of collecting webpage data device, comprising:
First processing module, for determining webpage to be collected;
Second processing module, for loading described webpage to be collected;
Detection module, for detecting described webpage to be collected whether loaded, if so, then obtains the data that current described webpage to be collected is corresponding;
Acquisition module, for according to pre-configured collection rule, from the data that described webpage to be collected is corresponding, gathers the data of mating with described collection rule.
Webpage data acquiring method provided by the invention and device, if by webpage loaded to be collected, then from data corresponding to described webpage to be collected, gather the technical scheme of the data of mate with the collection rule preset, effective realization gathers web data.
Accompanying drawing explanation
The schematic flow sheet of a kind of webpage data acquiring method that Fig. 1 provides for the embodiment of the present invention one;
The structural representation of a kind of collecting webpage data device that Fig. 2 provides for the embodiment of the present invention two.
Embodiment
For making the object of the embodiment of the present invention, technical scheme and advantage clearly, below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described.
The schematic flow sheet of a kind of webpage data acquiring method that Fig. 1 provides for the embodiment of the present invention one, as shown in Figure 1, described method comprises:
101, determine and load webpage to be collected.
Optionally, 101 can be performed according to the data harvesting request received, or perform 101 according to predetermined period timing, the present embodiment does not limit to its executive condition.Then corresponding, in 101, describedly determine webpage to be collected, specifically can comprise:
According to the data harvesting request comprising web page address to be collected received, using webpage corresponding for described web page address to be collected as described webpage to be collected; Or,
According to the data harvesting request received, using current web page as described webpage to be collected; Or,
According to preset cycle, timing using current web page as described webpage to be collected.
Wherein, the form of described web page address to be collected is specifically as follows, URL(uniform resource locator) (UniformResource Locator is called for short URL).By present embodiment, user can be made to select to take corresponding trigger collection mode according to different self-demands, and then more effectively carry out data acquisition.
Concrete, in actual applications, when needs load certain webpage, usually first by the login authentication of this webpage, the rear loading that just can realize webpage to be logined successfully.Then corresponding, in order to realize carrying out collecting webpage data more easily, in 101, before described loading webpage to be collected, can also comprise:
The log-on message that each webpage that inquiry prestores is corresponding, obtain the log-on message that described webpage to be collected is corresponding, described log-on message comprises login account and login password;
Send checking request to Website server, described checking request comprises log-on message corresponding to described webpage to be collected;
Described loading webpage to be collected, specifically can comprise:
What if receive, described Website server returned is verified message, then load described webpage to be collected.
In the present embodiment, prestore the log-on message that each webpage is corresponding, thus when needs log in and load certain webpage, can realize automatically logging in this webpage, and carry out webpage loading after logining successfully, and then further facilitate, effectively carry out data acquisition.
102, detect described webpage to be collected whether loaded, if so, then obtain the data that current described webpage to be collected is corresponding.
Concrete, detect described webpage to be collected whether loaded described in 102, specifically can comprise:
By MSHTML assembly, obtain the current reading state to described webpage to be collected;
If the current reading state to described webpage to be collected is completion status, then judge described webpage loaded to be collected.
In actual applications, MSHTML assembly and standard interface thereof can be passed through, all elements of access named web page.Concrete further, in the present embodiment, the document interface HTMLDocument that it can be used to provide by the com assembly of MSHTML, HTMLDocument2, HTMLDocument3, request is initiated to webpage to be collected again by interface createDocumentFromUrl, obtain an object htmldoc2 of IHTMLDocument2, when the reading state of htmldoc2 is " complete ", htmldoc2 is converted to an example htmldoc3 of IHTMLDocument3 interface, data corresponding to webpage to be collected are obtained afterwards by htmldoc3.documentElement.innerHTML.
103, according to pre-configured collection rule, from the data that described webpage to be collected is corresponding, the data of mating with described collection rule are gathered.
Concrete, by the method for matching regular expressions, from the data that described webpage to be collected is corresponding, corresponding data can be obtained.
The webpage data acquiring method that the present embodiment provides, if by webpage loaded to be collected, then gathers the technical scheme of corresponding data from data corresponding to described webpage to be collected, effectively realizes gathering web data.Further, the scheme provided by the present embodiment can be avoided, when webpage not yet loaded just carry out that collecting webpage data causes, the inaccurate incomplete problem of the web data collected, thus further increase collecting webpage data accuracy and reliability.
The structural representation of a kind of collecting webpage data device that Fig. 2 provides for the embodiment of the present invention two, as shown in Figure 2, described device comprises: the first processing module 21, second processing module 22, detection module 23 and acquisition module 24; Wherein,
First processing module 21, for determining webpage to be collected;
Second processing module 22, for loading described webpage to be collected;
Detection module 23, for detecting described webpage to be collected whether loaded, if so, then obtains the data that current described webpage to be collected is corresponding;
Acquisition module 24, for according to pre-configured collection rule, from the data that described webpage to be collected is corresponding, gathers the data of mating with described collection rule.
Concrete, after first processing module 21 determines webpage to be collected, second processing module 22 loads this webpage to be collected, accordingly, for this webpage to be collected, whether loaded detects detection module 23, and data corresponding to this webpage to be collected are obtained when loaded being detected, thus data corresponding to the webpage described to be collected that acquisition module 24 is obtained according to detection module 23 and pre-configured collection rule, gather the data of mating with described collection rule.
Optionally, as a kind of enforceable mode of the present embodiment, the first processing module 21, specifically may be used for the data harvesting request comprising web page address to be collected according to receiving, determine webpage to be collected, the web page address of described webpage to be collected is described web page address to be collected; Or, according to the data harvesting request received, using current web page as described webpage to be collected; Or, according to preset cycle, timing using current web page as described webpage to be collected.
By present embodiment, user can be made to select to take corresponding trigger collection mode according to different self-demands, and then more effectively carry out data acquisition.
Concrete, in actual applications, the log-on message that each webpage is corresponding can be prestored, thus when needs log in and load certain webpage, automatically log in this webpage, and carry out webpage loading after logining successfully, thus further facilitate, effectively carry out data acquisition, then corresponding, described device can also comprise:
Enquiry module, for inquiring about log-on message corresponding to each webpage of prestoring, obtain the log-on message that described webpage to be collected is corresponding, described log-on message comprises login account and login password;
Sending module, for sending checking request to Website server, described checking request comprises log-on message corresponding to described webpage to be collected;
First processing module 21, what receive if specifically may be used for that described Website server returns is verified message, then load described webpage to be collected.
Concrete again, detection module 23 specifically can comprise: acquiring unit, for by MSHTML assembly, obtains the current reading state to described webpage to be collected; Identifying unit, if be completion status for the current reading state to described webpage to be collected, then judges described webpage loaded to be collected.
The collecting webpage data device that the present embodiment provides, if by webpage loaded to be collected, then gathers the technical scheme of corresponding data from data corresponding to described webpage to be collected, effectively realizes gathering web data.Further, the scheme provided by the present embodiment can be avoided, when webpage not yet loaded just carry out that collecting webpage data causes, the inaccurate incomplete problem of the web data collected, thus further increase collecting webpage data accuracy and reliability.
Those skilled in the art can be well understood to, and for convenience and simplicity of description, the specific works process of the device of foregoing description, with reference to the corresponding process in preceding method embodiment, can not repeat them here.
One of ordinary skill in the art will appreciate that: all or part of step realizing above-mentioned each embodiment of the method can have been come by the hardware that programmed instruction is relevant.Aforesaid program can be stored in a computer read/write memory medium.This program, when performing, performs the step comprising above-mentioned each embodiment of the method; And aforesaid storage medium comprises: ROM, RAM, magnetic disc or CD etc. various can be program code stored medium.
Last it is noted that above each embodiment is only in order to illustrate technical scheme of the present invention, be not intended to limit; Although with reference to foregoing embodiments to invention has been detailed description, those of ordinary skill in the art is to be understood that: it still can be modified to the technical scheme described in foregoing embodiments, or carries out equivalent replacement to wherein some or all of technical characteristic; And these amendments or replacement, do not make the essence of appropriate technical solution depart from the scope of various embodiments of the present invention technical scheme.

Claims (8)

1. a webpage data acquiring method, is characterized in that, comprising:
Determine and load webpage to be collected;
Detect described webpage to be collected whether loaded, if so, then obtain the data that current described webpage to be collected is corresponding;
According to pre-configured collection rule, from the data that described webpage to be collected is corresponding, gather the data of mating with described collection rule.
2. method according to claim 1, is characterized in that, the described webpage to be collected of described detection whether loaded, specifically comprises:
By MSHTML assembly, obtain the current reading state to described webpage to be collected;
If the current reading state to described webpage to be collected is completion status, then judge described webpage loaded to be collected.
3. method according to claim 1, is characterized in that, before described loading webpage to be collected, also comprises:
The log-on message that each webpage that inquiry prestores is corresponding, obtain the log-on message that described webpage to be collected is corresponding, described log-on message comprises login account and login password;
Send checking request to Website server, described checking request comprises log-on message corresponding to described webpage to be collected;
Described loading webpage to be collected, specifically comprises:
What if receive, described Website server returned is verified message, then load described webpage to be collected.
4. the method according to any one of claim 1-3, is characterized in that, describedly determines webpage to be collected, specifically comprises:
According to the data harvesting request comprising web page address to be collected received, using webpage corresponding for described web page address to be collected as described webpage to be collected; Or,
According to the data harvesting request received, using current web page as described webpage to be collected; Or,
According to preset cycle, timing using current web page as described webpage to be collected.
5. a collecting webpage data device, is characterized in that, comprising:
First processing module, for determining webpage to be collected;
Second processing module, for loading described webpage to be collected;
Detection module, for detecting described webpage to be collected whether loaded, if so, then obtains the data that current described webpage to be collected is corresponding;
Acquisition module, for according to pre-configured collection rule, from the data that described webpage to be collected is corresponding, gathers the data of mating with described collection rule.
6. device according to claim 6, is characterized in that, described detection module specifically comprises:
Acquiring unit, for by MSHTML assembly, obtains the current reading state to described webpage to be collected;
Identifying unit, if be completion status for the current reading state to described webpage to be collected, then judges described webpage loaded to be collected.
7. device according to claim 5, is characterized in that, described device also comprises:
Enquiry module, for inquiring about log-on message corresponding to each webpage of prestoring, obtain the log-on message that described webpage to be collected is corresponding, described log-on message comprises login account and login password;
Sending module, for sending checking request to Website server, described checking request comprises log-on message corresponding to described webpage to be collected;
Described first processing module, if be verified message specifically for what receive that described Website server returns, then loads described webpage to be collected.
8. the device according to any one of claim 5-7, it is characterized in that, described first processing module, specifically for the data harvesting request comprising web page address to be collected that basis receives, determine webpage to be collected, the web page address of described webpage to be collected is described web page address to be collected; Or, according to the data harvesting request received, using current web page as described webpage to be collected; Or, according to preset cycle, timing using current web page as described webpage to be collected.
CN201310439192.1A 2013-09-24 2013-09-24 Webpage data collecting method and device Pending CN104462140A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310439192.1A CN104462140A (en) 2013-09-24 2013-09-24 Webpage data collecting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310439192.1A CN104462140A (en) 2013-09-24 2013-09-24 Webpage data collecting method and device

Publications (1)

Publication Number Publication Date
CN104462140A true CN104462140A (en) 2015-03-25

Family

ID=52908196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310439192.1A Pending CN104462140A (en) 2013-09-24 2013-09-24 Webpage data collecting method and device

Country Status (1)

Country Link
CN (1) CN104462140A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106502802A (en) * 2016-10-12 2017-03-15 山东浪潮云服务信息科技有限公司 A kind of concurrent acquisition method in distributed high in the clouds transmitted based on Avro RPC
CN108090071A (en) * 2016-11-22 2018-05-29 北大方正集团有限公司 Collection of resources method and apparatus in resources bank

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070198924A1 (en) * 1999-03-02 2007-08-23 Hiroshi Koike Dynamic web page generation method and system
CN101441629A (en) * 2007-11-19 2009-05-27 上海新纳广告传媒有限公司 Automatic acquiring method of non-structured web page information
CN101561802A (en) * 2008-04-18 2009-10-21 上海复旦光华信息科技股份有限公司 Web page structural data extraction method and system
CN103092817A (en) * 2013-01-18 2013-05-08 五八同城信息技术有限公司 Data collection method and data collection device based on script engine

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070198924A1 (en) * 1999-03-02 2007-08-23 Hiroshi Koike Dynamic web page generation method and system
CN101441629A (en) * 2007-11-19 2009-05-27 上海新纳广告传媒有限公司 Automatic acquiring method of non-structured web page information
CN101561802A (en) * 2008-04-18 2009-10-21 上海复旦光华信息科技股份有限公司 Web page structural data extraction method and system
CN103092817A (en) * 2013-01-18 2013-05-08 五八同城信息技术有限公司 Data collection method and data collection device based on script engine

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106502802A (en) * 2016-10-12 2017-03-15 山东浪潮云服务信息科技有限公司 A kind of concurrent acquisition method in distributed high in the clouds transmitted based on Avro RPC
CN108090071A (en) * 2016-11-22 2018-05-29 北大方正集团有限公司 Collection of resources method and apparatus in resources bank

Similar Documents

Publication Publication Date Title
CN102693271B (en) A kind of network information recommending method and system
CN102082792A (en) Phishing webpage detection method and device
CN102819713B (en) A kind of method and system detecting bullet window safe
CN108829838B (en) Batch processing method of account information and server
CN102663062A (en) Method and device for processing invalid links in search result
CN102752288A (en) Method and device for identifying network access action
CN111159514B (en) Method, device and equipment for detecting task effectiveness of web crawler and storage medium
US11263062B2 (en) API mashup exploration and recommendation
CN101441629A (en) Automatic acquiring method of non-structured web page information
CN105022694A (en) Test case generation method and system for mobile terminal test
CN102710646A (en) Method and system for collecting phishing websites
CN102663052A (en) Method and device for providing search results of search engine
KR20180074774A (en) How to identify malicious websites, devices and computer storage media
WO2016086784A1 (en) Method, apparatus and system for collecting webpage data
CN104462140A (en) Webpage data collecting method and device
US20160154886A1 (en) Accounting for authorship in a web log search engine
CN113641742A (en) Data extraction method, device, equipment and storage medium
CN102306181B (en) Method and system for providing network resources
CN103248513A (en) Network information data collection method and system based on Office suite
CN111221711A (en) User behavior data processing method, server and storage medium
CN110704721A (en) Client data processing method and device, terminal equipment and readable storage medium
CN113515455B (en) Automatic test method and system
CN104794397A (en) Virus detection method and device
KR102247067B1 (en) Method, apparatus and computer program for processing URL collected in web site
CN106095946B (en) Page processing method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150325

RJ01 Rejection of invention patent application after publication