CN107644028A - The collection method and system of web data - Google Patents

The collection method and system of web data Download PDF

Info

Publication number
CN107644028A
CN107644028A CN201610578428.3A CN201610578428A CN107644028A CN 107644028 A CN107644028 A CN 107644028A CN 201610578428 A CN201610578428 A CN 201610578428A CN 107644028 A CN107644028 A CN 107644028A
Authority
CN
China
Prior art keywords
webpage
source code
web data
url addresses
collection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610578428.3A
Other languages
Chinese (zh)
Other versions
CN107644028B (en
Inventor
徐介夫
朱杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201610578428.3A priority Critical patent/CN107644028B/en
Publication of CN107644028A publication Critical patent/CN107644028A/en
Application granted granted Critical
Publication of CN107644028B publication Critical patent/CN107644028B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The present invention is applied to software field, there is provided the collection method and device of a kind of web data.Methods described includes:Receive the write instruction of uniform resource position mark URL address, and URL addresses corresponding to write-in;Show source code corresponding to webpage corresponding to the URL addresses and webpage;The source code according to corresponding to the webpage capture of display, to realize the collection of web data.The degree of accuracy of the source code of crawl can be improved by the above method.

Description

The collection method and system of web data
Technical field
The embodiment of the present invention belongs to software field, more particularly to the collection method and system of a kind of web data.
Background technology
At present, user often needs to collect and analyze the data of each webpage, judges web data further according to analysis result Validity, or other operations etc. are performed according to analysis result.
In existing web data collection method, the data of specified location in webpage are typically directly captured, then to crawl Data are analyzed, but due to crawl data during be possible to mistake occur, that is, grab with specified location in the page not The data met, and user is difficult to find that the data of crawl are incongruent with specified location in the page according only to the data of crawl Data, therefore cause follow-up data results to malfunction.
The content of the invention
The embodiments of the invention provide a kind of collection method of web data and system, it is intended to which solving existing method may grab Get with the incongruent data of page specified location, so as to cause crawl data the degree of accuracy it is too low the problem of.
The embodiment of the present invention is achieved in that a kind of collection method of web data, and methods described includes:
Receive the write instruction of uniform resource position mark URL address, and URL addresses corresponding to write-in;
Show source code corresponding to webpage corresponding to the URL addresses and webpage;
The source code according to corresponding to the webpage capture of display, to realize the collection of web data.
The another object of the embodiment of the present invention is to provide a kind of collection system of web data, and the system includes:
The write instruction receiving unit of URL addresses, for receiving the write instruction of uniform resource position mark URL address, and URL addresses corresponding to write-in;
Web displaying unit, for showing source code corresponding to webpage corresponding to the URL addresses and webpage;
Web data collector unit, for source code corresponding to the webpage capture according to display, to realize web data Collect.
In embodiments of the present invention, due to source code corresponding to the webpage capture according to display, therefore, it is easy to user to judge Whether the source code currently captured is the source code for needing to capture, and so as to improve the degree of accuracy of the source code of crawl, and then is improved The degree of accuracy of follow-up data results.
Brief description of the drawings
Fig. 1 is a kind of flow chart of the collection method for web data that first embodiment of the invention provides;
Fig. 2 is a kind of schematic diagram of the position for write-in URL addresses that first embodiment of the invention provides;
Fig. 3 is the schematic diagram for the configurable browser parameters that first embodiment of the invention provides;
Fig. 4 is the schematic diagram for the button that first embodiment of the invention provides " source code ";
Fig. 5 is a kind of structure chart of the collection device for web data that second embodiment of the invention provides.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.
In the embodiment of the present invention, the write instruction of URL addresses, and URL addresses corresponding to write-in are received, shows the URL Source code corresponding to webpage corresponding to address and webpage, the source code according to corresponding to the webpage capture of display, to realize webpage The collection of data.
In order to illustrate technical solutions according to the invention, illustrated below by specific embodiment.
Embodiment one:
Fig. 1 shows a kind of flow chart of the collection method for web data that first embodiment of the invention provides, and is described in detail such as Under:
Step S11, receive the write instruction of uniform resource position mark URL address, and URL addresses corresponding to write-in.
Wherein, the write instruction of URL (Uniform Resource Locator, URL) address can pass through The operation that user's " duplication " " pastes " again is sent, and can also be directly inputted and sent by user.As shown in Fig. 2 presented in system Interface " entrance URL " places write-in " Shaoguan law court " corresponding to URL addresses.
Because some webpages are for specific Development of Web Browser, therefore, for the ease of subsequently can correctly, intactly The webpage is shown, the browser matched with the URL addresses can be selected after step S11 is performed, or, performing step After S11, the browser selection instruction that user sends is received, according to browser selection instruction selection and the URL addresses The browser of matching.Such as the browser of the selection type such as chrome or red fox or IE.Certainly, as shown in figure 3, in order to enter one Step is accelerated to collect the speed of web data, in the browser that selection matches with URL addresses, in addition to:Receive browser parameters Configuration-direct, browser parameters are configured according to the configuration-direct of the browser parameters.Wherein, browser parameters bag Include:Http sends time-out time, whether enables script execution, whether enables Cascading Style Sheet (CSSCascading Style Sheets, CSS), whether enable redirection, ActiveXNative etc..For example, electric business class website usually requires to enable script Perform, and general website does not need, and is performed due to that need not enable script, reduces flow occupancy, improves collection web data Speed.Further, in order to improve the convenience of the web data of subsequent analysis collection, the configuration of browser parameters is being received Instruction, when being configured according to the configuration-direct of the browser parameters to browser parameters, configuration item purpose title, project is retouched State and relevant field information.
Due to being also required to take when loading extends entitled js file (with the file of javascript scripting languages) Certain flow and time, therefore, in order to further improve the speed for collecting web data, it may filter that the js that need not be performed File, that is, be not loaded with the file for the js that need not be performed filtered out.
Step S12, show source code corresponding to webpage corresponding to the URL addresses and webpage.
It is pointed out that webpage and source code are shown on the same interface of system, so that user's contrast is checked.
In the step, in order to which the different demands of user are flexibly met, before webpage corresponding to display URL addresses, including: Receive the reload page instruction that user sends.As shown in Fig. 2 when user clicks on " reload page " button, reload page is sent Instruction, webpage corresponding to showing URL addresses is instructed according to the reload page.Before source code corresponding to display webpage, including: Receive the source code idsplay order that user sends.As shown in figure 4, when user clicks on " source code " button, send source code and show Show instruction, the source code according to corresponding to the source code idsplay order shows webpage.
Alternatively, corresponding webpage can be just shown because some websites need user to input after log-on message, therefore, is Reduce the operating procedure of user, also for can automatically, normally show webpage, then before the step S12, including:
A1, judge whether webpage corresponding to the URL addresses needs log-on message.Specifically, prestoring needs to log in The URL addresses of information, when write-in URL addresses with prestore need log-on message some URL address it is identical when, judge Webpage corresponding to the URL addresses of the write-in needs log-on message, otherwise, it is determined that webpage corresponding to the URL addresses of the write-in is not required to Want log-on message.
A2, when webpage corresponding to the URL addresses needs log-on message, by the log-on message obtained in advance write-in described in The relevant position of webpage corresponding to URL addresses, to log in webpage corresponding to the URL addresses.
Specifically, the log-on message for logging in webpage corresponding to the URL addresses is obtained in advance, when write-in needs log-on message URL addresses after, by the log-on message obtained in advance write webpage relevant position, so as to webpage verification using data-hiding technology log-on message into After work(, system can show webpage corresponding to URL addresses.
Step S13, the source code according to corresponding to the webpage capture of display, to realize the collection of web data.
Specifically, when showing a webpage, source code corresponding to the webpage that crawl display screen is currently shown, so as to More web datas are grabbed in once capturing.
Alternatively, in source code corresponding to the part webpage that only crawl display screen is currently shown, the step S13 is specific Including:
B1, detect stay time of the current mouse in the position of webpage.Specifically, when mouse is rested in the webpage of display Some position when, between recording at the beginning of the mouse stops, and fixed statistics time started interval time with it is current when Between difference (i.e. stay time).
B2, when current mouse exceedes default duration in the stay time of the position of webpage, capture the current mouse In the source code of the position correspondence of webpage, to realize the collection of web data.Alternatively, the position of the webpage taken due to mouse Less, therefore, in order to grab more source codes, the source code for capturing the current mouse in the position correspondence of webpage is Refer to, source code of the crawl current mouse in the space of a whole page of the position correspondence of webpage.For example, it is assumed that the multiple spaces of a whole page of webpage point of display: The space of a whole page 1, the space of a whole page 2, the space of a whole page 3 and the space of a whole page 4, when current mouse is at the position of webpage (the corresponding space of a whole page 1), capture corresponding to the space of a whole page 1 Source code.
In above-mentioned B1 and B2, due to when current mouse exceedes default duration in the stay time of the position of webpage, from The dynamic source code for capturing current mouse in the position correspondence of webpage, it is therefore not necessary to which user operates, improves Webpage data capturing Convenience.
Alternatively, in source code corresponding to the part webpage that only crawl display screen is currently shown, the step S13 is specific Including:
B1 ', detection current mouse are in the position of webpage.
B2 ', source code crawl instruction is received, instruction crawl current mouse is captured in the position of webpage according to the source code Corresponding source code.Wherein, source code crawl instruction can be sent by pressing mouse button (left button and/or right button).
In above-mentioned B1 ' and B2 ', without paying close attention to residence time of the current mouse in the position of webpage, as long as receiving source generation Code crawl instruction will capture source code of the current mouse in the position correspondence of webpage.Alternatively, due to the webpage of mouse occupancy Position it is little, therefore, in order to grab more source codes, capture source generation of the current mouse in the position correspondence of webpage Code refers to, captures source code of the current mouse in the space of a whole page of the position correspondence of webpage.
Further, in order to grab more accurate web data, then detect what user selected in the webpage of display Web data;Further according to source code corresponding to the Webpage data capturing of user's selection.Due to only capturing the webpage number of user's selection According to therefore so that the source code of crawl more meets the demand of user.
Alternatively, in order to grab web data corresponding to multiple webpages, after the step S13, including:
C1, judge that website corresponding to the webpage of display whether there is multiple webpages.
When website corresponding to C2, the webpage in display has multiple webpages, page-turning instruction is sent, to show correspondence after page turning Webpage.Wherein, page-turning instruction can be clicked on " lower one page " button by user and be sent, and be may also be arranged on automatically clicking interval time and arrived Automatically clicking " lower one page " button is sent when coming, certainly, in order that obtaining the page-turning instruction that automatically clicking " lower one page " button is sent The page-turning instruction sent closer to user's click " lower one page " button, then the automatically clicking interval time set can not be too short, For example should be greater than 3 seconds, but can not be long, in order to avoid the overlong time of crawl web data, such as, it should be less than 8 minutes etc..
C3, the source code according to corresponding to corresponding webpage capture after page turning, to realize the collection of web data.
In above-mentioned C1~C3, due to the web data of multiple webpages can be captured by sending page-turning instruction, therefore so that The web data of crawl is more comprehensively.
Further, for the ease of subsequently checking the web data of crawl, after step s 13, the webpage of collection is stored Data.Specifically, can be stored by database, file or excel form.The webpage number of collection is stored by various modes According to, improve subsequent analysis collection web data convenience.
In first embodiment of the invention, the write instruction of URL addresses is received, and URL addresses corresponding to writing, described in display Source code corresponding to webpage corresponding to URL addresses and webpage, the source code according to corresponding to the webpage capture of display, to realize net The collection of page data.Due to source code corresponding to the webpage capture according to display, therefore, it is easy to user to judge the source currently captured Whether code is the source code for needing to capture, and so as to improve the degree of accuracy of the source code of crawl, and then improves follow-up data point Analyse the degree of accuracy of result.
It should be understood that in embodiments of the present invention, the size of the sequence number of above-mentioned each process is not meant to the elder generation of execution sequence Afterwards, the execution sequence of each process should be determined with its function and internal logic, the implementation process structure without tackling the embodiment of the present invention Into any restriction.
Embodiment two:
Fig. 5 shows a kind of structure chart of the collection system for web data that second embodiment of the invention provides, the webpage The collection system of data can include the user equipment to be communicated through wireless access network RAN with one or more core nets, should User equipment can be mobile phone (or being " honeycomb " phone), the computer with mobile device etc., for example, user equipment Can also be portable, pocket, hand-held, built-in computer or vehicle-mounted mobile device, they and wireless access network Exchange voice and/or data.In another example the mobile device can include smart mobile phone, tablet personal computer, personal digital assistant PDA, Point-of-sale terminal POS or vehicle-mounted computer etc..For convenience of description, it illustrate only the part related to the embodiment of the present invention.
The collection system of the web data includes:The write instruction receiving units 51 of URL addresses, web displaying unit 52, Web data collector unit 53:
The write instruction receiving unit 51 of URL addresses, for receiving the write instruction of uniform resource position mark URL address, And URL addresses corresponding to writing.
Wherein, the operation that the write instruction of URL addresses " can be pasted " again by user " duplication " is sent, can also be by using Family, which directly inputs, to be sent.
Because some webpages are for specific Development of Web Browser, therefore, for the ease of subsequently can correctly, intactly The webpage is shown, the collection system of the web data includes:Browser selecting unit, for selecting and the URL addresses The browser matched somebody with somebody, or, receive the browser selection instruction that sends of user, according to browser selection instruction selection with it is described The browser of URL addresses matching.Such as the browser of the selection type such as chrome or red fox or IE.Certainly, as shown in figure 3, being The speed for collecting web data is further speeded up, in the browser that selection matches with URL addresses, the receipts of the web data Collecting system also includes:Configuration-direct receiving unit, for receiving the configuration-direct of browser parameters, according to the browser parameters Configuration-direct browser parameters are configured.Wherein, browser parameters include:Whether http sends time-out time, enables Script performs, whether enables CSS, whether enables redirection, ActiveXNative etc..For example, electric business class website usually requires Script execution is enabled, and general website does not need, and is performed due to that need not enable script, reduces flow occupancy, improves collection The speed of web data.Further, in order to improve the convenience of the web data of subsequent analysis collection, the configuration-direct connects Receiving unit also includes:Configuration item purpose title, item description and relevant field information.
Due to being also required to take when loading extends entitled js file (with the file of javascript scripting languages) Certain flow and time, therefore, in order to further improve the speed for collecting web data, the collection system of the web data Including:File filtering unit, for filtering out the file for the js that need not be performed, that is, be not loaded with filtering out need not perform Js file.
Web displaying unit 52, for showing source code corresponding to webpage corresponding to the URL addresses and webpage.
It is pointed out that webpage and source code are shown on the same interface of system, so that user's contrast is checked.
In order to which the different demands of user are flexibly met, the collection system of the web data includes:Reload page instruction connects Unit is received, the reload page instruction sent for receiving user, net corresponding to showing URL addresses is instructed according to the reload page Page.And/or including:The source code idsplay order that user sends is received, according to corresponding to the source code idsplay order shows webpage Source code.
Alternatively, corresponding webpage can be just shown because some websites need user to input after log-on message, therefore, is Reduce the operating procedure of user, also for can automatically, normally show webpage, then the collection system bag of the web data Include:
Log-on message judging unit, for judging whether webpage corresponding to the URL addresses needs log-on message.Specifically Ground, the URL addresses for needing log-on message are prestored, URL addresses and certain for needing log-on message for prestoring when write-in When individual URL addresses are identical, judge that webpage corresponding to the URL addresses of the write-in needs log-on message, otherwise, it is determined that the write-in Webpage corresponding to URL addresses does not need log-on message.
Log-on message writing unit, for that when webpage corresponding to the URL addresses needs log-on message, will obtain in advance Log-on message write the relevant position of webpage corresponding to the URL addresses, to log in webpage corresponding to the URL addresses.
Specifically, the log-on message for logging in webpage corresponding to the URL addresses is obtained in advance, when write-in needs log-on message URL addresses after, by the log-on message obtained in advance write webpage relevant position, so as to webpage verification using data-hiding technology log-on message into After work(, system can show webpage corresponding to URL addresses.
Web data collector unit 53, for source code corresponding to the webpage capture according to display, to realize web data Collection.
Specifically, when showing a webpage, source code corresponding to the webpage that crawl display screen is currently shown, so as to More web datas are grabbed in once capturing.
Alternatively, in source code corresponding to the part webpage that only crawl display screen is currently shown, the web data is received Collection unit 53 includes:
Stay time detection module, for detecting stay time of the current mouse in the position of webpage.Specifically, mouse is worked as When resting on some position in the webpage of display, between recording at the beginning of the mouse stops, and united in fixed interval time Count the difference (i.e. stay time) of time started and current time.
Source code handling module, for when current mouse exceedes default duration in the stay time of the position of webpage, Source code of the current mouse in the position correspondence of webpage is captured, to realize the collection of web data.Alternatively, due to mouse The position of the webpage of occupancy is little, therefore, in order to grab more source codes, captures the current mouse in the position of webpage Corresponding source code refers to, captures source code of the current mouse in the space of a whole page of the position correspondence of webpage.
In above-mentioned stay time detection module and source code handling module, due to stopping in current mouse in the position of webpage When staying the duration to exceed default duration, the automatic source code for capturing current mouse in the position correspondence of webpage, it is therefore not necessary to user Operation, improve the convenience of Webpage data capturing.
Alternatively, in source code corresponding to the part webpage that only crawl display screen is currently shown, the web data is received Collection unit 53 includes:
Mouse position detection module, for detecting current mouse in the position of webpage.
Source code captures command reception module, for receiving source code crawl instruction, is captured and instructed according to the source code Capture source code of the current mouse in the position correspondence of webpage.Wherein, source code crawl instruction can be by pressing mouse button (left button And/or right button) send.
In above-mentioned mouse position detection module and source code crawl command reception module, without paying close attention to current mouse in webpage Position residence time, as long as receive source code crawl instruction will capture current mouse in the source of the position correspondence of webpage Code.Alternatively, the position of the webpage taken due to mouse is little, therefore, in order to grab more source codes, described in crawl Current mouse refers in the source code of the position correspondence of webpage, captures source generation of the current mouse in the space of a whole page of the position correspondence of webpage Code.
Further, in order to grab more accurate web data, the collection system of web data includes:Selection Web data detection unit, the web data selected for detecting user in the webpage of display;The Webpage data capturing list of selection Member, for the source code according to corresponding to the Webpage data capturing that user selects.Due to only capturing the web data of user's selection, because This so that the source code of crawl more meets the demand of user.
Alternatively, in order to grabbing web data corresponding to multiple webpages, the collection system bag of the web data Include:
Multiple webpage judging units, for judging that website corresponding to the webpage of display whether there is multiple webpages.
Page-turning instruction issue unit, for when website has multiple webpages corresponding to the webpage in display, sending page turning and referring to Order, to show corresponding webpage after page turning.
Webpage data capturing unit after page turning, for the source code according to corresponding to corresponding webpage capture after page turning, with Realize the collection of web data.Wherein, page-turning instruction can be clicked on " lower one page " button by user and be sent, and may also be arranged on from dynamic point Automatically clicking " lower one page " button is sent when hitting interval time arrival, certainly, in order that obtaining automatically clicking " lower one page " button hair The page-turning instruction gone out clicks on the page-turning instruction that sends of " lower one page " button closer to user, then during the automatically clicking interval set Between can not be too short, such as, should be greater than 3 seconds, but can not be long, in order to avoid the overlong time of crawl web data, such as, should be less than 8 minutes etc..
In Webpage data capturing unit after above-mentioned multiple webpage judging units, page-turning instruction issue unit, page turning, due to The web data of multiple webpages can be captured by sending page-turning instruction, therefore so that the web data of crawl is more comprehensively.
Further, for the ease of subsequently checking the web data of crawl, the collection system of the web data includes:Net Page data memory cell, for storing the web data collected.Specifically, database, file or excel form can be passed through Storage.The web data of collection is stored by various modes, improves the convenience of the web data of subsequent analysis collection.
In second embodiment of the invention, due to source code corresponding to the webpage capture according to display, therefore, it is easy to user to sentence Whether the disconnected source code currently captured is the source code for needing to capture, so as to improve the degree of accuracy of the source code of crawl, Jin Erti The degree of accuracy of high follow-up data results.
Those of ordinary skill in the art are it is to be appreciated that the list of each example described with reference to the embodiments described herein Member and algorithm steps, it can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually Performed with hardware or software mode, application-specific and design constraint depending on technical scheme.Professional and technical personnel Described function can be realized using distinct methods to each specific application, but this realization is it is not considered that exceed The scope of the present invention.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, the corresponding process in preceding method embodiment is may be referred to, will not be repeated here.
In several embodiments provided herein, it should be understood that disclosed systems, devices and methods, can be with Realize by another way.For example, device embodiment described above is only schematical, for example, the unit Division, only a kind of division of logic function, can there is other dividing mode, such as multiple units or component when actually realizing Another system can be combined or be desirably integrated into, or some features can be ignored, or do not perform.It is another, it is shown or The mutual coupling discussed or direct-coupling or communication connection can be the indirect couplings by some interfaces, device or unit Close or communicate to connect, can be electrical, mechanical or other forms.
The unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units it is integrated in a unit.
If the function is realized in the form of SFU software functional unit and is used as independent production marketing or in use, can be with It is stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially in other words The part to be contributed to prior art or the part of the technical scheme can be embodied in the form of software product, the meter Calculation machine software product is stored in a storage medium, including some instructions are causing a computer equipment (can be People's computer, server, or network equipment etc.) perform all or part of step of each embodiment methods described of the present invention. And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with the medium of store program codes.
The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained Cover within protection scope of the present invention.Therefore, protection scope of the present invention described should be defined by scope of the claims.

Claims (10)

1. a kind of collection method of web data, it is characterised in that methods described includes:
Receive the write instruction of uniform resource position mark URL address, and URL addresses corresponding to write-in;
Show source code corresponding to webpage corresponding to the URL addresses and webpage;
The source code according to corresponding to the webpage capture of display, to realize the collection of web data.
2. according to the method for claim 1, it is characterised in that in webpage corresponding to the display URL addresses and Before source code corresponding to webpage, including:
Judge whether webpage corresponding to the URL addresses needs log-on message;
When webpage corresponding to the URL addresses needs log-on message, the log-on message obtained in advance is write into the URL addresses The relevant position of corresponding webpage, to log in webpage corresponding to the URL addresses.
3. method according to claim 1 or 2, it is characterised in that source generation corresponding to the webpage capture according to display Code, to realize the collection of web data, is specifically included:
Detect stay time of the current mouse in the position of webpage;
When current mouse exceedes default duration in the stay time of the position of webpage, the current mouse is captured in webpage The source code of position correspondence, to realize the collection of web data.
4. method according to claim 1 or 2, it is characterised in that source generation corresponding to the webpage capture according to display Code, to realize the collection of web data, is specifically included:
Current mouse is detected in the position of webpage;
Source code crawl instruction is received, instruction crawl current mouse is captured in the source of the position correspondence of webpage according to the source code Code.
5. method according to claim 1 or 2, it is characterised in that in source corresponding to the webpage capture according to display Code, after realizing the collection of web data, including:
Judge that website corresponding to the webpage of display whether there is multiple webpages;
When website has multiple webpages corresponding to webpage in display, page-turning instruction is sent, to show corresponding webpage after page turning;
According to source code corresponding to corresponding webpage capture after page turning, to realize the collection of web data.
6. a kind of collection system of web data, it is characterised in that the system includes:
The write instruction receiving unit of URL addresses, for receiving the write instruction of uniform resource position mark URL address, and write Corresponding URL addresses;
Web displaying unit, for showing source code corresponding to webpage corresponding to the URL addresses and webpage;
Web data collector unit, for source code corresponding to the webpage capture according to display, to realize the collection of web data.
7. system according to claim 6, it is characterised in that the system includes:
Log-on message judging unit, for judging whether webpage corresponding to the URL addresses needs log-on message;
Log-on message writing unit, for when webpage corresponding to the URL addresses needs log-on message, being stepped on what is obtained in advance The relevant position that information writes webpage corresponding to the URL addresses is recorded, to log in webpage corresponding to the URL addresses.
8. the system according to claim 6 or 7, it is characterised in that the web data collector unit includes:
Stay time detection module, for detecting stay time of the current mouse in the position of webpage;
Source code handling module, for when current mouse exceedes default duration in the stay time of the position of webpage, capturing The current mouse the position correspondence of webpage source code, to realize the collection of web data.
9. the system according to claim 6 or 7, it is characterised in that the web data collector unit includes:
Mouse position detection module, for detecting current mouse in the position of webpage;
Source code captures command reception module, and for receiving source code crawl instruction, instruction crawl is captured according to the source code Source code of the current mouse in the position correspondence of webpage.
10. the system according to claim 6 or 7, it is characterised in that the system includes:
Multiple webpage judging units, for judging that website corresponding to the webpage of display whether there is multiple webpages;
Page-turning instruction issue unit, for when website has multiple webpages corresponding to the webpage in display, sending page-turning instruction, with Corresponding webpage after display page turning;
Webpage data capturing unit after page turning, for the source code according to corresponding to corresponding webpage capture after page turning, to realize The collection of web data.
CN201610578428.3A 2016-07-20 2016-07-20 Method and system for collecting webpage data Active CN107644028B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610578428.3A CN107644028B (en) 2016-07-20 2016-07-20 Method and system for collecting webpage data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610578428.3A CN107644028B (en) 2016-07-20 2016-07-20 Method and system for collecting webpage data

Publications (2)

Publication Number Publication Date
CN107644028A true CN107644028A (en) 2018-01-30
CN107644028B CN107644028B (en) 2020-09-04

Family

ID=61109212

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610578428.3A Active CN107644028B (en) 2016-07-20 2016-07-20 Method and system for collecting webpage data

Country Status (1)

Country Link
CN (1) CN107644028B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670100A (en) * 2018-12-21 2019-04-23 第四范式(北京)技术有限公司 A kind of page data grasping means and device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101320387A (en) * 2008-07-11 2008-12-10 浙江大学 Web page text and image ranking method based on user caring time
CN102469111A (en) * 2010-10-29 2012-05-23 国际商业机器公司 Method and system for analyzing website access
CN103186670A (en) * 2013-03-27 2013-07-03 中金数据系统有限公司 Method and system for integrally acquiring webpage information
CN103593344A (en) * 2012-08-13 2014-02-19 北大方正集团有限公司 Information acquisition method and device
CN103927370A (en) * 2014-04-23 2014-07-16 焦点科技股份有限公司 Network information batch acquisition method of combined text and picture information
US8832055B1 (en) * 2005-06-16 2014-09-09 Gere Dev. Applications, LLC Auto-refinement of search results based on monitored search activities of users
CN104199874A (en) * 2014-08-20 2014-12-10 哈尔滨工程大学 Webpage recommendation method based on user browsing behaviors
CN105183453A (en) * 2015-08-07 2015-12-23 安一恒通(北京)科技有限公司 Webpage-based information acquiring method and apparatus
CN105512193A (en) * 2015-11-26 2016-04-20 上海携程商务有限公司 Data acquisition system and method based on browser expansion

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8832055B1 (en) * 2005-06-16 2014-09-09 Gere Dev. Applications, LLC Auto-refinement of search results based on monitored search activities of users
CN101320387A (en) * 2008-07-11 2008-12-10 浙江大学 Web page text and image ranking method based on user caring time
CN102469111A (en) * 2010-10-29 2012-05-23 国际商业机器公司 Method and system for analyzing website access
CN103593344A (en) * 2012-08-13 2014-02-19 北大方正集团有限公司 Information acquisition method and device
CN103186670A (en) * 2013-03-27 2013-07-03 中金数据系统有限公司 Method and system for integrally acquiring webpage information
CN103927370A (en) * 2014-04-23 2014-07-16 焦点科技股份有限公司 Network information batch acquisition method of combined text and picture information
CN104199874A (en) * 2014-08-20 2014-12-10 哈尔滨工程大学 Webpage recommendation method based on user browsing behaviors
CN105183453A (en) * 2015-08-07 2015-12-23 安一恒通(北京)科技有限公司 Webpage-based information acquiring method and apparatus
CN105512193A (en) * 2015-11-26 2016-04-20 上海携程商务有限公司 Data acquisition system and method based on browser expansion

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670100A (en) * 2018-12-21 2019-04-23 第四范式(北京)技术有限公司 A kind of page data grasping means and device
CN109670100B (en) * 2018-12-21 2020-06-26 第四范式(北京)技术有限公司 Page data capturing method and device

Also Published As

Publication number Publication date
CN107644028B (en) 2020-09-04

Similar Documents

Publication Publication Date Title
JP5134684B2 (en) How to understand website information through web page structure analysis
KR101748196B1 (en) Determining message data to present
CN107800591A (en) A kind of analysis method of unified daily record data
CN107590654A (en) A kind of method of on-line payment, terminal and computer-readable medium
CN104050266B (en) User behavior recording method, device and web browser
CN111552633A (en) Interface abnormal call testing method and device, computer equipment and storage medium
CN107085549B (en) Method and device for generating fault information
CN112486708B (en) Page operation data processing method and processing system
CN110515830A (en) Operation trace method for visualizing, device, equipment and storage medium
CN104462397A (en) Promotion information processing method and promotion information processing device
CN104486495A (en) Method and device for displaying prompt message of new message at terminal
CN105159475B (en) A kind of characters input method and device
CN105022694A (en) Test case generation method and system for mobile terminal test
CN110647321A (en) Method, device and equipment for playing back operation flow and storage medium
CN109684571A (en) A kind of collecting method and device, storage medium
CN107515950A (en) A kind of image processing method, device, terminal and computer-readable recording medium
CN103929339B (en) A kind of web data acquisition method and system
JP6447726B2 (en) Card addition method, apparatus, device, and computer storage medium
TWI528186B (en) System and method for posting messages by audio signals
CN114064144A (en) Communication plug-in unit for cross-application data acquisition and communication method
CN105550179A (en) Webpage collection method and browser plug-in
CN106980658A (en) Video labeling method and device
CN107644028A (en) The collection method and system of web data
CN109583448B (en) Accounting method, device, electronic equipment and medium
WO2014183494A1 (en) Method, apparatus, and system of opening a web page

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant