CN107644028A - The collection method and system of web data - Google Patents
The collection method and system of web data Download PDFInfo
- Publication number
- CN107644028A CN107644028A CN201610578428.3A CN201610578428A CN107644028A CN 107644028 A CN107644028 A CN 107644028A CN 201610578428 A CN201610578428 A CN 201610578428A CN 107644028 A CN107644028 A CN 107644028A
- Authority
- CN
- China
- Prior art keywords
- webpage
- source code
- web data
- url addresses
- collection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Abstract
The present invention is applied to software field, there is provided the collection method and device of a kind of web data.Methods described includes:Receive the write instruction of uniform resource position mark URL address, and URL addresses corresponding to write-in;Show source code corresponding to webpage corresponding to the URL addresses and webpage;The source code according to corresponding to the webpage capture of display, to realize the collection of web data.The degree of accuracy of the source code of crawl can be improved by the above method.
Description
Technical field
The embodiment of the present invention belongs to software field, more particularly to the collection method and system of a kind of web data.
Background technology
At present, user often needs to collect and analyze the data of each webpage, judges web data further according to analysis result
Validity, or other operations etc. are performed according to analysis result.
In existing web data collection method, the data of specified location in webpage are typically directly captured, then to crawl
Data are analyzed, but due to crawl data during be possible to mistake occur, that is, grab with specified location in the page not
The data met, and user is difficult to find that the data of crawl are incongruent with specified location in the page according only to the data of crawl
Data, therefore cause follow-up data results to malfunction.
The content of the invention
The embodiments of the invention provide a kind of collection method of web data and system, it is intended to which solving existing method may grab
Get with the incongruent data of page specified location, so as to cause crawl data the degree of accuracy it is too low the problem of.
The embodiment of the present invention is achieved in that a kind of collection method of web data, and methods described includes:
Receive the write instruction of uniform resource position mark URL address, and URL addresses corresponding to write-in;
Show source code corresponding to webpage corresponding to the URL addresses and webpage;
The source code according to corresponding to the webpage capture of display, to realize the collection of web data.
The another object of the embodiment of the present invention is to provide a kind of collection system of web data, and the system includes:
The write instruction receiving unit of URL addresses, for receiving the write instruction of uniform resource position mark URL address, and
URL addresses corresponding to write-in;
Web displaying unit, for showing source code corresponding to webpage corresponding to the URL addresses and webpage;
Web data collector unit, for source code corresponding to the webpage capture according to display, to realize web data
Collect.
In embodiments of the present invention, due to source code corresponding to the webpage capture according to display, therefore, it is easy to user to judge
Whether the source code currently captured is the source code for needing to capture, and so as to improve the degree of accuracy of the source code of crawl, and then is improved
The degree of accuracy of follow-up data results.
Brief description of the drawings
Fig. 1 is a kind of flow chart of the collection method for web data that first embodiment of the invention provides;
Fig. 2 is a kind of schematic diagram of the position for write-in URL addresses that first embodiment of the invention provides;
Fig. 3 is the schematic diagram for the configurable browser parameters that first embodiment of the invention provides;
Fig. 4 is the schematic diagram for the button that first embodiment of the invention provides " source code ";
Fig. 5 is a kind of structure chart of the collection device for web data that second embodiment of the invention provides.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.
In the embodiment of the present invention, the write instruction of URL addresses, and URL addresses corresponding to write-in are received, shows the URL
Source code corresponding to webpage corresponding to address and webpage, the source code according to corresponding to the webpage capture of display, to realize webpage
The collection of data.
In order to illustrate technical solutions according to the invention, illustrated below by specific embodiment.
Embodiment one:
Fig. 1 shows a kind of flow chart of the collection method for web data that first embodiment of the invention provides, and is described in detail such as
Under:
Step S11, receive the write instruction of uniform resource position mark URL address, and URL addresses corresponding to write-in.
Wherein, the write instruction of URL (Uniform Resource Locator, URL) address can pass through
The operation that user's " duplication " " pastes " again is sent, and can also be directly inputted and sent by user.As shown in Fig. 2 presented in system
Interface " entrance URL " places write-in " Shaoguan law court " corresponding to URL addresses.
Because some webpages are for specific Development of Web Browser, therefore, for the ease of subsequently can correctly, intactly
The webpage is shown, the browser matched with the URL addresses can be selected after step S11 is performed, or, performing step
After S11, the browser selection instruction that user sends is received, according to browser selection instruction selection and the URL addresses
The browser of matching.Such as the browser of the selection type such as chrome or red fox or IE.Certainly, as shown in figure 3, in order to enter one
Step is accelerated to collect the speed of web data, in the browser that selection matches with URL addresses, in addition to:Receive browser parameters
Configuration-direct, browser parameters are configured according to the configuration-direct of the browser parameters.Wherein, browser parameters bag
Include:Http sends time-out time, whether enables script execution, whether enables Cascading Style Sheet (CSSCascading Style
Sheets, CSS), whether enable redirection, ActiveXNative etc..For example, electric business class website usually requires to enable script
Perform, and general website does not need, and is performed due to that need not enable script, reduces flow occupancy, improves collection web data
Speed.Further, in order to improve the convenience of the web data of subsequent analysis collection, the configuration of browser parameters is being received
Instruction, when being configured according to the configuration-direct of the browser parameters to browser parameters, configuration item purpose title, project is retouched
State and relevant field information.
Due to being also required to take when loading extends entitled js file (with the file of javascript scripting languages)
Certain flow and time, therefore, in order to further improve the speed for collecting web data, it may filter that the js that need not be performed
File, that is, be not loaded with the file for the js that need not be performed filtered out.
Step S12, show source code corresponding to webpage corresponding to the URL addresses and webpage.
It is pointed out that webpage and source code are shown on the same interface of system, so that user's contrast is checked.
In the step, in order to which the different demands of user are flexibly met, before webpage corresponding to display URL addresses, including:
Receive the reload page instruction that user sends.As shown in Fig. 2 when user clicks on " reload page " button, reload page is sent
Instruction, webpage corresponding to showing URL addresses is instructed according to the reload page.Before source code corresponding to display webpage, including:
Receive the source code idsplay order that user sends.As shown in figure 4, when user clicks on " source code " button, send source code and show
Show instruction, the source code according to corresponding to the source code idsplay order shows webpage.
Alternatively, corresponding webpage can be just shown because some websites need user to input after log-on message, therefore, is
Reduce the operating procedure of user, also for can automatically, normally show webpage, then before the step S12, including:
A1, judge whether webpage corresponding to the URL addresses needs log-on message.Specifically, prestoring needs to log in
The URL addresses of information, when write-in URL addresses with prestore need log-on message some URL address it is identical when, judge
Webpage corresponding to the URL addresses of the write-in needs log-on message, otherwise, it is determined that webpage corresponding to the URL addresses of the write-in is not required to
Want log-on message.
A2, when webpage corresponding to the URL addresses needs log-on message, by the log-on message obtained in advance write-in described in
The relevant position of webpage corresponding to URL addresses, to log in webpage corresponding to the URL addresses.
Specifically, the log-on message for logging in webpage corresponding to the URL addresses is obtained in advance, when write-in needs log-on message
URL addresses after, by the log-on message obtained in advance write webpage relevant position, so as to webpage verification using data-hiding technology log-on message into
After work(, system can show webpage corresponding to URL addresses.
Step S13, the source code according to corresponding to the webpage capture of display, to realize the collection of web data.
Specifically, when showing a webpage, source code corresponding to the webpage that crawl display screen is currently shown, so as to
More web datas are grabbed in once capturing.
Alternatively, in source code corresponding to the part webpage that only crawl display screen is currently shown, the step S13 is specific
Including:
B1, detect stay time of the current mouse in the position of webpage.Specifically, when mouse is rested in the webpage of display
Some position when, between recording at the beginning of the mouse stops, and fixed statistics time started interval time with it is current when
Between difference (i.e. stay time).
B2, when current mouse exceedes default duration in the stay time of the position of webpage, capture the current mouse
In the source code of the position correspondence of webpage, to realize the collection of web data.Alternatively, the position of the webpage taken due to mouse
Less, therefore, in order to grab more source codes, the source code for capturing the current mouse in the position correspondence of webpage is
Refer to, source code of the crawl current mouse in the space of a whole page of the position correspondence of webpage.For example, it is assumed that the multiple spaces of a whole page of webpage point of display:
The space of a whole page 1, the space of a whole page 2, the space of a whole page 3 and the space of a whole page 4, when current mouse is at the position of webpage (the corresponding space of a whole page 1), capture corresponding to the space of a whole page 1
Source code.
In above-mentioned B1 and B2, due to when current mouse exceedes default duration in the stay time of the position of webpage, from
The dynamic source code for capturing current mouse in the position correspondence of webpage, it is therefore not necessary to which user operates, improves Webpage data capturing
Convenience.
Alternatively, in source code corresponding to the part webpage that only crawl display screen is currently shown, the step S13 is specific
Including:
B1 ', detection current mouse are in the position of webpage.
B2 ', source code crawl instruction is received, instruction crawl current mouse is captured in the position of webpage according to the source code
Corresponding source code.Wherein, source code crawl instruction can be sent by pressing mouse button (left button and/or right button).
In above-mentioned B1 ' and B2 ', without paying close attention to residence time of the current mouse in the position of webpage, as long as receiving source generation
Code crawl instruction will capture source code of the current mouse in the position correspondence of webpage.Alternatively, due to the webpage of mouse occupancy
Position it is little, therefore, in order to grab more source codes, capture source generation of the current mouse in the position correspondence of webpage
Code refers to, captures source code of the current mouse in the space of a whole page of the position correspondence of webpage.
Further, in order to grab more accurate web data, then detect what user selected in the webpage of display
Web data;Further according to source code corresponding to the Webpage data capturing of user's selection.Due to only capturing the webpage number of user's selection
According to therefore so that the source code of crawl more meets the demand of user.
Alternatively, in order to grab web data corresponding to multiple webpages, after the step S13, including:
C1, judge that website corresponding to the webpage of display whether there is multiple webpages.
When website corresponding to C2, the webpage in display has multiple webpages, page-turning instruction is sent, to show correspondence after page turning
Webpage.Wherein, page-turning instruction can be clicked on " lower one page " button by user and be sent, and be may also be arranged on automatically clicking interval time and arrived
Automatically clicking " lower one page " button is sent when coming, certainly, in order that obtaining the page-turning instruction that automatically clicking " lower one page " button is sent
The page-turning instruction sent closer to user's click " lower one page " button, then the automatically clicking interval time set can not be too short,
For example should be greater than 3 seconds, but can not be long, in order to avoid the overlong time of crawl web data, such as, it should be less than 8 minutes etc..
C3, the source code according to corresponding to corresponding webpage capture after page turning, to realize the collection of web data.
In above-mentioned C1~C3, due to the web data of multiple webpages can be captured by sending page-turning instruction, therefore so that
The web data of crawl is more comprehensively.
Further, for the ease of subsequently checking the web data of crawl, after step s 13, the webpage of collection is stored
Data.Specifically, can be stored by database, file or excel form.The webpage number of collection is stored by various modes
According to, improve subsequent analysis collection web data convenience.
In first embodiment of the invention, the write instruction of URL addresses is received, and URL addresses corresponding to writing, described in display
Source code corresponding to webpage corresponding to URL addresses and webpage, the source code according to corresponding to the webpage capture of display, to realize net
The collection of page data.Due to source code corresponding to the webpage capture according to display, therefore, it is easy to user to judge the source currently captured
Whether code is the source code for needing to capture, and so as to improve the degree of accuracy of the source code of crawl, and then improves follow-up data point
Analyse the degree of accuracy of result.
It should be understood that in embodiments of the present invention, the size of the sequence number of above-mentioned each process is not meant to the elder generation of execution sequence
Afterwards, the execution sequence of each process should be determined with its function and internal logic, the implementation process structure without tackling the embodiment of the present invention
Into any restriction.
Embodiment two:
Fig. 5 shows a kind of structure chart of the collection system for web data that second embodiment of the invention provides, the webpage
The collection system of data can include the user equipment to be communicated through wireless access network RAN with one or more core nets, should
User equipment can be mobile phone (or being " honeycomb " phone), the computer with mobile device etc., for example, user equipment
Can also be portable, pocket, hand-held, built-in computer or vehicle-mounted mobile device, they and wireless access network
Exchange voice and/or data.In another example the mobile device can include smart mobile phone, tablet personal computer, personal digital assistant PDA,
Point-of-sale terminal POS or vehicle-mounted computer etc..For convenience of description, it illustrate only the part related to the embodiment of the present invention.
The collection system of the web data includes:The write instruction receiving units 51 of URL addresses, web displaying unit 52,
Web data collector unit 53:
The write instruction receiving unit 51 of URL addresses, for receiving the write instruction of uniform resource position mark URL address,
And URL addresses corresponding to writing.
Wherein, the operation that the write instruction of URL addresses " can be pasted " again by user " duplication " is sent, can also be by using
Family, which directly inputs, to be sent.
Because some webpages are for specific Development of Web Browser, therefore, for the ease of subsequently can correctly, intactly
The webpage is shown, the collection system of the web data includes:Browser selecting unit, for selecting and the URL addresses
The browser matched somebody with somebody, or, receive the browser selection instruction that sends of user, according to browser selection instruction selection with it is described
The browser of URL addresses matching.Such as the browser of the selection type such as chrome or red fox or IE.Certainly, as shown in figure 3, being
The speed for collecting web data is further speeded up, in the browser that selection matches with URL addresses, the receipts of the web data
Collecting system also includes:Configuration-direct receiving unit, for receiving the configuration-direct of browser parameters, according to the browser parameters
Configuration-direct browser parameters are configured.Wherein, browser parameters include:Whether http sends time-out time, enables
Script performs, whether enables CSS, whether enables redirection, ActiveXNative etc..For example, electric business class website usually requires
Script execution is enabled, and general website does not need, and is performed due to that need not enable script, reduces flow occupancy, improves collection
The speed of web data.Further, in order to improve the convenience of the web data of subsequent analysis collection, the configuration-direct connects
Receiving unit also includes:Configuration item purpose title, item description and relevant field information.
Due to being also required to take when loading extends entitled js file (with the file of javascript scripting languages)
Certain flow and time, therefore, in order to further improve the speed for collecting web data, the collection system of the web data
Including:File filtering unit, for filtering out the file for the js that need not be performed, that is, be not loaded with filtering out need not perform
Js file.
Web displaying unit 52, for showing source code corresponding to webpage corresponding to the URL addresses and webpage.
It is pointed out that webpage and source code are shown on the same interface of system, so that user's contrast is checked.
In order to which the different demands of user are flexibly met, the collection system of the web data includes:Reload page instruction connects
Unit is received, the reload page instruction sent for receiving user, net corresponding to showing URL addresses is instructed according to the reload page
Page.And/or including:The source code idsplay order that user sends is received, according to corresponding to the source code idsplay order shows webpage
Source code.
Alternatively, corresponding webpage can be just shown because some websites need user to input after log-on message, therefore, is
Reduce the operating procedure of user, also for can automatically, normally show webpage, then the collection system bag of the web data
Include:
Log-on message judging unit, for judging whether webpage corresponding to the URL addresses needs log-on message.Specifically
Ground, the URL addresses for needing log-on message are prestored, URL addresses and certain for needing log-on message for prestoring when write-in
When individual URL addresses are identical, judge that webpage corresponding to the URL addresses of the write-in needs log-on message, otherwise, it is determined that the write-in
Webpage corresponding to URL addresses does not need log-on message.
Log-on message writing unit, for that when webpage corresponding to the URL addresses needs log-on message, will obtain in advance
Log-on message write the relevant position of webpage corresponding to the URL addresses, to log in webpage corresponding to the URL addresses.
Specifically, the log-on message for logging in webpage corresponding to the URL addresses is obtained in advance, when write-in needs log-on message
URL addresses after, by the log-on message obtained in advance write webpage relevant position, so as to webpage verification using data-hiding technology log-on message into
After work(, system can show webpage corresponding to URL addresses.
Web data collector unit 53, for source code corresponding to the webpage capture according to display, to realize web data
Collection.
Specifically, when showing a webpage, source code corresponding to the webpage that crawl display screen is currently shown, so as to
More web datas are grabbed in once capturing.
Alternatively, in source code corresponding to the part webpage that only crawl display screen is currently shown, the web data is received
Collection unit 53 includes:
Stay time detection module, for detecting stay time of the current mouse in the position of webpage.Specifically, mouse is worked as
When resting on some position in the webpage of display, between recording at the beginning of the mouse stops, and united in fixed interval time
Count the difference (i.e. stay time) of time started and current time.
Source code handling module, for when current mouse exceedes default duration in the stay time of the position of webpage,
Source code of the current mouse in the position correspondence of webpage is captured, to realize the collection of web data.Alternatively, due to mouse
The position of the webpage of occupancy is little, therefore, in order to grab more source codes, captures the current mouse in the position of webpage
Corresponding source code refers to, captures source code of the current mouse in the space of a whole page of the position correspondence of webpage.
In above-mentioned stay time detection module and source code handling module, due to stopping in current mouse in the position of webpage
When staying the duration to exceed default duration, the automatic source code for capturing current mouse in the position correspondence of webpage, it is therefore not necessary to user
Operation, improve the convenience of Webpage data capturing.
Alternatively, in source code corresponding to the part webpage that only crawl display screen is currently shown, the web data is received
Collection unit 53 includes:
Mouse position detection module, for detecting current mouse in the position of webpage.
Source code captures command reception module, for receiving source code crawl instruction, is captured and instructed according to the source code
Capture source code of the current mouse in the position correspondence of webpage.Wherein, source code crawl instruction can be by pressing mouse button (left button
And/or right button) send.
In above-mentioned mouse position detection module and source code crawl command reception module, without paying close attention to current mouse in webpage
Position residence time, as long as receive source code crawl instruction will capture current mouse in the source of the position correspondence of webpage
Code.Alternatively, the position of the webpage taken due to mouse is little, therefore, in order to grab more source codes, described in crawl
Current mouse refers in the source code of the position correspondence of webpage, captures source generation of the current mouse in the space of a whole page of the position correspondence of webpage
Code.
Further, in order to grab more accurate web data, the collection system of web data includes:Selection
Web data detection unit, the web data selected for detecting user in the webpage of display;The Webpage data capturing list of selection
Member, for the source code according to corresponding to the Webpage data capturing that user selects.Due to only capturing the web data of user's selection, because
This so that the source code of crawl more meets the demand of user.
Alternatively, in order to grabbing web data corresponding to multiple webpages, the collection system bag of the web data
Include:
Multiple webpage judging units, for judging that website corresponding to the webpage of display whether there is multiple webpages.
Page-turning instruction issue unit, for when website has multiple webpages corresponding to the webpage in display, sending page turning and referring to
Order, to show corresponding webpage after page turning.
Webpage data capturing unit after page turning, for the source code according to corresponding to corresponding webpage capture after page turning, with
Realize the collection of web data.Wherein, page-turning instruction can be clicked on " lower one page " button by user and be sent, and may also be arranged on from dynamic point
Automatically clicking " lower one page " button is sent when hitting interval time arrival, certainly, in order that obtaining automatically clicking " lower one page " button hair
The page-turning instruction gone out clicks on the page-turning instruction that sends of " lower one page " button closer to user, then during the automatically clicking interval set
Between can not be too short, such as, should be greater than 3 seconds, but can not be long, in order to avoid the overlong time of crawl web data, such as, should be less than
8 minutes etc..
In Webpage data capturing unit after above-mentioned multiple webpage judging units, page-turning instruction issue unit, page turning, due to
The web data of multiple webpages can be captured by sending page-turning instruction, therefore so that the web data of crawl is more comprehensively.
Further, for the ease of subsequently checking the web data of crawl, the collection system of the web data includes:Net
Page data memory cell, for storing the web data collected.Specifically, database, file or excel form can be passed through
Storage.The web data of collection is stored by various modes, improves the convenience of the web data of subsequent analysis collection.
In second embodiment of the invention, due to source code corresponding to the webpage capture according to display, therefore, it is easy to user to sentence
Whether the disconnected source code currently captured is the source code for needing to capture, so as to improve the degree of accuracy of the source code of crawl, Jin Erti
The degree of accuracy of high follow-up data results.
Those of ordinary skill in the art are it is to be appreciated that the list of each example described with reference to the embodiments described herein
Member and algorithm steps, it can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually
Performed with hardware or software mode, application-specific and design constraint depending on technical scheme.Professional and technical personnel
Described function can be realized using distinct methods to each specific application, but this realization is it is not considered that exceed
The scope of the present invention.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description,
The specific work process of device and unit, the corresponding process in preceding method embodiment is may be referred to, will not be repeated here.
In several embodiments provided herein, it should be understood that disclosed systems, devices and methods, can be with
Realize by another way.For example, device embodiment described above is only schematical, for example, the unit
Division, only a kind of division of logic function, can there is other dividing mode, such as multiple units or component when actually realizing
Another system can be combined or be desirably integrated into, or some features can be ignored, or do not perform.It is another, it is shown or
The mutual coupling discussed or direct-coupling or communication connection can be the indirect couplings by some interfaces, device or unit
Close or communicate to connect, can be electrical, mechanical or other forms.
The unit illustrated as separating component can be or may not be physically separate, show as unit
The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple
On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs
's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also
That unit is individually physically present, can also two or more units it is integrated in a unit.
If the function is realized in the form of SFU software functional unit and is used as independent production marketing or in use, can be with
It is stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially in other words
The part to be contributed to prior art or the part of the technical scheme can be embodied in the form of software product, the meter
Calculation machine software product is stored in a storage medium, including some instructions are causing a computer equipment (can be
People's computer, server, or network equipment etc.) perform all or part of step of each embodiment methods described of the present invention.
And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited
Reservoir (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with the medium of store program codes.
The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any
Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained
Cover within protection scope of the present invention.Therefore, protection scope of the present invention described should be defined by scope of the claims.
Claims (10)
1. a kind of collection method of web data, it is characterised in that methods described includes:
Receive the write instruction of uniform resource position mark URL address, and URL addresses corresponding to write-in;
Show source code corresponding to webpage corresponding to the URL addresses and webpage;
The source code according to corresponding to the webpage capture of display, to realize the collection of web data.
2. according to the method for claim 1, it is characterised in that in webpage corresponding to the display URL addresses and
Before source code corresponding to webpage, including:
Judge whether webpage corresponding to the URL addresses needs log-on message;
When webpage corresponding to the URL addresses needs log-on message, the log-on message obtained in advance is write into the URL addresses
The relevant position of corresponding webpage, to log in webpage corresponding to the URL addresses.
3. method according to claim 1 or 2, it is characterised in that source generation corresponding to the webpage capture according to display
Code, to realize the collection of web data, is specifically included:
Detect stay time of the current mouse in the position of webpage;
When current mouse exceedes default duration in the stay time of the position of webpage, the current mouse is captured in webpage
The source code of position correspondence, to realize the collection of web data.
4. method according to claim 1 or 2, it is characterised in that source generation corresponding to the webpage capture according to display
Code, to realize the collection of web data, is specifically included:
Current mouse is detected in the position of webpage;
Source code crawl instruction is received, instruction crawl current mouse is captured in the source of the position correspondence of webpage according to the source code
Code.
5. method according to claim 1 or 2, it is characterised in that in source corresponding to the webpage capture according to display
Code, after realizing the collection of web data, including:
Judge that website corresponding to the webpage of display whether there is multiple webpages;
When website has multiple webpages corresponding to webpage in display, page-turning instruction is sent, to show corresponding webpage after page turning;
According to source code corresponding to corresponding webpage capture after page turning, to realize the collection of web data.
6. a kind of collection system of web data, it is characterised in that the system includes:
The write instruction receiving unit of URL addresses, for receiving the write instruction of uniform resource position mark URL address, and write
Corresponding URL addresses;
Web displaying unit, for showing source code corresponding to webpage corresponding to the URL addresses and webpage;
Web data collector unit, for source code corresponding to the webpage capture according to display, to realize the collection of web data.
7. system according to claim 6, it is characterised in that the system includes:
Log-on message judging unit, for judging whether webpage corresponding to the URL addresses needs log-on message;
Log-on message writing unit, for when webpage corresponding to the URL addresses needs log-on message, being stepped on what is obtained in advance
The relevant position that information writes webpage corresponding to the URL addresses is recorded, to log in webpage corresponding to the URL addresses.
8. the system according to claim 6 or 7, it is characterised in that the web data collector unit includes:
Stay time detection module, for detecting stay time of the current mouse in the position of webpage;
Source code handling module, for when current mouse exceedes default duration in the stay time of the position of webpage, capturing
The current mouse the position correspondence of webpage source code, to realize the collection of web data.
9. the system according to claim 6 or 7, it is characterised in that the web data collector unit includes:
Mouse position detection module, for detecting current mouse in the position of webpage;
Source code captures command reception module, and for receiving source code crawl instruction, instruction crawl is captured according to the source code
Source code of the current mouse in the position correspondence of webpage.
10. the system according to claim 6 or 7, it is characterised in that the system includes:
Multiple webpage judging units, for judging that website corresponding to the webpage of display whether there is multiple webpages;
Page-turning instruction issue unit, for when website has multiple webpages corresponding to the webpage in display, sending page-turning instruction, with
Corresponding webpage after display page turning;
Webpage data capturing unit after page turning, for the source code according to corresponding to corresponding webpage capture after page turning, to realize
The collection of web data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610578428.3A CN107644028B (en) | 2016-07-20 | 2016-07-20 | Method and system for collecting webpage data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610578428.3A CN107644028B (en) | 2016-07-20 | 2016-07-20 | Method and system for collecting webpage data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107644028A true CN107644028A (en) | 2018-01-30 |
CN107644028B CN107644028B (en) | 2020-09-04 |
Family
ID=61109212
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610578428.3A Active CN107644028B (en) | 2016-07-20 | 2016-07-20 | Method and system for collecting webpage data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107644028B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109670100A (en) * | 2018-12-21 | 2019-04-23 | 第四范式(北京)技术有限公司 | A kind of page data grasping means and device |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101320387A (en) * | 2008-07-11 | 2008-12-10 | 浙江大学 | Web page text and image ranking method based on user caring time |
CN102469111A (en) * | 2010-10-29 | 2012-05-23 | 国际商业机器公司 | Method and system for analyzing website access |
CN103186670A (en) * | 2013-03-27 | 2013-07-03 | 中金数据系统有限公司 | Method and system for integrally acquiring webpage information |
CN103593344A (en) * | 2012-08-13 | 2014-02-19 | 北大方正集团有限公司 | Information acquisition method and device |
CN103927370A (en) * | 2014-04-23 | 2014-07-16 | 焦点科技股份有限公司 | Network information batch acquisition method of combined text and picture information |
US8832055B1 (en) * | 2005-06-16 | 2014-09-09 | Gere Dev. Applications, LLC | Auto-refinement of search results based on monitored search activities of users |
CN104199874A (en) * | 2014-08-20 | 2014-12-10 | 哈尔滨工程大学 | Webpage recommendation method based on user browsing behaviors |
CN105183453A (en) * | 2015-08-07 | 2015-12-23 | 安一恒通(北京)科技有限公司 | Webpage-based information acquiring method and apparatus |
CN105512193A (en) * | 2015-11-26 | 2016-04-20 | 上海携程商务有限公司 | Data acquisition system and method based on browser expansion |
-
2016
- 2016-07-20 CN CN201610578428.3A patent/CN107644028B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8832055B1 (en) * | 2005-06-16 | 2014-09-09 | Gere Dev. Applications, LLC | Auto-refinement of search results based on monitored search activities of users |
CN101320387A (en) * | 2008-07-11 | 2008-12-10 | 浙江大学 | Web page text and image ranking method based on user caring time |
CN102469111A (en) * | 2010-10-29 | 2012-05-23 | 国际商业机器公司 | Method and system for analyzing website access |
CN103593344A (en) * | 2012-08-13 | 2014-02-19 | 北大方正集团有限公司 | Information acquisition method and device |
CN103186670A (en) * | 2013-03-27 | 2013-07-03 | 中金数据系统有限公司 | Method and system for integrally acquiring webpage information |
CN103927370A (en) * | 2014-04-23 | 2014-07-16 | 焦点科技股份有限公司 | Network information batch acquisition method of combined text and picture information |
CN104199874A (en) * | 2014-08-20 | 2014-12-10 | 哈尔滨工程大学 | Webpage recommendation method based on user browsing behaviors |
CN105183453A (en) * | 2015-08-07 | 2015-12-23 | 安一恒通(北京)科技有限公司 | Webpage-based information acquiring method and apparatus |
CN105512193A (en) * | 2015-11-26 | 2016-04-20 | 上海携程商务有限公司 | Data acquisition system and method based on browser expansion |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109670100A (en) * | 2018-12-21 | 2019-04-23 | 第四范式(北京)技术有限公司 | A kind of page data grasping means and device |
CN109670100B (en) * | 2018-12-21 | 2020-06-26 | 第四范式(北京)技术有限公司 | Page data capturing method and device |
Also Published As
Publication number | Publication date |
---|---|
CN107644028B (en) | 2020-09-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5134684B2 (en) | How to understand website information through web page structure analysis | |
KR101748196B1 (en) | Determining message data to present | |
CN107800591A (en) | A kind of analysis method of unified daily record data | |
CN107590654A (en) | A kind of method of on-line payment, terminal and computer-readable medium | |
CN104050266B (en) | User behavior recording method, device and web browser | |
CN111552633A (en) | Interface abnormal call testing method and device, computer equipment and storage medium | |
CN107085549B (en) | Method and device for generating fault information | |
CN112486708B (en) | Page operation data processing method and processing system | |
CN110515830A (en) | Operation trace method for visualizing, device, equipment and storage medium | |
CN104462397A (en) | Promotion information processing method and promotion information processing device | |
CN104486495A (en) | Method and device for displaying prompt message of new message at terminal | |
CN105159475B (en) | A kind of characters input method and device | |
CN105022694A (en) | Test case generation method and system for mobile terminal test | |
CN110647321A (en) | Method, device and equipment for playing back operation flow and storage medium | |
CN109684571A (en) | A kind of collecting method and device, storage medium | |
CN107515950A (en) | A kind of image processing method, device, terminal and computer-readable recording medium | |
CN103929339B (en) | A kind of web data acquisition method and system | |
JP6447726B2 (en) | Card addition method, apparatus, device, and computer storage medium | |
TWI528186B (en) | System and method for posting messages by audio signals | |
CN114064144A (en) | Communication plug-in unit for cross-application data acquisition and communication method | |
CN105550179A (en) | Webpage collection method and browser plug-in | |
CN106980658A (en) | Video labeling method and device | |
CN107644028A (en) | The collection method and system of web data | |
CN109583448B (en) | Accounting method, device, electronic equipment and medium | |
WO2014183494A1 (en) | Method, apparatus, and system of opening a web page |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |