CN107644028B - Method and system for collecting webpage data - Google Patents

Method and system for collecting webpage data Download PDF

Info

Publication number
CN107644028B
CN107644028B CN201610578428.3A CN201610578428A CN107644028B CN 107644028 B CN107644028 B CN 107644028B CN 201610578428 A CN201610578428 A CN 201610578428A CN 107644028 B CN107644028 B CN 107644028B
Authority
CN
China
Prior art keywords
webpage
source code
instruction
url address
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610578428.3A
Other languages
Chinese (zh)
Other versions
CN107644028A (en
Inventor
徐介夫
朱杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201610578428.3A priority Critical patent/CN107644028B/en
Publication of CN107644028A publication Critical patent/CN107644028A/en
Application granted granted Critical
Publication of CN107644028B publication Critical patent/CN107644028B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention is applicable to the field of software and provides a method and a device for collecting webpage data. The method comprises the following steps: receiving a write-in instruction of a Uniform Resource Locator (URL) address, and writing in a corresponding URL address; displaying a webpage corresponding to the URL address and a source code corresponding to the webpage; and capturing corresponding source codes according to the displayed webpage so as to realize the collection of webpage data. The method can improve the accuracy of the captured source code.

Description

Method and system for collecting webpage data
Technical Field
The embodiment of the invention belongs to the field of software, and particularly relates to a method and a system for collecting webpage data.
Background
At present, users often need to collect and analyze data of each webpage, and then judge validity of the webpage data according to an analysis result, or execute other operations according to the analysis result, and the like.
In the existing web page data collection method, data at a specified position in a web page are usually directly captured and then the captured data are analyzed, but errors may occur in the process of capturing the data, that is, data which do not conform to the specified position in the web page are captured, and a user only difficultly finds that the captured data are data which do not conform to the specified position in the web page according to the captured data, so that errors in subsequent data analysis results are caused.
Disclosure of Invention
The embodiment of the invention provides a method and a system for collecting webpage data, and aims to solve the problem that the accuracy of the captured data is too low because the data which is not in accordance with the specified position of a page can be captured by the conventional method.
The embodiment of the invention is realized in such a way that a method for collecting webpage data comprises the following steps:
receiving a write-in instruction of a Uniform Resource Locator (URL) address, and writing in a corresponding URL address;
displaying a webpage corresponding to the URL address and a source code corresponding to the webpage;
and capturing corresponding source codes according to the displayed webpage so as to realize the collection of webpage data.
Another object of an embodiment of the present invention is to provide a system for collecting webpage data, where the system includes:
the write-in instruction receiving unit of the URL address is used for receiving the write-in instruction of the URL address of the uniform resource locator and writing the write-in instruction into the corresponding URL address;
the webpage display unit is used for displaying the webpage corresponding to the URL address and the source code corresponding to the webpage;
and the webpage data collecting unit is used for capturing the corresponding source code according to the displayed webpage so as to realize the collection of the webpage data.
In the embodiment of the invention, the corresponding source code is grabbed according to the displayed webpage, so that a user can conveniently judge whether the currently grabbed source code is the source code needing to be grabbed, the accuracy of the grabbed source code is improved, and the accuracy of the subsequent data analysis result is improved.
Drawings
Fig. 1 is a flowchart of a method for collecting web page data according to a first embodiment of the present invention;
FIG. 2 is a diagram illustrating a location of a write URL address according to a first embodiment of the present invention;
FIG. 3 is a diagram of configurable browser parameters provided by a first embodiment of the present invention;
FIG. 4 is a schematic diagram of a "source code" key provided by the first embodiment of the present invention;
fig. 5 is a block diagram of a web page data collecting apparatus according to a second embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In the embodiment of the invention, a writing instruction of the URL address is received, the corresponding URL address is written, the webpage corresponding to the URL address and the source code corresponding to the webpage are displayed, and the corresponding source code is captured according to the displayed webpage, so that the collection of webpage data is realized.
In order to explain the technical means of the present invention, the following description will be given by way of specific examples.
The first embodiment is as follows:
fig. 1 is a flowchart illustrating a method for collecting web page data according to a first embodiment of the present invention, which is detailed as follows:
step S11, receiving a write command of the URL address, and writing the URL address into the write command.
The write command of the Uniform Resource Locator (URL) address may be issued by the user through the operation of "copy" and "paste", or may be issued by the user through direct input. As shown in fig. 2, a URL address corresponding to "shaoguan court" is written at an "entry URL" of an interface presented by the system.
Since some web pages are developed for a specific browser, in order to facilitate the subsequent correct and complete display of the web page, a browser matching the URL address may be selected after step S11 is executed, or a browser selection instruction issued by the user is received after step S11 is executed, and the browser matching the URL address is selected according to the browser selection instruction. For example, a chrome or Firefox or IE type browser is selected. Of course, as shown in fig. 3, in order to further increase the speed of collecting the web page data, when selecting the browser matching with the URL address, the method further includes: and receiving a configuration instruction of the browser parameters, and configuring the browser parameters according to the configuration instruction of the browser parameters. Wherein the browser parameters include: http sends timeout, whether script execution is enabled, whether Cascading Style Sheets (CSS) is enabled, whether redirection is enabled, activex, etc. For example, an e-commerce website generally needs to enable script execution, but a general website does not need to enable script execution, so that the traffic occupation is reduced, and the speed of collecting webpage data is increased. Further, in order to improve convenience of subsequent analysis of collected webpage data, when a configuration instruction of a browser parameter is received and the browser parameter is configured according to the configuration instruction of the browser parameter, a name, a project description and related field information of a project are configured.
Since loading the js file (the javascript-written file) also requires a certain amount of traffic and time, the js file that does not need to be executed can be filtered out in order to further increase the speed of collecting web page data, that is, the js file that does not need to be executed is not loaded.
And step S12, displaying the webpage corresponding to the URL address and the source code corresponding to the webpage.
It should be noted that both the web page and the source code are displayed on the same interface of the system for the user to view in contrast.
In this step, in order to flexibly meet different requirements of the user, before displaying the web page corresponding to the URL address, the method includes: and receiving a page reloading instruction sent by a user. As shown in fig. 2, when the user clicks the "reload page" button, a reload page command is issued, and the web page corresponding to the URL address is displayed according to the reload page command. Before displaying the source code corresponding to the webpage, the method comprises the following steps: and receiving a source code display instruction sent by a user. As shown in fig. 4, when the user clicks the "source code" button, a source code display instruction is issued, and the source code corresponding to the web page is displayed according to the source code display instruction.
Optionally, since some websites require the user to input login information before displaying the corresponding web page, in order to reduce the operation steps of the user and to automatically and normally display the web page, before the step S12, the method includes:
and A1, judging whether the webpage corresponding to the URL address needs login information. Specifically, the URL address of the information to be registered is stored in advance, when the written URL address is the same as a certain URL address of the information to be registered stored in advance, it is determined that the web page corresponding to the written URL address requires the information to be registered, otherwise, it is determined that the web page corresponding to the written URL address does not require the information to be registered.
And A2, when the webpage corresponding to the URL address needs the login information, writing the login information acquired in advance into the corresponding position of the webpage corresponding to the URL address so as to log in the webpage corresponding to the URL address.
Specifically, the login information of the webpage corresponding to the login address is obtained in advance, and after the URL address needing the login information is written in, the obtained login information is written in the corresponding position of the webpage, so that after the login information is verified successfully by the webpage, the webpage corresponding to the URL address can be displayed by the system.
And step S13, capturing corresponding source codes according to the displayed webpage so as to collect webpage data.
Specifically, when a web page is displayed, the source code corresponding to the currently displayed web page on the display screen is crawled, so that more web page data can be crawled in one crawl.
Optionally, when only the source code corresponding to the partial webpage currently displayed on the display screen is crawled, the step S13 specifically includes:
and B1, detecting the stay time of the current mouse at the position of the webpage. Specifically, when the mouse stays at a certain position in the displayed web page, the starting time of the mouse stay is recorded, and the difference between the starting time and the current time (i.e., the stay duration) is counted at fixed intervals.
B2, when the stay time of the current mouse at the position of the webpage exceeds the preset time, capturing the source code corresponding to the position of the current mouse at the webpage to realize the collection of the webpage data. Optionally, since the position of the web page occupied by the mouse is not large, in order to capture more source codes, capturing the source code corresponding to the position of the current mouse on the web page means capturing the source code of the layout corresponding to the position of the current mouse on the web page. For example, assume that the displayed web page is divided into a plurality of panels: and when the current mouse is at the position of the webpage (corresponding to the layout 1), the source code corresponding to the layout 1 is captured by the layout 1, the layout 2, the layout 3 and the layout 4.
In the above B1 and B2, when the staying time of the current mouse at the position of the web page exceeds the preset time, the source code corresponding to the position of the current mouse at the web page is automatically captured, so that the user operation is not required, and the convenience of capturing the web page data is improved.
Optionally, when only the source code corresponding to the partial webpage currently displayed on the display screen is crawled, the step S13 specifically includes:
b1', detecting the position of the current mouse on the webpage.
B2', receiving a source code grabbing instruction, and grabbing a source code corresponding to the position of the current mouse on the webpage according to the source code grabbing instruction. Wherein, the source code grabbing instruction can be sent out by pressing a mouse key (a left key and/or a right key).
In the above B1 'and B2', the source code corresponding to the position of the current mouse on the web page is captured as long as the source code capture instruction is received without paying attention to the dwell time of the current mouse on the position of the web page. Optionally, since the position of the web page occupied by the mouse is not large, in order to capture more source codes, capturing the source code corresponding to the position of the current mouse on the web page means capturing the source code of the layout corresponding to the position of the current mouse on the web page.
Further, in order to capture more accurate webpage data, webpage data selected by a user on a displayed webpage are detected; and capturing the corresponding source code according to the webpage data selected by the user. Because only the webpage data selected by the user is captured, the captured source code is more in line with the requirements of the user.
Optionally, in order to capture web page data corresponding to a plurality of web pages, after the step S13, the method includes:
c1, determining whether there are multiple web pages in the website corresponding to the displayed web page.
And C2, when a plurality of web pages exist in the website corresponding to the displayed web page, sending a page turning instruction to display the web page corresponding to the turned page. The page turning instruction can be sent by clicking a next page key by a user, or can be sent by automatically clicking the next page key when the automatic click interval time arrives, and certainly, in order to enable the page turning instruction sent by automatically clicking the next page key to be closer to the page turning instruction sent by clicking the next page key by the user, the set automatic click interval time cannot be too short, for example, the set automatic click interval time should be more than 3 seconds, but also cannot be too long, so that the time for capturing webpage data is too long, for example, the time should be less than 8 minutes, and the like.
And C3, capturing the corresponding source code according to the corresponding webpage after page turning to realize the collection of the webpage data.
In the above-mentioned C1 to C3, since the web page data of a plurality of web pages can be fetched by issuing a page turn instruction, the fetched web page data is more comprehensive.
Further, in order to facilitate subsequent viewing of the crawled web page data, after step S13, the collected web page data is stored. In particular, it may be stored in the form of a database, a file, or excel. The collected webpage data are stored in various modes, and convenience of subsequent analysis of the collected webpage data is improved.
In the first embodiment of the present invention, a write instruction of a URL address is received, a corresponding URL address is written, a web page corresponding to the URL address and a source code corresponding to the web page are displayed, and the corresponding source code is captured according to the displayed web page, so as to collect web page data. The corresponding source code is grabbed according to the displayed webpage, so that a user can conveniently judge whether the currently grabbed source code is the source code needing to be grabbed, the accuracy of the grabbed source code is improved, and the accuracy of a subsequent data analysis result is improved.
It should be understood that, in the embodiment of the present invention, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiment of the present invention.
Example two:
fig. 5 shows a block diagram of a system for collecting web page data according to a second embodiment of the present invention, which may include user equipment communicating with one or more core networks via a radio access network RAN, where the user equipment may be a mobile telephone (or "cellular" telephone), a computer with mobile equipment, etc., and the user equipment may also be a portable, pocket, hand-held, computer-included, or vehicle-mounted mobile device, for example, which exchanges voice and/or data with the radio access network. Also for example, the mobile device may include a smartphone, a tablet computer, a Personal Digital Assistant (PDA), a point-of-sale (POS) or a vehicle-mounted computer, etc. For convenience of explanation, only portions related to the embodiments of the present invention are shown.
The system for collecting web page data comprises: URL address write instruction receiving unit 51, web page display unit 52, web page data collecting unit 53:
the write command receiving unit 51 of the URL address is configured to receive a write command of the URL address and write the corresponding URL address.
The writing instruction of the URL address can be issued by the user ' copy ' and then ' paste ', or by the user's direct input.
Since some web pages are developed for a specific browser, in order to display the web pages correctly and completely, the system for collecting data of the web pages includes: and the browser selecting unit is used for selecting the browser matched with the URL address, or receiving a browser selecting instruction sent by a user and selecting the browser matched with the URL address according to the browser selecting instruction. For example, a chrome or Firefox or IE type browser is selected. Of course, as shown in fig. 3, in order to further increase the speed of collecting the web page data, when the browser matching the URL address is selected, the system for collecting the web page data further includes: and the configuration instruction receiving unit is used for receiving a configuration instruction of the browser parameters and configuring the browser parameters according to the configuration instruction of the browser parameters. Wherein the browser parameters include: http sends timeout, whether script execution is enabled, whether CSS is enabled, whether redirection is enabled, activexative, etc. For example, an e-commerce website generally needs to enable script execution, but a general website does not need to enable script execution, so that the traffic occupation is reduced, and the speed of collecting webpage data is increased. Further, in order to improve convenience of subsequent analysis of the collected web page data, the configuration instruction receiving unit further includes: the name of the configuration item, the item description and the related field information.
Because loading a file with a js extension (a file written in javascript) also needs to occupy a certain flow and time, in order to further increase the speed of collecting web page data, the web page data collecting system includes: and the file filtering unit is used for filtering out the files of the js which do not need to be executed, namely, the filtered files of the js which do not need to be executed are not loaded.
And a web page display unit 52, configured to display a web page corresponding to the URL address and a source code corresponding to the web page.
It should be noted that both the web page and the source code are displayed on the same interface of the system for the user to view in contrast.
In order to flexibly meet different requirements of users, the system for collecting webpage data comprises: and the heavy-load page instruction receiving unit is used for receiving a heavy-load page instruction sent by a user and displaying a webpage corresponding to the URL address according to the heavy-load page instruction. And/or, comprising: and receiving a source code display instruction sent by a user, and displaying a source code corresponding to the webpage according to the source code display instruction.
Optionally, since some websites require the user to input login information before displaying the corresponding web page, in order to reduce the operation steps of the user and to automatically and normally display the web page, the system for collecting web page data includes:
and the login information judging unit is used for judging whether the webpage corresponding to the URL address needs login information or not. Specifically, the URL address of the information to be registered is stored in advance, when the written URL address is the same as a certain URL address of the information to be registered stored in advance, it is determined that the web page corresponding to the written URL address requires the information to be registered, otherwise, it is determined that the web page corresponding to the written URL address does not require the information to be registered.
And the login information writing unit is used for writing the login information acquired in advance into the corresponding position of the webpage corresponding to the URL address when the webpage corresponding to the URL address needs the login information so as to log in the webpage corresponding to the URL address.
Specifically, the login information of the webpage corresponding to the login address is obtained in advance, and after the URL address needing the login information is written in, the obtained login information is written in the corresponding position of the webpage, so that after the login information is verified successfully by the webpage, the webpage corresponding to the URL address can be displayed by the system.
And the web page data collecting unit 53 is configured to capture a corresponding source code according to the displayed web page to collect the web page data.
Specifically, when a web page is displayed, the source code corresponding to the currently displayed web page on the display screen is crawled, so that more web page data can be crawled in one crawl.
Optionally, when the source code corresponding to only a part of the web page currently displayed on the display screen is crawled, the web page data collecting unit 53 includes:
and the dwell time detection module is used for detecting the dwell time of the current mouse at the position of the webpage. Specifically, when the mouse stays at a certain position in the displayed web page, the starting time of the mouse stay is recorded, and the difference between the starting time and the current time (i.e., the stay duration) is counted at fixed intervals.
And the source code capturing module is used for capturing the source code corresponding to the position of the current mouse on the webpage when the staying time of the current mouse on the position of the webpage exceeds the preset time so as to collect webpage data. Optionally, since the position of the web page occupied by the mouse is not large, in order to capture more source codes, capturing the source code corresponding to the position of the current mouse on the web page means capturing the source code of the layout corresponding to the position of the current mouse on the web page.
In the dwell time detection module and the source code capturing module, when the dwell time of the current mouse at the position of the webpage exceeds the preset time, the source code corresponding to the position of the current mouse at the webpage is automatically captured, so that user operation is not needed, and the convenience of capturing webpage data is improved.
Optionally, when the source code corresponding to only a part of the web page currently displayed on the display screen is crawled, the web page data collecting unit 53 includes:
and the mouse position detection module is used for detecting the position of the current mouse on the webpage.
And the source code grabbing instruction receiving module is used for receiving a source code grabbing instruction and grabbing a source code corresponding to the position of the current mouse on the webpage according to the source code grabbing instruction. Wherein, the source code grabbing instruction can be sent out by pressing a mouse key (a left key and/or a right key).
In the mouse position detection module and the source code capture instruction receiving module, the retention time of the current mouse at the position of the webpage does not need to be concerned, and the source code corresponding to the current mouse at the position of the webpage can be captured as long as the source code capture instruction is received. Optionally, since the position of the web page occupied by the mouse is not large, in order to capture more source codes, capturing the source code corresponding to the position of the current mouse on the web page means capturing the source code of the layout corresponding to the position of the current mouse on the web page.
Further, in order to be able to capture more accurate web page data, the web page data collection system includes: the selected webpage data detection unit is used for detecting webpage data selected by a user on a displayed webpage; and the selected webpage data grabbing unit is used for grabbing the corresponding source code according to the webpage data selected by the user. Because only the webpage data selected by the user is captured, the captured source code is more in line with the requirements of the user.
Optionally, in order to be able to capture webpage data corresponding to a plurality of webpages, the system for collecting webpage data includes:
and the webpage judging units are used for judging whether a plurality of webpages exist in the website corresponding to the displayed webpage.
And the page turning instruction sending unit is used for sending a page turning instruction to display the corresponding webpage after page turning when a plurality of webpages exist in the website corresponding to the displayed webpage.
And the webpage data grabbing unit after page turning is used for grabbing the corresponding source code according to the webpage corresponding to the page turning so as to realize the collection of the webpage data. The page turning instruction can be sent by clicking a next page key by a user, or can be sent by automatically clicking the next page key when the automatic click interval time arrives, and certainly, in order to enable the page turning instruction sent by automatically clicking the next page key to be closer to the page turning instruction sent by clicking the next page key by the user, the set automatic click interval time cannot be too short, for example, the set automatic click interval time should be more than 3 seconds, but also cannot be too long, so that the time for capturing webpage data is too long, for example, the time should be less than 8 minutes, and the like.
In the plurality of web page judging units, the page turning instruction sending unit and the page-turned web page data capturing unit, the web page data of the plurality of web pages can be captured by sending the page turning instruction, so that the captured web page data is more comprehensive.
Further, in order to facilitate subsequent viewing of the crawled web page data, the web page data collection system comprises: and the webpage data storage unit is used for storing the collected webpage data. In particular, it may be stored in the form of a database, a file, or excel. The collected webpage data are stored in various modes, and convenience of subsequent analysis of the collected webpage data is improved.
In the second embodiment of the present invention, since the corresponding source code is fetched according to the displayed web page, it is convenient for the user to determine whether the currently fetched source code is a source code that needs to be fetched, so as to improve the accuracy of the fetched source code, and further improve the accuracy of the subsequent data analysis result.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (8)

1. A method for collecting data of a web page, the method comprising:
receiving a write-in instruction of a Uniform Resource Locator (URL) address, writing in the corresponding URL address, and filtering files which are not required to be executed and are written by a javascript script language;
displaying a webpage corresponding to the URL address and a source code corresponding to the webpage;
capturing corresponding source codes according to the displayed webpage to realize the collection of webpage data, and specifically comprising the following steps: detecting the stay time of the current mouse at the position of the webpage; when the staying time of the current mouse at the position of the webpage exceeds a preset time, capturing a source code corresponding to the current mouse at the position of the webpage to realize the collection of webpage data;
after the receiving a write command of a Uniform Resource Locator (URL) address and writing a corresponding URL address, the method further comprises: selecting a browser matched with the URL address, receiving a configuration instruction of a browser parameter, configuring the browser parameter according to the configuration instruction of the browser parameter, and configuring the name, the item description and related field information of an item; wherein the browser parameters include: http sends timeout time, whether script execution is enabled, whether a cascading style sheet is enabled, whether redirection is enabled, and ActiveXnative;
before the displaying the webpage corresponding to the URL address, the method comprises the following steps: receiving a heavy load page instruction sent by a user, and displaying a webpage corresponding to the URL address according to the heavy load page instruction; before displaying the source code corresponding to the webpage, the method comprises the following steps: and receiving a source code display instruction sent by a user, and displaying a source code corresponding to the webpage according to the source code display instruction.
2. The method of claim 1, prior to displaying the web page corresponding to the URL address and the source code corresponding to the web page, comprising:
judging whether the webpage corresponding to the URL address needs login information or not;
and when the webpage corresponding to the URL address needs login information, writing the login information acquired in advance into the corresponding position of the webpage corresponding to the URL address so as to log in the webpage corresponding to the URL address.
3. The method according to claim 1 or 2, wherein the capturing the corresponding source code according to the displayed web page to collect the web page data specifically comprises:
detecting the position of a current mouse on a webpage;
and receiving a source code grabbing instruction, and grabbing a source code corresponding to the position of the current mouse on the webpage according to the source code grabbing instruction.
4. The method according to claim 1 or 2, after the crawling of the corresponding source code according to the displayed web page to realize the collection of the web page data, comprising:
judging whether a plurality of webpages exist in a website corresponding to the displayed webpage;
when a plurality of webpages exist in a website corresponding to a displayed webpage, a page turning instruction is sent out to display the webpage corresponding to the turned page;
and capturing the corresponding source code according to the corresponding webpage after page turning so as to realize the collection of webpage data.
5. A system for collecting data on a web page, the system comprising:
the write-in instruction receiving unit of the URL address is used for receiving the write-in instruction of the URL address, writing the corresponding URL address and filtering files which are not required to be executed and are written by the javascript language;
the webpage display unit is used for displaying the webpage corresponding to the URL address and the source code corresponding to the webpage;
the webpage data collecting unit is used for capturing corresponding source codes according to the displayed webpage so as to realize the collection of the webpage data;
the web page data collecting unit includes:
the dwell time detection module is used for detecting the dwell time of the current mouse at the position of the webpage;
the source code capturing module is used for capturing a source code corresponding to the position of the current mouse on the webpage when the staying time of the current mouse on the position of the webpage exceeds a preset time so as to collect webpage data;
the browser selecting unit is used for selecting a browser matched with the URL address;
the system comprises a configuration instruction receiving unit, a configuration instruction receiving unit and a configuration instruction processing unit, wherein the configuration instruction receiving unit is used for receiving a configuration instruction of a browser parameter, configuring the browser parameter according to the configuration instruction of the browser parameter, and configuring the name, the item description and related field information of an item; wherein the browser parameters include: http sends timeout time, whether script execution is enabled, whether a cascading style sheet is enabled, whether redirection is enabled, and ActiveXnative;
and the heavy-load page instruction receiving unit is used for receiving a heavy-load page instruction sent by a user, displaying the webpage corresponding to the URL address according to the heavy-load page instruction, receiving a source code display instruction sent by the user, and displaying the source code corresponding to the webpage according to the source code display instruction.
6. The system of claim 5, wherein the system comprises:
the login information judging unit is used for judging whether the webpage corresponding to the URL address needs login information or not;
and the login information writing unit is used for writing the login information acquired in advance into the corresponding position of the webpage corresponding to the URL address when the webpage corresponding to the URL address needs the login information so as to log in the webpage corresponding to the URL address.
7. The system according to claim 5 or 6, wherein the web page data collecting unit comprises:
the mouse position detection module is used for detecting the position of the current mouse on the webpage;
and the source code grabbing instruction receiving module is used for receiving a source code grabbing instruction and grabbing a source code corresponding to the position of the current mouse on the webpage according to the source code grabbing instruction.
8. The system according to claim 5 or 6, characterized in that it comprises:
the web page judging unit is used for judging whether a plurality of web pages exist in a website corresponding to the displayed web pages;
the page turning instruction sending unit is used for sending a page turning instruction to display a page corresponding to a turned page when a plurality of pages exist in a website corresponding to the displayed page;
and the webpage data grabbing unit after page turning is used for grabbing the corresponding source code according to the webpage corresponding to the page turning so as to realize the collection of the webpage data.
CN201610578428.3A 2016-07-20 2016-07-20 Method and system for collecting webpage data Active CN107644028B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610578428.3A CN107644028B (en) 2016-07-20 2016-07-20 Method and system for collecting webpage data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610578428.3A CN107644028B (en) 2016-07-20 2016-07-20 Method and system for collecting webpage data

Publications (2)

Publication Number Publication Date
CN107644028A CN107644028A (en) 2018-01-30
CN107644028B true CN107644028B (en) 2020-09-04

Family

ID=61109212

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610578428.3A Active CN107644028B (en) 2016-07-20 2016-07-20 Method and system for collecting webpage data

Country Status (1)

Country Link
CN (1) CN107644028B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670100B (en) * 2018-12-21 2020-06-26 第四范式(北京)技术有限公司 Page data capturing method and device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101320387A (en) * 2008-07-11 2008-12-10 浙江大学 Web page text and image ranking method based on user caring time
CN102469111A (en) * 2010-10-29 2012-05-23 国际商业机器公司 Method and system for analyzing website access
CN103186670A (en) * 2013-03-27 2013-07-03 中金数据系统有限公司 Method and system for integrally acquiring webpage information
CN103593344A (en) * 2012-08-13 2014-02-19 北大方正集团有限公司 Information acquisition method and device
CN103927370A (en) * 2014-04-23 2014-07-16 焦点科技股份有限公司 Network information batch acquisition method of combined text and picture information
US8832055B1 (en) * 2005-06-16 2014-09-09 Gere Dev. Applications, LLC Auto-refinement of search results based on monitored search activities of users
CN104199874A (en) * 2014-08-20 2014-12-10 哈尔滨工程大学 Webpage recommendation method based on user browsing behaviors
CN105183453A (en) * 2015-08-07 2015-12-23 安一恒通(北京)科技有限公司 Webpage-based information acquiring method and apparatus
CN105512193A (en) * 2015-11-26 2016-04-20 上海携程商务有限公司 Data acquisition system and method based on browser expansion

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8832055B1 (en) * 2005-06-16 2014-09-09 Gere Dev. Applications, LLC Auto-refinement of search results based on monitored search activities of users
CN101320387A (en) * 2008-07-11 2008-12-10 浙江大学 Web page text and image ranking method based on user caring time
CN102469111A (en) * 2010-10-29 2012-05-23 国际商业机器公司 Method and system for analyzing website access
CN103593344A (en) * 2012-08-13 2014-02-19 北大方正集团有限公司 Information acquisition method and device
CN103186670A (en) * 2013-03-27 2013-07-03 中金数据系统有限公司 Method and system for integrally acquiring webpage information
CN103927370A (en) * 2014-04-23 2014-07-16 焦点科技股份有限公司 Network information batch acquisition method of combined text and picture information
CN104199874A (en) * 2014-08-20 2014-12-10 哈尔滨工程大学 Webpage recommendation method based on user browsing behaviors
CN105183453A (en) * 2015-08-07 2015-12-23 安一恒通(北京)科技有限公司 Webpage-based information acquiring method and apparatus
CN105512193A (en) * 2015-11-26 2016-04-20 上海携程商务有限公司 Data acquisition system and method based on browser expansion

Also Published As

Publication number Publication date
CN107644028A (en) 2018-01-30

Similar Documents

Publication Publication Date Title
US20150156332A1 (en) Methods and apparatus to monitor usage of mobile devices
CN101957818B (en) Method and system for collecting webpages in batches
US9141697B2 (en) Method, system and computer-readable storage medium for detecting trap of web-based perpetual calendar and building retrieval database using the same
CN108304410B (en) Method and device for detecting abnormal access page and data analysis method
CN106202101B (en) Advertisement identification method and device
CN104216921A (en) Addition prompting method, device and system for rapid links in browser
CN107085549B (en) Method and device for generating fault information
CN107450808B (en) Mouse pointer positioning method of browser and computing device
CN111552633A (en) Interface abnormal call testing method and device, computer equipment and storage medium
US9124623B1 (en) Systems and methods for detecting scam campaigns
CN112486708B (en) Page operation data processing method and processing system
CN106911554B (en) Historical information display method and device
CN111047147B (en) Automatic business process acquisition method and intelligent terminal
CN109359582A (en) Information search method, information search device and mobile terminal
CN110968822A (en) Page detection method and device, electronic equipment and storage medium
CN103929339B (en) A kind of web data acquisition method and system
CN111177623A (en) Information processing method and device
CN107644028B (en) Method and system for collecting webpage data
CN103475673A (en) Phishing website recognizing method and device and client side
CN104239326A (en) Method, device and system for zooming webpage fonts
CN106097403B (en) Method for acquiring network protected index data based on image curve calculation
CN108268507B (en) Browser-based processing method and device and electronic equipment
CN107306308B (en) Page response method and device
US20230056653A1 (en) Document analysis to identify document characteristics and appending the document characteristics to a record
CN111629005A (en) Anti-cheating method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant