CN104408204A - Method and device for obtaining webpage page link address - Google Patents

Method and device for obtaining webpage page link address Download PDF

Info

Publication number
CN104408204A
CN104408204A CN201410802023.4A CN201410802023A CN104408204A CN 104408204 A CN104408204 A CN 104408204A CN 201410802023 A CN201410802023 A CN 201410802023A CN 104408204 A CN104408204 A CN 104408204A
Authority
CN
China
Prior art keywords
page
web page
original web
event
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410802023.4A
Other languages
Chinese (zh)
Inventor
李浛天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201410802023.4A priority Critical patent/CN104408204A/en
Publication of CN104408204A publication Critical patent/CN104408204A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9558Details of hyperlinks; Management of linked annotations

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method and a device for obtaining a webpage page link address. The method disclosed by the invention comprises the following steps: loading an original webpage page, monitoring whether the original webpage page jumps, triggering a click event for simulating the jump event of the original webpage page, in the event of monitoring that the original webpage page jumps, preventing the original webpage page from jumping, intercepting and capturing page jump requirement of the original webpage page, and reading the link address of a target webpage page, to which the original webpage page jumps, from the page jump requirement. According to the invention, the problem of being poor in efficiency of obtaining the link address of the webpage page is solved.

Description

The acquisition methods of Webpage chained address and device
Technical field
The present invention relates to computer internet field, in particular to a kind of acquisition methods and device of Webpage chained address.
Background technology
Along with the arrival in HTML (Hypertext Markup Language) html5 epoch, increasing website can be selected in webpage, embed a large amount of js codes and reach the website browsing effect of extremely dazzling, and a lot of website can utilize js no longer directly to use <a> label to carry out the navigation of redirect; For search engine, these pages carrying out redirect by js are difficult to get, and are difficult to there is a kind of web analysis method at present and can be intactly grabbed by those pages being carried out redirect by js.A hope is taken to the demand of the site maps of some websites, if this website exists the page of a large amount of js navigation, be so difficult to intactly obtain site maps fast by existing reptile method; For a single page type website, at some front end frame, its whole page also cannot be obtained.Trace sth. to its source, be can be connected to which page on earth because cannot parse it for an independent page efficient.
The principle of existing search engine crawler technology is the value of the href attribute of all <a> labels of resolving an extraction html page.This way speed is fast, does not need extra http request, easily realizes.But along with the development of html5, the situation of carrying out navigating at the href community-internal administration js code of <a> label increases gradually, is also increased gradually by the situation of disposing js skip instruction window.location=' targetUrl ' in the click event of the labels such as span, div, button.Traditional search engine crawler technology can not grab the page be linked to by these methods.
The another kind of apparent way for this problem is exactly the clicking operation of simulation browser, and often kind of dom element is all triggered one click event, waits for operation result and monitors whether page jump occurs within the stand-by period.This way intactly can get a page and all be linked to which other page, but efficiency extreme difference.This way needs to carry out clicking operation for each dom element, and needs the wait time of 5 to 10 seconds to judge whether page jump to occur.For a page, then can along with the increasing number of this page dom element deterioration of efficiency.For crawler technology, need to resolve multiple html page at short notice, this way is obviously unpractical.
For obtaining the inefficient problem in Webpage chained address in prior art, at present effective solution is not yet proposed.
Summary of the invention
Fundamental purpose of the present invention is the acquisition methods and the device that provide a kind of Webpage chained address, obtains the inefficient problem in Webpage chained address to solve in prior art.
To achieve these goals, according to an aspect of the embodiment of the present invention, provide a kind of acquisition methods of Webpage chained address, the method comprises: load the original web page page, trigger the click event for simulating original web page page generation redirect event, monitor the original web page page and whether redirect event occurs, when listening to original web page page generation redirect event, block original web page page generation redirect, and intercept and capture the page jump request of the original web page page, the chained address of the target web page that the original web page page will jump to is read from page jump request.
To achieve these goals, according to the another aspect of the embodiment of the present invention, provide a kind of acquisition device of Webpage chained address, this device comprises: the first load-on module, for loading the original web page page, click module, for triggering the click event for simulating original web page page generation redirect event, interception module, redirect event whether is there is for monitoring the original web page page, when listening to original web page page generation redirect event, block original web page page generation redirect, and intercept and capture the page jump request of the original web page page, read module, for reading the chained address of the target web page that the original web page page will jump to from page jump request.
According to the embodiment of the present invention, by the acquisition methods of Webpage chained address, solve in correlation technique and obtain the inefficient problem in Webpage chained address, reach the effect obtaining Webpage chained address efficiently.
Accompanying drawing explanation
The accompanying drawing forming a application's part is used to provide a further understanding of the present invention, and schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the schematic flow sheet of the acquisition methods of Webpage chained address according to the embodiment of the present invention;
Fig. 2 is the structural representation of the acquisition device of Webpage chained address according to the embodiment of the present invention;
Fig. 3 is a kind of preferred structure schematic diagram of the acquisition device of Webpage chained address according to the embodiment of the present invention;
Fig. 4 is the another kind of preferred structure schematic diagram of the acquisition device of Webpage chained address according to the embodiment of the present invention;
Fig. 5 is the another kind of preferred structure schematic diagram of the acquisition device of Webpage chained address according to the embodiment of the present invention;
Fig. 6 is the another kind of preferred structure schematic diagram of the acquisition device of Webpage chained address according to the embodiment of the present invention;
Fig. 7 is the another kind of preferred structure schematic diagram of the acquisition device of Webpage chained address according to the embodiment of the present invention.
Embodiment
It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.Below with reference to the accompanying drawings and describe the present invention in detail in conjunction with the embodiments.
The present invention program is understood better in order to make those skilled in the art person, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the embodiment of a part of the present invention, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, should belong to the scope of protection of the invention.
It should be noted that, term " first ", " second " etc. in instructions of the present invention and claims and above-mentioned accompanying drawing are for distinguishing similar object, and need not be used for describing specific order or precedence.Should be appreciated that the data used like this can be exchanged, in the appropriate case so that embodiments of the invention described herein.In addition, term " comprises " and " having " and their any distortion, intention is to cover not exclusive comprising, such as, contain those steps or unit that the process of series of steps or unit, method, system, product or equipment is not necessarily limited to clearly list, but can comprise clearly do not list or for intrinsic other step of these processes, method, product or equipment or unit.
Just the name word concept that the application relates to is described below:
Js:Javascript, a kind of script developed by the LiveScript of Netscape, fundamental purpose is in order to settlement server terminal language.
Dom:Document Object Model, DOM Document Object Model, a kind of edition interface for html and xml document, it provides a kind of structurized method for expressing to document, content and the presentation mode of document can be changed, webpage and script and other author language are linked up.
Phantomjs: being one does not namely have the browser of display interface based on the webkit kernel browser without a head of js, and the system resource consumed is drawn at the interface that such accessed web page just eliminates browser, proper for network test.
Embodiment 1
According to the embodiment of the present invention, provide a kind of embodiment of the method for acquisition methods of Webpage chained address.It should be noted that, can perform in the computer system of such as one group of computer executable instructions in the step shown in the process flow diagram of accompanying drawing, and, although show logical order in flow charts, but in some cases, can be different from the step shown or described by order execution herein.
The embodiment of the method that the embodiment of the present application one provides can perform in mobile terminal, terminal or similar arithmetic unit.
Fig. 1 is the acquisition methods schematic flow sheet of the Webpage chained address according to the embodiment of the present invention.For purposes of illustration, the architecture of painting is only an example of proper environment, not proposes any limitation to the usable range of the application or function.Should by the acquisition methods of this Webpage chained address for not there is any dependence or demand to the arbitrary assembly shown in Fig. 1 or combination yet.
As shown in Figure 1, the acquisition methods of this Webpage chained address can comprise:
S12, loads the original web page page, monitors the original web page page and whether redirect event occurs;
Wherein, in step s 12, loading the original web page page is utilize phantomjs to build a browser platform, with this, js in page that will analyze is resolved, phantomjs provides one group of API for developer, a browser platform is built by phantomjs, the original web page page is loaded, such accessed web page just eliminates browser interface and draws the system resource consumed, so no longer ask the resources such as unnecessary picture, multimedia, accelerate the loading velocity of the original web page page; The above-mentioned original web page page is the HTML (Hypertext Markup Language) html page, loads the html page utilizing phantomjs in the browser built, and monitors the html page and whether jump request occurs.
S14, triggers the click event for simulating original web page page generation redirect event;
Wherein, in step S14, by triggering the click event for simulating original web page page generation redirect in phantomjs, make in phantomjs, simulate click behavior, new Webpage is made to be easy to be caught in further, the final chained address obtaining the target web page.
S16, when listening to original web page page generation redirect event, blocking original web page page generation redirect, and intercepting and capturing the page jump request of the original web page page;
Wherein, in step s 16, the original web page page is the html page, when the html page performs jump instruction, listen to html page generation redirect event, block html page jump to the target web page, stop the loading of the target web page, and intercept and capture the page jump request of the html page; Block html page jump can stop superfluous content loading to the target web page, so, do not need to consume too much resource in unnecessary page load request.
S18, reads the chained address of the target web page that the original web page page will jump to from page jump request.
Wherein, in step S18, the original web page page is the html page, includes the chained address of the target web page that the original web page page will jump in the request of html page jump, and this chained address is uniform resource locator URL; So, after the page jump request of intercepting and capturing the html page, read the chained address URL of the target web page, because this method carries out analyzing based on the click behavior of phantomjs, admittedly can very intactly parse the all-links comprised in webpage, the interference of the dynamic syntax of js can not be subject to.
The method of the above embodiments of the present application 1, provide a kind of acquisition methods of Webpage chained address, the method loads the original web page page by browser, trigger the click event for simulating original web page page generation redirect event, block original web page page generation redirect, and from page jump request, read the chained address of the target web page; Compared with prior art, the link overcoming some js generation is difficult to the problem got by reptile, reaches the object of effective acquisition Webpage chained address further.
Particularly, before step S12, the method also comprises:
S11, loads the blocking function function for blocking original web page page generation redirect, and loads the monitoring function function of the chained address for monitoring redirect event and target acquisition Webpage; Wherein, after loading the success of the original web page page, blocking function function is started.
In step s 11, utilize phantomjs to build a blocking function function browsing platform to load for blocking original web page page generation redirect, phantomjs provides one group of API for developer, a browser platform is built by phantomjs, the original web page page is loaded, such accessed web page just eliminates browser interface and draws the system resource consumed, no longer ask the resources such as unnecessary picture, multimedia, accelerate the speed loading blocking function function and monitor function function; Wherein, phantomjs is utilized to load the html page, with this, js in html page that will analyze is resolved, the one group of API provided in phantomjs is provided, wherein, comprising: blocking function function navigationLocked, for blocking original web page page generation redirect, before the link generation redirect clicked, stopping the loading of new web page in time; Comprise monitoring function function onNavigationRequested in addition, for monitoring the chained address of redirect event and target acquisition Webpage, this method utilizes phantomjs to build one and browses platform and call blocking function function and monitor function function in phantomjs, clicks faster more efficient than the simulation of simplicity.And the Internet resources consumed are less.
Particularly, in step S14, the method also comprises:
S142, creates by calling the click event that event functions creates the DOM Document Object Model dom element of the original web page page;
In step S142, due to the not built-in click event of phantomjs, so this step creates the click event of the dom element of the original web page page by creating event functions createEvent, wherein, create in the click event of each dom element, dom element is requisite information in this method, because intactly can get a page to be linked to which other webpage after the click event triggering each dom element.
S144, before whether the monitoring original web page page redirect event occurs, triggers the click event of the dom element of the original web page page by calling scheduling events function.
In step S144, by scheduling events function dispatchEvent, each dom element is carried out to the triggering of click event, when a dom element is clicked, in most cases, complete all events are triggered in capital in a short period of time, like this, the simulation click triggered than simplicity is completed in the short time faster more efficient.
Particularly, step S16 specifically comprises:
S162, after the click event of the dom element listening to the original web page page is triggered, judges whether to trigger described page jump request by calling discriminant function;
In this step S162, after the click event listening to the dom element in the original web page page is triggered, by calling discriminant function, judge whether also to trigger page jump request, such as, if the click event list that namely dom element is bound exists page jump event, it will be triggered (within about 200 milliseconds) rapidly, if instead do not detected that within this time redirect event is triggered, then can say that this dom element does not have bound page jump event.After redirect event is triggered, then can get the Object linking jumped to, i.e. the information of other pages that is linked to of this page.
S164 is when triggering described page jump request in judged result, stops load page jump request by calling blocking function function, makes to block original web page page generation redirect.
In step S164, be when having triggered page jump request at short notice in judged result, call blocking function function navigationLocked, navigationLocked is set to true to stop load page jump request, the loading of new web page is stopped in time before making the link generation redirect clicked, block original web page page jump can stop superfluous content loading to the target web page, so, do not need to consume too much resource in unnecessary page load request.
Particularly, before step S142, the method also comprises:
S13, obtains by calling the dom element that function obtains the described original web page page.
In step s 13, when performing current task queue, first, call and obtain the dom element that function obtains the original web page page, then trigger the click event of the dom element of the original web page page, and then judge whether also to trigger page jump request, after redirect event is triggered, then can get the chained address of the target web page jumped to, i.e. the information of other pages that is linked to of this original web page page.When there is no triggering page jump request, perform next task queue; But dom element belongs to the necessary information solving this problem, be difficult to avoid, can only choose some fixing tag types of click by configuration and remove unnecessary label analysis, the reduction overall time consumed; According to this principle, the chained address that a dom element jumps to the target web page can be got within the time of about 200 milliseconds, by the link information that a dom element comprises being parsed within the time within 200 milliseconds, faster than prior art 25 ~ 50 times.
Click event is triggered to each dom element of the whole original web page page, and performs in many task queues of this Behavioral availability, and in onNavigationRequested function, monitor the uniform resource locator URL comprised in jump request.The object of catching the target URL of page jump request of resolving the whole original web page page is reached with this.
Such as, the original web page page had about 1,000 dom elements, when using 10 task queues in this step, only need the time of about 20 ~ 40 seconds can resolve all dom elements, same situation classic method can only resolve 2 ~ 8 dom elements at 20 ~ 40 seconds.Efficiency is greatly improve.
Particularly, after step S142, the method also comprises:
S143, by calling non-new window function the original web page page is rewritten as the form of non-new window, and the objective attribute target attribute rewriting the default label of the original web page page is self-adaptation form, wherein, the objective attribute target attribute of non-new window function and default label is for being presented at the original web page page by the chained address of the target web the got page.
In step S143, non-new window function window.open is rewritten as the form of non-new window, and change the target attribute of all <a> labels into _ self, ensure that onNavigationRequested can capture jump request completely with this, can also only choose some fixing tag types of click by configuration and remove unnecessary label analysis, the reduction overall time consumed.
The acquisition methods of a kind of Webpage chained address that the present invention proposes, by loading the original web page page, trigger the click event for simulating original web page page generation redirect event, monitor the original web page page and whether redirect event occurs, when listening to original web page page generation redirect event, block original web page page generation redirect, and intercept and capture the page jump request of the original web page page, the chained address of the target web page that the original web page page will jump to is read from page jump request, therefore, this application provides a kind of method obtaining target web page link address efficiently, namely after loading the original web page page in phantomjs, by triggering the click event for simulating original web page page generation redirect event, the original web page page is monitored, when listening to original web page page generation jump request, stop original web page page generation redirect to stop the loading of the target web page, thus the content stoping loading unnecessary, do not need to consume too much resource in unnecessary page load request, intercept and capture the jump request of the original web page page to read the chained address of the target web page simultaneously.Owing to loading the original web page page in said method in phantomjs, no longer ask unnecessary picture, the resources such as multimedia, accelerate the loading velocity of the original web page page, above-mentioned prevention original web page page generation redirect is to stop the loading of the target web page, do not need the resource consumed in unnecessary page load request, therefore the final speed obtaining target web page link address is accelerated, and then solve in prior art and obtain the inefficient problem in Webpage chained address, the link overcoming some js generation is difficult to the problem got by reptile, reach the effect obtaining Webpage chained address efficiently.The present invention is enough to the demand meeting reptile as a rule, and saves network request resource, the efficiency of the chained address of the acquisition target web page can be made to improve a class, can realize obtaining Webpage chained address efficiently.
Embodiment 2
The embodiment of the present invention additionally provides a kind of acquisition device of Webpage chained address, and the device that the above embodiments of the present application provide can run on terminal or mobile terminal, but is not limited thereto.
Fig. 2 is the structural representation of the acquisition device being the Webpage chained address of embodiment according to the present invention.As shown in Figure 2, this device comprises: the first load-on module 21, click module 22, interception module 23 and read module 24;
Wherein, the first load-on module 21, for loading the original web page page;
Particularly, in the first load-on module 21, loading the original web page page is utilize phantomjs to build a browser platform, with this, js in page that will analyze is resolved, phantomjs provides one group of API for developer, a browser platform is built by phantomjs, the original web page page is loaded, such accessed web page just eliminates browser interface and draws the system resource consumed, so no longer ask the resources such as unnecessary picture, multimedia, accelerate the loading velocity of the original web page page; The above-mentioned original web page page is the HTML (Hypertext Markup Language) html page, loads the html page utilizing phantomjs in the browser built, and monitors the html page and whether jump request occurs.
Wherein, module 22 is clicked, for triggering the click event for simulating original web page page generation redirect event;
In above-mentioned click module 22, by triggering the click event for simulating original web page page generation redirect event in phantomjs, make in phantomjs, simulate click behavior, new Webpage is made to be easy to be caught in further, the final chained address obtaining the target web page.
Interception module 23, being connected to the first click module 22, for monitoring the original web page page, whether redirect event occurring, when listening to original web page page generation redirect event, block original web page page generation redirect, and intercept and capture the page jump request of the original web page page;
Particularly, in interception module 23, the original web page page is the html page, when the html page performs jump instruction, listen to html page generation redirect event, block html page jump to the target web page, stop the loading of the target web page, and intercept and capture the page jump request of the html page; Block html page jump can stop superfluous content loading to the target web page, so, do not need to consume too much resource in unnecessary page load request.
Read module 24, is connected to interception module 23, for reading the chained address of the target web page that the original web page page will jump to from page jump request.
Particularly, in read module 24, the original web page page is the html page, includes the chained address of the target web page that the original web page page will jump in the request of html page jump, and this chained address is uniform resource locator URL; So, after the page jump request of intercepting and capturing the html page, read the chained address URL of the target web page, because this device carries out analyzing based on the click behavior of phantomjs, admittedly can very intactly parse the all-links comprised in webpage, the interference of the dynamic syntax of js can not be subject to.
The device of the above embodiments of the present application 2, provide a kind of acquisition device of Webpage chained address, this device loads the original web page page by the first load-on module 21, the click event for simulating original web page page generation redirect event is triggered by click module 22, the page jump request of the original web page page is intercepted and captured by interception module 23, the chained address of the target web page that the original web page page will jump to is read by read module 24, compared with prior art, the link overcoming some js generation is difficult to the problem got by reptile, reach the object obtaining Webpage chained address efficiently.
Fig. 3 is a kind of preferred structure schematic diagram of the acquisition device of Webpage chained address according to the embodiment of the present invention, and as shown in Figure 3, this device comprises outside all structures shown in Fig. 2, also comprises: the second load-on module 31, is described this device below.
Second load-on module 31, be connected to the first load-on module 21, before loading the original web page page, load the blocking function function for blocking original web page page generation redirect, and load the monitoring function function of the chained address for monitoring redirect event and target acquisition Webpage; Wherein, after loading the success of the original web page page, blocking function function is started.
In the second load-on module 31, utilize phantomjs to build a blocking function function browsing platform to load for blocking original web page page generation redirect, phantomjs provides one group of API for developer, a browser platform is built by phantomjs, the original web page page is loaded, such accessed web page just eliminates browser interface and draws the system resource consumed, no longer ask the resources such as unnecessary picture, multimedia, accelerate the speed loading blocking function function and monitor function function; Wherein, phantomjs is utilized to load the html page, with this, js in html page that will analyze is resolved, the one group of API provided in phantomjs is provided, wherein, comprising: blocking function function navigationLocked, for blocking original web page page generation redirect, before the link generation redirect clicked, stopping the loading of new web page in time; Comprise monitoring function function onNavigationRequested in addition, for monitoring the chained address of redirect event and target acquisition Webpage, this method utilizes phantomjs to build one and browses platform and call blocking function function and monitor function function in phantomjs, clicks faster more efficient than the simulation of simplicity.And the Internet resources consumed are less.
Fig. 4 is the another kind of preferred structure schematic diagram of the acquisition device of Webpage chained address according to the embodiment of the present invention; As shown in Figure 4, this device comprises all structures shown in Fig. 2, and wherein, this click module 22 comprises: creating unit 41 and trigger element 42, be described this device below.
Creation module 41, for creating by calling the click event that event functions creates the DOM Document Object Model dom element in the original web page page;
Particularly, in creating unit 41, due to the not built-in click event of phantomjs, so this step creates the click event of the dom element of the original web page page by creating event functions createEvent, wherein, create in the click event of each dom element, dom element is requisite information in this method, because intactly can get a page to be linked to which other webpage after the click event triggering each dom element.
Trigger element 42, is connected to creating unit 41, before whether there is redirect event at the monitoring original web page page, triggers the click event of the dom element of the original web page page by calling scheduling events function.
Particularly, in trigger element 42, by scheduling events function dispatchEvent, each dom element is carried out to the triggering of click event, when a dom element is clicked, in most cases, complete all time is triggered in capital in a short period of time, like this, completes to trigger and close simple simulation and click faster more efficient in the short time.
Fig. 5 is the another kind of preferred structure schematic diagram of the acquisition device of Webpage chained address according to the embodiment of the present invention; As shown in Figure 5, this device comprises all structures shown in Fig. 2, and wherein this interception module 23 also comprises: judging unit 51 and blocking unit 52, be described this device below.
Judging unit 51, for judging whether triggering page jump request by calling discriminant function;
In above-mentioned judging unit 51, after the click event listening to the dom element in the original web page page is triggered, by calling discriminant function, judge whether also to trigger page jump request, such as, if there is page jump event in the click event list that namely dom element is bound, it will be triggered (within about 200 milliseconds) rapidly, if instead do not detected that within this time redirect event is triggered, then can say that this dom element does not have bound page jump event.After redirect event is triggered, then can get the Object linking jumped to, i.e. the information of other pages that is linked to of this page.
Blocking unit 52, is connected to judging unit 51, for when judged result is triggering page jump request, stops load page jump request by calling blocking function function, makes to block original web page page generation redirect.
In above-mentioned blocking unit 52, be when having triggered page jump request at short notice in judged result, call blocking function function navigationLocked, navigationLocked is set to true to stop load page jump request, the loading of new web page is stopped in time before making the link generation redirect clicked, block original web page page jump can stop superfluous content loading to the target web page, so, do not need to consume too much resource in unnecessary page load request.
Fig. 6 is the another kind of preferred structure schematic diagram of the acquisition device of Webpage chained address according to the embodiment of the present invention; As shown in Figure 6, this device comprises outside all structures shown in Fig. 4, also comprises: acquisition module 61, is described below to this device.
Acquisition module 61, is connected to and clicks module 22, for before the click event by calling the dom element creating the event functions establishment original web page page, obtains by calling the dom element that function obtains the original web page page.
In above-mentioned acquisition module 61, when performing current task queue, first, call and obtain the dom element that function obtains the original web page page, then trigger the click event of the dom element of the original web page page, and then judge whether also to trigger page jump request, after redirect event is triggered, then can get the chained address of the target web page jumped to, i.e. the information of other pages that is linked to of this original web page page.When there is no triggering page jump request, perform next task queue; But dom element belongs to the necessary information solving this problem, be difficult to avoid, can only choose some fixing tag types of click by configuration and remove unnecessary label analysis, the reduction overall time consumed; According to this principle, the chained address that a dom element jumps to the target web page can be got within the time of about 200 milliseconds, by the link information that a dom element comprises being parsed within the time within 200 milliseconds, faster than prior art 25 ~ 50 times.
Click event is triggered to each dom element of whole webpage, and performs in many task queues of this Behavioral availability, and in onNavigationRequested function, monitor the uniform resource locator URL comprised in jump request.The object of catching the target URL of page jump request of resolving whole webpage is reached with this.
Such as, the original web page page had about 1,000 dom elements, when creating 10 task queues in this step, only need the time of about 20 ~ 40 seconds can resolve all dom elements, in same situation conventional apparatus, 2 ~ 8 dom elements can only be resolved at 20 ~ 40 seconds.Efficiency is greatly improve.
Fig. 7 is the another kind of preferred structure schematic diagram of the acquisition device of Webpage chained address according to the embodiment of the present invention, as shown in Figure 7, this device comprises outside all structures shown in Fig. 4, and this interception module 23 also comprises: call unit 71, is described below to this device.
Call unit 71, be connected to creating unit 41, for after the click event by calling the dom element creating the event functions establishment original web page page, by calling non-new window function the original web page page is rewritten as the form of non-new window, and the objective attribute target attribute rewriting the default label of the original web page page is self-adaptation form, wherein, the objective attribute target attribute of non-new window function and default label is for being presented at the original web page page by the chained address of the target web the got page.
In above-mentioned calling module 71, non-new window function window.open is rewritten as the form of non-new window, and change the target attribute of all <a> labels into _ self, ensure that onNavigationRequested can capture jump request completely with this, can also only choose some fixing tag types of click by configuration and remove unnecessary label analysis, the reduction overall time consumed.
The acquisition device of a kind of Webpage chained address that the present invention proposes, the original web page page is loaded by the first load-on module 21, the click event for simulating original web page page generation redirect event is triggered by click module 22, monitor the original web page page by interception module 23 and whether redirect event occurs, when listening to original web page page generation redirect event, block original web page page generation redirect, and intercept and capture the page jump request of the original web page page, from page jump request, the chained address of the target web page that the original web page page will jump to is read by read module 24, therefore, this application provides a kind of device obtaining target web page link address efficiently, after this device loads the original web page page on a web browser, by triggering the click event for simulating original web page page generation redirect event, the original web page page is monitored, when listening to original web page page generation jump request, stop original web page page generation redirect to stop the loading of the target web page, thus the content stoping loading unnecessary, do not need to consume too much resource in unnecessary page load request, intercept and capture the jump request of the original web page page to read the chained address of the target web page simultaneously.Build one owing to utilizing phantomjs in said apparatus and browse the platform loads original web page page, no longer ask unnecessary picture, the resources such as multimedia, accelerate the loading velocity of the original web page page, above-mentioned prevention original web page page generation redirect is to stop the loading of the target web page, do not need the resource consumed in unnecessary page load request, therefore the final speed obtaining target web page link address is accelerated, and then solve in prior art and obtain the inefficient problem in Webpage chained address, the link overcoming some js generation is difficult to the problem got by reptile, reach the effect obtaining Webpage chained address efficiently.The present invention is enough to the demand meeting reptile as a rule, and saves network request resource, the efficiency of the chained address of the acquisition target web page can be made to improve a class, can realize obtaining Webpage chained address efficiently.
It should be noted that, for aforesaid each embodiment of the method, in order to simple description, therefore it is all expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not by the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and involved action and module might not be that the present invention is necessary.
In the above-described embodiments, the description of each embodiment is all emphasized particularly on different fields, in certain embodiment, there is no the part described in detail, can see the associated description of other embodiments.
In several embodiments that the application provides, should be understood that, disclosed device, the mode by other realizes.Such as, device embodiment described above is only schematic, the such as division of described unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, such as multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, and the indirect coupling of device or unit or communication connection can be electrical or other form.
The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, also can be that the independent physics of unit exists, also can two or more unit in a unit integrated.Above-mentioned integrated unit both can adopt the form of hardware to realize, and the form of SFU software functional unit also can be adopted to realize.
If described integrated unit using the form of SFU software functional unit realize and as independently production marketing or use time, can be stored in a computer read/write memory medium.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words or all or part of of this technical scheme can embody with the form of software product, this computer software product is stored in a storage medium, comprises all or part of step of some instructions in order to make a computer equipment (can be personal computer, mobile terminal, server or the network equipment etc.) perform method described in each embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, ROM (read-only memory) (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), portable hard drive, magnetic disc or CD etc. various can be program code stored medium.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (12)

1. an acquisition methods for Webpage chained address, is characterized in that, comprising:
Load the original web page page;
Trigger the click event for simulating described original web page page generation redirect event;
Monitor the described original web page page and whether described redirect event occurs, when listening to the described original web page page and described redirect event occurring, block described original web page page generation redirect, and intercept and capture the page jump request of the described original web page page;
The chained address of the target web page that the described original web page page will jump to is read from described page jump request.
2. method according to claim 1, is characterized in that, before the loading original web page page, described method also comprises:
Load blocking function function for blocking described original web page page generation redirect, and load for monitoring described redirect event and catching the monitoring function function of chained address of the described target web page;
Wherein, after the described original web page page success of loading, described blocking function function is started.
3. method according to claim 1, is characterized in that, the click event triggered for simulating described original web page page generation redirect event comprises:
The click event that event functions creates the DOM Document Object Model dom element of the described original web page page is created by calling;
The click event of the dom element of the described original web page page is triggered by calling scheduling events function.
4. method according to claim 1, is characterized in that, when listening to the described original web page page and described redirect event occurring, the step of blocking described original web page page generation redirect comprises:
The page jump request of the described original web page page is judged whether to trigger by calling discriminant function;
When judged result is the described page jump request of triggering, stops the described page jump request of loading by calling described blocking function function, making to block described original web page page generation redirect.
5. method according to claim 3, is characterized in that, creating before event functions creates the click event of the DOM Document Object Model dom element of the described original web page page by calling, described method also comprises:
The dom element that function obtains the described original web page page is obtained by calling.
6. method according to claim 3, is characterized in that, creating after event functions creates the click event of the DOM Document Object Model dom element of the described original web page page by calling, described method also comprises:
By calling non-new window function the described original web page page is rewritten as the form of non-new window, and the objective attribute target attribute rewriting the default label of the described original web page page is self-adaptation form, wherein, the objective attribute target attribute of described non-new window function and described default label is for being presented at the described original web page page by the chained address of the described target web page got.
7. an acquisition device for Webpage chained address, is characterized in that, comprising:
First load-on module, for loading the original web page page;
Click module, for triggering the click event for simulating described original web page page generation redirect event;
Interception module, described redirect event whether is there is for monitoring the described original web page page, when listening to the described original web page page and described redirect event occurring, block described original web page page generation redirect, and intercept and capture the page jump request of the described original web page page;
Read module, for reading the chained address of the target web page that the described original web page page will jump to from described page jump request.
8. device according to claim 7, is characterized in that, described device also comprises:
Second load-on module, for before the loading original web page page, load blocking function function for blocking described original web page page generation redirect, and load for monitoring described redirect event and catching the monitoring function function of chained address of the described target web page;
Wherein, after the described original web page page success of loading, described blocking function function is started.
9. device according to claim 7, is characterized in that, described click module comprises:
Creating unit, for creating by calling the click event that event functions creates the DOM Document Object Model dom element of the described original web page page;
Trigger element, for triggering the click event of the dom element of the described original web page page by calling scheduling events function.
10. device according to claim 7, is characterized in that, described interception module comprises:
Judging unit, for judging whether to trigger the page jump request of the described original web page page by calling discriminant function;
Blocking unit, for being when triggering described page jump request in judged result, stoping the described page jump request of loading by calling described blocking function function, making to block described original web page page generation redirect.
11. devices according to claim 9, is characterized in that, described device also comprises:
Acquisition module, for creating before event functions creates the click event of the dom element of the described original web page page by calling, obtains by calling the dom element that function obtains the described original web page page.
12. devices according to claim 9, is characterized in that, described click module also comprises:
Call unit, for creating after event functions creates the click event of the dom element of the described original web page page by calling, by calling non-new window function the described original web page page is rewritten as the form of non-new window, and the objective attribute target attribute rewriting the default label of the described original web page page is self-adaptation form, wherein, the objective attribute target attribute of described non-new window function and described default label is for being presented at the described original web page page by the chained address of the described target web page got.
CN201410802023.4A 2014-12-18 2014-12-18 Method and device for obtaining webpage page link address Pending CN104408204A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410802023.4A CN104408204A (en) 2014-12-18 2014-12-18 Method and device for obtaining webpage page link address

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410802023.4A CN104408204A (en) 2014-12-18 2014-12-18 Method and device for obtaining webpage page link address

Publications (1)

Publication Number Publication Date
CN104408204A true CN104408204A (en) 2015-03-11

Family

ID=52645835

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410802023.4A Pending CN104408204A (en) 2014-12-18 2014-12-18 Method and device for obtaining webpage page link address

Country Status (1)

Country Link
CN (1) CN104408204A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105100298A (en) * 2015-07-24 2015-11-25 北京奇虎科技有限公司 Page access method in application program and apparatus thereof
CN105260310A (en) * 2015-10-23 2016-01-20 天津橙子科技有限公司 User behavior simulation method for WEB application program
CN106202072A (en) * 2015-04-29 2016-12-07 阿里巴巴集团控股有限公司 The method and apparatus that display content is provided
CN106250107A (en) * 2016-07-18 2016-12-21 福建天泉教育科技有限公司 A kind of data statistical approach and system
WO2017016458A1 (en) * 2015-07-24 2017-02-02 北京奇虎科技有限公司 Application internal page processing method and device
CN106844486A (en) * 2016-12-23 2017-06-13 北京奇虎科技有限公司 Crawl the method and device of dynamic web page
CN106919636A (en) * 2016-07-04 2017-07-04 阿里巴巴集团控股有限公司 link jump method and device
CN109063144A (en) * 2018-08-07 2018-12-21 广州金猫信息技术服务有限公司 Visual network crawler method and device
CN109426535A (en) * 2017-08-24 2019-03-05 武汉斗鱼网络科技有限公司 A kind of method jumping to page designated position, storage medium, equipment and system
CN109739717A (en) * 2018-04-12 2019-05-10 京东方科技集团股份有限公司 A kind of method and device of page data acquisition, server
CN109840418A (en) * 2019-02-19 2019-06-04 Oppo广东移动通信有限公司 Jump control method, device, storage medium and the terminal of application program
CN110020044A (en) * 2017-09-22 2019-07-16 北京国双科技有限公司 A kind of crawling method and device of crawler
US10387012B2 (en) 2018-01-23 2019-08-20 International Business Machines Corporation Display of images with action zones
CN110324410A (en) * 2019-06-18 2019-10-11 中国南方电网有限责任公司 Initiate method, apparatus, computer equipment and the storage medium of web-page requests
CN110611713A (en) * 2019-09-17 2019-12-24 深圳市网心科技有限公司 Data downloading method and system, electronic equipment and storage medium
CN110708270A (en) * 2018-07-10 2020-01-17 阿里巴巴集团控股有限公司 Abnormal link detection method and device
CN112632358A (en) * 2020-12-29 2021-04-09 北京天融信网络安全技术有限公司 Resource link obtaining method and device, electronic equipment and storage medium
CN112637361A (en) * 2020-12-29 2021-04-09 北京天融信网络安全技术有限公司 Page proxy method, device, electronic equipment and storage medium
CN112749351A (en) * 2019-10-29 2021-05-04 金色熊猫有限公司 Link address determination method, link address determination device, computer-readable storage medium and equipment
EP3848824A1 (en) * 2020-01-07 2021-07-14 Baidu Online Network Technology (Beijing) Co., Ltd. Landing page processing method, apparatus, device and medium
CN114491356A (en) * 2021-12-27 2022-05-13 北京金堤科技有限公司 Data acquisition method and device, computer storage medium and electronic equipment
WO2024045954A1 (en) * 2022-08-31 2024-03-07 华为云计算技术有限公司 Method and apparatus for obtaining secondary page, and computer device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6185701B1 (en) * 1997-11-21 2001-02-06 International Business Machines Corporation Automated client-based web application URL link extraction tool for use in testing and verification of internet web servers and associated applications executing thereon
CN104182478A (en) * 2014-08-01 2014-12-03 北京华清泰和科技有限公司 Website monitoring pre-warning method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6185701B1 (en) * 1997-11-21 2001-02-06 International Business Machines Corporation Automated client-based web application URL link extraction tool for use in testing and verification of internet web servers and associated applications executing thereon
CN104182478A (en) * 2014-08-01 2014-12-03 北京华清泰和科技有限公司 Website monitoring pre-warning method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
依云等: "在用python写网络爬虫时,遇到href=‘javascript:void(0)’;怎么获得javascript:void(0)里的实际内容", 《HTTPS://WWW.ZHIHU.COM/QUESTION/20626694》 *
周骅: "phantomjs的使用说明", 《HTTP://WWW.ZHOUHUA.INFO/2014/03/19/PHANTOMJS/?UTM_SOURCE=TUICOOL》 *

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202072A (en) * 2015-04-29 2016-12-07 阿里巴巴集团控股有限公司 The method and apparatus that display content is provided
CN106202072B (en) * 2015-04-29 2019-12-03 阿里巴巴集团控股有限公司 The method and apparatus of display content are provided
WO2017016458A1 (en) * 2015-07-24 2017-02-02 北京奇虎科技有限公司 Application internal page processing method and device
CN105100298A (en) * 2015-07-24 2015-11-25 北京奇虎科技有限公司 Page access method in application program and apparatus thereof
CN105260310A (en) * 2015-10-23 2016-01-20 天津橙子科技有限公司 User behavior simulation method for WEB application program
CN106919636A (en) * 2016-07-04 2017-07-04 阿里巴巴集团控股有限公司 link jump method and device
CN106250107A (en) * 2016-07-18 2016-12-21 福建天泉教育科技有限公司 A kind of data statistical approach and system
CN106844486A (en) * 2016-12-23 2017-06-13 北京奇虎科技有限公司 Crawl the method and device of dynamic web page
CN109426535A (en) * 2017-08-24 2019-03-05 武汉斗鱼网络科技有限公司 A kind of method jumping to page designated position, storage medium, equipment and system
CN110020044A (en) * 2017-09-22 2019-07-16 北京国双科技有限公司 A kind of crawling method and device of crawler
US10936171B2 (en) 2018-01-23 2021-03-02 International Business Machines Corporation Display of images with action zones
US10387012B2 (en) 2018-01-23 2019-08-20 International Business Machines Corporation Display of images with action zones
CN109739717A (en) * 2018-04-12 2019-05-10 京东方科技集团股份有限公司 A kind of method and device of page data acquisition, server
CN109739717B (en) * 2018-04-12 2021-01-26 京东方科技集团股份有限公司 Page data acquisition method and device and server
CN110708270A (en) * 2018-07-10 2020-01-17 阿里巴巴集团控股有限公司 Abnormal link detection method and device
CN109063144A (en) * 2018-08-07 2018-12-21 广州金猫信息技术服务有限公司 Visual network crawler method and device
CN109840418A (en) * 2019-02-19 2019-06-04 Oppo广东移动通信有限公司 Jump control method, device, storage medium and the terminal of application program
CN109840418B (en) * 2019-02-19 2021-01-01 Oppo广东移动通信有限公司 Jump control method and device for application program, storage medium and terminal
CN110324410B (en) * 2019-06-18 2022-04-05 中国南方电网有限责任公司 Method, device, computer equipment and storage medium for initiating webpage request
CN110324410A (en) * 2019-06-18 2019-10-11 中国南方电网有限责任公司 Initiate method, apparatus, computer equipment and the storage medium of web-page requests
CN110611713A (en) * 2019-09-17 2019-12-24 深圳市网心科技有限公司 Data downloading method and system, electronic equipment and storage medium
CN112749351A (en) * 2019-10-29 2021-05-04 金色熊猫有限公司 Link address determination method, link address determination device, computer-readable storage medium and equipment
CN112749351B (en) * 2019-10-29 2023-07-28 金色熊猫有限公司 Link address determination method, device, computer readable storage medium and equipment
EP3848824A1 (en) * 2020-01-07 2021-07-14 Baidu Online Network Technology (Beijing) Co., Ltd. Landing page processing method, apparatus, device and medium
CN112632358A (en) * 2020-12-29 2021-04-09 北京天融信网络安全技术有限公司 Resource link obtaining method and device, electronic equipment and storage medium
CN112637361A (en) * 2020-12-29 2021-04-09 北京天融信网络安全技术有限公司 Page proxy method, device, electronic equipment and storage medium
CN112632358B (en) * 2020-12-29 2021-09-14 北京天融信网络安全技术有限公司 Resource link obtaining method and device, electronic equipment and storage medium
CN112637361B (en) * 2020-12-29 2022-09-16 北京天融信网络安全技术有限公司 Page proxy method, device, electronic equipment and storage medium
CN114491356A (en) * 2021-12-27 2022-05-13 北京金堤科技有限公司 Data acquisition method and device, computer storage medium and electronic equipment
WO2024045954A1 (en) * 2022-08-31 2024-03-07 华为云计算技术有限公司 Method and apparatus for obtaining secondary page, and computer device

Similar Documents

Publication Publication Date Title
CN104408204A (en) Method and device for obtaining webpage page link address
CN107368487B (en) Dynamic layout method, device and client for page components
US8413044B2 (en) Method and system of retrieving Ajax web page content
WO2016173200A1 (en) Malicious website detection method and system
US10304084B2 (en) Real-time monitoring of ads inserted in real-time into a web page
EP2374078B1 (en) Method for server-side logging of client browser state through markup language
CN103268361B (en) Extracting method, the device and system of URL are hidden in webpage
US20130212465A1 (en) Postponed rendering of select web page elements
US20080320498A1 (en) High Performance Script Behavior Detection Through Browser Shimming
CN108334641B (en) Method, system, electronic equipment and storage medium for collecting user behavior data
CN107807937B (en) Website SEO processing method, device and system
CN110442815A (en) Page generation method, system, device and computer readable storage medium
CN110457656A (en) A kind of document display method and apparatus
CN102870118A (en) Access method, device and system to user behavior
CN102916847A (en) Method and device for monitoring website speed
CN111880790A (en) Page rendering method, page rendering system, and computer-readable storage medium
CN110598135A (en) Network request processing method and device, computer readable medium and electronic equipment
Ghasemisharif et al. Speedreader: Reader mode made fast and private
CN113704590A (en) Webpage data acquisition method and device, electronic equipment and storage medium
Vogel et al. An in-depth analysis of web page structure and efficiency with focus on optimization potential for initial page load
CN104407979B (en) script detection method and device
CN111880789A (en) Page rendering method, device, server and computer-readable storage medium
WO2016137435A1 (en) Element identifier generation
US11550990B2 (en) Machine first approach for identifying accessibility, non-compliances, remediation techniques and fixing at run-time
CN103677951A (en) Method and system for controlling executing process of JavaScript

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150311

RJ01 Rejection of invention patent application after publication