CN103177115A - Method and device of extracting page link of webpage - Google Patents

Method and device of extracting page link of webpage Download PDF

Info

Publication number
CN103177115A
CN103177115A CN2013101161237A CN201310116123A CN103177115A CN 103177115 A CN103177115 A CN 103177115A CN 2013101161237 A CN2013101161237 A CN 2013101161237A CN 201310116123 A CN201310116123 A CN 201310116123A CN 103177115 A CN103177115 A CN 103177115A
Authority
CN
China
Prior art keywords
control
webpage
scheduled event
page link
needs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013101161237A
Other languages
Chinese (zh)
Other versions
CN103177115B (en
Inventor
徐锐波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201310116123.7A priority Critical patent/CN103177115B/en
Publication of CN103177115A publication Critical patent/CN103177115A/en
Application granted granted Critical
Publication of CN103177115B publication Critical patent/CN103177115B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method and a device of extracting page link of a webpage. The method of the extracting page link of the webpage includes that positions of controls in the webpage which are needed to be triggered by scheduled events are confirmed; the scheduled events which trigger the controls are generated at the confirmed positions so that the controls are triggered to change; an application interface is called to read all controls in the webpage, wherein all controls need to carry out page link extracting; and the extracted page link is obtained according to all controls which are read.

Description

A kind of method and apparatus that extracts the Webpage link
Technical field
The present invention relates to Internet technical field, particularly a kind of method and apparatus that extracts the Webpage link.
Background technology
Traditional page download technology is generally the original web page contents of direct download, and this downloading mode is easy to cause the loss of content of pages.Especially at present increasing Webpage has adopted the mode of Asynchronous Request data, for this class webpage, can't obtain the full content of webpage by direct download, can omit some crucial link and information, therefore need to play up webpage, obtain more comprehensively info web from rendering result.
The operation of playing up webpage mainly comprises: create polar plot and bitmap graphics, adjust the webpage color, make button, navigation bar and animation, utilize filter to process the operations such as image.After adopting the page rendering technology, the loss of page link in the time of to a certain degree reducing page download, but still have some problems.
For example, although the JavaScript script file code in webpage can be resolved and carry out to rendering program automatically, but trigger the just JavaScript code of operation for the mouse event that needs that contains in webpage, rendering program can't be carried out this code automatically, causes page link to be lost.
Summary of the invention
In view of the above problems, the present invention has been proposed in order to a kind of method and apparatus of extraction network (WEB) page link that overcomes the problems referred to above or address the above problem at least in part is provided.
According to one aspect of the present invention, the embodiment of the present invention provides a kind of method of extracting the Webpage link, comprising:
Need in webpage to determine the position of the control that scheduled event triggers; Generate the scheduled event that triggers control in the position of determining, change to trigger control; Call application interface reads needs to carry out the page link extraction in webpage all controls; The page link that obtains extracting according to all controls that read.
Wherein, need the position of the control of scheduled event triggering to comprise in above-mentioned definite webpage: according to the position of the control of needs scheduled event triggering in the coordinate information locating web-pages of the control of knowing in advance.
Wherein, need the position of the control that scheduled event triggers in determining webpage before, said method also comprises:
Determine to need in webpage to carry out the zone that page link extracts; Call application interface read the zone in initial extraction operation in all controls of webpage; The target keywords that utilization gets and/or Target Photo mate the control that reads in the initial extraction operation, judge whether to exist the control that needs scheduled event to trigger, when existing in the page when needing control that scheduled event triggers, with the position as the control that needs the scheduled event triggering, the position of the control that matches.
Wherein, said method also comprises: when not having the control that needs the scheduled event triggering in the page, and the page link that all controls that read in operating according to initial extraction obtain extracting.
Wherein, before needing to carry out the zone of page link extraction in determining webpage, said method also comprises: according to URL Web page loading in browser of the webpage in the input parameter that receives; When webpage after loaded, determines to need in webpage to carry out the zone that page link extracts in browser.
Wherein, said method also comprises: the process identification (PID) ID that records the browser place process of Web page loading; Demonstration information in the status bar of the corresponding browser of process ID of monitoring record when the loading of the demonstration information indication page is completed, is confirmed webpage loaded in browser.
Wherein, need to carry out the zone that page link extracts in above-mentioned definite webpage and comprise: search the demonstration forms of the corresponding browser of process ID of record, the zone of webpage in showing forms confirmed as needs to carry out the zone that page link extracts.
Wherein, the target keywords that gets of above-mentioned utilization and/or Target Photo mate the control that reads in the initial extraction operation and comprise:
When the input parameter that receives comprises key word and/or picture, with the key word in input parameter and/or picture as the target keywords that gets and/or Target Photo, when not comprising key word and/or picture in the input parameter that receives, read key word and/or picture in the preset configuration file, with the key word in the preset configuration file and/or picture as the target keywords that gets and/or Target Photo.
Wherein, the target keywords that gets of above-mentioned utilization and/or Target Photo mate the control that reads in the initial extraction operation and comprise:
Utilize one or more key words that the control title is mated, when having at least one key word in the control title, confirm that the match is successful, the control that the match is successful is chosen for the control that needs scheduled event to trigger; When not having key word in the control title of all controls that read out, confirm that it fails to match, there is not the control that needs scheduled event to trigger; And/or, web page contents in the zone is carried out pattern-recognition, identify all pictures in the zone, utilize one or more Target Photos that the picture that identifies is mated, when having at least one Target Photo in the picture that identifies, confirm that the match is successful, the control that the match is successful is chosen for the control that needs scheduled event to trigger; When not having Target Photo in all pictures that identify, confirm that it fails to match, there is not the control that needs scheduled event to trigger.
Wherein, above-mentioned position determining generates the scheduled event that triggers control and comprises: generate the triggering message of scheduled event, and should trigger message and be passed to control.
Wherein, above-mentioned scheduled event is mouse event, and above-mentioned triggering control changes and comprises: trigger the DOM nodal value of control in browser DOM Document Object Model dom tree structure and be updated to the directly page link of Gains resources.
Wherein, the above-mentioned application interface that calls reads and needs to carry out all controls that page link extracts in webpage and comprise: invoke user interface UI application interface reads and needs to carry out all controls that page link extracts in webpage.
Wherein, the page link that above-mentioned all controls according to reading in the initial extraction operation obtain extracting comprises: after the repetition control in the control that reads in the operation of deletion initial extraction, and the page link that obtains extracting;
The page link that all controls that above-mentioned basis reads obtain extracting comprises: after the repetition control in the control that deletion reads, and the page link that obtains extracting.
According to another aspect of the present invention, the embodiment of the present invention provides a kind of device that extracts the Webpage link, comprising:
The control position determination unit is suitable for determining needing in webpage the position of the control that scheduled event triggers;
The control trigger element is suitable for generating in the position of determining the scheduled event that triggers control, changes to trigger control;
The control extraction unit is suitable for calling application interface reads needs to carry out the page link extraction in webpage all controls;
Page link obtains the unit, is suitable for the page link that obtains extracting according to all controls that read.
Wherein, the control position determination unit is suitable for the position according to the control of needs scheduled event triggering in the coordinate information locating web-pages of the control of knowing in advance.
Wherein, said apparatus also comprises the regional determining unit of extraction and control matching unit, and this extracts regional determining unit, is suitable for determining needing in webpage to carry out the zone that page link extracts, control extraction unit is suitable for calling application interface and reads all controls in the zone; This control matching unit is suitable for utilizing the target keywords and/or the Target Photo that get that the control that reads in the initial extraction operation is mated, and judges whether to exist the control that needs scheduled event to trigger; The control position determination unit also is suitable for when existing in the page when needing control that scheduled event triggers, with the position as the control that needs the scheduled event triggering, the position of the control that matches.
Wherein, page link obtains the unit, also is suitable for when not having the control that needs the scheduled event triggering in the page page link that all controls that read in operating according to initial extraction obtain extracting.
Wherein, said apparatus also comprises receiving element and webpage loading unit, and this receiving element is suitable for receiving input parameter; This webpage loading unit is suitable for the URL Web page loading in browser according to the webpage in the input parameter that receives; Extract regional determining unit, also be suitable for after loaded, determining to need in webpage to carry out the zone that page link extracts when webpage in browser.
Wherein, said apparatus also comprises: record cell is suitable for recording the process ID of the browser place process of Web page loading; The browser status monitoring means is suitable for the demonstration information in the status bar of the corresponding browser of process ID of monitoring record, when the demonstration information indication page loads when completing, confirms webpage loaded in browser.
Wherein, extract regional determining unit, be suitable for searching the demonstration forms of the corresponding browser of process ID of record, the zone of webpage in showing forms confirmed as needs to carry out the zone that page link extracts.
Wherein, the control matching unit, be suitable for when the input parameter that receives comprises key word and/or picture, with the key word in input parameter and/or picture as the target keywords that gets and/or Target Photo, when not comprising key word and/or picture in the input parameter that receives, read key word and/or picture in the preset configuration file, with the key word in the preset configuration file and/or picture as the target keywords that gets and/or Target Photo.
Wherein, the control matching unit is suitable for utilizing one or more key words that the control title is mated, and when having at least one key word in the control title, confirms that the match is successful, the control that the match is successful is chosen for the control that needs scheduled event to trigger; And/or, web page contents in the zone is carried out pattern-recognition, identify all pictures in the zone, utilize one or more Target Photos that the picture that identifies is mated, when having at least one Target Photo in the picture that identifies, confirm that the match is successful, the control that the match is successful is chosen for the control that needs scheduled event to trigger.
Wherein, the control trigger element is suitable for generating the triggering message of scheduled event, and should trigger message and be passed to control.
Wherein, above-mentioned scheduled event is mouse event, and the control trigger element also is suitable for triggering the DOM nodal value of control in browser DOM Document Object Model dom tree structure and is updated to the directly page link of Gains resources.
Wherein, the control extraction unit is suitable for invoke user interface UI application interface and reads and need to carry out all controls that page link extracts in webpage.
Wherein, page link obtains the unit, after also being suitable for deleting the repetition control in the control that reads, and the page link that obtains extracting.
The embodiment of the present invention is to comprising the webpage of the control that needs the scheduled event triggering, trigger the scheduled event of this control by automatic generation, trigger the technological means that this control changes, can be when utilizing application interface to extract page link, dynamic analysis is also carried out JavaScript code corresponding to this control, avoided due to the loss that can't move the page link that JavaScript code corresponding to this control cause, thereby guaranteed the integrality of the page link that extracts.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, for can clearer understanding technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
Description of drawings
By reading hereinafter detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing only is used for the purpose of preferred implementation is shown, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts with identical reference symbol.In the accompanying drawings:
Fig. 1 shows a kind of according to an embodiment of the invention device that extracts the Webpage link;
Fig. 2 shows the another kind of according to an embodiment of the invention device that extracts the Webpage link; And
Fig. 3 shows according to an embodiment of the invention and is subject to triggering front control in the page;
Fig. 4 shows the control after being triggered in the page according to an embodiment of the invention;
Fig. 5 shows the method flow diagram of the extraction Webpage link of another embodiment according to the present invention;
Fig. 6 shows the operational scheme schematic diagram of a software upgrading platform using this programme.
Embodiment
Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment of the present disclosure in accompanying drawing, yet should be appreciated that and to realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order to understand the disclosure more thoroughly that these embodiment are provided, and can with the scope of the present disclosure complete convey to those skilled in the art.
One embodiment of the invention provides a kind of device that extracts page link, referring to Fig. 1, comprising: control position determination unit 111, control trigger element 112, control extraction unit 113, page link obtain unit 114, extract regional determining unit 115, control matching unit 116, receiving element 117, webpage loading unit 118, record cell 119 and browser status monitoring means 120.The below describes these unit respectively.
Control position determination unit 111 needing in definite webpage to be suitable for the position of the control of scheduled event triggering.Control position determination unit 111 can be determined the position of control at least by following three kinds of modes:
Mode one, directly the location
Control position determination unit 111 is suitable for the position according to the control of needs scheduled event triggering in the direct locating web-pages of coordinate information of the control of knowing in advance.
Above-mentioned coordinate information can directly be indicated the position of control in the page, as coordinate information can be for control coordinate figure X, the coordinate figure Y of vertical direction of the horizontal direction in the page, utilize (X, Y) can directly be positioned to the coordinate points at control place.
During concrete the execution, receive input parameter by receiving element 117, for example, this input parameter comprises parameters u.The parameters u indication needs the URL (URL(uniform resource locator), Uniform Resource Locator) of network (WEB) page of execution page link extraction.
Webpage loading unit 118 is suitable for the URL Web page loading in browser according to the webpage in the input parameter that receives.In browser after loaded, control position determination unit 111 is utilized the position of control in the page, coordinate information location of control, finds out the control that needs scheduled event to trigger when webpage.
This locator meams, the control fixing scene in position in the page that needing relatively to be suitable for scheduled event to trigger can be determined the position of control quickly and accurately.
For before carrying out page link extraction operation, can not know in advance the scene that whether comprises the control that needs the scheduled event triggering in the page, can adopt following mode two and mode three.When employing mode two and mode three, control position determination unit 111 needs to be appreciated that and to work as the employing mode for the moment in conjunction with extracting regional determining unit 115 and 116 operations of control matching unit, extracting regional determining unit 115 and control matching unit 116 can omit, referring to Fig. 2.
Mode two, keyword match
When concrete the execution, with similar in mode one, at first need to receive input parameter by receiving element 117, for example, this input parameter comprises parameters u.The parameters u indication needs the URL of the WEB page of execution page link extraction.
Webpage loading unit 118 is suitable for the URL Web page loading in browser according to the webpage in the input parameter that receives.When webpage after loaded, determines to need in webpage to carry out by extracting regional determining unit 115 zone that page link extracts in browser.For example, extract regional determining unit 115 and search the demonstration forms of browser, for example, in the IE browser, lookup names is the demonstration forms that the forms of Internet Explorer-Server namely obtain browser, and the zone of webpage in showing forms confirmed as needs to carry out the zone that page link extracts.
In mode two, after having determined above-mentioned zone, need to utilize control extraction unit 113 to carry out an initial extraction operation, namely call application interface and read all controls in the zone.After control was triggered, control extraction unit 113 can read all controls in the zone again.
Control matching unit 116 utilizes the target keywords that gets that the control that reads in the initial extraction operation is mated, and judges whether to exist the control that needs scheduled event to trigger.Usually can comprise the information corresponding with key word in the control title, control matching unit 116 utilizes one or more key words that the control title is mated, when having at least one key word in the control title, confirm that the match is successful, the control that the match is successful is chosen for the control that needs scheduled event to trigger; When not having key word in the control title of all controls that read out, confirm that it fails to match, there is not the control that needs scheduled event to trigger;
When existing in the page when needing control that scheduled event triggers, control position determination unit 111 is with the position as the control that needs the scheduled event triggering, the position of the control that matches.
Mode three, picture coupling
Mode three is basic identical with the processing mode of mode two, key distinction point is, in mode three, control matching unit 116 utilizes the Target Photo that gets that the control that reads in the initial extraction operation is mated, and judges whether to exist the control that needs scheduled event to trigger.Web page contents in 116 pairs of zones of control matching unit carries out pattern-recognition, identify all pictures in the zone, utilize one or more Target Photos that the picture that identifies is mated, when having at least one Target Photo in the picture that identifies, confirm that the match is successful, the control that the match is successful is chosen for the control that needs scheduled event to trigger; When not having Target Photo in all pictures that identify, confirm that it fails to match, there is not the control that needs scheduled event to trigger.
Due to the content change of most webpages or the speed of renewal, mode two and mode three provide a kind of locator meams more neatly, can accurately orient the position of the control that needs the scheduled event triggering in webpage.
When the extraction operation of carrying out a web page contents, aforesaid way one to mode three can be used simultaneously, also can choose it wantonly and make up in twos or select a use.
Wherein, the target keywords of above-mentioned control matching unit 116 uses, Target Photo can obtain by following two kinds of approach at least:
Approach one, the gain of parameter that receives by receiving element 117.
The input parameter that receiving element 117 receives can also comprise parameter k except above-mentioned parameter u, this parameter k comprises key word and/or picture.Control matching unit 116 with the key word in input parameter k and/or picture as the target keywords that gets and/or Target Photo.Can comprise one or more key words and/or picture in parameter k, for example, when parameter k comprises a plurality of key word, can separate with symbol " | " between a plurality of key words.For needing mouse event to trigger the scene of control, as need mouse to click the control that onclick event or mouse move the onmouseover Event triggered, an example of key word is exactly " downloading immediately " character in the control title.
When needing, the input parameter that receiving element 117 receives can also comprise parameter o, memory location and the storage mode of the page link that this parameter o indication extracts, as the page link that will extract by row separate be stored to C: ret.txt; Parametric t, the time restriction of this parametric t indication WEB page access as 60 seconds, when the page load time surpasses 60 seconds, confirms that the page loads overtime.
Approach two, obtain by the preset configuration file.
Key word and/or picture are preset in the configuration file of system, control matching unit 116 by reading this configuration file, can acquire target keywords and/or Target Photo.
Because input parameter can reflect the actual demand that the webpage of current scene is played up, the embodiment of the present invention preferentially adopts approach one to obtain target keywords and/or Target Photo, in the time can't obtaining by approach one, when not comprising key word and/or picture in input parameter, then adopt approach two to obtain target keywords and/or Target Photo.Be that control matching unit 116 is when the input parameter that receives comprises key word and/or picture, with the key word in input parameter and/or picture as the target keywords that gets and/or Target Photo, when not comprising key word and/or picture in the input parameter that receives, read key word and/or picture in the preset configuration file, with the key word in the preset configuration file and/or picture as the target keywords that gets and/or Target Photo.
Wherein, all to support to move simultaneously the scene of a plurality of browser process due to most systems, to need to carry out the browsing device net page that page link extracts in order picking out, said apparatus also comprises record cell 119.When webpage loading unit 118 was opened the WEB page, this record cell 119 obtained the browser process sign (ID) of this WEB page.The demonstration information in the status bar of the corresponding browser of process ID of browser status monitoring means 120 meeting monitoring records, when the loading of the demonstration information indication page is completed, confirm webpage loaded in browser.And extract the demonstration forms that regional determining unit 115 can be searched the corresponding browser of process ID of record, the zone of webpage in showing forms confirmed as needs to carry out the zone that page link extracts.
Wherein, above-mentioned control extraction unit 113 invoke user interface (User Interface, UI) application interfaces read needs to carry out all controls that page link extracts in webpage.In an example, the UI interface (Interface) that Microsoft's dynamic access (MicrosoftActiveAccessibility, the MSAA) mechanism in embodiment of the present invention calling system provides reads all controls in the browser display forms.
Control trigger element 112 generates the scheduled event that triggers control in the position of determining, change to trigger control.When carrying out concrete trigger action, control trigger element 112 generates the triggering message of scheduled events, and should trigger message and be passed to control.For example, when scheduled event is mouse event, control trigger element 112 generates the triggering message of onclick event or onmouseover event in the position of the control of monitoring out, should trigger message and be passed to control, mouse events, thereby trigger the DOM nodal value of control in browser DOM Document Object Model (Document Object Model, DOM) tree construction and be updated to the directly page link of Gains resources, this page link can be URL.Referring to Fig. 3, show the control before being subject in the page triggering, the name that this control shows in browser page is called " www.xxxx.com/productForPC.shtml# downloads immediately ", and this control is the control that needs mouse event to trigger, and comprises key word in title and " downloads immediately ".Referring to Fig. 4, be the control after being triggered in the page.Trigger message by generation, analog mouse is clicked this control, make the DOM nodal value of this control be changed to the directly page link of Gains resources (as URL), namely be changed to " Dl dir.xx.com/xxfile/xx/XX2012/XPlusDesktop4.4.exe " in the example of Fig. 4.Adopting the mode of the direct downloading web pages content of tradition or the mode that existing webpage is played up is to extract the URL's shown in Fig. 4.And the scheme of the present embodiment can extract URL all in webpage, is applicable to the linkage extraction demand of all WEB pages.
Page link obtains the page link that unit 114 obtains extracting according to all controls that read.Scene for the control locator meams that adopts aforesaid way one, when carrying out the page link extraction, only need call the control extraction unit reads the page one time, and page link obtains the page link that unit 114 is obtained extracting by the data that read in this time reading.For the scene that adopts aforesaid way two or mode three, when carrying out the page link extraction, when not having the control that needs the scheduled event triggering in the page, only need carry out the initial extraction operation, page link obtains the page link that unit 114 data that operation obtains according to initial extraction obtain extracting; When having the control that needs the scheduled event triggering in the page, page link obtains unit 114 need to call control extraction unit twi-read page link, page link obtain the unit by after the page link that obtains extracting in the data that read in once reading.
And, mode one, mode two or mode three times, when comprising unnecessary repeating data in the page link that extracts, page link obtains unit 114 also can delete repetition control in the control that reads after, the page link that obtains extracting.
From the above mentioned, the embodiment of the present invention is to comprising the webpage of the control that needs the scheduled event triggering, trigger the scheduled event of this control by automatic generation, trigger the technological means that this control changes, can be when utilizing application interface to extract page link, dynamic analysis is also carried out JavaScript code corresponding to this control, has avoided due to the loss that can't move the page link that JavaScript code corresponding to this control cause, thereby has guaranteed the integrality of the page link that extracts.
Another embodiment of the present invention also provides a kind of method of extracting the Webpage link, referring to Fig. 5, comprises following processing:
S500: according to URL Web page loading in browser of the webpage in the input parameter that receives.
Record the process identification (PID) (ID) of the browser place process of Web page loading in this step.
S501: determine webpage loaded in browser.
Demonstration information in this step in the status bar of the corresponding browser of process ID of monitoring record when the loading of the demonstration information indication page is completed, is confirmed webpage loaded in browser.
After execution of step S501, the position of the control that is triggered by scheduled event in the definite webpage of needs.
Under a kind of mode, the present embodiment is according to the position of the control of needs scheduled event triggering in the direct locating web-pages of coordinate information of the control of knowing in advance.
Under another kind of mode, as shown in Figure 5, first determine to need in webpage to carry out the zone that page link extracts in step S502.
Search the demonstration forms of the corresponding browser of process ID of record in step S502, the zone of webpage in showing forms confirmed as needs to carry out the zone that page link extracts.
S503: call application interface all controls in reading the zone in initial extraction operation.
Call the UI application interface reads needs to carry out the page link extraction in webpage all controls in this step.
S504: utilize the target keywords and/or the Target Photo that get that the control that reads in the initial extraction operation is mated.
Above-mentioned operation of obtaining target keywords and/or Target Photo can comprise: when the input parameter that receives comprises key word and/or picture, with the key word in input parameter and/or picture as the target keywords that gets and/or Target Photo, when not comprising key word and/or picture in the input parameter that receives, read key word and/or picture in the preset configuration file, with the key word in the preset configuration file and/or picture as the target keywords that gets and/or Target Photo.
Whether have by matching judgment the control that needs scheduled event to trigger in this step, for example, utilize one or more key words that the control title is mated, when having at least one key word in the control title, confirm that the match is successful, the control that the match is successful is chosen for the control that needs scheduled event to trigger; When not having key word in the control title of all controls that read out, confirm that it fails to match, there is not the control that needs scheduled event to trigger; And/or, web page contents in the zone is carried out pattern-recognition, identify all pictures in the zone, utilize one or more Target Photos that the picture that identifies is mated, when having at least one Target Photo in the picture that identifies, confirm that the match is successful, the control that the match is successful is chosen for the control that needs scheduled event to trigger; When not having Target Photo in all pictures that identify, confirm that it fails to match, there is not the control that needs scheduled event to trigger.
When existing in the page when needing control that scheduled event triggers, with the position as the control that needs the scheduled event triggering, the position of the control that matches.
S505: when having the control that needs the scheduled event triggering in the page, generate the scheduled event that triggers control in the position of determining, change to trigger control, then execution in step S507.
When triggering control and change, generate the triggering message of scheduled event, and should trigger message and be passed to control, control can trigger message according to this and change.For example, when above-mentioned scheduled event is mouse event, triggers the DOM nodal value of control in browser dom tree structure and be updated to the directly page link of Gains resources.
S506: when not having the control that needs the scheduled event triggering in the page, according to all controls that read in the initial extraction operation, after the repetition control in the control that deletion reads, the page link that obtains extracting.
S507: call application interface reads again needs to carry out the page link extraction in webpage all controls; According to all controls that read, after the repetition control in the control that reads in the operation of deletion initial extraction, the page link that obtains extracting.
In the inventive method embodiment, the concrete executive mode of each step can referring to device embodiment of the present invention, not repeat them here.
From the above mentioned, the embodiment of the present invention is to comprising the webpage of the control that needs the scheduled event triggering, trigger the scheduled event of this control by automatic generation, trigger the technological means that this control changes, can be when utilizing application interface to extract page link, dynamic analysis is also carried out JavaScript code corresponding to this control, has avoided due to the loss that can't move the page link that JavaScript code corresponding to this control cause, thereby has guaranteed the integrality of the page link that extracts.
Referring to Fig. 6, show the operational scheme schematic diagram of a software upgrading platform using this programme.This software upgrading platform comprises seed scheduler, url grabber, html resolver, url filtrator, url detecting device, action processor and database (DB).
The input parameter of above-mentioned software upgrading platform is self-defining seed (there is no the restriction of parent page, key word or domain name), the software upgrading platform is stored to the seed that gets in database, can obtain needing to carry out by the seed that gets the webpage that page link extracts.
Database root is safeguarded a seed scheduling queue according to the seed information of storage, when the new kind period of the day from 11 p.m. to 1 a.m occurring, can automatically new seed be added in scheduling queue.
The input quantity of seed scheduler is the information of seed in database, as the new seed information that adds in database, comprise seed ID, ID attribute, scheduling time interval, detect update mode, analysis mode, whether grasp, the next crawl time, whether resolve and detect and upgrade and failed processing mode etc.The seed scheduler is output as the data of the xml form that comprises seed information.
The seed scheduler has been realized the function of the receiving element in above-described embodiment, page loading unit, record cell and browser status monitoring means.
The data of the xml form that is input as seed scheduler output of url grabber, the output of url grabber comprises the html that grabs, js, xml, txt, the information of ini, and with the formatted output of xml.The output that is input as the url grabber of html resolver, the html resolver is resolved the page link that the url grabber grasps out by the analysis mode that defines in input parameter, extract link link (being equivalent to above-mentioned control), the html resolver is output as and contains the xml that is drawn into links.
Url grabber and html resolver have been realized the function of said extracted zone determining unit and control extraction unit.
The output that is input as the html resolver of url detecting device; Check by the detection update mode that defines in input parameter check whether there is the control that needs scheduled event to trigger, the url detecting device is output as the data of the xml form that contains testing result.The url detecting device has been realized the function of above-mentioned control matching unit.
Action processor is processed according to testing result.For example, when having the control that needs the scheduled event triggering in testing result indication webpage, action processor generates the scheduled event that triggers described control, change to trigger described control, then, action processor calls url grabber and html resolver and again extracts url in the page, and with the url that again is drawn into new database more.When not existing in testing result indication webpage when needing control that scheduled event triggers, action processor with the url information of url detecting device output as the url in the page that extracts and report to database.
Action processor has realized that above-mentioned control position determination unit, control trigger element and page link obtain the function of unit.
Above-mentioned software upgrading platform also provides the packet download function, is provided with query interface in platform, by this query interface can be from database the downloading data bag, can be connected easily new download and parse module.In this platform, administrative center can also be set, so that each device in platform is managed and O﹠M.
Carry out the data transmission by gearman between each device in above-mentioned software upgrading platform, realize that by gearman multimachine crawl, multimachine are resolved and multimachine detects, realize the solution coupling between seed scheduler, url grabber, url resolver, url filtrator, url detecting device and action processor.The data of each device can be stored to database, remain always, look into and add up in order to return.In addition, platform can by the interval and the mode that detects of seed detection time of the self-defined seed of input parameter or software, have higher dirigibility.
The embodiment of the invention discloses following device:
In A, the present embodiment, described control position determination unit is suitable for the position according to the control of needs scheduled event triggering in the coordinate information locating web-pages of the control of knowing in advance.
In B, the present embodiment, described device also comprises the regional determining unit of extraction and control matching unit,
The regional determining unit of described extraction is suitable for determining needing in webpage to carry out the zone that page link extracts;
Described control extraction unit is suitable for calling application interface and reads all controls in described zone;
Described control matching unit is suitable for utilizing the target keywords and/or the Target Photo that get that the control that reads in the initial extraction operation is mated, and judges whether to exist the control that needs scheduled event to trigger.
Described control position determination unit also is suitable for when existing in the page when needing control that scheduled event triggers, with the position of the control that matches as the described position that needs the control that scheduled event triggers.
In C, the present embodiment, described page link obtains the unit, also is suitable for when not having the control that needs the scheduled event triggering in the page page link that all controls that read in operating according to described initial extraction obtain extracting.
In D, the present embodiment, described device also comprises receiving element and webpage loading unit,
Described receiving element is suitable for receiving input parameter;
Described webpage loading unit is suitable for the URL Web page loading in browser according to the webpage in the input parameter that receives;
The regional determining unit of described extraction also is suitable for after loaded, determining to need in webpage to carry out the zone that page link extracts when webpage in browser.
In E, the present embodiment, described device also comprises:
Record cell is suitable for recording the process identification (PID) ID of the browser place process that loads described webpage;
The browser status monitoring means is suitable for the demonstration information in the status bar of the corresponding browser of process ID of monitoring record, when the described demonstration information indication page loads when completing, confirms webpage loaded in browser.
In F, the present embodiment, the regional determining unit of described extraction is suitable for searching the demonstration forms of the corresponding browser of process ID of record, and the zone of webpage in showing forms confirmed as needs to carry out the zone that page link extracts.
In G, the present embodiment, described control matching unit, be suitable for when the input parameter that receives comprises key word and/or picture, with the key word in input parameter and/or picture as the target keywords that gets and/or Target Photo, when not comprising key word and/or picture in the input parameter that receives, read key word and/or picture in the preset configuration file, with the key word in the preset configuration file and/or picture as the target keywords that gets and/or Target Photo.
In H, the present embodiment, described control matching unit is suitable for utilizing one or more key words that the control title is mated, when having at least one key word in the control title, confirm that the match is successful, the control that the match is successful is chosen for the control that needs scheduled event to trigger; And/or, web page contents in described zone is carried out pattern-recognition, identify all pictures in described zone, utilize one or more Target Photos that the picture that identifies is mated, when having at least one Target Photo in the picture that identifies, confirm that the match is successful, the control that the match is successful is chosen for the control that needs scheduled event to trigger.
In I, the present embodiment, described control trigger element is suitable for generating the triggering message of scheduled event, and should trigger message and be passed to described control.
In J, the present embodiment, described scheduled event is mouse event, and described control trigger element also is suitable for triggering the DOM nodal value of control in browser DOM Document Object Model dom tree structure and is updated to the directly page link of Gains resources.
In K, the present embodiment, described control extraction unit is suitable for invoke user interface UI application interface and reads and need to carry out all controls that page link extracts in described webpage.
In L, the present embodiment, described page link obtains the unit, after also being suitable for deleting the repetition control in the control that reads, and the page link that obtains extracting.
Intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with demonstration at this algorithm that provides.Various general-purpose systems also can with based on using together with this teaching.According to top description, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.Should be understood that and to utilize various programming languages to realize content of the present invention described here, and the top description that language-specific is done is in order to disclose preferred forms of the present invention.
In the instructions that provides herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can be in the situation that do not have these details to put into practice.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the description to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes in the above.Yet the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires the more feature of feature clearly put down in writing than institute in each claim.Or rather, as following claims reflected, inventive aspect was to be less than all features of the disclosed single embodiment in front.Therefore, follow claims of embodiment and incorporate clearly thus this embodiment into, wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and can adaptively change and they are arranged in one or more equipment different from this embodiment the module in the equipment in embodiment.Can be combined into a module or unit or assembly to the module in embodiment or unit or assembly, and can put them into a plurality of submodules or subelement or sub-component in addition.At least some in such feature and/or process or unit are mutually repelling, and can adopt any combination to disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and so all processes or the unit of disclosed any method or equipment make up.Unless clearly statement in addition, in this instructions (comprising claim, summary and the accompanying drawing followed), disclosed each feature can be by providing identical, being equal to or the alternative features of similar purpose replaces.
In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature rather than further feature included in other embodiment, the combination of the feature of different embodiment mean be in scope of the present invention within and form different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.
All parts embodiment of the present invention can realize with hardware, perhaps realizes with the software module of moving on one or more processor, and perhaps the combination with them realizes.It will be understood by those of skill in the art that and to use in practice microprocessor or digital signal processor (DSP) to realize according to some or all some or repertoire of parts in the device of the extraction Webpage of embodiment of the present invention link.The present invention can also be embodied as be used to part or all equipment or the device program (for example, computer program and computer program) of carrying out method as described herein.The program of the present invention that realizes like this can be stored on computer-readable medium, perhaps can have the form of one or more signal.Such signal can be downloaded from internet website and obtain, and perhaps provides on carrier signal, perhaps provides with any other form.
It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation that do not break away from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed in element or step in claim.Being positioned at word " " before element or " one " does not get rid of and has a plurality of such elements.The present invention can realize by means of the hardware that includes some different elements and by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to come imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title with these word explanations.

Claims (14)

1. one kind is extracted the method that Webpage links, and comprising:
Need in webpage to determine the position of the control that scheduled event triggers;
Generate in described definite position the scheduled event that triggers described control, change to trigger described control;
Call application interface reads needs to carry out the page link extraction in described webpage all controls;
The page link that obtains extracting according to all controls that read.
2. method according to claim 1 wherein, needs the position of the control that scheduled event triggers to comprise in described definite webpage:
Position according to the control of needs scheduled event triggering in the coordinate information locating web-pages of the control of knowing in advance.
3. method according to claim 1, wherein, need the position of the control that scheduled event triggers in described definite webpage before, described method also comprises:
Determine to need in webpage to carry out the zone that page link extracts;
Call application interface read described zone in initial extraction operation in all controls of webpage;
The target keywords that utilization gets and/or Target Photo mate the control that reads in the initial extraction operation, judge whether to exist the control that needs scheduled event to trigger.
Need the position of the control of scheduled event triggering to comprise in described definite webpage:
When existing in the page when needing control that scheduled event triggers, with the position of the control that matches as the described position that needs the control that scheduled event triggers.
4. method according to claim 3, wherein, described method also comprises:
When not having the control that needs the scheduled event triggering in the page, the page link that all controls that read in operating according to described initial extraction obtain extracting.
5. method according to claim 3, wherein, need to carry out the zone that page link extracts in described definite webpage before, described method also comprises:
Uniform resource position mark URL Web page loading in browser according to the webpage in the input parameter that receives;
When webpage after loaded, determines to need in webpage to carry out the zone that page link extracts in browser.
6. method according to claim 5, wherein, described method also comprises:
Record loads the process identification (PID) ID of the browser place process of described webpage;
Demonstration information in the status bar of the corresponding browser of process ID of monitoring record when the loading of the described demonstration information indication page is completed, is confirmed webpage loaded in browser.
7. method according to claim 6 wherein, needs to carry out the zone that page link extracts and comprises in described definite webpage:
Search the demonstration forms of the corresponding browser of process ID of record, the zone of webpage in showing forms confirmed as needs to carry out the zone that page link extracts.
8. method according to claim 3, wherein, the target keywords that described utilization gets and/or Target Photo mate the control that reads in the initial extraction operation and comprise:
When the input parameter that receives comprises key word and/or picture, with the key word in input parameter and/or picture as the target keywords that gets and/or Target Photo, when not comprising key word and/or picture in the input parameter that receives, read key word and/or picture in the preset configuration file, with the key word in the preset configuration file and/or picture as the target keywords that gets and/or Target Photo.
9. method according to claim 3, wherein, the target keywords that described utilization gets and/or Target Photo mate the control that reads in the initial extraction operation and comprise:
Utilize one or more key words that the control title is mated, when having at least one key word in the control title, confirm that the match is successful, the control that the match is successful is chosen for the control that needs scheduled event to trigger; When not having key word in the control title of all controls that read out, confirm that it fails to match, there is not the control that needs scheduled event to trigger; And/or,
Web page contents in described zone is carried out pattern-recognition, identify all pictures in described zone, utilize one or more Target Photos that the picture that identifies is mated, when having at least one Target Photo in the picture that identifies, confirm that the match is successful, the control that the match is successful is chosen for the control that needs scheduled event to trigger; When not having Target Photo in all pictures that identify, confirm that it fails to match, there is not the control that needs scheduled event to trigger.
10. method according to claim 1 wherein, describedly generates in described definite position the scheduled event that triggers described control and comprises:
Generate the triggering message of scheduled event, and should trigger message and be passed to described control.
11. method according to claim 1, wherein, described scheduled event is mouse event, and described triggering control changes and comprises:
Trigger the DOM nodal value of control in browser DOM Document Object Model dom tree structure and be updated to the directly page link of Gains resources.
12. method according to claim 1, wherein, the described application interface that calls reads and needs to carry out all controls that page link extracts in described webpage and comprise:
Invoke user interface UI application interface reads needs to carry out all controls that page link extracts in described webpage.
13. method according to claim 4, wherein,
The page link that described all controls according to reading in described initial extraction operation obtain extracting comprises: after the repetition control in the control that reads in the operation of deletion initial extraction, and the page link that obtains extracting;
The page link that all controls that described basis reads obtain extracting comprises:
After repetition control in the control that deletion reads, the page link that obtains extracting.
14. a device that extracts the Webpage link comprises:
The control position determination unit is suitable for determining needing in webpage the position of the control that scheduled event triggers;
The control trigger element is suitable for generating in described definite position the scheduled event that triggers described control, changes to trigger described control;
The control extraction unit is suitable for calling application interface reads needs to carry out the page link extraction in described webpage all controls;
Page link obtains the unit, is suitable for the page link that obtains extracting according to all controls that read.
CN201310116123.7A 2013-04-03 2013-04-03 A kind of method and apparatus extracting Webpage link Active CN103177115B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310116123.7A CN103177115B (en) 2013-04-03 2013-04-03 A kind of method and apparatus extracting Webpage link

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310116123.7A CN103177115B (en) 2013-04-03 2013-04-03 A kind of method and apparatus extracting Webpage link

Publications (2)

Publication Number Publication Date
CN103177115A true CN103177115A (en) 2013-06-26
CN103177115B CN103177115B (en) 2016-06-29

Family

ID=48636976

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310116123.7A Active CN103177115B (en) 2013-04-03 2013-04-03 A kind of method and apparatus extracting Webpage link

Country Status (1)

Country Link
CN (1) CN103177115B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104657377A (en) * 2013-11-20 2015-05-27 阿里巴巴集团控股有限公司 Multi-channel webpage control positioning method and device
CN104995619A (en) * 2013-08-23 2015-10-21 华为终端有限公司 Webpage processing method and device
CN105183453A (en) * 2015-08-07 2015-12-23 安一恒通(北京)科技有限公司 Webpage-based information acquiring method and apparatus
CN105607895A (en) * 2014-11-21 2016-05-25 阿里巴巴集团控股有限公司 Operation method and device of application program on the basis of application program programming interface
CN106202244A (en) * 2016-06-28 2016-12-07 深圳中兴网信科技有限公司 Web page message return method and web page message return system
CN109981349A (en) * 2019-02-27 2019-07-05 华为技术有限公司 Call chain information query method and equipment
CN112260853A (en) * 2020-09-17 2021-01-22 北京大米科技有限公司 Disaster recovery switching method and device, storage medium and electronic equipment
CN112632358A (en) * 2020-12-29 2021-04-09 北京天融信网络安全技术有限公司 Resource link obtaining method and device, electronic equipment and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885538B (en) * 2016-09-28 2020-12-22 北京京东尚科信息技术有限公司 Method and device for adding hot area links on picture

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101364988A (en) * 2008-09-26 2009-02-11 深圳市迅雷网络技术有限公司 Method and apparatus determining webpage security
CN102955913A (en) * 2011-08-25 2013-03-06 腾讯科技(深圳)有限公司 Method and system for detecting hung Trojans of web page

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101364988A (en) * 2008-09-26 2009-02-11 深圳市迅雷网络技术有限公司 Method and apparatus determining webpage security
CN102955913A (en) * 2011-08-25 2013-03-06 腾讯科技(深圳)有限公司 Method and system for detecting hung Trojans of web page

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104995619A (en) * 2013-08-23 2015-10-21 华为终端有限公司 Webpage processing method and device
US10929497B2 (en) 2013-08-23 2021-02-23 Huawei Device Co., Ltd. Replacing a web page while maintaining a communication link
CN104657377B (en) * 2013-11-20 2018-04-03 阿里巴巴集团控股有限公司 A kind of multichannel webpage control localization method and device
CN104657377A (en) * 2013-11-20 2015-05-27 阿里巴巴集团控股有限公司 Multi-channel webpage control positioning method and device
CN105607895B (en) * 2014-11-21 2021-03-02 阿里巴巴集团控股有限公司 Application program operation method and device based on application program programming interface
CN105607895A (en) * 2014-11-21 2016-05-25 阿里巴巴集团控股有限公司 Operation method and device of application program on the basis of application program programming interface
CN105183453A (en) * 2015-08-07 2015-12-23 安一恒通(北京)科技有限公司 Webpage-based information acquiring method and apparatus
CN106202244A (en) * 2016-06-28 2016-12-07 深圳中兴网信科技有限公司 Web page message return method and web page message return system
CN109981349A (en) * 2019-02-27 2019-07-05 华为技术有限公司 Call chain information query method and equipment
CN109981349B (en) * 2019-02-27 2022-02-25 华为云计算技术有限公司 Call chain information query method and device
US11809300B2 (en) 2019-02-27 2023-11-07 Huawei Cloud Computing Technologies Co., Ltd. Trace chain information query method and device
CN112260853A (en) * 2020-09-17 2021-01-22 北京大米科技有限公司 Disaster recovery switching method and device, storage medium and electronic equipment
CN112260853B (en) * 2020-09-17 2023-07-21 北京大米科技有限公司 Disaster recovery switching method and device, storage medium and electronic equipment
CN112632358A (en) * 2020-12-29 2021-04-09 北京天融信网络安全技术有限公司 Resource link obtaining method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN103177115B (en) 2016-06-29

Similar Documents

Publication Publication Date Title
CN103177115A (en) Method and device of extracting page link of webpage
US8065667B2 (en) Injecting content into third party documents for document processing
US8230320B2 (en) Method and system for social bookmarking of resources exposed in web pages that don't follow the representational state transfer architectural style (REST)
US8762556B2 (en) Displaying content on a mobile device
US8296722B2 (en) Crawling of object model using transformation graph
CN104077387A (en) Webpage content display method and browser device
US20060190561A1 (en) Method and system for obtaining script related information for website crawling
CN105868096B (en) For showing the method, device and equipment of web page test result in a browser
US20130318514A1 (en) Map generator for representing interrelationships between app features forged by dynamic pointers
CN110209966B (en) Webpage refreshing method, webpage system and electronic equipment
KR20060079080A (en) Methods and apparatus for evaluating aspects of a web page
US20130132422A1 (en) System and method for creating and controlling an application operating on a plurality of computer platform types
CN110851681B (en) Crawler processing method, crawler processing device, server and computer readable storage medium
CN101876897A (en) System and method used for processing Widget on Web browser
CN104036011A (en) Webpage element display method and browser device.
CN103678487A (en) Method and device for generating web page snapshot
CN103034495A (en) Browser capable of isolating plug-in in webpage and webpage plug-in isolating method
CN103678509A (en) Method and device for generating webpage template
CN112835809A (en) Test data setting method, device, equipment and medium based on browser
CN103678510A (en) Method and device for providing visualized label for webpage
CN102902784A (en) Web page classification storage system and method
CN103853717A (en) Web crawler
CN114491560A (en) Vulnerability detection method and device, storage medium and electronic equipment
CN113868502A (en) Page crawler method and device, electronic equipment and readable storage medium
CN103544271A (en) Picture processing window loading method and device for browsers

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220725

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.

TR01 Transfer of patent right