CN101984429A

CN101984429A - Method and device for acquiring destination page, search engine and browser

Info

Publication number: CN101984429A
Application number: CN 201010531460
Authority: CN
Inventors: 潘云泓
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2010-11-04
Filing date: 2010-11-04
Publication date: 2011-03-09
Anticipated expiration: 2030-11-04
Also published as: CN101984429B

Abstract

The invention provides a method and a device for acquiring a destination page, a search engine and a browser. The method comprises the following steps of: capturing a foundation page corresponding to a received uniform resource locator (URL) and a script of the foundation page by the search engine; and analyzing the captured foundation page and the captured script to generate over one state path comprising dynamic information and corresponding to the foundation page, and capturing the destination page by using the generated state path, wherein the state path comprises the URL of the foundation page, position information of a document object model (DOM) event for generating the dynamic information in the foundation page and a callback function index corresponding to the DOM event. The search engine can capture dynamic contents in the page when searching the destination page.

Description

Method and device for acquiring target page, search engine and browser

Technical Field

The invention relates to the internet technology, in particular to a method and a device for acquiring a target page, a search engine and a browser.

Background

With the rapid development of networks, the internet becomes a carrier of a large amount of information, and how to effectively extract and utilize the information becomes a great challenge. Search engines, as a tool to assist people in retrieving information, have become portals and guides for users to access the internet. The web crawler (Spider) is a program for automatically extracting web pages and is an important component of a search engine.

The traditional web crawler starts from Uniform Resource Locators (URLs) of one or a plurality of initial web pages, captures basic pages of the URLs, analyzes the content of the current basic pages to obtain the URLs of target pages, and performs data processing, including establishing web page summaries, snapshots, indexes and storage, and then returns the web page summaries, snapshots, indexes and storage to a browser for selection by a user.

However, when the traditional web crawler acquires the URL of the target page, only the static page can be captured, but with the continuous development of the internet technology, the content of the page is converted from the former static mode to the dynamic mode to generate data, and the traditional web crawler technology obviously cannot meet the conversion requirement, that is, cannot capture the dynamic content of the page.

Disclosure of Invention

The invention provides a method and a device for acquiring a target page, a search engine and a browser, so that the search engine can capture dynamic content in the page when searching the target page.

The specific technical scheme is as follows:

a method for acquiring a target page comprises the following steps:

A. capturing a basic page corresponding to the received uniform resource locator URL and a script of the basic page;

B. analyzing the captured basic page and the script, generating more than one state path containing dynamic information corresponding to the basic page, and capturing a target page by using the generated state paths; wherein the state path comprises: the method comprises the steps of obtaining URL of a basic page, position information of a Document Object Model (DOM) event generating dynamic information in the basic page and a callback function index corresponding to the DOM event.

Wherein, the step B specifically comprises:

downloading each DOM node in the grabbing process of the basic page and the script, and sequentially executing steps B11 to B13 on the downloaded DOM nodes until the downloading of all the DOM nodes is finished, and then executing step B14;

b11, judging whether the currently downloaded DOM node is a script tag, if so, transferring to the step B11 for the next downloaded DOM node, otherwise, executing the step B12;

b12, judging whether the currently downloaded DOM node contains a DOM event and a call-back function, if not, transferring to the step B11 for the next downloaded DOM node, and if so, executing the step B13;

b13, generating a state path by using the DOM event contained in the currently downloaded DOM node, storing the generated state path in a state path queue, and turning to the step B11 for the next downloaded DOM node;

b14, acquiring the target pages corresponding to each state path in the state queue one by one, judging whether to generate new page content or generate page jump, and determining the state path generating the new page content or generating the page jump as the state path corresponding to the basic page.

Or, the step B specifically includes:

downloading each DOM node in the grabbing process of the basic page and the script, and sequentially executing the steps B21 to B23 on the downloaded DOM nodes until the downloading of all the DOM nodes is finished;

b21, judging whether the currently downloaded DOM node is a script tag, if so, transferring to the step B21 for the next downloaded DOM node, otherwise, executing the step B22;

b22, judging whether the currently downloaded DOM node contains a DOM event and a call-back function, if not, transferring to the step B21 for the next downloaded DOM node, and if so, executing the step B23;

b23, generating a state path by using the DOM event contained in the currently downloaded DOM node;

b24, acquiring a target page corresponding to the state path, judging whether to generate new page content or generate page jump, if so, determining that the state path is the state path corresponding to the basic page, and turning to the step B21 for the next downloaded DOM node; otherwise, go to step B21 for the next downloaded DOM node.

In the above manner, the determining whether the page jump occurs includes: and if the obtained URL of the target page is different from that of the basic page, determining that page jump occurs.

Specifically, the determining whether to generate new page content includes: carrying out sentence signature or character string comparison on the acquired target page and the basic page, and if the comparison result shows that the target page and the basic page have different page contents, determining to generate new page contents; or,

and calculating the similarity between the acquired target page and the basic page, and determining to generate new page content if the calculation result shows that the target page and the basic page have different page contents.

Wherein the position information of the DOM event comprises: DOM node identification, path Xpath of DOM node and DOM event identification.

Still further, after the step B, the method further comprises:

C. and B, storing the state path corresponding to the basic page generated in the step B and the snapshot of the captured target page, and establishing and storing an index of the target page.

A method for obtaining a target page is based on the method and comprises the following steps;

after receiving a search request from a browser, matching keywords contained in the search request with indexes of stored target pages, including a state path corresponding to the matched target page in a search result, and returning the search result to the browser, so that the browser can obtain the corresponding target page by using the state path selected by a user.

In addition, the search result may further include: snapshot information of the matched target page;

and after receiving snapshot information of the target page selected by the user and returned by the browser, returning a corresponding snapshot of the target page to the browser.

Furthermore, after the state path corresponding to the matched target page is included in the search result and returned to the browser, the method further includes:

and after receiving the state path selected by the user and sent by the browser, sending a target page request to a target page site according to the state path selected by the user, so that the target page site can push a target page to the browser.

A method for acquiring a target page comprises the following steps:

the browser receives a search result containing a state path returned by a search engine after sending a search request to the search engine;

sending a target page request to a target page site according to a state path selected by a user;

receiving a target page pushed by the target page site;

wherein the search result containing the state path is returned by the search engine using the method of claim 8.

An apparatus for obtaining a target page, the apparatus comprising:

the first grabbing unit is used for grabbing a basic page corresponding to the received uniform resource locator URL and a script of the basic page;

the analysis unit is used for analyzing the basic page and the script captured by the first capture unit and generating more than one state path which corresponds to the basic page and contains dynamic information; wherein the state path comprises: the method comprises the steps that URL of a basic page, position information of a Document Object Model (DOM) event generating dynamic information in the basic page and a callback function index corresponding to the DOM event are obtained;

and the second grabbing unit is used for grabbing the target page by using the state path generated by the analysis unit.

Wherein, the analysis unit specifically includes: the device comprises a first judgment module, a second judgment module, a first path generation module and a first path determination module;

the first grabbing unit downloads each DOM node in the grabbing process of the basic page and the script thereof, sends the currently downloaded DOM node to the first judging module, and sends a confirmation notice to the first path determining module after finishing the downloading of all the DOM nodes;

the first judging module is used for judging whether the currently downloaded DOM node is a script tag or not, if so, triggering the first grabbing unit to download the next DOM node, and otherwise, sending a judgment notice to the second judging module;

the second judging module is used for judging whether the currently downloaded DOM node contains a DOM event and a callback function, if not, the first capturing unit is triggered to download the next DOM node, and if so, an execution notice is sent to the first path generating module;

the first path generation module is used for generating a state path by using the currently downloaded DOM node after receiving the execution notification, storing the generated state path in a state path queue and triggering the first capture unit to download the next DOM node;

the first path determining module is configured to, when receiving the determination notification, trigger the second capturing unit to obtain target pages corresponding to each state path in the state queue one by one, determine whether to generate new page content or generate page jump according to an obtaining result of the second capturing unit, and determine the generated new page content or the state path in which the page jump occurs as the state path corresponding to the basic page.

Specifically, the analysis unit may include: the device comprises a third judging module, a fourth judging module, a second path generating module and a second path determining module;

the first grabbing unit downloads each DOM node in the grabbing process of the basic page and the script thereof, and sends the currently downloaded DOM node to the third judging module until the downloading of all DOM nodes is finished;

the third judging module is used for judging whether the currently downloaded DOM node is a script tag or not, if so, the first grabbing unit is triggered to download the next DOM node, and otherwise, a judgment notice is sent to the fourth judging module;

the fourth judging module is used for judging whether the currently downloaded DOM node contains a DOM event and a callback function, if not, the first capturing unit is triggered to download the next DOM node, and if so, an execution notification is sent to the second path generating module;

the second path generating module is configured to generate a state path by using a DOM event included in a currently downloaded DOM node when receiving the execution notification, and send the generated state path to the second path determining module;

and the second path determining module is used for triggering the second capturing unit to obtain a target page corresponding to the state path when the state path is received, judging whether new page content is generated or page jump is generated according to the obtaining result of the second capturing unit, if so, determining that the state path is the state path corresponding to the basic page, and triggering the first capturing unit to download a next DOM node, otherwise, triggering the first capturing unit to download the next DOM node.

Wherein, judging whether the page jump occurs comprises: and if the obtained URL of the target page is different from that of the basic page, determining that page jump occurs.

Determining whether to generate new page content includes: carrying out sentence signature or character string comparison on the acquired target page and the basic page, and if the comparison result shows that the target page and the basic page have different page contents, determining to generate new page contents; or,

Specifically, the location information of the DOM event includes: DOM node identification, path Xpath of DOM node and DOM event identification.

Still further, the apparatus further comprises:

and the storage unit is used for storing the state path corresponding to the basic page generated by the analysis unit and the snapshot of the target page captured by the second capture unit, and establishing and storing the index of the target page.

A search engine, the search engine comprising: the device for acquiring the target page, the user interface unit and the search processing unit are arranged;

the user interface unit is used for receiving a search request from a browser and sending a keyword contained in the search request to the search processing unit; returning the search result sent by the search processing unit to the browser, so that the browser can obtain a corresponding target page by using the state path selected by the user;

and the search processing unit is used for matching the keyword with the index of the target page stored in the storage unit of the device, and sending the state path corresponding to the matched target page to the user interface unit by including the state path in the search result.

Furthermore, the search result further includes: snapshot information of the matched target page;

the user interface unit is also used for sending the snapshot information of the target page selected by the user and returned by the browser to the search processing unit; returning the snapshot of the target page sent by the search processing unit to the browser;

the search processing unit is further configured to obtain a snapshot of the corresponding target page from the storage unit according to the snapshot information of the target page selected by the user, and send the snapshot to the user interface unit.

Still further, the search engine further comprises: a path analysis unit and a network interface unit;

the user interface unit is also used for sending the state path to the path analysis unit after receiving the state path selected by the user and sent by the browser;

the path analysis unit is used for generating a target page request according to the received state path;

and the network interface unit is used for sending the target page request generated by the path analysis unit to a target page site.

A browser, the browser comprising: the system comprises a network side interface unit, a path analysis unit and a user side interface unit;

the network side interface unit, configured to receive a search result including a status path sent by the search engine according to claim 19; sending the target page request sent by the path analysis unit to a target page site;

the user side interface unit is used for displaying the search result received by the network side interface unit to a user; sending the state path selected by the user to the path analysis unit;

and the path analysis unit is used for generating a target page request according to the state path selected by the user and sending the target page request to the network side interface unit.

According to the technical scheme, the concept of the state path is introduced based on the analysis of the basic page and the script thereof, namely the state path containing the dynamic information corresponding to the basic page is generated, and the target page pointed by the state path contains the dynamic content of the page, so that the subsequent search engine can capture the dynamic content in the page when searching the target page.

Drawings

FIG. 1 is a flow chart of the main method provided by the present invention;

FIG. 2 is a flowchart of a detailed method provided in one embodiment of the present invention;

FIG. 3 is a flowchart of generating a status path according to a second embodiment of the present invention;

FIG. 4 is a flowchart of generating a status path according to a third embodiment of the present invention;

fig. 5 is a flowchart illustrating a browser obtaining a target page according to a fourth embodiment of the present invention;

fig. 6 is a flowchart illustrating a process of acquiring a target page by a browser according to a fifth embodiment of the present invention;

fig. 7 is a flowchart of a browser obtaining a target snapshot according to a sixth embodiment of the present invention;

FIG. 8 is a schematic diagram of the structure of the apparatus according to the present invention;

FIG. 9 is a schematic diagram of a structure of the analysis unit of FIG. 8;

FIG. 10 is a schematic view of another structure of the analysis unit of FIG. 8;

FIG. 11 is a schematic diagram of a search engine according to the present invention;

fig. 12 is a schematic view of a browser structure according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

The main method provided by the invention can be shown as figure 1, and comprises the following steps:

step 101: and capturing a basic page corresponding to the received URL and a script of the basic page.

Step 102: analyzing the captured basic page and the script to generate more than one state path containing dynamic information corresponding to the basic page; wherein the state path includes: the method comprises the steps of obtaining a URL of a base page, position information of a Document Object Model (DOM) event generating dynamic information in the base page and a callback function index corresponding to the DOM event generating the dynamic information.

Step 103: and grabbing the target page by using the generated state path.

The method flow shown in fig. 1 is an operation performed by a search engine, and further, the search engine stores a generated state path, so that after receiving a search request of a browser, a search result including the state path is returned to the browser, so that the browser obtains a corresponding target page by using the state path selected by a user.

The above method is described in detail below by way of specific examples.

The first embodiment,

Fig. 2 is a flowchart of a detailed method according to a first embodiment of the present invention, and as shown in fig. 2, the method may specifically include the following steps:

step 201: the search engine receives the URL.

The search engine may automatically batch grab URLs in the background.

Step 202: and capturing a basic page corresponding to the received URL and a script of the basic page.

The correspondence between the basic page and the script can have the following two types: in one aspect, a script document exists in an HTML tag contained in the source code of the base page. Secondly, a link of a script document exists in an HTML (hypertext markup language) tag contained in the basic page source code, and the link of the script document points to the script document; that is, the base page and the script document are two different documents, but there is a reference relationship.

Step 203: analyzing DOM nodes downloaded from the captured basic page, judging whether scripts corresponding to DOM events in the DOM nodes generate dynamic information or not, generating more than one state path containing the dynamic information corresponding to the basic page according to the analysis result, and acquiring a target page by using the state paths; wherein the state path includes: the method comprises the steps of obtaining URL of a basic page, position information of a DOM event generating dynamic information in the basic page and a callback function index corresponding to the DOM event generating the dynamic information.

Scripting languages involved in the present invention include, but are not limited to: java script, vbscript, perl, or python.

Wherein, the position information of the DOM event may include: DOM node identification, DOM node path (Xpath) and DOM event identification. Wherein, the DOM node identification may be: ID of DOM node or name of DOM node.

And the callback function index in the state path is used for referencing the callback function corresponding to the DOM event. All callback functions in the script are provided with indexes, and the corresponding relation between the indexes and the specific callback functions can be stored through data structures such as a global function table, a mapping function and the like. And querying a data structure containing the corresponding relation between the index and the specific callback function through the callback function index in the state path, so as to obtain the callback function corresponding to the DOM event. The callback function herein may include: anonymous callback functions and non-anonymous callback functions.

And aiming at the state path, after compiling and executing the callback function corresponding to the DOM event, acquiring the corresponding target page.

The specific implementation of this step will be described in detail in example two and example three.

For a base page, it may correspond to N state paths and to N target pages, where N may be an integer greater than one.

For example, for a base page with a URL of www.baidu.com, the resulting two-bar path may be:

{base_url:http://www.baidu.com，id:idsample1，xpath:html/body/a/，event:click，type:new_content，callback:fun1}

{base_url:http://www.baidu.com，id:idsample2，xpath:html/body/li/a/，event:click，type:new_link，callback:fun2}

it should be noted that the present invention does not limit the specific format of the state path, and the above is only one example.

Step 204: and storing the state path corresponding to the basic page and the target page snapshot corresponding to the state path, and establishing and storing an index of the target page so as to be found by a search engine in the following process and return the index to the browser as a search result.

In this embodiment, the base page and its script captured in step 202 may be stored, the state path generated in step 203 may be stored, and the target page snapshot acquired in step 203 may be stored. The storing of the basic page may specifically include: a base page URL, a base page snapshot, etc.

The process of obtaining the target page by the search engine can be executed periodically or manually. When the state path corresponding to the base page is generated each time, if the stored state path exists, the generated state path corresponding to the base page may be compared with the stored state path corresponding to the base page, and if the state path corresponding to the base page is different from the stored state path corresponding to the base page, the stored state path corresponding to the base page is updated in time.

In addition, the search engine can periodically check whether the target page is updated according to the index of the target page, and update the stored index of the target page in time. Similarly, if the target page snapshot acquired each time is different from the stored target page snapshot, the stored target page snapshot may be replaced with the newly acquired target page snapshot.

The three types of contents stored above may be stored separately or in combination.

The above steps 201 to 204 are all operations of the search engine in the background, and if the search engine receives a search request from the browser, the following steps are continuously executed in the foreground.

Step 205: after receiving a search request from a browser, matching keywords contained in the search request with indexes of all target pages, including a state path corresponding to the matched target page in a search result, and returning the search result to the browser, so that the browser can obtain the corresponding target page by using the state path selected by a user.

When a search engine receives a search request containing a keyword, in addition to the index of a target page participating in matching, the index of a basic page also participates in matching, that is, the basic page is also included in a search result, which is the same as the prior art and is not described in detail again.

Furthermore, the search result may also include snapshot information of the target page, or may also include an index of the target page.

In this step, the browser specifically uses the state path selected by the user to obtain the corresponding target page, which refers to embodiment four and embodiment five.

The manner of generating the state path in step 203 may adopt two manners, i.e., embodiment two and embodiment three.

Example II,

Fig. 3 is a flowchart of generating a status path according to a second embodiment of the present invention, and as shown in fig. 3, the method may specifically include the following steps:

step 301: and downloading each DOM node in the grabbing process of the basic page and the script thereof.

Step 302: judging whether downloading of the DOM node is finished, if so, finishing the capturing process of the basic page, and turning to the step 306; otherwise, step 303 is performed on the currently downloaded DOM node.

Step 303: judging whether the currently downloaded DOM node is a script tag, if so, turning to step 302 for the next downloaded DOM node; otherwise, step 304 is performed.

For a node of a script tag, a script corresponding to the script tag may be sent to a script parsing engine for compiling and executing.

Step 304: judging whether the DOM node contains a DOM event and a call-back function, if not, skipping the analysis of the DOM node, and turning to the step 302 for the next downloaded DOM node; if so, step 305 is performed.

If the DOM node does not contain the DOM event and the call-back function, the DOM node does not cause page jump and new page content, namely, page dynamic information is not generated, the DOM node can be skipped over, and if the next DOM node exists, the analysis of the next DOM node is started.

Step 305: generating a state path by using a DOM event contained in the DOM node, and storing the generated state path in a state path queue; go to step 302 for the next downloaded DOM node.

Step 306: and acquiring target pages corresponding to each state path in the state queue one by one, judging whether new page content is generated or page skipping occurs, and determining the state path generating the new page content or the page skipping as the state path corresponding to the basic page.

And then storing the state path which generates new page content or generates page jump and the corresponding target page.

The method for judging whether the page jump occurs may be: and if the URLs of the target page and the basic page are different, determining that page jump occurs. The manner of determining whether to generate new page content may be: and carrying out sentence signature or character string comparison on the target page and the basic page, or calculating the similarity of the target page and the basic page, and if the comparison result or the similarity calculation result shows that the target page and the basic page have different page contents, determining to generate new page contents. In the comparison of sentence signatures, the calculation of sentence signatures may adopt an existing calculation manner, such as MD5, and is not limited in this respect.

In the second embodiment, all the state paths generated by the DOM events are stored in the state path queue, but since the state paths of the DOM events do not necessarily generate page dynamic information, and some invalid state paths may exist, each state path in the state path queue is further determined one by one, and whether a target page corresponding to the state path queue contains dynamic information is determined. The flow from step 303 to step 305 is a process of analyzing each DOM node to generate a state path preliminarily, that is, step 303 to step 305 are performed on each downloaded DOM node until all DOM nodes are downloaded, and step 306 is performed to determine a state path corresponding to the base page finally.

Example III,

Fig. 4 is a flowchart of generating a status path according to a third embodiment of the present invention, and as shown in fig. 4, the method may specifically include the following steps:

step 401: and downloading each DOM node in the grabbing process of the basic page and the script thereof.

Step 402: judging whether downloading of the DOM node is finished or not, if so, finishing the grabbing process of the basic page; otherwise, step 403 is performed on the currently downloaded DOM node.

Step 403: judging whether the currently downloaded DOM node is a script tag, if so, turning to step 402 for the next downloaded DOM node; otherwise, step 404 is performed.

Step 404: judging whether the DOM node contains a DOM event and a call-back function, if not, skipping the analysis of the DOM node, and turning to the step 402 for the next downloaded DOM node; if so, step 405 is performed.

Step 405: a state path is generated using DOM events in the DOM node.

In this step, a state path may be generated for all DOM events, and more preferably, for DOM events in a preset DOM event list. Wherein the DOM events in the preset DOM event list may include: onclick, ondbclick, onmouseover, onmouseove, onmouseout, onblu, onfocus, onchange, onsubmit, onselect, etc., which are all DOM events that may generate dynamic information for a page.

Step 406: acquiring a target page corresponding to the state path, judging whether to generate new page content or generate page jump, and if so, executing step 407; otherwise, go to step 402 for the next DOM node to be downloaded.

Step 407: determining the state path as the state path corresponding to the base page, storing the state path and the target page corresponding to the state path, and going to step 402 for the next downloaded DOM node.

Different from the second embodiment, in the third embodiment, each time a state path is generated, a determination is made as to whether the target page corresponding to the state path queue includes dynamic information (step 406), and if so, the state path and the target page corresponding to the state path queue are stored. Steps 403 to 407 are processes of generating a state path after analyzing each downloaded DOM node, that is, steps 403 to 407 are performed on each downloaded DOM node until all DOM nodes are downloaded.

The flow shown in this third embodiment is ended.

In the second and third embodiments, when the target page corresponding to the state path is obtained and whether new page content is generated or page jump is generated is determined, the callback function index corresponding to the DOM event is sent to the script parsing engine, the script parsing engine obtains the corresponding callback function according to the callback function index, the target page corresponding to the state path is obtained according to the compiling and executing result of the obtained callback function, and whether new page content is generated or page jump is generated is determined. For anonymous functions, the script parsing engine compiles and executes the obtained callback functions in real time after obtaining the corresponding callback functions, and for non-anonymous functions, the script parsing engine can utilize the compiling and executing results of the callback functions before after obtaining the corresponding callback functions.

The mode that the browser acquires the target page by using the state path can be divided into two modes according to whether the browser has the function of analyzing the state path, which are respectively described through the fourth embodiment and the fifth embodiment.

Example four,

When the browser has the function of analyzing the state path, the corresponding flowchart is shown in fig. 5, and includes the following steps:

step 501: the browser sends a search request (Query) containing the keyword to the search engine.

Step 502: the search engine executes step 205 to return search results containing a status path to the browser.

Step 503: and the browser sends a target page request to the target page site according to the state path selected by the user.

When the user clicks the state path of the target page, the browser analyzes the state path clicked by the user and sends a target page request to the target page site according to the state path.

Step 504: and the target page website pushes the target page to the browser.

Example V,

When the browser does not have the function of analyzing the state path, the corresponding flowchart is shown in fig. 6, and includes the following steps:

step 601: the browser sends a search request containing the keyword to a search engine.

Step 602: the search engine executes step 205 to return search results containing a status path to the browser.

Step 603: the browser sends the status path selected by the user to the search engine.

Step 604: and the search engine sends a target page request to the target page site according to the state path selected by the user.

Because the browser does not have a state path analysis function, the browser only sends the state path selected by the user to the search engine, and the search engine analyzes the state path and sends the target page request to the target page site according to the state path.

Step 605: and the target page website pushes the target page to the browser.

The target page request sent by the search engine contains browser information, so that the target page site pushes the target page to the browser.

The flow shown in the fifth embodiment is ended.

In another case, if the search engine includes the target page snapshot information in the search result returned in step 205 of the first embodiment, if the user clicks the target page snapshot, the interaction between the browser and the search engine may be performed according to the sixth embodiment.

Example six,

Fig. 7 is a flowchart of obtaining a target snapshot by a browser according to a sixth embodiment, as shown in fig. 7, the method may include the following steps:

step 701: the browser sends a search request containing the keyword to a search engine.

Step 702: the search engine executes step 205 to return search results containing state path and target page snapshot information to the browser.

Step 703: and the browser sends the target page snapshot information selected by the user to a search engine.

Step 704: and the search engine determines the corresponding target page snapshot and returns the corresponding target page snapshot to the browser.

Because the search engine stores all target page snapshots locally, interaction with a target page site is not needed, and the corresponding target page snapshots are directly acquired from the local and then returned to the browser.

The above is a detailed description of the method provided by the present invention, and the following is a detailed description of an apparatus for acquiring a target page provided by the present invention, as shown in fig. 8, the apparatus may include: a first grasping unit 800, an analyzing unit 810, and a second grasping unit 820.

The first fetching unit 800 is configured to fetch a base page corresponding to the received URL and a script of the base page.

The analysis unit 810 is configured to analyze the basic page and the script captured by the first capture unit 800, and generate one or more status paths including dynamic information corresponding to the basic page; wherein the state path includes: the method comprises the steps of obtaining URL of a basic page, position information of DOM events generating dynamic information in the basic page and callback function indexes corresponding to the DOM events.

And a second fetching unit 820 for fetching the target page using the state path generated by the analyzing unit 810.

The analysis unit 810 may adopt two structures, where the first structure is shown in fig. 9, and specifically includes: a first judging module 811, a second judging module 812, a first path generating module 813, and a first path determining module 814.

The first capture unit 800 downloads DOM nodes in the capture process of the basic page and the script thereof, and sends the currently downloaded DOM nodes to the first determination module 811 until the download of all DOM nodes is finished, and then sends a determination notification to the first path determination module 814.

The first determining module 811 is configured to determine whether the currently downloaded DOM node is a script tag, if so, trigger the first capturing unit 800 to download the next DOM node, otherwise, send a determination notification to the second determining module 812.

The second determining module 812 is configured to determine whether the currently downloaded DOM node contains a DOM event and a callback function, if not, trigger the first capturing unit 800 to download the next DOM node, and if so, send an execution notification to the first path generating module 813.

The first path generating module 813 is configured to generate a state path by using the currently downloaded DOM node after receiving the execution notification, store the generated state path in the state path queue, and trigger the first capturing unit 800 to download the next DOM node.

The first path determining module 814 is configured to, when receiving the determination notification, trigger the second capturing unit 820 to obtain the target pages corresponding to the state paths in the state queue one by one, determine whether to generate new page content or generate page jump according to an obtaining result of the second capturing unit 820, and determine the generated new page content or the state path with the page jump as the state path corresponding to the base page.

In addition, as shown in fig. 10, the second structure of the analysis unit 810 may specifically include: a third determination module 911, a fourth determination module 912, a second path generation module 913, and a second path determination module 914.

The first capture unit 800 downloads each DOM node in the capture process of the basic page and the script thereof, and sends the currently downloaded DOM node to the third determination module 911 until the downloading of all DOM nodes is finished.

The third determining module 911 is configured to determine whether the currently downloaded DOM node is a script tag, if so, trigger the first capturing unit 800 to download the next DOM node, and otherwise, send a determination notification to the fourth determining module 912.

A fourth determining module 912, configured to determine whether the currently downloaded DOM node contains a DOM event and a callback function, if not, trigger the first capturing unit 800 to download the next DOM node, and if so, send an execution notification to the second path generating module 913.

The second path generating module 913 is configured to generate a state path by using a DOM event included in the currently downloaded DOM node when receiving the execution notification, and send the generated state path to the second path determining module 914.

The second path determining module 914, configured to trigger the second capturing unit 820 to obtain a target page corresponding to the state path when the state path is received, and determine whether to generate new page content or generate page jump according to an obtaining result of the second capturing unit 820, if so, determine that the state path is the state path corresponding to the base page, and trigger the first capturing unit 800 to download a next DOM node, otherwise, trigger the first capturing unit 800 to download the next DOM node.

Specifically, when the method is applied to the two structures, the determining whether the page jump occurs may include: and if the acquired URLs of the target page and the basic page are different, determining that page jump occurs.

Determining whether to generate new page content may include: carrying out sentence signature or character string comparison on the acquired target page and the basic page, and if the comparison result shows that the target page and the basic page have different page contents, determining to generate new page contents; or calculating the similarity of the acquired target page and the basic page, and determining to generate new page content if the calculation result shows that the target page and the basic page have different page contents.

Wherein, the position information of the DOM event in the state path includes: DOM node identification, XPath of DOM node and DOM event identification.

Still further, the apparatus may further comprise:

the storage unit 830 is configured to store the state path corresponding to the base page generated by the analysis unit 810 and the snapshot of the target page captured by the second capture unit 820, and establish and store an index of the target page.

In addition, the storage unit 830 stores the base page captured by the first capture unit 800, wherein the base page, the state path, and the snapshot of the target page may be stored separately or in a unified manner.

Fig. 11 is a schematic structural diagram of a search engine provided in the present invention, and as shown in fig. 11, the search engine includes: the apparatus shown in fig. 8, a user interface unit 1101, and a search processing unit 1102.

A user interface unit 1101 for receiving a search request from a browser and transmitting a keyword included in the search request to a search processing unit 1102; and returning the search result sent by the search processing unit 1102 to the browser, so that the browser can obtain the corresponding target page by using the state path selected by the user.

The search processing unit 1102 is configured to match the keyword with the index of the target page stored in the storage unit 830, include the state path corresponding to the matched target page in the search result, and send the search result to the user interface unit 1101.

Preferably, the search result may further include: snapshot information of the matched target page. At this time, the process of the present invention,

the user interface unit 1101 is further configured to send snapshot information of a target page selected by the user, which is returned by the browser, to the search processing unit 1102; the snapshot of the target page sent by the search processing unit 1102 is returned to the browser.

The search processing unit 1102 is further configured to obtain a snapshot of the corresponding target page from the storage unit 830 according to the snapshot information of the target page selected by the user, and send the snapshot to the user interface unit 1101.

Furthermore, when the browser does not have the function of parsing the status path, the search engine needs to have the function to assist in completing the pushing of the target page to the browser. At this time, the search engine may further include: a path parsing unit 1103 and a network interface unit 1104.

The user interface unit 1101 is further configured to, after receiving the status path selected by the user and sent by the browser, send the status path to the path analysis unit 1103.

A path parsing unit 1103, configured to generate a target page request according to the received status path.

And a network interface unit 1104, configured to send the target page request generated by the path analysis unit 1103 to the target page site.

Fig. 12 is a schematic structural diagram of a browser provided with a state path analysis function, and as shown in fig. 12, the browser may include: a network side interface unit 1201, a path analysis unit 1202, and a user side interface unit 1203.

A network side interface unit 1201, configured to receive a search result including a state path sent by the search engine shown in fig. 11; and sending the target page request sent by the path analysis unit 1202 to the target page site.

A user side interface unit 1203, configured to display the search result received by the network side interface unit 1201 to a user; the status path selected by the user is sent to the path analysis unit 1202.

A path parsing unit 1202, configured to generate a target page request according to the state path selected by the user and send the target page request to the network-side interface unit 1201.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for obtaining a target page is characterized by comprising the following steps:

2. The method according to claim 1, wherein step B specifically comprises:

3. The method according to claim 1, wherein step B specifically comprises:

4. The method of claim 2 or 3, wherein determining whether a page jump has occurred comprises: and if the obtained URL of the target page is different from that of the basic page, determining that page jump occurs.

5. The method of claim 2 or 3, wherein determining whether to generate new page content comprises: carrying out sentence signature or character string comparison on the acquired target page and the basic page, and if the comparison result shows that the target page and the basic page have different page contents, determining to generate new page contents; or,

6. A method according to any of claims 1 to 3, wherein the location information of DOM events comprises: DOM node identification, path Xpath of DOM node and DOM event identification.

7. A method according to any one of claims 1 to 3, characterized in that after said step B, the method further comprises:

8. A method for obtaining a target page, the method according to claim 7 being followed by:

9. The method of claim 8, wherein the search results further comprise: snapshot information of the matched target page;

10. The method of claim 8, wherein after including the status path corresponding to the matched target page in the search result and returning the search result to the browser, the method further comprises:

11. A method for obtaining a target page is characterized by comprising the following steps:

receiving a target page pushed by the target page site;

12. An apparatus for obtaining a target page, the apparatus comprising:

13. The apparatus according to claim 12, wherein the analysis unit comprises in particular: the device comprises a first judgment module, a second judgment module, a first path generation module and a first path determination module;

14. The apparatus according to claim 12, wherein the analysis unit comprises in particular: the device comprises a third judging module, a fourth judging module, a second path generating module and a second path determining module;

15. The apparatus of claim 13 or 14, wherein determining whether a page jump occurs comprises: and if the obtained URL of the target page is different from that of the basic page, determining that page jump occurs.

16. The apparatus of claim 13 or 14, wherein determining whether to generate new page content comprises: carrying out sentence signature or character string comparison on the acquired target page and the basic page, and if the comparison result shows that the target page and the basic page have different page contents, determining to generate new page contents; or,

17. The apparatus according to any of claims 12 to 14, wherein the location information of the DOM event comprises: DOM node identification, path Xpath of DOM node and DOM event identification.

18. The apparatus of any one of claims 12 to 14, further comprising:

19. A search engine, comprising: the apparatus, user interface unit, and search processing unit of claim 18;

20. The search engine of claim 19, further comprising, in the search results: snapshot information of the matched target page;

21. The search engine of claim 19, further comprising: a path analysis unit and a network interface unit;

22. A browser, comprising: the system comprises a network side interface unit, a path analysis unit and a user side interface unit;