CN103838747A - Network service construction method and device and webpage data extraction method and device - Google Patents

Network service construction method and device and webpage data extraction method and device Download PDF

Info

Publication number
CN103838747A
CN103838747A CN201210479166.7A CN201210479166A CN103838747A CN 103838747 A CN103838747 A CN 103838747A CN 201210479166 A CN201210479166 A CN 201210479166A CN 103838747 A CN103838747 A CN 103838747A
Authority
CN
China
Prior art keywords
parameter
xpath
service
code
http
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201210479166.7A
Other languages
Chinese (zh)
Other versions
CN103838747B (en
Inventor
邹纲
皮冰锋
张军
钟朝亮
于浩
松尾昭彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201210479166.7A priority Critical patent/CN103838747B/en
Publication of CN103838747A publication Critical patent/CN103838747A/en
Application granted granted Critical
Publication of CN103838747B publication Critical patent/CN103838747B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a network service construction method and device and a webpage data extraction method and device. The network service construction method includes the steps of collecting data which are related to access to a deep webpage and include HTTP information, a JS event and a stack snapshot, searching for service relative parameters from the collected HTTP information, dividing the found parameters into the user input parameters, the explicit parameters and the implicit parameters, constructing a first XPath which can generate the explicit parameters, constructing JS codes which can generate the implicit parameters, constructing a second XPath which can generate a final deep webpage return result, and constructing a structure according to the sequence of the JS event, the stack snapshot and the observed HTTP information, wherein the structure represents the internal process of a service, and the network service is composed of the user input parameters, the JS codes, the first XPath, the second XPath and the structure representing the internal process of the service.

Description

Network service construction method and equipment and webpage data extracting method and equipment
Technical field
The Data Extraction Technology that relate generally to deep layer net page of the present invention is relevant and the construction method of associated network services.Particularly, the present invention relates to a kind of structure for the method and apparatus of network service that the data of deep layer net page are extracted and corresponding webpage data extracting method and equipment.
Background technology
In recent years, along with the development of internet and application thereof, all kinds of useful informations that can obtain and utilize on network are how much levels and increase, and have greatly increased the source of people's obtaining informations.Webpage can be divided into top layer webpage and deep layer net page substantially.Top layer webpage is easily used by user, can provide API to be beneficial to machine processing simultaneously.Deep layer net page is readable to user, but to machine unfriendly, also seldom has corresponding API.Therefore, propose some data pick-up methods, will the access of deep layer net page be encapsulated as to network service (Web Service), thereby be beneficial to machine access deep layer net page, and can generate how higher level network service based on such network service.As network service that existing network service is integrated to provide new etc.
But, along with the development of network, Javascript(Java script, hereafter is JS) be widely applied in webpage.The problem of thereupon bringing is in the access of deep layer net page, a lot of logins relevant to the access of deep layer net page or query argument may be the dynamic result that JS code is carried out, and traditional Data Extraction Technology is difficult to wherein relate to the deep layer net page access of JS the code dynamic state performance.Therefore, be difficult to build corresponding network service to extract the data of deep layer net page.
Summary of the invention
Provide hereinafter about brief overview of the present invention, to the basic comprehension about some aspect of the present invention is provided.Should be appreciated that this general introduction is not about exhaustive general introduction of the present invention.It is not that intention is determined key of the present invention or pith, and nor is it intended to limit the scope of the present invention.Its object is only that the form of simplifying provides some concept, using this as the preorder in greater detail of discussing after a while.
The object of the invention is the problems referred to above for prior art, proposed a kind of network service construction method and equipment and webpage data extracting method and equipment.Can, by relating to the deep layer net page access of JS the code dynamic state performance, be configured to network service, and utilize this network service can extract the data of deep layer net page according to technical scheme of the present invention.
To achieve these goals, according to an aspect of the present invention, provide a kind of network service construction method, having comprised: collect the data relevant with the access of deep layer net page, described data have comprised HTTP message, JS event, storehouse snapshot; In collected HTTP message, search service correlation parameter, and the parameter searching is divided into user's input parameter, explicit parament, implicit expression parameter; Structure can generate an XPath of explicit parament; Structure can generate the JS code of implicit expression parameter; Structure can generate the 2nd XPath that final deep layer net page returns results; And according to the order of JS event, storehouse snapshot and observed HTTP message, build the structure that represents service internal process; The structure of wherein said user's input parameter, JS code, the first and second XPath, expression service internal process has formed described network service.
According to another aspect of the present invention, provide a kind of data pick-up method of deep layer net page, it comprises the input parameter according to user, builds HTTP request; Obtain HTTP and ask corresponding http response; According to obtained http response, utilize an XPath to generate explicit parament, and utilize JS code to generate implicit expression parameter; According at least one in user's input parameter, explicit parament, implicit expression parameter, build HTTP request; Obtain HTTP and ask corresponding http response; Repeat above-mentioned generation, structure, obtaining step according to the structure that represents service internal process, until according to obtained http response, utilize the 2nd XPath to generate final deep layer net page and return results.
According to a further aspect of the invention, provide a kind of network service to build equipment, it comprises: gathering-device, be configured to collect the data relevant with the access of deep layer net page, and described data comprise HTTP message, JS event, storehouse snapshot; Parameter search device, is configured in collected HTTP message, search service correlation parameter, and the parameter searching is divided into user's input parameter, explicit parament, implicit expression parameter; The one XPath construction device, is configured to build an XPath that can generate explicit parament; JS code construction device, is configured to build the JS code that can generate implicit expression parameter; The 2nd XPath construction device, is configured to structure and can generates the 2nd XPath that final deep layer net page returns results; And structure construction device, be configured to, according to the order of JS event, storehouse snapshot and observed HTTP message, build the structure that represents service internal process; The structure of wherein said user's input parameter, JS code, the first and second XPath, expression service internal process has formed described network service.
According to a further aspect of the invention, provide a kind of data pick-up equipment of deep layer net page, it comprises: the first construction device, for according to user's input parameter, builds HTTP request; Acquisition device, asks corresponding http response for obtaining HTTP; Parameter generating apparatus, for according to obtained http response, utilizes an XPath to generate explicit parament, and utilizes JS code to generate implicit expression parameter; The second construction device, for according at least one of user's input parameter, explicit parament, implicit expression parameter, builds HTTP request; Control device, be used for according to parameter generating apparatus, the second construction device, acquisition device executable operations described in the organization instruction of expression service internal process, until according to obtained http response, utilizing the 2nd XPath to generate final deep layer net page, it returns results.
In addition, according to a further aspect in the invention, also provide a kind of storage medium.Described storage medium comprises machine-readable program code, and when carry out described program code on messaging device time, described program code is carried out according to said method of the present invention described messaging device.
In addition, in accordance with a further aspect of the present invention, also provide a kind of program product.Described program product comprises the executable instruction of machine, and when carry out described instruction on messaging device time, described instruction is carried out according to said method of the present invention described messaging device.
Brief description of the drawings
Below with reference to the accompanying drawings illustrate embodiments of the invention, can understand more easily above and other objects, features and advantages of the present invention.Parts in accompanying drawing are just in order to illustrate principle of the present invention.In the accompanying drawings, same or similar technical characterictic or parts will adopt same or similar Reference numeral to represent.In accompanying drawing:
Fig. 1 shows according to the process flow diagram of the network service construction method of the embodiment of the present invention;
Fig. 2 shows according to the process flow diagram of the deep layer net page data pick-up method of the embodiment of the present invention;
Fig. 3 shows the block diagram that builds equipment according to the network service of the embodiment of the present invention;
Fig. 4 shows according to the block diagram of the deep layer net page data pick-up equipment of the embodiment of the present invention; And
Fig. 5 shows and can be used for implementing according to the schematic block diagram of the computing machine of the method and apparatus of the embodiment of the present invention.
Embodiment
In connection with accompanying drawing, example embodiment of the present invention is described in detail hereinafter.All features of actual embodiment are not described for clarity and conciseness, in instructions.But, should understand, in the process of any this actual embodiment of exploitation, must make much decisions specific to embodiment, to realize developer's objectives.For example, meet and the restrictive condition of system and traffic aided, and these restrictive conditions may change to some extent along with the difference of embodiment.In addition,, although will also be appreciated that development is likely very complicated and time-consuming, concerning having benefited from those skilled in the art of present disclosure, this development is only routine task.
At this, also it should be noted is that, for fear of the details because of unnecessary fuzzy the present invention, only show in the accompanying drawings with according to the closely-related apparatus structure of the solution of the present invention and/or treatment step, and omitted other details little with relation of the present invention.In addition, also it is pointed out that element and the feature in an accompanying drawing of the present invention or a kind of embodiment, described can combine with element and feature shown in one or more other accompanying drawing or embodiment.
As mentioned above, the problem that conventional art is faced is the Dynamic Execution that relates to JS code in increasing deep layer net page access.In order to solve such problem, must be in corresponding network service Dynamic Execution JS code.
In general, the process of access deep layer net page comprises: for example login, inquire about, show result.This process is a series of HTTP message at bottom, as HTTP request and corresponding http response.In this process, can relate to the Dynamic Execution of JS code.For the access of most of deep layer net pages, this process is fixed, just the value difference of parameter.Therefore, can, by reappearing this series of process, send HTTP request, from the http response obtaining, extracting parameter builds next HTTP request, comprises until obtain the HTTP message finally returning results, and obtains the data of deep layer net page.
Therefore, can, by collecting observation data, structure pattern, be encapsulated as network service and build the network service that can extract deep layer net page data.
Below with reference to Fig. 1, the flow process of network service construction method is according to an embodiment of the invention described.
Fig. 1 shows according to the process flow diagram of the network service construction method of the embodiment of the present invention.As shown in Figure 1, according to network service construction method of the present invention, comprise the steps: to collect the data relevant with the access of deep layer net page, described data comprise HTTP message, JS event, storehouse snapshot (step S1); In collected HTTP message, search service correlation parameter, and the parameter searching is divided into user's input parameter, explicit parament, implicit expression parameter (step S2); Structure can generate an XPath(step S 3 of explicit parament); Structure can generate the JS code (step S4) of implicit expression parameter; Structure can generate the 2nd XPath(step S5 that final deep layer net page returns results); According to the order of JS event, storehouse snapshot and observed HTTP message, build the structure (step S6) that represents service internal process; Thereby the structure of described user's input parameter, JS code, the first and second XPath, expression service internal process has formed described network service.
In step S1, collect the data relevant with the access of deep layer net page, described data comprise HTTP message, JS event, storehouse snapshot.
As mentioned above, wish to visit by reappearing HTTP message the data of deep layer net page.Therefore, first collect HTTP message sequence and be analyzed.
User or window may trigger Javascript event.Dom tree node can receive JS event, such as the keyboard from user interface knocks, the pressing etc. of mouse.Window after ready, can produce onLoad event.These JS events can for example be caught by Firefox plug-in unit.These events are by structure record and triggering in order by representing service internal process.
At bottom, the execution flow process of whole JS code is the process of function call.Correspondingly, have a JS storehouse, it records being pressed into (push) and ejecting (popup) of function.By for example using the JDS(Javascript debug service that Firefox provides), can catch the storehouse snapshot of JS storehouse while changing.By these storehouse snapshots, the activity that can follow the trail of JS.Like this, JS activity can be described to the history that JS function is pressed into storehouse and ejects from storehouse.Utilize such historical information, can obtain execution sequence and the operation of JS code.
It should be noted that said process can utilize but not rely on Firefox plug-in unit, any instrument that obtains JS event, storehouse snapshot that can contribute to all can be utilized at this.
In step S1, after above-mentioned collection step, also collected data can be classified by the page.
The access of deep layer net page can be considered to the generative process of a series of webpages.Generate next webpage by current webpage, the like, until produce last webpage.In addition, JS execution environment is webpage based on after the complete all JS codes of whole loading.And collected data are according to time sequencing tissue, between webpage, significantly do not demarcate.Therefore, collected data stream is divided into units of pages, each units of pages comprises the storehouse snapshot of a page and all JS events that occur on this page, HTTP message, JS.Can utilize the time of origin of window.onload event to cut apart collected data stream as webpage border by the page.
In step S1, also can, after above-mentioned classifying step, remove the related data that operates the irrelevant page with user.
In the access process of deep layer net page, can produce a lot of HTTP message, but not all HTTP message is all requisite for the extraction of deep layer net page data.Therefore, without following the tracks of the parameter occurring.As long as by the behavior of analysis user, remain and analyzed operating relevant related data with user.
In the access of deep layer net page, user view can by following JS event detection to.Such as user's login invariably accompanies, a series of keyboards that occur on Input element in HTML knock and button is clicked.Equally, user's inquiry some KeyEvents that also invariably accompany.In addition, for example on Firefox browser window, provide user data marker (User Data Annotator).The page being labeled is obviously that user wants.
Therefore, can utilize JS event and user data marker, select to operate the relevant page with user, thereby determine with user and operate the irrelevant page, remove and user operates the related data of the irrelevant page.Described and user operate the relevant page and comprise the page of JS event and the page of the next page and user data marker mark thereof occur.
In step S2, in collected HTTP message, search service correlation parameter, and the parameter searching is divided into user's input parameter, explicit parament, implicit expression parameter.
Can be from the following location finding service related parameters of HTTP message: the POST data corresponding to POST body, XML HTTP request of URL query portion, cookie field, URL path, HTTP POST message.
URL query portion for instance,
In http://translate.google.com.hk/translate_a/t client=t & text=home, client=t & text=home is exactly the parameter of transmitting.
In browser, JS can send parameter to server by operation cookie field.
Under normal circumstances, parameter is expressed as the right of title and value.Therefore, HTTP request can be from corresponding parameter direct-reduction out.Similarly, URL also can restore from corresponding parameter.
Therefore, by from collected HTTP message, search service related parameters, just can reappear HTTP message by these parameters in the future, thereby utilize these HTTP message, reappear the access of deep layer net page, extract the data of deep layer net page.
The parameter searching can be divided into three classes: user's input parameter, explicit parament, implicit expression parameter.
User's input parameter is user's input of network service in the future.
Explicit parament can utilize XPath to extract from the original page.
Implicit expression parameter can not utilize XPath to extract from the original page as explicit parament, but the result that JS code is carried out, therefore can only obtain by carrying out JS code.
Therefore,, in step S3, structure can generate an XPath of explicit parament.
Build an XPath that can generate explicit parament according to web data, and utilize the one XPath generate explicit parament be all technology well known to those skilled in the art, do not repeat them here.
In step S4, structure can generate the JS code of implicit expression parameter.In essence, the object of step S4 is check and remove unnecessary JS code, simplifies JS code, and makes the execution of JS code can depart from browser environment with JS object replacement browser object.
Particularly, can build as follows the JS code that can generate implicit expression parameter.
The structure that it should be noted that JS code is to carry out for the collected data of each page.As mentioned above, the execution environment of JS code is a page.The JS code that will be simplified can obtain from the source code of HTML and JS message.
In step S41, according to storehouse snapshot, obtain JS activity history information.
As mentioned above, by storehouse snapshot, can know the invoked procedure of JS function, understand the execution sequence of JS code.By storehouse snapshot convert to function be pressed into storehouse/from storehouse eject history, as JS activity history information.
In step S42, in the JS code comprising, parse JS syntax tree from html source code and HTTP message.
Step S42 can use the conventional Rhino instrument in this area, parses JS syntax tree.
In JS syntax tree, be mainly in SETVAR, GETVAR, SETPROP, GETPROP node, performance variable in function.
In step S43, according to JS activity history information, traversal mark JS syntax tree.
Particularly, will be labeled as useless (step S431) with the irrelevant function of JS activity history information.Wherein the function not appearing in JS flow of event is labeled as useless.That is to say, according to storehouse snapshot, these functions are not pressed into storehouse or eject from storehouse in carrying out operating relevant JS code with user.
In GETVAR, SETVAR, GETPROP, SETPROP node, determine between the function and variable in JS syntax tree, dependence between variable and variable, and definite result is appended to (step S432) on corresponding node.
The browser object (step S433) that mark and service are irrelevant.
The JS environment of browser has many objects, and the part in these objects is to introduce from the object of browser inside.Therefore,, if want to allow JS code move, just must again realize requisite browser object in the situation that there is no browser environment.In fact, a lot of browser objects are relevant to operation UI element, CSS, browser status bar, window etc., therefore, do not need to realize these objects.Therefore it is carried out to mark in this step.
Mark is for generation of the key object (step S434) of HTTP request.
Because hope is by sending HTTP request, obtains http response and reappear a series of HTTP message, so need to be for generation of the key object of HTTP request.
In step S44, by only retaining function and node relevant to service in JS syntax tree, simplify JS syntax tree.
Particularly, remove all useless functions (step S441) that are marked as.
Traversal JS syntax tree, with locator key object and corresponding key node (step S442) thereof.
Variable relevant key node is put into Dependency Set (step S443).
Check key node and until its father node and the brotgher of node of root node, to carry out following processing: if the variable in Dependency Set depends on the variable in present node, the all variablees in present node are joined in Dependency Set, otherwise delete present node; Remove about setting up the node of event and calling and the node of serving irrelevant browser object (step S444).
Set up constant table to preserve the constant (step S445) relevant to service.
In some JS code, can be by constant assignment to variable.For this situation, can set up constant table, retain by the constant of assignment, with in the time that JS code is carried out by its assignment to variable, produce implicit expression parameter.
Replace the browser object (step S446) relevant to service with self-defined object.Because the JS code regenerating will depart from browser execution.Correspondingly, browser object can not exist again.For the normal implicit expression parameter that generates, need to replace browser object to complete its function with self-defining JS object.
JS syntax tree based on simplifying, constant table, self-defined object, generate JS code (step S447).This process parses the inverse process of JS syntax tree in step S42 from JS code before being, know to those skilled in the art.
So far, built the JS code that can generate implicit expression parameter.This JS code has obtained simplifying.
By above-mentioned processing, the JS code that has obtained generating an XPath of explicit parament and can generate implicit expression parameter.But as mentioned above, the processing of whole deep layer net page is the sequence of HTTP message, therefore, in step S5, structure can generate the 2nd XPath that final deep layer net page returns results, to present the result of data pick-up to user.And, in step S6, according to the order of JS event, storehouse snapshot and observed HTTP message, build the structure that represents service internal process.What kind of order this structural table understands according to generates HTTP request according at least one in user's input parameter, explicit parament, implicit expression parameter, and http response based on corresponding, utilize an XPa th to generate explicit parament, and how to utilize JS code to generate implicit expression parameter, based on newly-generated parameter, create new HTTP request, go round and begin again, until obtain final http response, and utilize the 2nd XPath to generate final deep layer net page to return results.Wherein, about the execution of JS code, DOM model is event driven, that is to say that event handling function is activated by mouse click, keyword or timer.In the situation that there is no browser environment and user's operation, by activating these event handling functions (according to the structure that shows to serve internal process) by the mode of programming, reappear the process of deep layer net page access.
Through above-mentioned steps, to obtain user's input parameter, the first and second XPath, the JS code of simplifying, represented the structure of service internal process, all these have just formed network service.
Below, describe according to the deep layer net page data pick-up method of the embodiment of the present invention with reference to Fig. 2.
Fig. 2 shows according to the process flow diagram of the deep layer net page data pick-up method of the embodiment of the present invention.As shown in Figure 2, according to deep layer net page data pick-up method of the present invention, comprise the steps: the input parameter according to user, build HTTP request (step S21); Obtain HTTP and ask corresponding http response (step S22); According to obtained http response, utilize an XPath to generate explicit parament, and utilize JS code to generate implicit expression parameter (step S23); According at least one in user's input parameter, explicit parament, implicit expression parameter, build HTTP request (step S24); Obtain HTTP and ask corresponding http response (step S25); Repeat above-mentioned generation, structure, obtaining step according to the structure that represents service internal process, until according to obtained http response, utilize the 2nd XPath to generate final deep layer net page and return results (step S26).
Below, describe according to the network service of the embodiment of the present invention and build equipment with reference to Fig. 3.
Fig. 3 shows the block diagram that builds equipment according to the network service of the embodiment of the present invention.As shown in Figure 3, build equipment 300 according to network service of the present invention and comprise: gathering-device 31, be configured to collect the data relevant with the access of deep layer net page, described data comprise HTTP message, JS event, storehouse snapshot; Parameter search device 32, is configured in collected HTTP message, search service correlation parameter, and the parameter searching is divided into user's input parameter, explicit parament, implicit expression parameter; The one XPath construction device 33, is configured to build an XPath that can generate explicit parament; JS code construction device 34, is configured to build the JS code that can generate implicit expression parameter; The 2nd XPath construction device 35, is configured to structure and can generates the 2nd XPath that final deep layer net page returns results; And structure construction device 36, be configured to, according to the order of JS event, storehouse snapshot and observed HTTP message, build the structure that represents service internal process; The structure of wherein said user's input parameter, JS code, the first and second XPath, expression service internal process has formed described network service.
In a specific embodiment, network service builds equipment 300 and also comprises sorter 37, and described sorter 37 is configured the data of being collected by described gathering-device 31 to classify by the page.
In a specific embodiment, network service builds equipment 300 and also comprises removal device 38, and described removal device 38 is configured to remove from the sorted page of described sorter 37 related data that operates the irrelevant page with user.
In a specific embodiment, removal device 38 is further configured to and utilizes JS event and user data marker, select operate the relevant page with user, thereby determine and user operate the irrelevant page.
In a specific embodiment, operate the relevant page with user and comprise the page of generation JS event and the page of the next page and user data marker mark thereof.
In a specific embodiment, parameter search device 32 is from the following location finding service related parameters of HTTP message: the POST data corresponding to POST body, XML HTTP request of URL query portion, cookie field, URL path, HTTP POST message.
In a specific embodiment, JS code construction device 34 is for the collected data of each page, and structure can generate the JS code of implicit expression parameter;
In a specific embodiment, JS code construction device 34 comprises: historical information obtains unit 341, for according to storehouse snapshot, obtains JS activity history information; Resolution unit 342, for from the JS code that html source code and HTTP message comprise, parses JS syntax tree; Indexing unit 343, for according to JS activity history information, travels through and mark JS syntax tree; Simplify unit 344, for by only retaining function and the node that JS syntax tree is relevant to service, simplify JS syntax tree; Constant table is set up unit 345, for setting up constant table to preserve the constant relevant to service; Replacement unit 346, for replacing the browser object relevant to service with self-defined object; Generation unit 347, for the JS syntax tree based on simplifying, constant table, self-defined object, generates JS code.
In a specific embodiment, indexing unit 343 is configured to: will be labeled as useless with the irrelevant function of JS activity history information; In GETVAR, SETVAR, GETPROP, SETPROP node, determine between function and variable, dependence between variable and variable, and definite result is appended on corresponding node; The browser object that mark and service are irrelevant; Mark is for generation of the key object of HTTP request.
In a specific embodiment, simplify unit 343 and be configured to: remove all useless functions that are marked as; Traversal JS syntax tree, with locator key object and corresponding key node thereof; Variable relevant key node is put into Dependency Set; Check key node and until its father node and the brotgher of node of root node, to carry out following processing: if the variable in Dependency Set depends on the variable in present node, the all variablees in present node are joined in Dependency Set, otherwise delete present node; Remove about setting up the node of event and calling and the node of serving irrelevant browser object.
Due to build at network service according to the present invention processing in each included device of equipment 300 respectively with the step S1-S6 of above-described network service construction method in processing similar, therefore omit for simplicity, the detailed description of these devices at this.
Describe according to the deep layer net page data pick-up equipment of the embodiment of the present invention with reference to Fig. 4.
Fig. 4 shows according to the block diagram of the deep layer net page data pick-up equipment 400 of the embodiment of the present invention.As shown in Figure 4, deep layer net page data pick-up equipment 400 according to the present invention comprises: the first construction device 41, for according to user's input parameter, builds HTTP request; Acquisition device 42, asks corresponding http response for obtaining HTTP; Parameter generating apparatus 43, for according to obtained http response, utilizes an XPath to generate explicit parament, and utilizes JS code to generate implicit expression parameter; The second construction device 44, for according at least one of user's input parameter, explicit parament, implicit expression parameter, builds HTTP request; Control device 45, be used for according to parameter generating apparatus 43, the second construction device 44, acquisition device 42 executable operations described in the organization instruction of expression service internal process, until according to obtained http response, utilizing the 2nd XPath to generate final deep layer net page, it returns results.
Due to the processing in included each of deep layer net page data pick-up equipment 400 according to the present invention device respectively with the step S21-S26 of above-described deep layer net page data pick-up method in processing similar, therefore omit for simplicity, the detailed description of these devices at this.
In addition, still need and be pointed out that here, in the said equipment, each component devices, unit can be configured by the mode of software, firmware, hardware or its combination.Configure spendable concrete means or mode and be well known to those skilled in the art, do not repeat them here.In the situation that realizing by software or firmware, to the computing machine (example multi-purpose computer 500 as shown in Figure 5) with specialized hardware structure, the program that forms this software is installed from storage medium or network, this computing machine, in the time that various program is installed, can be carried out various functions etc.
Fig. 5 illustrates and can be used for implementing according to the schematic block diagram of the computing machine of the method and apparatus of the embodiment of the present invention.
In Fig. 5, CPU (central processing unit) (CPU) 501 carries out various processing according to the program of storage in ROM (read-only memory) (ROM) 502 or from the program that storage area 508 is loaded into random access memory (RAM) 503.In RAM 503, also store as required data required in the time that CPU 501 carries out various processing etc.CPU 501, ROM 502 and RAM 503 are connected to each other via bus 504.Input/output interface 505 is also connected to bus 504.
Following parts are connected to input/output interface 505: importation 506(comprises keyboard, mouse etc.), output 507(comprises display, such as cathode-ray tube (CRT) (CRT), liquid crystal display (LCD) etc., with loudspeaker etc.), storage area 508(comprises hard disk etc.), communications portion 509(comprises that network interface unit is such as LAN card, modulator-demodular unit etc.).Communications portion 509 via network such as the Internet executive communication processing.As required, driver 510 also can be connected to input/output interface 505.Detachable media 511, such as disk, CD, magneto-optic disk, semiconductor memory etc. can be installed on driver 510 as required, is installed in storage area 508 computer program of therefrom reading as required.
In the situation that realizing above-mentioned series of processes by software, from network such as the Internet or storage medium are such as detachable media 511 is installed the program that forms softwares.
It will be understood by those of skill in the art that this storage medium is not limited to wherein having program stored therein shown in Fig. 5, distributes separately the detachable media 511 so that program to be provided to user with equipment.The example of detachable media 511 comprises disk (comprising floppy disk (registered trademark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Or storage medium can be hard disk comprising in ROM 502, storage area 508 etc., wherein computer program stored, and be distributed to user together with comprising their equipment.
The present invention also proposes a kind of program product that stores the instruction code that machine readable gets.When described instruction code is read and carried out by machine, can carry out above-mentioned according to the method for the embodiment of the present invention.
Correspondingly, be also included within of the present invention open for carrying the storage medium of the above-mentioned program product that stores the instruction code that machine readable gets.Described storage medium includes but not limited to floppy disk, CD, magneto-optic disk, storage card, memory stick etc.
In the above in the description of the specific embodiment of the invention, describe and/or the feature that illustrates can be used in same or similar mode in one or more other embodiment for a kind of embodiment, combined with the feature in other embodiment, or substitute the feature in other embodiment.
Should emphasize, term " comprises/comprises " existence that refers to feature, key element, step or assembly while use herein, but does not get rid of the existence of one or more further feature, key element, step or assembly or add.
In addition, the time sequencing of describing during method of the present invention is not limited to is to specifications carried out, also can be according to other time sequencing ground, carry out concurrently or independently.The execution sequence of the method for therefore, describing in this instructions is not construed as limiting technical scope of the present invention.
Although the present invention is disclosed by the description to specific embodiments of the invention above,, should be appreciated that, above-mentioned all embodiment and example are all illustrative, and not restrictive.Those skilled in the art can design various amendments of the present invention, improvement or equivalent in the spirit and scope of claims.These amendments, improvement or equivalent also should be believed to comprise in protection scope of the present invention.
remarks
1. a network service construction method, comprising:
Collect the data relevant with the access of deep layer net page, described data comprise HTTP message, JS event, storehouse snapshot;
In collected HTTP message, search service correlation parameter, and the parameter searching is divided into user's input parameter, explicit parament, implicit expression parameter;
Structure can generate an XPath of explicit parament;
Structure can generate the JS code of implicit expression parameter;
Structure can generate the 2nd XPath that final deep layer net page returns results; And
According to the order of JS event, storehouse snapshot and observed HTTP message, build the structure that represents service internal process;
The structure of wherein said user's input parameter, JS code, the first and second XPath, expression service internal process has formed described network service.
2. the network service construction method as described in remarks 1, wherein, after described collection step, classifies collected data by the page.
3. the network service construction method as described in remarks 2, wherein, after described classifying step, removes the related data that operates the irrelevant page with user.
4. the network service construction method as described in remarks 3, wherein utilizes JS event and user data marker, select operate the relevant page with user, thereby determine and user operate the irrelevant page.
5. the network service construction method as described in remarks 4, wherein said and user operate the relevant page and comprise the page of JS event and the page of the next page and user data marker mark thereof occur.
6. the network service construction method as described in remarks 1, wherein from the following location finding service related parameters of HTTP message: the POST data corresponding to POST body, XML HTTP request of URL query portion, cookie field, URL path, HTTP POST message.
7. the network service construction method as described in remarks 2, the JS code that wherein said structure can generate implicit expression parameter comprises:
For the collected data of each page,
According to storehouse snapshot, obtain JS activity history information;
In the JS code comprising, parse JS syntax tree from html source code and HTTP message;
According to JS activity history information, traversal mark JS syntax tree;
By only retaining function and node relevant to service in JS syntax tree, simplify JS syntax tree;
Set up constant table to preserve the constant relevant to service;
Replace the browser object relevant to service with self-defined object;
JS syntax tree based on simplifying, constant table, self-defined object, generate JS code.
8. the network service construction method as described in remarks 7, wherein said mark JS syntax tree comprises:
To be labeled as useless with the irrelevant function of JS activity history information;
In GETVAR, SETVAR, GETPROP, SETPROP node, determine between function and variable, dependence between variable and variable, and definite result is appended on corresponding node;
The browser object that mark and service are irrelevant;
Mark is for generation of the key object of HTTP request.
9. the network service construction method as described in remarks 8, the wherein said JS of simplifying syntax tree comprises:
Remove all useless functions that are marked as;
Traversal JS syntax tree, with locator key object and corresponding key node thereof;
Variable relevant key node is put into Dependency Set;
Check key node and until its father node and the brotgher of node of root node, to carry out following processing:
If the variable in Dependency Set depends on the variable in present node, all variablees in present node are joined in Dependency Set;
Otherwise deletion present node;
Remove about setting up the node of event and calling and the node of serving irrelevant browser object.
10. the network service that utilization builds as the method for remarks 1-9 extracts a method for the data of deep layer net page, comprising:
According to user's input parameter, build HTTP request;
Obtain HTTP and ask corresponding http response;
According to obtained http response, utilize an XPath to generate explicit parament, and utilize JS code to generate implicit expression parameter;
According at least one in user's input parameter, explicit parament, implicit expression parameter, build HTTP request;
Obtain HTTP and ask corresponding http response;
Repeat above-mentioned generation, structure, obtaining step according to the structure that represents service internal process, until according to obtained http response, utilize the 2nd XPath to generate final deep layer net page and return results.
11. 1 kinds of network services build equipment, comprising:
Gathering-device, is configured to collect the data relevant with the access of deep layer net page, and described data comprise HTTP message, JS event, storehouse snapshot;
Parameter search device, is configured in collected HTTP message, search service correlation parameter, and the parameter searching is divided into user's input parameter, explicit parament, implicit expression parameter;
The one XPath construction device, is configured to build an XPath that can generate explicit parament;
JS code construction device, is configured to build the JS code that can generate implicit expression parameter;
The 2nd XPath construction device, is configured to structure and can generates the 2nd XPath that final deep layer net page returns results; And
Structure construction device, is configured to, according to the order of JS event, storehouse snapshot and observed HTTP message, build the structure that represents service internal process;
The structure of wherein said user's input parameter, JS code, the first and second XPath, expression service internal process has formed described network service.
12. network services as described in remarks 11 build equipment, also comprise sorter, and described sorter is configured the data of being collected by described gathering-device to classify by the page.
13. network services as described in remarks 12 build equipment, also comprise removal device, and described removal device is configured to remove from the sorted page of described sorter the related data that operates the irrelevant page with user.
14. network services as described in remarks 13 build equipment, and wherein said removal device is further configured to and utilizes JS event and user data marker, select to operate the relevant page with user, thereby determine and user operates the irrelevant page.
15. network services as described in remarks 14 build equipment, wherein saidly operate the relevant page with user and comprise the page of generation JS event and the page of the next page and user data marker mark thereof.
16. network services as described in remarks 11 build equipment, and wherein said parameter search device is from the following location finding service related parameters of HTTP message: the POST data corresponding to POST body, XML HTTP request of URL query portion, cookie field, URL path, HTTP POST message.
17. network services as described in remarks 12 build equipment, and wherein said JS code construction device is for the collected data of each page, and structure can generate the JS code of implicit expression parameter; Described JS code construction device comprises:
Historical information obtains unit, for according to storehouse snapshot, obtains JS activity history information;
Resolution unit, for from the JS code that html source code and HTTP message comprise, parses JS syntax tree;
Indexing unit, for according to JS activity history information, travels through and mark JS syntax tree;
Simplify unit, for by only retaining function and the node that JS syntax tree is relevant to service, simplify JS syntax tree;
Constant table is set up unit, for setting up constant table to preserve the constant relevant to service;
Replacement unit, for replacing the browser object relevant to service with self-defined object;
Generation unit, for the JS syntax tree based on simplifying, constant table, self-defined object, generates JS code.
18. network services as described in remarks 17 build equipment, and wherein said indexing unit is configured to:
To be labeled as useless with the irrelevant function of JS activity history information;
In GETVAR, SETVAR, GETPROP, SETPROP node, determine between function and variable, dependence between variable and variable, and definite result is appended on corresponding node;
The browser object that mark and service are irrelevant;
Mark is for generation of the key object of HTTP request.
19. network services as described in remarks 18 build equipment, and the wherein said unit of simplifying is configured to:
Remove all useless functions that are marked as;
Traversal JS syntax tree, with locator key object and corresponding key node thereof;
Variable relevant key node is put into Dependency Set;
Check key node and until its father node and the brotgher of node of root node, to carry out following processing:
If the variable in Dependency Set depends on the variable in present node, all variablees in present node are joined in Dependency Set;
Otherwise deletion present node;
Remove about setting up the node of event and calling and the node of serving irrelevant browser object.
20. 1 kinds of utilizations, as the network service of the device build of remarks 11-19 extracts the equipment of the data of deep layer net page, comprising:
The first construction device, for according to user's input parameter, builds HTTP request;
Acquisition device, asks corresponding http response for obtaining HTTP;
Parameter generating apparatus, for according to obtained http response, utilizes an XPath to generate explicit parament, and utilizes JS code to generate implicit expression parameter;
The second construction device, for according at least one of user's input parameter, explicit parament, implicit expression parameter, builds HTTP request;
Control device, be used for according to parameter generating apparatus, the second construction device, acquisition device executable operations described in the organization instruction of expression service internal process, until according to obtained http response, utilizing the 2nd XPath to generate final deep layer net page, it returns results.

Claims (10)

1. a network service construction method, comprising:
Collect the data relevant with the access of deep layer net page, described data comprise HTTP message, JS event, storehouse snapshot;
In collected HTTP message, search service correlation parameter, and the parameter searching is divided into user's input parameter, explicit parament, implicit expression parameter;
Structure can generate an XPath of explicit parament;
Structure can generate the JS code of implicit expression parameter;
Structure can generate the 2nd XPath that final deep layer net page returns results; And
According to the order of JS event, storehouse snapshot and observed HTTP message, build the structure that represents service internal process;
The structure of wherein said user's input parameter, JS code, the first and second XPath, expression service internal process has formed described network service.
2. network service construction method as claimed in claim 1, wherein, after described collection step, classifies collected data by the page.
3. network service construction method as claimed in claim 2, wherein, after described classifying step, removes the related data that operates the irrelevant page with user.
4. network service construction method as claimed in claim 3, wherein utilizes JS event and user data marker, select operate the relevant page with user, thereby determine and user operate the irrelevant page.
5. network service construction method as claimed in claim 2, the JS code that wherein said structure can generate implicit expression parameter comprises:
For the collected data of each page,
According to storehouse snapshot, obtain JS activity history information;
In the JS code comprising, parse JS syntax tree from html source code and HTTP message;
According to JS activity history information, traversal mark JS syntax tree;
By only retaining function and node relevant to service in JS syntax tree, simplify JS syntax tree;
Set up constant table to preserve the constant relevant to service;
Replace the browser object relevant to service with self-defined object;
JS syntax tree based on simplifying, constant table, self-defined object, generate JS code.
6. network service construction method as claimed in claim 5, wherein said mark JS syntax tree comprises:
To be labeled as useless with the irrelevant function of JS activity history information;
In GETVAR, SETVAR, GETPROP, SETPROP node, determine between function and variable, dependence between variable and variable, and definite result is appended on corresponding node;
The browser object that mark and service are irrelevant;
Mark is for generation of the key object of HTTP request.
7. network service construction method as claimed in claim 6, the wherein said JS of simplifying syntax tree comprises:
Remove all useless functions that are marked as;
Traversal JS syntax tree, with locator key object and corresponding key node thereof;
Variable relevant key node is put into Dependency Set;
Check key node and until its father node and the brotgher of node of root node, to carry out following processing:
If the variable in Dependency Set depends on the variable in present node, all variablees in present node are joined in Dependency Set;
Otherwise deletion present node;
Remove about setting up the node of event and calling and the node of serving irrelevant browser object.
8. the network service that utilization builds as the method for claim 1-7 extracts a method for the data of deep layer net page, comprising:
According to user's input parameter, build HTTP request;
Obtain HTTP and ask corresponding http response;
According to obtained http response, utilize an XPath to generate explicit parament, and utilize JS code to generate implicit expression parameter;
According at least one in user's input parameter, explicit parament, implicit expression parameter, build HTTP request;
Obtain HTTP and ask corresponding http response;
Repeat above-mentioned generation, structure, obtaining step according to the structure that represents service internal process, until according to obtained http response, utilize the 2nd XPath to generate final deep layer net page and return results.
9. network service builds an equipment, comprising:
Gathering-device, is configured to collect the data relevant with the access of deep layer net page, and described data comprise HTTP message, JS event, storehouse snapshot;
Parameter search device, is configured in collected HTTP message, search service correlation parameter, and the parameter searching is divided into user's input parameter, explicit parament, implicit expression parameter;
The one XPath construction device, is configured to build an XPath that can generate explicit parament;
JS code construction device, is configured to build the JS code that can generate implicit expression parameter;
The 2nd XPath construction device, is configured to structure and can generates the 2nd XPath that final deep layer net page returns results; And
Structure construction device, is configured to, according to the order of JS event, storehouse snapshot and observed HTTP message, build the structure that represents service internal process;
The structure of wherein said user's input parameter, JS code, the first and second XPath, expression service internal process has formed described network service.
10. utilize the network service of device build as claimed in claim 9 to extract an equipment for the data of deep layer net page, comprising:
The first construction device, for according to user's input parameter, builds HTTP request;
Acquisition device, asks corresponding http response for obtaining HTTP;
Parameter generating apparatus, for according to obtained http response, utilizes an XPath to generate explicit parament, and utilizes JS code to generate implicit expression parameter;
The second construction device, for according at least one of user's input parameter, explicit parament, implicit expression parameter, builds HTTP request;
Control device, be used for according to parameter generating apparatus, the second construction device, acquisition device executable operations described in the organization instruction of expression service internal process, until according to obtained http response, utilizing the 2nd XPath to generate final deep layer net page, it returns results.
CN201210479166.7A 2012-11-22 2012-11-22 Network service construction method and equipment and webpage data extracting method and equipment Expired - Fee Related CN103838747B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210479166.7A CN103838747B (en) 2012-11-22 2012-11-22 Network service construction method and equipment and webpage data extracting method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210479166.7A CN103838747B (en) 2012-11-22 2012-11-22 Network service construction method and equipment and webpage data extracting method and equipment

Publications (2)

Publication Number Publication Date
CN103838747A true CN103838747A (en) 2014-06-04
CN103838747B CN103838747B (en) 2017-07-07

Family

ID=50802261

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210479166.7A Expired - Fee Related CN103838747B (en) 2012-11-22 2012-11-22 Network service construction method and equipment and webpage data extracting method and equipment

Country Status (1)

Country Link
CN (1) CN103838747B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202244A (en) * 2016-06-28 2016-12-07 深圳中兴网信科技有限公司 Web page message return method and web page message return system
CN111368104A (en) * 2018-12-26 2020-07-03 阿里巴巴集团控股有限公司 Information processing method, device and equipment
CN113778389A (en) * 2020-09-23 2021-12-10 北京沃东天骏信息技术有限公司 Interface idempotent judging method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080040327A1 (en) * 2006-08-14 2008-02-14 International Business Machines Corporation System and method for searching deep web services
CN101251852A (en) * 2008-01-11 2008-08-27 孟小峰 Integrating system and method of Web data facing to field
CN101763425A (en) * 2010-01-12 2010-06-30 苏州阔地网络科技有限公司 Universal method for capturing webpage contents of any webpage
CN102663041A (en) * 2012-03-28 2012-09-12 重庆大学 Automatic extraction method oriented to data of deep web pages
CN102682119A (en) * 2012-05-16 2012-09-19 崔志明 Deep webpage data acquiring method based on dynamic knowledge

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080040327A1 (en) * 2006-08-14 2008-02-14 International Business Machines Corporation System and method for searching deep web services
CN101251852A (en) * 2008-01-11 2008-08-27 孟小峰 Integrating system and method of Web data facing to field
CN101763425A (en) * 2010-01-12 2010-06-30 苏州阔地网络科技有限公司 Universal method for capturing webpage contents of any webpage
CN102663041A (en) * 2012-03-28 2012-09-12 重庆大学 Automatic extraction method oriented to data of deep web pages
CN102682119A (en) * 2012-05-16 2012-09-19 崔志明 Deep webpage data acquiring method based on dynamic knowledge

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202244A (en) * 2016-06-28 2016-12-07 深圳中兴网信科技有限公司 Web page message return method and web page message return system
CN111368104A (en) * 2018-12-26 2020-07-03 阿里巴巴集团控股有限公司 Information processing method, device and equipment
CN111368104B (en) * 2018-12-26 2023-05-26 阿里巴巴集团控股有限公司 Information processing method, device and equipment
CN113778389A (en) * 2020-09-23 2021-12-10 北京沃东天骏信息技术有限公司 Interface idempotent judging method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN103838747B (en) 2017-07-07

Similar Documents

Publication Publication Date Title
US10152488B2 (en) Static-analysis-assisted dynamic application crawling architecture
CN102272757B (en) Method for server-side logging of client browser state through markup language
Zhou et al. API deprecation: a retrospective analysis and detection method for code examples on the web
CN103023714B (en) The liveness of topic Network Based and cluster topology analytical system and method
CN106126648B (en) It is a kind of based on the distributed merchandise news crawler method redo log
US9990428B2 (en) Computerized identification of app search functionality for search engine access
US11132409B2 (en) Identifying client states
CN109376291B (en) Website fingerprint information scanning method and device based on web crawler
CN106202514A (en) Accident based on Agent is across the search method of media information and system
CN102999314A (en) Immediate delay tracker tool
US10146749B2 (en) Tracking JavaScript actions
CN104063401A (en) Webpage style address merging method and device
CN105022775A (en) Apparatus and method for structuring web page access history
CN104598536B (en) A kind of distributed network information structuring processing method
CN103838747A (en) Network service construction method and device and webpage data extraction method and device
CN101763432A (en) Method for constructing lightweight webpage dynamic view
CN110472126A (en) A kind of acquisition methods of page data, device and equipment
CN111310044A (en) Method, device, equipment and storage medium for extracting page element information
Nabuco et al. Inferring user interface patterns from execution traces of web applications
CN110297960A (en) A kind of distributed DOC DATA acquisition system based on configuration
CN115563423A (en) Data acquisition method and device, computer equipment and storage medium
CN104063506A (en) Method and device for identifying repeated web pages
CN110019045A (en) Method and device is landed in log
CN114996402A (en) Intelligent customer service information processing method and device
CN102981821A (en) Method and system for event broker

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170707

Termination date: 20181122

CF01 Termination of patent right due to non-payment of annual fee