CN103838747B - Network service construction method and equipment and webpage data extracting method and equipment - Google Patents

Network service construction method and equipment and webpage data extracting method and equipment Download PDF

Info

Publication number
CN103838747B
CN103838747B CN201210479166.7A CN201210479166A CN103838747B CN 103838747 B CN103838747 B CN 103838747B CN 201210479166 A CN201210479166 A CN 201210479166A CN 103838747 B CN103838747 B CN 103838747B
Authority
CN
China
Prior art keywords
parameter
xpath
service
network service
page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210479166.7A
Other languages
Chinese (zh)
Other versions
CN103838747A (en
Inventor
邹纲
皮冰锋
张军
钟朝亮
于浩
松尾昭彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201210479166.7A priority Critical patent/CN103838747B/en
Publication of CN103838747A publication Critical patent/CN103838747A/en
Application granted granted Critical
Publication of CN103838747B publication Critical patent/CN103838747B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of network service construction method and equipment and webpage data extracting method and equipment.The network service construction method includes:The data relevant with the access of deep layer net page are collected, the data include HTTP message, JS events, storehouse snapshot;In collected HTTP message, search service relevant parameter, and the parameter that will be searched is divided into user input parameter, explicit parament, implicit parameter;Structure can generate an XPath of explicit parament;Structure can generate the JS codes of implicit parameter;Structure can generate the 2nd XPath of final deep layer net page returning result;And according to the order of JS events, storehouse snapshot and observed HTTP message, build the structure for representing service internal process;Wherein described user input parameter, JS codes, the first and second XPath, the structure of expression service internal process constitute the network service.

Description

Network service construction method and equipment and webpage data extracting method and equipment
Technical field
This invention relates generally to the related Data Extraction Technology of deep layer net page and the construction method of associated network services. Specifically, the present invention relates to a kind of method and apparatus for building the network service extracted for the data to deep layer net page And corresponding webpage data extracting method and equipment.
Background technology
In recent years, with internet and its development of application, all kinds of useful informations that can be obtained on network and utilized are in several What level increases, and substantially increases the source that people obtain information.Webpage can be broadly divided into top layer webpage and deep layer net page.Top layer Webpage is easily used by a user, while can provide API is beneficial to machine processing.Deep layer net page is readable to user, but to machine simultaneously It is unfriendly, also few corresponding API.Therefore, it is proposed to some data pick-up methods, will be to the access encapsulation of deep layer net page Network service(Web Service), so as to access deep layer net page beneficial to machine, it is possible to based on the generation of such network service more More higher level network service.Such as existing network service is integrated to provide new network service.
However, with the development of network, Javascript(Java script language, hereinafter referred to as JS)Widely should In having used webpage.The problem brought therewith be in the access of deep layer net page, it is much related to the access of deep layer net page to step on Record or query argument are probably the dynamic result that JS codes are performed, and traditional Data Extraction Technology is difficult to be directed to JS generations The deep layer net page of code Dynamic Execution is accessed.Accordingly, it is difficult to build corresponding network service to extract the data of deep layer net page.
The content of the invention
It has been given below on brief overview of the invention, to provide on the basic of certain aspects of the invention Understand.It should be appreciated that this general introduction is not on exhaustive general introduction of the invention.It is not intended to determine pass of the invention Key or pith, nor is it intended to limit the scope of the present invention.Its purpose only provides some concepts in simplified form, In this, as the preamble in greater detail discussed after a while.
The purpose of the present invention is directed to the above mentioned problem of prior art, it is proposed that a kind of network service construction method and equipment And webpage data extracting method and equipment.Technology according to the present invention scheme can will be related to the deep layer of JS the code dynamic state performances Web page access, is configured to network service, and the data of deep layer net page can be extracted using the network service.
To achieve these goals, according to an aspect of the invention, there is provided a kind of network service construction method, bag Include:The data relevant with the access of deep layer net page are collected, the data include HTTP message, JS events, storehouse snapshot;Received In the HTTP message of collection, search service relevant parameter, and the parameter that will be searched is divided into user input parameter, explicit parament, hidden Formula parameter;Structure can generate an XPath of explicit parament;Structure can generate the JS codes of implicit parameter;Structure can 2nd XPath of the final deep layer net page returning result of generation;And disappeared according to JS events, storehouse snapshot and observed HTTP The order of breath, builds the structure for representing service internal process;Wherein described user input parameter, JS codes, first and second XPath, the structure of expression service internal process constitute the network service.
According to another aspect of the present invention, there is provided a kind of data pick-up method of deep layer net page, it include according to Family |input paramete, builds HTTP request;Obtain the corresponding http response of HTTP request;According to acquired http response, utilize First XPath generates explicit parament, and utilizes the implicit parameter of JS code buildings;According to user input parameter, explicit parament, implicit At least one of parameter, builds HTTP request;Obtain the corresponding http response of HTTP request;Internal process is serviced according to representing Structure repeat above-mentioned generation, structure, obtaining step, until the http response according to acquired in, given birth to using the 2nd XPath Into final deep layer net page returning result.
According to a further aspect of the invention, there is provided a kind of network service builds equipment, and it includes:Collection device, quilt It is configured to collect the data relevant with the access of deep layer net page, the data include HTTP message, JS events, storehouse snapshot;Ginseng Number searcher, is configured as in collected HTTP message, search service relevant parameter, and the parameter that will be searched is divided into User input parameter, explicit parament, implicit parameter;First XPath construction devices, being configured as structure can generate explicit parament An XPath;JS code construction devices, are configured as building the JS codes that can generate implicit parameter;2nd XPath builds Device, is configured as building the 2nd XPath that can generate final deep layer net page returning result;And structure construction device, The order according to JS events, storehouse snapshot and observed HTTP message is configured as, the knot for representing service internal process is built Structure;Wherein described user input parameter, JS codes, the first and second XPath, the structure of expression service internal process constitute institute State network service.
According to a further aspect of the invention, there is provided a kind of data pick-up equipment of deep layer net page, it includes:First structure Device is built, for according to user input parameter, building HTTP request;Acquisition device, for obtaining the corresponding HTTP of HTTP request Response;Parameter generation device, for the http response acquired in basis, generates explicit parament, and utilize JS using an XPath The implicit parameter of code building;Second construction device, for according in user input parameter, explicit parament, implicit parameter at least One, build HTTP request;Control device, for parameter generation dress described in the organization instruction according to expression service internal process Put, the second construction device, acquisition device perform operation, until its http response according to acquired in, is generated using the 2nd XPath Final deep layer net page returning result.
In addition, according to another aspect of the present invention, additionally providing a kind of storage medium.The storage medium can including machine The program code of reading, when described program code is performed on message processing device, described program code is caused at described information Reason equipment performs the above method of the invention.
Additionally, in accordance with a further aspect of the present invention, additionally providing a kind of program product.Described program product can including machine The instruction of execution, when the instruction is performed on message processing device, the instruction causes that described information processing equipment is performed The above method of the invention.
Brief description of the drawings
Below with reference to the accompanying drawings illustrate embodiments of the invention, can be more readily understood that more than of the invention and it Its objects, features and advantages.Part in accompanying drawing is intended merely to show principle of the invention.In the accompanying drawings, identical or similar Technical characteristic or part will be represented using same or similar reference.In accompanying drawing:
Fig. 1 shows the flow chart of network service construction method according to embodiments of the present invention;
Fig. 2 shows the flow chart of deep layer net page data pick-up method according to embodiments of the present invention;
Fig. 3 shows that network service according to embodiments of the present invention builds the block diagram of equipment;
Fig. 4 shows the block diagram of deep layer net page data pick-up equipment according to embodiments of the present invention;And
Fig. 5 shows the schematic block diagram of the computer of the method and apparatus that can be used for implementing according to embodiments of the present invention.
Specific embodiment
One exemplary embodiment of the invention is described in detail hereinafter in connection with accompanying drawing.Rise for clarity and conciseness See, all features of actual implementation method are not described in the description.It should be understood, however, that developing any this reality Many decisions specific to implementation method must be made during implementation method, to realize the objectives of developer. For example, meet the restrictive condition related to system and business, and these restrictive conditions may be with the difference of implementation method And change.Additionally, it also should be appreciated that, although development is likely to be extremely complex and time-consuming, but to having benefited from this For those skilled in the art of disclosure, this development is only routine task.
Herein, in addition it is also necessary to which explanation is a bit, in order to avoid having obscured the present invention because of unnecessary details, in the accompanying drawings Apparatus structure and/or the process step closely related with scheme of the invention is illustrate only, and is eliminated and the present invention The little other details of relation.In addition, it may also be noted that described in an accompanying drawing of the invention or a kind of implementation method Element and the element that can be shown in one or more other accompanying drawings or implementation method of feature and feature be combined.
As described above, the problem that conventional art is faced is increasing deep layer net page access in be related to the dynamic of JS codes Perform.In order to solve such problem, it is necessary to the Dynamic Execution JS codes in corresponding network service.
In general, the process for accessing deep layer net page includes:For example log in, inquire about, show result.This process is in bottom It is a series of HTTP message, such as HTTP request and corresponding http response.The dynamic of JS codes can be related in this process State is performed.For the access of most of deep layer net pages, this process is fixed, and simply the value of parameter is different.Therefore, may be used By reappearing this series of process, that is, to send HTTP request, extracting parameter is next to build from the http response for obtaining Individual HTTP request, until obtaining the data for including the HTTP message for finally returning that result to obtain deep layer net page.
Therefore, it can by collecting observation data, forming types, being encapsulated as network service building can extract deep layer net The network service of page data.
The flow of network service construction method according to an embodiment of the invention is described below with reference to Fig. 1.
Fig. 1 shows the flow chart of network service construction method according to embodiments of the present invention.As shown in figure 1, according to this The network service construction method of invention, comprises the following steps:Collect the data relevant with the access of deep layer net page, the packet Include HTTP message, JS events, storehouse snapshot(Step S1);In collected HTTP message, search service relevant parameter, and will The parameter for searching is divided into user input parameter, explicit parament, implicit parameter(Step S2);Structure can generate explicit parament First XPath(Step S 3);Structure can generate the JS codes of implicit parameter(Step S4);Structure can generate final depth 2nd XPath of layer webpage returning result(Step S5);According to the suitable of JS events, storehouse snapshot and observed HTTP message Sequence, builds the structure for representing service internal process(Step S6);So as to the user input parameter, JS codes, first and second XPath, the structure of expression service internal process constitute the network service.
In step sl, the data relevant with the access of deep layer net page are collected, the data include HTTP message, JS things Part, storehouse snapshot.
As indicated above, it is desirable to access the data of deep layer net page by reappearing HTTP message.Therefore, HTTP is collected first Message sequence is simultaneously analyzed.
User or window may trigger Javascript events.DOM tree node can receive JS events, such as from user The keyboard at interface tapped, mouse is pressed.Window can produce onLoad events after ready.These JS events all may be used For example to be captured by Firefox plug-in units.These events are by by representing the structure record of service internal process and triggering in order.
In bottom, the execution flow of whole JS codes is the process of function call.A JS storehouse is accordingly, there are, its Record the press-in of function(push)And ejection(popup).By using the JDS that Firefox is provided(Javascript is except mistake Service), storehouse snapshot when JS storehouses change can be captured.By these storehouse snapshots, the activity of JS can be followed the trail of.This Sample, JS activities can be described as JS functions and be pressed into storehouse and the history from storehouse ejection.Using such historical information, can To obtain execution sequence and the operation of JS codes.
It should be noted that said process using but do not rely on Firefox plug-in units, it is any can aid in acquisition JS events, The instrument of storehouse snapshot can be utilized herein.
In step sl, after above-mentioned collection step, collected data can also be classified by the page.
The access of deep layer net page is considered a series of generating process of webpages.It is next by current auto-building html files Webpage, the like, until producing last webpage.Additionally, JS performing environments are to be based on entirely having loaded all of JS codes Webpage afterwards.And collected data are organized sequentially in time, the not obvious boundary between webpage.Cause This, units of pages is divided into by collected data flow, and each units of pages includes a page and occurs on this page All JS events, HTTP message, the storehouse snapshot of JS.Can be by the use of the time of origin of window.onload events as net Page boundary is come by the collected data flow of page segmentation.
In step sl, the related data of the page unrelated with user's operation after above-mentioned classifying step, can also be removed.
Many HTTP messages can be produced in the access process of deep layer net page, but not all HTTP message is all right It is essential in the extraction of deep layer net page data.It is therefore not necessary to track occurred parameter.As long as by analyzing user's Behavior, the related data relevant with user's operation is remained and is analyzed.
In the access of deep layer net page, user view can by following JS event detections to.The login of such as user A series of keyboards occurred on Input elements in HTML that invariably accompany are tapped and button is clicked on.Equally, user looks into Ask some KeyEvents that also invariably accompany.Additionally, providing user data marker on such as Firefox browser window (User Data Annotator).The labeled page is clearly what user wanted.
Therefore, it can, using JS events and user data marker, select the page relevant with user's operation, so that it is determined that The page unrelated with user's operation, removes the related data of the page unrelated with user's operation.It is described relevant with user's operation The page includes the page of the page and its next page that JS events occur and user data marker mark.
In step s 2, in collected HTTP message, search service relevant parameter, and the parameter that will be searched is divided into User input parameter, explicit parament, implicit parameter.
Can be from the following location search service relevant parameter of HTTP message:URL query portions, cookie fields, URL roads Footpath, the POST body of HTTP POST messages, the corresponding POST datas of XML HTTP request.
URL query portions for example,
http://translate.google.com.hk/translate_a/tIn client=t&text=home Client=t&text=home is exactly the parameter of transmission.
In a browser, JS can send parameter to server by operating cookie fields.
Under normal circumstances, parameter is expressed as the right of title and value.Therefore, HTTP request can be straight from corresponding parameter Connect and restore.Similarly, URL can also be restored from corresponding parameter.
Therefore, by from collected HTTP message, searching service related parameters, can just be joined by these in the future Count to reappear HTTP message, so as to using these HTTP messages, reappear the access of deep layer net page, extract the data of deep layer net page.
The parameter for searching is divided into three classes:User input parameter, explicit parament, implicit parameter.
User input parameter is the user input of the network service in future.
Explicit parament can be extracted using XPath from the original page.
Implicit parameter can not be extracted using XPath as explicit parament from the original page, but JS codes are performed Result, therefore can only be obtained by performing JS codes.
Therefore, in step s3, building can generate an XPath of explicit parament.
Being built according to web data can generate an XPath of explicit parament, and aobvious using XPath generations Formula parameter is all technology well known to those skilled in the art, be will not be repeated here.
In step s 4, building can generate the JS codes of implicit parameter.Substantially, the purpose of step S4 is to check and go Except unnecessary JS codes, that is, JS codes are simplified, and replace browser object to allow that the execution of JS codes takes off with JS objects From browser environment.
Specifically, can build as follows can generate the JS codes of implicit parameter.
It should be noted that what the collected data that the structure of JS codes is directed to each page were carried out.As described above, JS codes Performing environment be a page.The JS codes to be simplified can be obtained from the source code of HTML and JS message.
In step S41, according to storehouse snapshot, JS activity history information is obtained.
As described above, by storehouse snapshot, the invoked procedure of JS functions can be known, the execution sequence of JS codes is understood. By storehouse snapshot be converted into function press-in storehouse/from storehouse ejection history, as JS activity history information.
In step S42, in the JS codes included from html source code and HTTP message, JS syntax trees are parsed.
Step S42 can be used Rhino instruments commonly used in the art, parse JS syntax trees.
In JS syntax trees, mainly in SETVAR, GETVAR, SETPROP, GETPROP node, operated in function Variable.
In step S43, according to JS activity history information, JS syntax trees are traveled through and marked.
Specifically, by the function unrelated with JS activity history information labeled as useless(Step S431).To not go out wherein Function in present JS flows of event is labeled as useless.That is, according to storehouse snapshot, these functions are not operated with user Relevant JS codes are pressed into storehouse or are ejected from storehouse in performing.
In GETVAR, SETVAR, GETPROP, SETPROP node, determine between the function and variable in JS syntax trees, Dependence between variable and variable, and will determine that result is attached on corresponding node(Step S432).
The mark browser object unrelated with service(Step S433).
The JS environment of browser has many objects, and the part in these objects is that the object from inside browser is introduced 's.Therefore, if it is desired to allowing JS codes to be run in the case of no browser environment, must just realize again essential clear Look at device object.In fact, many browser objects are related to operation UI element, CSS, browser status bar, window etc., because This, it is not necessary to realize these objects.Therefore be marked in this step.
Mark the key object for producing HTTP request(Step S434).
As it is desirable that by sending HTTP request, obtain http response to reappear a series of HTTP messages, so needing to use In the key object for producing HTTP request.
In step S44, by only retaining related to service function and node in JS syntax trees, JS syntax trees are simplified.
Specifically, removal is all is marked as useless function(Step S441).
Traversal JS syntax trees, to position key object and its corresponding key node(Step S442).
The related variable of key node is put into Dependency Set(Step S443).
Check key node and until its father node and the brotgher of node of root node, are processed as follows with performing:If according to Rely the variable concentrated to depend on the variable in present node, then all variables in present node are added in Dependency Set, it is no Then delete present node;Removal is on setting up the node of event and calling the node of the browser object unrelated with service(Step Rapid S444).
Constant table is set up to preserve the constant related to servicing(Step S445).
In some JS codes, constant can be assigned to variable.In this case, constant table can be set up, reservation is assigned Constant, produce implicit parameter to be assigned to variable when JS codes are performed.
The browser object related to service is replaced with custom object(Step S446).Because the JS codes for regenerating Depart from browser execution.Correspondingly, browser object will not exist again.In order to normally generate implicit parameter, it is necessary to making by oneself The JS objects of justice complete its function instead of browser object.
Based on JS syntax trees, constant table, the custom object simplified, JS codes are generated(Step S447).This process is Parse the inverse process of JS syntax trees from JS codes in step S42 before, be to those skilled in the art to know 's.
So far, constructing can generate the JS codes of implicit parameter.The JS codes are simplified.
By above-mentioned treatment, the XPath that can the generate explicit parament and JS that implicit parameter can be generated has been obtained Code.However, as described above, the treatment of whole deep layer net page is the sequence of HTTP message, therefore, in step s 5, build energy The 2nd XPath of final deep layer net page returning result is enough generated, so that the result of data pick-up to be presented to user.Also, in step In rapid S6, according to the order of JS events, storehouse snapshot and observed HTTP message, the knot for representing service internal process is built Structure.This structure is indicated according to what kind of sequentially according at least one of user input parameter, explicit parament, implicit parameter To generate HTTP request, and based on corresponding http response, explicit parament is generated using an XPa th, and how to utilize JS generations The implicit parameter of code generation, based on newly-generated parameter, creates new HTTP request, goes round and begins again, until obtaining final HTTP Response, and generate final deep layer net page returning result using the 2nd XPath.Wherein on the execution of JS codes, DOM model is Event driven, that is to say, that event handling function is activated by mouse click, keyword or timer.Do not having In the case that browser environment and user operate, these event handling functions will be activated with the mode of programming(According to show clothes The structure of business internal process)To reappear the process of deep layer net page access.
By above-mentioned steps, the JS codes for obtain user input parameter, the first and second XPath, simplifying, expression service The structure of internal process, it is all these just to constitute network service.
Deep layer net page data pick-up method according to embodiments of the present invention is described next, with reference to Fig. 2.
Fig. 2 shows the flow chart of deep layer net page data pick-up method according to embodiments of the present invention.As shown in Fig. 2 root According to deep layer net page data pick-up method of the invention, comprise the following steps:According to user input parameter, HTTP request is built(Step Rapid S21);Obtain the corresponding http response of HTTP request(Step S22);According to acquired http response, using an XPath Generation explicit parament, and utilize the implicit parameter of JS code buildings(Step S23);According to user input parameter, explicit parament, implicit At least one of parameter, builds HTTP request(Step S24);Obtain the corresponding http response of HTTP request(Step S25);Press Above-mentioned generation, structure, obtaining step are repeated according to the structure for representing service internal process, until the HTTP according to acquired in rings Should, generate final deep layer net page returning result using the 2nd XPath(Step S26).
Network service according to embodiments of the present invention is described next, with reference to Fig. 3 build equipment.
Fig. 3 shows that network service according to embodiments of the present invention builds the block diagram of equipment.As shown in figure 3, root Building equipment 300 according to network service of the invention includes:Collection device 31, is configured as collecting relevant with the access of deep layer net page Data, the data include HTTP message, JS events, storehouse snapshot;Parameter search device 32, is configured as collected In HTTP message, search service relevant parameter, and the parameter that will be searched is divided into user input parameter, explicit parament, implicit ginseng Number;First XPath construction devices 33, are configured as building the XPath that can generate explicit parament;JS code construction devices 34, it is configured as building the JS codes that can generate implicit parameter;2nd XPath construction devices 35, being configured as structure can 2nd XPath of the final deep layer net page returning result of generation;And structure construction device 36, be configured as according to JS events, The order of storehouse snapshot and observed HTTP message, builds the structure for representing service internal process;Wherein described user input Parameter, JS codes, the first and second XPath, the structure of expression service internal process constitute the network service.
In a specific embodiment, network service builds equipment 300 also includes sorter 37, the sorter 37 It is configured to be classified the data collected by the collection device 31 by the page.
In a specific embodiment, network service builds equipment 300 also includes removal device 38, the removal device 38 It is configured as being removed from the sorted page of the sorter 37 related data of the page unrelated with user's operation.
In a specific embodiment, removal device 38 is further configured to be marked using JS events and user data Device, selects the page relevant with user's operation, so that it is determined that the page unrelated with user's operation.
In a specific embodiment, the page relevant with user's operation includes occurring the page of JS events and its next The page of the page and user data marker mark.
In a specific embodiment, following location search service related ginseng of the parameter search device 32 from HTTP message Number:URL query portions, cookie fields, URL paths, the POST body of HTTP POST messages, XML HTTP request couple The POST data answered.
In a specific embodiment, JS code constructions device 34 builds energy for the collected data of each page Enough generate the JS codes of implicit parameter;
In a specific embodiment, JS code constructions device 34 includes:Historical information obtaining unit 341, for basis Storehouse snapshot, obtains JS activity history information;Resolution unit 342, for the JS included from html source code and HTTP message In code, JS syntax trees are parsed;Indexing unit 343, for according to JS activity history information, traveling through and marking JS syntax trees; Unit 344 is simplified, for by only retaining related to service function and node in JS syntax trees, simplifying JS syntax trees;Constant Table sets up unit 345, for the constant for setting up constant table to preserve related to servicing;Replacement unit 346, for self-defined right As replacing the browser object related to service;Generation unit 347, for based on the JS syntax trees, constant table simplified, self-defined Object, generates JS codes.
In a specific embodiment, indexing unit 343 is configured as:By the function mark unrelated with JS activity history information It is designated as useless;In GETVAR, SETVAR, GETPROP, SETPROP node, determine between function and variable, variable and variable Between dependence, and by determine result be attached on corresponding node;The mark browser object unrelated with service;Mark Key object for producing HTTP request.
In a specific embodiment, unit 343 is simplified to be configured as:Removal is all to be marked as useless function;Time JS syntax trees are gone through, to position key object and its corresponding key node;The related variable of key node is put into Dependency Set; Check key node and until its father node and the brotgher of node of root node, are processed as follows with performing:If in Dependency Set Variable depends on the variable in present node, then all variables in present node are added in Dependency Set, otherwise deletes and works as Front nodal point;Removal is on setting up the node of event and calling the node of the browser object unrelated with service.
Due to the treatment in each device included by the network according to the invention service construction equipment 300 respectively with it is upper Treatment in the step of network service construction method of face description S1-S6 is similar to, therefore for simplicity, these is omitted herein The detailed description of device.
Reference picture 4 describes deep layer net page data pick-up equipment according to embodiments of the present invention.
Fig. 4 shows the block diagram of deep layer net page data pick-up equipment 400 according to embodiments of the present invention.Such as Fig. 4 Shown, deep layer net page data pick-up equipment 400 of the invention includes:First construction device 41, for according to user input Parameter, builds HTTP request;Acquisition device 42, for obtaining the corresponding http response of HTTP request;Parameter generation device 43, uses In the http response acquired in basis, explicit parament is generated using an XPath, and utilize the implicit parameter of JS code buildings;The Two construction devices 44, for according at least one of user input parameter, explicit parament, implicit parameter, building HTTP request; Control device 45, for parameter generation device 43, the second construction device described in the organization instruction according to expression service internal process 44th, acquisition device 42 performs operation, until its http response according to acquired in, final deep layer is generated using the 2nd XPath Webpage returning result.
Due to the treatment difference in each device included by deep layer net page data pick-up equipment 400 of the invention Treatment in the step of with deep layer net page data pick-up method described above S21-S26 is similar, therefore for simplicity, This omits the detailed description of these devices.
Additionally, still needing here, it is noted that each component devices, unit can be by softwares, firmware, hard in the said equipment Part or the mode of its combination are configured.The usable specific means of configuration or mode are well known to those skilled in the art, This is repeated no more.In the case where being realized by software or firmware, from storage medium or network to specialized hardware structure Computer(All-purpose computer 500 for example shown in Fig. 5)The program for constituting the software is installed, the computer is being provided with various journeys During sequence, various functions etc. are able to carry out.
Fig. 5 shows the schematic block diagram of the computer that can be used for the method and apparatus for implementing according to embodiments of the present invention.
In Figure 5, CPU (CPU) 501 is according to the program stored in read-only storage (ROM) 502 or from depositing The program that storage part 508 is loaded into random access memory (RAM) 503 performs various treatment.In RAM 503, always according to need Store the data required when CPU 501 performs various treatment etc..CPU 501, ROM 502 and RAM 503 are via bus 504 are connected to each other.Input/output interface 505 is also connected to bus 504.
Components described below is connected to input/output interface 505:Importation 506(Including keyboard, mouse etc.), output section Divide 507(Including display, such as cathode-ray tube (CRT), liquid crystal display (LCD) etc., and loudspeaker etc.), storage part 508(Including hard disk etc.), communications portion 509(Including NIC such as LAN card, modem etc.).Communications portion 509 Communication process is performed via network such as internet.As needed, driver 510 can be connected to input/output interface 505. Detachable media 511 such as disk, CD, magneto-optic disk, semiconductor memory etc. can as needed be installed in driver On 510 so that the computer program for reading out is installed in storage part 508 as needed.
It is such as removable from network such as internet or storage medium in the case where above-mentioned series of processes is realized by software Unload medium 511 and the program for constituting software is installed.
It will be understood by those of skill in the art that this storage medium be not limited to wherein having program stored therein shown in Fig. 5, Separately distribute to provide a user with the detachable media 511 of program with equipment.The example of detachable media 511 includes disk (including floppy disk (registration mark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (Comprising mini-disk (MD) (registration mark)) and semiconductor memory.Or, storage medium can be ROM 502, storage part Hard disk included in 508 etc., wherein computer program stored, and user is distributed to together with the equipment comprising them.
The present invention also proposes a kind of program product of the instruction code of the machine-readable that is stored with.The instruction code is by machine When device reads and performs, above-mentioned method according to embodiments of the present invention is can perform.
Correspondingly, also wrapped for carrying the storage medium of the program product of the instruction code of the above-mentioned machine-readable that is stored with Include in disclosure of the invention.The storage medium includes but is not limited to floppy disk, CD, magneto-optic disk, storage card, memory stick etc. Deng.
In description above to the specific embodiment of the invention, the feature for describing and/or showing for a kind of implementation method Can be used in one or more other embodiments in same or similar mode, with the feature in other embodiment It is combined, or substitute the feature in other embodiment.
It should be emphasized that term "comprises/comprising" refers to the presence of feature, key element, step or component when being used herein, but simultaneously It is not excluded for the presence of one or more further features, key element, step or component or additional.
Additionally, the method for the present invention be not limited to specifications described in time sequencing perform, it is also possible to according to it He time sequencing ground, concurrently or independently perform.Therefore, the execution sequence of the method described in this specification is not to this hair Bright technical scope is construed as limiting.
Although being had been disclosed to the present invention by the description to specific embodiment of the invention above, should The understanding, above-mentioned all embodiments and example are illustrative, and not restrictive.Those skilled in the art can be in institute Design is to various modifications of the invention, improvement or equivalent in attached spirit and scope by the claims.These modification, improve or Person's equivalent should also be as being to be considered as included in protection scope of the present invention.
Note
1. a kind of network service construction method, including:
The data relevant with the access of deep layer net page are collected, the data include HTTP message, JS events, storehouse snapshot;
In collected HTTP message, search service relevant parameter, and the parameter that will be searched is divided into user input ginseng Number, explicit parament, implicit parameter;
Structure can generate an XPath of explicit parament;
Structure can generate the JS codes of implicit parameter;
Structure can generate the 2nd XPath of final deep layer net page returning result;And
According to the order of JS events, storehouse snapshot and observed HTTP message, the knot for representing service internal process is built Structure;
Wherein described user input parameter, JS codes, the first and second XPath, the structure structure of expression service internal process Into the network service.
2. note 1 as described in network service construction method, wherein after the collection step, collected data are pressed The page is classified.
3. the network service construction method as described in note 2, wherein after the classifying step, removal operates nothing with user The related data of the page of pass.
4. the network service construction method as described in note 3, wherein using JS events and user data marker, selection with User operates the relevant page, so that it is determined that the page unrelated with user's operation.
5. note 4 as described in network service construction method, wherein the page relevant with user's operation include generation The page of the page of JS events and its next page and user data marker mark.
6. the network service construction method as described in note 1, wherein related from the following location search service of HTTP message Parameter:URL query portions, cookie fields, URL paths, the POST body of HTTP POST messages, XML HTTP request Corresponding POST data.
7. note 2 as described in network service construction method, wherein the structure can generate the JS codes of implicit parameter Including:
For the data collected by each page,
According to storehouse snapshot, JS activity history information is obtained;
In the JS codes included from html source code and HTTP message, JS syntax trees are parsed;
According to JS activity history information, JS syntax trees are traveled through and marked;
By only retaining related to service function and node in JS syntax trees, JS syntax trees are simplified;
Constant table is set up to preserve the constant related to servicing;
The browser object related to service is replaced with custom object;
Based on JS syntax trees, constant table, the custom object simplified, JS codes are generated.
8. note 7 as described in network service construction method, wherein the mark JS syntax trees include:
By the function unrelated with JS activity history information labeled as useless;
In GETVAR, SETVAR, GETPROP, SETPROP node, determine between function and variable, variable and variable it Between dependence, and by determine result be attached on corresponding node;
The mark browser object unrelated with service;
Mark the key object for producing HTTP request.
9. note 8 as described in network service construction method, wherein the JS syntax trees of simplifying include:
Removal is all to be marked as useless function;
Traversal JS syntax trees, to position key object and its corresponding key node;
The related variable of key node is put into Dependency Set;
Check key node and until its father node and the brotgher of node of root node, are processed as follows with performing:
If the variable in Dependency Set depends on the variable in present node, all variables in present node are added To in Dependency Set;
Otherwise delete present node;
Removal is on setting up the node of event and calling the node of the browser object unrelated with service.
10. the network service that a kind of method using such as note 1-9 builds is wrapped come the method for extracting the data of deep layer net page Include:
According to user input parameter, HTTP request is built;
Obtain the corresponding http response of HTTP request;
According to acquired http response, explicit parament is generated using an XPath, and implicitly join using JS code buildings Number;
According at least one of user input parameter, explicit parament, implicit parameter, HTTP request is built;
Obtain the corresponding http response of HTTP request;
Structure according to service internal process is represented repeats above-mentioned generation, structures, obtaining step, up to according to being obtained The http response for taking, final deep layer net page returning result is generated using the 2nd XPath.
A kind of 11. network services build equipment, including:
Collection device, is configured as collecting data relevant with the access of deep layer net page, the data including HTTP message, JS events, storehouse snapshot;
Parameter search device, is configured as in collected HTTP message, search service relevant parameter, and will search Parameter be divided into user input parameter, explicit parament, implicit parameter;
First XPath construction devices, are configured as building the XPath that can generate explicit parament;
JS code construction devices, are configured as building the JS codes that can generate implicit parameter;
2nd XPath construction devices, being configured as building can generate the second of final deep layer net page returning result XPath;And
Structure construction device, is configured as the order according to JS events, storehouse snapshot and observed HTTP message, builds Represent the structure of service internal process;
Wherein described user input parameter, JS codes, the first and second XPath, the structure structure of expression service internal process Into the network service.
12. note 11 as described in network service build equipment, also including sorter, the sorter be configured by The data collected by the collection device are classified by the page.
13. network service as described in note 12 builds equipment, and also including removal device, the removal device is configured as The related data of the page unrelated with user's operation is removed from the sorted page of the sorter.
14. network service as described in note 13 builds equipment, wherein the removal device is further configured to utilize JS events and user data marker, select the page relevant with user's operation, so that it is determined that the page unrelated with user's operation.
15. network service as described in note 14 builds equipment, wherein the page relevant with user's operation includes hair The page and its next page of raw JS events and the page of user data marker mark.
16. note 11 as described in network service build equipment, wherein the parameter search device from HTTP message as Lower position search service relevant parameter:URL query portions, cookie fields, URL paths, the POST of HTTP POST messages The corresponding POST datas of body, XML HTTP request.
17. network service as described in note 12 builds equipment, wherein the JS code constructions device is directed to each page Collected data, structure can generate the JS codes of implicit parameter;The JS code constructions device includes:
Historical information obtaining unit, for according to storehouse snapshot, obtaining JS activity history information;
Resolution unit, in the JS codes that are included from html source code and HTTP message, parsing JS syntax trees;
Indexing unit, for according to JS activity history information, traveling through and marking JS syntax trees;
Unit is simplified, for by only retaining related to service function and node in JS syntax trees, simplifying JS syntax trees;
Constant table sets up unit, for the constant for setting up constant table to preserve related to servicing;
Replacement unit, for replacing the browser object related to service with custom object;
Generation unit, for based on JS syntax trees, constant table, the custom object simplified, generating JS codes.
18. network service as described in note 17 builds equipment, wherein the indexing unit is configured as:
By the function unrelated with JS activity history information labeled as useless;
In GETVAR, SETVAR, GETPROP, SETPROP node, determine between function and variable, variable and variable it Between dependence, and by determine result be attached on corresponding node;
The mark browser object unrelated with service;
Mark the key object for producing HTTP request.
19. network service as described in note 18 builds equipment, wherein the unit of simplifying is configured as:
Removal is all to be marked as useless function;
Traversal JS syntax trees, to position key object and its corresponding key node;
The related variable of key node is put into Dependency Set;
Check key node and until its father node and the brotgher of node of root node, are processed as follows with performing:
If the variable in Dependency Set depends on the variable in present node, all variables in present node are added To in Dependency Set;
Otherwise delete present node;
Removal is on setting up the node of event and calling the node of the browser object unrelated with service.
A kind of network services of 20. device builds using such as note 11-19 come the equipment that extracts the data of deep layer net page, Including:
First construction device, for according to user input parameter, building HTTP request;
Acquisition device, for obtaining the corresponding http response of HTTP request;
Parameter generation device, for the http response acquired in basis, explicit parament, and profit is generated using an XPath With the implicit parameter of JS code buildings;
Second construction device, for according at least one of user input parameter, explicit parament, implicit parameter, building HTTP request;
Control device, builds for parameter generation device, second described in the organization instruction according to expression service internal process Device, acquisition device perform operation, until its http response according to acquired in, final deep layer is generated using the 2nd XPath Webpage returning result.

Claims (10)

1. a kind of network service construction method, including:
The data relevant with the access of deep layer net page are collected, the data include HTTP message, JS events, storehouse snapshot;
In collected HTTP message, search service relevant parameter, and the parameter that will be searched is divided into user input parameter, aobvious Formula parameter, implicit parameter;
Structure can generate an XPath of explicit parament;
Structure can generate the JS codes of implicit parameter;
Structure can generate the 2nd XPath of final deep layer net page returning result;And
According to the order of JS events, storehouse snapshot and observed HTTP message, the structure for representing service internal process is built;
Wherein described user input parameter, JS codes, the first and second XPath, the structure of expression service internal process are constituted The network service;
Wherein explicit parament can utilize XPath to be extracted from the original page, and implicit parameter can only be obtained by performing JS codes .
2. network service construction method as claimed in claim 1, wherein after the collection step, collected data are pressed The page is classified.
3. network service construction method as claimed in claim 2, wherein after the classifying step, removal operates nothing with user The related data of the page of pass.
4. network service construction method as claimed in claim 3, wherein using JS events and user data marker, selection with User operates the relevant page, so that it is determined that the page unrelated with user's operation.
5. network service construction method as claimed in claim 2, wherein the structure can generate the JS codes of implicit parameter Including:
For the data collected by each page,
According to storehouse snapshot, JS activity history information is obtained;
In the JS codes included from html source code and HTTP message, JS syntax trees are parsed;
According to JS activity history information, JS syntax trees are traveled through and marked;
By only retaining related to service function and node in JS syntax trees, JS syntax trees are simplified;
Constant table is set up to preserve the constant related to servicing;
The browser object related to service is replaced with custom object;
Based on JS syntax trees, constant table, the custom object simplified, JS codes are generated.
6. network service construction method as claimed in claim 5, wherein the mark JS syntax trees include:
By the function unrelated with JS activity history information labeled as useless;
In GETVAR, SETVAR, GETPROP, SETPROP node, determine between function and variable, between variable and variable Dependence, and will determine that result is attached on corresponding node;
The mark browser object unrelated with service;
Mark the key object for producing HTTP request.
7. network service construction method as claimed in claim 6, wherein the JS syntax trees of simplifying include:
Removal is all to be marked as useless function;
Traversal JS syntax trees, to position key object and its corresponding key node;
The related variable of key node is put into Dependency Set;
Check key node and until its father node and the brotgher of node of root node, are processed as follows with performing:
If the variable in Dependency Set depends on the variable in present node, by all variables in present node be added to according to Rely and concentrate;
Otherwise delete present node;
Removal is on setting up the node of event and calling the node of the browser object unrelated with service.
8. a kind of method using such as any one of claim 1-7 builds network service extracts the data of deep layer net page Method, including:
According to user input parameter, HTTP request is built;
Obtain the corresponding http response of HTTP request;
According to acquired http response, explicit parament is generated using an XPath, and utilize the implicit parameter of JS code buildings;
According at least one of user input parameter, explicit parament, implicit parameter, HTTP request is built;
Obtain the corresponding http response of HTTP request;
Structure according to service internal process is represented repeats generation, structure, obtaining step that above-mentioned consecutive order is performed, directly To the http response according to acquired in, final deep layer net page returning result is generated using the 2nd XPath.
9. a kind of network service builds equipment, including:
Collection device, is configured as collecting the data relevant with the access of deep layer net page, and the data include HTTP message, JS things Part, storehouse snapshot;
Parameter search device, is configured as in collected HTTP message, search service relevant parameter, and the ginseng that will be searched Number is divided into user input parameter, explicit parament, implicit parameter;
First XPath construction devices, are configured as building the XPath that can generate explicit parament;
JS code construction devices, are configured as building the JS codes that can generate implicit parameter;
2nd XPath construction devices, are configured as building the 2nd XPath that can generate final deep layer net page returning result; And
Structure construction device, is configured as the order according to JS events, storehouse snapshot and observed HTTP message, builds and represents Service the structure of internal process;
Wherein described user input parameter, JS codes, the first and second XPath, the structure of expression service internal process are constituted The network service;
Wherein explicit parament can utilize XPath to be extracted from the original page, and implicit parameter can only be obtained by performing JS codes .
10. a kind of network service using device build as claimed in claim 9 is wrapped come the equipment for extracting the data of deep layer net page Include:
First construction device, for according to user input parameter, building HTTP request;
Acquisition device, for obtaining the corresponding http response of HTTP request;
Parameter generation device, for the http response acquired in basis, generates explicit parament, and utilize JS using an XPath The implicit parameter of code building;
Second construction device, for according at least one of user input parameter, explicit parament, implicit parameter, building HTTP Request;
Control device, for according to represent service internal process organization instruction described in parameter generation device, the second construction device, Acquisition device performs operation, until its http response according to acquired in, generates final deep layer net page and return using the 2nd XPath Return result.
CN201210479166.7A 2012-11-22 2012-11-22 Network service construction method and equipment and webpage data extracting method and equipment Expired - Fee Related CN103838747B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210479166.7A CN103838747B (en) 2012-11-22 2012-11-22 Network service construction method and equipment and webpage data extracting method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210479166.7A CN103838747B (en) 2012-11-22 2012-11-22 Network service construction method and equipment and webpage data extracting method and equipment

Publications (2)

Publication Number Publication Date
CN103838747A CN103838747A (en) 2014-06-04
CN103838747B true CN103838747B (en) 2017-07-07

Family

ID=50802261

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210479166.7A Expired - Fee Related CN103838747B (en) 2012-11-22 2012-11-22 Network service construction method and equipment and webpage data extracting method and equipment

Country Status (1)

Country Link
CN (1) CN103838747B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202244B (en) * 2016-06-28 2020-01-21 深圳中兴网信科技有限公司 Webpage message returning method and webpage message returning system
CN111368104B (en) * 2018-12-26 2023-05-26 阿里巴巴集团控股有限公司 Information processing method, device and equipment
CN113778389A (en) * 2020-09-23 2021-12-10 北京沃东天骏信息技术有限公司 Interface idempotent judging method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251852A (en) * 2008-01-11 2008-08-27 孟小峰 Integrating system and method of Web data facing to field
CN101763425A (en) * 2010-01-12 2010-06-30 苏州阔地网络科技有限公司 Universal method for capturing webpage contents of any webpage
CN102663041A (en) * 2012-03-28 2012-09-12 重庆大学 Automatic extraction method oriented to data of deep web pages
CN102682119A (en) * 2012-05-16 2012-09-19 崔志明 Deep webpage data acquiring method based on dynamic knowledge

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7533085B2 (en) * 2006-08-14 2009-05-12 International Business Machines Corporation Method for searching deep web services

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251852A (en) * 2008-01-11 2008-08-27 孟小峰 Integrating system and method of Web data facing to field
CN101763425A (en) * 2010-01-12 2010-06-30 苏州阔地网络科技有限公司 Universal method for capturing webpage contents of any webpage
CN102663041A (en) * 2012-03-28 2012-09-12 重庆大学 Automatic extraction method oriented to data of deep web pages
CN102682119A (en) * 2012-05-16 2012-09-19 崔志明 Deep webpage data acquiring method based on dynamic knowledge

Also Published As

Publication number Publication date
CN103838747A (en) 2014-06-04

Similar Documents

Publication Publication Date Title
Di Lucca et al. WARE: A tool for the reverse engineering of web applications
US20050066269A1 (en) Information block extraction apparatus and method for Web pages
CN104268148B (en) A kind of forum page Information Automatic Extraction method and system based on time string
CN101488151B (en) System and method for gathering website contents
US9087081B2 (en) Method and system of saving and querying context data for online applications
CN103443786A (en) Machine learning method to identify independent tasks for parallel layout in web browsers
CN102760151B (en) Implementation method of open source software acquisition and searching system
CN101609399B (en) Intelligent website development system based on modeling and method thereof
US11132409B2 (en) Identifying client states
CN103294732B (en) Webpage capture method and reptile
TW201250492A (en) Method and system of extracting web page information
CN102193953A (en) System and method for migrating desktop applications
CN105550206B (en) The edition control method and device of structured query sentence
US11030384B2 (en) Identification of sequential browsing operations
CN102375847B (en) Method and device for forming merge tree for generating document template
KR20190058141A (en) Method for generating data extracted from document and apparatus thereof
CN103838747B (en) Network service construction method and equipment and webpage data extracting method and equipment
JP2008134906A (en) Business process definition generation method, device and program
CN103605742A (en) Method and device for recognizing network resource entity content page
CN110472126A (en) A kind of acquisition methods of page data, device and equipment
US20160164975A1 (en) Method and apparatus for mashing up heterogeneous sensors, and recording medium thereof
JP3914081B2 (en) Access authority setting method and structured document management system
CN106991144B (en) Method and system for customizing data crawling workflow
CN102981821A (en) Method and system for event broker
KR101005871B1 (en) B-Tree Index Vector Based Web-Log Restoration Method For Huge Web Log Mining And Web Attack Detection

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170707

Termination date: 20181122

CF01 Termination of patent right due to non-payment of annual fee