CN108196874A - A kind of webpage analysis method, device and storage medium, program product - Google Patents

A kind of webpage analysis method, device and storage medium, program product Download PDF

Info

Publication number
CN108196874A
CN108196874A CN201711481065.2A CN201711481065A CN108196874A CN 108196874 A CN108196874 A CN 108196874A CN 201711481065 A CN201711481065 A CN 201711481065A CN 108196874 A CN108196874 A CN 108196874A
Authority
CN
China
Prior art keywords
web page
page element
data
condition
web
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711481065.2A
Other languages
Chinese (zh)
Other versions
CN108196874B (en
Inventor
邹荣珠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neusoft Corp
Original Assignee
Neusoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neusoft Corp filed Critical Neusoft Corp
Priority to CN201711481065.2A priority Critical patent/CN108196874B/en
Publication of CN108196874A publication Critical patent/CN108196874A/en
Application granted granted Critical
Publication of CN108196874B publication Critical patent/CN108196874B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the present application discloses a kind of webpage analysis method and device, and for quickly and easily carrying out web page analysis, this method includes:It is analysed to web data to be matched with default screening conditions, obtains the default screening conditions matched with web data to be analyzed as goal condition, and obtain each goal condition corresponding data in web data to be analyzed;According to default screening conditions and the correspondence of basic web page element, the corresponding basic web page element of goal condition is determined;Using either objective condition in web data to be analyzed corresponding data as the data of the corresponding basic web page element of the goal condition;The data of the corresponding basic web page element of the corresponding basic web page element of each goal condition and the goal condition are exported as web page analysis result.

Description

A kind of webpage analysis method, device and storage medium, program product
Technical field
This application involves technical field of data processing, and in particular to a kind of webpage analysis method, device and storage medium, journey Sequence product.
Background technology
As the development of Internet and the relevant technologies is with ripe, data are extracted from webpage and have become people's acquisition The important means of information.And data are extracted from webpage it is necessary to analyze structure of web page, number is obtained by web page analysis According to the specific location in the page, so as to which data are extracted from the page.
At present, more common webpage analysis method be based on DOM Document Object Model (Document Object Model, DOM webpage analysis method).According to DOM specification it is found that each ingredient in web document is a node:Entire webpage Document is a document node, and each web page tag is a node element, and the text included in element is text node, often One webpage attribute is an attribute node, and annotation belongs to comment nodes, and all there are relationships each other for these nodes.It is based on The web page analysis process of DOM includes:The source code of analyzing web page obtains the relationship between the node defined in source code, and adjusts The relationship between above-mentioned node is converted into dom tree with the interface that DOM specification gives, then by searching for the node in dom tree Obtain required data.
However, since the tag element in web page source code and inline code are extremely abundant, various patterns and layout thousand become Wan Hua, the above-mentioned webpage analysis method based on DOM are realized complicated and are easily malfunctioned.
Invention content
In view of this, the embodiment of the present application provides a kind of webpage analysis method, device and storage medium, program product, with Reduce the complexity and error rate of web page analysis.
To solve the above problems, technical solution provided by the embodiments of the present application is as follows:
A kind of webpage analysis method, the method includes:
It is analysed to web data to be matched with default screening conditions, acquisition is matched with the web data to be analyzed Default screening conditions as goal condition, and it is corresponding in the web data to be analyzed to obtain each goal condition Data;
According to the default screening conditions and the correspondence of basic web page element, the corresponding base of the goal condition is determined Plinth web page element;
Using either objective condition in the web data to be analyzed corresponding data as the corresponding base of the goal condition The data of plinth web page element;
By the corresponding basic web page element of each corresponding basic web page element of the goal condition and the goal condition Data as web page analysis result export.
Optionally, the method further includes:
According to the level constituent relation between web page element, basic web page element tool corresponding with the goal condition is determined Have levels the father net page element of constituent relation;
Using basic web page element corresponding with the goal condition have the father net page element of level constituent relation as Web page analysis result exports.
Optionally, the level constituent relation according between web page element determines base corresponding with the goal condition Plinth web page element has the father net page element of level constituent relation, including:
According to the level constituent relation between web page element, by upper the one of the corresponding basic web page element of the goal condition Grade web page element web page element as a result, using the upper level web page element of the results web page element as the results web page Element, until the results web page element is top page element, will all the results web page elements as with the target The corresponding basic web page element of condition has the father net page element of level constituent relation.
Optionally, the method further includes:
Using the corresponding basic web page element of the goal condition as the web page element for obtaining data;
According to the level constituent relation between web page element, when each next stage webpage member for detecting parent web page element Element is the web page element for obtaining data, is given birth to using the data of each next stage web page element of the parent web page element Into the data of the parent web page element, upper level webpage of the parent web page element for the web page element for obtaining data Element;
Using the parent web page element as the web page element for obtaining data, repeat described when detecting parent Each next stage web page element of web page element is the web page element for obtaining data, utilizes the parent web page element The data of each next stage web page element generate the data of the parent web page element, until the parent web page element is top layer Web page element;
It is exported the data of each parent web page element as web page analysis result.
Optionally, the method further includes:
The webpage to be analyzed is matched with boundary filtering condition, acquisition is matched with the web data to be analyzed Boundary filtering condition as boundary condition, and obtain the boundary condition corresponding number in the web data to be analyzed According to;
According to the boundary filtering condition and the correspondence of web page element, the corresponding webpage member of the boundary condition is determined Element;
Using either boundary condition in the web data to be analyzed corresponding data as the corresponding net of the boundary condition The data of page element;
Using the corresponding web page element of the boundary condition as the web page element for obtaining data, repeat described when detection Each next stage web page element to parent web page element is the web page element for obtaining data, utilizes the parent webpage The data of each next stage web page element of element generate the data of the parent web page element, until the parent web page element For top page element.
Optionally, the default screening conditions include the data screening condition for describing preset basic web page element.
A kind of web page analysis device, described device include:
First matching unit is matched for being analysed to web data with default screening conditions, and acquisition is treated with described The default screening conditions that match of analysis web data obtain each goal condition and are treated point described as goal condition Analyse corresponding data in web data;
First determination unit, for according to the default screening conditions and the correspondence of basic web page element, determining institute State the corresponding basic web page element of goal condition;
Second determination unit, for using either objective condition in the web data to be analyzed corresponding data as should The data of the corresponding basic web page element of goal condition;
Output unit, for each corresponding basic web page element of the goal condition and the goal condition is corresponding The data of basic web page element are exported as web page analysis result.
Optionally, described device further includes:
Third determination unit, for according to the level constituent relation between web page element, determining and the goal condition pair The basic web page element answered has the father net page element of level constituent relation;
The output unit is additionally operable to basic web page element corresponding with the goal condition having level constituent relation Father net page element as web page analysis result export.
Optionally, the third determination unit is specifically used for:
According to the level constituent relation between web page element, by upper the one of the corresponding basic web page element of the goal condition Grade web page element web page element as a result, using the upper level web page element of the results web page element as the results web page Element, until the results web page element is top page element, will all the results web page elements as with the target The corresponding basic web page element of condition has the father net page element of level constituent relation.
Optionally, described device further includes:
4th determination unit, for using the corresponding basic web page element of the goal condition as the webpage member for obtaining data Element;
Generation unit, for according to the level constituent relation between web page element, when detecting each of parent web page element A next stage web page element is the web page element for obtaining data, utilizes each next stage net of the parent web page element The data of page element generate the data of the parent web page element, and the parent web page element is the webpage member for obtaining data The upper level web page element of element;
First trigger element, for using the parent web page element as the web page element for obtaining data, triggering institute It is the acquisition number to state generation unit and repeat each next stage web page element that ought detect parent web page element According to web page element, the data of each next stage web page element of the parent web page element is utilized to generate parent webpage member The data of element, until the parent web page element is top page element;
The output unit is additionally operable to export the data of each parent web page element as web page analysis result.
Optionally, described device further includes:
Second matching unit, for the webpage to be analyzed to be matched with boundary filtering condition, acquisition is treated with described The boundary filtering condition that analysis web data matches obtains the boundary condition in the net to be analyzed as boundary condition Corresponding data in page data;
5th determination unit for the correspondence according to the boundary filtering condition and web page element, determines the side The corresponding web page element of boundary's condition;
6th determination unit, for using either boundary condition in the web data to be analyzed corresponding data as should The data of the corresponding web page element of boundary condition;
Second trigger element, for using the corresponding web page element of the boundary condition as obtain data web page element, It is described to trigger the generation unit and repeat each next stage web page element that ought detect parent web page element The web page element of data is obtained, the parent is generated using the data of each next stage web page element of the parent web page element The data of web page element, until the parent web page element is top page element.
Optionally, the default screening conditions include the data screening condition for describing preset basic web page element.
A kind of computer readable storage medium is stored with instruction in the computer readable storage medium storing program for executing, works as described instruction When running on the terminal device so that the terminal device performs above-mentioned webpage analysis method.
A kind of computer program product, when the computer program product is run on the terminal device so that the terminal Equipment performs above-mentioned webpage analysis method.
It can be seen that the embodiment of the present application has the advantages that:
The embodiment of the present application is matched by being analysed to web data with default screening conditions, and obtaining can be with net to be analyzed The matched default screening conditions of page data have correspondence as goal condition, the goal condition with basic web page element, from And possessed basis web page element in web data to be analyzed can be quickly obtained, meanwhile, in web data to be analyzed and in advance If in the matching process of screening conditions, goal condition corresponding data in web data to be analyzed can also be obtained, due to mesh Mark condition is corresponding with basic web page element, therefore the corresponding data of goal condition are that goal condition is corresponding with basic web page element Data, so as to obtain the number of the corresponding basic web page element of goal condition and the corresponding basic web page element of the goal condition It is exported according to as web page analysis result, realizes the analysis to webpage, in this process without to multilayer labels in web page source code Nested complicated description is analyzed, so as to reduce the complexity of web page analysis and error rate.
Description of the drawings
Fig. 1 is the flow chart of a webpage analysis method embodiment provided in the embodiment of the present application;
Fig. 2 is the schematic diagram of the level constituent relation between a kind of web page element provided in the embodiment of the present application;
Fig. 3 is the schematic diagram of the level constituent relation between another web page element provided in the embodiment of the present application;
Fig. 4 is the schematic diagram of the level constituent relation between another web page element provided in the embodiment of the present application;
Fig. 5 is the flow chart of another webpage analysis method embodiment provided in the embodiment of the present application;
Fig. 6 is the flow chart of another webpage analysis method embodiment provided in the embodiment of the present application;
Fig. 7 is the flow chart of another webpage analysis method embodiment provided in the embodiment of the present application;
Fig. 8 is that the webpage analysis method provided in the embodiment of the present application realizes the schematic diagram of process;
Fig. 9 is the schematic diagram of web page analysis device embodiment provided in the embodiment of the present application.
Specific embodiment
Above-mentioned purpose, feature and advantage to enable the application are more obvious understandable, below in conjunction with the accompanying drawings and specific real Mode is applied to be described in further detail the embodiment of the present application.
In order to improve the efficiency of web page analysis, the embodiment of the present application provides a kind of webpage analysis method and device, passes through Web data is analysed to match with preset screening conditions, can obtain in web data to be analyzed with preset screening conditions The data of matched basis web page element and these basic web page elements.Meanwhile it is made up of the level between web page element Relationship, the basic web page element that can be obtained and obtain have the father net page element of level constituent relation, and acceptable basis obtains The data of basic web page element arrived generate the data of parent web page element that these basic web page elements can generate.Separately Outside, it can also in time find according to boundary filtering condition and skip abnormal data section, directly match the net there are data exception The data of page element last layer web page element.According to webpage analysis method provided by the embodiments of the present application, can realize to webpage Analysis, in this process without analyze the nested complicated description of multilayer labels in web page source code, so as to reduce net The complexity and error rate of page analysis
Below in conjunction with the accompanying drawings, the various non-limiting embodiments that the present invention will be described in detail.
It is shown in Figure 1, a kind of webpage analysis method embodiment provided in the embodiment of the present application, the present embodiment are provided It may comprise steps of:
Step 101:It is analysed to web data to be matched with default screening conditions, obtain and web data to be analyzed The default screening conditions mixed obtain each goal condition corresponding number in web data to be analyzed as goal condition According to.
In the embodiment of the present application, web page analysis is and the web data to be analyzed in webpage is analyzed, to be analyzed Web data can be whole web datas in webpage, or the part web data in webpage, webpage number to be analyzed According to can be determined according to actual demand, the method the application for obtaining web data to be analyzed does not also limit.
Include web page element in webpage, web page element for example web object, web page tag, header label, heading label, Keyword label, network address, domain name, access port, access path etc..There is level constituent relation between web page element, such as Heading label and keyword label may be constructed header label, and domain name, access port and access path may be constructed network address Deng.In level constituent relation between page element, the web page element in the bottom may be considered basic web page element, i.e., Basic web page element may be constructed the web page element of upper level, but there is no web page elements to reconstruct basic web page element.
Basic web page element is corresponding with data screening condition, and data screening condition can use regular expression and/or advanced The power function description that programming language is realized.Data screening condition can describe basic web page element.Such as regular expression $ 1 ~/<title>[^<]*</title>/ i can describe heading label this basic web page element, with the regular expression matching On web data to be analyzed be web page tag.
In the embodiment of the present application in some possible realization methods, default screening conditions can include preset for describing The data screening condition of basic web page element.I.e. default screening conditions can be the corresponding number of whole basic web page element institutes According to screening conditions or the basic corresponding data screening condition of web page element institute of a part.Preset basis webpage Element can be set according to actual demand, can achieve the effect that as desired to analyze certain web page elements.
Matched by being analysed to web data with default screening conditions, can obtain there are one or multiple default sieves Condition and web data successful match to be analyzed are selected, then these default screening conditions matched with web data to be analyzed can be with As goal condition.At the same time it can also obtain goal condition corresponding data in web data to be analyzed, which can be The specific data value in Data Matching range or web data to be analyzed in web data to be analyzed.For example, and mesh The matched web data to be analyzed of mark condition 1 is the 1st byte in web data to be analyzed to the 10th byte, then goal condition 1 in web data to be analyzed corresponding data be the 1st byte to the 10th this Data Matching range of byte or be the Specific data in 1 byte to the 10th byte.
Step 102:According to default screening conditions and the correspondence of basic web page element, the corresponding base of goal condition is determined Plinth web page element.
According to preceding description, basic web page element has one-to-one relationship with data screening condition, then presets screening item Part also has one-to-one relationship with basic web page element.Again since goal condition belongs to default screening conditions, then in this step In can determine the corresponding basic web page element of each goal condition, the corresponding basic web page element of goal condition can be recognized To be basic web page element included in basic web page element namely webpage included in web data to be analyzed.
Step 103:Using either objective condition, corresponding data are corresponded to as the goal condition in web data to be analyzed Basic web page element data.
Each goal condition is corresponding with a basic web page element, while each goal condition is corresponding with again in webpage to be analyzed Corresponding data in data, then for the data corresponding to the basis web page element, each target can be obtained in this step by having The data of the corresponding basic web page element of condition.
Step 104:By the corresponding basic webpage of the corresponding basic web page element of each goal condition and the goal condition The data of element are exported as web page analysis result.
By above-mentioned steps, the corresponding basic web page element of goal condition and these basic web page elements can be obtained Data, exported as web page analysis result.
The embodiment of the present application is matched by being analysed to web data with default screening conditions, and obtaining can be with net to be analyzed The matched default screening conditions of page data have correspondence as goal condition, the goal condition with basic web page element, from And possessed basis web page element in web data to be analyzed can be quickly obtained, meanwhile, in web data to be analyzed and in advance If in the matching process of screening conditions, goal condition corresponding data in web data to be analyzed can also be obtained, due to mesh Mark condition is corresponding with basic web page element, therefore the corresponding data of goal condition are that goal condition is corresponding with basic web page element Data, so as to obtain the number of the corresponding basic web page element of goal condition and the corresponding basic web page element of the goal condition It is exported according to as web page analysis result, realizes the analysis to webpage, in this process without to multilayer labels in web page source code Nested complicated description is analyzed, so as to reduce the complexity of web page analysis and error rate.
In the embodiment of the present application in some possible realization methods, the level that can be built in advance between web page element is formed Relationship, the level constituent relation between web page element can be identified by tree structure.
Shown in Figure 2 as a kind of example, Fig. 2 is the level constituent relation between the web page element that the application provides A kind of exemplary plot.In the example, top page element is defined as using all web datas in webpage as an entirety, is Facilitate narration, which is denoted as webpage.
Webpage can include particular content this two parts data in network address and webpage, for sake of convenience, will be in webpage Particular content be denoted as web object.Therefore, for this web page element of webpage, two web page elements of next level, difference For network address and web object.
It can include domain name, access port and access path three parts content in network address, therefore, for this webpage of network address Element, three web page elements of next level are respectively:Domain name, access port, access path;It can include net in web object The external resource three parts content of page label, embedded scripted code and webpage access, therefore, for this net of web object Page element, three web page elements of next level, respectively:Web page tag, scripted code and external resource.
Web page tag can include the label (abbreviation header label) for defining webpage stem and the label for defining Web page text (abbreviation body tag), therefore, for this web page element of web page tag, two web page elements of next level, respectively:It is first Portion's label and body tag;In embedded scripted code can include variable, constant and function, therefore, for scripted code this One web page element, three web page elements of next level, respectively:Variable, constant and function;The external resource of webpage access can To include file, image, video and audio etc., therefore, for this web page element of external resource, four webpages of next level Element, respectively:File, image, video and audio.
Header label can include heading label and keyword label, therefore, for this web page element of header label, under Two web page elements of one level, respectively:Heading label and keyword label.
Shown in Figure 3 as another example, Fig. 3 is that the level between the web page element that the application provides forms pass Another exemplary plot of system.It is different from example shown in Fig. 2, in this example, for this web page element of webpage, the four of next level A web page element, respectively:Network address, web page tag, scripted code and external resource.
Shown in Figure 4 as another example, Fig. 4 is that the level between the web page element that the application provides forms pass Another exemplary plot of system.It is different from example shown in Fig. 2, Fig. 3, in this example, for this web page element of webpage, next level Web page element for web object, for this web page element of web object, three web page elements of next level, respectively: Web page tag, scripted code and external resource.
By example shown in Fig. 4 it is found that in the embodiment of the present application, web page element can be established just for part web page element Between level constituent relation.
Level constituent relation between web page element can be built according to actual conditions, and the application is to this without limiting. After the level constituent relation between defining web page element, webpage is built according to the level constituent relation between above-mentioned web page element Tree structure.
It is shown in Figure 5 based on the level constituent relation between above-mentioned web page element, it shows in the embodiment of the present application and carries A kind of webpage analysis method embodiment supplied in the present embodiment based on above-mentioned webpage analysis method, can also obtain and target The father net page element that the corresponding basic web page element of condition has level constituent relation is exported as web page analysis result, this reality Example is applied to may comprise steps of:
Step 501:According to the level constituent relation between web page element, basic webpage member corresponding with goal condition is determined Element has the father net page element of level constituent relation.
In web page analysis, the upper wire of basic web page element included in analysis webpage is also needed under some scenes Which page member is known as, which not only includes the web page element i.e. base for having the relationship of directly constituting with basic web page element The upper level web page element of plinth web page element further includes the webpage of the upper level again member of the upper level web page element of basic web page element Element, and so on.
Therefore, in the embodiment of the present application in some possible realization methods, the realization of step 501 can include:According to net Level constituent relation between page element, by the upper level web page element of the corresponding basic web page element of goal condition as a result Web page element, by the upper level web page element of results web page element web page element as a result, until results web page element is top Whole results web page elements are had level constituent relation by layer web page element Father net page element.
Namely by way of successively searching web page element, looked into step by step since the corresponding basic web page element of goal condition The web page element of upper level is looked for until top page element.The corresponding basic web page element of goal condition can be considered webpage In included basic web page element, such as the corresponding basic web page element of goal condition includes domain name, access port, accesses road Diameter and heading label can then be directed to the corresponding basic web page element of each goal condition, search its father net page respectively Element, such as father net page member of the father net page element including network address and webpage, access path and heading label of domain name Element is network address and webpage, and the father net page element of heading label includes header label, web page tag, web object and net Page.
Step 502:Basic web page element corresponding with goal condition is had to the father net page element of level constituent relation It is exported as web page analysis result.
It in the present embodiment, can be by the level constituent relation between the web page element that forms in advance, by successively looking into The mode looked for, being quickly obtained basic web page element corresponding with goal condition has the father net page element of level constituent relation It is exported as web page analysis result, so as to which the basic webpage included by webpage can also be obtained during web page analysis The father net page element situation of element.
In the embodiment of the present application, the data of the corresponding basic web page element of goal condition are obtained, and according to webpage member Level constituent relation between element in the case where meeting condition, can also obtain the data of other web page elements, which can The each next stage web page element for thinking parent web page element is the web page element for obtaining data, and parent web page element is obtains The upper level web page element of the web page element of data.
For example, by the embodiment of Fig. 1 obtain basic web page element included by webpage for domain name, access port and The data of access path and these basic web page elements, i.e., these basic web page elements are the web page element for obtaining data.Root According to the level constituent relation between web page element, the upper level web page element of domain name, access port and access path is network address, I.e. network address be obtain data web page element upper level web page element, then network address as parent web page element meet its it is each under Level-one web page element is the web page element for obtaining data, then, can be with according to the data of domain name, access port and access path The data of network address are generated, this web page element of network address can also be confirmed as obtaining the web page element of data at this time.
It is shown in Figure 6, in order to obtain the data of the father net page element for meeting condition of basic web page element, the application Embodiment also provides another webpage analysis method embodiment, and the present embodiment may comprise steps of:
Step 601:Using the corresponding basic web page element of goal condition as the web page element for obtaining data.
In the above-described embodiments, the data of the corresponding basic web page element of goal condition are had been obtained for, then goal condition Corresponding basis web page element can be as the web page element for obtaining data.
Step 602:According to the level constituent relation between web page element, when detecting each next of parent web page element Grade web page element is the web page element for obtaining data, is given birth to using the data of each next stage web page element of parent web page element Into the data of parent web page element, parent web page element is the upper level web page element for the web page element for obtaining data.
In the present embodiment, the web page element for obtaining data can be inputted one by one, according to the level structure between web page element Into relationship, whether the web page element for detecting the acquisition data of input may be constructed upper level web page element, if cannot continue Input obtains the web page element of data, until the web page element of the acquisition data of input may be constructed upper level web page element, this When can detect parent web page element each next stage web page element be obtain data web page element, parent webpage member Element can be understood as obtaining the upper level web page element of the web page element of data.
Then parent web page element can be generated using the data of each next stage web page element of parent web page element Data, such as the data according to domain name, access port and access path can generate the data of network address, according to network address and The data of web object can generate data of webpage etc..
Step 603:Using parent web page element as the web page element for obtaining data.
Step 604:Detect whether parent web page element is top page element, if it is, 605 are entered step, if It is no, then return to step 602.
That is, after the data for obtaining parent web page element, parent web page element can also be used as the webpage for obtaining data Then element repeats step 602 when each next stage web page element for detecting parent web page element is acquisition data Web page element, using parent web page element each next stage web page element data generation parent web page element data, Until parent web page element is top page element.
Step 605:It is exported the data of each parent web page element as web page analysis result.
In the present embodiment, can according to included in webpage basic web page element data, generate upper wire step by step The data of page element, using the level constituent relation between web page element, can quickly realize the analysis of webpage, avoid to webpage The analysis of multilayer nest label in structure ensure that flexibility and the high efficiency of web page analysis.
In addition, in some cases, web page analysis process may have exception, and the embodiment of the present application also provides one kind can In a manner of skipping abnormal data section and continue web page analysis.Shown in Figure 7, the embodiment of the present application also provides a kind of webpage Analysis method embodiment, on the basis of the corresponding embodiments of Fig. 6, the present embodiment can also include the following steps:
Step 701:It is analysed to webpage to be matched with boundary filtering condition, acquisition is matched with web data to be analyzed Boundary filtering condition as boundary condition, and obtain boundary condition corresponding data in web data to be analyzed.
There is specific end boundary, boundary filtering condition can describe the net of a certain layer between different web pages element hierarchy The end boundary of page element, for example, website pages element next stage web page element for domain name, access port and access path, These three web page elements of domain name, access port and access path belong to the web page element of same level-one, end boundary that there are one tools, The end boundary can be described with boundary filtering condition, the boundary filtering condition and the upper level web page element of this layer of web page element With correspondence, such as this web page element has correspondence to the boundary filtering condition with network address.Boundary filtering condition can To be described with regular expression.
It is matched by being analysed to web data with each boundary filtering condition, may be matched successful boundary Screening conditions, these boundary filtering conditions matched with web data to be analyzed can be used as boundary condition.At the same time it can also Boundary condition corresponding data in web data to be analyzed are obtained, which can be the data in web data to be analyzed With the specific data value in range or web data to be analyzed.Boundary condition is corresponding in web data to be analyzed Data can be the corresponding data of this layer of each web page element namely may be considered web page element corresponding with boundary condition Data.
Step 702:According to boundary filtering condition and the correspondence of web page element, the corresponding webpage member of boundary condition is determined Element.
According to preceding description, web page element has one-to-one relationship with boundary filtering condition, then boundary condition also with net Page element has one-to-one relationship, then can determine the corresponding web data of boundary condition.
Step 703:Using either boundary condition, corresponding data are corresponded to as the boundary condition in web data to be analyzed Web page element data.
Then can using boundary condition in web data to be analyzed corresponding data as the corresponding webpage of the boundary condition The data of element.Such as the corresponding web page element of boundary condition is network address, then boundary condition is corresponding in web data to be analyzed Data be this web page element of network address data, the data can be web data to be analyzed in Data Matching range, It can also be the specific data value in web data to be analyzed.For example, the data of this web page element of network address can be to treat point The 30th byte to the 40th this Data Matching range of byte or is the 30th byte to the 40th word in analysis web data Specific data in section.
In the present embodiment, the data of the corresponding web page element of boundary condition can be directly obtained, i.e., if boundary condition The shortage of data of the next stage web page element of corresponding web page element, it is also possible to obtain the number of the corresponding web page element of boundary condition According to.For example, according to the explanation of above-described embodiment, if it is desired to obtaining the data of network address, need according to domain name, access port, access The data of the data generation network address in path, still, if lacking any web page element in domain name, access port, access path Data can not then generate the data of network address.It is abnormal in order to skip certain web page element analyses, then it can pass through boundary filtering condition The analysis of this layer of web page element is jumped out, directly obtains the analysis result of last layer web page element.If for example, lack domain name, visit Ask any one or more data in port, access path, it, can also after the corresponding boundary filtering condition of network address is detected Directly obtain the data of network address.
Step 704:Using the corresponding web page element of boundary condition as the web page element for obtaining data, 602 are entered step.
Can be using the corresponding web page element of boundary condition as the web page element for obtaining data, repeating ought detect Each next stage web page element of parent web page element is the web page element for obtaining data, utilizes each of parent web page element The data of the data generation parent web page element of next stage web page element, until parent web page element is top page element.
Explanation about step 602 may refer to above-described embodiment, and details are not described herein.
In the present embodiment, using the level constituent relation between web page element, the boundary of each layer web page element is determined, It notes abnormalities in time, and skip abnormal data section in each web page element analytic process, in the shortage of data of certain web page elements In the case of, it can also continue to analyze other web page elements by boundary filtering condition, to obtain web page analysis result.
In practical applications, level constituent relation and the data between web page element can be defined by grammar rule Screening conditions and the correspondence of basic web page element generate lexical analyzer and syntactic analysis by compiling grammar rule Device completes webpage analysis method provided by the embodiments of the present application by lexical analyzer and syntax analyzer.Below with reference to showing Example, illustrates the realization process of webpage analysis method provided by the embodiments of the present application in practical applications.
Grammar rule can use production representation, and the form of general production is:
vn:V1 (p1) ... vk (pk), alternatively, vn:v1...vk;
Wherein, ":" for stipulations symbol, using stipulations symbol as boundary, left part of a production is a non-terminal vn, production The right includes one or more symbol v1 ..., vk, and the symbol on the right of production can carry data screening condition P1 ..., pk, the symbol that non-terminal can be finely divided again, finishing sign are the symbol that cannot be subdivided.It generates The semanteme of formula is:The left part symbol of production is by the sign convention of production right part.
In the grammar rule of the embodiment of the present application, web page element can be abstracted as symbol, be denoted as web page element symbol, no Same web page element is abstracted as different web page element symbols.
For example, webpage can be abstracted as web page element symbol html_TOP, network address can be abstracted as web page element symbol Html_url, web object can be abstracted as web page element symbol html_object etc..
Using grammar rule, the level constituent relation between web page element can be defined, for example, see definition shown in Fig. 2 Level constituent relation between web page element, forming the grammar rule of top page element can be:
HTML_TOP:html_url html_object;
Wherein, web page element symbol html_url represents network address, and web page element symbol html_object represents webpage pair As the two combines the web page element symbol HTML_TOP that can generate webpage.
Forming the grammar rule of network address can be:
html_url:domain_name access_port url_path;
Wherein, web page element symbol domain_name represents domain name, and web page element symbol access_port represents to access Port, web page element symbol url_path represent access path, and three combines the web page element symbol html_ that can generate network address url。
The exemplary illustration for the grammar rule for defining the level constituent relation between web page element is above are only, this is exemplary Illustrate there is no exhaustion is carried out to the level constituent relation defined between web page element, for grammar rule the embodiment of the present application Without limiting.
Meanwhile correspondence and the side of data screening condition and basic web page element can also be defined by grammar rule The correspondence of boundary's screening conditions and web page element.
For example, defining data screening condition and the grammar rule of the correspondence of basic web page element can be:
html_title:Html_data ($ 1~/<title>[^<]*</title>/i);
Wherein, $ 1~/<title>[^<]*</title>/ i is data screening condition, is represented by regular expression, is terminated Html_data is accorded with as html (Hyper Text Markup Language, HyperText Markup Language) input traffic Symbol, web page element symbol html_title represent this basic web page element of heading label, which can represent to mark Inscribe label and the correspondence of the data screening condition.
In another example defining the grammar rule of the correspondence of boundary filtering condition and web page element can be:
html_object_boundary:Html_data ($ 1~/</html>/i);
The grammar rule represents 1~/</html>This boundary filtering condition of/i has correspondence with web object.
It above are only and define data screening condition and the exemplary illustration of the grammar rule of basic web page element correspondence, And boundary filtering condition and the exemplary illustration of the grammar rule of web page element correspondence are defined, the exemplary illustration is not Have and exhaustion is carried out to such grammar rule, for grammar rule the embodiment of the present application also without limiting.
Furthermore it is also possible to by web page element loading rule, preset basic web page element is determined, it is pre- to further determine that If screening conditions, such as determine only to analyze heading label by web page element loading rule.
It is shown in Figure 8, based on the definition of above-mentioned grammar rule, it can realize that the embodiment of the present application discloses based on compiler Webpage analysis method.Specifically, above-mentioned grammar rule can be compiled into lexical analyzer and every layer of web page element difference Then corresponding syntax analyzer utilizes the analysis of lexical analyzer, syntax analyzer completion to web data to be analyzed, output Analysis result.
The compilation process of lexical analyzer can include:
The production obtained first in grammar rule is analyzed, and extraction obtains lexical element, and lexical element is to generate The right part of formula, such as:" html_data ($ 1~/<title>[^<]*</title>/i)”.
Had by the lexical element that the grammar rule of definition boundary filtering condition and the correspondence of web page element is extracted There is higher priority, such as give high priority label L1.
The morphology member extracted by the grammar rule of definition data screening condition and the correspondence of basic web page element Element has lower priority, such as gives high priority label L0.
Regular expression in the lexical element of extraction is collected and builds finite automaton, obtains being integrated with finite automatic The lexical analyzer of machine.That is, the lexical analyzer for being integrated with above-mentioned finite automaton is exactly that grammar rule is compiled The lexical analyzer translated.Finite automaton can be deterministic finite automaton DFA (Deterministic Finite Automaton), or nondeterministic finite automaton NFA (Nondeterministic Finite Automata).
The compilation process of syntax analyzer can include:
It can be based on the level constituent relation between the web page element that grammar rule defines, generative grammar analyzer.It gives birth to Into syntax analyzer include the pushdown automata of an analysis grammer state, which includes:Controller, automatic machine State stack and web page element symbol stack, is output and input state of automata jump list (GOTO tables) and action schedule (ACTION tables). Wherein, input is the symbol sebolic addressing that lexical analyzer provides, which sorted according to the Data Matching range of hit.Automatically Grammer state is preserved in machine state stack and web page element symbol stack, grammer state is generated by pushdown automata, original state For S0.A symbol is inputted under Action tables, i.e. current state and does not jump to new state, action of tabling look-up is directly obtained and (tables look-up Action can only be stipulations), one symbol of input and new state is jumped under Goto tables, i.e. current state, is then tabled look-up It acts (action of tabling look-up can be stipulations, shift-in or receiving).
The operational process of lexical analyzer and syntax analyzer is continued to explain below in conjunction with example, which can be with Including:
Step 1:Input web data to be analyzed:
It is analysed to web data and carries terminal symbol html_data input lexical analyzers.
Step 2:Lexical analyzer parses:
Web data is analysed to be matched with the regular expression in lexical analyzer.According to the canonical of match hit The priority tag of expression formula does following processing:
When the priority tag of regular expression is L1:Return includes the lexical element of the regular expression and data With range.
The regular expression that priority tag is L1 is matched, representative matches boundary filtering condition, the morphology member of return Element can be corresponding with boundary condition, and it is right in web data to be analyzed that the Data Matching range of return may be considered boundary condition The data answered.
When the priority tag of regular expression is L0:Return to the lexical element and data for including the regular expression Matching range.The regular expression that priority tag is L0 is matched, representative matches default screening conditions, the morphology member of return Element can be corresponding with goal condition, and it is right in web data to be analyzed that the Data Matching range of return may be considered goal condition The data answered.
Sequence of the result of lexical analyzer output for lexical element, html_data (x1), html_data (x2) html_ Data (x3), html_data (x4) ..., a lexical element for taking out the sequence submit to basal layer as incoming symbol Syntax analyzer.
Step 3:Syntax analyzer parses:
Into the syntax analyzer of this layer, incoming symbol is sent into the automatic machine of syntax analyzer, judges that incoming symbol is The no regular expression including priority for L1, if including the regular expression that priority is L1, terminates current level Syntactic analysis exports web page element symbol corresponding with the incoming symbol, performs step 4;For example, incoming symbol html_data ($ 1~/</html>/ i), it includes boundary filtering condition, i.e. incoming symbol includes the regular expression that priority is L1, output Web page element symbol html_object corresponding with the incoming symbol, and enter step 4.
If not including the regular expression that priority is L1, judge whether incoming symbol includes terminal symbol, if so, Inquire ACTION tables;If it is not, then inquiring GOTO tables, and the action is performed according to the action tabled look-up, if action for stipulations and Stipulations generate the web page element symbol of the upper level of current level, then perform step 4;It such as obtains automatic machine and jumps to next shape State simultaneously obtains new incoming symbol, when new incoming symbol is the incoming symbol that needs to be analyzed by underlying syntax analyzer, then will The symbol is exported from this layer of syntax analyzer, then performs step 5, otherwise continues to analyze new incoming symbol, until data analysis It completes.
The operational process of syntax analyzer is briefly described with reference to example.
First the state stack of syntax analyzer be S0, from lexical analyzer incoming symbol html_data ($ 1~/<title> [^<]*</title>/ i), which does not include the regular expression that priority is L1, and html_data is terminal symbol, Inquire ACTION tables, ACTION token record following items contents:" current state, can incoming symbol, the action taken is newly-generated Symbol ".Table look-up understand S0 states under, can incoming symbol html_data ($ 1~/<title>[^<]*</title>/ i), it adopts The action taken is " stipulations ", and generation web page element symbol is that (step is regarded as determining that goal condition corresponds to html_title Basic web page element, and the data of the basis web page element can be obtained), using html_title as this layer of syntax analyzer Next incoming symbol.
Since html_title is non-terminal, GOTO tables, GOTO token record following items contents are inquired:" current shape State, can incoming symbol, jump to new state, the action taken ", table look-up understand S0 states under, incoming symbol html_title, It is S1 to jump to new state, and the action taken is " shift-in ", and next new incoming symbol is obtained from lexical analyzer.
Next incoming symbol be " html_data ($ 1~/<Meta name=" keywords " [^>]*>/ i) ", inquiry ACTION tables obtain web page element symbol html_keywords, as new incoming symbol;Due to incoming symbol html_ Keywords is nonterminal symbol, inquires GOTO tables, and current state S1 jumps to new state S2, and the action taken is " stipulations ", rule About generating new web page element symbol html_head, (process of the new web page element symbol html_head of generation is included according to net The data of the data of page element title and the data generation web page element head of web page element keywords, the i.e. step can be with It is considered the mistake of the data according to the data of each next stage web page element of parent web page element generation parent web page element Journey), using the web page element symbol html_head of stipulations generation as next incoming symbol.
Incoming symbol html_head is non-terminal, inquires GOTO tables, and state S2 jumps to new state S3, tables look-up dynamic Work is " receiving ", and it is the aiming symbol of current level syntax analyzer to show html_head, then current level syntax analyzer knot Beam using web page element symbol html_head as the incoming symbol of last layer syntax analyzer, enters step 4.
Step 4:It cuts out to upper strata syntax analyzer and parses:
The syntax analyzer of level where cutting out the web page element symbol of step 3 syntax analyzer output, by the webpage The symbol of element starts to perform step 3 as incoming symbol.It cuts out to upper strata syntax analyzer for example from web page element The syntax analyzer of level where the syntax analyzer of level cuts out web page element head where keywords, title.
Step 5:It is cut into the parsing of underlying syntax analyzer:
The syntax analyzer of level where cutting out the web page element symbol of step 3 syntax analyzer output, by the webpage The symbol of element starts to perform step 3 as incoming symbol.The parsing of underlying syntax analyzer is cut into for example from web page element head The syntax analyzer of level where the syntax analyzer of place level is cut into web page element keywords, title.
The embodiment of the present application carries out web page analysis with compiler, can flexibly describe webpage member by defining regular expression Element, and the level constituent relation between web page element can be described with simple and clear grammar rule, by the complexity in web page analysis Realize the lexical analyzer and syntax analyzer for giving automatic programming:Lexical analyzer matches regular expression, syntactic analysis Device determines the corresponding web page element of regular expression with pushdown automata, analyzes the relationship between web page element, ensure that webpage The flexibility of analysis and high efficiency.
By the level constituent relation between web page element, the abstract and ownership being clearly defined between web page element is closed System avoids multilayer labels are nested in structure of web page complicated description and analysis.
Simultaneously, it is determined that boundary filtering condition can in time be found simultaneously after occurring exception during every layer of web page analysis Abnormal data section is skipped, and the influence of abnormal data is limited within this depth-first syntactic analysis.
Default screening conditions are set additionally by web page element loading rule, reaches grammar rule and once defines, make on demand With the effect of analysis.
Shown in Figure 9, the embodiment of the present application also provides a kind of web page analysis device embodiment, can include:
First matching unit 901 is matched for being analysed to web data with default screening conditions, is obtained and is treated point The default screening conditions that analysis web data matches obtain each goal condition in web data to be analyzed as goal condition In corresponding data;
First determination unit 902, for according to default screening conditions and the correspondence of basic web page element, determining target The corresponding basic web page element of condition;
Second determination unit 903, for using either objective condition in web data to be analyzed corresponding data as should The data of the corresponding basic web page element of goal condition;
Output unit 904, for the corresponding basic web page element of each goal condition and the goal condition is corresponding The data of basic web page element are exported as web page analysis result.
In the embodiment of the present application in some possible realization methods, which can also include:
Third determination unit, for according to the level constituent relation between web page element, determining corresponding with goal condition Basic web page element has the father net page element of level constituent relation;
Output unit is additionally operable to the upper wire that basic web page element corresponding with goal condition is had level constituent relation Page element is exported as web page analysis result.
In the embodiment of the present application in some possible realization methods, third determination unit is specifically used for:
According to the level constituent relation between web page element, by the upper level net of the corresponding basic web page element of goal condition Page element web page element as a result, by the upper level web page element of results web page element web page element as a result, Zhi Daojie Fruit web page element is top page element, is had whole results web page elements as basic web page element corresponding with goal condition Have levels the father net page element of constituent relation.
In the embodiment of the present application in some possible realization methods, which can also include:
4th determination unit, for using the corresponding basic web page element of goal condition as the web page element for obtaining data;
Generation unit, for according to the level constituent relation between web page element, when detecting each of parent web page element A next stage web page element is the web page element for obtaining data, utilizes each next stage web page element of parent web page element Data generate the data of parent web page element, and parent web page element is the upper level web page element for the web page element for obtaining data;
First trigger element, for using parent web page element as the web page element for obtaining data, triggering generation unit weight Multiple execution is the web page element of acquisition data when each next stage web page element for detecting parent web page element, utilizes parent The data of the data generation parent web page element of each next stage web page element of web page element, until parent web page element is top Layer web page element;
Output unit is additionally operable to export the data of each parent web page element as web page analysis result.
In the embodiment of the present application in some possible realization methods, which can also include:
Second matching unit is matched for being analysed to webpage with boundary filtering condition, is obtained and webpage to be analyzed Boundary filtering condition in Data Matching obtains boundary condition corresponding number in web data to be analyzed as boundary condition According to;
5th determination unit for the correspondence according to boundary filtering condition and web page element, determines boundary condition pair The web page element answered;
6th determination unit, for using either boundary condition in web data to be analyzed corresponding data as the boundary The data of the corresponding web page element of condition;
Second trigger element, for using the corresponding web page element of boundary condition as the web page element for obtaining data, triggering Generation unit is repeated when each next stage web page element for detecting parent web page element is the webpage member of acquisition data Element, using the data of the data generation parent web page element of each next stage web page element of parent web page element, until parent Web page element is top page element.
In the embodiment of the present application in some possible realization methods, default screening conditions can include preset for describing The data screening condition of basic web page element.
In this way, the embodiment of the present application is matched by being analysed to web data with default screening conditions, obtaining can be with treating The matched default screening conditions of web data are analyzed as goal condition, which has corresponding close with basic web page element System, so as to be quickly obtained possessed basis web page element in web data to be analyzed, meanwhile, in web data to be analyzed With goal condition corresponding data in web data to be analyzed in the matching process of default screening conditions, can also be obtained, by Basic web page element is corresponding with, therefore the corresponding data of goal condition are that goal condition is corresponding with basic webpage in goal condition The data of element, so as to obtain the corresponding basic web page element of goal condition and the corresponding basic web page element of the goal condition Data exported as web page analysis result, the analysis to webpage is realized, in this process without to multilayer in web page source code The nested complicated description of label is analyzed, so as to reduce the complexity of web page analysis and error rate.
A kind of computer readable storage medium is also provided in the embodiment of the present application, is deposited in the computer readable storage medium storing program for executing Instruction is contained, when instruction is run on the terminal device so that terminal device performs the webpage point that any of the above-described embodiment provides Analysis method.
A kind of computer program product is also provided in the embodiment of the present application, and the computer program product is on the terminal device During operation so that terminal device performs the webpage analysis method that any of the above-described embodiment provides.
It should be noted that each embodiment is described by the way of progressive in this specification, each embodiment emphasis is said Bright is all difference from other examples, and just to refer each other for identical similar portion between each embodiment.For reality For applying system disclosed in example or device, since it is corresponded to the methods disclosed in the examples, so fairly simple, the phase of description Part is closed referring to method part illustration.
It should also be noted that, herein, relational terms such as first and second and the like are used merely to one Entity or operation are distinguished with another entity or operation, without necessarily requiring or implying between these entities or operation There are any actual relationship or orders.Moreover, term " comprising ", "comprising" or its any other variant are intended to contain Lid non-exclusive inclusion, so that process, method, article or equipment including a series of elements not only will including those Element, but also including other elements that are not explicitly listed or further include as this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that Also there are other identical elements in process, method, article or equipment including the element.
It can directly be held with reference to the step of method or algorithm that the embodiments described herein describes with hardware, processor The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.
The foregoing description of the disclosed embodiments enables professional and technical personnel in the field to realize or using the application. A variety of modifications of these embodiments will be apparent for those skilled in the art, it is as defined herein General Principle can in other embodiments be realized in the case where not departing from spirit herein or range.Therefore, the application The embodiments shown herein is not intended to be limited to, and is to fit to and the principles and novel features disclosed herein phase one The most wide range caused.

Claims (10)

1. a kind of webpage analysis method, which is characterized in that the method includes:
It is analysed to web data to be matched with default screening conditions, acquisition matches pre- with the web data to be analyzed If screening conditions obtain each goal condition corresponding number in the web data to be analyzed as goal condition According to;
According to the default screening conditions and the correspondence of basic web page element, the corresponding facilities network of the goal condition is determined Page element;
Using either objective condition in the web data to be analyzed corresponding data as the corresponding facilities network of the goal condition The data of page element;
By the number of the corresponding basic web page element of each corresponding basic web page element of the goal condition and the goal condition It is exported according to as web page analysis result.
2. according to the method described in claim 1, it is characterized in that, the method further includes:
According to the level constituent relation between web page element, determine that basic web page element corresponding with the goal condition has layer The father net page element of secondary constituent relation;
Basic web page element corresponding with the goal condition had into the father net page element of level constituent relation as webpage Analysis result exports.
3. according to the method described in claim 2, it is characterized in that, the level constituent relation according between web page element, Determine that basic web page element corresponding with the goal condition has the father net page element of level constituent relation, including:
According to the level constituent relation between web page element, by the upper level net of the corresponding basic web page element of the goal condition Page element web page element as a result, using the upper level web page element of the results web page element as results web page member Element, until the results web page element is top page element, will all the results web page elements as with the target item The corresponding basic web page element of part has the father net page element of level constituent relation.
4. according to the method described in claim 1, it is characterized in that, the method further includes:
Using the corresponding basic web page element of the goal condition as the web page element for obtaining data;
According to the level constituent relation between web page element, when each next stage web page element for detecting parent web page element is equal For the web page element for obtaining data, institute is generated using the data of each next stage web page element of the parent web page element State the data of parent web page element, upper level webpage member of the parent web page element for the web page element for obtaining data Element;
Using the parent web page element as the web page element for obtaining data, repeat described when detecting parent webpage Each next stage web page element of element is the web page element for obtaining data, utilizes each of the parent web page element The data of next stage web page element generate the data of the parent web page element, until the parent web page element is top page Element;
It is exported the data of each parent web page element as web page analysis result.
5. according to the method described in claim 4, it is characterized in that, the method further includes:
The webpage to be analyzed with boundary filtering condition is matched, obtains the side matched with the web data to be analyzed Boundary's screening conditions obtain the boundary condition corresponding data in the web data to be analyzed as boundary condition;
According to the boundary filtering condition and the correspondence of web page element, the corresponding web page element of the boundary condition is determined;
Using either boundary condition, corresponding data are first as the corresponding webpage of the boundary condition in the web data to be analyzed The data of element;
Using the corresponding web page element of the boundary condition as the web page element for obtaining data, repeat described when detecting father Each next stage web page element of grade web page element is the web page element for obtaining data, utilizes the parent web page element The data of each next stage web page element generate the data of the parent web page element, until the parent web page element is top Layer web page element.
6. according to the method described in claim 1, it is characterized in that, the default screening conditions include describing preset base The data screening condition of plinth web page element.
7. a kind of web page analysis device, which is characterized in that described device includes:
First matching unit is matched for being analysed to web data with default screening conditions, obtain with it is described to be analyzed The default screening conditions that web data matches obtain each goal condition in the net to be analyzed as goal condition Corresponding data in page data;
First determination unit, for according to the default screening conditions and the correspondence of basic web page element, determining the mesh The corresponding basic web page element of mark condition;
Second determination unit, for using either objective condition in the web data to be analyzed corresponding data as the target The data of the corresponding basic web page element of condition;
Output unit, for by the corresponding basic web page element of each goal condition and the corresponding basis of the goal condition The data of web page element are exported as web page analysis result.
8. device according to claim 7, which is characterized in that described device further includes:
Third determination unit, for according to the level constituent relation between web page element, determining corresponding with the goal condition Basic web page element has the father net page element of level constituent relation;
The output unit is additionally operable to basic web page element corresponding with the goal condition having the upper of level constituent relation Layer web page element is exported as web page analysis result.
9. a kind of computer readable storage medium, which is characterized in that instruction is stored in the computer readable storage medium storing program for executing, when When described instruction is run on the terminal device so that terminal device perform claim requirement 1-6 any one of them webpage point Analysis method.
10. a kind of computer program product, which is characterized in that when the computer program product is run on the terminal device, make Obtain terminal device perform claim requirement 1-6 any one of them webpage analysis methods.
CN201711481065.2A 2017-12-29 2017-12-29 Webpage analysis method and device, storage medium and program product Active CN108196874B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711481065.2A CN108196874B (en) 2017-12-29 2017-12-29 Webpage analysis method and device, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711481065.2A CN108196874B (en) 2017-12-29 2017-12-29 Webpage analysis method and device, storage medium and program product

Publications (2)

Publication Number Publication Date
CN108196874A true CN108196874A (en) 2018-06-22
CN108196874B CN108196874B (en) 2021-03-16

Family

ID=62586766

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711481065.2A Active CN108196874B (en) 2017-12-29 2017-12-29 Webpage analysis method and device, storage medium and program product

Country Status (1)

Country Link
CN (1) CN108196874B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111949916A (en) * 2020-08-20 2020-11-17 深信服科技股份有限公司 Webpage analysis method, device, equipment and storage medium
CN112148957A (en) * 2019-06-26 2020-12-29 北京百度网讯科技有限公司 Webpage access data analysis method, device and equipment and readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070094246A1 (en) * 2005-10-25 2007-04-26 International Business Machines Corporation System and method for searching dates efficiently in a collection of web documents
CN101719124A (en) * 2008-10-09 2010-06-02 李晶心 System of infinite layering multi-path acquisition based on regular matching
CN103440315A (en) * 2013-08-27 2013-12-11 北京工业大学 Web page cleaning method based on theme
CN104199096A (en) * 2014-09-12 2014-12-10 吉林大学 Extraction method and device of horizons of seismic data cube
CN105095525A (en) * 2015-09-28 2015-11-25 北京奇虎科技有限公司 Method and device for acquiring web page data
CN106599246A (en) * 2016-12-20 2017-04-26 维沃移动通信有限公司 Display content interception method, mobile terminal and control server

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070094246A1 (en) * 2005-10-25 2007-04-26 International Business Machines Corporation System and method for searching dates efficiently in a collection of web documents
CN101719124A (en) * 2008-10-09 2010-06-02 李晶心 System of infinite layering multi-path acquisition based on regular matching
CN103440315A (en) * 2013-08-27 2013-12-11 北京工业大学 Web page cleaning method based on theme
CN104199096A (en) * 2014-09-12 2014-12-10 吉林大学 Extraction method and device of horizons of seismic data cube
CN105095525A (en) * 2015-09-28 2015-11-25 北京奇虎科技有限公司 Method and device for acquiring web page data
CN106599246A (en) * 2016-12-20 2017-04-26 维沃移动通信有限公司 Display content interception method, mobile terminal and control server

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112148957A (en) * 2019-06-26 2020-12-29 北京百度网讯科技有限公司 Webpage access data analysis method, device and equipment and readable storage medium
CN111949916A (en) * 2020-08-20 2020-11-17 深信服科技股份有限公司 Webpage analysis method, device, equipment and storage medium
CN111949916B (en) * 2020-08-20 2024-04-09 深信服科技股份有限公司 Webpage analysis method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN108196874B (en) 2021-03-16

Similar Documents

Publication Publication Date Title
Laender et al. A brief survey of web data extraction tools
RU2610241C2 (en) Method and system for text synthesis based on information extracted as rdf-graph using templates
Chen et al. Improving automated documentation to code traceability by combining retrieval techniques
US11269601B2 (en) Internet-based machine programming
US20130268263A1 (en) Method for processing natural language and mathematical formula and apparatus therefor
US10789302B2 (en) Method and system for extracting user-specific content
JP4659946B2 (en) How to generate a wrapper grammar
CN105122208A (en) Source program analysis system, source program analysis method, and recording medium on which program is recorded
CN108563561B (en) Program implicit constraint extraction method and system
CN106649557A (en) Semantic association mining method for defect report and mail list
CN108196874A (en) A kind of webpage analysis method, device and storage medium, program product
CN106156035B (en) A kind of generic text method for digging and system
CN106657075B (en) Multi-layer protocol analytic method, device and data matching method and device
CN108694192A (en) The judgment method and device of type of webpage
CN104408198B (en) The acquisition methods and device of Webpage content
Coulter et al. An evolutionary perspective of software engineering research through co-word analysis
US10325000B2 (en) System for automatically generating wrapper for entire websites
KR100910895B1 (en) Automatic system and method for examining content of law amendent and for enacting or amending law
Kubis A query language for WordNet-like lexical databases
Lim et al. Generalized and lightweight algorithms for automated web forum content extraction
Kiomourtzis et al. NOMAD: Linguistic Resources and Tools Aimed at Policy Formulation and Validation.
Song et al. Data extraction and annotation for dynamic web pages
US20180052917A1 (en) Computer-implemented methods and systems for categorization and analysis of documents and records
CN118550546B (en) Python project dependent conflict detection and resolution method and device
Rahman et al. Pattern analysis of TXL programs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant