CN114817811B

CN114817811B - Website analysis method and device

Info

Publication number: CN114817811B
Application number: CN202210494646.4A
Authority: CN
Inventors: 薛秋雨; 柳超
Original assignee: Yancheng Tianyanchawei Technology Co ltd
Current assignee: Yancheng Tianyanchawei Technology Co ltd
Priority date: 2022-05-07
Filing date: 2022-05-07
Publication date: 2024-03-19
Anticipated expiration: 2042-05-07
Also published as: CN114817811A

Abstract

The invention discloses a website analysis method and a website analysis device, wherein the method comprises the following steps: analyzing the document structure of a main page of a target website to obtain a page tag set of the main page, and determining at least one link page according to the page tag set of the main page; acquiring a page tag set of at least one link page, and determining a page link rule applicable to the target website according to the page tag set of the at least one link page and the page tag set of the main page; acquiring all link pages associated with a main page of the target website and a tag path of each link page based on page link rules applicable to the target website; and generating a list block structure of each link page step by step based on the label path, and generating a webpage data structure of the target website according to the list block structure of each link page.

Description

Website analysis method and device

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to a website parsing method and apparatus, a computer readable storage medium, an electronic device, and a computer program product.

Background

Currently, when data collection is performed, because of different web page hierarchies of websites, a series of rules designed in advance need to be used for analyzing each website, such as a link rule, a title rule and a page turning rule. When the number of websites is large, developers need to consume more effort and time to design rules.

In this case, there is a need to automatically parse links, titles, and page turning links and rules for web site list pages.

Disclosure of Invention

In view of this, the present invention proposes a website parsing method and apparatus, and a computer readable storage medium, an electronic device, and a computer program product, which aim to automatically parse links, titles, and page turning links in a website list page by processing a document structure of the page. According to the technical scheme, the analysis precision and the acquisition efficiency of the data can be improved, and the manual spending time is greatly reduced.

According to an aspect of an embodiment of the present invention, there is provided a website parsing method, including:

analyzing the document structure of a main page of a target website to obtain a page tag set of the main page, and determining at least one link page according to the page tag set of the main page;

Acquiring a page tag set of at least one link page, and determining a page link rule applicable to the target website according to the page tag set of the at least one link page and the page tag set of the main page;

acquiring all link pages associated with a main page of the target website and a tag path of each link page based on page link rules applicable to the target website; and

and generating a list block structure of each link page step by step based on the label path, and generating a webpage data structure of the target website according to the list block structure of each link page.

According to another aspect of an embodiment of the present invention, there is provided a website parsing apparatus, including:

the analyzing unit is used for analyzing the document structure of the main page of the target website to obtain a page tag set of the main page, and determining at least one link page according to the page tag set of the main page;

the determining unit is used for acquiring a page tag set of at least one link page, and determining a page link rule applicable to the target website according to the page tag set of the at least one link page and the page tag set of the main page;

An obtaining unit, configured to obtain all link pages associated with a main page of the target website and a tag path of each link page based on a page link rule applicable to the target website; and

and the generating unit is used for generating a list block structure of each link page step by step based on the label path and generating a webpage data structure of the target website according to the list block structure of each link page.

According to a further aspect of embodiments of the present invention, there is provided a computer readable storage medium storing a computer program for performing the method according to any one of the embodiments described above.

According to still another aspect of an embodiment of the present invention, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method according to any of the embodiments.

According to a further aspect of embodiments of the present invention, there is provided a computer program product comprising computer readable code which, when run on a device, causes a processor in the device to perform a method for implementing any of the embodiments described above.

According to the website analysis method and device, the computer readable storage medium, the electronic equipment and the computer program product provided by the embodiment of the invention, on one hand, the acquisition time required by manually positioning the webpage is greatly reduced, the information acquisition efficiency is improved, and on the other hand, when the website structure is changed and upgraded, the acquisition rule is not required to be manually changed, the labor cost of secondary development is greatly reduced, so that data acquisition personnel have more energy to perform data value mining, and the efficiency and the output value are improved.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent by describing embodiments of the present invention in more detail with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, and not constitute a limitation to the invention. In the drawings, like reference numerals generally refer to like parts or steps.

FIG. 1 is a flowchart of a website parsing method according to an exemplary embodiment of the present invention;

FIG. 2 is a schematic diagram of a relationship between a target probability and a header length in a single page website structure according to an exemplary embodiment of the present invention;

FIG. 3 is a schematic diagram of a data list structure provided by an exemplary embodiment of the present invention;

FIG. 4 is a flowchart of a website parsing method according to another exemplary embodiment of the present invention;

FIG. 5 is a schematic diagram of a website resolution apparatus according to an exemplary embodiment of the present invention;

fig. 6 is a structure of an electronic device provided in an exemplary embodiment of the present invention.

Detailed Description

Hereinafter, exemplary embodiments according to the present invention will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present invention and not all embodiments of the present invention, and it should be understood that the present invention is not limited by the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.

It will be appreciated by those of skill in the art that the terms "first," "second," etc. in embodiments of the present invention are used merely to distinguish between different steps, devices or modules, etc., and do not represent any particular technical meaning nor necessarily logical order between them.

It should also be understood that in embodiments of the present invention, "plurality" may refer to two or more, and "at least one" may refer to one, two or more.

It should also be appreciated that any component, data, or structure referred to in an embodiment of the invention may be generally understood as one or more without explicit limitation or the contrary in the context.

In addition, the term "and/or" in the present invention is merely an association relationship describing the association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone. In the present invention, the character "/" generally indicates that the front and rear related objects are an or relationship.

It should also be understood that the description of the embodiments of the present invention emphasizes the differences between the embodiments, and that the same or similar features may be referred to each other, and for brevity, will not be described in detail.

Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.

The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, the techniques, methods, and apparatus should be considered part of the specification.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

Embodiments of the invention are operational with numerous other general purpose or special purpose computing system environments or configurations with electronic devices, such as terminal devices, computer systems, servers, etc. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with the terminal device, computer system, server, or other electronic device include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments that include any of the foregoing, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc., that perform particular tasks or implement particular abstract data types. The computer system/server may be implemented in a distributed cloud computing environment in which tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computing system storage media including memory storage devices.

Exemplary method

When large-scale data is grabbed, a large amount of effort is required by webpage list grabbers, and as the number of grabbers increases, the maintenance cost of developers increases. The website analysis method provided by the embodiment of the invention can automatically collect data based on the webpage links, thereby reducing the difficulty of manual development. Even if the web site is changed, the manual maintenance is not needed, so that the manual input cost is greatly reduced, and the problems of low grabbing and developing efficiency, high maintenance cost and the like are effectively solved. The method in the embodiment of the invention effectively improves the data analysis precision and the acquisition efficiency, and greatly reduces the manual development time.

Fig. 1 is a flowchart of a website parsing method according to an exemplary embodiment of the present invention. The embodiment can be applied to an electronic device, as shown in fig. 1, and includes the following steps:

step 101, analyzing the document structure of the main page of the target website to obtain a page tag set of the main page, and determining at least one link page according to the page tag set of the main page. In order to acquire data of a website, it is necessary to determine a web page hierarchy of the website (e.g., a rule of page turning or next page anchor between a previous level web page and a next level web page) and automatically acquire data of the website according to the web page hierarchy. In the present invention, the main page may be the top page of the website or the target website or other pages (e.g., the link page on the top page is used as the main page).

In one embodiment, parsing a document structure of a main page of a target website to obtain a page tag set of the main page includes: acquiring a network address of a main page of a target website; acquiring a webpage source code of a main page based on a network address of the main page of a target website; constructing a document structure of a main page based on a webpage source code, wherein the document structure is a Document Object Model (DOM) tree; and analyzing the document structure of the main page of the target website to obtain all tags of the main page, and determining a page tag set of the main page according to all the tags.

Preferably, a target website is selected from a plurality of websites through user input or system setting, and a network address of the target website is obtained. In general, the network address of the target website may be considered as the network address of the main page of the target website. The method comprises the steps of accessing a target website based on a network address of a main page of the target website, and acquiring a webpage source code of the main page based on the access of the main page of the target website. Wherein the web page source code is a code composed of a markup language. The code made up of a markup language typically includes a plurality of tags, and the plurality of tags have a certain structural relationship therebetween. Accordingly, the document structure of the main page is constructed based on the web page source code, for example, the hierarchical structure of the code is determined based on a plurality of tags to the web page source code, thereby constructing the document structure of the main page based on the hierarchical structure of the code.

Preferably, the document structure is a document object model DOM tree. It should be appreciated that the document structure may be any reasonable representation structure. Since a document structure such as a document object model DOM tree represents all tags of pages and a hierarchical structure by a predetermined structure, parsing the document structure of a main page of a target website can acquire all tags of the main page. And forming all tags in the document structure into a page tag set of the main page. It should be noted that, the document structure of the document object model DOM tree is adopted to obtain all the tags of the main page, and the page tag set of the main page is determined according to all the tags, so that the accuracy of obtaining all the tags of the main page can be improved, and the obtaining difficulty is reduced.

In one embodiment, determining at least one link page from a page tag set of a main page includes: inquiring whether page link anchor points exist in a page tag set of a main page according to an anchor point rule which is obtained in advance and is used for determining the link page; when a page link anchor is queried, at least one link page is determined based on the page link anchor. Since the page tag set of the main page includes various types of tags, e.g., page turn or next page tag, title tag, etc., a search can be performed in the page tag set of the main page to determine whether a page link anchor (e.g., page turn or next page tag) exists. Typically, the markup language will preset anchor rules for linking pages, such as links or page links that use which tags to represent pages that are flipped or next. The page link anchor may include anchor text and page links. For this purpose, whether page link anchor points exist is searched in the page tag set of the main page according to the pre-acquired anchor point rule for determining the link page.

In particular, the anchor rule that determines a linked page may be considered to locate a page link anchor (also referred to as a page turning anchor) among a plurality of tags. In order to accurately locate the page turning anchor points, anchor texts of various page turning or next pages can be counted or enumerated in advance, so that anchor texts of different types of page turning or next pages can be deduced. For example, anchor text enumerating the last 30 pages or next page (next page, last page, etc.) is traversed, locating page-turning anchors through the DOM tree.

According to one embodiment of the present invention, anchor text for various page turns or next pages may be pre-stored. Then, the anchor text of the page or the next page existing in the DOM tree, that is, a tag (which may also be referred to as a page-turning tag) including the anchor text of the page or the next page is searched. After the page flip tag is found, a page link, such as uniform resource locator URL (uniform resource locator), may be obtained. The method for acquiring the page links through the searched page turning labels comprises two cases: in the first case, the URL is written directly in the page-turning tag, by looking up the href key, there being the URL of the next page in or after the value of href; in the second case, the page-turning label is obtained through JavaScript parsing, and the URL of the next page can exist in or behind the value of the onclick by looking up the onclick keyword. In this way, the present embodiment can be positioned not only from the first page to the next page but also from the next page to the next page.

In one embodiment, when the page link anchor is not queried, the method of the embodiment of the invention further comprises the following steps: acquiring various page link rules, wherein the various page link rules are determined based on statistical processing of massive page link anchor points; at least one linked page of the master page is determined based on a plurality of page linking rules. Wherein each link page is an accessible page and the hash value of each link page is different from the hash value of the main page, for example, when at least one link page of the main page is determined, an access request or a download request is made to the link page. If the data obtained by the access request or the download request is not content which cannot be accessed (for example, the error code 404), and the hash value of the linked page is different from the hash value of the current page, determining that the linked page is the linked page of the next page, and linking is the page link of the next page.

In one embodiment, after parsing the document structure of the main page of the target website to obtain the page tag set of the main page, the method further comprises: determining whether at least one link page exists in the main page according to a page tag set of the main page; when the main page is determined to have no link page, determining a plurality of titles of the main page and a title level of each title according to the page tag set; and generating a webpage data structure of the target website based on the titles and the title levels of the main page. For example, in the parsing method, when the first page of the target website does not need to be turned or the target website has only one page of data, algorithm deduction can be performed on each title in the list page set of the main page.

From the above, in the present invention, when analyzing the website, it is necessary to determine whether the target website adopts a hierarchical web structure. When a target website adopts a hierarchical webpage structure, the method acquires a page tag set of at least one link page, and determines a page link rule applicable to the target website according to the page tag set of the at least one link page and the page tag set of the main page; acquiring all link pages associated with a main page of the target website and a tag path of each link page based on page link rules applicable to the target website; and generating a list block structure of each link page step by step based on the label path, and generating a webpage data structure of the target website according to the list block structure of each link page. When the target website does not adopt the hierarchical webpage structure, for example, the first page of the target website does not need to be turned or the target website has data of only one page, determining whether at least one link page exists in the main page according to the page tag set of the main page; when the main page is determined to have no link page, determining a plurality of titles of the main page and a title level of each title according to the page tag set; and generating a webpage data structure of the target website based on the titles and the title levels of the main page.

In general, the active title is generally longer, as shown in FIG. 2. Fig. 2 is a schematic diagram of a relationship between a target probability and a header length in a single page website structure according to an exemplary embodiment of the present invention. The probability of targeting a valid title is positively correlated with the title length. But the header length has two thresholds, a lower limit and an upper limit. Where the title length may be the number of characters included in the title. Taking the lower limit value of the title length as 5 and the upper limit value 105 as an example, when the title length of the target a tag is greater than 105 or less than 5, this list set is down-weighted. And determining the set with the highest weight obtained by final calculation as a list set. The method is also applicable to the scheme of calculating when the output list block is larger than 1.

Step 102, obtaining a page tag set of at least one link page, and determining a page link rule applicable to the target website according to the page tag set of the at least one link page and the page tag set of the main page. That is, the page turning parameters of the turned page or the next page are determined by parsing different points of the page tag set of at least one link page and the page tag set of the main page.

In one embodiment, determining the page link rule applicable to the target website specifically includes: respectively comparing the page tag sets of at least one link page with the page tag sets of the main page in a differential mode to determine the number of different dynamic parameters; when the number of the dynamic parameters is 1, determining that the dynamic parameters are page turning parameters; when the number of the dynamic parameters is greater than 1, acquiring the dynamic parameters of the main page and the dynamic parameters of the link pages of at least two levels below the main page, and carrying out enumeration verification on each dynamic parameter, so that the dynamic parameters with the changed parameter values are determined to be page turning parameters; and determining a page link rule applicable to the target website based on the page turning parameter and the value of the page turning parameter. For example, after linking to the second page through the top page, the dynamic parameters of the second page are obtained. And comparing the dynamic parameter of the second page with the dynamic parameter of the first page (i.e. diff), and if only one dynamic parameter exists in the difference, the dynamic parameter is the page turning parameter of the page turning or the page turning parameter of the next page. If a plurality of dynamic parameters exist in the difference, acquiring dynamic parameters of the main page, the next page of the main page and the next page of the main page, namely, dynamic parameters of the link pages of at least two levels below the main page. And carrying out enumeration verification on each dynamic parameter, so that the dynamic parameter with the changed parameter value is determined to be the page turning parameter. For example, the dynamic parameters of the first page are page, and the dynamic parameters of the next page are page and index. In this case, determining the page link anchor (dynamic parameter of the next page) as page or index may have a misjudgment. For this purpose, the dynamic parameters of the next page are determined, and when the dynamic parameters of the next page are page and index, the page turning parameters are determined by the dynamic parameters with the parameter values changed. In addition, to further improve accuracy, dynamic parameters of multiple levels of linked pages may be checked.

In one embodiment, because the parameter value of the dynamic parameter of the page turning or the next page is generally a digital or digital hash value, the dynamic parameter satisfying this condition is the page turning parameter of the page turning or the next page. And finally, combining the values of the page turning parameters to obtain all links of the whole website.

Step 103, based on a page link rule applicable to the target website, acquiring all link pages associated with the main page of the target website and a tag path of each link page, specifically including: determining a page link anchor point of the target website based on page link rules applicable to the target website; accessing step by step based on the page link anchor points to acquire all link pages of the target website which are associated with the main page; and acquiring the webpage source code of each link page, and determining the label path of each link page based on the webpage source code.

Specifically, the page link rule applicable to the target website indicates the manner in which the target website uses the page link anchor, and therefore, the page link anchor of the target website can be determined based on the page link rule applicable to the target website. As described above, the plurality of pages of the target website constitute a hierarchical structure through the page link anchor, and for this purpose, step-by-step access is performed from the main page based on the page link anchor to acquire all the link pages of the target website associated with the main page.

It should be appreciated that the main page and all of the link pages associated with the main page may also form a tree structure. And acquiring the webpage source code of each link page, and determining the label path of each link page based on the webpage source code. As above, the DOM tree for each linked page may be constructed based on parsing the web page source code for each linked page. In the DOM tree, all tags of the linked page may be included and a tag path of the linked page may be constructed based on the DOM tree and all tags (e.g.,/html [1]/body [1]/div [2]/url [3]/li [1]/a [2 ]). Thus, a tag path for each linked page can be determined based on the web page source code.

Step 104, generating a list block structure of each link page step by step based on the label path, and generating a webpage data structure of the target website according to the list block structure of each link page. The list block structure of each link page is generated step by step based on the label path, and the list block structure comprises the following steps: constructing a label path set by the label path of each link page; based on matching the common part in the label paths, carrying out path aggregation on the label paths in the label path set; and determining the minimum common parent node of the label paths subjected to path aggregation as a list block, thereby generating a list block structure of each link page.

For example, all a tags in the DOM tree of the link page are traversed and their XML paths are marked, which are converted into tag paths (format similarity/html [1]/body [1]/div [2]/url [3]/li [1]/a [2 ]). Fig. 3 is a schematic diagram of a data list structure according to an exemplary embodiment of the present invention. As shown in fig. 3, the data list structure is mainly divided into three types: 1) List block 1, all a tags are peer tags; 2) The list block 2 comprises a plurality of labels, under each label, the father node of the specific a label has a plurality of similar brother nodes, and in the list block of the type, the father node of the specific a label is verified to have no more than three a labels in the peer of the specific a label; and 3) a list block 3 consisting of a plurality of small target lists. For example, a plurality of tags are included and a plurality of a tags are included under each tag.

In one embodiment, performing path aggregation on label paths in a set of label paths based on matching common portions in the label paths comprises: when all page link anchor points in the label path are of the same level, common prefix matching is carried out on common parts in the label path, and prefix path aggregation is carried out on label paths in a label path set; when a plurality of peer nodes exist in a father node of a page link anchor point in the tag path, carrying out common suffix matching on a common part in the tag path, and carrying out suffix path aggregation on the tag paths in the tag path set; when a plurality of lists exist in the label paths, combining the label paths, performing common prefix matching, and performing prefix path aggregation on the label paths in the label path set. For example, the common part in the tag path/html [1]/body [1]/div [2]/ul [3]/li [1]/a [1] and/html [1]/body [1]/div [2]/ul [3]/li [1]/a [2] is/html [1]/body [1]/div [2]/ul [3]/li [1], which can be considered as a common prefix.

The tag path list is aggregated by common prefix and/or common suffix extraction on the tag path set generated above. The smallest common parent node of all target tags (i.e., the longest common prefix of the XML path of a list block) is one list block, dividing the entire target a-tag set of the linked page into multiple list block sets. The polymerization scheme includes: as in the case of list block 1 shown in fig. 3, common prefix aggregation is employed; list block 2, common suffix aggregation is adopted; in the case of list block 3, merging first and then common prefix aggregation is adopted. If the number of the target a labels in the specific single group after aggregation is less than 3, the aggregation failure is the aggregation failure, and other aggregation schemes are switched. For example, determining whether the XML paths for the a-tags under a parent node are identical; if the polymerization is the same, the next polymerization or switching scheme polymerization is carried out if the polymerization is different. When the number of the target a tags in the specific single group is less than 3 in the results of all the aggregation schemes, the target a tags in the specific single group are not aggregated, but are directly presented or provided separately.

In one embodiment, generating a web page data structure of the target web site from the list block structure of each link page includes: based on the hierarchical relationship between the main page and each link page, forming a page tree structure by the main page and each link page; a web page data structure of the target web site is generated based on the list block structure and the page tree structure according to each link page. For example, the hash value of each list aggregate XML path, that is, the list block link content in the above process is recorded, wherein the list block link content includes data-related information such as a data link, a data title, and the like. And extracting list block sets corresponding to all paths in the DOM tree of the next page, and comparing hash values of list block link contents corresponding to the same paths of the two pages. The list blocks with different hash values are the list blocks needing to be analyzed, and the XML path is the data path rule of the target website.

Fig. 4 is a flowchart of a website parsing method according to another exemplary embodiment of the present invention. As shown in fig. 4, the method starts at step 401.

Step 402, a web page link is entered or retrieved.

Step 403, downloading the web page source code and constructing the DOM tree based on the web page links.

Step 404 determines whether there are page link anchors in the DOM tree that are turned/next page, i.e., locating page link anchors that are turned/next page. By traversing the enumerated near 30 page-turning/next page anchor text (e.g., next page, last page, etc.), the tags of the page-turning/next page are located through the DOM tree.

When the page link to the page turn/next page is located, step 405 is performed, and when the page link to the page turn/next page cannot be located, step 406 is continued.

Step 406, continue page turning/dynamic parameter derivation for the next page. And analyzing and judging the JavaScript in the source code of the webpage to generate a parameter generation rule. And sending a downloading request to the obtained link, wherein if the obtained data is not inaccessible (for example, 404 is wrong) content and the hash value of the page is different from that of the first page, the link is the page link of page turning/next page.

Step 405, parse the dynamic parameters of the page turn/next page. And after the dynamic parameters of the second page are obtained, diff is carried out with the dynamic parameters of the first page. If only one dynamic parameter exists, the dynamic parameter is the page turning parameter. If a plurality of dynamic parameters exist, carrying out enumeration verification on the dynamic parameters, wherein the parameter value of the dynamic parameters of page turning/next page is generally a digital or digital hash value, and the dynamic parameters meeting the condition are page turning parameters of page turning/next page. Finally, by combining the values of page turning parameters, all page number links of the whole website can be obtained

Step 407, downloading the source code of the page turning/next page and constructing DOM data.

In step 408, the rule of the target list block is obtained, that is, rule extraction is performed on the target list block to be acquired. All a tags in the DOM tree of the link page are traversed and their XML paths are marked, which are converted into tag paths (format like/html [1]/body [1]/div [2]/url [3]/li [1]/a [2 ]). Fig. 3 is a schematic diagram of a data list structure according to an exemplary embodiment of the present invention. As shown in fig. 3, the data list structure is mainly divided into three types: 1) List block 1, all a tags are peer tags; 2) The list block 2 comprises a plurality of labels, under each label, the father node of the specific a label has a plurality of similar brother nodes, and in the list block of the type, the father node of the specific a label is verified to have no more than three a labels in the peer of the specific a label; and 3) a list block 3 consisting of a plurality of small target lists. For example, a plurality of tags are included and a plurality of a tags are included under each tag.

The tag path list is aggregated by common prefix and/or common suffix extraction on the tag path set generated above. The smallest common parent node of all target tags (i.e., the longest common prefix of the XML path of a list block) is one list block, dividing the entire target a-tag set of the linked page into multiple list block sets. The polymerization scheme includes: as in the case of list block 1 shown in fig. 3, common prefix aggregation is employed; list block 2, common suffix aggregation is adopted; in the case of list block 3, merging first and then common prefix aggregation is adopted. If the number of the target a labels in the specific single group after aggregation is less than 3, the aggregation failure is the aggregation failure, and other aggregation schemes are switched. When the number of the target a tags in the specific single group is less than 3 in the results of all the aggregation schemes, the target a tags in the specific single group are not aggregated, but are directly presented or provided separately.

And recording the XML path of each list set in the process, namely the hash value of the list block link content, wherein the list block link content comprises data related information such as data links, data titles and the like. And extracting list block sets corresponding to all paths in the DOM tree of the next page, and comparing hash values of list block link contents corresponding to the same paths of the two pages. The list blocks with different hash values are the list blocks needing to be analyzed, and the XML path is the data path rule of the target website.

Step 409, output page turning/next page rule and list area rule.

Step 410, collecting the whole site (website) based on the page turning/next page rule and list region rule

Step 411, end.

Exemplary apparatus

Fig. 5 is a schematic structural diagram of a website parsing apparatus according to an exemplary embodiment of the present invention. As shown in fig. 5, the apparatus includes:

the parsing unit 501 is configured to parse the document structure of the main page of the target website to obtain a page tag set of the main page, and determine at least one link page according to the page tag set of the main page.

Preferably, the parsing unit 501 includes:

the first acquisition subunit is used for acquiring the network address of the main page of the target website;

The second acquisition subunit is used for acquiring the webpage source code of the main page based on the network address of the main page of the target website;

the construction subunit is used for constructing a document structure of the main page based on the webpage source code, wherein the document structure is a Document Object Model (DOM) tree;

the first determining subunit is used for analyzing the document structure of the main page of the target website to obtain all tags of the main page, and determining a page tag set of the main page according to all the tags.

Preferably, the parsing unit 501 further includes:

the query subunit is used for querying whether page link anchor points exist in the page tag set of the main page according to the pre-acquired anchor point rule for determining the link page;

and the second determining subunit is used for determining at least one link page based on the page link anchor point when the page link anchor point is queried.

The third determining subunit is used for acquiring a plurality of page link rules when the page link anchor points are not queried, wherein the plurality of page link rules are determined based on statistical processing of massive page link anchor points;

and a fourth determining subunit, configured to determine at least one link page of the main page based on the generation rule of the page parameter and the multiple page link rules, where each link page is an accessible page and a hash value of each link page is different from a hash value of the main page.

Preferably, the parsing unit 501 is specifically further configured to: determining whether at least one link page exists in the main page according to a page tag set of the main page; when the main page is determined to have no link page, determining a plurality of titles of the main page and a title level of each title according to the page tag set; and generating a webpage data structure of the target website based on the titles and the title levels of the main page.

The determining unit 502 is configured to obtain a page tag set of at least one link page, and determine a page link rule applicable to the target website according to the page tag set of the at least one link page and the page tag set of the main page.

Preferably, the determining unit 502 includes:

a comparing subunit, configured to compare the page tag sets of the at least one link page with the page tag sets of the main page, respectively, so as to determine the number of different dynamic parameters; when the number of the dynamic parameters is 1, determining that the dynamic parameters are page turning parameters; when the number of the dynamic parameters is greater than 1, acquiring the dynamic parameters of the main page and the dynamic parameters of the link pages of at least two levels below the main page, and carrying out enumeration verification on each dynamic parameter, so that the dynamic parameters with the changed parameter values are determined to be page turning parameters;

And a fifth determining subunit for determining a page link rule applicable to the target website based on the page turning parameter and the page turning parameter value.

The acquiring unit 503 is configured to acquire all the link pages associated with the main page and the tag path of each link page of the target website based on the page link rule applicable to the target website.

Preferably, the acquisition unit 503 includes:

a sixth determining subunit, configured to determine a page link anchor point of the target website based on a page link rule applicable to the target website;

the fourth acquisition subunit is used for accessing step by step based on the page link anchor points so as to acquire all link pages of the target website, which are associated with the main page;

and a fifth acquisition subunit, configured to acquire a web page source code of each link page, and determine a tag path of each link page based on the web page source code.

The generating unit 504 is configured to generate a list block structure of each link page step by step based on the tag path, and generate a web page data structure of the target website according to the list block structure of each link page.

Preferably, the generating unit 504 includes:

a forming subunit, configured to form a label path set from a label path of each link page;

An aggregation subunit, configured to perform path aggregation on the label paths in the label path set based on matching the common portion in the label paths;

and a seventh determining subunit, configured to determine the label path minimum common parent node that is subjected to path aggregation as a list block, thereby generating a list block structure of each link page.

Preferably, the aggregation subunit is specifically configured to perform common prefix matching on a common portion in the label path when all page link anchors in the label path are at the same level, and perform prefix path aggregation on the label paths in the label path set;

when a plurality of peer nodes exist in a father node of a page link anchor point in the tag path, carrying out common suffix matching on a common part in the tag path, and carrying out suffix path aggregation on the tag paths in the tag path set;

when a plurality of lists exist in the label paths, combining the label paths, performing common prefix matching, and performing prefix path aggregation on the label paths in the label path set.

Preferably, the aggregation subunit is specifically further configured to form a page tree structure by the main page and each link page based on a hierarchical relationship between the main page and each link page;

A web page data structure of the target web site is generated based on the list block structure and the page tree structure according to each link page.

Exemplary electronic device

Fig. 6 is a structure of an electronic device provided in an exemplary embodiment of the present invention. The electronic device may be either or both of the first device and the second device, or a stand-alone device independent thereof, which may communicate with the first device and the second device to receive the acquired input signals therefrom. Fig. 6 illustrates a block diagram of an electronic device according to an embodiment of the invention. As shown in fig. 6, the electronic device includes one or more processors 61 and memory 62.

The processor 61 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities and may control other components in the electronic device to perform the desired functions.

Memory 62 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that may be executed by the processor 61 to implement the website resolution method and/or other desired functions of the software program of the various embodiments of the present invention described above. In one example, the electronic device may further include: an input device 63 and an output device 64, which are interconnected by a bus system and/or other forms of connection mechanisms (not shown).

In addition, the input device 63 may also include, for example, a keyboard, a mouse, and the like.

The output device 64 can output various information to the outside. The output device 64 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, etc.

Of course, only some of the components of the electronic device that are relevant to the present invention are shown in fig. 6 for simplicity, components such as buses, input/output interfaces, etc. being omitted. In addition, the electronic device may include any other suitable components depending on the particular application.

Exemplary computer program product and computer readable storage Medium

In addition to the methods and apparatus described above, embodiments of the invention may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform steps in a method of website resolution according to various embodiments of the invention described in the "exemplary methods" section of this specification.

The computer program product may write program code for performing operations of embodiments of the present invention in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present invention may also be a computer-readable storage medium, having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform steps in a method of website resolution according to various embodiments of the present invention described in the "exemplary methods" section above in this specification.

The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present invention have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present invention are merely examples and not intended to be limiting, and these advantages, benefits, effects, etc. are not to be considered as essential to the various embodiments of the present invention. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the invention is not necessarily limited to practice with the above described specific details.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, so that the same or similar parts between the embodiments are mutually referred to. For system embodiments, the description is relatively simple as it essentially corresponds to method embodiments, and reference should be made to the description of method embodiments for relevant points.

The block diagrams of the devices, apparatuses, devices, systems referred to in the present invention are only illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

The method and apparatus of the present invention may be implemented in a number of ways. For example, the methods and apparatus of the present invention may be implemented by software, hardware, firmware, or any combination of software, hardware, firmware. The above-described sequence of steps for the method is for illustration only, and the steps of the method of the present invention are not limited to the sequence specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present invention may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present invention. Thus, the present invention also covers a recording medium storing a program for executing the method according to the present invention.

It is also noted that in the apparatus, devices and methods of the present invention, the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent aspects of the present invention. The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the invention. Thus, the present invention is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the invention to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims

1. A method for web site resolution, the method comprising:

generating a list block structure of each link page step by step based on the label path, including:

constructing a label path set by the label path of each link page;

Based on matching the common part in the label paths, carrying out path aggregation on the label paths in the label path set, wherein the method comprises the following steps:

when all target labels in the label path are of the same level, carrying out common prefix matching on a common part in the label path, and carrying out prefix path aggregation on the label paths in the label path set;

when a plurality of peer nodes exist in a father node of a target label in the label path, carrying out common suffix matching on a common part in the label path, and carrying out suffix path aggregation on the label paths in the label path set;

when a plurality of lists exist in the label paths, combining the label paths, performing common prefix matching, and performing prefix path aggregation on the label paths in the label path set;

determining the minimum common parent node of the label paths subjected to path aggregation as a list block, and generating a list block structure of each link page;

generating a webpage data structure of the target website according to the list block structure of each link page, wherein the webpage data structure comprises the following steps:

forming a page tree structure by the main page and each link page based on the hierarchical relationship of the main page and each link page;

and generating a webpage data structure of the target website according to the list block structure and the page tree structure of each link page.

2. The method of claim 1, wherein parsing the document structure of the main page of the target website to obtain a set of page tags for the main page comprises:

acquiring a network address of a main page of the target website;

acquiring a webpage source code of a main page of the target website based on a network address of the main page of the target website;

constructing a document structure of the main page based on the webpage source code, wherein the document structure is a Document Object Model (DOM) tree;

and analyzing the document structure of the main page of the target website to obtain all tags of the main page, and determining a page tag set of the main page according to all the tags.

3. The method according to claim 1 or 2, wherein said determining at least one link page from the set of page tags of the main page comprises:

inquiring whether page link anchor points exist in a page tag set of the main page according to an anchor point rule which is obtained in advance and is used for determining the link page;

when a page link anchor is queried, at least one link page is determined based on the page link anchor.

4. The method of claim 3, wherein when a page link anchor is not queried, the method further comprises:

Acquiring various page link rules, wherein the various page link rules are determined based on statistical processing of massive page link anchor points;

at least one link page of the main page is determined based on a plurality of page link rules, wherein each link page is an accessible page and the hash value of each link page is different from the hash value of the main page.

5. The method of claim 1, wherein the determining the page link rule applicable to the target website based on the set of page tags for the at least one link page and the set of page tags for the main page comprises:

comparing the page tag set of the at least one link page with the page tag set of the main page respectively to determine the number of different dynamic parameters;

when the number of the dynamic parameters is 1, determining the dynamic parameters as page turning parameters;

when the number of the dynamic parameters is greater than 1, acquiring the dynamic parameters of the main page and the dynamic parameters of the link pages of at least two levels below the main page, performing enumeration verification on each dynamic parameter, and determining the dynamic parameters with changed parameter values as page turning parameters;

And determining a page link rule applicable to the target website based on the page turning parameter and the value of the page turning parameter.

6. The method of claim 1, wherein the obtaining, based on the page link rule applicable to the target website, all link pages associated with the main page of the target website and a tag path of each link page includes:

determining a page link anchor point of the target website based on a page link rule applicable to the target website;

accessing step by step based on the page link anchor points to acquire all link pages associated with a main page of the target website;

and acquiring the webpage source code of each link page, and determining the label path of each link page based on the webpage source code.

7. The method of claim 1, further comprising, after parsing a document structure of a main page of a target website to obtain a page tag set of the main page:

determining whether at least one link page exists in the main page according to a page tag set of the main page;

when the main page is determined to have no link page, determining a plurality of titles of the main page and a title level of each title according to the page tag set;

And generating a webpage data structure of the target website based on the titles and the title levels of the main page.

8. A web site resolution apparatus, the apparatus comprising:

the generating unit is configured to generate a list block structure of each link page step by step based on the label path, and generate a web page data structure of the target website according to the list block structure of each link page, where the generating unit includes:

The aggregation subunit is configured to perform path aggregation on the label paths in the label path set based on matching the common portion in the label paths, and specifically is configured to:

when all page link anchor points in the label path are of the same level, common prefix matching is carried out on common parts in the label path, and prefix path aggregation is carried out on label paths in a label path set;

based on the hierarchical relationship between the main page and each link page, forming a page tree structure by the main page and each link page;

generating a webpage data structure of the target website based on the list block structure and the page tree structure according to each link page;

9. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program for executing the method of any one of claims 1-7.

10. An electronic device, the electronic device comprising: a processor and a memory; wherein,

the memory is used for storing the processor executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method of any one of claims 1-7.