CN107423391A

CN107423391A - The information extracting method of Web page structural data

Info

Publication number: CN107423391A
Application number: CN201710605031.3A
Authority: CN
Inventors: 陈星�; 张佳俊; 王洲; 王一洲
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2017-07-24
Filing date: 2017-07-24
Publication date: 2017-12-01
Anticipated expiration: 2037-07-24
Also published as: CN107423391B

Abstract

The invention discloses a kind of information extracting method of Web page structural data, web page code is pre-processed first, remove noise information, according to page layout label as node, by the nest relation and hierarchical relationship of layout tab, its dom tree is constructed, and be stored in List, beta pruning is carried out to dom tree by judging whether branch is identical, forms DOM reconstruct trees；Then node is marked by node path, and tree is reconstructed to DOM corresponding to two webpages and contrasted, determine the Path where destination object, and produce corresponding wrapper, realize automatic extract.The present invention can rapidly process a large amount of WEB contents automatically, extract correct information.

Description

The information extracting method of Web page structural data

Technical field

The invention belongs to network information processing field, more particularly to a kind of information extracting method of Web page structural data.

Background technology

Internet develops the explosive growth for bringing information rapidly, Web oneself through developing into a huge bins Storehouse, turn into and become more and more important and most potential global information transmission and shared resource.But, it is desirable to quickly and accurately from sea Required information is found in the resource of amount and is applied by other programs, just becomes a great problem.Therefore, it is necessary to application letter Data that breath extraction technique extracts structuring from substantial amounts of semi-structured information, meeting theme.Due to html web page master It is not for for manipulating and using, data therein are difficult to be employed program directly to use if for what is browsed.Therefore, Data are extracted from webpage and pass them to application program use be still a complexity, difficulty but intentionally The task of justice.

The content of the invention

In view of this, it is automatic fast it is an object of the invention to provide a kind of information extracting method of Web page structural data A large amount of WEB contents are handled fastly, and can extract correct information.

To achieve the above object, the present invention adopts the following technical scheme that：

A kind of information extracting method of Web page structural data, comprises the following steps：

A) the html web page code of the structure identical sample webpage given to two pre-processes, and removes noise information；

B) to each info web of acquisition, according to page layout label as node, by the nest relation of layout tab and Hierarchical relationship, child node is sequentially stored into, until the Text Node of innermost layer construct dom tree, and deposit as leaf node with this Enter List；

C) beta pruning reconstruct is carried out respectively to two dom trees, the content in leaf node under same branches is merged into a leaf Under node, remaining identical branch is deleted；

D) JSON strings are parsed, obtain the object information wherein included, and be stored in the List of special placing objects information, Key name and key-value pair should store；

E) Path mark is carried out respectively to two DOM reconstruct trees, travels through whole DOM reconstruct tree, contrast leaf section therein Whether point content is identical with the object information obtained in step d), and Path corresponding to the leaf node is recorded if identical；

F) Path that two sample webpages are inquired is compared, extraction wherein identical Path, obtained just The Path of true target object information；If extracting some identical Paths, increase new sample webpage, repeat Step a) to step f), until screening obtains the Path of correct target object information；

Pair g) dom tree generation and reconstruct are carried out with sample webpage structure identical target web, and carries out Path mark, time Go through whole target DOM reconstruct tree, contrast the Path of acquired target object information, judge whether it is identical, if identical, The contents of object for exporting correspondence position is target object information.

Further, the step c) is comprised the following steps that：

c1：The dom tree of beta pruning reconstruct is treated, since root node, finds the node that first son node number in dom tree is more than 1；

c2：All child nodes are judged between any two, if the child node number of current node is 0 and current two sons Node type is identical, performs cut operator；

c3：If the child node number of current node is not 0, to its child node tree recursive call dom tree restructing algorithm again；

c4：Recursive function is called to judge whether child node tree is identical, it is such as identical, call recursive function to realize and sub- node tree is cut Branch operation, finally obtain DOM reconstruct trees.

Further, Path mark includes feature tag path tag and feature digital path mark in the step e) Note.

Further, feature numeral path tag algorithm is as follows：

e1：Tree is reconstructed to the DOM in characteristic word path to be obtained, if present node M father node number is not 0, obtained simultaneously The feature digital path of its father node is stored, and in numerical digit corresponding to the memory node M of end；

e2：If present node M father node number is 0, numerical digit corresponding to memory node M；

e3：All child nodes of node M are proceeded as follows successively：If i-th of child node N of node M child node Number is not 0, to node N recursive call feature numeral path tag algorithms；If i-th of child node N of node M child node Number is 0, obtains the feature digital path of node M and is stored in node N feature digital path, and in N pairs of end memory node The numerical digit answered；

e4：Finally obtain the feature digital path of DOM reconstruct trees.

Further, in step g), object is first carried out according to the feature digital path of acquired target object information Search and extract, if do not extract contents of object according to feature digital path, further according to the feature of target object information Tag path carries out the lookup and extraction of object.

The present invention has the advantages that compared with prior art：

（1）The present invention is when constructing dom tree, and using label as node label, it is less to construct the required time, and can be fine The tree structure using dom tree represent the nesting and hierarchical relationship of former page-tag；

（2）The present invention has carried out beta pruning to dom tree, ensure that the most simple of dom tree, deposited so as to reduce in the reconstruct of dom tree Store up the utilization of resource；

（3）When positioning target information position, multiple sample webpages can be contrasted, it is thus possible to accurately obtain target The Path of object information.

Brief description of the drawings

Fig. 1 is the schematic flow sheet of the information extracting method of Web page structural data of the present invention；

Fig. 2 is dom tree restructing algorithm flow chart of the present invention；

Fig. 3 is feature of present invention digital path labeling algorithm flow chart；

Fig. 4 is the Path result figure of a sample webpage in the embodiment of the present invention；

Fig. 5 is the Path result figure of another sample webpage in the embodiment of the present invention；

Fig. 6 is the Path result figure of target information in the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawings and embodiment the present invention will be further described.

As shown in figure 1, the present invention provides a kind of information extracting method of Web page structural data, including：

C) beta pruning reconstruct is carried out respectively to two dom trees, the content in leaf node under same branches is merged into a leaf Under node, remaining identical branch is deleted, as shown in Fig. 2 dom tree reconstruct comprises the following steps that：

c4：Recursive function is called to judge whether child node tree is identical, it is such as identical, call recursive function to realize and sub- node tree is cut Branch operation, finally obtain DOM reconstruct trees；

E) feature tag path tag and feature numeral path tag are carried out respectively to two DOM reconstruct trees, travels through whole DOM Reconstruct tree, contrast leaf node content therein whether with d) in obtain object information it is identical, the leaf is recorded if identical Path corresponding to node；

As shown in figure 3, feature numeral path tag algorithm is as follows：

e4：Finally obtain the feature digital path of DOM reconstruct trees.

F) Path that two sample webpages are inquired is compared, extraction wherein identical Path, obtained To the Path of correct target object information；If extracting some identical Paths, increase new sample webpage, Repeat step a) to step f), until screening obtains the Path of correct target object information；

Pair g) dom tree generation and reconstruct are carried out with sample webpage structure identical target web, and carries out Path mark, time Go through whole target DOM reconstruct tree, first according to the feature digital path of acquired target object information carry out object lookup and Extraction, if do not have object output content according to feature digital path, further according to the feature tag path of target object information Carry out the lookup and extraction of object.

To extract in bean cotyledon books webpage exemplified by " author " this contents of object,

First, with bean cotyledon books《Picked up towards sunset is spent》Webpage and bean cotyledon books《The The Romance of the Three Kingdoms》Webpage is sample webpage, in bean cotyledon figure Book《Picked up towards sunset is spent》The URL addresses of webpage are input, in the case that " Lu xun " is instance objects input, construct the DOM of the webpage Tree, and perform cut operator and form DOM reconstruct trees, when inquiring about the Path of " Lu xun " object, two Paths can be exported, Occur such case be because former webpage text in, the position of existing " author " to be obtained corresponding " Lu xun ", also one " Lu xun " that individual bean cotyledon labels to books, when whole DOM reconstruct tree is traveled through, also comply with requirement and be acquired path, Route result is as shown in figure 4, the first row and the second row in Fig. 4 are two feature tag paths where " Lu xun " object, Three rows and fourth line are its feature digital paths, and wherein the first row and the third line is the place of required target information, and the Two rows and fourth line are the positions of text identical interference informations.As can be seen here, such situation only have a sample webpage without Method correctly finds required target information, so also needing to increase in addition sample webpage and its instance objects input is carried out pair Than can just determine correct object information position.

Increase another sample webpage bean cotyledon books《The The Romance of the Three Kingdoms》Webpage, above-mentioned identical operation is performed, webpage is carried out Construction dom tree simultaneously reconstructs, and inquires about the Path of " Luo Guanzhong " object, route result as shown in figure 5, the first row in Fig. 5 and Second row is two feature tag paths where " Luo Guanzhong " object, and the third line and fourth line are its feature digital paths.

Because the two webpages, which are all bean cotyledon books, introduces webpage, basic format is all identical, thus be result in Two feature tag paths are all identical corresponding to two examples, can not thus be obtained just by contrasting its feature tag path True object path.And because feature digital path requires tightened up to webpage format, the two page layouts and form are not yet Be it is identical, it is otherwise varied so as to result in the feature digital path of the two objects, wherein still there is an identical path, This is exactly the location paths of " author " object to be looked for, and another just generates difference, therefore can be cast out.

The Path result of correct target object information is obtained more afterwards as shown in fig. 6, the path is exactly required Proper characteristics path corresponding to target " author " information（Contain feature tag path and feature digital path）.

In the present embodiment, the Path of correct target information has only just been got with two sample webpages, and In the other cases, the sample webpage many more than two that may be inputted, two sample webpages are contrasted, Path identical portions Divide and take common factor, different piece takes union, and by the result compared with next sample webpage, by that analogy, passes through contrast Obtain target signature path.

Dom tree generation and reconstruct are carried out to target web, the position of object is quickly found out according to the feature digital path of acquisition Put, and extract contents of object, if feature digital path does not extract contents of object, according to feature tag path, time Go through whole target DOM reconstruct tree, feature tag path known to contrast judges whether identical, if identical, exports correspondence position Contents of object, the content is exactly required object information.

Although the present invention is disclosed as above with preferred embodiment, it is not for limiting the present invention, any this area Technical staff without departing from the spirit and scope of the present invention, may be by the methods and technical content of the disclosure above to this hair Bright technical scheme makes possible variation and modification, therefore, every content without departing from technical solution of the present invention, according to the present invention Technical spirit to any simple modifications, equivalents, and modifications made for any of the above embodiments, belong to technical solution of the present invention Protection domain.It the foregoing is only presently preferred embodiments of the present invention, all impartial changes done according to scope of the present invention patent Change and modify, should all belong to the covering scope of the present invention.

Claims

1. a kind of information extracting method of Web page structural data, it is characterised in that comprise the following steps：

D) JSON strings are parsed, obtain the object information wherein included, and be stored in the List of special placing objects information, Key name and key-value pair should store;

2. the information extracting method of Web page structural data according to claim 1, it is characterised in that the step c's) Comprise the following steps that：

3. the information extracting method of Web page structural data according to claim 1, it is characterised in that in the step e) Path mark includes feature tag path tag and feature numeral path tag.

4. the information extracting method of Web page structural data according to claim 3, it is characterised in that feature digital path Labeling algorithm is as follows：

e4：Finally obtain the feature digital path of DOM reconstruct trees.

5. the information extracting method of Web page structural data according to claim 3, it is characterised in that in step g), first The lookup and extraction of object are carried out according to the feature digital path of acquired target object information, if according to feature numeral road When contents of object is not extracted in footpath, the lookup and extraction of object are carried out further according to the feature tag path of target object information.