CN106326314A

CN106326314A - Web page information extraction method and device

Info

Publication number: CN106326314A
Application number: CN201510395013.8A
Authority: CN
Inventors: 马莘权
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Shenzhen Yayue Technology Co ltd
Priority date: 2015-07-07
Filing date: 2015-07-07
Publication date: 2017-01-11
Anticipated expiration: 2035-07-07
Also published as: CN106326314B

Abstract

The invention discloses a web page information extraction method. The web page information extraction method comprises the following steps: setting up a node tree according to the HTML elements in the web page while an information extraction request is received; confirming the target position of information to be extracted in the node tree according to the preset configuration information in the information extraction request; and extracting the information corresponding to the target position. The invention also discloses a web page information extraction device. The web page information extraction method and device reduce the operation difficulty of information extraction.

Description

Method for abstracting web page information and device

Technical field

The present invention relates to networking technology area, particularly relate to method for abstracting web page information and device.

Background technology

It is known that in the prior art when carrying out Web page information extraction, generally use Feature Words, depend on The position of information to be extracted is positioned by Feature Words.Owing to using Feature Words to need in website when positioning Feature Words refine repeatedly；Meanwhile, the most general Feature Words can cause by mistake in specific webpage Sentencing, the most special Feature Words is then difficult to be applicable to the extraction of other webpage.Therefore in this extraction side Method needs utilize participle and text semantic identification technology, to improve the accuracy of information extraction；But by In using participle and text semantic identification technology, the difficulty that result in information extraction is bigger.

Summary of the invention

The main purpose of the embodiment of the present invention is to provide a kind of method for abstracting web page information and device, it is intended to fall The operation easier of low information extraction.

For achieving the above object, a kind of method for abstracting web page information, described net are embodiments provided Page information abstracting method comprises the following steps:

When receiving information extraction request, set up node tree according to HTML element in webpage；

The configuration information preset in asking according to described information extraction determines letter to be extracted in described node tree The target location of breath；

Extract the information that described target location is corresponding.

Additionally, in order to realize foregoing invention purpose, the embodiment of the present invention additionally provides a kind of info web and takes out Fetching is put, and described Web page information extraction device includes:

MBM, when being used for receiving information extraction request, sets up joint according to HTML element in webpage Point tree；

Determining module, the configuration information preset in asking according to described information extraction is at described node tree The middle target location determining information to be extracted；

Abstraction module, for extracting the information that described target location is corresponding.

When the embodiment of the present invention is by receiving information extraction request, set up according to HTML element in webpage Node tree；Then the configuration information preset in asking according to described information extraction determines in described node tree The target location of information to be extracted；Finally extract the information that described target location is corresponding；To complete in webpage Carry out the operation of information extraction.Owing to when carrying out information extraction, the embodiment of the present invention is based on by HTML The node tree of text generation, carries out information extraction on node tree；Text is used to divide relative to prior art Word and semantics recognition, carry out the mode of information extraction, and the embodiment of the present invention can reduce the behaviour of information extraction Make difficulty, thus reduce the extraction cost of information in html web page on the whole.

Accompanying drawing explanation

Fig. 1 is the hardware frame structural representation of info web withdrawing device first embodiment of the present invention；

Fig. 2 is the functional module structure schematic diagram of info web withdrawing device the second embodiment of the present invention；

Fig. 3 is the functional module structure schematic diagram of info web withdrawing device the 3rd embodiment of the present invention；

Fig. 4 is the refinement function mould of the first embodiment of MBM in info web withdrawing device of the present invention Block structure schematic diagram；

Fig. 5 is html text exemplary plot in info web withdrawing device of the present invention；

Fig. 6 is according to the node tree obtained after Fig. 5 parsing；

Fig. 7 is the refinement function mould of the second embodiment of MBM in info web withdrawing device of the present invention Block structure schematic diagram；

Fig. 8 is the schematic flow sheet of info web extraction method first embodiment of the present invention；

Fig. 9 is the schematic flow sheet of info web extraction method the second embodiment of the present invention；

Figure 10 is the schematic flow sheet of info web extraction method the 3rd embodiment of the present invention；

Figure 11 is the refinement flow process that info web extraction method of the present invention sets up first embodiment in node tree Schematic diagram；

Figure 12 is the refinement flow process that info web extraction method of the present invention sets up the second embodiment in node tree Schematic diagram.

The realization of the object of the invention, functional characteristics and advantage will in conjunction with the embodiments, do referring to the drawings further Explanation.

Detailed description of the invention

Technical scheme is further illustrated below in conjunction with Figure of description and specific embodiment.Should Understanding, specific embodiment described herein only in order to explain the present invention, is not intended to limit the present invention.

With reference to Fig. 1, info web withdrawing device first embodiment of the present invention is proposed.In this embodiment, this net Page information withdrawing device includes: processor 111, memorizer 112, user interface 113, network interface 114 And communication bus 115.Communication bus 115 is used for the communication in data server between each building block, uses Family interface 113 is for receiving the information of user's input, and this user interface can be wireline interface and wireless connect Mouthful, such as keyboard, mouse etc..Network interface 114 carries out intercommunication for data server and outside, This network interface 114 can also include line interface and wave point.Memorizer 112 can include one or More than one computer-readable recording medium, and it not only includes internal storage, also includes that outside is deposited Reservoir.In this memorizer, storage has operating system and info web extraction program etc..Processor 111 is used for Call the info web in memorizer 112 and extract program out, to perform following operation:

Extract the information that described target location is corresponding.

Further, the info web that processor 111 is additionally operable to call in memorizer 112 extracts program out, with Operation below performing:

Described when receiving information extraction request, according to before in webpage, HTML element sets up node tree also Including:

Type and the described information to be extracted position in described node tree according to described information to be extracted are pressed Configuration information is generated according to presetting rule；

Information extraction request is generated according to described configuration information.

Described ask according to described information extraction in the configuration information preset determine in described node tree and wait to take out The win the confidence target location of breath includes:

Type according to described information to be extracted and the described information to be extracted position in described node tree, Described target location is determined according to tree ergodic algorithm.

Described receive information extraction request time, set up node tree according to HTML element in webpage and include:

When receiving information extraction request, resolve html text content；

When being resolved to the beginning label of HTML element, the HTML element being currently resolved to is set as Destination node continues to resolve；

The character string resolving the non-HTML element content obtained is added on described mesh with the form of child node Under mark node, and judge whether again to be resolved to the beginning label of HTML element；

The HTML element being the most currently resolved to is set as the child node of described destination node；So After described child node is set as, and destination node continues to resolve, and perform described to resolve obtain non- The character string of HTML element content is added under described destination node with the form of child node, and judgement is The no step starting label being again resolved to HTML element；

If it is not, then the end-tag at the described destination node correspondence HTML element being resolved to is non-first During the end-tag of the HTML element that HTLM element is corresponding, the father node of described destination node is arranged Continue to resolve for destination node, and perform the described character string that will resolve the non-HTML element content obtained It is added under described destination node with the form of child node, and judges whether again to be resolved to HTML element Start label step；End-tag at the described destination node correspondence HTML element being resolved to is During the end-tag of the HTML element that first HTLM element is corresponding, terminate the solution to html text content Analysis, forms node tree according to the recurrence relation of each node.

The character string of described non-HTML element content parsing obtained is added on institute with the form of child node State under destination node, and also include before judging whether again to be resolved to the beginning label of HTML element:

When being resolved to element property and the property value of described destination node correspondence HTML element, by described Element property and property value are set as the child node of described destination node.

With reference to Fig. 2, it is provided that the second embodiment of the present invention a kind of Web page information extraction device, the present embodiment The Web page information extraction device of middle offer includes:

MBM 10, when being used for receiving information extraction request, sets up according to HTML element in webpage Node tree；

Determining module 20, the configuration information preset in asking according to described information extraction is at described node Tree determines the target location of information to be extracted；

Abstraction module 30, for extracting the information that described target location is corresponding.

The method for abstracting web page information that the present embodiment provides is mainly used in the extraction to info web.Above-mentioned HTML element is web page element, starts label and element end-tag, such as html element including element Label is started including element<html>with element end-tag</html>.

When carrying out Web page information extraction, first specific language is used to describe information to be extracted by user, Then according to the corresponding configuration information of language generation described, generate information according to configuration information and extract request out, When receiving this information and extracting request out, will set up node tree according to HTML element (should in the present embodiment Node tree is dom tree, is the document object model of HTML).Then request is extracted out according to this information In configuration information determine the target location of information to be extracted, carry out letter finally according to the target location determined Breath extraction, thus complete to carry out the operation of information extraction in webpage.

Further, with reference to Fig. 3, based on above-described embodiment, in the third embodiment, above-mentioned info web Draw-out device also includes:

Configuration generation module 40, for existing according to type and the described information to be extracted of described information to be extracted Position in described node tree generates configuration information according to presetting rule；

Request generation module 50, for generating information extraction request according to described configuration information.

In the present embodiment, above-mentioned presetting rule can be configured according to actual needs, such as, can use Regular expression or XPATH grammer, be explained in detail below as a example by XPATH grammer.Specifically, When using XPATH syntactic description until extract information positional information in node tree, will be defeated according to user The type of the information to be extracted entered and described information to be extracted position in described node tree generates joins as follows Confidence ceases:

title://h1；

Wherein title is the type of above-mentioned information to be extracted, and h1 is that above-mentioned information to be extracted is at described node tree In position.In the present embodiment, h1 is HTML element, and is the node on node tree.

After generating configuration information, the request of corresponding information extraction will be generated according to this configuration information, to enter Row information extraction operates.

Further, based on above-mentioned 3rd embodiment, in the fourth embodiment, above-mentioned determine that module 20 has Body is used for, according to type and the described information to be extracted position in described node tree of described information to be extracted Put, determine described target location according to tree ergodic algorithm.

In the present embodiment, any position that the tree ergodic algorithm of maturation can be used to come in search tree, entering Line search confirm information to be extracted target location time, can first according to above-mentioned information to be extracted described The node of position (i.e. h1 in above-described embodiment) the search correspondence in node tree, then according to be extracted The type (i.e. title in above-described embodiment) of information determines the particular location of the information needing extraction, from And obtain target location.In the present embodiment, specifically, above-mentioned target location is h1 node on node tree It corresponding child node tree is the content of title in h1.

Further, with reference to Fig. 4, based on any of the above-described embodiment, in the fourth embodiment, above-mentioned modeling Module 10 includes:

Resolution unit 101, when being used for receiving information extraction request, resolves html text content；And work as When being resolved to the beginning label of HTML element, the HTML element being currently resolved to is set as that target saves Point continues to resolve；

Judging unit 102, for the character string by resolving the non-HTML element content obtained with child node Form is added under described destination node, and judges whether again to be resolved to the beginning label of HTML element；

First processing unit 103, for when being again resolved to the beginning label of HTML element, by current The HTML element being resolved to is set as the child node of described destination node；Then described child node is set Continue to resolve for destination node, and continued executing with will be resolved in the non-HTML element obtained by judging unit The character string held is added under described destination node with the form of child node, and judges whether again to be resolved to The operation starting label of HTML element；

Second processing unit 104, for when being the most again resolved to the beginning label of HTML element, is solving The end-tag of the described destination node correspondence HTML element analysed is that non-first HTLM element is corresponding During the end-tag of HTML element, the father node of described destination node is set to destination node and continues to solve Analyse, and the character string performing described non-HTML element content parsing obtained adds with the form of child node It is added under described destination node, and judges whether again to be resolved to the operation starting label of HTML element； End-tag at the described destination node correspondence HTML element being resolved to is that first HTLM element is corresponding The end-tag of HTML element time, terminate the parsing to html text content, according to each node Recurrence relation forms node tree.

It should be noted that, first HTML element is html in the html file of each webpage, should The end-tag of html is</html>.In the present embodiment, html text resolver will be successively read HTML Each byte in text, when often reading the beginning label of a complete HTML element, will create One HTML element node, and create DOM node tree according to the structural relation of each HTML element.With As a example by html text content is the content shown in Fig. 5, it is explained in detail.

The beginning label of the HTML element first resolving acquisition is<html>, is set by HTML element html For destination node, then proceed to the child node under destination node is resolved；Continue to resolve acquisition The beginning label of HTML element is<head>, head is now set as the child node of html node, so After with head as destination node, continue the child node under destination node is resolved；Continue to resolve and obtain The beginning label of HTML element be<title>, now title is set as the child node of head node, so After with title as destination node, continue the child node under destination node is resolved；Continue to resolve and obtain The character string of non-HTML element content be " big dominate without the reading of pop-up complete edition ", now by character string " big dominant force is read without pop-up complete edition " is added under destination node title with the form of child node；Continue to solve Analysis obtains the end-tag of HTML element</title>, now with the father node head of child node title as target Node continues to resolve, thus completes the recurrence to title, continues the recurrence to head；It is sequentially completed respectively The recurrence of HTML element, thus form DOM node tree as shown in Figure 6.

Further, with reference to Fig. 7, based on above-mentioned 5th embodiment, in the sixth embodiment, above-mentioned information Draw-out device also includes:

Child node setup unit, for belonging to when the element being resolved to described destination node correspondence HTML element When property and property value, described element property and property value are set as the child node of described destination node.

As shown in Figure 6, as a example by the div in HTML element, wherein class is the element property of div, area For property value；Id is the element property of div, and content is property value.Can be respectively by element property and attribute Value is all added under corresponding div node respectively with the form of child node, will class=area and id=content Child node as corresponding div.In the present embodiment, owing to adding element property and property value as accordingly The child node of HTML element, thus on node tree during traversal, can add in above-mentioned configuration information Element property and property value, to be more accurately obtained the accurate location of HTML element.Such as, use XPATH syntactic description is when extract information positional information in node tree, by treating of inputting according to user The type of Extracting Information and described information to be extracted position in described node tree generates and configures letter as follows Breath: $ type: //div [@class=" play_seat "]/a [2].

With reference to Fig. 8, it is proposed that info web extraction method first embodiment of the present invention, in the first embodiment, This method for abstracting web page information comprises the following steps:

Step S10, when receiving information extraction request, sets up node tree according to HTML element in webpage；

Step S20, the configuration information preset in asking according to described information extraction is true in described node tree The target location of fixed information to be extracted；

Step S30, extracts the information that described target location is corresponding.

Further, with reference to Fig. 9, based on above-described embodiment, in a second embodiment, above-mentioned steps S10 The most also include:

Step S40, type and described information to be extracted according to described information to be extracted are at described node tree In position generate configuration information according to presetting rule；

Step S50, generates information extraction request according to described configuration information.

title://h1；

Further, with reference to Figure 10, based on above-mentioned second embodiment, in the third embodiment, above-mentioned step Rapid S20 includes:

Further, with reference to Figure 11, based on any of the above-described embodiment, in the fourth embodiment, above-mentioned step Rapid S10 includes:

Step S11, when receiving information extraction request, resolves html text content；

Step S12, when being resolved to the beginning label of HTML element, the html element that will be currently resolved to Element is set as that destination node continues to resolve；

Step S13, adds the character string resolving the non-HTML element content obtained with the form of child node Under described destination node, and judge whether again to be resolved to the beginning label of HTML element；If so, Then perform step S14, otherwise perform step S15；

Step S14, is set as the child node of described destination node by the HTML element being currently resolved to；So After described child node is set as, and destination node continues to resolve, and return step S13；

Step S15, when the end-tag of the described destination node correspondence HTML element being resolved to, it is judged that Whether the end-tag of described HTML element is the end of the HTML element that first HTLM element is corresponding Label；The most then perform step S16, otherwise perform step S17；

Step S16, terminates the parsing to html text content, forms joint according to the recurrence relation of each node Point tree；

Step S17, is set to the father node of described destination node destination node and continues to resolve, and return Perform step S13.

Further, with reference to Figure 12, based on above-mentioned 5th embodiment, in the sixth embodiment, above-mentioned step Also include before rapid S13:

Step S18, when the element property and the property value that are resolved to described destination node correspondence HTML element Time, described element property and property value are set as the child node of described destination node.

The foregoing is only the preferred embodiments of the present invention, not thereby limit its scope of the claims, every profit The equivalent structure made by description of the invention and accompanying drawing content or equivalence flow process conversion, directly or indirectly transport It is used in other relevant technical fields, is the most in like manner included in the scope of patent protection of the present invention.

Claims

1. a method for abstracting web page information, it is characterised in that described method for abstracting web page information include with Lower step:

Extract the information that described target location is corresponding.

2. method for abstracting web page information as claimed in claim 1, it is characterised in that described in receive letter During breath extraction request, also include according to before in webpage, HTML element sets up node tree:

3. method for abstracting web page information as claimed in claim 2, it is characterised in that described in described basis The configuration information preset in information extraction request determines the target location of information to be extracted in described node tree Including:

4. method for abstracting web page information as claimed any one in claims 1 to 3, it is characterised in that institute State receive information extraction request time, set up node tree according to HTML element in webpage and include:

When receiving information extraction request, resolve html text content；

If it is not, then the end-tag at the described destination node correspondence HTML element being resolved to is non-first During the end-tag of the HTML element that HTLM element is corresponding, the father node of described destination node is arranged Continue to resolve for destination node, and perform the described character string that will resolve the non-HTML element content obtained It is added under described destination node with the form of child node, and performs the described non-HTML that will resolve acquisition The character string of element content is added under described destination node with the form of child node, and judges whether again It is resolved to the step starting label of HTML element；In described destination node correspondence HTML being resolved to When the end-tag of element is the end-tag of the HTML element that first HTLM element is corresponding, it is right to terminate The parsing of html text content, forms node tree according to the recurrence relation of each node.

5. method for abstracting web page information as claimed in claim 4, it is characterised in that described parsing is obtained The character string of the non-HTML element content obtained is added under described destination node with the form of child node, and Also include before judging whether again to be resolved to the beginning label of HTML element:

6. a Web page information extraction device, it is characterised in that described Web page information extraction device includes:

7. Web page information extraction device as claimed in claim 6, it is characterised in that described info web Draw-out device also includes:

Configuration generation module, is used for the type according to described information to be extracted and described information to be extracted in institute The position stated in node tree generates configuration information according to presetting rule；

Request generation module, for generating information extraction request according to described configuration information.

8. Web page information extraction device as claimed in claim 7, it is characterised in that described determine module Specifically for, type and described information to be extracted according to described information to be extracted are in described node tree Position, determines described target location according to tree ergodic algorithm.

9. the Web page information extraction device as according to any one of claim 6 to 8, it is characterised in that institute State MBM to include:

Resolution unit, when being used for receiving information extraction request, resolves html text content；And when solving When the beginning label of HTML element is arrived in analysis, the HTML element being currently resolved to is set as destination node Continue to resolve；

Judging unit, for resolving the character string shape with child node of the non-HTML element content obtained Formula is added under described destination node, and judges whether again to be resolved to the beginning label of HTML element；

First processing unit, for when being again resolved to the beginning label of HTML element, will currently solve The HTML element analysed is set as the child node of described destination node；Then described child node is set as Destination node continues to resolve, and is continued executing with will be resolved the non-HTML element content obtained by judging unit Character string be added under described destination node with the form of child node, and judge whether again to be resolved to The operation starting label of HTML element；

Second processing unit, for when being the most again resolved to the beginning label of HTML element, is resolving To the end-tag of described destination node correspondence HTML element be that non-first HTLM element is corresponding During the end-tag of HTML element, the father node of described destination node is set to destination node and continues to solve Analyse, and the character string performing described non-HTML element content parsing obtained adds with the form of child node It is added under described destination node, and is continued executing with will be resolved in the non-HTML element obtained by judging unit The character string held is added under described destination node with the form of child node, and judges whether again to be resolved to The operation starting label of HTML element；At the described destination node correspondence HTML element being resolved to When end-tag is the end-tag of the HTML element that first HTLM element is corresponding, terminate HTML The parsing of content of text, forms node tree according to the recurrence relation of each node.

10. Web page information extraction device as claimed in claim 9, it is characterised in that described information extraction Device also includes: