CN106326314A - Web page information extraction method and device - Google Patents
Web page information extraction method and device Download PDFInfo
- Publication number
- CN106326314A CN106326314A CN201510395013.8A CN201510395013A CN106326314A CN 106326314 A CN106326314 A CN 106326314A CN 201510395013 A CN201510395013 A CN 201510395013A CN 106326314 A CN106326314 A CN 106326314A
- Authority
- CN
- China
- Prior art keywords
- node
- information
- html element
- html
- resolved
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
- G06F16/9577—Optimising the visualization of content, e.g. distillation of HTML documents
Abstract
The invention discloses a web page information extraction method. The web page information extraction method comprises the following steps: setting up a node tree according to the HTML elements in the web page while an information extraction request is received; confirming the target position of information to be extracted in the node tree according to the preset configuration information in the information extraction request; and extracting the information corresponding to the target position. The invention also discloses a web page information extraction device. The web page information extraction method and device reduce the operation difficulty of information extraction.
Description
Technical field
The present invention relates to networking technology area, particularly relate to method for abstracting web page information and device.
Background technology
It is known that in the prior art when carrying out Web page information extraction, generally use Feature Words, depend on
The position of information to be extracted is positioned by Feature Words.Owing to using Feature Words to need in website when positioning
Feature Words refine repeatedly;Meanwhile, the most general Feature Words can cause by mistake in specific webpage
Sentencing, the most special Feature Words is then difficult to be applicable to the extraction of other webpage.Therefore in this extraction side
Method needs utilize participle and text semantic identification technology, to improve the accuracy of information extraction;But by
In using participle and text semantic identification technology, the difficulty that result in information extraction is bigger.
Summary of the invention
The main purpose of the embodiment of the present invention is to provide a kind of method for abstracting web page information and device, it is intended to fall
The operation easier of low information extraction.
For achieving the above object, a kind of method for abstracting web page information, described net are embodiments provided
Page information abstracting method comprises the following steps:
When receiving information extraction request, set up node tree according to HTML element in webpage;
The configuration information preset in asking according to described information extraction determines letter to be extracted in described node tree
The target location of breath;
Extract the information that described target location is corresponding.
Additionally, in order to realize foregoing invention purpose, the embodiment of the present invention additionally provides a kind of info web and takes out
Fetching is put, and described Web page information extraction device includes:
MBM, when being used for receiving information extraction request, sets up joint according to HTML element in webpage
Point tree;
Determining module, the configuration information preset in asking according to described information extraction is at described node tree
The middle target location determining information to be extracted;
Abstraction module, for extracting the information that described target location is corresponding.
When the embodiment of the present invention is by receiving information extraction request, set up according to HTML element in webpage
Node tree;Then the configuration information preset in asking according to described information extraction determines in described node tree
The target location of information to be extracted;Finally extract the information that described target location is corresponding;To complete in webpage
Carry out the operation of information extraction.Owing to when carrying out information extraction, the embodiment of the present invention is based on by HTML
The node tree of text generation, carries out information extraction on node tree;Text is used to divide relative to prior art
Word and semantics recognition, carry out the mode of information extraction, and the embodiment of the present invention can reduce the behaviour of information extraction
Make difficulty, thus reduce the extraction cost of information in html web page on the whole.
Accompanying drawing explanation
Fig. 1 is the hardware frame structural representation of info web withdrawing device first embodiment of the present invention;
Fig. 2 is the functional module structure schematic diagram of info web withdrawing device the second embodiment of the present invention;
Fig. 3 is the functional module structure schematic diagram of info web withdrawing device the 3rd embodiment of the present invention;
Fig. 4 is the refinement function mould of the first embodiment of MBM in info web withdrawing device of the present invention
Block structure schematic diagram;
Fig. 5 is html text exemplary plot in info web withdrawing device of the present invention;
Fig. 6 is according to the node tree obtained after Fig. 5 parsing;
Fig. 7 is the refinement function mould of the second embodiment of MBM in info web withdrawing device of the present invention
Block structure schematic diagram;
Fig. 8 is the schematic flow sheet of info web extraction method first embodiment of the present invention;
Fig. 9 is the schematic flow sheet of info web extraction method the second embodiment of the present invention;
Figure 10 is the schematic flow sheet of info web extraction method the 3rd embodiment of the present invention;
Figure 11 is the refinement flow process that info web extraction method of the present invention sets up first embodiment in node tree
Schematic diagram;
Figure 12 is the refinement flow process that info web extraction method of the present invention sets up the second embodiment in node tree
Schematic diagram.
The realization of the object of the invention, functional characteristics and advantage will in conjunction with the embodiments, do referring to the drawings further
Explanation.
Detailed description of the invention
Technical scheme is further illustrated below in conjunction with Figure of description and specific embodiment.Should
Understanding, specific embodiment described herein only in order to explain the present invention, is not intended to limit the present invention.
With reference to Fig. 1, info web withdrawing device first embodiment of the present invention is proposed.In this embodiment, this net
Page information withdrawing device includes: processor 111, memorizer 112, user interface 113, network interface 114
And communication bus 115.Communication bus 115 is used for the communication in data server between each building block, uses
Family interface 113 is for receiving the information of user's input, and this user interface can be wireline interface and wireless connect
Mouthful, such as keyboard, mouse etc..Network interface 114 carries out intercommunication for data server and outside,
This network interface 114 can also include line interface and wave point.Memorizer 112 can include one or
More than one computer-readable recording medium, and it not only includes internal storage, also includes that outside is deposited
Reservoir.In this memorizer, storage has operating system and info web extraction program etc..Processor 111 is used for
Call the info web in memorizer 112 and extract program out, to perform following operation:
When receiving information extraction request, set up node tree according to HTML element in webpage;
The configuration information preset in asking according to described information extraction determines letter to be extracted in described node tree
The target location of breath;
Extract the information that described target location is corresponding.
Further, the info web that processor 111 is additionally operable to call in memorizer 112 extracts program out, with
Operation below performing:
Described when receiving information extraction request, according to before in webpage, HTML element sets up node tree also
Including:
Type and the described information to be extracted position in described node tree according to described information to be extracted are pressed
Configuration information is generated according to presetting rule;
Information extraction request is generated according to described configuration information.
Further, the info web that processor 111 is additionally operable to call in memorizer 112 extracts program out, with
Operation below performing:
Described ask according to described information extraction in the configuration information preset determine in described node tree and wait to take out
The win the confidence target location of breath includes:
Type according to described information to be extracted and the described information to be extracted position in described node tree,
Described target location is determined according to tree ergodic algorithm.
Further, the info web that processor 111 is additionally operable to call in memorizer 112 extracts program out, with
Operation below performing:
Described receive information extraction request time, set up node tree according to HTML element in webpage and include:
When receiving information extraction request, resolve html text content;
When being resolved to the beginning label of HTML element, the HTML element being currently resolved to is set as
Destination node continues to resolve;
The character string resolving the non-HTML element content obtained is added on described mesh with the form of child node
Under mark node, and judge whether again to be resolved to the beginning label of HTML element;
The HTML element being the most currently resolved to is set as the child node of described destination node;So
After described child node is set as, and destination node continues to resolve, and perform described to resolve obtain non-
The character string of HTML element content is added under described destination node with the form of child node, and judgement is
The no step starting label being again resolved to HTML element;
If it is not, then the end-tag at the described destination node correspondence HTML element being resolved to is non-first
During the end-tag of the HTML element that HTLM element is corresponding, the father node of described destination node is arranged
Continue to resolve for destination node, and perform the described character string that will resolve the non-HTML element content obtained
It is added under described destination node with the form of child node, and judges whether again to be resolved to HTML element
Start label step;End-tag at the described destination node correspondence HTML element being resolved to is
During the end-tag of the HTML element that first HTLM element is corresponding, terminate the solution to html text content
Analysis, forms node tree according to the recurrence relation of each node.
Further, the info web that processor 111 is additionally operable to call in memorizer 112 extracts program out, with
Operation below performing:
The character string of described non-HTML element content parsing obtained is added on institute with the form of child node
State under destination node, and also include before judging whether again to be resolved to the beginning label of HTML element:
When being resolved to element property and the property value of described destination node correspondence HTML element, by described
Element property and property value are set as the child node of described destination node.
When the embodiment of the present invention is by receiving information extraction request, set up according to HTML element in webpage
Node tree;Then the configuration information preset in asking according to described information extraction determines in described node tree
The target location of information to be extracted;Finally extract the information that described target location is corresponding;To complete in webpage
Carry out the operation of information extraction.Owing to when carrying out information extraction, the embodiment of the present invention is based on by HTML
The node tree of text generation, carries out information extraction on node tree;Text is used to divide relative to prior art
Word and semantics recognition, carry out the mode of information extraction, and the embodiment of the present invention can reduce the behaviour of information extraction
Make difficulty, thus reduce the extraction cost of information in html web page on the whole.
With reference to Fig. 2, it is provided that the second embodiment of the present invention a kind of Web page information extraction device, the present embodiment
The Web page information extraction device of middle offer includes:
MBM 10, when being used for receiving information extraction request, sets up according to HTML element in webpage
Node tree;
Determining module 20, the configuration information preset in asking according to described information extraction is at described node
Tree determines the target location of information to be extracted;
Abstraction module 30, for extracting the information that described target location is corresponding.
The method for abstracting web page information that the present embodiment provides is mainly used in the extraction to info web.Above-mentioned
HTML element is web page element, starts label and element end-tag, such as html element including element
Label is started including element<html>with element end-tag</html>.
When carrying out Web page information extraction, first specific language is used to describe information to be extracted by user,
Then according to the corresponding configuration information of language generation described, generate information according to configuration information and extract request out,
When receiving this information and extracting request out, will set up node tree according to HTML element (should in the present embodiment
Node tree is dom tree, is the document object model of HTML).Then request is extracted out according to this information
In configuration information determine the target location of information to be extracted, carry out letter finally according to the target location determined
Breath extraction, thus complete to carry out the operation of information extraction in webpage.
When the embodiment of the present invention is by receiving information extraction request, set up according to HTML element in webpage
Node tree;Then the configuration information preset in asking according to described information extraction determines in described node tree
The target location of information to be extracted;Finally extract the information that described target location is corresponding;To complete in webpage
Carry out the operation of information extraction.Owing to when carrying out information extraction, the embodiment of the present invention is based on by HTML
The node tree of text generation, carries out information extraction on node tree;Text is used to divide relative to prior art
Word and semantics recognition, carry out the mode of information extraction, and the embodiment of the present invention can reduce the behaviour of information extraction
Make difficulty, thus reduce the extraction cost of information in html web page on the whole.
Further, with reference to Fig. 3, based on above-described embodiment, in the third embodiment, above-mentioned info web
Draw-out device also includes:
Configuration generation module 40, for existing according to type and the described information to be extracted of described information to be extracted
Position in described node tree generates configuration information according to presetting rule;
Request generation module 50, for generating information extraction request according to described configuration information.
In the present embodiment, above-mentioned presetting rule can be configured according to actual needs, such as, can use
Regular expression or XPATH grammer, be explained in detail below as a example by XPATH grammer.Specifically,
When using XPATH syntactic description until extract information positional information in node tree, will be defeated according to user
The type of the information to be extracted entered and described information to be extracted position in described node tree generates joins as follows
Confidence ceases:
title://h1;
Wherein title is the type of above-mentioned information to be extracted, and h1 is that above-mentioned information to be extracted is at described node tree
In position.In the present embodiment, h1 is HTML element, and is the node on node tree.
After generating configuration information, the request of corresponding information extraction will be generated according to this configuration information, to enter
Row information extraction operates.
Further, based on above-mentioned 3rd embodiment, in the fourth embodiment, above-mentioned determine that module 20 has
Body is used for, according to type and the described information to be extracted position in described node tree of described information to be extracted
Put, determine described target location according to tree ergodic algorithm.
In the present embodiment, any position that the tree ergodic algorithm of maturation can be used to come in search tree, entering
Line search confirm information to be extracted target location time, can first according to above-mentioned information to be extracted described
The node of position (i.e. h1 in above-described embodiment) the search correspondence in node tree, then according to be extracted
The type (i.e. title in above-described embodiment) of information determines the particular location of the information needing extraction, from
And obtain target location.In the present embodiment, specifically, above-mentioned target location is h1 node on node tree
It corresponding child node tree is the content of title in h1.
Further, with reference to Fig. 4, based on any of the above-described embodiment, in the fourth embodiment, above-mentioned modeling
Module 10 includes:
Resolution unit 101, when being used for receiving information extraction request, resolves html text content;And work as
When being resolved to the beginning label of HTML element, the HTML element being currently resolved to is set as that target saves
Point continues to resolve;
Judging unit 102, for the character string by resolving the non-HTML element content obtained with child node
Form is added under described destination node, and judges whether again to be resolved to the beginning label of HTML element;
First processing unit 103, for when being again resolved to the beginning label of HTML element, by current
The HTML element being resolved to is set as the child node of described destination node;Then described child node is set
Continue to resolve for destination node, and continued executing with will be resolved in the non-HTML element obtained by judging unit
The character string held is added under described destination node with the form of child node, and judges whether again to be resolved to
The operation starting label of HTML element;
Second processing unit 104, for when being the most again resolved to the beginning label of HTML element, is solving
The end-tag of the described destination node correspondence HTML element analysed is that non-first HTLM element is corresponding
During the end-tag of HTML element, the father node of described destination node is set to destination node and continues to solve
Analyse, and the character string performing described non-HTML element content parsing obtained adds with the form of child node
It is added under described destination node, and judges whether again to be resolved to the operation starting label of HTML element;
End-tag at the described destination node correspondence HTML element being resolved to is that first HTLM element is corresponding
The end-tag of HTML element time, terminate the parsing to html text content, according to each node
Recurrence relation forms node tree.
It should be noted that, first HTML element is html in the html file of each webpage, should
The end-tag of html is</html>.In the present embodiment, html text resolver will be successively read HTML
Each byte in text, when often reading the beginning label of a complete HTML element, will create
One HTML element node, and create DOM node tree according to the structural relation of each HTML element.With
As a example by html text content is the content shown in Fig. 5, it is explained in detail.
The beginning label of the HTML element first resolving acquisition is<html>, is set by HTML element html
For destination node, then proceed to the child node under destination node is resolved;Continue to resolve acquisition
The beginning label of HTML element is<head>, head is now set as the child node of html node, so
After with head as destination node, continue the child node under destination node is resolved;Continue to resolve and obtain
The beginning label of HTML element be<title>, now title is set as the child node of head node, so
After with title as destination node, continue the child node under destination node is resolved;Continue to resolve and obtain
The character string of non-HTML element content be " big dominate without the reading of pop-up complete edition ", now by character string
" big dominant force is read without pop-up complete edition " is added under destination node title with the form of child node;Continue to solve
Analysis obtains the end-tag of HTML element</title>, now with the father node head of child node title as target
Node continues to resolve, thus completes the recurrence to title, continues the recurrence to head;It is sequentially completed respectively
The recurrence of HTML element, thus form DOM node tree as shown in Figure 6.
Further, with reference to Fig. 7, based on above-mentioned 5th embodiment, in the sixth embodiment, above-mentioned information
Draw-out device also includes:
Child node setup unit, for belonging to when the element being resolved to described destination node correspondence HTML element
When property and property value, described element property and property value are set as the child node of described destination node.
As shown in Figure 6, as a example by the div in HTML element, wherein class is the element property of div, area
For property value;Id is the element property of div, and content is property value.Can be respectively by element property and attribute
Value is all added under corresponding div node respectively with the form of child node, will class=area and id=content
Child node as corresponding div.In the present embodiment, owing to adding element property and property value as accordingly
The child node of HTML element, thus on node tree during traversal, can add in above-mentioned configuration information
Element property and property value, to be more accurately obtained the accurate location of HTML element.Such as, use
XPATH syntactic description is when extract information positional information in node tree, by treating of inputting according to user
The type of Extracting Information and described information to be extracted position in described node tree generates and configures letter as follows
Breath: $ type: //div [@class=" play_seat "]/a [2].
With reference to Fig. 8, it is proposed that info web extraction method first embodiment of the present invention, in the first embodiment,
This method for abstracting web page information comprises the following steps:
Step S10, when receiving information extraction request, sets up node tree according to HTML element in webpage;
Step S20, the configuration information preset in asking according to described information extraction is true in described node tree
The target location of fixed information to be extracted;
Step S30, extracts the information that described target location is corresponding.
The method for abstracting web page information that the present embodiment provides is mainly used in the extraction to info web.Above-mentioned
HTML element is web page element, starts label and element end-tag, such as html element including element
Label is started including element<html>with element end-tag</html>.
When carrying out Web page information extraction, first specific language is used to describe information to be extracted by user,
Then according to the corresponding configuration information of language generation described, generate information according to configuration information and extract request out,
When receiving this information and extracting request out, will set up node tree according to HTML element (should in the present embodiment
Node tree is dom tree, is the document object model of HTML).Then request is extracted out according to this information
In configuration information determine the target location of information to be extracted, carry out letter finally according to the target location determined
Breath extraction, thus complete to carry out the operation of information extraction in webpage.
When the embodiment of the present invention is by receiving information extraction request, set up according to HTML element in webpage
Node tree;Then the configuration information preset in asking according to described information extraction determines in described node tree
The target location of information to be extracted;Finally extract the information that described target location is corresponding;To complete in webpage
Carry out the operation of information extraction.Owing to when carrying out information extraction, the embodiment of the present invention is based on by HTML
The node tree of text generation, carries out information extraction on node tree;Text is used to divide relative to prior art
Word and semantics recognition, carry out the mode of information extraction, and the embodiment of the present invention can reduce the behaviour of information extraction
Make difficulty, thus reduce the extraction cost of information in html web page on the whole.
Further, with reference to Fig. 9, based on above-described embodiment, in a second embodiment, above-mentioned steps S10
The most also include:
Step S40, type and described information to be extracted according to described information to be extracted are at described node tree
In position generate configuration information according to presetting rule;
Step S50, generates information extraction request according to described configuration information.
In the present embodiment, above-mentioned presetting rule can be configured according to actual needs, such as, can use
Regular expression or XPATH grammer, be explained in detail below as a example by XPATH grammer.Specifically,
When using XPATH syntactic description until extract information positional information in node tree, will be defeated according to user
The type of the information to be extracted entered and described information to be extracted position in described node tree generates joins as follows
Confidence ceases:
title://h1;
Wherein title is the type of above-mentioned information to be extracted, and h1 is that above-mentioned information to be extracted is at described node tree
In position.In the present embodiment, h1 is HTML element, and is the node on node tree.
After generating configuration information, the request of corresponding information extraction will be generated according to this configuration information, to enter
Row information extraction operates.
Further, with reference to Figure 10, based on above-mentioned second embodiment, in the third embodiment, above-mentioned step
Rapid S20 includes:
Type according to described information to be extracted and the described information to be extracted position in described node tree,
Described target location is determined according to tree ergodic algorithm.
In the present embodiment, any position that the tree ergodic algorithm of maturation can be used to come in search tree, entering
Line search confirm information to be extracted target location time, can first according to above-mentioned information to be extracted described
The node of position (i.e. h1 in above-described embodiment) the search correspondence in node tree, then according to be extracted
The type (i.e. title in above-described embodiment) of information determines the particular location of the information needing extraction, from
And obtain target location.In the present embodiment, specifically, above-mentioned target location is h1 node on node tree
It corresponding child node tree is the content of title in h1.
Further, with reference to Figure 11, based on any of the above-described embodiment, in the fourth embodiment, above-mentioned step
Rapid S10 includes:
Step S11, when receiving information extraction request, resolves html text content;
Step S12, when being resolved to the beginning label of HTML element, the html element that will be currently resolved to
Element is set as that destination node continues to resolve;
Step S13, adds the character string resolving the non-HTML element content obtained with the form of child node
Under described destination node, and judge whether again to be resolved to the beginning label of HTML element;If so,
Then perform step S14, otherwise perform step S15;
Step S14, is set as the child node of described destination node by the HTML element being currently resolved to;So
After described child node is set as, and destination node continues to resolve, and return step S13;
Step S15, when the end-tag of the described destination node correspondence HTML element being resolved to, it is judged that
Whether the end-tag of described HTML element is the end of the HTML element that first HTLM element is corresponding
Label;The most then perform step S16, otherwise perform step S17;
Step S16, terminates the parsing to html text content, forms joint according to the recurrence relation of each node
Point tree;
Step S17, is set to the father node of described destination node destination node and continues to resolve, and return
Perform step S13.
It should be noted that, first HTML element is html in the html file of each webpage, should
The end-tag of html is</html>.In the present embodiment, html text resolver will be successively read HTML
Each byte in text, when often reading the beginning label of a complete HTML element, will create
One HTML element node, and create DOM node tree according to the structural relation of each HTML element.With
As a example by html text content is the content shown in Fig. 5, it is explained in detail.
The beginning label of the HTML element first resolving acquisition is<html>, is set by HTML element html
For destination node, then proceed to the child node under destination node is resolved;Continue to resolve acquisition
The beginning label of HTML element is<head>, head is now set as the child node of html node, so
After with head as destination node, continue the child node under destination node is resolved;Continue to resolve and obtain
The beginning label of HTML element be<title>, now title is set as the child node of head node, so
After with title as destination node, continue the child node under destination node is resolved;Continue to resolve and obtain
The character string of non-HTML element content be " big dominate without the reading of pop-up complete edition ", now by character string
" big dominant force is read without pop-up complete edition " is added under destination node title with the form of child node;Continue to solve
Analysis obtains the end-tag of HTML element</title>, now with the father node head of child node title as target
Node continues to resolve, thus completes the recurrence to title, continues the recurrence to head;It is sequentially completed respectively
The recurrence of HTML element, thus form DOM node tree as shown in Figure 6.
Further, with reference to Figure 12, based on above-mentioned 5th embodiment, in the sixth embodiment, above-mentioned step
Also include before rapid S13:
Step S18, when the element property and the property value that are resolved to described destination node correspondence HTML element
Time, described element property and property value are set as the child node of described destination node.
As shown in Figure 6, as a example by the div in HTML element, wherein class is the element property of div, area
For property value;Id is the element property of div, and content is property value.Can be respectively by element property and attribute
Value is all added under corresponding div node respectively with the form of child node, will class=area and id=content
Child node as corresponding div.In the present embodiment, owing to adding element property and property value as accordingly
The child node of HTML element, thus on node tree during traversal, can add in above-mentioned configuration information
Element property and property value, to be more accurately obtained the accurate location of HTML element.Such as, use
XPATH syntactic description is when extract information positional information in node tree, by treating of inputting according to user
The type of Extracting Information and described information to be extracted position in described node tree generates and configures letter as follows
Breath: $ type: //div [@class=" play_seat "]/a [2].
The foregoing is only the preferred embodiments of the present invention, not thereby limit its scope of the claims, every profit
The equivalent structure made by description of the invention and accompanying drawing content or equivalence flow process conversion, directly or indirectly transport
It is used in other relevant technical fields, is the most in like manner included in the scope of patent protection of the present invention.
Claims (10)
1. a method for abstracting web page information, it is characterised in that described method for abstracting web page information include with
Lower step:
When receiving information extraction request, set up node tree according to HTML element in webpage;
The configuration information preset in asking according to described information extraction determines letter to be extracted in described node tree
The target location of breath;
Extract the information that described target location is corresponding.
2. method for abstracting web page information as claimed in claim 1, it is characterised in that described in receive letter
During breath extraction request, also include according to before in webpage, HTML element sets up node tree:
Type and the described information to be extracted position in described node tree according to described information to be extracted are pressed
Configuration information is generated according to presetting rule;
Information extraction request is generated according to described configuration information.
3. method for abstracting web page information as claimed in claim 2, it is characterised in that described in described basis
The configuration information preset in information extraction request determines the target location of information to be extracted in described node tree
Including:
Type according to described information to be extracted and the described information to be extracted position in described node tree,
Described target location is determined according to tree ergodic algorithm.
4. method for abstracting web page information as claimed any one in claims 1 to 3, it is characterised in that institute
State receive information extraction request time, set up node tree according to HTML element in webpage and include:
When receiving information extraction request, resolve html text content;
When being resolved to the beginning label of HTML element, the HTML element being currently resolved to is set as
Destination node continues to resolve;
The character string resolving the non-HTML element content obtained is added on described mesh with the form of child node
Under mark node, and judge whether again to be resolved to the beginning label of HTML element;
The HTML element being the most currently resolved to is set as the child node of described destination node;So
After described child node is set as, and destination node continues to resolve, and perform described to resolve obtain non-
The character string of HTML element content is added under described destination node with the form of child node, and judgement is
The no step starting label being again resolved to HTML element;
If it is not, then the end-tag at the described destination node correspondence HTML element being resolved to is non-first
During the end-tag of the HTML element that HTLM element is corresponding, the father node of described destination node is arranged
Continue to resolve for destination node, and perform the described character string that will resolve the non-HTML element content obtained
It is added under described destination node with the form of child node, and performs the described non-HTML that will resolve acquisition
The character string of element content is added under described destination node with the form of child node, and judges whether again
It is resolved to the step starting label of HTML element;In described destination node correspondence HTML being resolved to
When the end-tag of element is the end-tag of the HTML element that first HTLM element is corresponding, it is right to terminate
The parsing of html text content, forms node tree according to the recurrence relation of each node.
5. method for abstracting web page information as claimed in claim 4, it is characterised in that described parsing is obtained
The character string of the non-HTML element content obtained is added under described destination node with the form of child node, and
Also include before judging whether again to be resolved to the beginning label of HTML element:
When being resolved to element property and the property value of described destination node correspondence HTML element, by described
Element property and property value are set as the child node of described destination node.
6. a Web page information extraction device, it is characterised in that described Web page information extraction device includes:
MBM, when being used for receiving information extraction request, sets up joint according to HTML element in webpage
Point tree;
Determining module, the configuration information preset in asking according to described information extraction is at described node tree
The middle target location determining information to be extracted;
Abstraction module, for extracting the information that described target location is corresponding.
7. Web page information extraction device as claimed in claim 6, it is characterised in that described info web
Draw-out device also includes:
Configuration generation module, is used for the type according to described information to be extracted and described information to be extracted in institute
The position stated in node tree generates configuration information according to presetting rule;
Request generation module, for generating information extraction request according to described configuration information.
8. Web page information extraction device as claimed in claim 7, it is characterised in that described determine module
Specifically for, type and described information to be extracted according to described information to be extracted are in described node tree
Position, determines described target location according to tree ergodic algorithm.
9. the Web page information extraction device as according to any one of claim 6 to 8, it is characterised in that institute
State MBM to include:
Resolution unit, when being used for receiving information extraction request, resolves html text content;And when solving
When the beginning label of HTML element is arrived in analysis, the HTML element being currently resolved to is set as destination node
Continue to resolve;
Judging unit, for resolving the character string shape with child node of the non-HTML element content obtained
Formula is added under described destination node, and judges whether again to be resolved to the beginning label of HTML element;
First processing unit, for when being again resolved to the beginning label of HTML element, will currently solve
The HTML element analysed is set as the child node of described destination node;Then described child node is set as
Destination node continues to resolve, and is continued executing with will be resolved the non-HTML element content obtained by judging unit
Character string be added under described destination node with the form of child node, and judge whether again to be resolved to
The operation starting label of HTML element;
Second processing unit, for when being the most again resolved to the beginning label of HTML element, is resolving
To the end-tag of described destination node correspondence HTML element be that non-first HTLM element is corresponding
During the end-tag of HTML element, the father node of described destination node is set to destination node and continues to solve
Analyse, and the character string performing described non-HTML element content parsing obtained adds with the form of child node
It is added under described destination node, and is continued executing with will be resolved in the non-HTML element obtained by judging unit
The character string held is added under described destination node with the form of child node, and judges whether again to be resolved to
The operation starting label of HTML element;At the described destination node correspondence HTML element being resolved to
When end-tag is the end-tag of the HTML element that first HTLM element is corresponding, terminate HTML
The parsing of content of text, forms node tree according to the recurrence relation of each node.
10. Web page information extraction device as claimed in claim 9, it is characterised in that described information extraction
Device also includes:
Child node setup unit, for belonging to when the element being resolved to described destination node correspondence HTML element
When property and property value, described element property and property value are set as the child node of described destination node.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510395013.8A CN106326314B (en) | 2015-07-07 | 2015-07-07 | Webpage information extraction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510395013.8A CN106326314B (en) | 2015-07-07 | 2015-07-07 | Webpage information extraction method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106326314A true CN106326314A (en) | 2017-01-11 |
CN106326314B CN106326314B (en) | 2020-09-29 |
Family
ID=57724841
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510395013.8A Active CN106326314B (en) | 2015-07-07 | 2015-07-07 | Webpage information extraction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106326314B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108874870A (en) * | 2018-04-24 | 2018-11-23 | 北京中科闻歌科技股份有限公司 | A kind of data pick-up method, equipment and computer can storage mediums |
CN110309364A (en) * | 2018-03-02 | 2019-10-08 | 腾讯科技(深圳)有限公司 | A kind of information extraction method and device |
CN116126426A (en) * | 2023-04-10 | 2023-05-16 | 杭州城市大数据运营有限公司 | Automatic component decoupling method and system based on Web service system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7293018B2 (en) * | 2001-03-30 | 2007-11-06 | Kabushiki Kaisha Toshiba | Apparatus, method, and program for retrieving structured documents |
CN102650999A (en) * | 2011-02-28 | 2012-08-29 | 株式会社理光 | Method and system for extracting object attribution value information from webpage |
CN102890692A (en) * | 2011-07-22 | 2013-01-23 | 阿里巴巴集团控股有限公司 | Webpage information extraction method and webpage information extraction system |
CN103246732A (en) * | 2013-05-10 | 2013-08-14 | 合肥工业大学 | Online Web news content extracting method and system |
CN103744987A (en) * | 2014-01-20 | 2014-04-23 | 深圳市佳创视讯技术股份有限公司 | Video website media asset integrating method and system based on DOM tree matching |
CN103853760A (en) * | 2012-12-03 | 2014-06-11 | 中国移动通信集团公司 | Method and device for extracting contents of bodies of web pages |
-
2015
- 2015-07-07 CN CN201510395013.8A patent/CN106326314B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7293018B2 (en) * | 2001-03-30 | 2007-11-06 | Kabushiki Kaisha Toshiba | Apparatus, method, and program for retrieving structured documents |
CN102650999A (en) * | 2011-02-28 | 2012-08-29 | 株式会社理光 | Method and system for extracting object attribution value information from webpage |
CN102890692A (en) * | 2011-07-22 | 2013-01-23 | 阿里巴巴集团控股有限公司 | Webpage information extraction method and webpage information extraction system |
CN103853760A (en) * | 2012-12-03 | 2014-06-11 | 中国移动通信集团公司 | Method and device for extracting contents of bodies of web pages |
CN103246732A (en) * | 2013-05-10 | 2013-08-14 | 合肥工业大学 | Online Web news content extracting method and system |
CN103744987A (en) * | 2014-01-20 | 2014-04-23 | 深圳市佳创视讯技术股份有限公司 | Video website media asset integrating method and system based on DOM tree matching |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110309364A (en) * | 2018-03-02 | 2019-10-08 | 腾讯科技(深圳)有限公司 | A kind of information extraction method and device |
CN110309364B (en) * | 2018-03-02 | 2023-03-28 | 腾讯科技(深圳)有限公司 | Information extraction method and device |
CN108874870A (en) * | 2018-04-24 | 2018-11-23 | 北京中科闻歌科技股份有限公司 | A kind of data pick-up method, equipment and computer can storage mediums |
CN116126426A (en) * | 2023-04-10 | 2023-05-16 | 杭州城市大数据运营有限公司 | Automatic component decoupling method and system based on Web service system |
CN116126426B (en) * | 2023-04-10 | 2023-08-29 | 杭州城市大数据运营有限公司 | Automatic component decoupling method and system based on Web service system |
Also Published As
Publication number | Publication date |
---|---|
CN106326314B (en) | 2020-09-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107729480B (en) | Text information extraction method and device for limited area | |
US8281284B2 (en) | Method and software for editing web documents | |
US20190196811A1 (en) | Api specification generation | |
CN111079043A (en) | Key content positioning method | |
CN107766344B (en) | Template rendering method and device and browser | |
KR20150130476A (en) | Techniques for language translation localization for computer applications | |
CN103345532A (en) | Method and device for extracting webpage information | |
CN104063401A (en) | Webpage style address merging method and device | |
JP6693582B2 (en) | Document abstract generation method, device, electronic device, and computer-readable storage medium | |
CN106294885A (en) | A kind of data collection towards isomery webpage and mask method | |
CN106708885A (en) | Method and device for achieving searching | |
CN106326314A (en) | Web page information extraction method and device | |
US10303747B2 (en) | Method, apparatus and system for controlling address input | |
CN104572787B (en) | The recognition methods of pseudo- original website and device | |
CN105653669B (en) | Hypertext markup language generation method and device | |
CN111158973B (en) | Web application dynamic evolution monitoring method | |
US20150301994A1 (en) | Non-transitory computer readable medium, information processing apparatus, and information processing method | |
KR102095703B1 (en) | An apparatus, method and recording medium for Markup parsing | |
CN113448982A (en) | DDL statement analysis method and device, computer equipment and storage medium | |
JP6114090B2 (en) | Machine translation apparatus, machine translation method and program | |
CN105511642A (en) | Input method and input device | |
CN114003714B (en) | Intelligent knowledge pushing method for document context sensing | |
CN110618809B (en) | Front-end webpage input constraint extraction method and device | |
CN112989042B (en) | Hot topic extraction method and device, computer equipment and storage medium | |
KR100911620B1 (en) | Method and system for sampling web documents information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20221205 Address after: 1402, Floor 14, Block A, Haina Baichuan Headquarters Building, No. 6, Baoxing Road, Haibin Community, Xin'an Street, Bao'an District, Shenzhen, Guangdong 518100 Patentee after: Shenzhen Yayue Technology Co.,Ltd. Address before: 2, 518000, East 403 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District Patentee before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd. |