CN104346405B - A kind of method and device of the Extracting Information from webpage - Google Patents
A kind of method and device of the Extracting Information from webpage Download PDFInfo
- Publication number
- CN104346405B CN104346405B CN201310344292.6A CN201310344292A CN104346405B CN 104346405 B CN104346405 B CN 104346405B CN 201310344292 A CN201310344292 A CN 201310344292A CN 104346405 B CN104346405 B CN 104346405B
- Authority
- CN
- China
- Prior art keywords
- label
- node
- region
- original point
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of method and apparatus of the Extracting Information from webpage, including:It is respectively each label in each node addition preset label set in the corresponding document object model tree of the webpage for the webpage of input;Respectively according to the value of each predetermined feature of each node in each label corresponding score value, obtain original point of each label on each node;The root node of subtree where the node is passed to after decaying to original point of each label on each node;Original point after the attenuation received respectively according to the root node of each subtree, the transmission point of each label on the root node is determined, using transmission point and as the region represented by the subtree the score of each label;Select the high one or more regions of score, and in region selected by exporting label value.The application can improve the accuracy that specific information is extracted from the tree-shaped text structure such as webpage.
Description
Technical field
The present invention relates to internet arena more particularly to a kind of method and devices of the Extracting Information from webpage.
Background technology
From initial data source, drawing-out structure data are a kind of basic technologies.And webpage is then most common original number
According to source.Compared with extraction of the extraction of progress structural data with carrying out structural data in plain text information has in webpage
Big difference.On the one hand, when being extracted in webpage, due to not being the text grammer of specification used in webpage, rule
The text grammer of model does not apply to, and structure of web page is multifarious, and noise is very more.On the other hand, structuring number is carried out from webpage
According to extraction when extraction target it is also diversified, have individual node, the node string for having chain(Such as navigation bar), also have blocking
Region(Such as form).At present, the extraction that structural data is carried out from webpage generally uses rule-based method.Extracting object
It is the form, name-value pair (such as attribute-name and property value) and record list for comparing specification.The more dispersed independence of abstracting method or
Only utilize the information of the nearly scope of destination node.
A kind of scheme of existing extracting object attribute value information from webpage provides one kind extracting object from webpage
The method of attribute value information, step are:A) for a given webpage, obtain giving the corresponding DOM of webpage with this(Document pair
As model)Tree, and calculate the relevant information of each DOM node in dom tree;B) it is related to each DOM node according to dom tree
Information constructs a tape label node diagram, and calculates the fraction of each tape label node;C) point based on tape label node
Number selects tape label node tree from obtained tape label node diagram;D) based on the tape label node tree structure attribute value tree chosen.
The shortcomings that existing scheme is:
It is appropriate only for the extraction for the property value pair concentrated;
Variety classes label is not utilized come determining area, precision is extracted so as to improve in the zone;
It is excessively dull to the feature use of node in itself, emphasize literal feature;
Result affected by context is not obtained by influencing transmission effects systematically.
The content of the invention
The application technical problems to be solved are how to improve the extraction specific information from the tree-shaped text structure such as webpage
Accuracy.
To solve the above-mentioned problems, this application provides a kind of method of the Extracting Information from webpage, including:
It is respectively that each node addition is predetermined in the corresponding document object model tree of the webpage for the webpage of input
Each label in tag set;
Respectively according to the value of each predetermined feature of each node in each label corresponding score value, obtain respectively marking on each node
Original point of label;
The root node of subtree where the node is passed to after decaying to original point of each label on each node;
Original point after the attenuation received respectively according to the root node of each subtree, determine each label on the root node
It transfers and divides, using transmission point and as the region represented by the subtree the score of each label;
Select the high one or more regions of score, and in region selected by exporting label value.
Optionally, further included before the step of selecting score high one or more regions:
For each region, respectively with the score in the region be multiplied by the region present in target labels number and document object mould
The ratio of target labels sum in type tree obtains the co-occurrence point in the region;
The positional value of each destination node is added number of nodes total again divided by webpage and obtains average;For each region, divide
Not by number of nodes total in the summation divided by the region of the positional value of each node in the region and the difference absolute value of average, obtain
The density in the region;Calculate the positional value of the root node of the subtree corresponding to the region and the root node position of document object model tree
The absolute value of the difference of value is put, obtains the distance in the region, the density and distance to the region are weighted summation, obtain the region
Structure point;
According to the co-occurrence in each region point and structure weighted sum is divided to obtain the final score in each region respectively;Then selected
The step of selecting score high one or more regions.
Optionally, respectively according to the value of each feature of each node in each label corresponding score value, obtain on each node each
Original point of label the step of, includes:
Operations described below is carried out respectively for each node:
Obtain the value of each feature of the node;
For each label on the node, the value of each feature corresponding score value in the label is inquired about respectively, will be inquired
Score value be multiplied by the label and be added after the weight of individual features respectively, will add up result as on the node label it is original
Point.
Optionally, the step of decaying to original point of each label on each node includes:
Linear attenuation is carried out to original point of label, obtains attenuation results SLFor:
SL=S×((1-k1)+k1Dd/DS)
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k1Linearly to transfer attenuation
Index, value range (0,1), S are original point;
Original point after the attenuation received respectively according to the root node of each subtree, determine each label on the root node
The step of transmission divides includes:
It is respectively that each label selects one most in original point after each label attenuation received in the root node of each subtree
Original point after big attenuation, the transmission point as the label in the root node.
Optionally, the step of decaying to original point of each label on each node includes:
Exponential damping is carried out to original point of label, obtains attenuation results SQFor:
SQ=S×((1-k2)+k2Dd/DS)
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k2It transfers and decays for index
Index, value range (0,1), S are original point;
Original point after the attenuation received respectively according to the root node of each subtree, determine each label on the root node
The step of transmission divides includes:
Original point after the different labels that the root node of each subtree is received are decayed is separately summed, using result as the root
The transmission of respective labels point in node.
Optionally, the high one or more regions of score are selected, and in region selected by exporting label original point of step
Suddenly include:
Region all in the document object model tree according to score is ranked up, is selected according to order from high to low
X region before taking, using the root node of the corresponding subtree in selected region as both candidate nodes;X is default positive integer;
If both candidate nodes are the ancestor node of other both candidate nodes, the both candidate nodes of child nodes are only retained as;
In each both candidate nodes as in the subtree of root node, each label is ranked up according to original point of label respectively,
And original point of highest label is selected as candidate's label;
The node where candidate's label is chosen as finish node;
According to the corresponding web page contents of finish node, the value of output candidate's label.
Present invention also provides a kind of device of the Extracting Information from webpage, including:
Indexing unit is respectively each in the corresponding document object model tree of the webpage for the webpage for input
Each label in node addition preset label set;
Original point of computing unit, for respectively according to the value of each predetermined feature of each node corresponding point in each label
Value, obtains original point of each label on each node;
Transfer unit, for subtree where the node is passed to after decaying to original point of each label on each node
Root node;
Computing unit is divided in region, for original point after the attenuation that is received respectively according to the root node of each subtree, determines
The transmission of each label point on the root node, using the transmission point of each label and obtaining as the region represented by the subtree
Point;
Output unit, for the one or more regions for selecting score high, and in region selected by exporting label value.
Optionally, the device further includes:
Amending unit is divided in region, for from region computing unit being divided to obtain the score in each region;For each region, use respectively
The score in the region is multiplied by the ratio of target labels number present in the region and document object model tree target labels sum, obtains
To the co-occurrence point in the region;The positional value of each destination node is added number of nodes total again divided by webpage and obtains average;For
Each region, respectively by section total in the summation divided by the region of the positional value of each node in the region and the difference absolute value of average
Points, obtain the density in the region;Calculate the positional value and document object model tree of the root node of the subtree corresponding to the region
Root node position value absolute value of the difference, obtain the distance in the region, the density and distance to the region are weighted summation,
Obtain the structure point in the region;According to the co-occurrence in each region point and structure weighted sum being divided to obtain respectively, each region is final to be obtained
Point;Then the final score in each region is sent to the output unit.
Optionally, original point of computing unit respectively according to the value of each feature of each node in each label corresponding score value,
Obtain each label on each node original point refers to:
Original point of computing unit carries out operations described below respectively for each node:Obtain the value of each feature of the node;For
Each label on the node inquires about the value of each feature corresponding score value in the label, the score value inquired is multiplied respectively respectively
To be added after the weight of individual features in the label, original point of result as the label on the node will add up.
Optionally, transfer unit to original point of each label on each node decay and refer to:
The transfer unit carries out linear attenuation to original point of label, obtains attenuation results SLFor:
SL=S×((1-k1)+k1Dd/DS)
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k1Linearly to transfer attenuation
Index, value range (0,1), S are original point;
Original point after the attenuation that computing unit is received according to the root node of each subtree respectively is divided in region, determines the root section
The transmission minute of each label refers on point:
Region is divided in original point after each label attenuation that computing unit is received in the root node of each subtree, is respectively each
Label selects original point after a maximum attenuation, the transmission point as the label in the root node.
Optionally, transfer unit to original point of each label on each node decay and refer to:
Transfer unit carries out exponential damping to original point of label, obtains attenuation results SQFor:
SQ=S×((1-k2)+k2Dd/DS)
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k2It transfers and decays for index
Index, value range (0,1), S are original point;
Original point after the attenuation that computing unit is received according to the root node of each subtree respectively is divided in region, determines the root section
The transmission minute of each label refers on point:
Region divides original point after the different labels attenuation that computing unit is received the root node of each subtree by phase respectively
Add, using result as the transmission point of respective labels in the root node.
Optionally, the output unit includes:
Region ordering module for region all in the document object model tree to be ranked up according to score, is pressed
X region before being chosen according to order from high to low, using the root node of the corresponding subtree in selected region as both candidate nodes;X is
Default positive integer;
Screening module, for when both candidate nodes are the ancestor node of other both candidate nodes, being only retained as child nodes
Both candidate nodes;
Tag sorting module, in each both candidate nodes as in the subtree of root node, respectively according to original point of label
Each label is ranked up, and selects original point of highest label as candidate's label;
Selecting module, for choosing the node where candidate's label as finish node;
Output module, for according to the corresponding web page contents of finish node, the value of output candidate's label.
At least one embodiment of the application codetermines original point using the various features of label, can be by influencing to pass
It passs to reflect context as a result, can obtain accurately as a result, the property value for being not only suitble to concentrate is to extraction, but also is suitble to opposite point
The extraction of scattered label, it may also be used for the extraction of list items.The prioritization scheme of the application is selected jointly using a variety of labels
Behind region, then select from region more accurately result.Another prioritization scheme of the application introduces the co-occurrence point in region
With structure point, so as to be modified to region point, more accurately result is obtained.Certainly, any production of the application is implemented
Product must be not necessarily required to reach all the above advantage simultaneously.
Description of the drawings
Fig. 1 be embodiment one slave webpage in Extracting Information method flow diagram;
Fig. 2 is the schematic diagram of the extraction tree of an extension in embodiment one;
Fig. 3 is the schematic diagram of the node with label in embodiment one.
Specific embodiment
The technical solution of the application is described in detail below in conjunction with accompanying drawings and embodiments.
If it should be noted that not conflicting, each feature in the embodiment of the present application and embodiment can be tied mutually
It closes, within the protection domain of the application.In addition, though logical order is shown in flow charts, but in some situations
Under, it can be with the steps shown or described are performed in an order that is different from the one herein.
Embodiment one, a kind of method of the Extracting Information from webpage, as shown in Figure 1, including step S101-S105.
S101, the webpage for input are respectively each node addition preset label in the corresponding dom tree of the webpage
Each label in set.
S102, respectively according to the value of each predetermined feature of each node in each label corresponding score value, obtain each node
Original point of upper each label.S103, pass to the node after decaying to original point of each label on each node where subtree
Root node.
It original point after S104, the attenuation received respectively according to the root node of each subtree, determines each on the root node
The transmission of label point, using transmission point and as the region represented by the subtree the score of each label.
S105, high one or more regions of selection score, and in region selected by exporting label value.
The score of label in node can be transferred to the root node of subtree, as corresponding to subtree after addition in the present embodiment
Region score with selection region, therefore the region that the value for netpage tag is more scattered, score be also possible to because
It is more and higher for the region interior joint, therefore the present embodiment will not only choose the region that the value of label is concentrated, it is also possible to it selects
The value of middle label is dispersed in the region on multiple nodes.
In the present embodiment, after single webpage is inputted, each subtree can be made on the basis of dom tree in step S101
For a region, and it is each node(Node)Label is added, so as to establish the extraction tree of extension.The label(Label)Table
Show a kind of flag information for having and extracting target meaning, such as product price in product information page, product title.One node
Can there are one or multiple labels.
The example of the extraction tree of one extension is as shown in Fig. 2, root node is node 200.Using node 210 as root node
Subtree is region 11, includes root node 210 and child nodes 211.Subtree using node 220 as root node is region 12, bag
Containing root node 220, child nodes 221, child nodes 222 and child nodes 2221.Using node 222 as the subtree of root node
A region can be used as, including root node 222 and child nodes 2221.Each region can regard one piece in webpage as.Wherein,
Each node is as shown in figure 3, including one or more label 31-36.
In an embodiment of the present embodiment, the webpage is product information page, and the preset label set can be with
But it is not limited to include for label that is any or appointing several information below nested or display:Title(Title), product price
(Price), product picture(Image), brand (brand) product the corresponding property value pair of every attribute(AttrPairs)Deng.
Since region is scored at the sum of the transmission point of each label, the region of multiple labels is included in webpage, with there was only what is isolated
It compares in the region of one label, it is possible to easily be selected;The selection result of this sample embodiment will be included not only " only
The region of the very high label of one score ", it is also possible to comprising " region with multiple labels ", will not so be neglected when extracting
The region with a variety of labels is omitted, therefore improves the precision of extraction.In an embodiment of the present embodiment, node makes a reservation for
Feature can be, but not limited to include it is following any one or appoint several:
Node type(Type):That is webpage HTML tag type;
Literal feature(Text):That is character visible in webpage;
Attributive character(Attribute):The list of attribute values of html tag i.e. in webpage;
Structure feature(Structure):Node and the partial structurtes or text of the composition of interdependent node all around are specific
Structure;
Visual signature(Vision):Such as the font of character, color and the position in full page layout;
Other feature(Other):The other user-defined features of such as affair character.
By increasing the species of predetermined characteristic, it can avoid only depending on and extract that precision is not high to ask caused by literal feature
Topic.
Wherein, the value of the predetermined feature of node corresponding score value in each label can represent the value and the label of feature
Degree of correlation.The score value can be preset by system, and the value and the label of feature can also be determined by statistical
Degree of correlation, using the degree of correlation as the score value.
The identical value of same feature corresponding score value in different labels can be different, also be not excluded for score value phase
Same situation.For example the value of a literal feature of node is " low price ", and it is higher with the degree of correlation of price tag, in " price " label
In score value can be 0.8;And it is relatively low with the degree of correlation of " picture " label, the score value in " picture " label can be then ﹣ 0.5.
The different value of same feature corresponding score value in a label can be different, also be not excluded for score value phase
Same situation.If in a node value of all predetermined features in a label all without corresponding score value, from this
The label is deleted in node.
In an embodiment of the present embodiment, step S102 can specifically include:
Operations described below is carried out respectively for each node:
Obtain the value of each predetermined feature of the node;
For each label on the node, the value of each feature corresponding score value in the label is inquired about respectively, will be inquired
Score value be multiplied by the label and be added after the weight of individual features respectively, will add up result as on the node label it is original
Point.
The weight of feature can be different in different labels, are also not excluded for identical situation.Assuming that there are two price, brand marks
Label, predetermined feature include node type and literal feature the two, and when price tag interior joint type is A correspond to score value
5, it is literal to be characterized as corresponding to score value 9 during b;Correspond to score value ﹣ 7 when brand label interior joint type is A, it is literal when being characterized as b
Corresponding to score value ﹣ 1;If the node type of some node is A, literal to be characterized as b, and price tag interior joint type weight is
R1, literal feature weight are R2, then the original of price tag is divided into 5 × R1+9 × R2 in the node;Brand label interior joint
Type weight is R3, and literal feature weight is R4, then the original of brand label is divided into 7 × R3+ of ﹣(﹣ 1)×R4.
In a kind of alternative of present embodiment, by the height of the weight of literal feature and attributive character setting in each label
In other feature.
In one alternative of present embodiment, for the universality of implementation method, obtain maximum recalls result set,
The more relaxed rule of correspondence is used when score value is corresponding with the value of feature in label as far as possible, the value selection of such as literal feature is short
Character/word and the larger text of semantic primitive such as select the value " valency " of feature corresponding with score value rather than select " price " or " city
Field price " is corresponding with score value.To improve corresponding efficiency, can regular expression be used less as far as possible with the value of the corresponding feature of score value
And generation with several sections of texts, such as using " ori ", " price " it is corresponding with score value rather than with " originalprice " and score value pair
It should.
In one alternative of the embodiment, the score value in each label can use thick centrifugal pump, to avoid score value
The freeing of setting and fragmentation, for example score value can be uniformly arranged to following 6 kinds:
strong-reward:Strong reward score, such as 0.8;
strong-punish:Strong penalty values, such as ﹣ 0.8;
moderate-reward:Intermediate reward point, such as 0.5;
moderate-punish:Medium penalty values, such as ﹣ 0.5;
weak-reward:Weak reward score, such as 0.2;
weak-punish:Weak penalty values, such as ﹣ 0.2.
In step S103, the original point of primitive rule to decay to each label on each node can be:From transmission
Person is nearer, and influence is bigger, therefore up influences to decay the bottom of from by the depth of tree, and attenuation function can be selected by actual conditions.
In an embodiment of the present embodiment, in step S103, decay to original point of each label on each node
The step of can specifically include:
Linear attenuation is carried out to original point of label, obtains attenuation results SLFor:
SL=S×((1-k1)+k1Dd/DS);
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k1Linearly to transfer attenuation
Index, value range (0,1), S are original point.The depth of root node is 0, and the depth of other nodes is the father node of the node
Depth adds 1, and so on.Such as when from Fig. 2 node 2221 transfer attenuation after it is original assign to node 220 when, DdFor section
The depth of point 220:1, DsFor the depth of node 2221:3.
In the alternative, original point after the attenuation received respectively according to the root node of each subtree, the root section is determined
Can specifically it include the step of the transmission minute of each label on point:
It is respectively that each label selects one most in original point after each label attenuation received in the root node of each subtree
Original point after big attenuation, the transmission point as the label in the root node.
The embodiment is known as maximum transmission, refers to that the transmission of a label in ancestor node is divided into and all is transferred to this
The maximum in original point after the label attenuation of ancestor node, original point of each label when up transferring using linearly declining
The mode subtracted;The transmission of a label divides S '=max (S in ancestor nodeL0, SL1... ..., SLn-1), wherein SLi(0≤i≤n-
1, n be transferred to ancestor node, the label attenuation after original point of number)Refer to that child nodes were transferred by linear attenuation
Original point of the label come.
In an embodiment of the present embodiment, in step S103, decay to original point of each label on each node
The step of can specifically include:
Exponential damping is carried out to original point of label, obtains attenuation results SQFor:
SQ=S×((1-k2)+k2Dd/DS)
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k2It transfers and decays for index
Index, value range (0,1), S are original point.The depth of root node is 0, and the depth of other nodes is the father node of the node
Depth adds 1;And so on.
In the alternative, original point after the attenuation received respectively according to the root node of each subtree, the root section is determined
Can specifically it include the step of the transmission minute of each label on point:
Original point after the different labels that the root node of each subtree is received are decayed is separately summed, using result as the root
The transmission of respective labels point in node.
The embodiment is known as the transmission that can add up, and refers to that the score value of a label in ancestor node is transferred to the ancestral to be all
Original point of sum after the label attenuation of first node;When each general label up transfers by the way of exponential damping, ancestral
The transmission of a label divides S '=sum (S in first nodeQ0, SQ1... ..., SQn-1);Wherein SQi(0≤i≤n-1, n are to be transferred to
Ancestor node, the label attenuation after original point of number)Refer to the label that child nodes are passed over by exponential damping
Original point.
In an embodiment of the present embodiment, it can also include before step S105:
For each region, respectively with the score in the region be multiplied by the region present in target labels number and document object mould
The ratio of target labels sum in type tree obtains the co-occurrence point in the region;
The positional value of each destination node is added number of nodes total again divided by webpage and obtains average;For each region, divide
Not by number of nodes total in the summation divided by the region of the positional value of each node in the region and the difference absolute value of average, obtain
The density in the region;Calculate the positional value of the root node of the subtree corresponding to the region and the root node position of document object model tree
The absolute value of the difference of value is put, obtains the distance in the region, the density and distance to the region are weighted summation, obtain the region
Structure point;
According to the co-occurrence in each region point and structure weighted sum is divided to obtain the final score in each region respectively;Then selected
The step of selecting score high one or more regions.
In present embodiment, every weight can be arranged as required to when calculating structure point and final score.
Co-occurrence point is for evaluating and testing whether target labels appear at index in the region, more, the co-occurrence point of appearance
It is higher.The co-occurrence in one region divides CoocScore=Z × Nfound/Ntarget;Z is the score in the region in step S104;Nfound
For target labels number, N present in the regiontargetFor the sum of target labels.Target labels can be set in advance as needed,
Such as when to extract the information in relation to price, price tag is arranged to target labels;It can in addition contain according to different labels
The self-defined deduction situation for lacking certain label of significance level.
In general, there is the region of two major classes, a variety of labels are included in a kind of region, finish node there are one each,
Such as key area includes price, title, master map;Comprising single label in another region, but there are multiple nodes, such as attribute area
Domain only has node of the property value to label comprising multiple.For the region comprising a variety of labels, maximum region can obtain
To an extreme value point:Child nodes up transfer the original timesharing attenuation of label, but tag class increase can increase the region
Score, therefore when region just include all target labels when, the region branch reach an extreme point, the region
That is the region for the information extracted.
Structure timesharing is being calculated, each node is there are one pre-assigned positional value, for example a webpage one to share 1000
A node then assigns positional value to each node successively, and density and distance are all calculated with this;Such as the DOM in Fig. 2
Tree, the positional value of root node 200 is 1, and the positional value of node 210,220,230 is 2, and the positional value of node 211,221,222 is
3, the positional value of node 2221 is 4.The destination node can be set in advance as needed.
It, can also be when there are during isolated point, deduction be carried out to the final score in the embodiment;Isolated point refers to
The difference of region interior location value and above-mentioned average is more than the point of predetermined threshold;May have in one region may also be without isolated
Point.
In an embodiment of the present embodiment, step S106 can specifically include:
Region all in the document object model tree according to score is ranked up, is selected according to order from high to low
X region before taking, using the root node of the corresponding subtree in selected region as both candidate nodes;X is default positive integer;
If both candidate nodes are the ancestor node of other both candidate nodes, the both candidate nodes of child nodes are only retained as;
In each both candidate nodes as in the subtree of root node, each label is ranked up according to original point of label respectively,
And original point of highest label is selected as candidate's label;Different both candidate nodes are as in the subtree of root node, selected time
Select label different;
The node where candidate's label is chosen as finish node;It, can also be according to different marks in other embodiment
The requirement of label selects finish node in itself or in child nodes in both candidate nodes;
According to the corresponding web page contents of finish node, the value of output candidate's label.
For example for the label " price " in a both candidate nodes, being obtained in the corresponding web page contents of the both candidate nodes should
The value " 20 " of label simultaneously exports.The value of output may need the value of node before and after normalizing or needing to enrich.Output valve
Text normalization refers to rely on preassigned rule(Such as remove space, blacklist keyword, some symbols specified)Deng into
Row normalizing;It can also be enriched according to tag types and the value of the node depended on, such as some price tag, the value chosen is
10, previous node is currency symbol, and the latter is unit, then can be with combined value.
Embodiment two, a kind of device of the Extracting Information from webpage, including:
Indexing unit is respectively each in the corresponding document object model tree of the webpage for the webpage for input
Each label in node addition preset label set;
Original point of computing unit, for respectively according to the value of each predetermined feature of each node corresponding point in each label
Value, obtains original point of each label on each node;
Transfer unit, for subtree where the node is passed to after decaying to original point of each label on each node
Root node;
Computing unit is divided in region, for original point after the attenuation that is received respectively according to the root node of each subtree, determines
The transmission of each label point on the root node, using the transmission point of each label and obtaining as the region represented by the subtree
Point;
Output unit, for the one or more regions for selecting score high, and in region selected by exporting label value.
In an embodiment of the present embodiment, described device can also include:
Amending unit is divided in region, for from region computing unit being divided to obtain the score in each region;For each region, use respectively
The score in the region is multiplied by the ratio of target labels number present in the region and document object model tree target labels sum, obtains
To the co-occurrence point in the region;The positional value of each destination node is added number of nodes total again divided by webpage and obtains average;For
Each region, respectively by section total in the summation divided by the region of the positional value of each node in the region and the difference absolute value of average
Points, obtain the density in the region;Calculate the positional value and document object model tree of the root node of the subtree corresponding to the region
Root node position value absolute value of the difference, obtain the distance in the region, the density and distance to the region are weighted summation,
Obtain the structure point in the region;According to the co-occurrence in each region point and structure weighted sum being divided to obtain respectively, each region is final to be obtained
Point;Then the final score in each region is sent to the output unit.
In an embodiment of the present embodiment, original point of computing unit is respectively according to the value of each feature of each node each
Corresponding score value in label, obtaining original point of each label on each node can refer to:
Original point of computing unit carries out operations described below respectively for each node:Obtain the value of each feature of the node;For
Each label on the node inquires about the value of each feature corresponding score value in the label respectively;The score value inquired is multiplied respectively
To be added after the weight of individual features in the label, original point of result as the label on the node will add up.
In an embodiment of the present embodiment, transfer unit decays to original point of each label on each node can be with
Refer to:
The transfer unit carries out linear attenuation to original point of label, obtains attenuation results SLFor:
SL=S×((1-k1)+k1Dd/DS)
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k1Linearly to transfer attenuation
Index, value range (0,1), S are original point;
Correspondingly, original point after the attenuation that computing unit is received according to the root node of each subtree respectively is divided in region, really
The transmission point of each label refers on the fixed root node:
Region is divided in original point after each label attenuation that computing unit is received in the root node of each subtree, is respectively each
Label selects original point after a maximum attenuation, the transmission point as the label in the root node.
In an embodiment of the present embodiment, transfer unit decays to original point of each label on each node can be with
Refer to:
Transfer unit carries out exponential damping to original point of label, obtains attenuation results SQFor:
SQ=S×((1-k2)+k2Dd/DS)
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k2It transfers and decays for index
Index, value range (0,1), S are original point;
Correspondingly, original point after the attenuation that computing unit is received according to the root node of each subtree respectively is divided in region, really
The transmission point of each label refers on the fixed root node:
Region divides original point after the different labels attenuation that computing unit is received the root node of each subtree by phase respectively
Add, using result as the transmission point of respective labels in the root node.
In an embodiment of the present embodiment, the output unit can specifically include:
Region ordering module for region all in the document object model tree to be ranked up according to score, is pressed
X region before being chosen according to order from high to low, using the root node of the corresponding subtree in selected region as both candidate nodes;X is
Default positive integer;
Screening module, for when both candidate nodes are the ancestor node of other both candidate nodes, being only retained as child nodes
Both candidate nodes;
Tag sorting module, in each both candidate nodes as in the subtree of root node, respectively according to original point of label
Each label is ranked up, and selects original point of highest label as candidate's label;
Selecting module, for choosing the node where candidate's label as finish node;
Output module, for according to the corresponding web page contents of finish node, the value of output candidate's label.
One of ordinary skill in the art will appreciate that all or part of step in the above method can be instructed by program
Related hardware is completed, and described program can be stored in computer readable storage medium, such as read-only memory, disk or CD
Deng.Optionally, all or part of step of above-described embodiment can also be realized using one or more integrated circuits.Accordingly
Ground, the form that hardware may be employed in each module/unit in above-described embodiment are realized, can also use the shape of software function module
Formula is realized.The application is not restricted to the combination of the hardware and software of any particular form.
Certainly, the application can also have other various embodiments, ripe in the case of without departing substantially from the application spirit and its essence
Various corresponding changes and deformation, but these corresponding changes and change ought can be made according to the application by knowing those skilled in the art
Shape should all belong to the protection domain of claims hereof.
Claims (12)
1. a kind of method of the Extracting Information from webpage, including:
It is respectively each node addition preset label in the corresponding document object model tree of the webpage for the webpage of input
Each label in set;
Respectively according to the value of each predetermined feature of each node in each label corresponding score value, obtain each label on each node
Original point;
The root node of subtree where the node is passed to after decaying to original point of each label on each node;Wherein, to each
Rule when original point of each label is decayed on node is:Root node of the node from place subtree is nearer, and original point is declined
Amount of decrease degree is fewer;
Original point after the attenuation received respectively according to the root node of each subtree, determine the transmission of each label on the root node
Point, using transmission point and as the region represented by the subtree the score of each label;
Select the high one or more regions of score, and in region selected by exporting label value.
2. the method as described in claim 1, which is characterized in that also wrapped before the step of selection score high one or more regions
It includes:
For each region, respectively with the score in the region be multiplied by the region present in target labels number and document object model tree
The ratio of middle target labels sum obtains the co-occurrence point in the region;
The positional value of each destination node is added number of nodes total again divided by webpage and obtains average;For each region, respectively will
Number of nodes total in the summation divided by the region of the positional value of each node and the difference absolute value of average, obtains the area in the region
The density in domain;Calculate the positional value of the root node of the subtree corresponding to the region and the root node position value of document object model tree
Absolute value of the difference, obtain the distance in the region, the density and distance to the region are weighted summation, obtain the knot in the region
Structure point;
According to the co-occurrence in each region point and structure weighted sum is divided to obtain the final score in each region respectively;Then make choice
The step of dividing high one or more regions.
3. the method as described in claim 1, which is characterized in that right in each label according to the value of each feature of each node respectively
The score value answered, the step of obtaining original point of each label on each node, include:
Operations described below is carried out respectively for each node:
Obtain the value of each feature of the node;
For each label on the node, the value of each feature corresponding score value, point that will be inquired in the label are inquired about respectively
Value is multiplied by the label and is added after the weight of individual features respectively, will add up original point of result as the label on the node.
4. the method as described in claim 1, which is characterized in that the step of decaying to original point of each label on each node
Including:
Linear attenuation is carried out to original point of label, obtains attenuation results SLFor:
SL=S × ((1-k1)+k1Dd/DS)
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k1Linearly to transfer damped expoential,
Value range (0,1), S are original point;
Original point after the attenuation received respectively according to the root node of each subtree, determine the transmission of each label on the root node
The step of dividing includes:
In original point after each label attenuation received in the root node of each subtree, be respectively each label select one it is maximum
Original point after attenuation, the transmission point as the label in the root node.
5. the method as described in claim 1, which is characterized in that the step of decaying to original point of each label on each node
Including:
Exponential damping is carried out to original point of label, obtains attenuation results SQFor:
SQ=S × ((1-k2)+k2Dd/DS)
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k2Damped expoential is transferred for index,
Value range (0,1), S are original point;
Original point after the attenuation received respectively according to the root node of each subtree, determine the transmission of each label on the root node
The step of dividing includes:
Original point after the different labels that the root node of each subtree is received are decayed is separately summed, using result as the root node
The transmission of middle respective labels point.
6. the method as described in claim 1, which is characterized in that the high one or more regions of selection score, and selected by output
The step of selecting the value of label in region includes:
Region all in the document object model tree is ranked up according to score, before being chosen according to order from high to low
X region, using the root node of the corresponding subtree in selected region as both candidate nodes;X is default positive integer;
If both candidate nodes are the ancestor node of other both candidate nodes, the both candidate nodes of child nodes are only retained as;
In each both candidate nodes as in the subtree of root node, each label is ranked up according to original point of label respectively, and is selected
Original point of highest label is selected as candidate's label;
The node where candidate's label is chosen as finish node;
According to the corresponding web page contents of finish node, the value of output candidate's label.
7. a kind of device of the Extracting Information from webpage, which is characterized in that including:
Indexing unit is respectively each node in the corresponding document object model tree of the webpage for the webpage for input
Add each label in preset label set;
Original point of computing unit, for respectively according to the value of each predetermined feature of each node in each label corresponding score value,
Obtain original point of each label on each node;
Transfer unit, for the root section of subtree where the node is passed to after decaying to original point of each label on each node
Point;Wherein, rule when decaying to original point of each label on each node is:Root node of the node from place subtree is got over
Closely, original point of attenuation amplitude is fewer;
Computing unit is divided in region, for original point after the attenuation that is received respectively according to the root node of each subtree, determines the root
The transmission of each label point on node, using transmission point and as the region represented by the subtree the score of each label;
Output unit, for the one or more regions for selecting score high, and in region selected by exporting label value.
8. device as claimed in claim 7, which is characterized in that further include:
Amending unit is divided in region, for from region computing unit being divided to obtain the score in each region;For each region, respectively with the area
The score in domain is multiplied by the ratio of target labels number present in the region and document object model tree target labels sum, is somebody's turn to do
The co-occurrence in region point;The positional value of each destination node is added number of nodes total again divided by webpage and obtains average;For each area
Domain, respectively by node total in the summation divided by the region of the positional value of each node in the region and the difference absolute value of average
Number, obtains the density in the region;Calculate the positional value and document object model tree of the root node of the subtree corresponding to the region
The absolute value of the difference of root node position value obtains the distance in the region, and the density and distance to the region are weighted summation, obtain
To the structure point in the region;According to the co-occurrence in each region point and structure weighted sum is divided to obtain the final score in each region respectively;
Then the final score in each region is sent to the output unit.
9. device as claimed in claim 7, which is characterized in that original point of computing unit is respectively according to each feature of each node
Value corresponding score value, obtain each label on each node original point in each label refer to:
Original point of computing unit carries out operations described below respectively for each node:Obtain the value of each feature of the node;For the section
Each label on point inquires about the value of each feature corresponding score value in the label, the score value inquired is multiplied by this respectively respectively
It is added in label after the weight of individual features, will add up original point of result as the label on the node.
10. device as claimed in claim 7, which is characterized in that original point progress of the transfer unit to each label on each node
Attenuation refers to:
The transfer unit carries out linear attenuation to original point of label, obtains attenuation results SLFor:
SL=S × ((1-k1)+k1Dd/DS)
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k1Linearly to transfer damped expoential,
Value range (0,1), S are original point;
Original point after the attenuation that computing unit is received according to the root node of each subtree respectively is divided in region, determines on the root node
The transmission point of each label refers to:
Region is divided in original point after each label attenuation that computing unit is received in the root node of each subtree, is respectively each label
Original point after a maximum attenuation is selected, the transmission point as the label in the root node.
11. device as claimed in claim 7, which is characterized in that original point progress of the transfer unit to each label on each node
Attenuation refers to:
Transfer unit carries out exponential damping to original point of label, obtains attenuation results SQFor:
SQ=S × ((1-k2)+k2Dd/DS)
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k2Damped expoential is transferred for index,
Value range (0,1), S are original point;
Original point after the attenuation that computing unit is received according to the root node of each subtree respectively is divided in region, determines on the root node
The transmission point of each label refers to:
Region divides computing unit by original point after the attenuation of different labels that the root node of each subtree is received by being separately summed, will
As a result the transmission point as respective labels in the root node.
12. device as claimed in claim 7, which is characterized in that the output unit includes:
Region ordering module, for region all in the document object model tree to be ranked up according to score, according to from
X region before high to Low order is chosen, using the root node of the corresponding subtree in selected region as both candidate nodes;X is default
Positive integer;
Screening module, for when both candidate nodes are the ancestor node of other both candidate nodes, being only retained as the time of child nodes
Select node;
Tag sorting module, in each both candidate nodes as in the subtree of root node, respectively according to original point of label to each
Label is ranked up, and selects original point of highest label as candidate's label;
Selecting module, for choosing the node where candidate's label as finish node;
Output module, for according to the corresponding web page contents of finish node, the value of output candidate's label.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310344292.6A CN104346405B (en) | 2013-08-08 | 2013-08-08 | A kind of method and device of the Extracting Information from webpage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310344292.6A CN104346405B (en) | 2013-08-08 | 2013-08-08 | A kind of method and device of the Extracting Information from webpage |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104346405A CN104346405A (en) | 2015-02-11 |
CN104346405B true CN104346405B (en) | 2018-05-22 |
Family
ID=52502018
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310344292.6A Active CN104346405B (en) | 2013-08-08 | 2013-08-08 | A kind of method and device of the Extracting Information from webpage |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104346405B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105630772B (en) * | 2016-01-26 | 2018-10-12 | 广东工业大学 | A kind of abstracting method of webpage comment content |
CN106095854B (en) * | 2016-06-02 | 2022-05-17 | 腾讯科技(深圳)有限公司 | Method and device for determining position information of information block |
WO2018103540A1 (en) | 2016-12-09 | 2018-06-14 | 腾讯科技(深圳)有限公司 | Webpage content extraction method, device, and data storage medium |
CN107741942B (en) * | 2016-12-09 | 2020-06-02 | 腾讯科技(深圳)有限公司 | Webpage content extraction method and device |
CN109635219A (en) * | 2018-12-05 | 2019-04-16 | 云孚科技(北京)有限公司 | A kind of webpage content extracting method |
CN113626028B (en) * | 2020-05-07 | 2024-06-14 | 腾讯科技(深圳)有限公司 | Page element mapping method and device |
CN114528811B (en) * | 2022-01-21 | 2022-09-02 | 北京麦克斯泰科技有限公司 | Article content extraction method, device, equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102073654A (en) * | 2009-11-20 | 2011-05-25 | 富士通株式会社 | Methods and equipment for generating and maintaining web content extraction template |
CN102467501A (en) * | 2010-10-29 | 2012-05-23 | 北大方正集团有限公司 | Method and system for extracting news record metadata from news list page |
CN102591931A (en) * | 2011-12-23 | 2012-07-18 | 浙江大学 | Recognition and extraction method for webpage data records based on tree weight |
CN102915361A (en) * | 2012-10-18 | 2013-02-06 | 北京理工大学 | Webpage text extracting method based on character distribution characteristic |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7552116B2 (en) * | 2004-08-06 | 2009-06-23 | The Board Of Trustees Of The University Of Illinois | Method and system for extracting web query interfaces |
US7814084B2 (en) * | 2007-03-21 | 2010-10-12 | Schmap Inc. | Contact information capture and link redirection |
JP2011003182A (en) * | 2009-05-19 | 2011-01-06 | Studio Ousia Inc | Keyword display method and system thereof |
-
2013
- 2013-08-08 CN CN201310344292.6A patent/CN104346405B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102073654A (en) * | 2009-11-20 | 2011-05-25 | 富士通株式会社 | Methods and equipment for generating and maintaining web content extraction template |
CN102467501A (en) * | 2010-10-29 | 2012-05-23 | 北大方正集团有限公司 | Method and system for extracting news record metadata from news list page |
CN102591931A (en) * | 2011-12-23 | 2012-07-18 | 浙江大学 | Recognition and extraction method for webpage data records based on tree weight |
CN102915361A (en) * | 2012-10-18 | 2013-02-06 | 北京理工大学 | Webpage text extracting method based on character distribution characteristic |
Also Published As
Publication number | Publication date |
---|---|
CN104346405A (en) | 2015-02-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104346405B (en) | A kind of method and device of the Extracting Information from webpage | |
US8244773B2 (en) | Keyword output apparatus and method | |
US20190147010A1 (en) | System and method for block segmenting, identifying and indexing visual elements, and searching documents | |
CN104484431B (en) | A kind of multi-source Personalize News webpage recommending method based on domain body | |
US7444325B2 (en) | Method and system for information extraction | |
CN107704503A (en) | User's keyword extracting device, method and computer-readable recording medium | |
CN103020295B (en) | A kind of problem label for labelling method and device | |
CN104598462B (en) | Extract the method and device of structural data | |
CN103617213B (en) | Method and system for identifying newspage attributive characters | |
CN104484477B (en) | A kind of electronic map searching method, apparatus and system | |
CN111143547B (en) | Big data display method based on knowledge graph | |
Evert | A Lightweight and Efficient Tool for Cleaning Web Pages. | |
CN104268192A (en) | Webpage information extracting method, device and terminal | |
CN104281648B (en) | Search-result multi-dimensional navigating method on basis of dimension label | |
CN103870541A (en) | Social network user interest mining method and system | |
CN104899215A (en) | Data processing method, recommendation source information organization, information recommendation method and information recommendation device | |
CN108874934A (en) | Page body extracting method and device | |
CN105095206A (en) | Information processing method and information processing device | |
US20070005700A1 (en) | Method for processing data | |
CN109597934B (en) | Method and device for determining click recommendation words, storage medium and electronic equipment | |
CN109710773A (en) | The generation method and its device of event body | |
CN109299443A (en) | A kind of newsletter archive De-weight method based on Minimum Vertex Covering | |
CN103942233B (en) | The lobby page recognition methods of directory type web and device | |
CN106649318B (en) | Information display method and device | |
CN106484702B (en) | Target web page access volume display method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |