CN104346405B - A kind of method and device of the Extracting Information from webpage - Google Patents

A kind of method and device of the Extracting Information from webpage Download PDF

Info

Publication number
CN104346405B
CN104346405B CN201310344292.6A CN201310344292A CN104346405B CN 104346405 B CN104346405 B CN 104346405B CN 201310344292 A CN201310344292 A CN 201310344292A CN 104346405 B CN104346405 B CN 104346405B
Authority
CN
China
Prior art keywords
label
node
region
original point
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310344292.6A
Other languages
Chinese (zh)
Other versions
CN104346405A (en
Inventor
谢宣松
耿小亮
孙健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201310344292.6A priority Critical patent/CN104346405B/en
Publication of CN104346405A publication Critical patent/CN104346405A/en
Application granted granted Critical
Publication of CN104346405B publication Critical patent/CN104346405B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of method and apparatus of the Extracting Information from webpage, including:It is respectively each label in each node addition preset label set in the corresponding document object model tree of the webpage for the webpage of input;Respectively according to the value of each predetermined feature of each node in each label corresponding score value, obtain original point of each label on each node;The root node of subtree where the node is passed to after decaying to original point of each label on each node;Original point after the attenuation received respectively according to the root node of each subtree, the transmission point of each label on the root node is determined, using transmission point and as the region represented by the subtree the score of each label;Select the high one or more regions of score, and in region selected by exporting label value.The application can improve the accuracy that specific information is extracted from the tree-shaped text structure such as webpage.

Description

A kind of method and device of the Extracting Information from webpage
Technical field
The present invention relates to internet arena more particularly to a kind of method and devices of the Extracting Information from webpage.
Background technology
From initial data source, drawing-out structure data are a kind of basic technologies.And webpage is then most common original number According to source.Compared with extraction of the extraction of progress structural data with carrying out structural data in plain text information has in webpage Big difference.On the one hand, when being extracted in webpage, due to not being the text grammer of specification used in webpage, rule The text grammer of model does not apply to, and structure of web page is multifarious, and noise is very more.On the other hand, structuring number is carried out from webpage According to extraction when extraction target it is also diversified, have individual node, the node string for having chain(Such as navigation bar), also have blocking Region(Such as form).At present, the extraction that structural data is carried out from webpage generally uses rule-based method.Extracting object It is the form, name-value pair (such as attribute-name and property value) and record list for comparing specification.The more dispersed independence of abstracting method or Only utilize the information of the nearly scope of destination node.
A kind of scheme of existing extracting object attribute value information from webpage provides one kind extracting object from webpage The method of attribute value information, step are:A) for a given webpage, obtain giving the corresponding DOM of webpage with this(Document pair As model)Tree, and calculate the relevant information of each DOM node in dom tree;B) it is related to each DOM node according to dom tree Information constructs a tape label node diagram, and calculates the fraction of each tape label node;C) point based on tape label node Number selects tape label node tree from obtained tape label node diagram;D) based on the tape label node tree structure attribute value tree chosen. The shortcomings that existing scheme is:
It is appropriate only for the extraction for the property value pair concentrated;
Variety classes label is not utilized come determining area, precision is extracted so as to improve in the zone;
It is excessively dull to the feature use of node in itself, emphasize literal feature;
Result affected by context is not obtained by influencing transmission effects systematically.
The content of the invention
The application technical problems to be solved are how to improve the extraction specific information from the tree-shaped text structure such as webpage Accuracy.
To solve the above-mentioned problems, this application provides a kind of method of the Extracting Information from webpage, including:
It is respectively that each node addition is predetermined in the corresponding document object model tree of the webpage for the webpage of input Each label in tag set;
Respectively according to the value of each predetermined feature of each node in each label corresponding score value, obtain respectively marking on each node Original point of label;
The root node of subtree where the node is passed to after decaying to original point of each label on each node;
Original point after the attenuation received respectively according to the root node of each subtree, determine each label on the root node It transfers and divides, using transmission point and as the region represented by the subtree the score of each label;
Select the high one or more regions of score, and in region selected by exporting label value.
Optionally, further included before the step of selecting score high one or more regions:
For each region, respectively with the score in the region be multiplied by the region present in target labels number and document object mould The ratio of target labels sum in type tree obtains the co-occurrence point in the region;
The positional value of each destination node is added number of nodes total again divided by webpage and obtains average;For each region, divide Not by number of nodes total in the summation divided by the region of the positional value of each node in the region and the difference absolute value of average, obtain The density in the region;Calculate the positional value of the root node of the subtree corresponding to the region and the root node position of document object model tree The absolute value of the difference of value is put, obtains the distance in the region, the density and distance to the region are weighted summation, obtain the region Structure point;
According to the co-occurrence in each region point and structure weighted sum is divided to obtain the final score in each region respectively;Then selected The step of selecting score high one or more regions.
Optionally, respectively according to the value of each feature of each node in each label corresponding score value, obtain on each node each Original point of label the step of, includes:
Operations described below is carried out respectively for each node:
Obtain the value of each feature of the node;
For each label on the node, the value of each feature corresponding score value in the label is inquired about respectively, will be inquired Score value be multiplied by the label and be added after the weight of individual features respectively, will add up result as on the node label it is original Point.
Optionally, the step of decaying to original point of each label on each node includes:
Linear attenuation is carried out to original point of label, obtains attenuation results SLFor:
SL=S×((1-k1)+k1Dd/DS)
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k1Linearly to transfer attenuation Index, value range (0,1), S are original point;
Original point after the attenuation received respectively according to the root node of each subtree, determine each label on the root node The step of transmission divides includes:
It is respectively that each label selects one most in original point after each label attenuation received in the root node of each subtree Original point after big attenuation, the transmission point as the label in the root node.
Optionally, the step of decaying to original point of each label on each node includes:
Exponential damping is carried out to original point of label, obtains attenuation results SQFor:
SQ=S×((1-k2)+k2Dd/DS)
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k2It transfers and decays for index Index, value range (0,1), S are original point;
Original point after the attenuation received respectively according to the root node of each subtree, determine each label on the root node The step of transmission divides includes:
Original point after the different labels that the root node of each subtree is received are decayed is separately summed, using result as the root The transmission of respective labels point in node.
Optionally, the high one or more regions of score are selected, and in region selected by exporting label original point of step Suddenly include:
Region all in the document object model tree according to score is ranked up, is selected according to order from high to low X region before taking, using the root node of the corresponding subtree in selected region as both candidate nodes;X is default positive integer;
If both candidate nodes are the ancestor node of other both candidate nodes, the both candidate nodes of child nodes are only retained as;
In each both candidate nodes as in the subtree of root node, each label is ranked up according to original point of label respectively, And original point of highest label is selected as candidate's label;
The node where candidate's label is chosen as finish node;
According to the corresponding web page contents of finish node, the value of output candidate's label.
Present invention also provides a kind of device of the Extracting Information from webpage, including:
Indexing unit is respectively each in the corresponding document object model tree of the webpage for the webpage for input Each label in node addition preset label set;
Original point of computing unit, for respectively according to the value of each predetermined feature of each node corresponding point in each label Value, obtains original point of each label on each node;
Transfer unit, for subtree where the node is passed to after decaying to original point of each label on each node Root node;
Computing unit is divided in region, for original point after the attenuation that is received respectively according to the root node of each subtree, determines The transmission of each label point on the root node, using the transmission point of each label and obtaining as the region represented by the subtree Point;
Output unit, for the one or more regions for selecting score high, and in region selected by exporting label value.
Optionally, the device further includes:
Amending unit is divided in region, for from region computing unit being divided to obtain the score in each region;For each region, use respectively The score in the region is multiplied by the ratio of target labels number present in the region and document object model tree target labels sum, obtains To the co-occurrence point in the region;The positional value of each destination node is added number of nodes total again divided by webpage and obtains average;For Each region, respectively by section total in the summation divided by the region of the positional value of each node in the region and the difference absolute value of average Points, obtain the density in the region;Calculate the positional value and document object model tree of the root node of the subtree corresponding to the region Root node position value absolute value of the difference, obtain the distance in the region, the density and distance to the region are weighted summation, Obtain the structure point in the region;According to the co-occurrence in each region point and structure weighted sum being divided to obtain respectively, each region is final to be obtained Point;Then the final score in each region is sent to the output unit.
Optionally, original point of computing unit respectively according to the value of each feature of each node in each label corresponding score value, Obtain each label on each node original point refers to:
Original point of computing unit carries out operations described below respectively for each node:Obtain the value of each feature of the node;For Each label on the node inquires about the value of each feature corresponding score value in the label, the score value inquired is multiplied respectively respectively To be added after the weight of individual features in the label, original point of result as the label on the node will add up.
Optionally, transfer unit to original point of each label on each node decay and refer to:
The transfer unit carries out linear attenuation to original point of label, obtains attenuation results SLFor:
SL=S×((1-k1)+k1Dd/DS)
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k1Linearly to transfer attenuation Index, value range (0,1), S are original point;
Original point after the attenuation that computing unit is received according to the root node of each subtree respectively is divided in region, determines the root section The transmission minute of each label refers on point:
Region is divided in original point after each label attenuation that computing unit is received in the root node of each subtree, is respectively each Label selects original point after a maximum attenuation, the transmission point as the label in the root node.
Optionally, transfer unit to original point of each label on each node decay and refer to:
Transfer unit carries out exponential damping to original point of label, obtains attenuation results SQFor:
SQ=S×((1-k2)+k2Dd/DS)
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k2It transfers and decays for index Index, value range (0,1), S are original point;
Original point after the attenuation that computing unit is received according to the root node of each subtree respectively is divided in region, determines the root section The transmission minute of each label refers on point:
Region divides original point after the different labels attenuation that computing unit is received the root node of each subtree by phase respectively Add, using result as the transmission point of respective labels in the root node.
Optionally, the output unit includes:
Region ordering module for region all in the document object model tree to be ranked up according to score, is pressed X region before being chosen according to order from high to low, using the root node of the corresponding subtree in selected region as both candidate nodes;X is Default positive integer;
Screening module, for when both candidate nodes are the ancestor node of other both candidate nodes, being only retained as child nodes Both candidate nodes;
Tag sorting module, in each both candidate nodes as in the subtree of root node, respectively according to original point of label Each label is ranked up, and selects original point of highest label as candidate's label;
Selecting module, for choosing the node where candidate's label as finish node;
Output module, for according to the corresponding web page contents of finish node, the value of output candidate's label.
At least one embodiment of the application codetermines original point using the various features of label, can be by influencing to pass It passs to reflect context as a result, can obtain accurately as a result, the property value for being not only suitble to concentrate is to extraction, but also is suitble to opposite point The extraction of scattered label, it may also be used for the extraction of list items.The prioritization scheme of the application is selected jointly using a variety of labels Behind region, then select from region more accurately result.Another prioritization scheme of the application introduces the co-occurrence point in region With structure point, so as to be modified to region point, more accurately result is obtained.Certainly, any production of the application is implemented Product must be not necessarily required to reach all the above advantage simultaneously.
Description of the drawings
Fig. 1 be embodiment one slave webpage in Extracting Information method flow diagram;
Fig. 2 is the schematic diagram of the extraction tree of an extension in embodiment one;
Fig. 3 is the schematic diagram of the node with label in embodiment one.
Specific embodiment
The technical solution of the application is described in detail below in conjunction with accompanying drawings and embodiments.
If it should be noted that not conflicting, each feature in the embodiment of the present application and embodiment can be tied mutually It closes, within the protection domain of the application.In addition, though logical order is shown in flow charts, but in some situations Under, it can be with the steps shown or described are performed in an order that is different from the one herein.
Embodiment one, a kind of method of the Extracting Information from webpage, as shown in Figure 1, including step S101-S105.
S101, the webpage for input are respectively each node addition preset label in the corresponding dom tree of the webpage Each label in set.
S102, respectively according to the value of each predetermined feature of each node in each label corresponding score value, obtain each node Original point of upper each label.S103, pass to the node after decaying to original point of each label on each node where subtree Root node.
It original point after S104, the attenuation received respectively according to the root node of each subtree, determines each on the root node The transmission of label point, using transmission point and as the region represented by the subtree the score of each label.
S105, high one or more regions of selection score, and in region selected by exporting label value.
The score of label in node can be transferred to the root node of subtree, as corresponding to subtree after addition in the present embodiment Region score with selection region, therefore the region that the value for netpage tag is more scattered, score be also possible to because It is more and higher for the region interior joint, therefore the present embodiment will not only choose the region that the value of label is concentrated, it is also possible to it selects The value of middle label is dispersed in the region on multiple nodes.
In the present embodiment, after single webpage is inputted, each subtree can be made on the basis of dom tree in step S101 For a region, and it is each node(Node)Label is added, so as to establish the extraction tree of extension.The label(Label)Table Show a kind of flag information for having and extracting target meaning, such as product price in product information page, product title.One node Can there are one or multiple labels.
The example of the extraction tree of one extension is as shown in Fig. 2, root node is node 200.Using node 210 as root node Subtree is region 11, includes root node 210 and child nodes 211.Subtree using node 220 as root node is region 12, bag Containing root node 220, child nodes 221, child nodes 222 and child nodes 2221.Using node 222 as the subtree of root node A region can be used as, including root node 222 and child nodes 2221.Each region can regard one piece in webpage as.Wherein, Each node is as shown in figure 3, including one or more label 31-36.
In an embodiment of the present embodiment, the webpage is product information page, and the preset label set can be with But it is not limited to include for label that is any or appointing several information below nested or display:Title(Title), product price (Price), product picture(Image), brand (brand) product the corresponding property value pair of every attribute(AttrPairs)Deng. Since region is scored at the sum of the transmission point of each label, the region of multiple labels is included in webpage, with there was only what is isolated It compares in the region of one label, it is possible to easily be selected;The selection result of this sample embodiment will be included not only " only The region of the very high label of one score ", it is also possible to comprising " region with multiple labels ", will not so be neglected when extracting The region with a variety of labels is omitted, therefore improves the precision of extraction.In an embodiment of the present embodiment, node makes a reservation for Feature can be, but not limited to include it is following any one or appoint several:
Node type(Type):That is webpage HTML tag type;
Literal feature(Text):That is character visible in webpage;
Attributive character(Attribute):The list of attribute values of html tag i.e. in webpage;
Structure feature(Structure):Node and the partial structurtes or text of the composition of interdependent node all around are specific Structure;
Visual signature(Vision):Such as the font of character, color and the position in full page layout;
Other feature(Other):The other user-defined features of such as affair character.
By increasing the species of predetermined characteristic, it can avoid only depending on and extract that precision is not high to ask caused by literal feature Topic.
Wherein, the value of the predetermined feature of node corresponding score value in each label can represent the value and the label of feature Degree of correlation.The score value can be preset by system, and the value and the label of feature can also be determined by statistical Degree of correlation, using the degree of correlation as the score value.
The identical value of same feature corresponding score value in different labels can be different, also be not excluded for score value phase Same situation.For example the value of a literal feature of node is " low price ", and it is higher with the degree of correlation of price tag, in " price " label In score value can be 0.8;And it is relatively low with the degree of correlation of " picture " label, the score value in " picture " label can be then ﹣ 0.5.
The different value of same feature corresponding score value in a label can be different, also be not excluded for score value phase Same situation.If in a node value of all predetermined features in a label all without corresponding score value, from this The label is deleted in node.
In an embodiment of the present embodiment, step S102 can specifically include:
Operations described below is carried out respectively for each node:
Obtain the value of each predetermined feature of the node;
For each label on the node, the value of each feature corresponding score value in the label is inquired about respectively, will be inquired Score value be multiplied by the label and be added after the weight of individual features respectively, will add up result as on the node label it is original Point.
The weight of feature can be different in different labels, are also not excluded for identical situation.Assuming that there are two price, brand marks Label, predetermined feature include node type and literal feature the two, and when price tag interior joint type is A correspond to score value 5, it is literal to be characterized as corresponding to score value 9 during b;Correspond to score value ﹣ 7 when brand label interior joint type is A, it is literal when being characterized as b Corresponding to score value ﹣ 1;If the node type of some node is A, literal to be characterized as b, and price tag interior joint type weight is R1, literal feature weight are R2, then the original of price tag is divided into 5 × R1+9 × R2 in the node;Brand label interior joint Type weight is R3, and literal feature weight is R4, then the original of brand label is divided into 7 × R3+ of ﹣(﹣ 1)×R4.
In a kind of alternative of present embodiment, by the height of the weight of literal feature and attributive character setting in each label In other feature.
In one alternative of present embodiment, for the universality of implementation method, obtain maximum recalls result set, The more relaxed rule of correspondence is used when score value is corresponding with the value of feature in label as far as possible, the value selection of such as literal feature is short Character/word and the larger text of semantic primitive such as select the value " valency " of feature corresponding with score value rather than select " price " or " city Field price " is corresponding with score value.To improve corresponding efficiency, can regular expression be used less as far as possible with the value of the corresponding feature of score value And generation with several sections of texts, such as using " ori ", " price " it is corresponding with score value rather than with " originalprice " and score value pair It should.
In one alternative of the embodiment, the score value in each label can use thick centrifugal pump, to avoid score value The freeing of setting and fragmentation, for example score value can be uniformly arranged to following 6 kinds:
strong-reward:Strong reward score, such as 0.8;
strong-punish:Strong penalty values, such as ﹣ 0.8;
moderate-reward:Intermediate reward point, such as 0.5;
moderate-punish:Medium penalty values, such as ﹣ 0.5;
weak-reward:Weak reward score, such as 0.2;
weak-punish:Weak penalty values, such as ﹣ 0.2.
In step S103, the original point of primitive rule to decay to each label on each node can be:From transmission Person is nearer, and influence is bigger, therefore up influences to decay the bottom of from by the depth of tree, and attenuation function can be selected by actual conditions.
In an embodiment of the present embodiment, in step S103, decay to original point of each label on each node The step of can specifically include:
Linear attenuation is carried out to original point of label, obtains attenuation results SLFor:
SL=S×((1-k1)+k1Dd/DS);
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k1Linearly to transfer attenuation Index, value range (0,1), S are original point.The depth of root node is 0, and the depth of other nodes is the father node of the node Depth adds 1, and so on.Such as when from Fig. 2 node 2221 transfer attenuation after it is original assign to node 220 when, DdFor section The depth of point 220:1, DsFor the depth of node 2221:3.
In the alternative, original point after the attenuation received respectively according to the root node of each subtree, the root section is determined Can specifically it include the step of the transmission minute of each label on point:
It is respectively that each label selects one most in original point after each label attenuation received in the root node of each subtree Original point after big attenuation, the transmission point as the label in the root node.
The embodiment is known as maximum transmission, refers to that the transmission of a label in ancestor node is divided into and all is transferred to this The maximum in original point after the label attenuation of ancestor node, original point of each label when up transferring using linearly declining The mode subtracted;The transmission of a label divides S '=max (S in ancestor nodeL0, SL1... ..., SLn-1), wherein SLi(0≤i≤n- 1, n be transferred to ancestor node, the label attenuation after original point of number)Refer to that child nodes were transferred by linear attenuation Original point of the label come.
In an embodiment of the present embodiment, in step S103, decay to original point of each label on each node The step of can specifically include:
Exponential damping is carried out to original point of label, obtains attenuation results SQFor:
SQ=S×((1-k2)+k2Dd/DS)
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k2It transfers and decays for index Index, value range (0,1), S are original point.The depth of root node is 0, and the depth of other nodes is the father node of the node Depth adds 1;And so on.
In the alternative, original point after the attenuation received respectively according to the root node of each subtree, the root section is determined Can specifically it include the step of the transmission minute of each label on point:
Original point after the different labels that the root node of each subtree is received are decayed is separately summed, using result as the root The transmission of respective labels point in node.
The embodiment is known as the transmission that can add up, and refers to that the score value of a label in ancestor node is transferred to the ancestral to be all Original point of sum after the label attenuation of first node;When each general label up transfers by the way of exponential damping, ancestral The transmission of a label divides S '=sum (S in first nodeQ0, SQ1... ..., SQn-1);Wherein SQi(0≤i≤n-1, n are to be transferred to Ancestor node, the label attenuation after original point of number)Refer to the label that child nodes are passed over by exponential damping Original point.
In an embodiment of the present embodiment, it can also include before step S105:
For each region, respectively with the score in the region be multiplied by the region present in target labels number and document object mould The ratio of target labels sum in type tree obtains the co-occurrence point in the region;
The positional value of each destination node is added number of nodes total again divided by webpage and obtains average;For each region, divide Not by number of nodes total in the summation divided by the region of the positional value of each node in the region and the difference absolute value of average, obtain The density in the region;Calculate the positional value of the root node of the subtree corresponding to the region and the root node position of document object model tree The absolute value of the difference of value is put, obtains the distance in the region, the density and distance to the region are weighted summation, obtain the region Structure point;
According to the co-occurrence in each region point and structure weighted sum is divided to obtain the final score in each region respectively;Then selected The step of selecting score high one or more regions.
In present embodiment, every weight can be arranged as required to when calculating structure point and final score.
Co-occurrence point is for evaluating and testing whether target labels appear at index in the region, more, the co-occurrence point of appearance It is higher.The co-occurrence in one region divides CoocScore=Z × Nfound/Ntarget;Z is the score in the region in step S104;Nfound For target labels number, N present in the regiontargetFor the sum of target labels.Target labels can be set in advance as needed, Such as when to extract the information in relation to price, price tag is arranged to target labels;It can in addition contain according to different labels The self-defined deduction situation for lacking certain label of significance level.
In general, there is the region of two major classes, a variety of labels are included in a kind of region, finish node there are one each, Such as key area includes price, title, master map;Comprising single label in another region, but there are multiple nodes, such as attribute area Domain only has node of the property value to label comprising multiple.For the region comprising a variety of labels, maximum region can obtain To an extreme value point:Child nodes up transfer the original timesharing attenuation of label, but tag class increase can increase the region Score, therefore when region just include all target labels when, the region branch reach an extreme point, the region That is the region for the information extracted.
Structure timesharing is being calculated, each node is there are one pre-assigned positional value, for example a webpage one to share 1000 A node then assigns positional value to each node successively, and density and distance are all calculated with this;Such as the DOM in Fig. 2 Tree, the positional value of root node 200 is 1, and the positional value of node 210,220,230 is 2, and the positional value of node 211,221,222 is 3, the positional value of node 2221 is 4.The destination node can be set in advance as needed.
It, can also be when there are during isolated point, deduction be carried out to the final score in the embodiment;Isolated point refers to The difference of region interior location value and above-mentioned average is more than the point of predetermined threshold;May have in one region may also be without isolated Point.
In an embodiment of the present embodiment, step S106 can specifically include:
Region all in the document object model tree according to score is ranked up, is selected according to order from high to low X region before taking, using the root node of the corresponding subtree in selected region as both candidate nodes;X is default positive integer;
If both candidate nodes are the ancestor node of other both candidate nodes, the both candidate nodes of child nodes are only retained as;
In each both candidate nodes as in the subtree of root node, each label is ranked up according to original point of label respectively, And original point of highest label is selected as candidate's label;Different both candidate nodes are as in the subtree of root node, selected time Select label different;
The node where candidate's label is chosen as finish node;It, can also be according to different marks in other embodiment The requirement of label selects finish node in itself or in child nodes in both candidate nodes;
According to the corresponding web page contents of finish node, the value of output candidate's label.
For example for the label " price " in a both candidate nodes, being obtained in the corresponding web page contents of the both candidate nodes should The value " 20 " of label simultaneously exports.The value of output may need the value of node before and after normalizing or needing to enrich.Output valve Text normalization refers to rely on preassigned rule(Such as remove space, blacklist keyword, some symbols specified)Deng into Row normalizing;It can also be enriched according to tag types and the value of the node depended on, such as some price tag, the value chosen is 10, previous node is currency symbol, and the latter is unit, then can be with combined value.
Embodiment two, a kind of device of the Extracting Information from webpage, including:
Indexing unit is respectively each in the corresponding document object model tree of the webpage for the webpage for input Each label in node addition preset label set;
Original point of computing unit, for respectively according to the value of each predetermined feature of each node corresponding point in each label Value, obtains original point of each label on each node;
Transfer unit, for subtree where the node is passed to after decaying to original point of each label on each node Root node;
Computing unit is divided in region, for original point after the attenuation that is received respectively according to the root node of each subtree, determines The transmission of each label point on the root node, using the transmission point of each label and obtaining as the region represented by the subtree Point;
Output unit, for the one or more regions for selecting score high, and in region selected by exporting label value.
In an embodiment of the present embodiment, described device can also include:
Amending unit is divided in region, for from region computing unit being divided to obtain the score in each region;For each region, use respectively The score in the region is multiplied by the ratio of target labels number present in the region and document object model tree target labels sum, obtains To the co-occurrence point in the region;The positional value of each destination node is added number of nodes total again divided by webpage and obtains average;For Each region, respectively by section total in the summation divided by the region of the positional value of each node in the region and the difference absolute value of average Points, obtain the density in the region;Calculate the positional value and document object model tree of the root node of the subtree corresponding to the region Root node position value absolute value of the difference, obtain the distance in the region, the density and distance to the region are weighted summation, Obtain the structure point in the region;According to the co-occurrence in each region point and structure weighted sum being divided to obtain respectively, each region is final to be obtained Point;Then the final score in each region is sent to the output unit.
In an embodiment of the present embodiment, original point of computing unit is respectively according to the value of each feature of each node each Corresponding score value in label, obtaining original point of each label on each node can refer to:
Original point of computing unit carries out operations described below respectively for each node:Obtain the value of each feature of the node;For Each label on the node inquires about the value of each feature corresponding score value in the label respectively;The score value inquired is multiplied respectively To be added after the weight of individual features in the label, original point of result as the label on the node will add up.
In an embodiment of the present embodiment, transfer unit decays to original point of each label on each node can be with Refer to:
The transfer unit carries out linear attenuation to original point of label, obtains attenuation results SLFor:
SL=S×((1-k1)+k1Dd/DS)
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k1Linearly to transfer attenuation Index, value range (0,1), S are original point;
Correspondingly, original point after the attenuation that computing unit is received according to the root node of each subtree respectively is divided in region, really The transmission point of each label refers on the fixed root node:
Region is divided in original point after each label attenuation that computing unit is received in the root node of each subtree, is respectively each Label selects original point after a maximum attenuation, the transmission point as the label in the root node.
In an embodiment of the present embodiment, transfer unit decays to original point of each label on each node can be with Refer to:
Transfer unit carries out exponential damping to original point of label, obtains attenuation results SQFor:
SQ=S×((1-k2)+k2Dd/DS)
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k2It transfers and decays for index Index, value range (0,1), S are original point;
Correspondingly, original point after the attenuation that computing unit is received according to the root node of each subtree respectively is divided in region, really The transmission point of each label refers on the fixed root node:
Region divides original point after the different labels attenuation that computing unit is received the root node of each subtree by phase respectively Add, using result as the transmission point of respective labels in the root node.
In an embodiment of the present embodiment, the output unit can specifically include:
Region ordering module for region all in the document object model tree to be ranked up according to score, is pressed X region before being chosen according to order from high to low, using the root node of the corresponding subtree in selected region as both candidate nodes;X is Default positive integer;
Screening module, for when both candidate nodes are the ancestor node of other both candidate nodes, being only retained as child nodes Both candidate nodes;
Tag sorting module, in each both candidate nodes as in the subtree of root node, respectively according to original point of label Each label is ranked up, and selects original point of highest label as candidate's label;
Selecting module, for choosing the node where candidate's label as finish node;
Output module, for according to the corresponding web page contents of finish node, the value of output candidate's label.
One of ordinary skill in the art will appreciate that all or part of step in the above method can be instructed by program Related hardware is completed, and described program can be stored in computer readable storage medium, such as read-only memory, disk or CD Deng.Optionally, all or part of step of above-described embodiment can also be realized using one or more integrated circuits.Accordingly Ground, the form that hardware may be employed in each module/unit in above-described embodiment are realized, can also use the shape of software function module Formula is realized.The application is not restricted to the combination of the hardware and software of any particular form.
Certainly, the application can also have other various embodiments, ripe in the case of without departing substantially from the application spirit and its essence Various corresponding changes and deformation, but these corresponding changes and change ought can be made according to the application by knowing those skilled in the art Shape should all belong to the protection domain of claims hereof.

Claims (12)

1. a kind of method of the Extracting Information from webpage, including:
It is respectively each node addition preset label in the corresponding document object model tree of the webpage for the webpage of input Each label in set;
Respectively according to the value of each predetermined feature of each node in each label corresponding score value, obtain each label on each node Original point;
The root node of subtree where the node is passed to after decaying to original point of each label on each node;Wherein, to each Rule when original point of each label is decayed on node is:Root node of the node from place subtree is nearer, and original point is declined Amount of decrease degree is fewer;
Original point after the attenuation received respectively according to the root node of each subtree, determine the transmission of each label on the root node Point, using transmission point and as the region represented by the subtree the score of each label;
Select the high one or more regions of score, and in region selected by exporting label value.
2. the method as described in claim 1, which is characterized in that also wrapped before the step of selection score high one or more regions It includes:
For each region, respectively with the score in the region be multiplied by the region present in target labels number and document object model tree The ratio of middle target labels sum obtains the co-occurrence point in the region;
The positional value of each destination node is added number of nodes total again divided by webpage and obtains average;For each region, respectively will Number of nodes total in the summation divided by the region of the positional value of each node and the difference absolute value of average, obtains the area in the region The density in domain;Calculate the positional value of the root node of the subtree corresponding to the region and the root node position value of document object model tree Absolute value of the difference, obtain the distance in the region, the density and distance to the region are weighted summation, obtain the knot in the region Structure point;
According to the co-occurrence in each region point and structure weighted sum is divided to obtain the final score in each region respectively;Then make choice The step of dividing high one or more regions.
3. the method as described in claim 1, which is characterized in that right in each label according to the value of each feature of each node respectively The score value answered, the step of obtaining original point of each label on each node, include:
Operations described below is carried out respectively for each node:
Obtain the value of each feature of the node;
For each label on the node, the value of each feature corresponding score value, point that will be inquired in the label are inquired about respectively Value is multiplied by the label and is added after the weight of individual features respectively, will add up original point of result as the label on the node.
4. the method as described in claim 1, which is characterized in that the step of decaying to original point of each label on each node Including:
Linear attenuation is carried out to original point of label, obtains attenuation results SLFor:
SL=S × ((1-k1)+k1Dd/DS)
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k1Linearly to transfer damped expoential, Value range (0,1), S are original point;
Original point after the attenuation received respectively according to the root node of each subtree, determine the transmission of each label on the root node The step of dividing includes:
In original point after each label attenuation received in the root node of each subtree, be respectively each label select one it is maximum Original point after attenuation, the transmission point as the label in the root node.
5. the method as described in claim 1, which is characterized in that the step of decaying to original point of each label on each node Including:
Exponential damping is carried out to original point of label, obtains attenuation results SQFor:
SQ=S × ((1-k2)+k2Dd/DS)
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k2Damped expoential is transferred for index, Value range (0,1), S are original point;
Original point after the attenuation received respectively according to the root node of each subtree, determine the transmission of each label on the root node The step of dividing includes:
Original point after the different labels that the root node of each subtree is received are decayed is separately summed, using result as the root node The transmission of middle respective labels point.
6. the method as described in claim 1, which is characterized in that the high one or more regions of selection score, and selected by output The step of selecting the value of label in region includes:
Region all in the document object model tree is ranked up according to score, before being chosen according to order from high to low X region, using the root node of the corresponding subtree in selected region as both candidate nodes;X is default positive integer;
If both candidate nodes are the ancestor node of other both candidate nodes, the both candidate nodes of child nodes are only retained as;
In each both candidate nodes as in the subtree of root node, each label is ranked up according to original point of label respectively, and is selected Original point of highest label is selected as candidate's label;
The node where candidate's label is chosen as finish node;
According to the corresponding web page contents of finish node, the value of output candidate's label.
7. a kind of device of the Extracting Information from webpage, which is characterized in that including:
Indexing unit is respectively each node in the corresponding document object model tree of the webpage for the webpage for input Add each label in preset label set;
Original point of computing unit, for respectively according to the value of each predetermined feature of each node in each label corresponding score value, Obtain original point of each label on each node;
Transfer unit, for the root section of subtree where the node is passed to after decaying to original point of each label on each node Point;Wherein, rule when decaying to original point of each label on each node is:Root node of the node from place subtree is got over Closely, original point of attenuation amplitude is fewer;
Computing unit is divided in region, for original point after the attenuation that is received respectively according to the root node of each subtree, determines the root The transmission of each label point on node, using transmission point and as the region represented by the subtree the score of each label;
Output unit, for the one or more regions for selecting score high, and in region selected by exporting label value.
8. device as claimed in claim 7, which is characterized in that further include:
Amending unit is divided in region, for from region computing unit being divided to obtain the score in each region;For each region, respectively with the area The score in domain is multiplied by the ratio of target labels number present in the region and document object model tree target labels sum, is somebody's turn to do The co-occurrence in region point;The positional value of each destination node is added number of nodes total again divided by webpage and obtains average;For each area Domain, respectively by node total in the summation divided by the region of the positional value of each node in the region and the difference absolute value of average Number, obtains the density in the region;Calculate the positional value and document object model tree of the root node of the subtree corresponding to the region The absolute value of the difference of root node position value obtains the distance in the region, and the density and distance to the region are weighted summation, obtain To the structure point in the region;According to the co-occurrence in each region point and structure weighted sum is divided to obtain the final score in each region respectively; Then the final score in each region is sent to the output unit.
9. device as claimed in claim 7, which is characterized in that original point of computing unit is respectively according to each feature of each node Value corresponding score value, obtain each label on each node original point in each label refer to:
Original point of computing unit carries out operations described below respectively for each node:Obtain the value of each feature of the node;For the section Each label on point inquires about the value of each feature corresponding score value in the label, the score value inquired is multiplied by this respectively respectively It is added in label after the weight of individual features, will add up original point of result as the label on the node.
10. device as claimed in claim 7, which is characterized in that original point progress of the transfer unit to each label on each node Attenuation refers to:
The transfer unit carries out linear attenuation to original point of label, obtains attenuation results SLFor:
SL=S × ((1-k1)+k1Dd/DS)
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k1Linearly to transfer damped expoential, Value range (0,1), S are original point;
Original point after the attenuation that computing unit is received according to the root node of each subtree respectively is divided in region, determines on the root node The transmission point of each label refers to:
Region is divided in original point after each label attenuation that computing unit is received in the root node of each subtree, is respectively each label Original point after a maximum attenuation is selected, the transmission point as the label in the root node.
11. device as claimed in claim 7, which is characterized in that original point progress of the transfer unit to each label on each node Attenuation refers to:
Transfer unit carries out exponential damping to original point of label, obtains attenuation results SQFor:
SQ=S × ((1-k2)+k2Dd/DS)
Wherein, DdFor the depth of destination node in transmission, DsFor the depth of source node in transmission;k2Damped expoential is transferred for index, Value range (0,1), S are original point;
Original point after the attenuation that computing unit is received according to the root node of each subtree respectively is divided in region, determines on the root node The transmission point of each label refers to:
Region divides computing unit by original point after the attenuation of different labels that the root node of each subtree is received by being separately summed, will As a result the transmission point as respective labels in the root node.
12. device as claimed in claim 7, which is characterized in that the output unit includes:
Region ordering module, for region all in the document object model tree to be ranked up according to score, according to from X region before high to Low order is chosen, using the root node of the corresponding subtree in selected region as both candidate nodes;X is default Positive integer;
Screening module, for when both candidate nodes are the ancestor node of other both candidate nodes, being only retained as the time of child nodes Select node;
Tag sorting module, in each both candidate nodes as in the subtree of root node, respectively according to original point of label to each Label is ranked up, and selects original point of highest label as candidate's label;
Selecting module, for choosing the node where candidate's label as finish node;
Output module, for according to the corresponding web page contents of finish node, the value of output candidate's label.
CN201310344292.6A 2013-08-08 2013-08-08 A kind of method and device of the Extracting Information from webpage Active CN104346405B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310344292.6A CN104346405B (en) 2013-08-08 2013-08-08 A kind of method and device of the Extracting Information from webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310344292.6A CN104346405B (en) 2013-08-08 2013-08-08 A kind of method and device of the Extracting Information from webpage

Publications (2)

Publication Number Publication Date
CN104346405A CN104346405A (en) 2015-02-11
CN104346405B true CN104346405B (en) 2018-05-22

Family

ID=52502018

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310344292.6A Active CN104346405B (en) 2013-08-08 2013-08-08 A kind of method and device of the Extracting Information from webpage

Country Status (1)

Country Link
CN (1) CN104346405B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105630772B (en) * 2016-01-26 2018-10-12 广东工业大学 A kind of abstracting method of webpage comment content
CN106095854B (en) * 2016-06-02 2022-05-17 腾讯科技(深圳)有限公司 Method and device for determining position information of information block
WO2018103540A1 (en) 2016-12-09 2018-06-14 腾讯科技(深圳)有限公司 Webpage content extraction method, device, and data storage medium
CN107741942B (en) * 2016-12-09 2020-06-02 腾讯科技(深圳)有限公司 Webpage content extraction method and device
CN109635219A (en) * 2018-12-05 2019-04-16 云孚科技(北京)有限公司 A kind of webpage content extracting method
CN113626028B (en) * 2020-05-07 2024-06-14 腾讯科技(深圳)有限公司 Page element mapping method and device
CN114528811B (en) * 2022-01-21 2022-09-02 北京麦克斯泰科技有限公司 Article content extraction method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073654A (en) * 2009-11-20 2011-05-25 富士通株式会社 Methods and equipment for generating and maintaining web content extraction template
CN102467501A (en) * 2010-10-29 2012-05-23 北大方正集团有限公司 Method and system for extracting news record metadata from news list page
CN102591931A (en) * 2011-12-23 2012-07-18 浙江大学 Recognition and extraction method for webpage data records based on tree weight
CN102915361A (en) * 2012-10-18 2013-02-06 北京理工大学 Webpage text extracting method based on character distribution characteristic

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7552116B2 (en) * 2004-08-06 2009-06-23 The Board Of Trustees Of The University Of Illinois Method and system for extracting web query interfaces
US7814084B2 (en) * 2007-03-21 2010-10-12 Schmap Inc. Contact information capture and link redirection
JP2011003182A (en) * 2009-05-19 2011-01-06 Studio Ousia Inc Keyword display method and system thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073654A (en) * 2009-11-20 2011-05-25 富士通株式会社 Methods and equipment for generating and maintaining web content extraction template
CN102467501A (en) * 2010-10-29 2012-05-23 北大方正集团有限公司 Method and system for extracting news record metadata from news list page
CN102591931A (en) * 2011-12-23 2012-07-18 浙江大学 Recognition and extraction method for webpage data records based on tree weight
CN102915361A (en) * 2012-10-18 2013-02-06 北京理工大学 Webpage text extracting method based on character distribution characteristic

Also Published As

Publication number Publication date
CN104346405A (en) 2015-02-11

Similar Documents

Publication Publication Date Title
CN104346405B (en) A kind of method and device of the Extracting Information from webpage
US8244773B2 (en) Keyword output apparatus and method
US20190147010A1 (en) System and method for block segmenting, identifying and indexing visual elements, and searching documents
CN104484431B (en) A kind of multi-source Personalize News webpage recommending method based on domain body
US7444325B2 (en) Method and system for information extraction
CN107704503A (en) User's keyword extracting device, method and computer-readable recording medium
CN103020295B (en) A kind of problem label for labelling method and device
CN104598462B (en) Extract the method and device of structural data
CN103617213B (en) Method and system for identifying newspage attributive characters
CN104484477B (en) A kind of electronic map searching method, apparatus and system
CN111143547B (en) Big data display method based on knowledge graph
Evert A Lightweight and Efficient Tool for Cleaning Web Pages.
CN104268192A (en) Webpage information extracting method, device and terminal
CN104281648B (en) Search-result multi-dimensional navigating method on basis of dimension label
CN103870541A (en) Social network user interest mining method and system
CN104899215A (en) Data processing method, recommendation source information organization, information recommendation method and information recommendation device
CN108874934A (en) Page body extracting method and device
CN105095206A (en) Information processing method and information processing device
US20070005700A1 (en) Method for processing data
CN109597934B (en) Method and device for determining click recommendation words, storage medium and electronic equipment
CN109710773A (en) The generation method and its device of event body
CN109299443A (en) A kind of newsletter archive De-weight method based on Minimum Vertex Covering
CN103942233B (en) The lobby page recognition methods of directory type web and device
CN106649318B (en) Information display method and device
CN106484702B (en) Target web page access volume display method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant