CN102456050A - Method and device for extracting data from webpage - Google Patents

Method and device for extracting data from webpage Download PDF

Info

Publication number
CN102456050A
CN102456050A CN2010105276359A CN201010527635A CN102456050A CN 102456050 A CN102456050 A CN 102456050A CN 2010105276359 A CN2010105276359 A CN 2010105276359A CN 201010527635 A CN201010527635 A CN 201010527635A CN 102456050 A CN102456050 A CN 102456050A
Authority
CN
China
Prior art keywords
node
dimension
constraint condition
current
constraint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010105276359A
Other languages
Chinese (zh)
Other versions
CN102456050B (en
Inventor
郑长松
肖巍
王全礼
杨俊拯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
MIGU Music Co Ltd
Original Assignee
China Mobile Group Sichuan Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Group Sichuan Co Ltd filed Critical China Mobile Group Sichuan Co Ltd
Priority to CN201010527635.9A priority Critical patent/CN102456050B/en
Publication of CN102456050A publication Critical patent/CN102456050A/en
Application granted granted Critical
Publication of CN102456050B publication Critical patent/CN102456050B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a device for extracting data from a webpage, wherein the method comprises the following steps of: step A, defining a node division rule in a webpage, and obtaining a constraint rule set for extracting data in the webpage according to the node division rule and the analysis of a training sample webpage; and step B, extracting data from a webpage to be extracted by means of the constraint rule set. With the method, the extraction of data by the way of compiling a regular expression can be avoided; therefore, the manpower resource is saved.

Description

The method and apparatus of extracted data from webpage
Technical field
The present invention relates to the data service technology, particularly the method and apparatus of extracted data from webpage.
Background technology
In the data service technology, present Web page information extraction mode all realizes through writing regular expression, is specially: to each website, even each column in the website, all need write corresponding regular expression.This mode is primarily aimed at certain inner column of certain less relatively website of quantity of information or website, relatively is fit to the research behavior of minority website short-term.
But technology or the layout that adopt the website are maked rapid progress, like this along with the variation of technology or layout; Just need write corresponding regular expression again; This can cause and can't handle in real time the data of the whole network magnanimity, and regular expression is all carried out by manual work; This mode that just need write corresponding regular expression again along with the variation of technology or layout needs too many human resources.
Summary of the invention
The invention provides the method for extracted data from webpage, to avoid saving human resources through writing the mode extracted data of regular expression.
Technical scheme provided by the invention comprises:
A kind of from webpage the method for extracted data, this method comprises:
Steps A, the division rule of node in the definition webpage according to the node division rule with to the parsing of training sample webpage, obtains the constraint rule set that is used for extracting the webpage data;
Step B utilizes said constraint rule set extracted data from webpage to be extracted.
A kind of from webpage the device of extracted data, comprising:
Processing unit is used for defining the division rule of webpage node, according to the node division rule with to the parsing of training sample webpage, obtains the constraint rule set that is used for extracting the webpage data;
Extracting unit is used for utilizing said constraint rule set from webpage extracted data to be extracted.
Can find out by above technical scheme, among the present invention,,, obtain the constraint rule set that is used for extracting the webpage data according to the node division rule with to the parsing of training sample webpage through the division rule of node in the definition webpage; And utilize said constraint rule to gather extracted data from webpage to be extracted; Realized accomplishing data pick-up according to node self attributes and definite rule constrain; Avoided making the template or the mode of writing regular expression to measure, also eliminated that website revision or technology change that the template of bringing defines again simultaneously or regular expression such as writes again at the influence of factor according to the website and webpage structure;
Further; Among the present invention,, just can accurately carry out data pick-up in real time to representing " with category node but the webpage of different web sites or different structure " in case rule constrain is established; Improved the scope of application of method and ageing greatly; Reduced manual intervention to a great extent, improved the quality of Information Retrieve by Search Engineer and the promptness problem of information updating greatly simultaneously, made search engine needn't be limited by front end webpage representation technology or the changeable puzzlement of format again.
Description of drawings
The basic flow sheet that Fig. 1 provides for the embodiment of the invention;
The process flow diagram of the step 102 that Fig. 2 provides for the embodiment of the invention;
The particular flow sheet of the step 204 that Fig. 3 provides for the embodiment of the invention;
The process flow diagram of the constrain set of node in definite step 205 that Fig. 4 provides for the embodiment of the invention;
The process flow diagram of the step 209 that Fig. 5 provides for the embodiment of the invention;
The process flow diagram that Fig. 6 gathers for the extreme value constraint condition of definite dimension on node attribute values that the embodiment of the invention provides;
The particular flow sheet of the step 103 that Fig. 7 provides for the embodiment of the invention;
The process flow diagram of the step 701 that Fig. 8 provides for the embodiment of the invention;
The particular flow sheet of the step 702 that Fig. 9 provides for the embodiment of the invention;
The process flow diagram of the step 703 that Figure 10 provides for the embodiment of the invention;
The relation of equivalence that Figure 11 provides for the embodiment of the invention is divided the constraint process flow diagram;
The process flow diagram of the ordering constraint that the division that Figure 12 provides for the embodiment of the invention concerns;
The process flow diagram of interconnection constraint between the dimension that Figure 13 provides for the embodiment of the invention;
The process flow diagram of the step 705 that Figure 14 provides for the embodiment of the invention;
The structure drawing of device that Figure 15 provides for the embodiment of the invention.
Embodiment
Method provided by the invention can extract the data in the webpage; It mainly is the attribute definition node corresponding division rule that utilizes between node self in the webpage or the node; Extract the dimension set and confirm that this appointments extracts each dimension in dimension set is concentrated appearance at this common training sample webpage path (Xpath) with specifying according to the common training sample webpage collection that provides again; Employing is based on constraint Analysis method between rough set equivalent partition and dimension; Calculate each dimension corresponding based on the constraint condition set of equal value of rough set and the constraint condition set of dimension on node attribute values etc.; This constraint condition that calculates set can be verified its validity and versatility through the authenticate reverse mode, at last these constraint condition set is applied in the data pick-up of webpage.Wherein, in the data pick-up in later stage, also can constantly improve above-mentioned constrain set, that is to say, method provided by the invention comes down to an open automatically continuous perfecting process.
In order to make the object of the invention, technical scheme and advantage clearer, describe the present invention below in conjunction with accompanying drawing and specific embodiment.
Referring to Fig. 1, the basic flow sheet that Fig. 1 provides for the embodiment of the invention.As shown in Figure 1, this flow process can may further comprise the steps:
Step 101, the division rule of node in the definition webpage.
In this step 101, node is essentially webpage is resolved to dom tree in the webpage, then dom tree is changed into each node, forms node set.
As for the division rule in the step 101, it can define according to factors such as self-contained information of each node and position in the webpage, also can define according to actual conditions, and the embodiment of the invention does not specifically limit.
Preferably, as one embodiment of the present of invention, this node-classification rule can comprise following several kinds at least one:
1) node location characteristic, the i.e. position of node in dom tree;
2) text of node, promptly node types is text, note, or pattern etc.;
3) the HTML mark Tag of node;
4) node content;
5) character length of node content;
6) contain meaningful child node number in the child node that comprises in the node;
7) zone in the HTML mark of node place;
8) number of the node brotgher of node;
9) regular expression of node literal;
10) other, can be according to the node division rule of different demands definition.
Step 102 according to node division rule with to the parsing of training sample webpage, is obtained the constraint rule set that is used for extracting the webpage data.
The number of training sample webpage is more than one in this step 102; Can unify to exist a set such as in the training sample set (following is that example is described with the training sample set); It is mainly confirmed according to actual conditions, does not need special processing, can be generic web page.
As for obtaining operation and how to realize that specifically Fig. 2 describes hereinafter in the step 102.
Step 103 is utilized said constraint rule set extracted data from webpage to be extracted.
So far, realized the data pick-up method that the embodiment of the invention provides through above step.
Referring to Fig. 2, the process flow diagram of the step 102 that Fig. 2 provides for the embodiment of the invention.In the present embodiment, step 102 mainly is according to the node division rule training sample webpage to be analyzed, and identifies specified dimension place node and satisfies which constraint rule; And then these constraint rules are applied in other sample webpages; If can correct and unique information that identifies all sample webpages in this setting dimension, then this constraint rule is kept, form the constraint rule set; If can not identify; Then extra increase node-classification rule repeats said process, finally obtains effective constraint rule set.Be described in detail in the face of this step 102 down.
As shown in Figure 2, this flow process can may further comprise the steps:
Step 201, (D1, D2 D3...Dm) (are designated as dimension set I), and specify this dimension set I corresponding nodes to deposit set UI to specify the dimension I that from the training sample webpage, extracts.
In the step 201, the dimension among the specified dimension set I is from the sample training webpage, to need the result object that extracts, such as: title title, content content, deliver time time etc.It can have one-to-one relationship with the node among the set UI, also can have many-one relationship, wherein; Node among the set UI of dimension correspondence comes down to the position of this dimension in the training sample webpage, i.e. path Xpath; Such as the Xpath of title in certain training sample webpage; Here, the Xpath mode of obtaining these information has a lot, and utilizing the firefox browser plug-in is wherein a kind of very simple mode.
Step 202, the training sample webpage that the traversal training sample is concentrated, with this training sample webpage that traverses as current sample.
Step 203 resolves to dom tree with current sample, then dom tree is changed into node set U.
In this step 203, set U can be identical with set UI, also can be different, such as, set U is the subclass among the set UI, perhaps gathers U and has the identical node of part etc. with set UI.
Step 204 will be gathered U based on the different nodes division rule and be divided into different division set.
That is to say, what node division of step 101 definition rule, what division set are this step 204 just can obtain, specifically can be referring to flow process shown in Figure 3.
Referring to Fig. 3, the particular flow sheet of the step 204 that Fig. 3 provides for the embodiment of the invention.As shown in Figure 3, this flow process can may further comprise the steps:
Step 301, the node division rule of traversal step 101 definition, with the node division rule that traverses as current division rule;
Step 302, traversal set U, with the node that traverses as present node;
Step 303 judges whether present node satisfies current division rule, if, execution in step 304; If not, execution in step 305;
Step 304 is added present node in the corresponding division set of this current division rule to, and execution in step 305 afterwards.
Step 305 judges among the set U whether also have the node that is not traveled through, if this one of them node that is not traveled through as present node, is returned execution in step 303; If not, execution in step 306;
Step 306 judges whether also there is the node division rule that is not traveled through in the node division rule that defines, if, one of them the node division rule that is not traveled through as current division rule, is returned execution in step 302, otherwise, current flow process finished.
So far, realized the flow process in the step 204 through above-mentioned steps 306.
Step 205 is gathered based on dividing, and confirms the constrain set of each node among the set UI, and all constrain sets of all nodes among the set UI are gathered as the first corresponding constraints of this training sample webpage.
This step 205 comes down to respectively to confirm that each node among the set UI appears at that step 204 obtains which is divided in the set, obtains the constrain set of this node, and the constrain set of all nodes among the set UI is combined, and obtains the set of first constraint condition.
Referring to Fig. 4, the process flow diagram of the constrain set of node in definite step 205 that Fig. 4 provides for the embodiment of the invention.As shown in Figure 4, this flow process can may further comprise the steps:
Step 401, traversal set UI, with the node that traverses as present node;
Whether step 402 exists the division set that comprises this present node in the division set that determining step 204 obtains, if, execution in step 403, otherwise, execution in step 404;
Step 403 leaves the node attribute values of on the present node division that should exist being gathered in the constrain set of present node in this present node.Execution in step 404 afterwards.
Because it is corresponding with the node division rule to divide set, therefore, it is regular corresponding that pairing node division is gathered in the node attribute values of this present node and the division of this existence.
Step 404 judges whether also there is the node that is not traveled through among the set UI, if, one of them node that is not traveled through as present node, is returned step 402, otherwise, current flow process finished.
So far, can confirm to gather the constrain set of each node among the UI through above-mentioned steps.
Step 206; To any two nodes among the set UI; Calculate the corresponding binary relation of two dimensions at least one attribute of these two nodes, result of calculation is recorded in corresponding second constraint condition set (being also referred to as constrain set between dimension) of this training sample webpage.
Wherein, according to the node attribute values type, and the association between the node, the binary relation in the embodiment of the invention mainly comprises following at least a:
(1) relation of equivalence: refer to the binary relation that each node attribute values all has, be used to judge whether two nodes equate, whether all are text nodes such as two nodes, and whether child nodes etc. is arranged on same attribute;
(2) comparison: it is optional binary relation, is used for the value of two nodes of comparison on same attribute, and the comparative result collection that it obtains is divided into: greater than, less than, equal three kinds;
(3) distance relation: the numeric type attribute all has, and the result is the absolute value of the numerical difference between of two nodes on this numeric type attribute;
(4) other based on the value of attribute, can increase more self-defining relation.
With two corresponding dimensions of any two nodes among the set UI is exercise question T pWith date D p, the attribute of choosing is that position attribution is an example, because comparison is arranged on position attribution; And distance relation, therefore, can calculate exercise question and date at least one of this two kinds of relations on position attribution; To calculate these two kinds of relations is example, supposes that these two result of calculations are: comparison: T p>D p, distance relation: | T p-D p|<5, so, with then this result of calculation being recorded between the corresponding dimension of current sample in the constrain set.
Step 207, whether the training of judgement sample set also exists the training sample webpage that is not traveled through, if, execution in step 208; Otherwise, execution in step 209.
Step 208 as current sample, returns this one of them training sample webpage that is not traveled through to execution in step 203.
Step 209 is confirmed the constraint condition set of equal value based on rough set according to first constraint condition set that step 205 is calculated.
Referring to Fig. 5, the process flow diagram of the step 209 that Fig. 5 provides for the embodiment of the invention.As shown in Figure 5, this flow process can may further comprise the steps:
Step 501 travels through the constrain set of each node in first constraint condition set, with the joint constraint set that traverses as current constrain set.
Step 502 judges that current constrain set corresponding nodes division rule is discrete type or continuous type, and according to confirming that the result calculates the probability that each node distributes in the current constrain set.
Because in step 205; The corresponding division of the constrain set of node gathered; Depend on the division of node division rule and divide set; Promptly divide set corresponding node division rule, therefore, this step 502 is easy to determine according to current constrain set should preceding constrain set corresponding nodes division rule.
In the present embodiment, if the corresponding nodes division rule is a discrete type, then the calculating in the step 502 can be carried out according to formula one:
P Vi = Counti Σ 0 N Count , (formula one)
Wherein, i is the sign of node, P greater than 0 and smaller or equal to N ViBe the distribution probability of node i on the corresponding dimension of current constrain set (owing to the constrain set corresponding node, and therefore the corresponding dimension of node, readily appreciates that the dimension that current constrain set is corresponding), counti is the property value of node i.
If the corresponding nodes division rule is the type that looses continuously, and the distribution of the normal probability paper below obeying, then the calculating in the step 502 can be carried out according to formula two:
f ( V i ) = 1 δ 2 π e ( V i - μ ) 2 2 δ 2 (formula two);
Wherein, f (V i) expression node v iProbability P ViDistribution situation, μ is a mathematical expectation, can represent through following formula: μ=V 1* P V1+ V 2* P V2+ ...+V N* P VNδ is a variance, can represent through following formula:
Figure BSA00000327568200083
I is greater than 0 and smaller or equal to N,
Figure BSA00000327568200084
Step 503, removal has the node that probability is different from the probability of intended target node from current constrain set;
Because constrain set is the set of node and property value, therefore, when removing node, property value also comes along and removes.
Step 504 judges that whether the joint constraint in the set of first constraint condition is gathered by whole traversals, if, execution in step 505, otherwise, current constrain set is confirmed as in one of them joint constraint set that is not traveled through, return execution in step 502.
Step 505 will be confirmed as the constraint condition set of equal value based on rough set through all set that step 503 obtains.
Because constrain set corresponding node; And the corresponding dimension of node, therefore, it is all corresponding with dimension to obtain the constraint condition set of equal value based on rough set that step 505 determines; And; According to the description of step 503, can know, should be of equal value based on the constraint condition set of equal value of the rough set destination node corresponding with dimension.
Step 210 merges corresponding second constraint condition set of different training sample webpage, and the set that obtains after confirming to merge is the constraint condition set of dimension on node attribute values.
Merging in the step 210 is to confirm the relation that exists between the dimension from the angle of all training sample webpages, promptly finds the general or total characteristic of this relation.Description according to step 206 can know that second constraint condition that each training sample webpage is corresponding has write down a plurality of binary relations in gathering.Binary relation with second constraint condition set record is that comparison and distance relation are example, and other situation realize that principle is similar, and then step 210 specifically can comprise:
(1); To the relatively merging of relation; It is meant in corresponding second constraint condition set of a training sample webpage and has the comparison between two nodes; Judge whether this comparison exists in corresponding second constraint condition set of other training sample webpages, if this comparison keeps; If not, delete this comparison.Can be during its concrete realization: will greater than, less than, equal discrete type numerical attribute as enumerated value, obtain the probable value of comparison according to above-mentioned formula one, the identical situation that will in different sample pages or leaves, distribute merges, the different relationships removal.
(2); Merging for distance relation; It is meant that in corresponding second constraint condition set of training sample webpage the distance relation between two dimensions is first value; If the distance relation in corresponding second constraint condition set of other training sample webpages between these two dimensions is second value, then the distance relation between these two dimensions is being the numerical value between first value to the second value after the merging, promptly is the coverage of simple computed range.The concrete realization is operating as: distance value one by one as the point of value; Regard as with these continuity numerical attributes as sample value; Obtain the probable value of distance relation according to above-mentioned formula two, the scope that compute distance values covers is progressively confirmed the zone that distributes to this relation.
So far, can determine constraint condition set of equal value and the constraint condition set of dimension on node attribute values through above step based on rough set.
Step 211, the validity of the constraint condition set that verification step 209 and step 210 obtain, when checking is invalid, execution in step 212.
In fact; Step 211 mainly is whether verification step 209 identifies each dimension among the specified dimension set I with the constraint condition set that step 210 obtains; If the constraint condition set that then definite step 209 and step 210 obtain is effective, the constraint condition set that directly utilizes this step 209 and step 210 to obtain is carried out data pick-up and is got final product; Realization flow is fairly simple, repeats no more here; Otherwise, confirm that the constraint condition set that step 209 and step 210 obtain is invalid, also need execution in step 212.As for how to verify, describe below.
In the present embodiment, the verification operation in the step 211 specifically comprises:
To any two dimension D1 and D2 in each training sample webpage, handle below carrying out:
If the set of dimension D1 corresponding nodes is U D1(N X1... N Xm), the set of dimension D1 corresponding nodes is U D1D2{ (N X1N Y1) ... N Yn), the set of binary relation on dimension D1 and the D2 for UR (R1 ... Rn), then dimension D1 is made up with dimension D2 corresponding nodes collection, it is right to obtain with lower node: U D1D2{ (N Xi, N Yi) | i ∈ 1, m}, j ∈ 1, n}}; It is right to travel through said node, confirms that the node of all binary relations was right during satisfying said binary relation gathered, and obtains gathering U D1{ (N x, N y) ...;
After per two dimensions are all accomplished said processing in to said training sample webpage; Judge whether the node that finally is met all binary relations in the binary relation set only has 1 to the set of forming; If confirm that the constraint condition set that step 209 and step 210 obtain identifies each dimension in the said specified dimension set; Promptly; The constraint condition set that step 209 and step 210 obtain effectively; Otherwise, confirm that the constraint condition set that step 209 and step 210 obtain discern not go out each dimension in the said specified dimension set, promptly the constraint condition that obtains of step 209 and step 210 gather invalid.
Step 212 is confirmed the extreme value constraint condition set of dimension on node attribute values.
For the extreme value sequence relation; Because relatively to as if relative node; And having only specified dimension set middle latitude corresponding nodes to fix in these nodes, other node all is that tool is not descriptive, therefore can only get maximum or minimum value is confirmed with the comparative sequences of value.
The embodiment of the invention provides flow process as shown in Figure 6 for reducing the calculated amount of step 212.
Referring to Fig. 6, the operational flowchart of the step 212 that Fig. 6 provides for the embodiment of the invention.As shown in Figure 6, this flow process can comprise:
Step 601, the dimension among the traversal specified dimension set I, with the dimension that traverses as current dimension, if current dimension corresponding nodes collection U nIn the number of nodes that comprises greater than 1, said set of node U nIn node belong to same training sample webpage, then travel through each attribute that intended target node all nodes corresponding with current dimension have, with the attribute that traverses as current attribute.
Step 602, all nodes that the intended target node is corresponding with current dimension respectively compare on current attribute, obtain comparative result, confirm the value of intended target node on this current attribute according to said comparative result.
The comparative result that obtains in this step 602 for comprise greater than, equal and less than three kinds.
Confirm the value of intended target node on current attribute according to comparative result, be specially:
If comparative result only less than, equal and less than being 0, then value is Top;
If comparative result only less than with equal, greater than being 0, then value is Tops;
If comparative result only greater than, less than with greater than being 0, then value is Bottom;
If comparative result only greater than with equal, less than being 0, then value is Bottoms;
If comparative result existing greater than have again less than, equal to arbitrarily, then value is Middle;
If comparative result only equals, greater than with less than all being 0, then value is Identical;
Step 603 is confirmed the output result that current attribute is corresponding based on the value of confirming.
Step 603 is specially:
If value is Top, Bottom, then exporting the result is 1, as the afterbody of extreme value sequence;
If value is Tops, Bottoms, then export the result for 2, m-1}, as the beginning or the center section of extreme value sequence, said m is the quantity of dimension in the specified dimension set;
If value is Identical, then exports the result and be m, as the beginning or the center section of extreme value sequence;
If value is Middle, then exporting the result is 0, as the afterbody subsequent treatment of sequence.
Step 604 judges whether also to exist the attribute that is not traveled through, if, one of them attribute that is not traveled through is confirmed as current attribute, return execution in step 602; If not, execution in step 605;
Step 605, each output result that each attribute is corresponding forms each set, and each set is handled, and obtains the corresponding extreme value arrangement set of current dimension; Judge the current preset merging condition that whether arrives, if, execution in step 606, otherwise, if the current dimension do not traveled through that also exists, then with one of them dimension that is not traveled through as current dimension, return execution in step 601.
Step 605 is specially: the arrangement of elements in first set that will obtain for m according to said output result obtains arranging set U p, all being divided into Top to all elements in first set, Bottom confirms to arrange set U p+ be that 1 second set that obtains is the corresponding qualified first extreme value sequence of this dimension according to said output result;
From said set of node U nIn determine and satisfy to arrange set U p+ according to said output result be 2, and m-1} obtain the 3rd the set subclass U Ns, transferring the bottoms in the 3rd set to bottom, tops transfers top to, is added to arrange set U p+ according to said output result be 2, in the 3rd set that m-1} obtains, form the secondary extremal sequence; Merge said first extreme value sequence and secondary extremal sequence, the extreme value sequence that obtains after merging is confirmed as the corresponding extreme value arrangement set of current dimension.
Wherein, merge the first extreme value sequence (being designated as S1) and secondary extremal sequence (being designated as S2) and be specially: if S1 ∈ is S2, S1 ∩ S2=S2 then, if S2 ∈ is S1, S1 ∩ S2=S1 then, if S1=S2, S1 ∩ S2=S1 then, other then return φ.
Need to prove, in the embodiment of the invention, pre-conditioned be the 3rd the set in number of elements be 0.
Step 606 merges the corresponding extreme value arrangement set of each dimension, and the extreme value arrangement set that obtains after merging is gathered U0 as the extreme value constraint condition of dimension on node attribute values.
Particularly, the merging in the step 606 can be carried out according to the mode that merges the first extreme value sequence and secondary extremal sequence in the step 605, repeats no more here.
So far, we have obtained the extreme value constraint condition set of information dimension on node attribute values.Need to prove that if obtain gathering U0 for empty set through calculating all training sample webpages, then think on the node division rule, it is discernible specifying and extracting dimension set I.If set U0 is an empty set, and through other two kinds of results that draw greater than actual result, think then that on the node division rule it is unrecognizable specify extracting dimension set 1.Needing to add the node division rule this moment recomputates.
So far, realized obtaining in the step 102 operation of the constraint rule set that is used for extracting the webpage data through above-mentioned steps.
As for the concrete operations in the step 103, can be referring to Fig. 7.
Referring to Fig. 7, the particular flow sheet of the step 103 that Fig. 7 provides for the embodiment of the invention.This flow process is when the constraint condition set on node attribute values can not discerned each dimension in gathering of said specified dimension with dimension based on the constraint condition of equal value set of rough set; The appropriate information node is screened in constraint rule set according to training generates, thus the extraction of the information of completion.As shown in Figure 7, this flow process can may further comprise the steps:
Step 701 generates and waits to extract the corresponding ensemble of communication of webpage.
Here, ensemble of communication is a kind of form of presentation of webpage, is the candidate collection of all data fields to be extracted place nodes, and concrete generating run can be referring to flow process shown in Figure 8:
Referring to Fig. 8, the process flow diagram of the step 701 that Fig. 8 provides for the embodiment of the invention.As shown in Figure 8, this flow process can comprise:
Step 801 resolves to dom tree with webpage to be extracted.
Step 802, traversal all nodes on the dom tree, with the node that traverses as present node, execution in step 803.
Step 803 judges whether this node is the note node, if, execution in step 804, otherwise, carry out 805.
Because the note node helps out in webpage design and the information of showing has nothing to do.Therefore, the data of the present invention's extraction are not generally on the note node.
Step 804 judges whether dom tree also has not traversal of node, if this one of them node that is not traveled through as present node, is returned execution in step 803, if not, return execution in step 801.
Step 805 is added this present node in the ensemble of communication to, returns execution in step 804.
So far, can generate ensemble of communication through above step.
Step 702 is divided according to the node division rule of definition the ensemble of communication that step 701 generates, and obtains each sub-set of ensemble of communication.
Referring to Fig. 9, the particular flow sheet of the step 702 that Fig. 9 provides for the embodiment of the invention.As shown in Figure 9, this flow process can may further comprise the steps:
Step 901, whether the ensemble of communication that determining step 701 generates is empty, if be empty, execution in step 904, otherwise execution in step 902.
Step 902, the traversal ensemble of communication, with the node in the ensemble of communication that traverses as present node, execution in step 903.
Step 903 to each node division rule of definition, judges whether this present node satisfies this division rule; If satisfy, then present node is added in the corresponding subclass of this node division rule, return step 901; If do not satisfy, then return step 901.
Step 904 is divided resulting subclass as the ensemble of communication U0 that step 701 is generated according to the node division rule of definition with each sub-set that obtains at last.
Step 703, the subclass that step 702 is obtained merges, and obtains characteristic value collection.
Because different subclass may comprise identical node, and follow-up data pick-up to be unit with the node carry out, therefore, extract the subclass that needs first combining step 702 to obtain for ease of follow-up data.
Wherein, The subclass that combining step 702 obtains mainly is with all subclass in the traversal step 702, and same node different characteristic value is merged, and generates the mapping table of new " with the element is key assignments; the characteristic tuple is value ", specifically can be referring to flow process shown in Figure 10:
Referring to Figure 10, the process flow diagram of the step 703 that Figure 10 provides for the embodiment of the invention.Shown in figure 10, this flow process can may further comprise the steps:
Step 1001, new characteristic value collection of initialization.
At this moment, the characteristic value collection in the step 1001 is essentially and constructs one with " element is a key assignments, and characteristic element ancestral is what be worth " empty mapping table.
Step 1002, the subclass that traversal step 702 obtains, with the subclass that traverses as current subclass, execution in step 1003.
Step 1003 travels through the node in the current subclass, with the node that traverses as present node, execution in step 1004.
Whether step 1004 comprises this present node in the judging characteristic value set, if comprise, and execution in step 1006; Otherwise, execution in step 1005.
Initialized characteristic value collection is an empty set, therefore, can not comprise present node, needs execution in step 1005.
Step 1005 is added present node and characteristic of correspondence value thereof in the characteristic value collection to, and execution in step 1007 afterwards.
Usually, corresponding above eigenwert of node.
Step 1006 is added present node characteristic of correspondence value in the characteristic value collection to, and execution in step 1007 afterwards.
Step 1007, this present node are set to access flag (Signed), and execution in step 1008.
Step 1008 judges whether current subclass has not traversal of node, if, one of them node that is not traveled through as present node, is returned execution in step 1004, if not, execution in step 1009.
Step 1009, the sign that current subclass is set to dispose.Execution in step 1010 afterwards.
Whether step 1010 also exists the subclass that is not traveled through in the subclass that determining step 702 obtains, if one of them subclass that is not traveled through as current subclass, is returned execution in step 1003, if not, finish current flow process.
So far, realized the subclass that step 702 obtains is merged, obtained characteristic value collection through above operation.
Step 704; In set of the constraint condition on the node attribute values and the extreme value constraint condition set of dimension on node attribute values said characteristic value collection is carried out rule constrain according to said constraint condition set of equal value, dimension, obtain being used for the interconnection constraint mapping table of extracted data based on rough set.
This step 704 mainly is to reduce the size of candidate collection, and progressively suitable destination node is finally found in screening.Be described in detail below:
1) relation of equivalence is divided constraint:
Relation of equivalence is divided constraint and is adopted the thought of dividing based on the rough set relation of equivalence; Characteristic value collection to from step 703, being made up of node and eigenwert thereof is carried out equivalent partition according to dimension, generate one group " with the dimension is sign, the both candidate nodes set of satisfied division relation that it is corresponding for value to " new mapping table; In this generative process; The node that does not satisfy any dimension will be disallowable, and dimension also reduces accordingly greatly, specifically can be referring to flow process shown in Figure 11:
Referring to Figure 11, the relation of equivalence that Figure 11 provides for the embodiment of the invention is divided the constraint process flow diagram.Shown in figure 11, this flow process can may further comprise the steps:
The constraint conditions set of equal value based on rough set that each dimension that step 1101, traversal step 209 are confirmed is corresponding is current constraint condition set with the constraint condition set cooperation of equal value of the corresponding dimension that traverses.
Step 1102 utilizes first constraint condition in the current constraint condition set that the characteristic value collection in the step 703 is carried out equivalent partition.
Owing to comprised the combination of property value and node based on the constraint condition set of equal value of rough set; Therefore; Constraint condition in the present embodiment is the combination of node and property value; Correspondingly, first constraint condition in this step 1102 is the combination of first node and property value in the set of current constraint condition.
As for how utilizing first constraint condition that the characteristic value collection in the step 703 is carried out equivalent partition, when specifically realizing, can carry out equivalent partition according to the attribute of the property value in the constraint condition or this node self.Through this step 1102, the results that obtains comprises two kinds: a kind of is the set of satisfying first constraint condition, is made up of a plurality of nodes; Another part is the set of not satisfying first constraint condition.
Step 1103 judges whether the set of current constraint condition only comprises first constraint condition, if, execution in step 1106, otherwise, with the constraint condition after first constraint condition as current constraint condition, execution in step 1104.
Step 1104 uses current constraint condition to carrying out equivalent partition through using a last constraint condition to carry out the set of satisfying a last constraint condition that equivalent partition obtains.
Through this step 1104, can access two kinds of results of similar step 1102.
Step 1105 judges that whether current constraint condition be last constraint condition of current constraint condition set, if, execution in step 1106, if not, the next constraint condition of current constraint condition as current constraint condition, is returned execution in step 1104
Step 1106, the set of satisfying last constraint condition is converted into " be key assignments with the dimension, the set of node that satisfies constraint for value to " mapping table;
Step 1107 transforms the mapping table that obtains with step 1106 and adds in the relation of equivalence division mapping set;
Step 1108; Whether also there is the dimension that is not traveled through constraint condition set of equal value in the constraint condition set of equal value of each dimension that determining step 209 is confirmed; If be that current constraint combines execution in step 1102 with the constraint condition set cooperation of equal value of one of them dimension that is not traveled through; Otherwise, finish current flow process.
Can find out that through above step when utilizing dimension constraint condition set of equal value to divide, division each time all can reduce the dimension of set; The constraint condition that increases is many more; The element of set will lack more, thereby reduces the complexity of this Processing Algorithm, has good efficient.
2) satisfy the ordering constraint that division concerns
The ordering constraint is after the equivalent partition Constraints Processing; To the self-contained both candidate nodes process of aggregation of each dimension; At first can classify according to constraint rule to these nodes; The a plurality of set of blocks that obtain after to classification according to the ordering rule of appointment then carry out the piece internal sort respectively, get every TopN element respectively as the candidate result collection according to the condition of configuration then.Treatment scheme is seen Figure 12, and is specific as follows:
Step 1201 travels through the corresponding ordering constraint condition of all dimensions, with the ordering constraint condition that traverses as current constraint condition.
In the present embodiment, ordering constraint condition is the corresponding extreme value sequence of dimension.
Step 1202, from relation of equivalence divide that mapping set comprises " with the dimension is key assignments, the set of node that satisfies constraint for value to " relation of equivalence divide and read the corresponding both candidate nodes of this dimension in the mapping table and gather.
Both candidate nodes set in this step 1202 is corresponding with the pairing dimension of current constraint condition.
Step 1203 is classified to both candidate nodes set based on current constraints.
In fact, step 1203 is according to the extreme value arrangement set both candidate nodes to be gathered to classify.
Step 1204 is carried out the piece internal sort respectively to a plurality of set of blocks that obtain after the classification.
Because node types or value are divided into discrete character type and continuous type, corresponding every type ordering rule is all inconsistent:
A) the discrete character type need be converted into the element form that enumerating of appointment comprises in the set with character, and concrete method for transformation sorts according to associated order referring to the formula one of the embodiment of the invention then;
B) continuous type, such as the length branch situation of character, the general Normal Distribution of the nodal community of continuous type comes it is carried out relevance ranking from the interval range that falls into, the formula two that concrete computing method provide referring to the embodiment of the invention.
Step 1205 is put in the same node set respectively from every TopN node choosing appointment, and according to the piece order successively.
Step 1206, this node set is stored in " dimension with its place is a key assignments, set of node for value to " ordering divide in the constraint mapping table;
Step 1207 judges whether to exist the ordering constraint condition that is not traveled through, if, one of them ordering constraint condition that is not traveled through as current constraint condition, is returned execution in step 1202, otherwise, current flow process finished.
Can find out; Ordering constraint division is the scope of further dwindling the set of the both candidate nodes of the correspondence of each dimension; Candidate collection is divided from the another one angle, and how the ordering rule of these dissimilar nodes of step defines is crucial, particularly to the conversion of discrete offset collection.
3) interconnection constraint between the dimension
Interconnection constraint is divided stipulations from the incidence relation angle between the dimension to whole aggregation process between dimension, further confirms the accurate scope of each dimension corresponding node set, thus the location of accomplishing information node.Figure 13 is seen in interconnection constraint, and detailed process is following:
Step 1301 travels through the corresponding interconnection constraint condition of all dimensions, with the interconnection constraint condition that traverses as current Correlation Criteria.
In this step 1301, the interconnection constraint condition is the constraint condition set of dimension on node attribute values;
Step 1302 is divided all both candidate nodes of obtaining the dimension relevant with current Correlation Criteria the constraint mapping table from ordering.
Because the interconnection constraint condition is the constraint condition set of dimension on node attribute values; And dimension has comprised the binary relation between two dimensions in the constraint condition on node attribute values set; Therefore, relevant with current Correlation Criteria dimension comes down to the related dimension of binary relation in the constraint condition set of dimension on node attribute values.
Step 1303 is amassed mapping to all both candidate nodes of the relevant dimension that reads, obtain one group with node to being the new set of element.
Suppose that the relevant dimension of current Correlation Criteria is respectively dimension a and dimension b; Dimension a has N set element, and dimension b has M set element, earlier the node between these two dimensions is done long-pending mapping; Obtain one group with node to being the new set of element, the element number in the set is N*M.
Step 1304, the element during step 1203 newly gathered be node to retraining calculating, the element that satisfies constraint rule is kept, and deletion not satisfy the node of constraint right, obtain the interconnection constraint mapping table of this dimension.
The constraint of this step 1304 is calculated and can be carried out according to existing operation, repeats no more here.
Step 1305 judges whether to exist the interconnection constraint condition that is not traveled through, if, one of them dimension interconnection constraint condition that is not traveled through as current constraint condition, is returned execution in step 1302, otherwise, current flow process finished.
Can find out that interconnection constraint mainly is from the interconnection constraint angle between dimension and the dimension, dimension-reduction treatment is proceeded in the both candidate nodes set of each dimension, further dwindle the element number of each set.
Step 705, the interconnection constraint mapping table that utilizes step 704 to obtain carries out data pick-up.
In fact, the data pick-up of this step 705 is a kind of reduction processes to the data virgin state, and processing procedure is referring to flow process shown in Figure 14:
Step 1401, the both candidate nodes set of dimension gathers the both candidate nodes of this dimension as current set in the traversal interconnection constraint mapping table.
Step 1402 judges whether the node number in the current set is 1, if, execution in step 1403, otherwise, execution in step 1406;
The final result of present embodiment requires the corresponding node of a dimension exactly, if surpass one, mistake occurs with regard to the information extraction that this dimension is described.
Step 1403 according to the relevant content information of this node of demand extraction, is promptly removed the mark and the related pattern information of webpage.
Step 1404, this information is saved in " be key assignments with the dimension, the nodal information content for value to " imformosome set in.
Step 1405 judges whether also there is the dimension both candidate nodes set that is not traveled through in the interconnection constraint mapping table, if one of them the dimension both candidate nodes that is not traveled through set as current set, is returned execution in step 1402; Otherwise, finish current flow process;
Step 1406 is handled this page link, dimension sign and both candidate nodes set write error thereof of waiting to extract webpage in the daily record.
So far, realized the data pick-up flow process that the embodiment of the invention provides through above step.
More than method that the embodiment of the invention is provided be described, the device that provides in the face of the embodiment of the invention is down described.
Referring to Figure 15, the structure drawing of device that Figure 15 provides for the embodiment of the invention.This device is corresponding to method mentioned above, and shown in figure 15, this device can comprise:
Processing unit 1501 is used for defining the division rule of webpage node, according to the node division rule with to the parsing of training sample webpage, obtains the constraint rule set that is used for extracting the webpage data;
Extracting unit 1502 is used for utilizing said constraint rule set from webpage extracted data to be extracted.
Wherein, processing unit 1501 comprises:
The constraint condition set symphysis becomes subelement 15011, is used for according to the node division rule, generates corresponding first constraint condition set of each training sample webpage and the set of second constraint condition;
The constraint rule set generates subelement 15012, is used for generating corresponding constraint rule set respectively according to set of first constraint condition and the set of second constraint condition.
In the present embodiment, said constraint condition set symphysis becomes subelement to generate corresponding first constraint condition set of each training sample webpage and the set of second constraint condition through following operation:
To each training sample webpage, this training sample webpage is resolved to dom tree, and convert this dom tree to node set U;
Different node division rules according to definition are divided into different division set with node set U;
Based on said division set, confirm that specified dimension set corresponding nodes deposits the constrain set of each node among the set UI, all constrain sets of node being deposited all nodes among the set UI are as corresponding first constraints set of this training sample webpage;
Deposit any two nodes among the set UI to node, calculate the corresponding binary relation of two dimensions at least one attribute of these two nodes, result of calculation is recorded in corresponding second constraint condition set of this training sample webpage.
In the present embodiment, said constraint rule set comprises: based on the constraint condition set of equal value and the constraint condition set of dimension on node attribute values of rough set; Perhaps, said constraint rule set comprises: constraint condition set of equal value, dimension based on rough set are gathered in set of the constraint condition on the node attribute values and the extreme value constraint condition of dimension on node attribute values,
Wherein, said dimension is integrated into said constraint condition of equal value set based on rough set in the extreme value constraint condition on the node attribute values and exists when the constraint condition set on node attribute values can not discerned each dimension in gathering of said specified dimension with dimension.
In the present embodiment, extracting unit 1502 comprises when the constraint condition set on node attribute values can not discerned each dimension in gathering of said specified dimension with dimension based on the constraint condition of equal value set of rough set:
Ensemble of communication generates subelement 15021, is used to generate wait to extract the corresponding ensemble of communication of webpage;
Divide subelement 15022, be used for said ensemble of communication is divided according to the node division rule of definition, obtain each sub-set of ensemble of communication;
Merge subelement 15023, be used to merge each sub-set that obtains, obtain characteristic value collection;
Rule constrain subelement 15024; Be used in set of the constraints on the node attribute values and the extreme value constraints set of dimension on node attribute values said characteristic value collection being carried out rule constrain, obtain being used for the interconnection constraint mapping table of extracted data based on said constraints set of equal value, dimension based on rough set;
Extract subelement 15025, be used to utilize said interconnection constraint mapping table to carry out data pick-up.
So far, the device that the embodiment of the invention is provided has been accomplished description.
Can find out by above technical scheme, among the present invention,,, obtain the constraint rule set that is used for extracting the webpage data according to the node division rule with to the parsing of training sample webpage through the division rule of node in the definition webpage; And utilize said constraint rule to gather extracted data from webpage to be extracted; Realized accomplishing data pick-up according to node self attributes and definite rule constrain; Avoided making the template or the mode of writing regular expression to measure, also eliminated that website revision or technology change that the template of bringing defines again simultaneously or regular expression such as writes again at the influence of factor according to the website and webpage structure;
Further; Among the present invention,, just can accurately carry out data pick-up in real time to representing " with category node but the webpage of different web sites or different structure " in case rule constrain is established; Improved the scope of application of method and ageing greatly; Reduced manual intervention to a great extent, improved the quality of Information Retrieve by Search Engineer and the promptness problem of information updating greatly simultaneously, made search engine needn't be limited by front end webpage representation technology or the changeable puzzlement of format again.
The above is merely preferred embodiment of the present invention, and is in order to restriction the present invention, not all within spirit of the present invention and principle, any modification of being made, is equal to replacement, improvement etc., all should be included within the scope that the present invention protects.

Claims (30)

1. the method for an extracted data from webpage is characterized in that, this method comprises:
Steps A, the division rule of node in the definition webpage according to the node division rule with to the parsing of training sample webpage, obtains the constraint rule set that is used for extracting the webpage data;
Step B utilizes said constraint rule set extracted data from webpage to be extracted.
2. method according to claim 1 is characterized in that, obtaining in the said steps A comprises:
Steps A 1 according to the node division rule, generates corresponding first constraint condition set of each training sample webpage and the set of second constraint condition;
Steps A 2 generates corresponding constraint rule set respectively according to set of first constraint condition and the set of second constraint condition.
3. method according to claim 2 is characterized in that, said steps A 1 comprises:
Steps A 11 to each training sample webpage, resolves to dom tree with this training sample webpage, and converts this dom tree to node set U;
Steps A 12, the different node division rules according to definition are divided into different division set with node set U;
Steps A 13; Based on said division set; Confirm to specify and extract the constrain set that dimension set corresponding nodes is deposited each node among the set U1, all constrain sets of node being deposited all nodes among the set UI are as corresponding first constraints set of this training sample webpage;
Steps A 14 is deposited any two nodes among the set UI to node, calculates the corresponding binary relation of two dimensions at least one attribute of these two nodes, and result of calculation is recorded in corresponding second constraint condition set of this training sample webpage.
4. method according to claim 3 is characterized in that, said steps A 12 comprises:
Steps A 121, the node division rule of traversal definition, with the node division rule that traverses as current division rule;
Steps A 122, traversal node set U, with the node that traverses as present node;
Steps A 123 judges whether this present node satisfies current division rule, if, then present node is added in the corresponding division set of this current division rule, execution in step A124, if not, execution in step A124;
Steps A 124 judges among the said node set U whether also have the node that is not traveled through, if this one of them node that is not traveled through as present node, is returned execution in step A123; If not, execution in step A125;
Steps A 125 judges whether also there is the node division rule that is not traveled through in the node division rule that defines, if, one of them the node division rule that is not traveled through as current division rule, is returned execution in step A122, otherwise, current flow process finished.
5. method according to claim 3 is characterized in that, said steps A 13 comprises:
Steps A 131, the traversal node deposit the set UI, with the node that traverses as present node;
Steps A 132; Whether there is the division set that comprises this present node in the division set that determining step A12 obtains; If the node attribute values of on the present node division that should exist being gathered is left in the constrain set of present node with this present node, afterwards execution in step A133; If not, execution in step A133;
Steps A 133, decision node deposit whether also there is the node that is not traveled through among the set U1, if, one of them node that is not traveled through as present node, is returned steps A 132, otherwise, current flow process finished.
6. method according to claim 2 is characterized in that, the constraint rule set in the said steps A 2 comprises: based on the constraint condition set of equal value and the constraint condition set of dimension on node attribute values of rough set.
7. method according to claim 6 is characterized in that, said constraint condition set of equal value based on rough set becomes according to the first constraint condition set symphysis, specifically comprises:
Steps A 21 travels through the constrain set of each node in first constraint condition set, with the joint constraint set that traverses as current constrain set;
Steps A 22 judges that current constrain set corresponding nodes division rule is discrete type or continuous type, and according to confirming that the result calculates the probability that each node distributes in the current constrain set;
Steps A 23, removal has the node that probability is different from the probability of intended target node from current constrain set;
Steps A 24 judges that whether the joint constraint in the set of first constraint condition is gathered by whole traversals, if, execution in step A25, otherwise, current constrain set is confirmed as in one of them joint constraint set that is not traveled through, return execution in step A22;
Steps A 25 will be confirmed as the constraint condition set of equal value based on rough set through all set that steps A 23 obtains.
8. method according to claim 6 is characterized in that, the constraint condition set of said dimension on node attribute values becomes according to the second constraint condition set symphysis, specifically comprises:
Merge corresponding second constraint condition set of different training sample webpage;
The set that obtains after confirming to merge is the constraint condition set of dimension on node attribute values.
9. method according to claim 8 is characterized in that, the binary relation in said second constraint condition set comprises comparison and distance relation at least;
Second constraint condition that different training sample webpage is corresponding is incorporated into and comprises less: merge comparison and combined distance relation;
Wherein, Merging comparison comprises: in corresponding second constraint condition set of a training sample webpage, have the comparison between two nodes; Judge whether this comparison exists in corresponding second constraint condition set of other training sample webpages, if this comparison keeps; If not, delete this comparison;
Combined distance relation comprises: the distance relation in corresponding second constraint condition set of a training sample webpage between two dimensions is first value; If the distance relation in corresponding second constraint condition set of other training sample webpages between these two dimensions is second value, then the distance relation between these two dimensions after the merging is being the numerical value between first value to the second value.
10. method according to claim 6 is characterized in that, said steps A 2 further comprises:
Verify whether the constraint condition set on node attribute values identifies said specified dimension each dimension in gathering with dimension in said constraint condition of equal value set based on rough set, if, execution in step B then; Otherwise, execution in step A3;
Steps A 3 is confirmed the extreme value constraint condition set of dimension on node attribute values.
11. method according to claim 10 is characterized in that, each dimension that said checking is gathered based on the constraint condition of equal value of rough set and whether the constraint condition set of dimension on node attribute values identifies in the specified dimension set comprises:
To any two dimension D1 and D2 in each training sample webpage, handle below carrying out: if the set of dimension D1 corresponding nodes is U D1(N X1... N Xm), the set of dimension D1 corresponding nodes is U D1D2{ (N X1N Y1) ... N Yn), the set of binary relation on dimension D1 and the D2 for UR (R1 ... Rn), then dimension D1 is made up with dimension D2 corresponding nodes collection, it is right to obtain with lower node: U D1D2{ (N Xi, N Yi) | i ∈ 1, m}, j ∈ 1, n}}; It is right to travel through said node, confirms that the node of all binary relations was right during satisfying said binary relation gathered, and obtains gathering U D1{ (N x, N y) ...;
After per two dimensions are all accomplished said processing in to said training sample webpage; Judge whether the node that finally is met all binary relations in the binary relation set only has 1 to the set of forming; If confirm that said constraint condition set of equal value and the constraint condition set of dimension on node attribute values based on rough set identifies each dimension in the said specified dimension set; Otherwise, confirm that said constraint condition of equal value based on rough set is gathered and the constraint condition set of dimension on node attribute values can not discerned each dimension in the said specified dimension set.
12. method according to claim 10 is characterized in that, said steps A 3 comprises:
Steps A 31 travels through the dimension in the set of said specified dimension, with the dimension that traverses as current dimension, if current dimension corresponding nodes collection U nIn the number of nodes that comprises greater than 1, said set of node U nIn node belong to same training sample webpage, then travel through each attribute that intended target node all nodes corresponding with current dimension have, with the attribute that traverses as current attribute;
Steps A 32, all nodes that the intended target node is corresponding with current dimension respectively compare on current attribute, obtain comparative result;
Steps A 33 is confirmed the value of said intended target node on this current attribute according to said comparative result;
Steps A 33 is confirmed the output result that current attribute is corresponding based on the value of confirming;
Steps A 34 judges whether also to exist the attribute that is not traveled through, if, one of them attribute that is not traveled through is confirmed as current attribute, return execution in step A32; If not, execution in step A35;
Steps A 35, each output result that said each attribute is corresponding forms each set, and each set is handled, and obtains the corresponding extreme value arrangement set of current dimension; Judge the current preset merging condition that whether arrives, if, execution in step A36, otherwise, if the current dimension do not traveled through that also exists, then with one of them dimension that is not traveled through as current dimension, return execution in step A31;
Steps A 36 merges the corresponding extreme value arrangement set of each dimension, and the extreme value arrangement set that obtains after merging is gathered as the extreme value constraint condition of dimension on node attribute values.
13. method according to claim 12 is characterized in that, said comparative result comprise greater than, equal and less than; Said steps A 33 comprises:
If comparative result only less than, equal and less than being 0, then value is Top;
If comparative result only less than with equal, greater than being 0, then value is Tops;
If comparative result only greater than, less than with greater than being 0, then value is Bottom;
If comparative result only greater than with equal, less than being 0, then value is Bottoms;
If comparative result existing greater than have again less than, equal to arbitrarily, then value is Middle;
If comparative result only equals, greater than with less than all being 0, then value is Identical.
14. method according to claim 13 is characterized in that, said steps A 34 comprises:
If value is Top, Bottom, then exporting the result is 1, as the afterbody of extreme value sequence;
If value is Tops, Bottoms, then export the result for 2, m-1}, as the beginning or the center section of extreme value sequence, said m is the quantity of dimension in the specified dimension set;
If value is Identical, then exports the result and be m, as the beginning or the center section of extreme value sequence;
If value is Middle, then exporting the result is 0, as the afterbody subsequent treatment of sequence.
15. method according to claim 14 is characterized in that, said steps A 35 comprises:
Arrangement of elements in first set that will obtain for m according to said output result obtains arranging set U p
All be divided into Top to all elements in first set, Bottom confirms to arrange set U p+ be that 1 second set that obtains is the corresponding qualified first extreme value sequence of this dimension according to said output result;
From said set of node U nIn determine and satisfy to arrange set U p+ according to said output result be 2, and m-1} obtain the 3rd the set subclass U Ns, transferring the bottoms in the 3rd set to bottom, tops transfers top to, is added to arrange set U p+ according to said output result be 2, in the 3rd set that m-1} obtains, form the secondary extremal sequence;
Merge said first extreme value sequence and secondary extremal sequence, the extreme value sequence that obtains after merging is confirmed as the corresponding extreme value arrangement set of current dimension;
Said pre-conditioned be the 3rd the set in number of elements be 0.
16. method according to claim 15 is characterized in that, said merging first extreme value sequence and secondary extremal sequence perhaps merge the corresponding extreme value arrangement set of each dimension and specifically comprise in the steps A 36:
To two extreme value sequence S1 and S2, if S1 ∈ is S2, S1 ∩ S2=S2 then, if S2 ∈ is S1, S1 ∩ S2=S1 then, if S1=S2, S1 ∩ S2=S1 then, other then return φ.
17. method according to claim 10 is characterized in that, when the equivalence constraint based on rough set
When the constraint condition set on node attribute values of set of circumstances and dimension can not discerned each dimension in the said specified dimension set, said step B comprised:
Step B1 generates and waits to extract the corresponding ensemble of communication of webpage;
Step B2 divides according to the node division rule of definition said ensemble of communication, obtains each sub-set of ensemble of communication;
Step B3 merges each sub-set that obtains, and obtains characteristic value collection;
Step B4; In set of the constraints on the node attribute values and the extreme value constraints set of dimension on node attribute values said characteristic value collection is carried out rule constrain based on said constraints set of equal value, dimension, obtain being used for the interconnection constraint mapping table of extracted data based on rough set;
Step B5 utilizes said interconnection constraint mapping table to carry out data pick-up.
18. method according to claim 17 is characterized in that, said step B1 comprises:
Step B11 resolves to dom tree with webpage to be extracted;
Step B12, traversal all nodes on the dom tree, with the node that traverses as present node, execution in step B13;
Step B13 judges whether this node is the note node, if, execution in step B14, otherwise, B15 carried out;
Step B14 judges whether dom tree also has not traversal of node, if this one of them node that is not traveled through as present node, is returned execution in step B13, if not, return execution in step B11;
Step B15 adds this present node in the ensemble of communication to, returns execution in step B14.
19. method according to claim 17 is characterized in that, said step B2 comprises:
Step B21 judges whether said ensemble of communication is empty, if be empty, and execution in step B24, otherwise execution in step B22;
Step B22, the traversal ensemble of communication, with the node in the ensemble of communication that traverses as present node, execution in step B23;
Step B23 to each node division rule of definition, judges whether present node satisfies this division rule, if satisfy, then present node is added in the corresponding subclass of this node division rule, returns step B21, if do not satisfy, then returns step B21;
Step B2, with each sub-set that obtains at last as to said ensemble of communication according to the definition the node division rule divide resulting subclass.
20. method according to claim 17 is characterized in that, said step B3 comprises:
Step B31, new characteristic value collection of initialization;
The subclass that step B32, traversal step B2 obtain, with the subclass that traverses as current subclass, execution in step B33;
Step B33 travels through the node in the current subclass, with the node that traverses as present node, execution in step B34;
Whether step B34 comprises this present node in the judging characteristic value set, if comprise, and execution in step B36; Otherwise, execution in step B35;
Step B35 adds present node and characteristic of correspondence value thereof in the characteristic value collection to, afterwards execution in step B37;
Step B36 adds present node characteristic of correspondence value in the characteristic value collection to, afterwards execution in step B37;
Step B37, this present node are set to access flag Signed, and execution in step B38;
Step B38 judges whether current subclass has not traversal of node, if, one of them node that is not traveled through as present node, is returned execution in step B34, if not, execution in step B39;
Step B39, the sign that current subclass is set to dispose; Execution in step B30 afterwards;
Whether step B30 also exists the subclass that is not traveled through in the subclass that determining step B2 obtains, if one of them subclass that is not traveled through as current subclass, is returned execution in step B33, if not, finish current flow process.
21. method according to claim 17 is characterized in that, said step B4 comprises:
Step B41 becomes relation of equivalence to divide mapping table according to said constraint condition set symphysis of equal value based on rough set;
Step B42 divides mapping table according to said relation of equivalence and becomes ordering to divide the constraint mapping table with the extreme value constraint condition set symphysis of said dimension on node attribute values;
Step B43, the interconnection constraint mapping table that the generation of constraint mapping table is used for extracted data is divided in constraint condition set and the said ordering on node attribute values according to said dimension.
22. method according to claim 21 is characterized in that, the constraint condition set of equal value based on rough set that different dimensions is corresponding different; Said step B41 comprises:
Step B411 travels through the corresponding constraint condition set of equal value based on rough set of each dimension, is current constraint condition set with the constraint condition set cooperation of equal value of the corresponding dimension that traverses;
Step B412 utilizes first constraint condition in the current constraint condition set that said characteristic value collection is carried out equivalent partition;
Step B413 judges whether the set of current constraint condition only comprises first constraint condition, if, execution in step B416, otherwise, with the constraint condition after first constraint condition as current constraint condition, execution in step B414;
Step B414 uses current constraint condition to carrying out equivalent partition through using a last constraint condition to carry out the set of satisfying a last constraint condition that equivalent partition obtains;
Step B415 judges that whether current constraint condition be last constraint condition of current constraint condition set, if if execution in step B416 not, as current constraint condition, returns the next constraint condition of current constraint condition to execution in step B414;
Step B416, the set of satisfying last constraint condition is converted into " be key assignments with the dimension, the set of node that satisfies constraint for value to " relation of equivalence divide mapping table;
Step B417 judges in the corresponding constraint condition set of equal value of each dimension whether also have the constraint condition set of equal value that is not traveled through, if; With the constraint condition set cooperation of equal value of one of them dimension that is not traveled through is that current constraint combines; Execution in step B412, otherwise, current flow process finished.
23. method according to claim 21 is characterized in that, said step B42 comprises:
Step B421 travels through the corresponding ordering constraint condition of all dimensions, and the ordering constraint condition that the dimension that traverses is corresponding is as current constraint condition; Said ordering constraint condition is the corresponding extreme value sequence of dimension;
Step B422 divides the both candidate nodes set of reading this dimension correspondence the mapping table from relation of equivalence;
Step B423 classifies to both candidate nodes set according to current constraint condition, and a plurality of set of blocks that obtain after the classification are carried out the piece internal sort respectively;
Step B424 is put in the same node set respectively from every TopN node choosing appointment, and according to the piece order successively;
Step B425, this node set is stored in " dimension with its place is a key assignments, set of node for value to " ordering divide in the constraint mapping table;
Step B426 judges whether to exist the ordering constraint condition that is not traveled through, if, the ordering constraint condition of one of them corresponding dimension that is not traveled through as current constraint condition, is returned execution in step B422, otherwise, current flow process finished.
24. method according to claim 21 is characterized in that, said step B43 comprises:
Step B431 travels through the corresponding interconnection constraint condition of all dimensions, with the dimension interconnection constraint condition that traverses as current Correlation Criteria; Said interconnection constraint condition is the constraint condition set of dimension on node attribute values;
Step B432 obtains pairing all both candidate nodes of the dimension relevant with current Correlation Criteria from ordering division constraint mapping table;
Step B433 amasss mapping to all both candidate nodes of obtaining, obtain one group with node to being the new set of element;
Step B434, the node during step B433 newly gathered keeps the element that satisfies constraint rule to retraining calculating, and deletion not satisfy the node of constraint right, obtain the interconnection constraint mapping table of this dimension;
Step B435 judges whether to exist the interconnection constraint condition that is not traveled through, if, one of them interconnection constraint condition that is not traveled through as current constraint condition, is returned execution in step B432, otherwise, current flow process finished.
25. method according to claim 17 is characterized in that, said step B5 comprises
Step B51, the both candidate nodes set of dimension gathers the both candidate nodes of this dimension as current set in the traversal interconnection constraint mapping table;
Step B52 judges whether the node number in the current set is 1, if, execution in step B53, otherwise, execution in step B56;
Step B53 according to the relevant content information of this node of demand extraction, promptly removes the mark and the related pattern information of webpage;
Step B54, this information is saved in " be key assignments with the dimension, the nodal information content for value to " imformosome set in;
Step B55 judges whether also there is the dimension both candidate nodes set that is not traveled through in the interconnection constraint mapping table, if one of them the dimension both candidate nodes that is not traveled through set as current set, is returned execution in step B52; Otherwise, finish current flow process;
Step B56 handles this page link, dimension sign and both candidate nodes set write error thereof of waiting to extract webpage in the daily record.
26. the device of an extracted data from webpage is characterized in that, this device comprises:
Processing unit is used for defining the division rule of webpage node, according to the node division rule with to the parsing of training sample webpage, obtains the constraint rule set that is used for extracting the webpage data;
Extracting unit is used for utilizing said constraint rule set from webpage extracted data to be extracted.
27. device according to claim 26 is characterized in that, said processing unit comprises:
The constraint condition set symphysis becomes subelement, is used for according to the node division rule, generates corresponding first constraint condition set of each training sample webpage and the set of second constraint condition;
The constraint rule set generates subelement, is used for generating corresponding constraint rule set respectively according to set of first constraint condition and the set of second constraint condition.
28. device according to claim 27 is characterized in that, said constraint condition set symphysis becomes subelement to generate corresponding first constraint condition set of each training sample webpage and the set of second constraint condition through following operation:
To each training sample webpage, this training sample webpage is resolved to dom tree, and convert this dom tree to node set U;
Different node division rules according to definition are divided into different division set with node set U;
Based on said division set, confirm that specified dimension set corresponding nodes deposits the constrain set of each node among the set UI, all constrain sets of node being deposited all nodes among the set UI are as corresponding first constraints set of this training sample webpage;
Deposit any two nodes among the set UI to node, calculate the corresponding binary relation of two dimensions at least one attribute of these two nodes, result of calculation is recorded in corresponding second constraint condition set of this training sample webpage.
29. device according to claim 27 is characterized in that, said constraint rule set comprises: based on the constraint condition set of equal value and the constraint condition set of dimension on node attribute values of rough set; Perhaps, said constraint rule set comprises: constraint condition set of equal value, dimension based on rough set are gathered in set of the constraint condition on the node attribute values and the extreme value constraint condition of dimension on node attribute values,
Wherein, said dimension is integrated into said constraint condition of equal value set based on rough set in the extreme value constraint condition on the node attribute values and exists when the constraint condition set on node attribute values can not discerned each dimension in gathering of said specified dimension with dimension.
30. device according to claim 29; It is characterized in that; Said extracting unit comprises when the constraint condition set on node attribute values can not discerned each dimension in gathering of said specified dimension with dimension based on the constraint condition of equal value set of rough set:
Ensemble of communication generates subelement, is used to generate wait to extract the corresponding ensemble of communication of webpage;
Divide subelement, be used for said ensemble of communication is divided according to the node division rule of definition, obtain each sub-set of ensemble of communication;
Merge subelement, be used to merge each sub-set that obtains, obtain characteristic value collection;
The rule constrain subelement; Be used in set of the constraints on the node attribute values and the extreme value constraints set of dimension on node attribute values said characteristic value collection being carried out rule constrain, obtain being used for the interconnection constraint mapping table of extracted data based on said constraints set of equal value, dimension based on rough set;
Extract subelement, be used to utilize said interconnection constraint mapping table to carry out data pick-up.
CN201010527635.9A 2010-10-27 2010-10-27 Method and device for extracting data from webpage Active CN102456050B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010527635.9A CN102456050B (en) 2010-10-27 2010-10-27 Method and device for extracting data from webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010527635.9A CN102456050B (en) 2010-10-27 2010-10-27 Method and device for extracting data from webpage

Publications (2)

Publication Number Publication Date
CN102456050A true CN102456050A (en) 2012-05-16
CN102456050B CN102456050B (en) 2014-04-09

Family

ID=46039247

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010527635.9A Active CN102456050B (en) 2010-10-27 2010-10-27 Method and device for extracting data from webpage

Country Status (1)

Country Link
CN (1) CN102456050B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102855324A (en) * 2012-09-11 2013-01-02 北京云泓道元信息技术有限公司 Automatic extracting method and device for network information
CN102902723A (en) * 2012-09-06 2013-01-30 北京北森测评技术有限公司 Method and device for analyzing network data
CN103634146A (en) * 2013-11-27 2014-03-12 华为技术有限公司 Network data processing method and device
CN103870506A (en) * 2012-12-17 2014-06-18 中国科学院计算技术研究所 Webpage information extraction method and system
CN105306525A (en) * 2015-09-11 2016-02-03 浪潮集团有限公司 Data layout method, device and system
CN106326316A (en) * 2015-07-08 2017-01-11 腾讯科技(深圳)有限公司 Web page advertisement filtering method and device
CN106682205A (en) * 2016-12-29 2017-05-17 努比亚技术有限公司 Device and method for data processing
CN106951434A (en) * 2017-02-06 2017-07-14 广东神马搜索科技有限公司 A kind of searching method, device and programmable device for search engine
CN107463696A (en) * 2017-08-15 2017-12-12 中译语通科技(北京)有限公司 A kind of method of Webpage largest block extraction
CN108153779A (en) * 2016-12-05 2018-06-12 阿里巴巴集团控股有限公司 Page data impression information processing method and processing device
CN108228676A (en) * 2016-12-22 2018-06-29 腾讯科技(深圳)有限公司 Information extraction method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1410918A (en) * 2002-05-31 2003-04-16 浙江大学 Searching engine based on information extraction technique
US20050216443A1 (en) * 2000-07-06 2005-09-29 Streamsage, Inc. Method and system for indexing and searching timed media information based upon relevance intervals
JP2008123425A (en) * 2006-11-15 2008-05-29 Ntt Resonant Inc Web document data providing device, method, and system
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050216443A1 (en) * 2000-07-06 2005-09-29 Streamsage, Inc. Method and system for indexing and searching timed media information based upon relevance intervals
CN1410918A (en) * 2002-05-31 2003-04-16 浙江大学 Searching engine based on information extraction technique
JP2008123425A (en) * 2006-11-15 2008-05-29 Ntt Resonant Inc Web document data providing device, method, and system
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102902723A (en) * 2012-09-06 2013-01-30 北京北森测评技术有限公司 Method and device for analyzing network data
CN102855324B (en) * 2012-09-11 2015-08-26 北京云泓道元信息技术有限公司 A kind of extraction method of the network information and device
CN102855324A (en) * 2012-09-11 2013-01-02 北京云泓道元信息技术有限公司 Automatic extracting method and device for network information
CN103870506A (en) * 2012-12-17 2014-06-18 中国科学院计算技术研究所 Webpage information extraction method and system
CN103870506B (en) * 2012-12-17 2017-02-08 中国科学院计算技术研究所 Webpage information extraction method and system
CN103634146A (en) * 2013-11-27 2014-03-12 华为技术有限公司 Network data processing method and device
CN106326316A (en) * 2015-07-08 2017-01-11 腾讯科技(深圳)有限公司 Web page advertisement filtering method and device
CN106326316B (en) * 2015-07-08 2022-11-29 腾讯科技(深圳)有限公司 Webpage advertisement filtering method and device
CN105306525A (en) * 2015-09-11 2016-02-03 浪潮集团有限公司 Data layout method, device and system
CN108153779A (en) * 2016-12-05 2018-06-12 阿里巴巴集团控股有限公司 Page data impression information processing method and processing device
CN108153779B (en) * 2016-12-05 2022-04-05 阿里巴巴集团控股有限公司 Page data delivery information processing method and device
CN108228676A (en) * 2016-12-22 2018-06-29 腾讯科技(深圳)有限公司 Information extraction method and system
CN108228676B (en) * 2016-12-22 2021-08-13 腾讯科技(深圳)有限公司 Information extraction method and system
US11093520B2 (en) 2016-12-22 2021-08-17 Tencent Technology (Shenzhen) Company Limited Information extraction method and system
CN106682205A (en) * 2016-12-29 2017-05-17 努比亚技术有限公司 Device and method for data processing
CN106951434B (en) * 2017-02-06 2020-03-10 广东神马搜索科技有限公司 Search method and device for search engine and programmable device
CN106951434A (en) * 2017-02-06 2017-07-14 广东神马搜索科技有限公司 A kind of searching method, device and programmable device for search engine
CN107463696A (en) * 2017-08-15 2017-12-12 中译语通科技(北京)有限公司 A kind of method of Webpage largest block extraction

Also Published As

Publication number Publication date
CN102456050B (en) 2014-04-09

Similar Documents

Publication Publication Date Title
CN102456050A (en) Method and device for extracting data from webpage
US8868621B2 (en) Data extraction from HTML documents into tables for user comparison
CN101650715B (en) Method and device for screening links on web pages
CN102023989A (en) Information retrieval method and system thereof
CN105589948A (en) Document citation network visualization and document recommendation method and system
CN104636478A (en) Information query method and device
CN103019728A (en) Effective complex report parsing engine and parsing method thereof
CN102831121A (en) Method and system for extracting webpage information
KR20070101288A (en) Tree search, totalizing, sort method, information processing device, and tree search, totalizing, and sort program
JP2018005436A (en) Circuit design device and circuit design method using the same
CN102456016A (en) Method and device for sequencing search results
CN106503211A (en) Information issues the method that the mobile edition of class website is automatically generated
CN102567494A (en) Website classification method and device
CN106599064A (en) Method for automatically classifying, obtaining and storing complex knowledge of high-end device
CN108959204B (en) Internet financial project information extraction method and system
CN104090769A (en) Graphic displaying method and device for service data
CN115358200A (en) Template document automatic generation method based on SysML meta model
CN103077228A (en) Set characteristic vector-based quick clustering method and device
CN109062921B (en) Method and system for extracting ship tray management information
CN114676961A (en) Enterprise external migration risk prediction method and device and computer readable storage medium
CN103020083A (en) Automatic mining method of requirement identification template, requirement identification method and corresponding device
CN103294791A (en) Extensible markup language pattern matching method
CN102722621A (en) Method for visualizing computed result of finite element method
CN112148735A (en) Construction method for structured form data knowledge graph
CN110083760B (en) Multi-recording dynamic webpage information extraction method based on visual block

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20160311

Address after: West high tech Zone Fucheng Road in Chengdu city of Sichuan province 610000 399 No. 6 Building 1 unit 12 floor No. 3

Patentee after: MIGU MUSIC CO., LTD.

Address before: 610041 No. 10 Peng Da Road, hi tech Zone, Sichuan, Chengdu

Patentee before: China Mobile Communication Group Sichuan Co., Ltd.