The context extraction method and device of Webpage
Technical field
The present invention relates to data processing field, in particular to the context extraction method and device of a kind of Webpage.
Background technology
Web page text, especially Web page text, it is most important information in webpage, while is also the important of big data analysis
Data source.
The extraction of Web page text in the prior art, extracted using text density as reference index mostly, wherein, text
This density refers to that the text size of HTML (HyperText Mark-up Language, HTML) element accounts for HTML
The ratio of element code length.Because the big element of text density is not necessarily text, some texts such as source, time, author
It is possible to by error extraction be text Deng element;The small element of text density is not not necessarily text, and some texts with the addition of sample
Formula information and hyperlink, cause text density to decline, so carrying out Web page text extraction only using text density as reference index
Mode, easily cause extraction mistake.
The problem of not accurate enough is extracted for the text of webpage in the prior art, not yet proposes effective solution party at present
Case.
The content of the invention
It is a primary object of the present invention to provide the context extraction method and device of a kind of Webpage, to solve existing skill
The problem of Web page text extraction is not accurate enough in art.
To achieve these goals, a kind of one side according to embodiments of the present invention, there is provided text of Webpage
Extracting method.Included according to the context extraction method of the Webpage of the present invention:Obtain the hypertext mark of Webpage to be extracted
Remember language HTML code, and the tree structure of the Webpage to be extracted is established according to the HTML code;Extract the tree
Second element corresponding to the father node of first element and the first node corresponding to the first node of shape structure, wherein, it is described
First node is the leaf node of the tree structure;The desired value of first element and the second element is calculated, wherein,
The desired value is used for the information content for representing element;Element corresponding to Maximum Index value in the desired value is obtained, obtains target
Element;And extracted the text that the object element includes as the text of the Webpage to be extracted.
Further, calculating first element and the desired value of the second element includes:Calculate the first element Gj's
First entropy Es1j and the first text size Ls1j, wherein, it is the number of the first node that j takes 1 to n, n successively;According to public affairs
Formulas I 1j=Es1j* (Ls1j)2Calculate the desired value I1j of the first element Gj;Calculate second element Ai the second entropy Es2i
With the second text size Ls2i, wherein, i takes 1 to w successively, and w is the number of the father node of the first node;And press
According to formula I2i=Es2i* (Ls2i)2Calculate the desired value I2i of the second element Ai.
Further, according to formulaCalculate the first of the first element Gj
Entropy Es1j, wherein, S1j be the first element Gj in the first character string, Ck1 be the first character string S1j in word, k according to
Secondary to take 1 to q, q is the number of the word in the first character string S1j, and P (Ck1) is word Ck1 in the first character string S1j
The probability of appearance.
Further, according to formulaCalculate the of the second element Ai
Two entropy Es2i, wherein, S2i be the second element Ai in the second character string, Ck2 be the second character string S2i in word, i
It is the number of the word in the second character string S2i to take 1 to p, p successively, and P (Ck2) is word Ck2 in the second character string S2i
The probability of middle appearance.
Further, the father of the first element and the first node corresponding to the first node in the tree structure is extracted
Second element includes corresponding to node:Judge whether the first element corresponding to the first node is block element;And judging
Go out the first element corresponding to the first node in the case of block element, extract the first element corresponding to the first node and
Second element corresponding to the father node of the first node.
To achieve these goals, a kind of another aspect according to embodiments of the present invention, there is provided text of Webpage
Extraction element.Included according to the text extraction element of the Webpage of the present invention:First acquisition unit, for obtaining net to be extracted
The HTML HTML code of the page page, and the tree-like of the Webpage to be extracted is established according to the HTML code
Structure;First extraction unit, for extracting the first element corresponding to the first node of the tree structure and the first node
Father node corresponding to second element, wherein, the first node be the tree structure leaf node;Computing unit, use
In the desired value for calculating first element and the second element, wherein, the desired value is used for the information content for representing element;
Second acquisition unit, for obtaining element corresponding to Maximum Index value in the desired value, obtain object element;And second carry
Unit is taken, for being extracted the text that the object element includes as the text of the Webpage to be extracted.
Further, the computing unit includes:First computing module, for calculating the first element Gj the first entropy
Es1j and the first text size Ls1j, wherein, it is the number of the first node that j takes 1 to n, n successively;Second computing module, use
According to formula I1j=Es1j* (Ls1j)2Calculate the desired value I1j of the first element Gj;3rd computing module, based on
Second element Ai the second entropy Es2i and the second text size Ls2i is calculated, wherein, it is the first node that i takes 1 to w, w successively
The father node number;And the 4th computing module, for according to formula I2i=Es2i* (Ls2i)2Calculate described second
Elements A i desired value I2i.
Further, first computing module includes:First calculating sub module, for according to formulaThe first entropy Es1j of the first element Gj is calculated, wherein, S1j is described
The first character string in first element Gj, Ck1 are the word in the first character string S1j, and it is first character that k takes 1 to q, q successively
The number of word in string S1j, P (Ck1) are the probability that word Ck1 occurs in the first character string S1j.
Further, second computing module includes:Second calculating sub module, for according to formulaThe second entropy Es2i of the second element Ai is calculated, wherein, S2i is described
The second character string in second element Ai, Ck2 are the word in the second character string S2i, and it is second character that i takes 1 to p, p successively
The number of word in string S2i, P (Ck2) are the probability that word Ck2 occurs in the second character string S2i.
Further, first extraction unit includes:Judge module, for judging first corresponding to the first node
Whether element is block element;And processing module, for judging that the first element corresponding to the first node is block element
In the case of, extract second element corresponding to the father node of the first element and the first node corresponding to the first node.
According to inventive embodiments, using the HTML code of acquisition Webpage to be extracted, and built according to the HTML code
Found the tree structure of the Webpage to be extracted;Extract the first element corresponding to the first node of the tree structure and described
Second element corresponding to the father node of first node, wherein, the first node is the leaf node of the tree structure;Calculate
The desired value of first element and second element, wherein, the desired value is used to represent first element or second yuan
The information content of element;Element corresponding to Maximum Index value in the desired value is obtained, obtains object element;And by the target element
The text that element includes is extracted as the text of the Webpage to be extracted.By establishing on page HTML generations to be extracted
The tree structure of code, realize determination to element corresponding to the father node of element corresponding to leaf node and leaf node simultaneously
Extract, and then the information content of the element is calculated according to the element extracted, the text that a maximum element of information content is included
This is extracted as the text of Webpage, such a to carry out text extraction to Webpage using information content as reference index
Mode, such a extracting mode not only allow for the text size in element, it is also contemplated that the confusion degree of text, is compared to existing
There is the extracting mode for the text for carrying out Webpage in technology only using text density as reference index, that is, only consider text
Length accounts for the extracting mode of code length ratio, and the text for solving Webpage in the prior art extracts not accurate enough ask
Topic, and then improve the text extraction accuracy effect of Webpage, analyzes to provide and is relatively defined for follow-up big data
True data basis.Also, the carry out Webpage text of such a tree structure based on the HTML code for establishing Webpage
The mode of extraction, for the different coding form of different web pages page text, without being extracted not by way of being separately configured
With the text of Webpage, so as to reach reduction resource consumption and improve the effect of extraction rate.
Brief description of the drawings
The accompanying drawing for forming the part of the application is used for providing a further understanding of the present invention, schematic reality of the invention
Apply example and its illustrate to be used to explain the present invention, do not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of the context extraction method of Webpage according to embodiments of the present invention;
Fig. 2 is a kind of flow chart of the context extraction method of optional Webpage according to embodiments of the present invention;And
Fig. 3 is the schematic diagram of the text extraction element of Webpage according to embodiments of the present invention.
Embodiment
In order that those skilled in the art more fully understand the present invention program, below in conjunction with the embodiment of the present invention
Accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only
The embodiment of a part of the invention, rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill people
The every other embodiment that member is obtained under the premise of creative work is not made, it should all belong to the model that the present invention protects
Enclose.
It should be noted that term " first " in description and claims of this specification and above-mentioned accompanying drawing, "
Two " etc. be for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that so use
Data can exchange in the appropriate case, so as to embodiments of the invention described herein can with except illustrating herein or
Order beyond those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover
Cover it is non-exclusive include, be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment
Those steps or unit clearly listed, but may include not list clearly or for these processes, method, product
Or the intrinsic other steps of equipment or unit.
Description below is done to technical term involved in the embodiment of the present invention:
Block element also known as block level element, and be inline element (inline element) corresponding to it, all it is html specifications
In concept.Block level element is when browser is shown, it will usually starts (and terminate) with newline.
Embodiment 1
According to embodiments of the present invention, there is provided a kind of embodiment of the method that can be used for implementing the application device embodiment,
It should be noted that can be in the department of computer science of such as one group computer executable instructions the flow of accompanying drawing illustrates the step of
Performed in system, although also, show logical order in flow charts, in some cases, can be with different from herein
Order perform shown or described step.
According to embodiments of the present invention, there is provided a kind of context extraction method of Webpage.Fig. 1 is implemented according to the present invention
The flow chart of the context extraction method of the Webpage of example, as shown in figure 1, this method includes steps S102 to step
S110:
S102:The HTML HTML code of Webpage to be extracted is obtained, and is established and treated according to HTML code
The tree structure of Webpage is extracted, specifically, the tree structure is to be based on DOM Document Object Model (Document Object
Model, abbreviation DOM) tree structure, so such a tree structure is referred to as dom tree.
S104:Extract second corresponding to the father node of the first element and first node corresponding to the first node of tree structure
Element, wherein, first node is the leaf node of tree structure, that is, extracts each leaf node of tree structure (that is, dom tree)
Second element corresponding to the father node of corresponding first element and leaf node, wherein, the quantity of leaf node and the first element
Quantity it is equal, the quantity of the father node of leaf node is equal with the quantity of second element.
S106:The desired value of the first element and second element is calculated, wherein, desired value is used for the information content for representing element,
That is, the information content of each first element and the information content of each second element are calculated, wherein, the desired value of some element is bigger,
Illustrate that the information content of the element is bigger, conversely, the desired value of some element is smaller, illustrate that the information content of the element is smaller.
S108:Element corresponding to Maximum Index value in desired value is obtained, obtains object element, it is, from multiple first
The maximum element of desired value is found out in the desired value of the desired value of element and multiple second elements, the element is exactly object element.
S110:Extracted the text that object element includes as the text of Webpage to be extracted, i.e. object element
In the content of text be exactly the text of Webpage for needing to extract.
In embodiments of the present invention, by establishing the tree structure on page HTML code to be extracted, realize to leaf
The determination and extraction of element corresponding to the father node of element corresponding to child node and leaf node, and then according to the member extracted
Element calculates the information content of the element, is carried out the text that a maximum element of information content includes as the text of Webpage
Extraction, it is such a using information content as reference index to Webpage progress text extraction in a manner of, such a extracting mode is not only examined
The text size in element is considered, it is also contemplated that the confusion degree of text, be compared to and only make in the prior art with text density
The extracting mode of the text of Webpage is carried out for reference index, that is, only considers that text size accounts for carrying for code length ratio
Mode is taken, solves the problems, such as that the text extraction of Webpage in the prior art is not accurate enough, and then raising webpage page
The text extraction accuracy effect in face, accurate data basis is provided for follow-up big data analysis.Also, such a base
In the mode of the progress Webpage text extraction of the tree structure for the HTML code for establishing Webpage, for different web pages page
The different coding form of face text, without extracting the text of the different web pages page by way of being separately configured, so as to reach
Reduce resource consumption and improve the effect of extraction rate.
Specifically, the index of each first element and each second element can be calculated by step 1-1 to step 1-4
Value, step 1-1 are specific as follows to step 1-4:
Step 1-1:The first element Gj the first entropy Es1j and the first text size Ls1j is calculated, wherein, j takes 1 successively
To n, n is the number of first node, it is, the first entropy and the first text size of each first element are calculated, wherein, the
One text size is the number of the word for the text that the first element includes, and the first entropy refers to the text included in the first element
Entropy, reflect the information density for the text that the first element includes.Such as:The text included in first element G1 has 100 words, that
First element G1 text size Ls11 is 100.
Step 1-2:According to formula I1j=Es1j* (Ls1j)2Calculate the first element Gj desired value I1j, that is, basis
First entropy of each first element and the product of the first text size square determine the desired value of first element.
Step 1-3:Second element Ai the second entropy Es2i and the second text size Ls2i is calculated, wherein, i takes 1 successively
It is the number of the father node of first node to w, w, it is, calculating the second entropy and the second text length of each second element
Degree, wherein, the second text size is the number of the word for the text that second element includes, and the second entropy refers to include in second element
Text entropy, reflect the information density of the text that second element includes.Such as:The text included in second element A1 has
300 words, then second element A1 text size Ls21 is 300.
Step 1-4:According to formula I2i=Es2i* (Ls2i)2Calculate second element Ai desired value I2i, that is, basis
Each second entropy of second element and the product of the second text size square determine the desired value of the second element.
Specifically, in embodiments of the present invention, according to formulaCalculate first
Element Gj the first entropy Es1j, wherein, S1j is the first character string in the first element Gj, and Ck1 is in the first character string S1j
Word, k takes 1 to q successively, and q is the number of the word in the first character string S1j, and P (Ck1) is word Ck1 in the first character string S1j
The probability of appearance.In embodiments of the present invention, all words in the text that the first element includes constitute the of first element
One character string, the frequency that each word in above-mentioned first character string occurs in first character string is calculated, each word is obtained and exists
The probability occurred in first character string, the probability of each word calculated is multiplied with the logarithm of the probability, the institute that will be obtained
There is result summation, then take negative, be exactly the first entropy of first element.Such as:The text included in first element G1 is " modern
Its weather is fine.", then the first element G1 the first character string S11 is " today, weather was fine ", and first character string S11 includes 5 altogether
Individual word, wherein, the probability that " the present " occurs in the first character string S11 be 1/5, " my god " occur in the first character string S11 it is general
Rate is 2/5, and the probability that " gas " occurs in the first character string S11 is 1/5, the probability that " fine " occurs in the first character string S11
For 1/5, then the first element G1 the first entropy Es11 can be calculated according to the following formula:
Es11=- (1/5*log21/5+2/5*log22/5+1/5*log21/5+1/5*log21/5)。
Specifically, in embodiments of the present invention, according to formulaCalculate second
Elements A i the second entropy Es2i, wherein, S2i is the second character string in second element Ai, and Ck2 is in the second character string S2i
Word, i takes 1 to p successively, and p is the number of the word in the second character string S2i, and P (Ck2) is word Ck2 in the second character string S2i
The probability of appearance.In embodiments of the present invention, all words in the text that second element includes constitute the of the second element
Two character strings, the frequency that each word in above-mentioned second character string occurs in second character string is calculated, each word is obtained and exists
The probability occurred in second character string, the probability of each word calculated is multiplied with the logarithm of the probability, the institute that will be obtained
There is result summation, then take negative, be exactly the second entropy of the second element.Such as:The text included in second element A1 is " modern
Its weather is fine, but really terribly cold.", then second element A1 the second character string S21 for " today, weather was fine, but really very
It is cold ", second character string S21 includes 10 words altogether, wherein, the probability that " the present " occurs in the second character string S21 is 1/10,
" my god " probability that occurs in the second character string S21 is 2/10, the probability that " gas " occurs in the second character string S21 is 1/10,
The probability that " fine " occurs in the second character string S21 is 1/10, " but " probability that occurs in the second character string S21 is 1/10,
The probability that "Yes" occurs in the second character string S21 is 1/10, and the probability that " true " occurs in the second character string S21 is 1/10,
" very " probability occurred in the second character string S21 is 1/10, and the probability that " cold " occurs in the second character string S21 is 1/10,
So second element A11 the second entropy Es21 can be calculated according to the following formula:
Es21=- (1/10*log21/10+2/10*log22/10+1/10*log21/10+1/10*log21/10+1/10*
log21/10+1/10*log21/10+1/10*log21/10+1/10*log21/10+1/10*log21/10)。
Preferably, in embodiments of the present invention, the first element and first corresponding to the first node in tree structure is extracted
Second element corresponding to the father node of node includes:Judge whether the first element corresponding to first node is block element, wherein;
In the case of judging that the first element corresponding to first node is block element, the first element and first corresponding to first node is extracted
Second element corresponding to the father node of node, that is, judging that the first element corresponding to first node is the situation of block element
Under, just extract second element corresponding to the father node of the first element and the first node corresponding to first node.Need to illustrate
, when whether judge the first element corresponding to first node is block element, once can only judge a first node pair
Answer whether the first element is block element, can also once judge whether element corresponding to multiple first nodes is block element respectively.
In embodiments of the present invention, by judging whether the first element of first node is block element, avoid by comprising
Content of text is that the first element of non-piece of elements such as " text sources ", " delivering the time " or " author " is extracted, and is reached
Further improve the effect of the Text Feature Extraction degree of accuracy of Webpage.
In addition, the context extraction method for the Webpage that the embodiment of the present invention is provided can also be according to showing in Fig. 2
Idiographic flow performs, i.e. Fig. 2 is a kind of stream of the context extraction method of optional Webpage according to embodiments of the present invention
Cheng Tu, as shown in Fig. 2 this method includes steps S202 to step S214:
S202:Requested webpage address is sent, wherein, the web page address is web page address corresponding to Webpage to be extracted,
The step is specially that the web page address of Webpage to be extracted is sent to server.
S204:The HTML code returned is obtained, specifically, obtains Webpage corresponding to the web page address that server returns
HTML code, equivalent to the HTML code of the acquisition Webpage to be extracted in step S102, be not repeated.
S206:Establish dom tree, i.e. establish the tree structure of Webpage, equivalent in step S102 according to HTML generations
Code establishes the tree structure of Webpage to be extracted, is not repeated.
S207:Current leaf node is set to first leaf node.
S208:Obtain element corresponding to current leaf node.
S210:Judge whether the element is block element, i.e. corresponding to the current leaf node obtained in judgment step S208
Whether element is block element, in the case where judging that the element is block element, performs step S212;Judging the element not
In the case of being block element, step S211 is performed:Current leaf node is set to next leaf node, return to step S208.
S212:The information content of element corresponding to the leaf node is calculated, and is calculated corresponding to the father node of the leaf node
The information content of element, equivalent to step S106, i.e. in the case where judging that the leaf node is block element, calculate the leaf
The information content of element corresponding to node, the information content of element corresponding to the father node of the leaf node is also calculated, calculated
After the information content of element corresponding to the information content of element corresponding to the leaf node and the father node of the leaf node, step is performed
S213:Judge whether to have traveled through whole leaf nodes, if it is, performing step S214;If it is not, then perform step S211:
Current leaf node is set to next leaf node, is then back to step S208.
S214:The content for the text that the maximum element of information content is included specifically, will calculate as text to be extracted
All information content gone out are ranked up, and the content for the text that the maximum element of information content includes is extracted as text.The step
Suddenly equivalent to step S110, it is not repeated.
It should be noted that for foregoing each method embodiment, in order to be briefly described, therefore it is all expressed as a series of
Combination of actions, but those skilled in the art should know, the present invention is not limited by described sequence of movement because
According to the present invention, some steps can use other orders or carry out simultaneously.Secondly, those skilled in the art should also know
Know, embodiment described in this description belongs to preferred embodiment, and involved action and module are not necessarily of the invention
It is necessary.
Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation
The method of example can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but a lot
In the case of the former be more preferably embodiment.Based on such understanding, technical scheme is substantially in other words to existing
The part that technology contributes can be embodied in the form of software product, and the computer software product is stored in a storage
In medium (such as ROM/RAM, magnetic disc, CD), including some instructions to cause a station terminal equipment (can be mobile phone, calculate
Machine, server, or network equipment etc.) perform method described in each embodiment of the present invention.
Embodiment 2
According to embodiments of the present invention, a kind of webpage for being used to implement the context extraction method of above-mentioned Webpage is additionally provided
The text extraction element of the page, the text extraction element are mainly used in performing the text that the above of the embodiment of the present invention is provided
Extracting method, the text extraction element of the Webpage provided below the embodiment of the present invention do specific introduction:
Fig. 3 is the schematic diagram of the text extraction element of Webpage according to embodiments of the present invention, as shown in figure 3, this is just
Literary extraction element mainly includes first acquisition unit 10, the first extraction unit 20, computing unit 30, second acquisition unit 40 and the
Two extraction units 50, wherein:
HTML HTML code of the first acquisition unit 10 for obtaining Webpage to be extracted, and according to
HTML code establishes the tree structure of Webpage to be extracted, and specifically, the tree structure is to be based on DOM Document Object Model
The tree structure of (Document Object Model, abbreviation DOM), so such a tree structure is referred to as dom tree.
First extraction unit 20 is used to extract the father of the first element and first node corresponding to the first node of tree structure
Second element corresponding to node, wherein, first node is the leaf node of tree structure, that is, extracts tree structure (that is, dom tree)
Each leaf node corresponding to second element corresponding to the father node of the first element and leaf node, wherein, leaf node
Quantity is equal with the quantity of the first element, and the quantity of the father node of leaf node is equal with the quantity of second element.
Computing unit 30 is used for the desired value for calculating the first element and second element, wherein, desired value is used to represent element
Information content, i.e. calculate the information content of each first element and the information content of each second element, wherein, the finger of some element
Scale value is bigger, illustrates that the information content of the element is bigger, conversely, the desired value of some element is smaller, illustrates the information content of the element
It is smaller.
Second acquisition unit 40 is used to obtain element corresponding to Maximum Index value in desired value, obtains object element, also
It is that the maximum element of desired value, the element are found out from the desired value of the desired values of multiple first elements and multiple second elements
It is exactly object element.
Second extraction unit 50 is used to be carried the text that object element includes as the text of Webpage to be extracted
Take, i.e. the content of the text in object element is exactly the text for the Webpage that needs extract.
In the embodiment of the present invention, by establishing the tree structure on page HTML code to be extracted, realize to leaf
The determination and extraction of element corresponding to the father node of element corresponding to node and leaf node, and then according to the element extracted
The information content of the element is calculated, the text that a maximum element of information content is included is carried as the text of Webpage
Take, it is such a using information content as reference index to Webpage carry out text extraction in a manner of, such a extracting mode not only considers
Text size in element, it is also contemplated that the confusion degree of text, be compared in the prior art only using text density as
Reference index carries out the extracting mode of the text of Webpage, that is, only considers that text size accounts for the extraction of code length ratio
Mode, solve the problems, such as that the text extraction of Webpage in the prior art is not accurate enough, and then raising Webpage
Text extraction accuracy effect, analyzed for follow-up big data and provide accurate data basis.It is also, such a to be based on
The mode of the progress Webpage text extraction of the tree structure of the HTML code of Webpage is established, for the different web pages page
The different coding form of text, without extracting the text of the different web pages page by way of being separately configured, so as to reach
Reduce resource consumption and improve the effect of extraction rate.
Specifically, computing unit 30 includes the first computing module, the second computing module, the 3rd computing module and the 4th calculating
Module, wherein:
First computing module is used for the first entropy Es1j and the first text size Ls1j for calculating the first element Gj, wherein, j
It is the number of first node to take 1 to n, n successively, it is, the first entropy and the first text size of each first element are calculated,
Wherein, the first text size is the number of the word for the text that the first element includes, and the first entropy refers to what is included in the first element
The entropy of text, reflect the information density for the text that the first element includes.Such as:The text included in first element G1 has 100
Individual word, then the first element G1 text size Ls11 is 100.
Second computing module is used for the desired value I1j that the first element Gj is calculated according to formula I1j=Es1j* (Ls1j) 2,
The index of first element is exactly determined according to the product of the first entropy of each first element and the first text size square
Value.
3rd computing module is used for the second entropy Es2i and the second text size Ls2i for calculating second element Ai, wherein, i
1 to w is taken successively, and w is the number of the father node of first node, it is, calculating the second entropy and second of each second element
Text size, wherein, the second text size is the number of the word for the text that second element includes, and the second entropy refers to second element
In the entropy of text that includes, reflect the information density of the text that second element includes.Such as:The text included in second element A1
Originally there are 300 words, then second element A1 text size Ls21 is 300.
4th computing module is used for the desired value I2i that second element Ai is calculated according to formula I2i=Es2i* (Ls2i) 2,
The index of the second element is exactly determined according to the product of the second entropy of each second element and the second text size square
Value.
Specifically, the first computing module includes the first calculating sub module, and the first calculating sub module is used for according to formulaThe first element Gj the first entropy Es1j is calculated, wherein, S1j is the first element
The first character string in Gj, Ck1 are the word in the first character string S1j, and it is the word in the first character string S1j that k takes 1 to q, q successively
Number, P (Ck1) is the probability that occurs in the first character string S1j of word Ck1.In embodiments of the present invention, the first element includes
Text in all words constitute the first character string of first element, calculate each word in above-mentioned first character string at this
The frequency occurred in first character string, obtain the probability that each word occurs in first character string, each word that will be calculated
Probability be multiplied with the logarithm of the probability, obtained all results are summed, then take negative, are exactly the first entropy of first element
Value.Such as:The text included in first element G1 is " today, weather was fine.", then the first element G1 the first character string S11 is
" today, weather was fine ", first character string S11 include 5 words altogether, wherein, the probability that " the present " occurs in the first character string S11
For 1/5, " my god " probability that occurs in the first character string S11 is 2/5, the probability that " gas " occurs in the first character string S11 is
1/5, the probability that " fine " occurs in the first character string S11 is 1/5, then the first element G1 the first entropy Es11=- (1/5*
log21/5+2/5*log22/5+1/5*log21/5+1/5*log21/5)。
Specifically, the second computing module includes the second calculating sub module, and the second calculating sub module is used for according to formulaSecond element Ai the second entropy Es2i is calculated, wherein, S2i is second element
The second character string in Ai, Ck2 are the word in the second character string S2i, and it is the word in the second character string S2i that i takes 1 to p, p successively
Number, P (Ck2) is the probability that occurs in the second character string S2i of word Ck2.
In embodiments of the present invention, all words in the text that second element includes constitute the second word of the second element
Symbol string, calculate the frequency that each word in above-mentioned second character string occurs in second character string, obtain each word this
The probability occurred in two character strings, the probability of each word calculated is multiplied with the logarithm of the probability, all knots that will be obtained
Fruit is summed, then takes negative, is exactly the second entropy of the second element.Such as:The text included in second element A1 is " today day
Gas is fine, but really terribly cold.", then second element A1 the second character string S21 is " today, weather was fine, but really terribly cold ", should
Second character string S21 includes 10 words altogether, wherein, the probability that " the present " occurs in the second character string S21 is 1/10, " my god " the
The probability occurred in two character string S21 is 2/10, and the probability that " gas " occurs in the second character string S21 is 1/10, and " fine " is the
The probability occurred in two character string S21 is 1/10, " but " probability that occurs in the second character string S21 is 1/10, "Yes" is the
The probability occurred in two character string S21 is 1/10, and the probability that " true " occurs in the second character string S21 is 1/10, and " very " is the
The probability occurred in two character string S21 is 1/10, and the probability that " cold " occurs in the second character string S21 is 1/10, then this
Was Used A11 the second entropy Es21 can be calculated according to the following formula:
Es21=- (1/10*log21/10+2/10*log22/10+1/10*log21/10+1/10*log21/10+1/10*
log21/10+1/10*log21/10+1/10*log21/10+1/10*log21/10+1/10*log21/10)。
Preferably, in embodiments of the present invention, the first extraction unit includes judge module and processing module, wherein, judge
Module is used to judge whether the first element corresponding to first node is block element;Processing module is used to judge first node pair
In the case that the first element answered is block element, it is corresponding to extract the father node of the first element and first node corresponding to first node
Second element.That is, in the case where judging that the first element is block element corresponding to first node, first node is just extracted
Second element corresponding to the father node of corresponding first element and the first node.It should be noted that judging first segment
When whether the first element corresponding to point is block element, it once can only judge whether corresponding first element of a first node is block
Element, it can also once judge whether element corresponding to multiple first nodes is block element respectively.
In embodiments of the present invention, by judging whether the first element of first node is block element, avoid by comprising
Content of text is that the first element of non-piece of elements such as " text sources ", " delivering the time " or " author " is extracted, and is reached
Further improve the effect of the Text Feature Extraction degree of accuracy of Webpage.
As can be seen from the above description, the present invention solves the text extraction of Webpage in the prior art not enough
The problem of accurate, and then improve the text extraction accuracy effect of Webpage.
The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.
In the above embodiment of the present invention, the description to each embodiment all emphasizes particularly on different fields, and does not have in some embodiment
The part of detailed description, it may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed client, can be by others side
Formula is realized.Wherein, device embodiment described above is only schematical, such as the division of the unit, and only one
Kind of division of logic function, can there is an other dividing mode when actually realizing, for example, multiple units or component can combine or
Another system is desirably integrated into, or some features can be ignored, or do not perform.It is another, it is shown or discussed it is mutual it
Between coupling or direct-coupling or communication connection can be INDIRECT COUPLING or communication link by some interfaces, unit or module
Connect, can be electrical or other forms.
The unit illustrated as separating component can be or may not be physically separate, show as unit
The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple
On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs
's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also
That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list
Member can both be realized in the form of hardware, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and is used as independent production marketing or use
When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially
The part to be contributed in other words to prior art or all or part of the technical scheme can be in the form of software products
Embody, the computer software product is stored in a storage medium, including some instructions are causing a computer
Equipment (can be personal computer, server or network equipment etc.) perform each embodiment methods described of the present invention whole or
Part steps.And foregoing storage medium includes:USB flash disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited
Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can be with store program codes
Medium.
Described above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should
It is considered as protection scope of the present invention.