CN104484449B - The context extraction method and device of Webpage - Google Patents

The context extraction method and device of Webpage Download PDF

Info

Publication number
CN104484449B
CN104484449B CN201410827773.7A CN201410827773A CN104484449B CN 104484449 B CN104484449 B CN 104484449B CN 201410827773 A CN201410827773 A CN 201410827773A CN 104484449 B CN104484449 B CN 104484449B
Authority
CN
China
Prior art keywords
text
node
character string
webpage
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410827773.7A
Other languages
Chinese (zh)
Other versions
CN104484449A (en
Inventor
侯明午
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201410827773.7A priority Critical patent/CN104484449B/en
Publication of CN104484449A publication Critical patent/CN104484449A/en
Application granted granted Critical
Publication of CN104484449B publication Critical patent/CN104484449B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of context extraction method of Webpage and device.Wherein, the context extraction method of Webpage includes:The HTML HTML code of Webpage to be extracted is obtained, and the tree structure of Webpage to be extracted is established according to HTML code;Second element corresponding to the father node of the first element and first node corresponding to the first node of tree structure is extracted, wherein, first node is the leaf node of tree structure;The desired value of the first element and second element is calculated, wherein, desired value is used for the information content for representing the first element or second element;Element corresponding to Maximum Index value in desired value is obtained, obtains object element;And extracted the text that object element includes as the text of Webpage to be extracted.By the present invention, solve the problems, such as that the text extraction of Webpage in the prior art is not accurate enough, and then improve the text extraction accuracy effect of Webpage.

Description

The context extraction method and device of Webpage
Technical field
The present invention relates to data processing field, in particular to the context extraction method and device of a kind of Webpage.
Background technology
Web page text, especially Web page text, it is most important information in webpage, while is also the important of big data analysis Data source.
The extraction of Web page text in the prior art, extracted using text density as reference index mostly, wherein, text This density refers to that the text size of HTML (HyperText Mark-up Language, HTML) element accounts for HTML The ratio of element code length.Because the big element of text density is not necessarily text, some texts such as source, time, author It is possible to by error extraction be text Deng element;The small element of text density is not not necessarily text, and some texts with the addition of sample Formula information and hyperlink, cause text density to decline, so carrying out Web page text extraction only using text density as reference index Mode, easily cause extraction mistake.
The problem of not accurate enough is extracted for the text of webpage in the prior art, not yet proposes effective solution party at present Case.
The content of the invention
It is a primary object of the present invention to provide the context extraction method and device of a kind of Webpage, to solve existing skill The problem of Web page text extraction is not accurate enough in art.
To achieve these goals, a kind of one side according to embodiments of the present invention, there is provided text of Webpage Extracting method.Included according to the context extraction method of the Webpage of the present invention:Obtain the hypertext mark of Webpage to be extracted Remember language HTML code, and the tree structure of the Webpage to be extracted is established according to the HTML code;Extract the tree Second element corresponding to the father node of first element and the first node corresponding to the first node of shape structure, wherein, it is described First node is the leaf node of the tree structure;The desired value of first element and the second element is calculated, wherein, The desired value is used for the information content for representing element;Element corresponding to Maximum Index value in the desired value is obtained, obtains target Element;And extracted the text that the object element includes as the text of the Webpage to be extracted.
Further, calculating first element and the desired value of the second element includes:Calculate the first element Gj's First entropy Es1j and the first text size Ls1j, wherein, it is the number of the first node that j takes 1 to n, n successively;According to public affairs Formulas I 1j=Es1j* (Ls1j)2Calculate the desired value I1j of the first element Gj;Calculate second element Ai the second entropy Es2i With the second text size Ls2i, wherein, i takes 1 to w successively, and w is the number of the father node of the first node;And press According to formula I2i=Es2i* (Ls2i)2Calculate the desired value I2i of the second element Ai.
Further, according to formulaCalculate the first of the first element Gj Entropy Es1j, wherein, S1j be the first element Gj in the first character string, Ck1 be the first character string S1j in word, k according to Secondary to take 1 to q, q is the number of the word in the first character string S1j, and P (Ck1) is word Ck1 in the first character string S1j The probability of appearance.
Further, according to formulaCalculate the of the second element Ai Two entropy Es2i, wherein, S2i be the second element Ai in the second character string, Ck2 be the second character string S2i in word, i It is the number of the word in the second character string S2i to take 1 to p, p successively, and P (Ck2) is word Ck2 in the second character string S2i The probability of middle appearance.
Further, the father of the first element and the first node corresponding to the first node in the tree structure is extracted Second element includes corresponding to node:Judge whether the first element corresponding to the first node is block element;And judging Go out the first element corresponding to the first node in the case of block element, extract the first element corresponding to the first node and Second element corresponding to the father node of the first node.
To achieve these goals, a kind of another aspect according to embodiments of the present invention, there is provided text of Webpage Extraction element.Included according to the text extraction element of the Webpage of the present invention:First acquisition unit, for obtaining net to be extracted The HTML HTML code of the page page, and the tree-like of the Webpage to be extracted is established according to the HTML code Structure;First extraction unit, for extracting the first element corresponding to the first node of the tree structure and the first node Father node corresponding to second element, wherein, the first node be the tree structure leaf node;Computing unit, use In the desired value for calculating first element and the second element, wherein, the desired value is used for the information content for representing element; Second acquisition unit, for obtaining element corresponding to Maximum Index value in the desired value, obtain object element;And second carry Unit is taken, for being extracted the text that the object element includes as the text of the Webpage to be extracted.
Further, the computing unit includes:First computing module, for calculating the first element Gj the first entropy Es1j and the first text size Ls1j, wherein, it is the number of the first node that j takes 1 to n, n successively;Second computing module, use According to formula I1j=Es1j* (Ls1j)2Calculate the desired value I1j of the first element Gj;3rd computing module, based on Second element Ai the second entropy Es2i and the second text size Ls2i is calculated, wherein, it is the first node that i takes 1 to w, w successively The father node number;And the 4th computing module, for according to formula I2i=Es2i* (Ls2i)2Calculate described second Elements A i desired value I2i.
Further, first computing module includes:First calculating sub module, for according to formulaThe first entropy Es1j of the first element Gj is calculated, wherein, S1j is described The first character string in first element Gj, Ck1 are the word in the first character string S1j, and it is first character that k takes 1 to q, q successively The number of word in string S1j, P (Ck1) are the probability that word Ck1 occurs in the first character string S1j.
Further, second computing module includes:Second calculating sub module, for according to formulaThe second entropy Es2i of the second element Ai is calculated, wherein, S2i is described The second character string in second element Ai, Ck2 are the word in the second character string S2i, and it is second character that i takes 1 to p, p successively The number of word in string S2i, P (Ck2) are the probability that word Ck2 occurs in the second character string S2i.
Further, first extraction unit includes:Judge module, for judging first corresponding to the first node Whether element is block element;And processing module, for judging that the first element corresponding to the first node is block element In the case of, extract second element corresponding to the father node of the first element and the first node corresponding to the first node.
According to inventive embodiments, using the HTML code of acquisition Webpage to be extracted, and built according to the HTML code Found the tree structure of the Webpage to be extracted;Extract the first element corresponding to the first node of the tree structure and described Second element corresponding to the father node of first node, wherein, the first node is the leaf node of the tree structure;Calculate The desired value of first element and second element, wherein, the desired value is used to represent first element or second yuan The information content of element;Element corresponding to Maximum Index value in the desired value is obtained, obtains object element;And by the target element The text that element includes is extracted as the text of the Webpage to be extracted.By establishing on page HTML generations to be extracted The tree structure of code, realize determination to element corresponding to the father node of element corresponding to leaf node and leaf node simultaneously Extract, and then the information content of the element is calculated according to the element extracted, the text that a maximum element of information content is included This is extracted as the text of Webpage, such a to carry out text extraction to Webpage using information content as reference index Mode, such a extracting mode not only allow for the text size in element, it is also contemplated that the confusion degree of text, is compared to existing There is the extracting mode for the text for carrying out Webpage in technology only using text density as reference index, that is, only consider text Length accounts for the extracting mode of code length ratio, and the text for solving Webpage in the prior art extracts not accurate enough ask Topic, and then improve the text extraction accuracy effect of Webpage, analyzes to provide and is relatively defined for follow-up big data True data basis.Also, the carry out Webpage text of such a tree structure based on the HTML code for establishing Webpage The mode of extraction, for the different coding form of different web pages page text, without being extracted not by way of being separately configured With the text of Webpage, so as to reach reduction resource consumption and improve the effect of extraction rate.
Brief description of the drawings
The accompanying drawing for forming the part of the application is used for providing a further understanding of the present invention, schematic reality of the invention Apply example and its illustrate to be used to explain the present invention, do not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of the context extraction method of Webpage according to embodiments of the present invention;
Fig. 2 is a kind of flow chart of the context extraction method of optional Webpage according to embodiments of the present invention;And
Fig. 3 is the schematic diagram of the text extraction element of Webpage according to embodiments of the present invention.
Embodiment
In order that those skilled in the art more fully understand the present invention program, below in conjunction with the embodiment of the present invention Accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill people The every other embodiment that member is obtained under the premise of creative work is not made, it should all belong to the model that the present invention protects Enclose.
It should be noted that term " first " in description and claims of this specification and above-mentioned accompanying drawing, " Two " etc. be for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that so use Data can exchange in the appropriate case, so as to embodiments of the invention described herein can with except illustrating herein or Order beyond those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment Those steps or unit clearly listed, but may include not list clearly or for these processes, method, product Or the intrinsic other steps of equipment or unit.
Description below is done to technical term involved in the embodiment of the present invention:
Block element also known as block level element, and be inline element (inline element) corresponding to it, all it is html specifications In concept.Block level element is when browser is shown, it will usually starts (and terminate) with newline.
Embodiment 1
According to embodiments of the present invention, there is provided a kind of embodiment of the method that can be used for implementing the application device embodiment, It should be noted that can be in the department of computer science of such as one group computer executable instructions the flow of accompanying drawing illustrates the step of Performed in system, although also, show logical order in flow charts, in some cases, can be with different from herein Order perform shown or described step.
According to embodiments of the present invention, there is provided a kind of context extraction method of Webpage.Fig. 1 is implemented according to the present invention The flow chart of the context extraction method of the Webpage of example, as shown in figure 1, this method includes steps S102 to step S110:
S102:The HTML HTML code of Webpage to be extracted is obtained, and is established and treated according to HTML code The tree structure of Webpage is extracted, specifically, the tree structure is to be based on DOM Document Object Model (Document Object Model, abbreviation DOM) tree structure, so such a tree structure is referred to as dom tree.
S104:Extract second corresponding to the father node of the first element and first node corresponding to the first node of tree structure Element, wherein, first node is the leaf node of tree structure, that is, extracts each leaf node of tree structure (that is, dom tree) Second element corresponding to the father node of corresponding first element and leaf node, wherein, the quantity of leaf node and the first element Quantity it is equal, the quantity of the father node of leaf node is equal with the quantity of second element.
S106:The desired value of the first element and second element is calculated, wherein, desired value is used for the information content for representing element, That is, the information content of each first element and the information content of each second element are calculated, wherein, the desired value of some element is bigger, Illustrate that the information content of the element is bigger, conversely, the desired value of some element is smaller, illustrate that the information content of the element is smaller.
S108:Element corresponding to Maximum Index value in desired value is obtained, obtains object element, it is, from multiple first The maximum element of desired value is found out in the desired value of the desired value of element and multiple second elements, the element is exactly object element.
S110:Extracted the text that object element includes as the text of Webpage to be extracted, i.e. object element In the content of text be exactly the text of Webpage for needing to extract.
In embodiments of the present invention, by establishing the tree structure on page HTML code to be extracted, realize to leaf The determination and extraction of element corresponding to the father node of element corresponding to child node and leaf node, and then according to the member extracted Element calculates the information content of the element, is carried out the text that a maximum element of information content includes as the text of Webpage Extraction, it is such a using information content as reference index to Webpage progress text extraction in a manner of, such a extracting mode is not only examined The text size in element is considered, it is also contemplated that the confusion degree of text, be compared to and only make in the prior art with text density The extracting mode of the text of Webpage is carried out for reference index, that is, only considers that text size accounts for carrying for code length ratio Mode is taken, solves the problems, such as that the text extraction of Webpage in the prior art is not accurate enough, and then raising webpage page The text extraction accuracy effect in face, accurate data basis is provided for follow-up big data analysis.Also, such a base In the mode of the progress Webpage text extraction of the tree structure for the HTML code for establishing Webpage, for different web pages page The different coding form of face text, without extracting the text of the different web pages page by way of being separately configured, so as to reach Reduce resource consumption and improve the effect of extraction rate.
Specifically, the index of each first element and each second element can be calculated by step 1-1 to step 1-4 Value, step 1-1 are specific as follows to step 1-4:
Step 1-1:The first element Gj the first entropy Es1j and the first text size Ls1j is calculated, wherein, j takes 1 successively To n, n is the number of first node, it is, the first entropy and the first text size of each first element are calculated, wherein, the One text size is the number of the word for the text that the first element includes, and the first entropy refers to the text included in the first element Entropy, reflect the information density for the text that the first element includes.Such as:The text included in first element G1 has 100 words, that First element G1 text size Ls11 is 100.
Step 1-2:According to formula I1j=Es1j* (Ls1j)2Calculate the first element Gj desired value I1j, that is, basis First entropy of each first element and the product of the first text size square determine the desired value of first element.
Step 1-3:Second element Ai the second entropy Es2i and the second text size Ls2i is calculated, wherein, i takes 1 successively It is the number of the father node of first node to w, w, it is, calculating the second entropy and the second text length of each second element Degree, wherein, the second text size is the number of the word for the text that second element includes, and the second entropy refers to include in second element Text entropy, reflect the information density of the text that second element includes.Such as:The text included in second element A1 has 300 words, then second element A1 text size Ls21 is 300.
Step 1-4:According to formula I2i=Es2i* (Ls2i)2Calculate second element Ai desired value I2i, that is, basis Each second entropy of second element and the product of the second text size square determine the desired value of the second element.
Specifically, in embodiments of the present invention, according to formulaCalculate first Element Gj the first entropy Es1j, wherein, S1j is the first character string in the first element Gj, and Ck1 is in the first character string S1j Word, k takes 1 to q successively, and q is the number of the word in the first character string S1j, and P (Ck1) is word Ck1 in the first character string S1j The probability of appearance.In embodiments of the present invention, all words in the text that the first element includes constitute the of first element One character string, the frequency that each word in above-mentioned first character string occurs in first character string is calculated, each word is obtained and exists The probability occurred in first character string, the probability of each word calculated is multiplied with the logarithm of the probability, the institute that will be obtained There is result summation, then take negative, be exactly the first entropy of first element.Such as:The text included in first element G1 is " modern Its weather is fine.", then the first element G1 the first character string S11 is " today, weather was fine ", and first character string S11 includes 5 altogether Individual word, wherein, the probability that " the present " occurs in the first character string S11 be 1/5, " my god " occur in the first character string S11 it is general Rate is 2/5, and the probability that " gas " occurs in the first character string S11 is 1/5, the probability that " fine " occurs in the first character string S11 For 1/5, then the first element G1 the first entropy Es11 can be calculated according to the following formula:
Es11=- (1/5*log21/5+2/5*log22/5+1/5*log21/5+1/5*log21/5)。
Specifically, in embodiments of the present invention, according to formulaCalculate second Elements A i the second entropy Es2i, wherein, S2i is the second character string in second element Ai, and Ck2 is in the second character string S2i Word, i takes 1 to p successively, and p is the number of the word in the second character string S2i, and P (Ck2) is word Ck2 in the second character string S2i The probability of appearance.In embodiments of the present invention, all words in the text that second element includes constitute the of the second element Two character strings, the frequency that each word in above-mentioned second character string occurs in second character string is calculated, each word is obtained and exists The probability occurred in second character string, the probability of each word calculated is multiplied with the logarithm of the probability, the institute that will be obtained There is result summation, then take negative, be exactly the second entropy of the second element.Such as:The text included in second element A1 is " modern Its weather is fine, but really terribly cold.", then second element A1 the second character string S21 for " today, weather was fine, but really very It is cold ", second character string S21 includes 10 words altogether, wherein, the probability that " the present " occurs in the second character string S21 is 1/10, " my god " probability that occurs in the second character string S21 is 2/10, the probability that " gas " occurs in the second character string S21 is 1/10, The probability that " fine " occurs in the second character string S21 is 1/10, " but " probability that occurs in the second character string S21 is 1/10, The probability that "Yes" occurs in the second character string S21 is 1/10, and the probability that " true " occurs in the second character string S21 is 1/10, " very " probability occurred in the second character string S21 is 1/10, and the probability that " cold " occurs in the second character string S21 is 1/10, So second element A11 the second entropy Es21 can be calculated according to the following formula:
Es21=- (1/10*log21/10+2/10*log22/10+1/10*log21/10+1/10*log21/10+1/10* log21/10+1/10*log21/10+1/10*log21/10+1/10*log21/10+1/10*log21/10)。
Preferably, in embodiments of the present invention, the first element and first corresponding to the first node in tree structure is extracted Second element corresponding to the father node of node includes:Judge whether the first element corresponding to first node is block element, wherein; In the case of judging that the first element corresponding to first node is block element, the first element and first corresponding to first node is extracted Second element corresponding to the father node of node, that is, judging that the first element corresponding to first node is the situation of block element Under, just extract second element corresponding to the father node of the first element and the first node corresponding to first node.Need to illustrate , when whether judge the first element corresponding to first node is block element, once can only judge a first node pair Answer whether the first element is block element, can also once judge whether element corresponding to multiple first nodes is block element respectively.
In embodiments of the present invention, by judging whether the first element of first node is block element, avoid by comprising Content of text is that the first element of non-piece of elements such as " text sources ", " delivering the time " or " author " is extracted, and is reached Further improve the effect of the Text Feature Extraction degree of accuracy of Webpage.
In addition, the context extraction method for the Webpage that the embodiment of the present invention is provided can also be according to showing in Fig. 2 Idiographic flow performs, i.e. Fig. 2 is a kind of stream of the context extraction method of optional Webpage according to embodiments of the present invention Cheng Tu, as shown in Fig. 2 this method includes steps S202 to step S214:
S202:Requested webpage address is sent, wherein, the web page address is web page address corresponding to Webpage to be extracted, The step is specially that the web page address of Webpage to be extracted is sent to server.
S204:The HTML code returned is obtained, specifically, obtains Webpage corresponding to the web page address that server returns HTML code, equivalent to the HTML code of the acquisition Webpage to be extracted in step S102, be not repeated.
S206:Establish dom tree, i.e. establish the tree structure of Webpage, equivalent in step S102 according to HTML generations Code establishes the tree structure of Webpage to be extracted, is not repeated.
S207:Current leaf node is set to first leaf node.
S208:Obtain element corresponding to current leaf node.
S210:Judge whether the element is block element, i.e. corresponding to the current leaf node obtained in judgment step S208 Whether element is block element, in the case where judging that the element is block element, performs step S212;Judging the element not In the case of being block element, step S211 is performed:Current leaf node is set to next leaf node, return to step S208.
S212:The information content of element corresponding to the leaf node is calculated, and is calculated corresponding to the father node of the leaf node The information content of element, equivalent to step S106, i.e. in the case where judging that the leaf node is block element, calculate the leaf The information content of element corresponding to node, the information content of element corresponding to the father node of the leaf node is also calculated, calculated After the information content of element corresponding to the information content of element corresponding to the leaf node and the father node of the leaf node, step is performed S213:Judge whether to have traveled through whole leaf nodes, if it is, performing step S214;If it is not, then perform step S211: Current leaf node is set to next leaf node, is then back to step S208.
S214:The content for the text that the maximum element of information content is included specifically, will calculate as text to be extracted All information content gone out are ranked up, and the content for the text that the maximum element of information content includes is extracted as text.The step Suddenly equivalent to step S110, it is not repeated.
It should be noted that for foregoing each method embodiment, in order to be briefly described, therefore it is all expressed as a series of Combination of actions, but those skilled in the art should know, the present invention is not limited by described sequence of movement because According to the present invention, some steps can use other orders or carry out simultaneously.Secondly, those skilled in the art should also know Know, embodiment described in this description belongs to preferred embodiment, and involved action and module are not necessarily of the invention It is necessary.
Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation The method of example can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but a lot In the case of the former be more preferably embodiment.Based on such understanding, technical scheme is substantially in other words to existing The part that technology contributes can be embodied in the form of software product, and the computer software product is stored in a storage In medium (such as ROM/RAM, magnetic disc, CD), including some instructions to cause a station terminal equipment (can be mobile phone, calculate Machine, server, or network equipment etc.) perform method described in each embodiment of the present invention.
Embodiment 2
According to embodiments of the present invention, a kind of webpage for being used to implement the context extraction method of above-mentioned Webpage is additionally provided The text extraction element of the page, the text extraction element are mainly used in performing the text that the above of the embodiment of the present invention is provided Extracting method, the text extraction element of the Webpage provided below the embodiment of the present invention do specific introduction:
Fig. 3 is the schematic diagram of the text extraction element of Webpage according to embodiments of the present invention, as shown in figure 3, this is just Literary extraction element mainly includes first acquisition unit 10, the first extraction unit 20, computing unit 30, second acquisition unit 40 and the Two extraction units 50, wherein:
HTML HTML code of the first acquisition unit 10 for obtaining Webpage to be extracted, and according to HTML code establishes the tree structure of Webpage to be extracted, and specifically, the tree structure is to be based on DOM Document Object Model The tree structure of (Document Object Model, abbreviation DOM), so such a tree structure is referred to as dom tree.
First extraction unit 20 is used to extract the father of the first element and first node corresponding to the first node of tree structure Second element corresponding to node, wherein, first node is the leaf node of tree structure, that is, extracts tree structure (that is, dom tree) Each leaf node corresponding to second element corresponding to the father node of the first element and leaf node, wherein, leaf node Quantity is equal with the quantity of the first element, and the quantity of the father node of leaf node is equal with the quantity of second element.
Computing unit 30 is used for the desired value for calculating the first element and second element, wherein, desired value is used to represent element Information content, i.e. calculate the information content of each first element and the information content of each second element, wherein, the finger of some element Scale value is bigger, illustrates that the information content of the element is bigger, conversely, the desired value of some element is smaller, illustrates the information content of the element It is smaller.
Second acquisition unit 40 is used to obtain element corresponding to Maximum Index value in desired value, obtains object element, also It is that the maximum element of desired value, the element are found out from the desired value of the desired values of multiple first elements and multiple second elements It is exactly object element.
Second extraction unit 50 is used to be carried the text that object element includes as the text of Webpage to be extracted Take, i.e. the content of the text in object element is exactly the text for the Webpage that needs extract.
In the embodiment of the present invention, by establishing the tree structure on page HTML code to be extracted, realize to leaf The determination and extraction of element corresponding to the father node of element corresponding to node and leaf node, and then according to the element extracted The information content of the element is calculated, the text that a maximum element of information content is included is carried as the text of Webpage Take, it is such a using information content as reference index to Webpage carry out text extraction in a manner of, such a extracting mode not only considers Text size in element, it is also contemplated that the confusion degree of text, be compared in the prior art only using text density as Reference index carries out the extracting mode of the text of Webpage, that is, only considers that text size accounts for the extraction of code length ratio Mode, solve the problems, such as that the text extraction of Webpage in the prior art is not accurate enough, and then raising Webpage Text extraction accuracy effect, analyzed for follow-up big data and provide accurate data basis.It is also, such a to be based on The mode of the progress Webpage text extraction of the tree structure of the HTML code of Webpage is established, for the different web pages page The different coding form of text, without extracting the text of the different web pages page by way of being separately configured, so as to reach Reduce resource consumption and improve the effect of extraction rate.
Specifically, computing unit 30 includes the first computing module, the second computing module, the 3rd computing module and the 4th calculating Module, wherein:
First computing module is used for the first entropy Es1j and the first text size Ls1j for calculating the first element Gj, wherein, j It is the number of first node to take 1 to n, n successively, it is, the first entropy and the first text size of each first element are calculated, Wherein, the first text size is the number of the word for the text that the first element includes, and the first entropy refers to what is included in the first element The entropy of text, reflect the information density for the text that the first element includes.Such as:The text included in first element G1 has 100 Individual word, then the first element G1 text size Ls11 is 100.
Second computing module is used for the desired value I1j that the first element Gj is calculated according to formula I1j=Es1j* (Ls1j) 2, The index of first element is exactly determined according to the product of the first entropy of each first element and the first text size square Value.
3rd computing module is used for the second entropy Es2i and the second text size Ls2i for calculating second element Ai, wherein, i 1 to w is taken successively, and w is the number of the father node of first node, it is, calculating the second entropy and second of each second element Text size, wherein, the second text size is the number of the word for the text that second element includes, and the second entropy refers to second element In the entropy of text that includes, reflect the information density of the text that second element includes.Such as:The text included in second element A1 Originally there are 300 words, then second element A1 text size Ls21 is 300.
4th computing module is used for the desired value I2i that second element Ai is calculated according to formula I2i=Es2i* (Ls2i) 2, The index of the second element is exactly determined according to the product of the second entropy of each second element and the second text size square Value.
Specifically, the first computing module includes the first calculating sub module, and the first calculating sub module is used for according to formulaThe first element Gj the first entropy Es1j is calculated, wherein, S1j is the first element The first character string in Gj, Ck1 are the word in the first character string S1j, and it is the word in the first character string S1j that k takes 1 to q, q successively Number, P (Ck1) is the probability that occurs in the first character string S1j of word Ck1.In embodiments of the present invention, the first element includes Text in all words constitute the first character string of first element, calculate each word in above-mentioned first character string at this The frequency occurred in first character string, obtain the probability that each word occurs in first character string, each word that will be calculated Probability be multiplied with the logarithm of the probability, obtained all results are summed, then take negative, are exactly the first entropy of first element Value.Such as:The text included in first element G1 is " today, weather was fine.", then the first element G1 the first character string S11 is " today, weather was fine ", first character string S11 include 5 words altogether, wherein, the probability that " the present " occurs in the first character string S11 For 1/5, " my god " probability that occurs in the first character string S11 is 2/5, the probability that " gas " occurs in the first character string S11 is 1/5, the probability that " fine " occurs in the first character string S11 is 1/5, then the first element G1 the first entropy Es11=- (1/5* log21/5+2/5*log22/5+1/5*log21/5+1/5*log21/5)。
Specifically, the second computing module includes the second calculating sub module, and the second calculating sub module is used for according to formulaSecond element Ai the second entropy Es2i is calculated, wherein, S2i is second element The second character string in Ai, Ck2 are the word in the second character string S2i, and it is the word in the second character string S2i that i takes 1 to p, p successively Number, P (Ck2) is the probability that occurs in the second character string S2i of word Ck2.
In embodiments of the present invention, all words in the text that second element includes constitute the second word of the second element Symbol string, calculate the frequency that each word in above-mentioned second character string occurs in second character string, obtain each word this The probability occurred in two character strings, the probability of each word calculated is multiplied with the logarithm of the probability, all knots that will be obtained Fruit is summed, then takes negative, is exactly the second entropy of the second element.Such as:The text included in second element A1 is " today day Gas is fine, but really terribly cold.", then second element A1 the second character string S21 is " today, weather was fine, but really terribly cold ", should Second character string S21 includes 10 words altogether, wherein, the probability that " the present " occurs in the second character string S21 is 1/10, " my god " the The probability occurred in two character string S21 is 2/10, and the probability that " gas " occurs in the second character string S21 is 1/10, and " fine " is the The probability occurred in two character string S21 is 1/10, " but " probability that occurs in the second character string S21 is 1/10, "Yes" is the The probability occurred in two character string S21 is 1/10, and the probability that " true " occurs in the second character string S21 is 1/10, and " very " is the The probability occurred in two character string S21 is 1/10, and the probability that " cold " occurs in the second character string S21 is 1/10, then this Was Used A11 the second entropy Es21 can be calculated according to the following formula:
Es21=- (1/10*log21/10+2/10*log22/10+1/10*log21/10+1/10*log21/10+1/10* log21/10+1/10*log21/10+1/10*log21/10+1/10*log21/10+1/10*log21/10)。
Preferably, in embodiments of the present invention, the first extraction unit includes judge module and processing module, wherein, judge Module is used to judge whether the first element corresponding to first node is block element;Processing module is used to judge first node pair In the case that the first element answered is block element, it is corresponding to extract the father node of the first element and first node corresponding to first node Second element.That is, in the case where judging that the first element is block element corresponding to first node, first node is just extracted Second element corresponding to the father node of corresponding first element and the first node.It should be noted that judging first segment When whether the first element corresponding to point is block element, it once can only judge whether corresponding first element of a first node is block Element, it can also once judge whether element corresponding to multiple first nodes is block element respectively.
In embodiments of the present invention, by judging whether the first element of first node is block element, avoid by comprising Content of text is that the first element of non-piece of elements such as " text sources ", " delivering the time " or " author " is extracted, and is reached Further improve the effect of the Text Feature Extraction degree of accuracy of Webpage.
As can be seen from the above description, the present invention solves the text extraction of Webpage in the prior art not enough The problem of accurate, and then improve the text extraction accuracy effect of Webpage.
The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.
In the above embodiment of the present invention, the description to each embodiment all emphasizes particularly on different fields, and does not have in some embodiment The part of detailed description, it may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed client, can be by others side Formula is realized.Wherein, device embodiment described above is only schematical, such as the division of the unit, and only one Kind of division of logic function, can there is an other dividing mode when actually realizing, for example, multiple units or component can combine or Another system is desirably integrated into, or some features can be ignored, or do not perform.It is another, it is shown or discussed it is mutual it Between coupling or direct-coupling or communication connection can be INDIRECT COUPLING or communication link by some interfaces, unit or module Connect, can be electrical or other forms.
The unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list Member can both be realized in the form of hardware, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and is used as independent production marketing or use When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially The part to be contributed in other words to prior art or all or part of the technical scheme can be in the form of software products Embody, the computer software product is stored in a storage medium, including some instructions are causing a computer Equipment (can be personal computer, server or network equipment etc.) perform each embodiment methods described of the present invention whole or Part steps.And foregoing storage medium includes:USB flash disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can be with store program codes Medium.
Described above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims (8)

  1. A kind of 1. context extraction method of Webpage, it is characterised in that including:
    The HTML HTML code of Webpage to be extracted is obtained, and waits to carry according to being established the HTML code Take the tree structure of Webpage;
    Extract second corresponding to the father node of the first element and the first node corresponding to the first node of the tree structure Element, wherein, the first node is the leaf node of the tree structure;
    The desired value of first element and the second element is calculated, wherein, the desired value is used for the information for representing element Amount, the information content of the more big then described element of the desired value are bigger;
    Element corresponding to Maximum Index value in the desired value is obtained, obtains object element;And
    Extracted the text that the object element includes as the text of the Webpage to be extracted;
    Calculating the desired value of first element and the second element includes:Calculate the first element Gj the first entropy Es1j and First text size Ls1j, first text size are the number of the word for the text that the first element includes, and the first entropy refers to The entropy of the text included in first element, the information density for the text that the first element of reflection includes, wherein, j takes 1 to n successively, n For the number of the first node;According to formula I1j=Es1j* (Ls1j)2Calculate the desired value I1j of the first element Gj; The the second entropy Es2i and the second text size Ls2i, second text size for calculating second element Ai include for second element Text word number, the second entropy refers to the entropy of the text included in second element, the text that reflection second element includes Information density, wherein, i takes 1 to w successively, and w is the number of the father node of the first node;And according to formula I2i =Es2i* (Ls2i)2Calculate the desired value I2i of the second element Ai.
  2. 2. context extraction method according to claim 1, it is characterised in that according to formula The first entropy Es1j of the first element Gj is calculated, wherein, S1j is the first character string in the first element Gj, and Ck1 is Word in first character string S1j, k take 1 to q successively, and q is the number of the word in the first character string S1j, and P (Ck1) is word The probability that Ck1 occurs in the first character string S1j.
  3. 3. context extraction method according to claim 1, it is characterised in that according to formula The second entropy Es2i of the second element Ai is calculated, wherein, S2i is the second character string in the second element Ai, and Ck2 is Word in second character string S2i, i take 1 to p successively, and p is the number of the word in the second character string S2i, and P (Ck2) is word The probability that Ck2 occurs in the second character string S2i.
  4. 4. context extraction method according to claim 1, it is characterised in that extract the first node in the tree structure Second element corresponding to the father node of corresponding first element and the first node includes:
    Judge whether the first element corresponding to the first node is block element;And
    In the case where judging that the first element corresponding to the first node is block element, extract corresponding to the first node Second element corresponding to the father node of first element and the first node.
  5. A kind of 5. text extraction element of Webpage, it is characterised in that including:
    First acquisition unit, for obtaining the HTML HTML code of Webpage to be extracted, and according to described HTML code establishes the tree structure of the Webpage to be extracted;
    First extraction unit, for extracting the first element corresponding to the first node of the tree structure and the first node Second element corresponding to father node, wherein, the first node is the leaf node of the tree structure;
    Computing unit, for calculating the desired value of first element and the second element, wherein, the desired value is used for table Show the information content of element, the information content of the more big then described element of the desired value is bigger;
    Second acquisition unit, for obtaining element corresponding to Maximum Index value in the desired value, obtain object element;And
    Second extraction unit, for being carried out the text that the object element includes as the text of the Webpage to be extracted Extraction;
    The computing unit includes:First computing module, for calculating the first element Gj the first entropy Es1j and the first text Length Ls1j, first text size are the number of the word for the text that the first element includes, and the first entropy refers to the first element In the entropy of text that includes, the information density for the text that the first element of reflection includes, wherein, j takes 1 to n successively, and n is described the The number of one node;Second computing module, for according to formula I1j=Es1j* (Ls1j)2Calculate the finger of the first element Gj Scale value I1j;3rd computing module, it is described for calculating second element Ai the second entropy Es2i and the second text size Ls2i Second text size is the number of the word for the text that second element includes, and the second entropy refers to the text included in second element Entropy, the information density for the text that second element includes is reflected, wherein, it is the described of the first node that i takes 1 to w, w successively The number of father node;And the 4th computing module, for according to formula I2i=Es2i* (Ls2i)2Calculate the second element Ai Desired value I2i.
  6. 6. text extraction element according to claim 5, it is characterised in that first computing module includes:
    First calculating sub module, for according to formulaCalculate the first element Gj The first entropy Es1j, wherein, S1j be the first element Gj in the first character string, Ck1 be the first character string S1j in Word, it is the number of the word in the first character string S1j that k takes 1 to q, q successively, and P (Ck1) is word Ck1 in first character string The probability occurred in S1j.
  7. 7. text extraction element according to claim 5, it is characterised in that second computing module includes:
    Second calculating sub module, for according to formulaCalculate the second element Ai The second entropy Es2i, wherein, S2i be the second element Ai in the second character string, Ck2 be the second character string S2i in Word, it is the number of the word in the second character string S2i that i takes 1 to p, p successively, and P (Ck2) is word Ck2 in second character string The probability occurred in S2i.
  8. 8. text extraction element according to claim 5, it is characterised in that first extraction unit includes:
    Judge module, for judging whether the first element corresponding to the first node is block element;And
    Processing module, for judge the first element corresponding to the first node be block element in the case of, described in extraction Second element corresponding to the father node of first element and the first node corresponding to first node.
CN201410827773.7A 2014-12-25 2014-12-25 The context extraction method and device of Webpage Active CN104484449B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410827773.7A CN104484449B (en) 2014-12-25 2014-12-25 The context extraction method and device of Webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410827773.7A CN104484449B (en) 2014-12-25 2014-12-25 The context extraction method and device of Webpage

Publications (2)

Publication Number Publication Date
CN104484449A CN104484449A (en) 2015-04-01
CN104484449B true CN104484449B (en) 2018-02-23

Family

ID=52758990

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410827773.7A Active CN104484449B (en) 2014-12-25 2014-12-25 The context extraction method and device of Webpage

Country Status (1)

Country Link
CN (1) CN104484449B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105354292A (en) * 2015-10-30 2016-02-24 东莞酷派软件技术有限公司 Page output method and apparatus
CN106354749B (en) * 2016-08-15 2020-06-02 北京小米移动软件有限公司 Information display method and device
CN108874934B (en) * 2018-06-01 2021-11-30 百度在线网络技术(北京)有限公司 Page text extraction method and device
CN110377796B (en) * 2019-07-25 2021-11-02 中南民族大学 Text extraction method, device and equipment based on DOM tree and storage medium
CN117909201B (en) * 2024-03-20 2024-06-11 暗物智能科技(广州)有限公司 Method and device for determining first screen time of page, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101093487A (en) * 2006-06-22 2007-12-26 上海新纳广告传媒有限公司 Method for extracting content of text based on HTML characteristics
CN102880707A (en) * 2012-09-27 2013-01-16 广州市动景计算机科技有限公司 Method and device for webpage body content recognition

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080098300A1 (en) * 2006-10-24 2008-04-24 Brilliant Shopper, Inc. Method and system for extracting information from web pages

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101093487A (en) * 2006-06-22 2007-12-26 上海新纳广告传媒有限公司 Method for extracting content of text based on HTML characteristics
CN102880707A (en) * 2012-09-27 2013-01-16 广州市动景计算机科技有限公司 Method and device for webpage body content recognition

Also Published As

Publication number Publication date
CN104484449A (en) 2015-04-01

Similar Documents

Publication Publication Date Title
CN104484449B (en) The context extraction method and device of Webpage
CN105868317B (en) Digital education resource recommendation method and system
CN103885987B (en) A kind of music recommends method and system
CN105138558B (en) The real time individual information collecting method of content is accessed based on user
CN105069143B (en) Extract the method and device of keyword in document
CN104750704B (en) A kind of webpage URL address sorts recognition methods and device
CN105389389B (en) A kind of network public-opinion propagation situation medium control analysis method
CN109766424A (en) Filtering method and device for reading understanding model training data
CN103678652A (en) Information individualized recommendation method based on Web log data
TW202001620A (en) Automatic website data collection method using a complex semantic computing model to form a seed vocabulary data set
CN104899335A (en) Method for performing sentiment classification on network public sentiment of information
CN103473231A (en) Classifier building method and system
CN109299277A (en) The analysis of public opinion method, server and computer readable storage medium
CN107368499B (en) Client label modeling and recommending method and device
CN106227866A (en) A kind of hybrid filtering film based on data mining recommends method
CN104462061B (en) Term extraction method and extraction element
CN104778164A (en) Method and device for detecting repeated URL (Uniform Resource Locator)
CN109905873A (en) A kind of network account correlating method based on signature identification information
CN107392392A (en) Microblogging forwarding Forecasting Methodology based on deep learning
CN109359198A (en) A kind of file classification method and device
CN107809370A (en) User recommends method and device
CN105426382A (en) Music recommendation method based on emotional context awareness of Personal Rank
CN116188120B (en) Method, device and system for recommending audio books and storage medium
CN111324725B (en) Topic acquisition method, terminal and computer readable storage medium
CN111882224A (en) Method and device for classifying consumption scenes

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Web page text extraction method and web page text extraction device

Effective date of registration: 20190531

Granted publication date: 20180223

Pledgee: Shenzhen Black Horse World Investment Consulting Co.,Ltd.

Pledgor: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

Registration number: 2019990000503

PE01 Entry into force of the registration of the contract for pledge of patent right
CP02 Change in the address of a patent holder

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Patentee after: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Patentee before: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

CP02 Change in the address of a patent holder
PP01 Preservation of patent right

Effective date of registration: 20240604

Granted publication date: 20180223