CN104484449A - Web page text extraction method and web page text extraction device - Google Patents

Web page text extraction method and web page text extraction device Download PDF

Info

Publication number
CN104484449A
CN104484449A CN201410827773.7A CN201410827773A CN104484449A CN 104484449 A CN104484449 A CN 104484449A CN 201410827773 A CN201410827773 A CN 201410827773A CN 104484449 A CN104484449 A CN 104484449A
Authority
CN
China
Prior art keywords
node
text
character string
word
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410827773.7A
Other languages
Chinese (zh)
Other versions
CN104484449B (en
Inventor
侯明午
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201410827773.7A priority Critical patent/CN104484449B/en
Publication of CN104484449A publication Critical patent/CN104484449A/en
Application granted granted Critical
Publication of CN104484449B publication Critical patent/CN104484449B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a web page text extraction method and a web page text extraction device, wherein the web page text extraction method comprises the following steps: acquiring the HTML (hypertext markup language) code of a web page to be extracted, and establishing a tree structure of the web page to be extracted according to the HTML code; extracting a first element corresponding to a first node and a second element corresponding to a father node of the first node of the tree structure, wherein the first node is a leaf node of the tree structure; calculating the index values of the first element and the second element, wherein the index value is used for representing the amount of information of the first element or the second element; acquiring the element corresponding to the maximum value in the index values to obtain the target element; taking the text of the target element as the text of the web page for extracting. After the method and the device are adopted, the problem in the prior art that the text extraction of the web page is not accurate enough can be solved, so that the text extraction accuracy of the web page can be further improved.

Description

The context extraction method of Webpage and device
Technical field
The present invention relates to data processing field, in particular to a kind of context extraction method and device of Webpage.
Background technology
Web page text, especially Web page text are most important information in webpage, are also the significant data sources of large data analysis simultaneously.
The extraction of Web page text in prior art, extract using text density as with reference to index mostly, wherein, text density refers to that the text size of HTML (HyperText Mark-up Language, HTML (Hypertext Markup Language)) element accounts for the ratio of html element element code length.Because, the element not necessarily text that text density is large, some text is likely text by error extraction as elements such as source, time, authors; The element that text density is little is not not necessarily text, and some texts with the addition of style information and hyperlink, cause text density to decline, so only using text density as the mode of carrying out Web page text extraction with reference to index, easily cause extraction mistake.
Extract not accurate enough problem for the text of webpage in prior art, not yet propose effective solution at present.
Summary of the invention
Fundamental purpose of the present invention is the context extraction method and the device that provide a kind of Webpage, extracts not accurate enough problem to solve Web page text in prior art.
To achieve these goals, according to an aspect of the embodiment of the present invention, a kind of context extraction method of Webpage is provided.Context extraction method according to Webpage of the present invention comprises: the HTML (Hypertext Markup Language) HTML code obtaining Webpage to be extracted, and the tree structure setting up described Webpage to be extracted according to described HTML code; Extract the second element that the father node of the first element corresponding to the first node of described tree structure and described first node is corresponding, wherein, described first node is the leaf node of described tree structure; Calculate the desired value of described first element and described second element, wherein, described desired value is for representing the quantity of information of element; Obtain the element that in described desired value, Maximum Index value is corresponding, obtain object element; And the text to be comprised by described object element extracts as the text of described Webpage to be extracted.
Further, the desired value calculating described first element and described second element comprises: the first entropy Es1j and the first text size Ls1j that calculate the first element Gj, and wherein, j gets 1 to n successively, and n is the number of described first node; According to formula I1j=Es1j* (Ls1j) 2calculate the desired value I1j of described first element Gj; Calculate the second entropy Es2i and the second text size Ls2i of the second elements A i, wherein, i gets 1 to w successively, and w is the number of the described father node of described first node; And according to formula I2i=Es2i* (Ls2i) 2calculate the desired value I2i of described second elements A i.
Further, according to formula calculate the first entropy Es1j of described first element Gj, wherein, S1j is the first character string in described first element Gj, Ck1 is the word in the first character string S1j, k gets 1 to q successively, q is the number of the word in described first character string S1j, the probability that P (Ck1) occurs in described first character string S1j for word Ck1.
Further, according to formula calculate the second entropy Es2i of described second elements A i, wherein, S2i is the second character string in described second elements A i, Ck2 is the word in the second character string S2i, i gets 1 to p successively, p is the number of the word in described second character string S2i, the probability that P (Ck2) occurs in described second character string S2i for word Ck2.
Further, the second element extracting the father node of the first element corresponding to first node in described tree structure and described first node corresponding comprises: judge whether the first element that described first node is corresponding is block element; And when the first element judging that described first node is corresponding is block element, extract the second element that the father node of the first element corresponding to described first node and described first node is corresponding.
To achieve these goals, according to the another aspect of the embodiment of the present invention, provide a kind of text extraction element of Webpage.Text extraction element according to Webpage of the present invention comprises: the first acquiring unit, for obtaining the HTML (Hypertext Markup Language) HTML code of Webpage to be extracted, and sets up the tree structure of described Webpage to be extracted according to described HTML code; First extraction unit, the second element that the father node of the first element that the first node for extracting described tree structure is corresponding and described first node is corresponding, wherein, described first node is the leaf node of described tree structure; Computing unit, for calculating the desired value of described first element and described second element, wherein, described desired value is for representing the quantity of information of element; Second acquisition unit, for obtaining the element that in described desired value, Maximum Index value is corresponding, obtains object element; And second extraction unit, extract for text that described object element the is comprised text as described Webpage to be extracted.
Further, described computing unit comprises: the first computing module, and for calculating the first entropy Es1j and the first text size Ls1j of the first element Gj, wherein, j gets 1 to n successively, and n is the number of described first node; Second computing module, for according to formula I1j=Es1j* (Ls1j) 2calculate the desired value I1j of described first element Gj; 3rd computing module, for calculating the second entropy Es2i and the second text size Ls2i of the second elements A i, wherein, i gets 1 to w successively, and w is the number of the described father node of described first node; And the 4th computing module, for according to formula I2i=Es2i* (Ls2i) 2calculate the desired value I2i of described second elements A i.
Further, described first computing module comprises: the first calculating sub module, for according to formula calculate the first entropy Es1j of described first element Gj, wherein, S1j is the first character string in described first element Gj, Ck1 is the word in the first character string S1j, k gets 1 to q successively, q is the number of the word in described first character string S1j, the probability that P (Ck1) occurs in described first character string S1j for word Ck1.
Further, described second computing module comprises: the second calculating sub module, for according to formula calculate the second entropy Es2i of described second elements A i, wherein, S2i is the second character string in described second elements A i, Ck2 is the word in the second character string S2i, i gets 1 to p successively, p is the number of the word in described second character string S2i, the probability that P (Ck2) occurs in described second character string S2i for word Ck2.
Further, described first extraction unit comprises: judge module, for judging whether the first element that described first node is corresponding is block element; And processing module, for when the first element judging that described first node is corresponding is block element, extract the second element that the father node of the first element corresponding to described first node and described first node is corresponding.
According to inventive embodiments, adopt the HTML code obtaining Webpage to be extracted, and set up the tree structure of described Webpage to be extracted according to described HTML code; Extract the second element that the father node of the first element corresponding to the first node of described tree structure and described first node is corresponding, wherein, described first node is the leaf node of described tree structure; Calculate the desired value of described first element and the second element, wherein, described desired value is for representing the quantity of information of described first element or the second element; Obtain the element that in described desired value, Maximum Index value is corresponding, obtain object element; And the text to be comprised by described object element extracts as the text of described Webpage to be extracted.By setting up the tree structure about page HTML code to be extracted, achieve the determination of element corresponding to the father node of element corresponding to leaf node and leaf node and extract, and then the quantity of information of this element is calculated according to the element extracted, the text of maximum for quantity of information text that element comprises as Webpage is extracted, this kind is using quantity of information as the mode of with reference to index, Webpage being carried out to text extraction, this kind of extracting mode not only considers the text size in element, also contemplate the confusion degree of text, be compared in prior art only using text density as the extracting mode carrying out the text of Webpage with reference to index, namely only consider that text size accounts for the extracting mode of code length ratio, solve the text of Webpage in prior art and extract not accurate enough problem, and then reach the text extraction accuracy effect improving Webpage, for follow-up large data analysis provides data basis comparatively accurately.And, this kind is based on the mode of carrying out the extraction of Webpage text of tree structure of HTML code setting up Webpage, for the different coding form of different web pages page text, extract the text of the different web pages page without the need to the mode by configuring separately, thus reach and reduce resource consumption and improve the effect of extraction rate.
Accompanying drawing explanation
The accompanying drawing forming a application's part is used to provide a further understanding of the present invention, and schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the process flow diagram of the context extraction method of Webpage according to the embodiment of the present invention;
Fig. 2 is the process flow diagram of the context extraction method of a kind of optional Webpage according to the embodiment of the present invention; And
Fig. 3 is the schematic diagram of the text extraction element of Webpage according to the embodiment of the present invention.
Embodiment
The present invention program is understood better in order to make those skilled in the art person, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the embodiment of a part of the present invention, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, should belong to the scope of protection of the invention.
It should be noted that, term " first ", " second " etc. in instructions of the present invention and claims and above-mentioned accompanying drawing are for distinguishing similar object, and need not be used for describing specific order or precedence.Should be appreciated that the data used like this can be exchanged in the appropriate case, so as embodiments of the invention described herein can with except here diagram or describe those except order implement.In addition, term " comprises " and " having " and their any distortion, intention is to cover not exclusive comprising, such as, contain those steps or unit that the process of series of steps or unit, method, system, product or equipment is not necessarily limited to clearly list, but can comprise clearly do not list or for intrinsic other step of these processes, method, product or equipment or unit.
Description below is done to technical term involved in the embodiment of the present invention:
Block element has another name called block level element, and that its correspondence is inline element (inline element), is all the concept in html specification.Block level element, when browser display, can start with newline (and end) usually.
Embodiment 1
According to the embodiment of the present invention, provide a kind of embodiment of the method that may be used for implementing the application's device embodiment, it should be noted that, can perform in the computer system of such as one group of computer executable instructions in the step shown in the process flow diagram of accompanying drawing, and, although show logical order in flow charts, in some cases, can be different from the step shown or described by order execution herein.
According to the embodiment of the present invention, provide a kind of context extraction method of Webpage.Fig. 1 is the process flow diagram of the context extraction method of Webpage according to the embodiment of the present invention, and as shown in Figure 1, the method comprises following step S102 to step S110:
S102: the HTML (Hypertext Markup Language) HTML code obtaining Webpage to be extracted, and the tree structure of Webpage to be extracted is set up according to HTML code, particularly, this tree structure is based on DOM Document Object Model (DocumentObject Model, be called for short DOM) tree structure, so this kind of tree structure also can be called dom tree.
S104: extract the second element that the father node of the first element corresponding to the first node of tree structure and first node is corresponding, wherein, first node is the leaf node of tree structure, namely tree structure is extracted (namely, dom tree) the first element corresponding to each leaf node and the second element corresponding to the father node of leaf node, wherein, the quantity of leaf node is equal with the quantity of the first element, and the quantity of the father node of leaf node is equal with the quantity of the second element.
S106: the desired value calculating the first element and the second element, wherein, desired value for representing the quantity of information of element, that is, calculates the quantity of information of each first element and the quantity of information of each second element, wherein, the desired value of certain element is larger, illustrates that the quantity of information of this element is larger, otherwise, the desired value of certain element is less, illustrates that the quantity of information of this element is less.
S108: obtain the element that in desired value, Maximum Index value is corresponding, obtain object element, namely, find out the maximum element of desired value from the desired value of multiple first element and the desired value of multiple second element, this element is exactly object element.
S110: the text comprised by object element extracts as the text of Webpage to be extracted, that is, the content of the text in object element is exactly the text needing the Webpage extracted.
In embodiments of the present invention, by setting up the tree structure about page HTML code to be extracted, achieve the determination of element corresponding to the father node of element corresponding to leaf node and leaf node and extract, and then the quantity of information of this element is calculated according to the element extracted, the text of maximum for quantity of information text that element comprises as Webpage is extracted, this kind is using quantity of information as the mode of with reference to index, Webpage being carried out to text extraction, this kind of extracting mode not only considers the text size in element, also contemplate the confusion degree of text, be compared in prior art only using text density as the extracting mode carrying out the text of Webpage with reference to index, namely only consider that text size accounts for the extracting mode of code length ratio, solve the text of Webpage in prior art and extract not accurate enough problem, and then reach the text extraction accuracy effect improving Webpage, for follow-up large data analysis provides data basis comparatively accurately.And, this kind is based on the mode of carrying out the extraction of Webpage text of tree structure of HTML code setting up Webpage, for the different coding form of different web pages page text, extract the text of the different web pages page without the need to the mode by configuring separately, thus reach and reduce resource consumption and improve the effect of extraction rate.
Particularly, can be calculated the desired value of each first element and each second element by step 1-1 to step 1-4, step 1-1 is specific as follows to step 1-4:
Step 1-1: the first entropy Es1j and the first text size Ls1j that calculate the first element Gj, wherein, j gets 1 to n successively, n is the number of first node, namely, calculates the first entropy and first text size of each first element, wherein, first text size is the number of the word of the text that the first element comprises, and the first entropy refers to the entropy of the text comprised in the first element, reflects the information density of the text that the first element comprises.Such as: the text comprised in the first element G1 has 100 words, so the text size Ls11 of the first element G1 is 100.
Step 1-2: according to formula I1j=Es1j* (Ls1j) 2calculate the desired value I1j of the first element Gj, namely determine the desired value of this first element according to the product of the first entropy of each first element and the first text size square.
Step 1-3: the second entropy Es2i and the second text size Ls2i that calculate the second elements A i, wherein, i gets 1 to w successively, w is the number of the father node of first node, namely, calculates the second entropy and second text size of each second element, wherein, second text size is the number of the word of the text that the second element comprises, and the second entropy refers to the entropy of the text comprised in the second element, reflects the information density of the text that the second element comprises.Such as: the text comprised in the second elements A 1 has 300 words, so the text size Ls21 of the second elements A 1 is 300.
Step 1-4: according to formula I2i=Es2i* (Ls2i) 2calculate the desired value I2i of the second elements A i, namely determine the desired value of this second element according to the product of the second entropy of each second element and the second text size square.
Particularly, in embodiments of the present invention, according to formula calculate the first entropy Es1j of the first element Gj, wherein, S1j is the first character string in the first element Gj, Ck1 is the word in the first character string S1j, k gets 1 to q successively, q is the number of the word in the first character string S1j, the probability that P (Ck1) occurs in the first character string S1j for word Ck1.In embodiments of the present invention, all words in the text that first element comprises constitute the first character string of this first element, calculate the frequency that each word in above-mentioned first character string occurs in this first character string, obtain the probability that each word occurs in this first character string, the probability of each word calculated is multiplied with the logarithm of this probability, by all result summations obtained, then getting negative, is exactly the first entropy of this first element.Such as: the text comprised in the first element G1 is for " today, weather was fine."; so the first character string S11 of the first element G1 is " today, weather was fine "; this first character string S11 comprises 5 words altogether; wherein; the probability that " the present " occurs in the first character string S11 is 1/5, " my god " probability that occurs in the first character string S11 is 2/5, the probability that " gas " occurs in the first character string S11 is 1/5; the probability that " fine " occurs in the first character string S11 is 1/5, and so the first entropy Es11 of the first element G1 can calculate according to the following formula:
Es11=-(1/5*log 21/5+2/5*log 22/5+1/5*log 21/5+1/5*log 21/5)。
Particularly, in embodiments of the present invention, according to formula calculate the second entropy Es2i of the second elements A i, wherein, S2i is the second character string in the second elements A i, Ck2 is the word in the second character string S2i, i gets 1 to p successively, p is the number of the word in the second character string S2i, the probability that P (Ck2) occurs in the second character string S2i for word Ck2.In embodiments of the present invention, all words in the text that second element comprises constitute the second character string of this second element, calculate the frequency that each word in above-mentioned second character string occurs in this second character string, obtain the probability that each word occurs in this second character string, the probability of each word calculated is multiplied with the logarithm of this probability, by all result summations obtained, then getting negative, is exactly the second entropy of this second element.Such as: the text comprised in the second elements A 1 is for " today, weather was fine, but really terribly cold.", so the second character string S21 of this second elements A 1 is that " today, weather was fine, but it is really terribly cold ", this second character string S21 comprises 10 words altogether, wherein, the probability that " the present " occurs in the second character string S21 is 1/10, " my god " probability that occurs in the second character string S21 is 2/10, the probability that " gas " occurs in the second character string S21 is 1/10, the probability that " fine " occurs in the second character string S21 is 1/10, " but " probability that occurs in the second character string S21 is 1/10, the probability that "Yes" occurs in the second character string S21 is 1/10, the probability that " really " occurs in the second character string S21 is 1/10, " very " probability occurred in the second character string S21 is 1/10, the probability that " cold " occurs in the second character string S21 is 1/10, so the second entropy Es21 of this second elements A 11 can calculate according to the following formula:
Es21=-(1/10*log 21/10+2/10*log 22/10+1/10*log 21/10+1/10*log 21/10+1/10*log 21/10+1/10*log 21/10+1/10*log 21/10+1/10*log 21/10+1/10*log 21/10)。
Preferably, in embodiments of the present invention, the second element that the father node of the first element that the first node in extraction tree structure is corresponding and first node is corresponding comprises: judge whether the first element that first node is corresponding is block element, wherein; When the first element judging that first node is corresponding is block element, extract the second element that the father node of the first element corresponding to first node and first node is corresponding, also be, when the first element judging that first node is corresponding is block element, just extract the second element that the father node of the first element corresponding to first node and this first node is corresponding.It should be noted that, when whether the first element judging that first node is corresponding is block element, once only can judge whether corresponding first element of a first node is block element, also once can judge whether the element that multiple first node is corresponding is block element respectively.
In embodiments of the present invention, by judging whether the first element of first node is block element, avoid and will comprise content of text and extract for the first element of the non-piece of element such as " text source ", " delivering the time " or " author ", reach the effect of the Text Feature Extraction accuracy improving Webpage further.
In addition, the context extraction method of the Webpage that the embodiment of the present invention provides can also perform according to the idiographic flow shown in Fig. 2, namely, Fig. 2 is the process flow diagram of the context extraction method of a kind of optional Webpage according to the embodiment of the present invention, as shown in Figure 2, the method comprises following step S202 to step S214:
S202: send request web page address, wherein, this web page address is the web page address that Webpage to be extracted is corresponding, and this step is specially the web page address sending Webpage to be extracted to server.
S204: obtain the HTML code returned, particularly, the HTML code of the Webpage that the web page address that acquisition server returns is corresponding, is equivalent to the HTML code of the acquisition Webpage to be extracted in step S102, is not repeated.
S206: set up dom tree, that is, set up the tree structure of Webpage, is equivalent to the tree structure setting up Webpage to be extracted according to HTML code in step S102, is not repeated.
S207: current leaf node is set to first leaf node.
S208: obtain the element that current leaf node is corresponding.
S210: judge whether this element is block element, that is, whether the element that the current leaf node obtained in determining step S208 is corresponding is block element, when judging that this element is block element, performs step S212; When judging that this element is not block element, perform step S211: current leaf node is set to next leaf node, returns step S208.
S212: the quantity of information calculating element corresponding to this leaf node, and the quantity of information of element corresponding to the father node calculating this leaf node, be equivalent to step S106, namely, when judging that this leaf node is block element, calculate the quantity of information of element corresponding to this leaf node, also to calculate the quantity of information of element corresponding to the father node of this leaf node, after the quantity of information of element corresponding to the father node of the quantity of information He this leaf node that calculate element corresponding to this leaf node, perform step S213: judge whether to have traveled through whole leaf node, if, then perform step S214, if not, then step S211 is performed: current leaf node is set to next leaf node, then returns step S208.
S214: all quantity of information calculated, as text to be extracted, particularly, sort by the content of the text comprised by element maximum for quantity of information, the content of the text that the maximum element of quantity of information comprises is extracted as text.This step is equivalent to step S110, is not repeated.
It should be noted that, for aforesaid each embodiment of the method, in order to simple description, therefore it is all expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not by the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and involved action and module might not be that the present invention is necessary.
Through the above description of the embodiments, those skilled in the art can be well understood to the mode that can add required general hardware platform by software according to the method for above-described embodiment and realize, hardware can certainly be passed through, but in a lot of situation, the former is better embodiment.Based on such understanding, technical scheme of the present invention can embody with the form of software product the part that prior art contributes in essence in other words, this computer software product is stored in a storage medium (as ROM/RAM, magnetic disc, CD), comprising some instructions in order to make a station terminal equipment (can be mobile phone, computing machine, server, or the network equipment etc.) perform method described in each embodiment of the present invention.
Embodiment 2
According to the embodiment of the present invention, additionally provide a kind of text extraction element of Webpage of the context extraction method for implementing above-mentioned Webpage, this text extraction element is mainly used in the context extraction method that execution embodiment of the present invention foregoing provides, and does concrete introduction below to the text extraction element of the Webpage that the embodiment of the present invention provides:
Fig. 3 is the schematic diagram of the text extraction element of Webpage according to the embodiment of the present invention, as shown in Figure 3, this text extraction element mainly comprises the first acquiring unit 10, first extraction unit 20, computing unit 30, second acquisition unit 40 and the second extraction unit 50, wherein:
First acquiring unit 10 is for obtaining the HTML (Hypertext Markup Language) HTML code of Webpage to be extracted, and the tree structure of Webpage to be extracted is set up according to HTML code, particularly, this tree structure is based on DOM Document Object Model (Document Object Model, be called for short DOM) tree structure, so this kind of tree structure also can be called dom tree.
First extraction unit 20 is for the second element corresponding to the father node of the first element corresponding to the first node that extracts tree structure and first node, wherein, first node is the leaf node of tree structure, namely tree structure is extracted (namely, dom tree) the first element corresponding to each leaf node and the second element corresponding to the father node of leaf node, wherein, the quantity of leaf node is equal with the quantity of the first element, and the quantity of the father node of leaf node is equal with the quantity of the second element.
Computing unit 30 is for calculating the desired value of the first element and the second element, wherein, desired value for representing the quantity of information of element, that is, calculates the quantity of information of each first element and the quantity of information of each second element, wherein, the desired value of certain element is larger, illustrates that the quantity of information of this element is larger, otherwise, the desired value of certain element is less, illustrates that the quantity of information of this element is less.
Second acquisition unit 40, for obtaining the element that in desired value, Maximum Index value is corresponding, obtains object element, and namely, from the desired value of multiple first element and the desired value of multiple second element, find out the maximum element of desired value, this element is exactly object element.
Second extraction unit 50 extracts for the text that comprised by the object element text as Webpage to be extracted, that is, the content of the text in object element is exactly the text needing the Webpage extracted.
In the embodiment of the present invention, by setting up the tree structure about page HTML code to be extracted, achieve the determination of element corresponding to the father node of element corresponding to leaf node and leaf node and extract, and then the quantity of information of this element is calculated according to the element extracted, the text of maximum for quantity of information text that element comprises as Webpage is extracted, this kind is using quantity of information as the mode of with reference to index, Webpage being carried out to text extraction, this kind of extracting mode not only considers the text size in element, also contemplate the confusion degree of text, be compared in prior art only using text density as the extracting mode carrying out the text of Webpage with reference to index, namely only consider that text size accounts for the extracting mode of code length ratio, solve the text of Webpage in prior art and extract not accurate enough problem, and then reach the text extraction accuracy effect improving Webpage, for follow-up large data analysis provides data basis comparatively accurately.And, this kind is based on the mode of carrying out the extraction of Webpage text of tree structure of HTML code setting up Webpage, for the different coding form of different web pages page text, extract the text of the different web pages page without the need to the mode by configuring separately, thus reach and reduce resource consumption and improve the effect of extraction rate.
Particularly, computing unit 30 comprises the first computing module, the second computing module, the 3rd computing module and the 4th computing module, wherein:
First computing module is for calculating the first entropy Es1j and the first text size Ls1j of the first element Gj, wherein, j gets 1 to n successively, n is the number of first node, namely, calculates the first entropy and first text size of each first element, wherein, first text size is the number of the word of the text that the first element comprises, and the first entropy refers to the entropy of the text comprised in the first element, reflects the information density of the text that the first element comprises.Such as: the text comprised in the first element G1 has 100 words, so the text size Ls11 of the first element G1 is 100.
Second computing module is used for calculating the desired value I1j of the first element Gj according to formula I1j=Es1j* (Ls1j) 2, namely determines the desired value of this first element according to the product of the first entropy of each first element and the first text size square.
3rd computing module is for calculating the second entropy Es2i and the second text size Ls2i of the second elements A i, wherein, i gets 1 to w successively, w is the number of the father node of first node, namely, calculates the second entropy and second text size of each second element, wherein, second text size is the number of the word of the text that the second element comprises, and the second entropy refers to the entropy of the text comprised in the second element, reflects the information density of the text that the second element comprises.Such as: the text comprised in the second elements A 1 has 300 words, so the text size Ls21 of the second elements A 1 is 300.
4th computing module is used for calculating the desired value I2i of the second elements A i according to formula I2i=Es2i* (Ls2i) 2, namely determines the desired value of this second element according to the product of the second entropy of each second element and the second text size square.
Particularly, the first computing module comprises the first calculating sub module, and the first calculating sub module is used for according to formula calculate the first entropy Es1j of the first element Gj, wherein, S1j is the first character string in the first element Gj, Ck1 is the word in the first character string S1j, k gets 1 to q successively, q is the number of the word in the first character string S1j, the probability that P (Ck1) occurs in the first character string S1j for word Ck1.In embodiments of the present invention, all words in the text that first element comprises constitute the first character string of this first element, calculate the frequency that each word in above-mentioned first character string occurs in this first character string, obtain the probability that each word occurs in this first character string, the probability of each word calculated is multiplied with the logarithm of this probability, by all result summations obtained, then getting negative, is exactly the first entropy of this first element.Such as: the text comprised in the first element G1 is for " today, weather was fine."; so the first character string S11 of the first element G1 is " today, weather was fine "; this first character string S11 comprises 5 words altogether; wherein; the probability that " the present " occurs in the first character string S11 is 1/5, " my god " probability that occurs in the first character string S11 is 2/5, the probability that " gas " occurs in the first character string S11 is 1/5; the probability that " fine " occurs in the first character string S11 is 1/5, so the first entropy Es11=-(1/5*log of the first element G1 21/5+2/5*log 22/5+1/5*log 21/5+1/5*log 21/5).
Particularly, the second computing module comprises the second calculating sub module, and the second calculating sub module is used for according to formula calculate the second entropy Es2i of the second elements A i, wherein, S2i is the second character string in the second elements A i, Ck2 is the word in the second character string S2i, i gets 1 to p successively, p is the number of the word in the second character string S2i, the probability that P (Ck2) occurs in the second character string S2i for word Ck2.
In embodiments of the present invention, all words in the text that second element comprises constitute the second character string of this second element, calculate the frequency that each word in above-mentioned second character string occurs in this second character string, obtain the probability that each word occurs in this second character string, the probability of each word calculated is multiplied with the logarithm of this probability, by all result summations obtained, then getting negative, is exactly the second entropy of this second element.Such as: the text comprised in the second elements A 1 is for " today, weather was fine, but really terribly cold.", so the second character string S21 of this second elements A 1 is that " today, weather was fine, but it is really terribly cold ", this second character string S21 comprises 10 words altogether, wherein, the probability that " the present " occurs in the second character string S21 is 1/10, " my god " probability that occurs in the second character string S21 is 2/10, the probability that " gas " occurs in the second character string S21 is 1/10, the probability that " fine " occurs in the second character string S21 is 1/10, " but " probability that occurs in the second character string S21 is 1/10, the probability that "Yes" occurs in the second character string S21 is 1/10, the probability that " really " occurs in the second character string S21 is 1/10, " very " probability occurred in the second character string S21 is 1/10, the probability that " cold " occurs in the second character string S21 is 1/10, so the second entropy Es21 of this second elements A 11 can calculate according to the following formula:
Es21=-(1/10*log 21/10+2/10*log 22/10+1/10*log 21/10+1/10*log 21/10+1/10*log 21/10+1/10*log 21/10+1/10*log 21/10+1/10*log 21/10+1/10*log 21/10)。
Preferably, in embodiments of the present invention, the first extraction unit comprises judge module and processing module, and wherein, judge module is for judging whether the first element that first node is corresponding is block element; Processing module is used for when the first element judging that first node is corresponding is block element, extracts the second element that the father node of the first element corresponding to first node and first node is corresponding.Also, namely, when the first element judging that first node is corresponding is block element, the second element that the father node of the first element corresponding to first node and this first node is corresponding is just extracted.It should be noted that, when whether the first element judging that first node is corresponding is block element, once only can judge whether corresponding first element of a first node is block element, also once can judge whether the element that multiple first node is corresponding is block element respectively.
In embodiments of the present invention, by judging whether the first element of first node is block element, avoid and will comprise content of text and extract for the first element of the non-piece of element such as " text source ", " delivering the time " or " author ", reach the effect of the Text Feature Extraction accuracy improving Webpage further.
As can be seen from the above description, the invention solves the text of Webpage in prior art and extract not accurate enough problem, and then reach the text extraction accuracy effect improving Webpage.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
In the above embodiment of the present invention, the description of each embodiment is all emphasized particularly on different fields, in certain embodiment, there is no the part described in detail, can see the associated description of other embodiments.
In several embodiments that the application provides, should be understood that, disclosed client, the mode by other realizes.Wherein, device embodiment described above is only schematic, the such as division of described unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, such as multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, and the indirect coupling of unit or module or communication connection can be electrical or other form.
The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, also can be that the independent physics of unit exists, also can two or more unit in a unit integrated.Above-mentioned integrated unit both can adopt the form of hardware to realize, and the form of SFU software functional unit also can be adopted to realize.
If described integrated unit using the form of SFU software functional unit realize and as independently production marketing or use time, can be stored in a computer read/write memory medium.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words or all or part of of this technical scheme can embody with the form of software product, this computer software product is stored in a storage medium, comprises all or part of step of some instructions in order to make a computer equipment (can be personal computer, server or the network equipment etc.) perform method described in each embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, ROM (read-only memory) (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), portable hard drive, magnetic disc or CD etc. various can be program code stored medium.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (10)

1. a context extraction method for Webpage, is characterized in that, comprising:
Obtain the HTML (Hypertext Markup Language) HTML code of Webpage to be extracted, and set up the tree structure of described Webpage to be extracted according to described HTML code;
Extract the second element that the father node of the first element corresponding to the first node of described tree structure and described first node is corresponding, wherein, described first node is the leaf node of described tree structure;
Calculate the desired value of described first element and described second element, wherein, described desired value is for representing the quantity of information of element;
Obtain the element that in described desired value, Maximum Index value is corresponding, obtain object element; And
The text comprised by described object element extracts as the text of described Webpage to be extracted.
2. context extraction method according to claim 1, is characterized in that, the desired value calculating described first element and described second element comprises:
Calculate the first entropy Es1j and the first text size Ls1j of the first element Gj, wherein, j gets 1 to n successively, and n is the number of described first node;
According to formula I1j=Es1j* (Ls1j) 2calculate the desired value I1j of described first element Gj;
Calculate the second entropy Es2i and the second text size Ls2i of the second elements A i, wherein, i gets 1 to w successively, and w is the number of the described father node of described first node; And
According to formula I2i=Es2i* (Ls2i) 2calculate the desired value I2i of described second elements A i.
3. context extraction method according to claim 2, is characterized in that, according to formula calculate the first entropy Es1j of described first element Gj, wherein, S1j is the first character string in described first element Gj, Ck1 is the word in the first character string S1j, k gets 1 to q successively, q is the number of the word in described first character string S1j, the probability that P (Ck1) occurs in described first character string S1j for word Ck1.
4. context extraction method according to claim 2, is characterized in that, according to formula calculate the second entropy Es2i of described second elements A i, wherein, S2i is the second character string in described second elements A i, Ck2 is the word in the second character string S2i, i gets 1 to p successively, p is the number of the word in described second character string S2i, the probability that P (Ck2) occurs in described second character string S2i for word Ck2.
5. context extraction method according to claim 1, is characterized in that, the second element extracting the father node of the first element corresponding to first node in described tree structure and described first node corresponding comprises:
Judge whether the first element that described first node is corresponding is block element; And
When the first element judging that described first node is corresponding is block element, extract the second element that the father node of the first element corresponding to described first node and described first node is corresponding.
6. a text extraction element for Webpage, is characterized in that, comprising:
First acquiring unit, for obtaining the HTML (Hypertext Markup Language) HTML code of Webpage to be extracted, and sets up the tree structure of described Webpage to be extracted according to described HTML code;
First extraction unit, the second element that the father node of the first element that the first node for extracting described tree structure is corresponding and described first node is corresponding, wherein, described first node is the leaf node of described tree structure;
Computing unit, for calculating the desired value of described first element and described second element, wherein, described desired value is for representing the quantity of information of element;
Second acquisition unit, for obtaining the element that in described desired value, Maximum Index value is corresponding, obtains object element; And
Second extraction unit, extracts for the text that comprised by the described object element text as described Webpage to be extracted.
7. text extraction element according to claim 6, is characterized in that, described computing unit comprises:
First computing module, for calculating the first entropy Es1j and the first text size Ls1j of the first element Gj, wherein, j gets 1 to n successively, and n is the number of described first node;
Second computing module, for according to formula I1j=Es1j* (Ls1j) 2calculate the desired value I1j of described first element Gj;
3rd computing module, for calculating the second entropy Es2i and the second text size Ls2i of the second elements A i, wherein, i gets 1 to w successively, and w is the number of the described father node of described first node; And
4th computing module, for according to formula I2i=Es2i* (Ls2i) 2calculate the desired value I2i of described second elements A i.
8. text extraction element according to claim 7, is characterized in that, described first computing module comprises:
First calculating sub module, for according to formula calculate the first entropy Es1j of described first element Gj, wherein, S1j is the first character string in described first element Gj, Ck1 is the word in the first character string S1j, k gets 1 to q successively, q is the number of the word in described first character string S1j, the probability that P (Ck1) occurs in described first character string S1j for word Ck1.
9. text extraction element according to claim 7, is characterized in that, described second computing module comprises:
Second calculating sub module, for according to formula calculate the second entropy Es2i of described second elements A i, wherein, S2i is the second character string in described second elements A i, Ck2 is the word in the second character string S2i, i gets 1 to p successively, p is the number of the word in described second character string S2i, the probability that P (Ck2) occurs in described second character string S2i for word Ck2.
10. text extraction element according to claim 6, is characterized in that, described first extraction unit comprises:
Judge module, for judging whether the first element that described first node is corresponding is block element; And
Processing module, for when the first element judging that described first node is corresponding is block element, extracts the second element that the father node of the first element corresponding to described first node and described first node is corresponding.
CN201410827773.7A 2014-12-25 2014-12-25 The context extraction method and device of Webpage Active CN104484449B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410827773.7A CN104484449B (en) 2014-12-25 2014-12-25 The context extraction method and device of Webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410827773.7A CN104484449B (en) 2014-12-25 2014-12-25 The context extraction method and device of Webpage

Publications (2)

Publication Number Publication Date
CN104484449A true CN104484449A (en) 2015-04-01
CN104484449B CN104484449B (en) 2018-02-23

Family

ID=52758990

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410827773.7A Active CN104484449B (en) 2014-12-25 2014-12-25 The context extraction method and device of Webpage

Country Status (1)

Country Link
CN (1) CN104484449B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105354292A (en) * 2015-10-30 2016-02-24 东莞酷派软件技术有限公司 Page output method and apparatus
CN106354749A (en) * 2016-08-15 2017-01-25 北京小米移动软件有限公司 Information display method and device
CN108874934A (en) * 2018-06-01 2018-11-23 百度在线网络技术(北京)有限公司 Page body extracting method and device
CN110377796A (en) * 2019-07-25 2019-10-25 中南民族大学 Text extracting method, device, equipment and storage medium based on dom tree

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101093487A (en) * 2006-06-22 2007-12-26 上海新纳广告传媒有限公司 Method for extracting content of text based on HTML characteristics
US20080098300A1 (en) * 2006-10-24 2008-04-24 Brilliant Shopper, Inc. Method and system for extracting information from web pages
CN102880707A (en) * 2012-09-27 2013-01-16 广州市动景计算机科技有限公司 Method and device for webpage body content recognition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101093487A (en) * 2006-06-22 2007-12-26 上海新纳广告传媒有限公司 Method for extracting content of text based on HTML characteristics
US20080098300A1 (en) * 2006-10-24 2008-04-24 Brilliant Shopper, Inc. Method and system for extracting information from web pages
CN102880707A (en) * 2012-09-27 2013-01-16 广州市动景计算机科技有限公司 Method and device for webpage body content recognition

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105354292A (en) * 2015-10-30 2016-02-24 东莞酷派软件技术有限公司 Page output method and apparatus
CN106354749A (en) * 2016-08-15 2017-01-25 北京小米移动软件有限公司 Information display method and device
CN106354749B (en) * 2016-08-15 2020-06-02 北京小米移动软件有限公司 Information display method and device
CN108874934A (en) * 2018-06-01 2018-11-23 百度在线网络技术(北京)有限公司 Page body extracting method and device
CN108874934B (en) * 2018-06-01 2021-11-30 百度在线网络技术(北京)有限公司 Page text extraction method and device
CN110377796A (en) * 2019-07-25 2019-10-25 中南民族大学 Text extracting method, device, equipment and storage medium based on dom tree
CN110377796B (en) * 2019-07-25 2021-11-02 中南民族大学 Text extraction method, device and equipment based on DOM tree and storage medium

Also Published As

Publication number Publication date
CN104484449B (en) 2018-02-23

Similar Documents

Publication Publication Date Title
CN103885987B (en) A kind of music recommends method and system
CN110297879B (en) Method, device and storage medium for data deduplication based on big data
US20180285331A1 (en) Method, server, browser, and system for recommending text information
CN103577452A (en) Website server and method and device for enriching content of website
CN104484449A (en) Web page text extraction method and web page text extraction device
CN110020312B (en) Method and device for extracting webpage text
CN108304377B (en) Extraction method of long-tail words and related device
CN105320760A (en) Document processing method and server
CN106227866A (en) A kind of hybrid filtering film based on data mining recommends method
CN103870553A (en) Input resource pushing method and system
CN104503988A (en) Searching method and device
CN105550359A (en) Webpage sorting method and device based on vertical search and server
CN110134780B (en) Method, device, equipment and computer readable storage medium for generating document abstract
CN112085087A (en) Method and device for generating business rules, computer equipment and storage medium
CN108647312A (en) A kind of user preference analysis method and its device
CN103544150A (en) Method and system for providing recommendation information for mobile terminal browser
US9454568B2 (en) Method, apparatus and computer storage medium for acquiring hot content
KR101931859B1 (en) Method for selecting headword of electronic document, method for providing electronic document, and computing system performing the same
CN103020208A (en) Searching method and device adapting to mobile terminal
CN108875050B (en) Text-oriented digital evidence-obtaining analysis method and device and computer readable medium
CN105426382B (en) A kind of music recommendation method of the mood context-aware based on Personal Rank
CN104660581A (en) Method, device and system for determining target users for business strategy
US10387545B2 (en) Processing page
CN108959289B (en) Website category acquisition method and device
US20240037134A1 (en) Method and apparatus for searching for clipping template

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Web page text extraction method and web page text extraction device

Effective date of registration: 20190531

Granted publication date: 20180223

Pledgee: Shenzhen Black Horse World Investment Consulting Co., Ltd.

Pledgor: Beijing Guoshuang Technology Co.,Ltd.

Registration number: 2019990000503

CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Patentee after: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Patentee before: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.