CN103491116A - Method and device for processing text-related structural data - Google Patents

Method and device for processing text-related structural data Download PDF

Info

Publication number
CN103491116A
CN103491116A CN201210192678.5A CN201210192678A CN103491116A CN 103491116 A CN103491116 A CN 103491116A CN 201210192678 A CN201210192678 A CN 201210192678A CN 103491116 A CN103491116 A CN 103491116A
Authority
CN
China
Prior art keywords
node
text
piece
nodes
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201210192678.5A
Other languages
Chinese (zh)
Inventor
蔡兵
徐羽
彭默
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shiji Guangsu Information Technology Co Ltd
Original Assignee
Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Shiji Guangsu Information Technology Co Ltd filed Critical Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority to CN201210192678.5A priority Critical patent/CN103491116A/en
Publication of CN103491116A publication Critical patent/CN103491116A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method and device for processing text-related structural data, and belongs to the field of internet technologies. The method includes the step 1, carrying out block processing on nodes in a document object model tree of a webpage according to types of preset candidate block nodes so as to obtain several candidate block nodes, wherein the types of the candidate block nodes are types of nodes corresponding to labels of a text used for storing the webpage, step 2, filtering out the candidate block nodes whose probabilities of the text used for storing the webpage are smaller than preset probability threshold values from the several candidate block nodes so as to obtain several block nodes, step 3, extracting the text-related structural data of the webpage from the several block nodes, and step 4, displaying the text-related structural data of the webpage. The text-related structural data of the webpage at least comprises titles, text information and text bodies. According to the technical scheme, the method can be used for efficiently extracting and displaying the text-related structural data.

Description

The processing method of the structural data that text is relevant and device
Technical field
The embodiment of the present invention relates to Internet technical field, processing method and the device of the structural data that particularly a kind of text is relevant.
Background technology
In prior art, as the webpage of WWW webpage and so on is mainly personal computer (Personal Computer; PC) browser of end design.Along with the development of technology and the driving of business, webpage becomes increasingly complex in recent years, and the content comprised is also more and more, such as in webpage, comprising the content of navigation, text, link, advertisement, JS etc. various complexity.
Be accompanied by the fast-developing of mobile Internet and as the universal use of the mobile device of mobile phone and so on, the user can use the mobile device online anywhere or anytime, therefore, the demand of user's direct browsing page on the mobile device as mobile phone and so on is increasing.
In realizing process of the present invention, the inventor finds that at least there is following problem in prior art: due to the page of complicated webpage usually directly the browser of mobile device support, add the objective condition such as mobile network and mobile device be screen constrained, for display web page on mobile device has brought certain difficulty, make the user usually can't on the browser of mobile device, see the relevant information of text of webpage.Therefore, need the processing scheme that the structural data that a kind of text is relevant is provided in prior art badly, can extract and show the structural data that text is relevant from webpage, thereby can realize the relevant structural data of the text in display web page on the browser of mobile device.
Summary of the invention
In order to solve the problem of prior art, the embodiment of the present invention provides the processing method of the structural data that a kind of text is relevant and device, mobile device.Described technical scheme is as follows:
On the one hand, the embodiment of the present invention provides the processing method of the structural data that a kind of text is relevant, and described method comprises:
According to the type of default candidate segmentation node, the node in the document object model tree of webpage is carried out to the piecemeal processing, obtain several candidate segmentation nodes; The type of described candidate segmentation node is the node type corresponding to label of the text for storing described webpage;
In the described several candidate segmentation nodes of filtering, the probability of the text of the described webpage of storage is less than predetermined probabilities threshold value candidate segmentation node, obtains several blocking nodes;
Extract the relevant structural data of text of described webpage from described several blocking nodes, in the relevant structural data of the text of described webpage, at least comprise title, text message and text;
The relevant structural data of text that shows described webpage.
Alternatively, in method as above, according to the type of default candidate segmentation node, the node in the document object model tree of webpage is carried out to the piecemeal processing, after obtaining several candidate segmentation nodes, in the described several candidate segmentation nodes of filtering, the probability of the text of the described webpage of storage is less than predetermined probabilities threshold value candidate segmentation node, before obtaining several blocking nodes, also comprise:
When the adjacent node of the described candidate segmentation node in the document object model tree of described webpage is non-candidate segmentation node, the adjacent node of described candidate segmentation node is integrated into to the child node of described candidate segmentation node; And/or
While also including with the non-conterminous non-candidate segmentation node of described candidate segmentation node, described non-conterminous non-candidate segmentation node is packaged as to described candidate segmentation node in the document object model tree of described webpage.
Alternatively, in method as above, in the described several candidate segmentation nodes of filtering, the probability of the text of the described webpage of storage is less than predetermined probabilities threshold value candidate segmentation node, obtains several blocking nodes, comprising:
For each the candidate segmentation node in described several candidate segmentation nodes, judge whether the ratio of the text size that text size sum that text size that described candidate segmentation node is corresponding is corresponding with the adjacent node of described candidate segmentation node and the father node of described candidate segmentation node are corresponding is more than or equal to the first predetermined threshold value; Using described candidate segmentation node as described blocking node, obtain altogether described several blocking node when being more than or equal to; Otherwise when being less than, the described candidate segmentation node of filtering.
Alternatively, in method as above, in the described several candidate segmentation nodes of filtering, the probability of the text of the described webpage of storage is less than predetermined probabilities threshold value candidate segmentation node, after obtaining several blocking nodes, before the structural data that the text of the described webpage of extraction is relevant from described several blocking nodes, also comprise:
For each blocking node in described several blocking nodes, delete the irrelevant child node of structural data relevant to the text of described webpage in described blocking node.
Alternatively, in method as above, in the described several candidate segmentation nodes of filtering, the probability of the text of the described webpage of storage is less than predetermined probabilities threshold value candidate segmentation node, after obtaining several blocking nodes, before the structural data that the text of the described webpage of extraction is relevant from described several blocking nodes, also comprise:
According to the position of described several blocking nodes in the document object model tree of described webpage, identify the set membership of described several blocking nodes;
The relevant structural data of text that extracts described webpage from described several blocking nodes comprises: in conjunction with the set membership of described several blocking nodes, extract the relevant structural data of text of described webpage from described several blocking nodes.
Alternatively, in method as above, in conjunction with the set membership of described several blocking nodes, extract the relevant structural data of text of described webpage from described several blocking nodes, comprising:
Travel through described several blocking node, extract the title piece from described several blocking nodes;
In conjunction with the set membership of described several blocking nodes, extract the text message piece from described several blocking nodes;
In conjunction with the set membership of described several blocking nodes, from described several blocking nodes, extract text block.
Alternatively, in method as above, travel through described several blocking node, extract the title piece from described several blocking node, comprising:
Travel through described several blocking node, extract the piece that includes the Hn label from described several blocking nodes;
Judge the page title that whether includes described webpage in the piece that includes the Hn label; While including the page title of described webpage in the piece that includes the Hn label, will include the piece of Hn label as the title piece.
Alternatively, in method as above, in conjunction with the set membership of described several blocking nodes, extract the text message piece from described several blocking nodes, comprising:
Set membership in conjunction with described several blocking nodes, obtain the text message piece that includes the text message parameter descendants's node in predeterminable range scope after title piece described in described several blocking nodes, described text message parameter comprises the time of delivering, source and author;
In conjunction with the set membership of described several blocking nodes, extract text block from described several blocking nodes, comprising:
In conjunction with the set membership of described several blocking nodes, the descendants's node after title piece described in described several blocking nodes and described text message piece, obtain text block.
Alternatively, in method as above, also comprise in the relevant structural data of the text of described webpage that secondary navigation piece, font select piece, page turning piece, related article piece, meagrely share piece, copyright statement piece and reply at least one in piece;
Extract the relevant structural data of text of described webpage from described several blocking nodes, also comprise following at least one:
In conjunction with the set membership of described several blocking nodes, obtain the former generation's node before title piece described in described several blocking nodes and include specific money symbol & Gt and do not comprise the described secondary navigation piece of sentence;
In conjunction with the set membership of described several blocking nodes, be arranged in described text message piece descendants's node afterwards and obtain the described font selection piece that includes font selection information from described several blocking nodes;
Extract the described page turning piece that includes page indication information from described several blocking nodes, described page indication information comprises at least one in page up, lower one page and Connected digits;
Extract the piece that includes links header and link URL(uniform resource locator) from described several blocking nodes, and the similarity of working as the page title of described links header and described webpage is more than or equal to the second predetermined threshold value, and, when the similarity of the URL(uniform resource locator) of described link URL(uniform resource locator) and described webpage is more than or equal to the 3rd predetermined threshold value, determine that the described piece that includes links header and link URL(uniform resource locator) is described related article piece;
Extract and include the meagre described meagre piece of sharing of sharing characteristic information from described several blocking nodes;
Extract the described copyright statement piece that includes the copyright statement characteristic information from described several blocking nodes; With
Extract from described several blocking nodes and include the described reply piece of replying characteristic information.
On the other hand, the embodiment of the present invention provides the processing unit of the structural data that a kind of text is relevant, and described device comprises:
The piecemeal processing module, carry out the piecemeal processing for the type of the candidate segmentation node according to default to the node of the document object model tree of webpage, obtains several candidate segmentation nodes; The type of described candidate segmentation node is the node type corresponding to label of the text for storing described webpage;
The filtering module, the probability of storing the text of described webpage for the described several candidate segmentation nodes of filtering is less than predetermined probabilities threshold value candidate segmentation node, obtains several blocking nodes;
Data extraction module, for extract the relevant structural data of text of described webpage from described several blocking nodes, at least comprise title, text message and text in the relevant structural data of the text of described webpage;
Display module, for the relevant structural data of text that shows described webpage.
Alternatively, in device as above, also comprise: integrate module, for in described piecemeal processing module, according to the type of default candidate segmentation node, the node of the document object model tree of webpage being carried out to the piecemeal processing, after obtaining several candidate segmentation nodes, in the described several candidate segmentation nodes of described filtering module filtering, the probability of the text of the described webpage of storage is less than predetermined probabilities threshold value candidate segmentation node, before obtaining several blocking nodes, when the adjacent node of the described candidate segmentation node in the document object model tree of described webpage is non-candidate segmentation node, the adjacent node of described candidate segmentation node is integrated into to the child node of described candidate segmentation node, and/or packetization module, while also including with the non-conterminous non-candidate segmentation node of described candidate segmentation node for the document object model tree when described webpage, described non-conterminous non-candidate segmentation node is packaged as to described candidate segmentation node.
Alternatively, in device as above, the filtering module, specifically for each the candidate segmentation node in described several candidate segmentation nodes, judge whether the ratio of the text size that text size sum that text size that described candidate segmentation node is corresponding is corresponding with the adjacent node of described candidate segmentation node and the father node of described candidate segmentation node are corresponding is more than or equal to the first predetermined threshold value; Using described candidate segmentation node as described blocking node, obtain altogether described several blocking node when being more than or equal to; Otherwise when being less than, the described candidate segmentation node of filtering.
Alternatively, in device as above, also comprise: removing module, probability for the text at the described webpage of the described several candidate segmentation node storage of described filtering module filtering is less than predetermined probabilities threshold value candidate segmentation node, after obtaining several blocking nodes, before described data extraction module is extracted the structural data that the text of described webpage is relevant from described several blocking nodes, for each blocking node in described several blocking nodes, delete the irrelevant child node of structural data relevant to the text of described webpage in described blocking node.
Alternatively, in device as above, also comprise identification module;
Described identification module, probability for the text at the described webpage of the described several candidate segmentation node storage of described filtering module filtering is less than predetermined probabilities threshold value candidate segmentation node, after obtaining several blocking nodes, before described data extraction module is extracted the structural data that the text of described webpage is relevant from described several blocking nodes, according to the position of described several blocking nodes in the document object model tree of described webpage, identify the set membership of described several blocking nodes;
Described data extraction module specifically for the set membership in conjunction with described several blocking nodes, is extracted the relevant structural data of text of described webpage from described several blocking nodes;
Alternatively, in device as above, described data extraction module comprises:
Title piece extraction unit for traveling through described several blocking node, extracts the title piece from described several blocking nodes;
Text message piece extraction unit for the set membership in conjunction with described several blocking nodes, extracts the text message piece from described several blocking nodes;
The text block extraction unit for the set membership in conjunction with described several blocking nodes, extracts text block from described several blocking nodes.
Alternatively, in device as above, described title piece extraction unit specifically for traveling through described several blocking node, extracts the piece that includes the Hn label from described several blocking nodes; And judgement includes the page title that whether includes described webpage in the piece of Hn label; While including the page title of described webpage in the piece that includes the Hn label, will include the piece of Hn label as the title piece.
Alternatively, in device as above, described text message piece extraction unit, specifically for the set membership in conjunction with described several blocking nodes, obtain the text message piece that includes the text message parameter descendants's node in predeterminable range scope after title piece described in described several blocking nodes, described text message parameter comprises the time of delivering, source and author;
Described text block extraction unit, specifically for the set membership in conjunction with described several blocking nodes, obtain text block the descendants's node after title piece described in described several blocking nodes and described text message piece.
Alternatively, in device as above, also comprise in the relevant structural data of the text of described webpage that secondary navigation piece, font select piece, page turning piece, related article piece, meagrely share piece, copyright statement piece and reply at least one in piece;
Described data extraction module also comprises following at least one unit:
Secondary navigation piece extraction unit, for the set membership in conjunction with described several blocking nodes, obtain the former generation's node before title piece described in described several blocking nodes and include specific money symbol & Gt and do not comprise the described secondary navigation piece of sentence;
Font is selected the piece extraction unit, for the set membership in conjunction with described several blocking nodes, is arranged in described text message piece descendants's node afterwards and obtains the described font selection piece that includes font selection information from described several blocking nodes;
Page turning piece extraction unit, for from described several blocking nodes, extracting the described page turning piece that includes page indication information, described page indication information comprises at least one in page up, lower one page and Connected digits;
Related article piece extraction unit, for extract the piece that includes links header and link URL(uniform resource locator) from described several blocking nodes, and the similarity of working as the page title of described links header and described webpage is more than or equal to the second predetermined threshold value, and, when the similarity of the URL(uniform resource locator) of described link URL(uniform resource locator) and described webpage is more than or equal to the 3rd predetermined threshold value, determine that the described piece that includes links header and link URL(uniform resource locator) is described related article piece;
The meagre piece extraction unit of sharing, for extracting and include the meagre described meagre piece of sharing of sharing characteristic information from described several blocking nodes;
Copyright statement piece extraction unit, for extracting from described several blocking nodes the described copyright statement piece that includes the copyright statement characteristic information; With
Reply the piece extraction unit, for from described several blocking nodes, extracting and include the described reply piece of replying characteristic information.
The beneficial effect that the technical scheme that the embodiment of the present invention provides is brought is:
Type by the candidate segmentation node according to default is carried out the piecemeal processing to the node in the document object model tree of webpage, obtains several candidate segmentation nodes; The type of this candidate segmentation node is the node type corresponding to label of the text for storing webpage; In the several candidate segmentation nodes of filtering, the probability of the text of storage webpage is less than predetermined probabilities threshold value candidate segmentation node, obtains several blocking nodes; Extract the relevant structural data of text of webpage from several blocking nodes, at least comprise title, text message and text in the structural data that in the embodiment of the present invention, the text of webpage is relevant; And the relevant structural data of the text of display web page.The embodiment of the present invention, by adopting technique scheme, can make up the deficiencies in the prior art, and a kind of efficient technical scheme of extracting from webpage and showing the structural data that text is relevant is provided.And the technical scheme of the embodiment of the present invention, can be applicable to extraction and the demonstration of the structural data that the text of any webpage is relevant, when effectively extracting the structural data that text is relevant, avoid extracting advertisement module, thereby realize filtering the function of part advertisement in text.Moreover the technical scheme of the embodiment of the present invention can also be after extracting the structural data that text is relevant, and show the structural data that this text is relevant, can provide a kind of salubrious reading experience for the user, to meet the demand of mobile device user.
The accompanying drawing explanation
In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, in below describing embodiment, the accompanying drawing of required use is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain according to these accompanying drawings other accompanying drawing.
The process flow figure of the structural data that the text that Fig. 1 provides for the embodiment of the present invention one is relevant;
Fig. 2 is a WWW webpage of the prior art;
Fig. 3 A-Fig. 3 C is respectively the webpage after the relevant structural data of text of the WWW webpage shown in Fig. 2 is processed;
The structural representation of the processing unit of the structural data that the text that Fig. 4 provides for the embodiment of the present invention two is relevant;
The structural representation of the processing unit of the structural data that the text that Fig. 5 provides for the embodiment of the present invention three is relevant.
Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail.
Embodiment mono-
The process flow figure of the structural data that the text that Fig. 1 provides for the embodiment of the present invention one is relevant.The executive agent of the processing method of the structural data that as shown in Figure 1, the text of the present embodiment is relevant is specifically as follows the processing unit of the structural data that a text is relevant.The processing method of the structural data that the text of the present embodiment is relevant, specifically can comprise the steps:
100, DOM Document Object Model (the Document Object Model to webpage according to the type of default candidate segmentation node; DOM) node in the tree carries out the piecemeal processing, obtains several candidate segmentation nodes;
In the present embodiment, the type of candidate segmentation node is the node type corresponding to label of the text for storing webpage; For example in prior art, the label of text of storage webpage can be DIV label or TABLE label, and now corresponding node type corresponding to label for the text of storing webpage can be called DIV node or TABLE node.But the development along with technology, also likely adopt in the future the text of the node storage webpage of other types, so comprise but be not restricted to for the node type corresponding to label of the text of storing webpage in the embodiment of the present invention and only comprise DIV node or TABLE node.
101, in the several candidate segmentation nodes of filtering, the probability of the text of storage webpage is less than predetermined probabilities threshold value candidate segmentation node, obtains several blocking nodes;
102, extract the relevant structural data of text of webpage from several blocking nodes;
103, the relevant structural data of the text of display web page.
At least comprise title, text message and text in the relevant structural data of the text of the webpage of the present embodiment.
The processing method of the structural data that the text of the present embodiment is relevant, carry out the piecemeal processing by the type of the candidate segmentation node according to default to the node in the dom tree of webpage, obtains several candidate segmentation nodes; The type of this candidate segmentation node is the node type corresponding to label of the text for storing webpage; In the several candidate segmentation nodes of filtering, the probability of the text of storage webpage is less than predetermined probabilities threshold value candidate segmentation node, obtains several blocking nodes; Extract the relevant structural data of text of webpage from several blocking nodes, at least comprise title, text message and text in the structural data that in the present embodiment, the text of webpage is relevant; And the relevant structural data of the text of display web page.The present embodiment, by adopting technique scheme, can make up the deficiencies in the prior art, and a kind of efficient technical scheme of extracting from webpage and showing the structural data that text is relevant is provided.And the technical scheme of the present embodiment, can be applicable to extraction and the demonstration of the structural data that the text of any webpage is relevant, when effectively extracting the structural data that text is relevant, avoid extracting commercial block, thereby realize filtering the function of part advertisement in text.Moreover the technical scheme of the present embodiment can also be after extracting the structural data that text is relevant,, and show the structural data that this text is relevant, can provide a kind of salubrious reading experience for the user, to meet the demand of mobile device user.
It should be noted that, also provide a kind of wrapper (Wrapper) of utilizing in prior art and extract interested content in webpage.It extracts related content according to certain information pattern recognition knowledge from specific information source, and means with particular form.But due to complexity and the lack of standard of structure of web page, the realization of a wrapper can only be generally a webpage for an information source, and the obtaining information Pattern Recognition Knowledge is also a time-consuming and artificial process in addition.With the prior art, compare, the technical scheme of the present embodiment can be applicable to all webpages, does not need different webpages is arranged to different wrapper, can effectively save use and the maintenance cost of wrapper.
Alternatively, on above-mentioned basis embodiment illustrated in fig. 1, step 100 wherein " is carried out the piecemeal processing to the node in the dom tree of webpage; obtain several candidate segmentation nodes " afterwards, step 101 " in the described several candidate segmentation nodes of filtering, the probability of the text of the described webpage of storage is less than predetermined probabilities threshold value candidate segmentation node; obtain several blocking nodes " before, can also comprise the steps (1) and/or (2):
(1), when the adjacent node of the candidate segmentation node in the dom tree of webpage is non-candidate segmentation node, the adjacent node of candidate segmentation node is integrated into to the child node of candidate segmentation node; The type of the non-candidate segmentation node in the present embodiment is other node types outside node type corresponding to the label of the text for storing webpage;
(the HyperText Markup Language of the HTML of the DOM of webpage and webpage in the present embodiment; HTML) label is corresponding, and each node in dom tree is corresponding with a label in html tag, and DOM and html tag mean the content of webpage in different ways.All take the candidate segmentation node and describe technical scheme of the present invention as DIV node or TABLE node as example for ease of describing following embodiment.
Adjacent node in the embodiment of the present invention refers to the brotgher of node that belongs to same father node.The adjacent node of candidate segmentation node herein refers to this both candidate nodes and belongs to same father node node, with this both candidate nodes, can be referred to as some nodes of the brotgher of node.
Still take the candidate segmentation node as DIV node or TABLE node be example, due to step (1) by the DIV node in the dom tree of webpage or TABLE node division out, but, owing to also can there being other nodes except DIV node or TABLE node in dom tree, for example the type of adjacent node of DIV node or TABLE node can be the P node, the node of OBJECT node or SCRIPT node etc. other types, now the adjacent node of DIV node or TABLE node can be integrated in DIV node or TABLE node, child node as DIV node or TABLE node.
(2), while also including with the non-conterminous non-candidate segmentation node of candidate segmentation node, non-conterminous non-candidate segmentation node is packaged as to the candidate segmentation node in the dom tree of webpage.
This step (2) is for also there is not the node with DIV node or the adjacent other types of TABLE node when the dom tree of webpage, as P node, OBJECT node or SCRIPT node.Now the node of other types can be packaged as to the type of candidate segmentation node, for the label of the text of storing webpage corresponding node type as DIV node or TABLE node.
After above-mentioned piecemeal is processed, dom tree is treated to and only comprises DIV node and/or TABLE node.Dom tree after piecemeal is processed comprises node type corresponding to label of two classes for the text of storing webpage, as both comprised the DIV node, comprise again the TABLE node, now can define a class in two category nodes is the node that main label is corresponding, and the node that the set membership of the several blocking nodes of sign can be corresponding with reference to main label in subsequent step carries out.For example can get the more class node corresponding as main label of interstitial content in two category nodes.
Further alternatively, on above-mentioned basis embodiment illustrated in fig. 1, step 101 " in the several candidate segmentation nodes of filtering, the probability of text of storage webpage is less than predetermined probabilities threshold value candidate segmentation node; obtain several blocking nodes " wherein, specifically can comprise: for each the candidate segmentation node in several candidate segmentation nodes, judge whether the ratio of the text size that text size sum that text size that the candidate segmentation node is corresponding is corresponding with the adjacent node of candidate segmentation node and the father node of candidate segmentation node are corresponding is more than or equal to the first predetermined threshold value; Using the candidate segmentation node as blocking node, obtain altogether several blocking nodes when being more than or equal to; Otherwise when being less than, filtering candidate segmentation node.
In the present embodiment, when the ratio of text size corresponding to text size corresponding to the candidate segmentation node text size sum corresponding with the adjacent node of candidate segmentation node and the father node of candidate segmentation node is less than the first predetermined threshold value, can think that this candidate segmentation node is less than the predetermined probabilities threshold value for the probability of the text of storing webpage, now can this candidate segmentation node of filtering; Otherwise when the ratio of text size corresponding to text size corresponding to the candidate segmentation node text size sum corresponding with the adjacent node of candidate segmentation node and the father node of candidate segmentation node is more than or equal to the first predetermined threshold value, can think that this candidate segmentation node is more than or equal to the predetermined probabilities threshold value for the probability of the text of storing webpage, now can this candidate segmentation node of filtering.To be text size sum that text size that corresponding candidate segmentation node is corresponding is corresponding with the adjacent node of candidate segmentation node equal the situation of the first predetermined threshold value with the ratio of text size corresponding to the father node of candidate segmentation node to predetermined probabilities threshold value in the present embodiment.The first predetermined threshold value in the present embodiment can be chosen according to actual conditions from 0-1, for example can get 0.65.
For example above-mentioned steps 101 is when realizing, can first to the candidate segmentation node that is positioned at the bottom of dom tree in several candidate segmentation nodes, be processed, the candidate segmentation node of the bottom in dom tree is taken out and puts into a queue, carry out the filtering processing according to aforesaid operations, this one deck is disposed, take out again one deck candidate segmentation node to the top layer direction of dom tree and carry out the filtering operation, like this according to successively being processed to the direction of top layer by the DOM bottom, at same layer, can process successively according to order from left to right, until all candidate segmentation nodes are processed, can carry out the filtering operation to several candidate segmentation nodes and obtain several blocking nodes.
Alternatively, on above-mentioned basis embodiment illustrated in fig. 1, wherein " in the several candidate segmentation nodes of filtering, the probability of the text of storage webpage is less than predetermined probabilities threshold value candidate segmentation node to step 101, obtain several blocking nodes " afterwards, step 102 " is extracted the relevant structural data of text of webpage " before from several blocking nodes, its method also comprises:, the processing method of the structural data that the text of the present embodiment is relevant can also comprise: for each blocking node in several blocking nodes, delete the structural data irrelevant child node relevant to the text of webpage in blocking node.For example, for as also comprised in the DIV node: the child node of SCRIPT node and OBJECT node and so on, can know that according to prior art the structural data of text of these nodes and webpage is irrelevant, now can directly these child nodes be deleted.Through above-mentioned steps 100 and 101 processing, these child nodes should be positioned among blocking node, and this step is directly deleted these irrelevant child nodes, can reduce the content of blocking node, improve the efficiency of the structural data that the text of follow-up extraction webpage is relevant.
Further alternatively, on above-mentioned basis embodiment illustrated in fig. 1, wherein step 101 " in the several candidate segmentation nodes of filtering, the probability of the text of storage webpage is less than predetermined probabilities threshold value candidate segmentation node; obtain several blocking nodes " afterwards, 102 " extracting the relevant structural data of text of webpage from several blocking nodes " before, its method also comprises: the position according to several blocking nodes in the document object model tree of described webpage identifies the set membership of several blocking nodes; For example can be according to several blocking nodes relation (as set membership or brotherhood etc.) between the degree of depth in the dom tree of webpage and each blocking node, can be sorted to several blocking nodes according to order from the top to bottom in dom tree, and be identified the set membership of several blocking nodes.For example the blocking node A in the deblocking node and blocking node B, if in dom tree, descendants's node that blocking node B is blocking node A, but the descendants's node that belongs to blocking node A do not have dom tree in the deblocking node in, other nodes simultaneously that belong to again former generation's node of blocking node B, now, in several blocking nodes, blocking node B is designated to the child node of blocking node A.
Now corresponding step 102 " the relevant structural data of text of the described webpage of extraction from described several blocking nodes " specifically can comprise: " in conjunction with the set membership of several blocking nodes, extracting the relevant structural data of text of webpage from several blocking nodes ".This step specifically can comprise:
(a) travel through several blocking nodes, extract the title piece from several blocking nodes;
For example, specifically can comprise: travel through several blocking nodes, extract the piece that includes the Hn label from several blocking nodes; Judge the page title that whether includes webpage in the piece that includes the Hn label; While including the page title of webpage in the piece that includes the Hn label, will include the piece of Hn label as the title piece.
(b) in conjunction with the set membership of several blocking nodes, extract the text message piece from several blocking nodes;
For example, specifically can comprise: in conjunction with the set membership of several blocking nodes, obtain the text message piece that includes the text message parameter in descendants's node in descendants's node from several blocking nodes after the title piece in the predeterminable range scope, the text message parameter comprises the time of delivering, source and author.
(c) in conjunction with the set membership of several blocking nodes, from several blocking nodes, extract text block.
For example, specifically can comprise: in conjunction with the set membership of several blocking nodes, in the descendants's node from several blocking nodes after title piece and text message piece, obtain text block.
When wherein text block contained text content is more, text block can have a plurality of, and when a plurality of text block is arranged, a plurality of text blocks can be arranged in a blocking node, also can be arranged in a plurality of blocking nodes.
It should be noted that, according to the user's of mobile device demand difference, also comprise in the relevant structural data of the text of the webpage in the embodiment of the present invention that secondary navigation piece, font select piece, page turning piece, related article piece, meagrely share piece, copyright statement piece and reply at least one in piece.Particularly, above-mentioned these pieces specifically all are arranged in some or a plurality of blocking nodes.
Now corresponding step 102 " is extracted the relevant structural data of text of webpage " from several blocking nodes, also comprise following at least one:
(i), in conjunction with the set membership of several blocking nodes, obtain in the former generation's node from several blocking nodes before the title piece and include specific money symbol; Gt and do not comprise the secondary navigation piece of sentence;
(ii), in conjunction with the set membership of several blocking nodes, be arranged in text message piece descendants's node afterwards and obtain the font selection piece that includes font selection information from several blocking nodes;
(iii) from several blocking nodes, extract the page turning piece that includes page indication information, page indication information comprises at least one in page up, lower one page and Connected digits;
(iv) from several blocking nodes, extract the piece that includes links header and link URL(uniform resource locator), and be more than or equal to the second predetermined threshold value when the similarity of the page title of links header and webpage, and link URL(uniform resource locator) (Uniform Resource Locator; While URL) with the similarity of the URL of webpage, being more than or equal to the 3rd predetermined threshold value, determine that the piece that includes links header and link URL is the related article piece;
(v) from several blocking nodes, extract and include the meagre meagre piece of sharing of sharing characteristic information;
(vi) from several blocking nodes, extract the copyright statement piece that includes the copyright statement characteristic information; With
(vii) extract from several blocking nodes and include the reply piece of replying characteristic information.
Above-mentioned these pieces have certain characteristic information, can, with reference to related art, not repeat them here in detail.
And above-mentioned each piece has certain positional information, can also utilize the positional information of each piece to be verified the piece extracted, determine that when checking the piece extracted is wrong, can adopt aforesaid way again to extract.
Above-mentioned all optional technical schemes can adopt combinable mode arbitrarily in conjunction with the optional embodiment that forms the embodiment of the present invention, at this, illustrate no longer one by one.
The processing method of the structural data that the text of above-described embodiment is relevant, can make up the deficiencies in the prior art, and a kind of efficient technical scheme of extracting from webpage and showing the structural data that text is relevant is provided.And the technical scheme of the present embodiment, can be applicable to extraction and the demonstration of the structural data that the text of any webpage is relevant, when effectively extracting the structural data that text is relevant, avoid extracting advertisement module, thereby realize filtering the function of part advertisement in text.Moreover the technical scheme of the present embodiment can also be after extracting the structural data that text is relevant, and show the structural data that this text is relevant, can provide a kind of salubrious reading experience for the user, to meet the demand of mobile device user.
The step 100-102 of above-described embodiment realizes the extraction of the structural data that text is relevant, and step 103 realizes the demonstration of the structure tree data that text is relevant.The all or part of step of the various method steps of above-described embodiment can realize by software program, or hardware that also can be relevant by software program instructions completes.
For example Fig. 2 is a WWW webpage of the prior art.According to the text of the invention described above embodiment, to the text of the WWW webpage shown in Fig. 2, relevant structural data carries out the processing of the structural data that text is relevant to the processing method of relevant structural data, can obtain being depicted as Fig. 3 A-Fig. 3 C the webpage that the webpage as shown in Fig. 2 is carried out to the demonstration after the relevant structural data of text is processed.
As Fig. 2 and Fig. 3 A, shown in Fig. 3 B and Fig. 3 C, can know, the technical scheme of employing above-described embodiment can be after extracting the structural data that text is relevant, can carry out reset and show webpage, can be for the user provides a kind of salubrious reading experience, to meet the demand of mobile device user.
Record according to above-described embodiment, in above-described embodiment, each step can realize on the browser of mobile device, each step that wherein realizes the structural data extraction that text is relevant can realize by plug-in unit or an instrument be carried on browser, shows that the function of the structural data that text is relevant can realize on browser.
Embodiment bis-
The structural representation of the processing unit of the structural data that the text that Fig. 4 provides for the embodiment of the present invention two is relevant.As shown in Figure 4, the processing unit of the structural data that the text of the present embodiment is relevant specifically can comprise: piecemeal processing module 10, filtering module 11, data extraction module 12 and display module 13.
Wherein piecemeal processing module 10 is carried out the piecemeal processing for the type of the candidate segmentation node according to default to the node of the dom tree of webpage, obtains several candidate segmentation nodes; The type of this candidate segmentation node is the node type corresponding to label of the text for storing webpage; Filtering module 11 is connected with piecemeal processing module 10, the probability that filtering module 11 is processed the text of the several candidate segmentation node storage webpages that obtain for filtering piecemeal processing module 10 is less than predetermined probabilities threshold value candidate segmentation node, obtaining several blocking node data extraction module 12 is connected with filtering module 11, data extraction module 12 is for process the relevant structural data of text that the several blocking nodes that obtain extract webpages from filtering module 11, in the relevant structural data of the text of this webpage, at least comprises title, text message and text; Display module 13 is connected with data extraction module 12, the text relevant structural data of display module 13 for showing that data extraction module 12 is extracted.
The processing unit of the structural data that the text of the present embodiment is relevant, by adopting above-mentioned module to realize that the processing of the structural data that text is relevant is identical with the realization mechanism of above-mentioned correlation technique embodiment, can also, with reference to the record of above-mentioned correlation technique embodiment, not repeat them here in detail.
The processing unit of the structural data that the text of the present embodiment is relevant, realize, by the type of the candidate segmentation node according to default, node in the dom tree of webpage is carried out to the piecemeal processing by adopting above-mentioned module, obtains several candidate segmentation nodes; The type of this candidate segmentation node is the node type corresponding to label of the text for storing webpage; In the several candidate segmentation nodes of filtering, the probability of the text of storage webpage is less than predetermined probabilities threshold value candidate segmentation node, obtains several blocking nodes; Extract the relevant structural data of text of webpage from several blocking nodes, at least comprise title, text message and text in the structural data that in the present embodiment, the text of webpage is relevant; And the relevant structural data of the text of display web page.The present embodiment, by adopting technique scheme, can make up the deficiencies in the prior art, and a kind of efficient technical scheme of extracting from webpage and showing the structural data that text is relevant is provided.And the technical scheme of the present embodiment, can be applicable to extraction and the demonstration of the structural data that the text of any webpage is relevant, when effectively extracting the structural data that text is relevant, avoid extracting commercial block, thereby realize filtering the function of part advertisement in text.Moreover the technical scheme of the present embodiment can also be after extracting the structural data that text is relevant, and show the structural data that this text is relevant, can provide a kind of salubrious reading experience for the user, to meet the demand of mobile device user.
Embodiment tri-
The structural representation of the processing of the structural data that the text that Fig. 5 provides for the embodiment of the present invention three is relevant.The processing unit of the structural data that text embodiment illustrated in fig. 5 is relevant, on above-mentioned basis embodiment illustrated in fig. 4, can also comprise following technical scheme.
As shown in Figure 5, in the processing unit of the structural data that the text of the present embodiment is relevant, also comprise integrate module 14 and/or packetization module 15, embodiment illustrated in fig. 5 take comprise that integrate module 14 and packetization module 15 are example.
Wherein integrate module 14 can be connected with filtering module 11 with piecemeal processing module 10, integrate module 14 is for carrying out piecemeal processing according to the type of default candidate segmentation node to the node of the document object model tree of webpage in piecemeal processing module 10, after obtaining several candidate segmentation nodes, in the several candidate segmentation nodes of filtering module 11 filtering, the probability of the text of storage webpage is less than predetermined probabilities threshold value candidate segmentation node, before obtaining several blocking nodes, when the adjacent node of the described candidate segmentation node in the document object model tree of described webpage is non-candidate segmentation node, the adjacent node of candidate segmentation node is integrated into to the child node of candidate segmentation node.The probability that filtering module 11 is processed the text of the several candidate segmentation node storage webpages that obtain for filtering integrate module 14 is less than predetermined probabilities threshold value candidate segmentation node, obtains several blocking nodes.
Packetization module 15 can be connected with integrate module 14, packetization module 15 is for after processing at integrate module 14, while also including with the non-conterminous non-candidate segmentation node of candidate segmentation node, non-conterminous non-candidate segmentation node is packaged as to the candidate segmentation node in the dom tree of webpage.
In actual quoting, when not comprising integrate module 14, packetization module 15 can be connected with filtering module 11 with piecemeal processing module 10 respectively.Now, the probability that filtering module 11 is processed the text of the several candidate segmentation node storage webpages that obtain for filtering packetization module 15 is less than predetermined probabilities threshold value candidate segmentation node, obtains several blocking nodes.
Alternatively, filtering module 11 in the processing unit of the structural data that the text of the present embodiment is relevant is specifically for each the candidate segmentation node in several candidate segmentation nodes, judges whether the ratio of the text size that text size sum that text size that the candidate segmentation node is corresponding is corresponding with the adjacent node of candidate segmentation node and the father node of candidate segmentation node are corresponding is more than or equal to the first predetermined threshold value; Using the candidate segmentation node as blocking node, obtain altogether several blocking nodes when being more than or equal to; Otherwise when being less than, filtering candidate segmentation node.Can also comprise removing module 16 in the processing unit of the structural data that further alternatively, the text of the present embodiment is relevant.This removing module 16 is connected with filtering module 11, removing module 16 is less than predetermined probabilities threshold value candidate segmentation node for the probability of the text at the several candidate segmentation node storage of filtering module 11 filtering webpage, after obtaining several blocking nodes, before data extraction module 13 is extracted the structural data that the text of webpage is relevant from several blocking nodes, the several blocking nodes that obtain of each blocking node process in to(for) filtering module 11, delete the structural data irrelevant child node relevant to the text of webpage in blocking node.Now corresponding data extraction module 12 can be connected with removing module 16, in several blocking nodes of data extraction module 12 from removing module 16 is processed, extracts the relevant structural data of text.
In the processing unit of the structural data that further alternatively, the text of the present embodiment is relevant; Identification module 17 is connected with filtering module 11, identification module 17 is less than predetermined probabilities threshold value candidate segmentation node for the probability of the text at the several candidate segmentation node storage of filtering module 11 filtering webpage, after obtaining several blocking nodes, before data extraction module 13 is extracted the structural data that the text of webpage is relevant from several blocking nodes, the position of several blocking nodes in the dom tree of webpage that 11 filterings obtain according to the filtering module, identify the set membership of several blocking nodes.As shown in Figure 5, when the processing unit of the relevant structural data of the text of the present embodiment comprises removing module 16, identification module 17 can be connected with removing module 16, in the position of the dom tree of webpage, identify the set membership of several blocking nodes for the several blocking nodes according to obtaining after removing module 16 deletion processing.
Data extraction module 12 specifically can be connected with identification module 17, for the set membership of the several blocking nodes in conjunction with identification module 17 signs, extracts the relevant structural data of text of webpage from several blocking nodes.As shown in Figure 5, data extraction module 12 specifically can comprise title piece extraction unit 121, text message piece extraction unit 122 and text block extraction unit 123.
Wherein title piece extraction unit 121 is connected with removing module 16, and title piece extraction unit 121 processes for traveling through filter removing module 16 the several blocking nodes that obtain, and from removing module 16, processes the several blocking nodes that obtain and extracts the title piece; Text message piece extraction unit 122 is connected with identification module 17 with removing module 16 respectively, text message piece extraction unit 122, for the set membership of the several blocking nodes in conjunction with identification module 17 signs, is processed the several blocking nodes that obtain and is extracted the text message piece from removing module 16; Text block extraction unit 123 is connected with identification module 17 with removing module 16 respectively, text block extraction unit 123, for the set membership of the several blocking nodes in conjunction with identification module 17 signs, is processed the several blocking nodes that obtain and is extracted text block from removing module 16.
Further alternatively, title piece extraction unit 121 processes specifically for traversal removing module 16 the several blocking nodes that obtain, and from removing module 16, processes the several blocking nodes that obtain and extracts the piece that includes the Hn label; And judgement includes the page title that whether includes webpage in the piece of Hn label; While including the page title of webpage in the piece that includes the Hn label, will include the piece of Hn label as the title piece.
Further alternatively, text message piece extraction unit 122 specifically can also be connected with title piece extraction unit 121, text message piece extraction unit 122 is specifically for the set membership of the several blocking nodes in conjunction with identification module 17 signs, obtain the text message piece that includes the text message parameter in descendants's node in predeterminable range scope after removing module 16 is processed the title piece that the several blocking nodes that obtain, title piece extraction unit 121 extracts, this text message parameter comprises the time of delivering, source and author.
Further alternatively, text block extraction unit 123 specifically can also be connected with text message piece extraction unit 122 with title piece extraction unit 121, text block extraction unit 123, specifically for the set membership of the several blocking nodes in conjunction with identification module 17 signs, obtains text block in the descendants's node after the text message piece that the title piece that the several blocking nodes that obtain from removing module 16 processing, title piece extraction unit 121 extracts and text message piece extraction unit 122 extract.
Now corresponding, title piece extraction unit 121, text message piece extraction unit 122 and text block extraction unit 123 also are connected with display module 13 respectively, text message piece and text block extraction unit 123 and text block that title is fast, text message piece extraction unit 122 extracts that display module 13 extracts for showing title piece extraction unit 121.
Further alternatively, also comprise secondary navigation piece, font selection piece, page turning piece, related article piece, meagre at least one that share in piece, copyright statement piece and reply piece in the structural data that the text of the webpage in the present embodiment is relevant;
Now corresponding data extraction module 12 also comprises following at least one unit (Fig. 5 is not shown):
Secondary navigation piece extraction unit, for process the set membership of the several blocking nodes that obtain in conjunction with identification module 17, obtain in the former generation's node the several blocking nodes that obtain from removing module 16 processing before the title piece and include specific money symbol; Gt and do not comprise the secondary navigation piece of sentence;
Font is selected the piece extraction unit, be used for processing in conjunction with identification module 17 set membership of the several blocking nodes that obtain, be arranged in text message piece descendants's node afterwards the several blocking nodes that obtain from removing module 16 processing and obtain the font selection piece that includes font selection information;
Page turning piece extraction unit, extract for from removing module 16, processing the several blocking nodes that obtain the page turning piece that includes page indication information, and page indication information comprises at least one in page up, lower one page and Connected digits;
Related article piece extraction unit, extract for from removing module 16, processing the several blocking nodes that obtain the piece that includes links header and link URL, and the similarity of working as the page title of links header and webpage is more than or equal to the second predetermined threshold value, and, when the similarity of the URL of link URL and webpage is more than or equal to the 3rd predetermined threshold value, determine that the piece that includes links header and link URL is the related article piece;
The meagre piece extraction unit of sharing, include the meagre meagre piece of sharing of sharing characteristic information for process the several blocking nodes extractions that obtain from removing module 16;
Copyright statement piece extraction unit, extract for from removing module 16, processing the several blocking nodes that obtain the copyright statement piece that includes the copyright statement characteristic information; With
Reply the piece extraction unit, for from removing module 16, processing the several blocking nodes that obtain, extract the reply piece that includes the reply characteristic information.
The processing unit of the structural data that the text of the present embodiment is relevant, by adopting above-mentioned module to realize that the processing of the structural data that text is relevant is identical with the realization mechanism of above-mentioned correlation technique embodiment, can also, with reference to the record of above-mentioned correlation technique embodiment, not repeat them here in detail.
The processing unit of the structural data that the text of the present embodiment is relevant, take and comprise that above-mentioned all optional technical schemes introduce technical scheme of the present invention as example, in practical application, above-mentioned all optional technical schemes can adopt any combinative mode arbitrarily in conjunction with forming a kind of optional embodiment of the present invention, this give an example no longer one by one,
The processing unit of the structural data that the text of the present embodiment is relevant, the technical scheme by adopting above-mentioned module to realize, can make up the deficiencies in the prior art, and a kind of efficient technical scheme of extracting from webpage and showing the structural data that text is relevant is provided.And the technical scheme of the present embodiment, can be applicable to extraction and the demonstration of the structural data that the text of any webpage is relevant, when effectively extracting the structural data that text is relevant, avoid extracting commercial block, thereby realize filtering the function of part advertisement in text.Moreover the technical scheme of the present embodiment can also be after extracting the structural data that text is relevant, and show the structural data that this text is relevant, can provide a kind of salubrious reading experience for the user, to meet the demand of mobile device user.
The embodiment of the present invention can also provide a kind of mobile device, is provided with the processing unit as above-mentioned Fig. 4 or the relevant structural data of text embodiment illustrated in fig. 5 on this mobile device.Can, with reference to the record of above-described embodiment, not repeat them here in detail.
It should be noted that: the processing unit of the structural data that the text that above-described embodiment provides is relevant is when data are extracted, only the division with above-mentioned each functional module is illustrated, in practical application, can above-mentioned functions be distributed and completed by different functional modules as required, the internal structure of the equipment of being about to is divided into different functional modules, to complete all or part of function described above.In addition, the processing method embodiment of the structural data that the processing unit of the structural data that the text that above-described embodiment provides is relevant is relevant to text belongs to same design, and its specific implementation process refers to embodiment of the method, repeats no more here.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
One of ordinary skill in the art will appreciate that all or part of step that realizes above-described embodiment can complete by hardware, also can come the hardware that instruction is relevant to complete by program, described program can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium of mentioning can be read-only memory, disk or CD etc.
The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims (18)

1. the processing method of the structural data that a text is relevant, is characterized in that, described method comprises:
According to the type of default candidate segmentation node, the node in the document object model tree of webpage is carried out to the piecemeal processing, obtain several candidate segmentation nodes; The type of described candidate segmentation node is the node type corresponding to label of the text for storing described webpage;
In the described several candidate segmentation nodes of filtering, the probability of the text of the described webpage of storage is less than predetermined probabilities threshold value candidate segmentation node, obtains several blocking nodes;
Extract the relevant structural data of text of described webpage from described several blocking nodes, in the relevant structural data of the text of described webpage, at least comprise title, text message and text;
The relevant structural data of text that shows described webpage.
2. method according to claim 1, it is characterized in that, according to the type of default candidate segmentation node, the node in the document object model tree of webpage is carried out to the piecemeal processing, after obtaining several candidate segmentation nodes, in the described several candidate segmentation nodes of filtering, the probability of the text of the described webpage of storage is less than predetermined probabilities threshold value candidate segmentation node, before obtaining several blocking nodes, described method also comprises:
When the adjacent node of the described candidate segmentation node in the document object model tree of described webpage is non-candidate segmentation node, the adjacent node of described candidate segmentation node is integrated into to the child node of described candidate segmentation node; And/or
While also including with the non-conterminous non-candidate segmentation node of described candidate segmentation node, described non-conterminous non-candidate segmentation node is packaged as to described candidate segmentation node in the document object model tree of described webpage.
3. method according to claim 1, is characterized in that, in the described several candidate segmentation nodes of filtering, the probability of the text of the described webpage of storage is less than predetermined probabilities threshold value candidate segmentation node, obtains several blocking nodes, comprising:
For each the candidate segmentation node in described several candidate segmentation nodes, judge whether the ratio of the text size that text size sum that text size that described candidate segmentation node is corresponding is corresponding with the adjacent node of described candidate segmentation node and the father node of described candidate segmentation node are corresponding is more than or equal to the first predetermined threshold value; Using described candidate segmentation node as described blocking node, obtain altogether described several blocking node when being more than or equal to; Otherwise when being less than, the described candidate segmentation node of filtering.
4. method according to claim 1, it is characterized in that, in the described several candidate segmentation nodes of filtering, the probability of the text of the described webpage of storage is less than predetermined probabilities threshold value candidate segmentation node, after obtaining several blocking nodes, before the structural data that the text of the described webpage of extraction is relevant from described several blocking nodes, described method also comprises:
For each blocking node in described several blocking nodes, delete the irrelevant child node of structural data relevant to the text of described webpage in described blocking node.
5. according to the arbitrary described method of claim 1-4, it is characterized in that, in the described several candidate segmentation nodes of filtering, the probability of the text of the described webpage of storage is less than predetermined probabilities threshold value candidate segmentation node, after obtaining several blocking nodes, before the structural data that the text of the described webpage of extraction is relevant from described several blocking nodes, described method also comprises:
According to the position of described several blocking nodes in the document object model tree of described webpage, identify the set membership of described several blocking nodes;
The relevant structural data of text that extracts described webpage from described several blocking nodes comprises: in conjunction with the set membership of described several blocking nodes, extract the relevant structural data of text of described webpage from described several blocking nodes.
6. method according to claim 5, is characterized in that, in conjunction with the set membership of described several blocking nodes, extracts the relevant structural data of text of described webpage from described several blocking nodes, comprising:
Travel through described several blocking node, extract the title piece from described several blocking nodes;
In conjunction with the set membership of described several blocking nodes, extract the text message piece from described several blocking nodes;
In conjunction with the set membership of described several blocking nodes, from described several blocking nodes, extract text block.
7. method according to claim 6, is characterized in that, travels through described several blocking node, extracts the title piece from described several blocking nodes, comprising:
Travel through described several blocking node, extract the piece that includes the Hn label from described several blocking nodes;
Judge the page title that whether includes described webpage in the piece that includes the Hn label; While including the page title of described webpage in the piece that includes the Hn label, will include the piece of Hn label as the title piece.
8. method according to claim 5, is characterized in that, in conjunction with the set membership of described several blocking nodes, extracts the text message piece from described several blocking nodes, comprising:
Set membership in conjunction with described several blocking nodes, obtain the text message piece that includes the text message parameter descendants's node in predeterminable range scope after title piece described in described several blocking nodes, described text message parameter comprises the time of delivering, source and author;
In conjunction with the set membership of described several blocking nodes, extract text block from described several blocking nodes, comprising:
In conjunction with the set membership of described several blocking nodes, the descendants's node after title piece described in described several blocking nodes and described text message piece, obtain text block.
9. method according to claim 5, it is characterized in that, also comprise in the relevant structural data of the text of described webpage that secondary navigation piece, font select piece, page turning piece, related article piece, meagrely share piece, copyright statement piece and reply at least one in piece;
Extract the relevant structural data of text of described webpage from described several blocking nodes, also comprise following at least one:
In conjunction with the set membership of described several blocking nodes, obtain the former generation's node before title piece described in described several blocking nodes and include specific money symbol & Gt and do not comprise the described secondary navigation piece of sentence;
In conjunction with the set membership of described several blocking nodes, be arranged in described text message piece descendants's node afterwards and obtain the described font selection piece that includes font selection information from described several blocking nodes;
Extract the described page turning piece that includes page indication information from described several blocking nodes, described page indication information comprises at least one in page up, lower one page and Connected digits;
Extract the piece that includes links header and link URL(uniform resource locator) from described several blocking nodes, and the similarity of working as the page title of described links header and described webpage is more than or equal to the second predetermined threshold value, and, when the similarity of the URL(uniform resource locator) of described link URL(uniform resource locator) and described webpage is more than or equal to the 3rd predetermined threshold value, determine that the described piece that includes links header and link URL(uniform resource locator) is described related article piece;
Extract and include the meagre described meagre piece of sharing of sharing characteristic information from described several blocking nodes;
Extract the described copyright statement piece that includes the copyright statement characteristic information from described several blocking nodes; With
Extract from described several blocking nodes and include the described reply piece of replying characteristic information.
10. the processing unit of the structural data that a text is relevant, is characterized in that, described device comprises:
The piecemeal processing module, carry out the piecemeal processing for the type of the candidate segmentation node according to default to the node of the document object model tree of webpage, obtains several candidate segmentation nodes; The type of described candidate segmentation node is the node type corresponding to label of the text for storing described webpage;
The filtering module, the probability of storing the text of described webpage for the described several candidate segmentation nodes of filtering is less than predetermined probabilities threshold value candidate segmentation node, obtains several blocking nodes;
Data extraction module, for extract the relevant structural data of text of described webpage from described several blocking nodes, at least comprise title, text message and text in the relevant structural data of the text of described webpage;
Display module, for the relevant structural data of text that shows described webpage.
11. device according to claim 10, is characterized in that, also comprises:
Integrate module, for in described piecemeal processing module, according to the type of default candidate segmentation node, the node of the document object model tree of webpage being carried out to the piecemeal processing, after obtaining several candidate segmentation nodes, in the described several candidate segmentation nodes of described filtering module filtering, the probability of the text of the described webpage of storage is less than predetermined probabilities threshold value candidate segmentation node, before obtaining several blocking nodes, when the adjacent node of the described candidate segmentation node in the document object model tree of described webpage is non-candidate segmentation node, the adjacent node of described candidate segmentation node is integrated into to the child node of described candidate segmentation node, and/or packetization module, while also including with the non-conterminous non-candidate segmentation node of described candidate segmentation node for the document object model tree when described webpage, described non-conterminous non-candidate segmentation node is packaged as to described candidate segmentation node.
12. device according to claim 10, it is characterized in that, the filtering module, specifically for each the candidate segmentation node in described several candidate segmentation nodes, judge whether the ratio of the text size that text size sum that text size that described candidate segmentation node is corresponding is corresponding with the adjacent node of described candidate segmentation node and the father node of described candidate segmentation node are corresponding is more than or equal to the first predetermined threshold value; Using described candidate segmentation node as described blocking node, obtain altogether described several blocking node when being more than or equal to; Otherwise when being less than, the described candidate segmentation node of filtering.
13. device according to claim 10, is characterized in that, described device also comprises:
Removing module, probability for the text at the described webpage of the described several candidate segmentation node storage of described filtering module filtering is less than predetermined probabilities threshold value candidate segmentation node, after obtaining several blocking nodes, before described data extraction module is extracted the structural data that the text of described webpage is relevant from described several blocking nodes, for each blocking node in described several blocking nodes, delete the irrelevant child node of structural data relevant to the text of described webpage in described blocking node.
14. according to the arbitrary described device of claim 10-13, it is characterized in that, described device also comprises identification module;
Described identification module, probability for the text at the described webpage of the described several candidate segmentation node storage of described filtering module filtering is less than predetermined probabilities threshold value candidate segmentation node, after obtaining several blocking nodes, before described data extraction module is extracted the structural data that the text of described webpage is relevant from described several blocking nodes, according to the position of described several blocking nodes in the document object model tree of described webpage, identify the set membership of described several blocking nodes;
Described data extraction module specifically for the set membership in conjunction with described several blocking nodes, is extracted the relevant structural data of text of described webpage from described several blocking nodes.
15. device according to claim 14, is characterized in that, described data extraction module comprises:
Title piece extraction unit for traveling through described several blocking node, extracts the title piece from described several blocking nodes;
Text message piece extraction unit for the set membership in conjunction with described several blocking nodes, extracts the text message piece from described several blocking nodes;
The text block extraction unit for the set membership in conjunction with described several blocking nodes, extracts text block from described several blocking nodes.
16. device according to claim 15, is characterized in that, described title piece extraction unit, specifically for traveling through described several blocking node, extracts the piece that includes the Hn label from described several blocking nodes; And judgement includes the page title that whether includes described webpage in the piece of Hn label; While including the page title of described webpage in the piece that includes the Hn label, will include the piece of Hn label as the title piece.
17. device according to claim 14, it is characterized in that, described text message piece extraction unit, specifically for the set membership in conjunction with described several blocking nodes, obtain the text message piece that includes the text message parameter descendants's node in predeterminable range scope after title piece described in described several blocking nodes, described text message parameter comprises the time of delivering, source and author;
Described text block extraction unit, specifically for the set membership in conjunction with described several blocking nodes, obtain text block the descendants's node after title piece described in described several blocking nodes and described text message piece.
18. device according to claim 14, it is characterized in that, also comprise in the relevant structural data of the text of described webpage that secondary navigation piece, font select piece, page turning piece, related article piece, meagrely share piece, copyright statement piece and reply at least one in piece;
Described data extraction module also comprises following at least one unit:
Secondary navigation piece extraction unit, for the set membership in conjunction with described several blocking nodes, obtain the former generation's node before title piece described in described several blocking nodes and include specific money symbol & Gt and do not comprise the described secondary navigation piece of sentence;
Font is selected the piece extraction unit, for the set membership in conjunction with described several blocking nodes, is arranged in described text message piece descendants's node afterwards and obtains the described font selection piece that includes font selection information from described several blocking nodes;
Page turning piece extraction unit, for from described several blocking nodes, extracting the described page turning piece that includes page indication information, described page indication information comprises at least one in page up, lower one page and Connected digits;
Related article piece extraction unit, for extract the piece that includes links header and link URL(uniform resource locator) from described several blocking nodes, and the similarity of working as the page title of described links header and described webpage is more than or equal to the second predetermined threshold value, and, when the similarity of the URL(uniform resource locator) of described link URL(uniform resource locator) and described webpage is more than or equal to the 3rd predetermined threshold value, determine that the described piece that includes links header and link URL(uniform resource locator) is described related article piece;
The meagre piece extraction unit of sharing, for extracting and include the meagre described meagre piece of sharing of sharing characteristic information from described several blocking nodes;
Copyright statement piece extraction unit, for extracting from described several blocking nodes the described copyright statement piece that includes the copyright statement characteristic information; With
Reply the piece extraction unit, for from described several blocking nodes, extracting and include the described reply piece of replying characteristic information.
CN201210192678.5A 2012-06-12 2012-06-12 Method and device for processing text-related structural data Pending CN103491116A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210192678.5A CN103491116A (en) 2012-06-12 2012-06-12 Method and device for processing text-related structural data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210192678.5A CN103491116A (en) 2012-06-12 2012-06-12 Method and device for processing text-related structural data

Publications (1)

Publication Number Publication Date
CN103491116A true CN103491116A (en) 2014-01-01

Family

ID=49831073

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210192678.5A Pending CN103491116A (en) 2012-06-12 2012-06-12 Method and device for processing text-related structural data

Country Status (1)

Country Link
CN (1) CN103491116A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095854A (en) * 2016-06-02 2016-11-09 腾讯科技(深圳)有限公司 A kind of method and device of the positional information determining block of information
CN106294475A (en) * 2015-06-04 2017-01-04 阿里巴巴集团控股有限公司 The generation method and apparatus of the page
CN106776886A (en) * 2016-11-29 2017-05-31 中国农业银行股份有限公司 A kind of Webpage body matter abstracting method and device
CN106874346A (en) * 2016-12-26 2017-06-20 微梦创科网络科技(中国)有限公司 Page body extracting method and device in webpage
CN109033282A (en) * 2018-07-11 2018-12-18 山东邦尼信息科技有限公司 A kind of Web page text extracting method and device based on extraction template
CN109325197A (en) * 2018-08-17 2019-02-12 百度在线网络技术(北京)有限公司 Method and apparatus for extracting information
CN109815315A (en) * 2019-01-29 2019-05-28 中国矿业大学(北京) A kind of impurely block message comprehensive analysis method based on document
CN110110198A (en) * 2017-12-28 2019-08-09 中移(苏州)软件技术有限公司 A kind of method for abstracting web page information and device
CN110321466A (en) * 2019-06-14 2019-10-11 广发证券股份有限公司 A kind of security information duplicate checking method and system based on semantic analysis

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294475A (en) * 2015-06-04 2017-01-04 阿里巴巴集团控股有限公司 The generation method and apparatus of the page
CN106095854B (en) * 2016-06-02 2022-05-17 腾讯科技(深圳)有限公司 Method and device for determining position information of information block
CN106095854A (en) * 2016-06-02 2016-11-09 腾讯科技(深圳)有限公司 A kind of method and device of the positional information determining block of information
CN106776886B (en) * 2016-11-29 2019-09-24 中国农业银行股份有限公司 A kind of Webpage body matter abstracting method and device
CN106776886A (en) * 2016-11-29 2017-05-31 中国农业银行股份有限公司 A kind of Webpage body matter abstracting method and device
CN106874346A (en) * 2016-12-26 2017-06-20 微梦创科网络科技(中国)有限公司 Page body extracting method and device in webpage
CN106874346B (en) * 2016-12-26 2020-10-30 微梦创科网络科技(中国)有限公司 Method and device for extracting page text in webpage
CN110110198B (en) * 2017-12-28 2021-06-15 中移(苏州)软件技术有限公司 Webpage information extraction method and device
CN110110198A (en) * 2017-12-28 2019-08-09 中移(苏州)软件技术有限公司 A kind of method for abstracting web page information and device
CN109033282A (en) * 2018-07-11 2018-12-18 山东邦尼信息科技有限公司 A kind of Web page text extracting method and device based on extraction template
CN109325197A (en) * 2018-08-17 2019-02-12 百度在线网络技术(北京)有限公司 Method and apparatus for extracting information
CN109325197B (en) * 2018-08-17 2022-07-15 百度在线网络技术(北京)有限公司 Method and device for extracting information
CN109815315B (en) * 2019-01-29 2020-09-22 中国矿业大学(北京) Method for comprehensively analyzing polluted plot information based on literature
CN109815315A (en) * 2019-01-29 2019-05-28 中国矿业大学(北京) A kind of impurely block message comprehensive analysis method based on document
CN110321466A (en) * 2019-06-14 2019-10-11 广发证券股份有限公司 A kind of security information duplicate checking method and system based on semantic analysis
CN110321466B (en) * 2019-06-14 2023-09-15 广发证券股份有限公司 Securities information duplicate checking method and system based on semantic analysis

Similar Documents

Publication Publication Date Title
CN103491116A (en) Method and device for processing text-related structural data
CN102902693B (en) Detect the repeat pattern on webpage
CN102663023B (en) Implementation method for extracting web content
CN109614550A (en) Public sentiment monitoring method, device, computer equipment and storage medium
Akpınar et al. Vision based page segmentation algorithm: Extended and perceived success
CN104598577B (en) A kind of extracting method of Web page text
CN101251855A (en) Equipment, system and method for cleaning internet web page
WO2011072434A1 (en) System and method for web content extraction
CN103544176A (en) Method and device for generating page structure template corresponding to multiple pages
CN107391675A (en) Method and apparatus for generating structure information
CN109492177B (en) web page blocking method based on web page semantic structure
CN102779169A (en) Extracting method and device for webpage content based on HTML (Hypertext Markup Language) label
CN106446072A (en) Webpage content processing method and apparatus
CN103166981A (en) Wireless webpage transcoding method and device
CN110413786A (en) Data processing method, intelligent terminal and storage medium based on web page text classification
CN105320734A (en) Web page core content extraction method
CN103365877B (en) Method and server to establishing catalogue after webpage progress transcoding
CN108334508A (en) The extracting method and device of webpage information
CN110490237A (en) Data processing method, device, storage medium and electronic equipment
CN103049557A (en) Website resource management method and website resource management device
CN102236658B (en) Webpage content extracting method and device
CN107590288A (en) Method and apparatus for extracting webpage picture and text block
CN105589918B (en) A kind of method and device for extracting page info
CN111581478A (en) Cross-website general news acquisition method for specific subject
CN104572874A (en) Webpage information extraction method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140101