CN102314497A

CN102314497A - Method and equipment for identifying body contents of markup language files

Info

Publication number: CN102314497A
Application number: CN201110249348A
Authority: CN
Inventors: 李伟刚; 秦玄铮
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2011-08-26
Filing date: 2011-08-26
Publication date: 2012-01-11
Anticipated expiration: 2031-08-26
Also published as: CN102314497B

Abstract

The invention aims to provide a method and equipment for identifying body contents of markup language files. The method comprises the following steps of: acquiring a plurality of markup language files to be processed by using template providing equipment; obtaining one or more groups of markup language files according to relevant information of the markup language files; comparing and analyzing contents of corresponding nodes in each DOM (Document Object Model) tree which corresponds to each markup language file in each group of at least one group of markup language files to obtain a body content node comprising the body contents of the group of markup language files; and obtaining a content marking template for identifying the body contents of the group of markup language files according to the obtained body content node. Compared with the prior art, the invention has the advantages that: body contents are obtained according to structural information of markup language files independent of specific contents in the markup language files, so that the body content identifying accuracy of different types of webpage is ensured.

Description

A kind of method and apparatus that is used for identification marking language file body matter

Technical field

The present invention relates to Internet technology, relate in particular to the technology that is used for identification marking language file body matter.

Background technology

Along with the development and the widespread use of mobile Internet technology, more and more users is through portable terminal, like smart mobile phone etc.; The access internet webpage; But, before showing on the screen of the html web page of in computing machine, browsing at portable terminal, need its web page contents is filtered because of the restriction of the screen size of portable terminal; The body matter that only keeps webpage is so that the user browses.In the prior art; The method of body matter utilizes key word to obtain mating in this web page contents usually in the identification html web page; Wherein, body matter means the content that is different from other similar webpages of carrying in this webpage, for example comprises link, friendly link, advertisement of headline, news content, other news etc. in the news web page; But the body matter of this webpage is headline and news content; The shortcoming of this method is that its body matter to the identification webpage does not have versatility, and promptly its regular expression needs to customize according to concrete type of webpage, otherwise the accuracy rate of identification will reduce.

Therefore, how to utilize a kind of universal method to discern and become problem demanding prompt solution like making language document body matters such as HTML.

Summary of the invention

The purpose of this invention is to provide a kind of method and apparatus that is used for identification marking language file body matter.

According to an aspect of the present invention, a kind of computer implemented method that is used for identification marking language file body matter is provided, wherein, this method may further comprise the steps:

A obtains pending a plurality of making language documents;

B obtains one or more groups making language document according to the relevant information of said a plurality of making language documents;

C compares analysis to the content of respective nodes in pairing each dom tree of each making language document in every group of at least one group echo language file, with the body matter node of the body matter that obtains to comprise this group echo language file;

D obtains in order to discern the content identification template of this group echo language file body matter according to the body matter node that is obtained.

According to another aspect of the present invention, a kind of equipment that is used for identification marking language file body matter is provided also, wherein, this equipment comprises:

The file deriving means is used to obtain pending a plurality of making language documents;

First deriving means is used for the relevant information according to said a plurality of making language documents, obtains one or more groups making language document;

Relative analytic apparatus; Be used for the content of respective nodes in pairing each dom tree of each making language document of every group of at least one group echo language file is compared analysis, comprise the body matter node of the body matter of this group echo language file with acquisition;

The template deriving means is used for according to the body matter node that is obtained, and obtains in order to discern the content identification template of this group echo language file body matter.

As stated; Compared with prior art; The present invention is through providing a kind of method in common to obtain to be used to the content identification template of the body matter of discerning certain type of making language document; This method does not rely on the particular content in the making language document and obtains body matter according to the structural information of this making language document, and in view of the above with the body matter of this content identification template applications in such making language document of extraction, thereby guarantee accuracy rate to the body matter identification of dissimilar webpages.

Description of drawings

Through reading the detailed description of doing with reference to following accompanying drawing that non-limiting example is done, it is more obvious that other features, objects and advantages of the present invention will become:

Fig. 1 is the equipment synoptic diagram that is used for identification marking language file body matter according to one aspect of the invention;

Fig. 2 is for being used for the exemplary plot of identification marking language file body matter according to the present invention;

Fig. 3 is for being used for the exemplary plot of identification marking language file body matter according to the present invention;

Fig. 3 A is for being used for the exemplary plot of identification marking language file body matter according to the present invention;

Fig. 3 B is for being used for the exemplary plot of identification marking language file body matter according to the present invention;

Fig. 4 is the equipment synoptic diagram that is used for identification marking language file body matter in accordance with a preferred embodiment of the present invention;

Fig. 5 is the method flow diagram that is used for identification marking language file body matter according to a further aspect of the present invention;

Fig. 6 is the method flow diagram that is used for identification marking language file body matter in accordance with a preferred embodiment of the present invention.

Same or analogous Reference numeral is represented same or analogous parts in the accompanying drawing.

Embodiment

Below in conjunction with accompanying drawing the present invention is described in further detail.

Fig. 1 is the equipment synoptic diagram that is used for identification marking language file body matter according to one aspect of the invention.Template provides equipment 1 to comprise file deriving means 11, first deriving means 12, relative analytic apparatus 13 and template deriving means 14.At this, template provides equipment 1 to include but not limited to the cloud that computing machine, network host, single network server, a plurality of webserver collection or a plurality of server constitute.At this, cloud is by constituting based on the great amount of calculation machine of cloud computing (Cloud Computing) or the webserver, and wherein, cloud computing is a kind of of Distributed Calculation, a super virtual machine of being made up of the loosely-coupled computing machine collection of a group.

As shown in Figure 1, file deriving means 11 obtains pending a plurality of making language documents.

Particularly, file deriving means 11 obtains rule according to predetermined file and from template provides the web page library of equipment 1, obtains the pairing a plurality of making language documents of internet web page, and wherein said predetermined file is obtained rule and included but not limited to:

1) obtains the pairing making language document of webpage that historical click volume surpasses certain click threshold;

2) obtain the pairing making language document of webpage of the accumulative total access times a predetermined level is exceeded that conducts interviews through portable terminal;

Wherein, said web page library is used to store the historical visit information of the pairing making language document of webpage and this webpage, and this web page library includes but not limited to relational database, memory storage, magnetic disk memory etc.

Alternatively, file deriving means 11 receives predetermined condition or Event triggered ground or directly reads this a plurality of making language documents from third party device through the communication mode of arranging termly.

At this, said SGML means a kind of other information that text and text is relevant and combines, and shows the computer literal word code about file structure and data processing details, and said making language document includes but not limited to:

-HTML(Hypertext Markup Language) file;

-extensible HyperText Markup Language (XHTML) file;

-extend markup language (XML) file.

In one example; File deriving means 11 carries out statistical study through the webpage relevant information in the web page library that template is provided equipment 1; Obtain each webpage by the number of times of user through mobile terminal accessing; And obtaining the webpage pairing html file of this number of times in view of the above above scheduled visit quantity, this scheduled visit quantity should and change along with actual demand and concrete application, for example in the less concrete application of number of users; This scheduled visit quantity can be tens thousand of to hundreds thousand of; And in the more concrete application of number of users, it is hundreds thousand of to millions of that this scheduled visit quantity can be, and it is confirmable that this should be that those skilled in the art reach concrete application according to the actual requirements.

In another example, file deriving means 11 sends to third party device through the API (API) of calling setting termly and obtains the request of making language document, and receives a plurality of making language documents that this third party device returns based on this request.

Those skilled in the art will be understood that the above-mentioned mode of obtaining a plurality of making language documents is merely for example; Other existing or modes of obtaining a plurality of making language documents that possibly occur from now on are as applicable to the present invention; Also should be included in the protection domain of the present invention, and be contained in this with way of reference at this.

Subsequently, the relevant information of said a plurality of making language documents that first deriving means 12 obtains according to file deriving means 11 obtains one or more groups making language document.

Particularly; A plurality of making language documents that first deriving means 12 obtains according to file deriving means 11 for example, obtain the relevant information of said a plurality of making language documents; And in view of the above those making language documents are carried out cluster, to obtain one or more groups making language document; Perhaps, obtain the relevant information of partial document in the said making language document, and this partial document is carried out cluster, to obtain one or more groups making language document.Wherein, the relevant information of said a plurality of making language documents includes but not limited to:

1) relevant information of the DOM Document Object Model of making language document (DOM) tree; Wherein, said dom tree means the tree construction data that obtain through making language document is resolved, and each node in this tree is corresponding with label and label substance in the making language document, but through the data in this dom tree operational label language file; Wherein, the relevant information of said a plurality of making language documents includes but not limited to:

A) relevant information of the pairing dom tree of said a plurality of making language documents; Particularly; The relevant information that comprises the pairing dom tree of these a plurality of making language documents when the relevant information of these a plurality of making language documents; Then first deriving means 12 can carry out cluster to these a plurality of making language documents according to the relevant information of this dom tree, to obtain one or more groups making language document; Wherein, the relevant information of said dom tree includes but not limited to:

I) number of nodes of said dom tree; Particularly; The number of nodes that comprises this dom tree when the relevant information of dom tree; Then first deriving means 12 can carry out cluster to these a plurality of making language documents according to this number of nodes; For example will wherein have same node point quantity, perhaps the making language document cluster of number of nodes in certain predetermined quantity interval is same group echo language file

The ii) topology information of said dom tree; Particularly; The topology information that comprises this dom tree when the relevant information of dom tree; Wherein, this topology information includes but not limited to the distribution of each tree node in the dom tree, and then first deriving means 12 will have making language document cluster that identical tree node distributes in same group.

Those skilled in the art will be understood that; The relevant information of above-mentioned each item dom tree not only can be used for first deriving means 12 separately and obtain one or more groups making language document, can also multinomial combination wherein be used for first deriving means 12 and obtain one or more groups making language document.

Those skilled in the art will be understood that also the relevant information of above-mentioned dom tree is merely for example; The relevant information of other dom trees existing or that possibly occur from now on is as applicable to the present invention; Also should be included in the protection domain of the present invention, and be contained in this with way of reference at this.

B) resource information in said a plurality of making language document; Particularly, when the relevant information of these a plurality of making language documents comprises the resource information in these a plurality of making language documents, wherein, this resource information includes but not limited to:

I) link information in the making language document includes but not limited to number of links, the similarity of anchor text in these a plurality of making language documents;

The ii) pictorial information in the making language document includes but not limited to the quantity of picture, the similarity of picture name, descriptor in these a plurality of making language documents;

If this, then first deriving means 12 can carry out cluster to these a plurality of making language documents according to this resource information, to obtain one or more groups making language document.

Those skilled in the art will be understood that; The relevant information of above-mentioned each item making language document not only can be used for first deriving means 12 separately and obtain one or more groups making language document, can also multinomial combination wherein be used for first deriving means 12 and obtain one or more groups making language document.

Those skilled in the art will be understood that also the relevant information of above-mentioned making language document is merely for example; The relevant information of other making language documents existing or that possibly occur from now on is as applicable to the present invention; Also should be included in the protection domain of the present invention, and be contained in this with way of reference at this.

In one example; Dom tree corresponding with it resolved and generated to 12 pairs of a plurality of html files of first deriving means respectively; Then according to the topology information of each dom tree; These a plurality of html files are carried out cluster, and the topology information of this DOM includes but not limited to the distribution of each tree node of dom tree.

With Fig. 2, Fig. 3 is example; The pairing dom tree of a part of html file that above-mentioned first deriving means, 12 clusters obtain has topological structure as shown in Figure 2, and the pairing dom tree of other html files has topological structure as shown in Figure 3, thus; First deriving means 12 obtains 2 groups of html files; G1 group and G2 group, wherein the html file in the G1 group has topological structure as shown in Figure 2, and the html file in the G2 group has topological structure as shown in Figure 3.Preferably; The topological structure of the dom tree of the html file in cluster to a group can be not quite identical, only needs consistent the getting final product of backbone nodes distribution of its dom tree, and for example the dom tree T1 of html file F 1 correspondence is shown in Fig. 3 A; The pairing dom tree T2 of html file F2 is shown in Fig. 3 B; Visible by figure, T1 and T2 have dom tree topological structure as shown in Figure 3, so during F1 and F2 will organize to G2 by cluster.

In another example, first deriving means 12 is through adding up the label < a>in a plurality of html files respectively, obtaining the quantity of hypertext link in each html file, and in view of the above to those html file clusters.Preferably, also can combine the similarity of the anchor content of text of this hypertext link, those HTML are carried out cluster; To obtain some groups of html files; Wherein, the html file in every group has identical hypertext link quantity, and the similar content degree of its anchor text surpasses predetermined similarity threshold.

Those skilled in the art will be understood that the above-mentioned mode of obtaining the making language document group is merely for example; Other existing or modes of obtaining the making language document group that possibly occur from now on are as applicable to the present invention; Also should be included in the protection domain of the present invention, and be contained in this with way of reference at this.

Then; The content of respective nodes compares analysis in pairing each dom tree of each making language document in every group of 13 pairs of at least one group echo language files of relative analytic apparatus, comprises the body matter node of the body matter of this group echo language file with acquisition.

Particularly; Relative analytic apparatus 13 obtains at least one group echo language file in one or more groups making language document according to first deriving means 12; For example obtain the making language document in every group respectively; And those making language documents are resolved, obtaining its corresponding dom tree, and the content in corresponding node in each dom tree and the subtree node thereof is compared analysis; Obtain to comprise the body matter node of this group echo language file body matter, the method for wherein said comparative analysis includes but not limited to:

1) according to the number of characters of the non-link text in each dom tree respective nodes and the subtree node content thereof; If in surpassing the dom tree of preset quantity ratio; The character quantity of the non-link text of this respective nodes and subtree node content thereof surpasses certain character quantity threshold value, and then relative analytic apparatus 13 judges that this node is the body matter node that comprises body matter;

2) according to each dom tree respective nodes content shared full content display space ratio when showing; If in surpassing the dom tree of preset quantity ratio; The shared display space ratio of this respective nodes content is all above certain proportion threshold value, and then relative analytic apparatus 13 judges that these nodes are the body matter node that comprises body matter;

3) according to the similarity of each dom tree respective nodes and subtree node content thereof; If in each dom tree; This respective nodes and subtree node content similarity each other thereof all are lower than certain similarity threshold, and then relative analytic apparatus 13 judges that this node is the body matter node that comprises body matter.

In one example, relative analytic apparatus 13 obtains one group of html file, and 2 html files in this group html file are resolved, and obtains two dom tree T3 and T4, and wherein T3 is shown in Fig. 3 A, and T4 is shown in Fig. 3 B;

Then; 13 pairs of these two dom trees of relative analytic apparatus travel through and the content of respective nodes and subtree node thereof are compared analysis; As obtain the quantity of character in node N4 and subtree node N6 thereof among the T3, the content among the N7, as 2500, and obtain the quantity of character in respective nodes N4 ' and the content among the subtree node N6 ' thereof among the T4; As 2000; Its character quantity is all above 1500 of predetermined character quantity threshold values, therefore, relative analytic apparatus 13 with this node as the body matter node that comprises this group html file body matter.

In another example, relative analytic apparatus 13 obtains one group of html file, and 2 html files in this group html file are resolved; Obtain two dom tree T3 and T4, wherein T3 is shown in Fig. 3 A, and T4 is shown in Fig. 3 B; Then; 13 pairs of two dom trees of relative analytic apparatus travel through and the content of respective nodes and subtree node thereof are compared analysis, as obtain the height and the width that its content of being provided with among the node N3 among the T3 shows, and the height and the width of the pairing web displaying of this html file; And to obtain this node content shared display space in webpage in view of the above be 30%; In like manner, the shared display space of content that obtains the respective nodes N3 ' among the T4 is 35%, and this equal proportion is all above predetermined proportion threshold value 20%; Therefore, relative analytic apparatus 13 with this node as the body matter node that comprises this group html file body matter.

Those skilled in the art will be understood that the mode of above-mentioned comparative analysis is merely for example; The mode of other comparative analysiss existing or that possibly occur from now on is as applicable to the present invention; Also should be included in the protection domain of the present invention, and be contained in this with way of reference at this.

At this, need to prove that each item numerical value in above-mentioned the giving an example is merely the example of illustration, for reader understanding the present invention, the True Data when being not practical application should not be regarded as any restriction to the present patent application protection domain.If no special instructions, the function of other local numerical value that occur is identical with here among this paper, for for simplicity, repeats no more.

At this; Need to prove that also the pairing concrete dom tree of making language document in above-mentioned the giving an example is merely the example of illustration, for understanding the present invention; True dom tree when being not practical application should not be regarded as any restriction to the present patent application protection domain.If no special instructions, the function of other local dom trees that occur is identical with here among this paper, for for simplicity, repeats no more.

Subsequently, template deriving means 14 obtains in order to discern the content identification template of this group echo language file body matter according to the body matter node that is obtained.

Particularly, each body matter node of the body matter that comprises this group echo language file that template deriving means 14 is obtained according to relative analytic apparatus 13, for example, with this body matter node pairing numbering in dom tree of making an appointment; Perhaps; With the routing information of this body matter node in dom tree, write with the corresponding content identification template of this group echo language file in, at this; This routing information for example can be XPath; Wherein, said XPath is a kind of path expression, can in dom tree, look for corresponding tree node through this path expression.At this, said content identification template is used to describe each the body matter nodal information that comprises body matter, and this content identification template can be used as template file and is stored in the file system, or can be used as data table stores in relational database.

In one example; Shown in Fig. 3 A, the body matter node that relative analytic apparatus 13 obtains to comprise certain group echo language file body matter is N1, N4 and N5, and the coding rule of body matter node is according under last to the tree node in the dom tree; Order is from left to right numbered; Thus, template deriving means 14 confirms that according to this coding rule the pairing numbering of N1, N4 and N5 is followed successively by: 1,4 and 5, and be written in the content identification template file.

In another example; Shown in Fig. 3 A; The body matter node that relative analytic apparatus 13 acquisitions comprise certain group echo language file body matter is N3 and N4; Thus, template deriving means 14 is according to those body matter nodes, and in dom tree, obtain its corresponding XPath and be respectively: the XPath of N3 is "/R0/N1/N3 "; The XPath of N4 is "/R0/N2/N4 ", and those XPath are written in the relational database that belongs to the corresponding content identification template of this group echo language file.

Those skilled in the art will be understood that the mode of above-mentioned acquisition content identification template is merely for example; The mode of other acquisition content identification templates existing or that possibly occur from now on is as applicable to the present invention; Also should be included in the protection domain of the present invention, and be contained in this with way of reference at this.

Preferably, be to work continuously between file deriving means 11, first deriving means 12, relative analytic apparatus 13 and the template deriving means 14.Particularly, file deriving means 11 obtains pending a plurality of making language documents constantly; Subsequently, first deriving means 12 also constantly according to the relevant information of said a plurality of making language documents, obtains one or more groups making language document; Then; Relative analytic apparatus 13 also compares analysis to the content of respective nodes in pairing each dom tree of each making language document in every group of at least one group echo language file constantly, with the body matter node of the body matter that obtains to comprise this group echo language file; Then, template deriving means 14 obtains in order to discern the content identification template of this group echo language file body matter also constantly according to the body matter node that is obtained; At this; What it will be understood by those skilled in the art that " continuing " be meant that each device constantly carries out the obtaining of making language document, making language document group respectively obtains, every group echo language file is compared the content identification template of analyzing and obtaining to be used for identification marking language file body matter; Until satisfying predetermined stoppage condition, for example file deriving means 11 stops obtaining making language document in a long time.

(with reference to Fig. 1) in a preferred embodiment; Relative analytic apparatus 13 comprises similarity acquiring unit (not shown) and node acquiring unit (not shown); Wherein, The similarity acquiring unit compares analysis to the content of respective nodes in pairing each dom tree of the making language document in said every group, to obtain the similarity of said content; Subsequently, the node acquiring unit is confirmed said body matter node according to said similarity.

Followingly with reference to Fig. 1 the preferred embodiment is described in detail, wherein, file deriving means 11 obtains pending a plurality of making language documents; First deriving means 12 obtains one or more groups making language document according to the relevant information of said a plurality of making language documents; Template deriving means 14 obtains in order to discern the content identification template of this group echo language file body matter according to the body matter node that is obtained.Its detailed process for for simplicity, is contained in this with way of reference with aforementioned identical with reference to the performed process of the described embodiment file of Fig. 1 deriving means 11, first deriving means 12 and template deriving means 14, does not give unnecessary details and do not do.

Particularly; The content of respective nodes and subtree node thereof compares analysis in pairing each dom tree of making language document in every group at least one group echo language file that the similarity acquiring unit obtains first deriving means 12; To obtain the similarity of said content; Wherein, the method that obtains said similar content degree includes but not limited to:

1) respective nodes of each dom tree and the word content of subtree node thereof are carried out character string relatively, confirm the similarity of this content, wherein, the degree of string matching is high more, and then the similarity of content is high more, otherwise then the similarity of this content is low more;

2) respective nodes of each dom tree and the word content of subtree node thereof are carried out participle; And, confirm the similarity of this content, wherein through identical participle quantity in each respective nodes word content is added up; The quantity of identical participle is few more; Then the similarity of content is low more, otherwise then the similarity of this content is high more; At this, word algorithm included but not limited to the forward maximum match in said minute, reverse maximum match, two-way maximum match, language model method, shortest path first or the like; Subsequently; The node that the node acquiring unit obtains according to the similarity acquiring unit and the similarity of subtree node content thereof; For example be lower than preset similarity threshold according to similarity, then this content is a body matter, otherwise; This content is the rule of non-body matter, confirm this node whether literary composition comprise the body matter node of body matter.

In one example; The similarity acquiring unit obtains the word content in respective nodes in certain group html file pairing each dom tree and the subtree node thereof, utilizes the forward maximum matching algorithm that each word content is carried out word segmentation processing respectively, obtains 3000 different participles; And through statistical study is carried out in the distribution of each participle in each word content that obtains; Confirm to surpass certain preset quantity, as 1500, participle in all each word contents, appearance is arranged all; Then the node acquiring unit obtains the similarity of this each word content in view of the above, as 0.7; Subsequently, the node that the node acquiring unit obtains according to the similarity acquiring unit and the similarity of subtree node content thereof, its similarity is higher than preset similarity threshold 0.4, confirms not comprise in this node the body matter of this group html file.

Those skilled in the art will be understood that the above-mentioned acquisition node content similarity and the mode of the body matter node that obtains to comprise body matter are merely for example; Other acquisition node content similarities existing or that possibly occur from now on or acquisition comprise the mode of body matter node of body matter as applicable to the present invention; Also should be included in the protection domain of the present invention, and be contained in this with way of reference at this.

(with reference to Fig. 1) in a further advantageous embodiment; Template deriving means 14 comprises routing information acquiring unit (not shown) and template generation unit (not shown); Wherein, the routing information acquiring unit obtains and the corresponding routing information of said body matter node according to said body matter node; Subsequently, the template generation unit adds said routing information in the said content identification template to, to obtain said content identification template.

Followingly with reference to Fig. 1 the preferred embodiment is described in detail, wherein, file deriving means 11 obtains pending a plurality of making language documents; First deriving means 12 obtains one or more groups making language document according to the relevant information of said a plurality of making language documents; The content of respective nodes compares analysis in pairing each dom tree of each making language document in every group of 13 pairs of at least one group echo language files of relative analytic apparatus, comprises the body matter node of the body matter of this group echo language file with acquisition; Its detailed process for for simplicity, is contained in this with way of reference with aforementioned identical with reference to the performed process of the described embodiment file of Fig. 1 deriving means 11, first deriving means 12 and relative analytic apparatus 13, does not give unnecessary details and do not do.

Particularly; The body matter node that comprises certain group echo language file body matter that the routing information acquiring unit obtains according to relative analytic apparatus 13; Obtain the routing information of this node from the dom tree at this node place, wherein, the expression mode of this routing information includes but not limited to:

-XPath；

-XPath combines with regular expression, and wherein said regular expression means and is used for describing or matees a series of single character strings that meet the character string of certain syntactic rule;

Subsequently, the template generation unit is written to the content identification template that is used for discerning this group echo language file body matter with the routing information that the routing information acquiring unit obtains, to obtain this content identification template.

In one example; Shown in Fig. 3 A, the body matter node that comprises certain group echo language file body matter that relative analytic apparatus 13 obtains is N6 and N7, and the routing information acquiring unit is according to those body matter nodes; Obtain its corresponding routing information and be "/R0/N2/N4/N [6-7] { 1} "; Subsequently, the template generation unit writes this routing information in certain content identification template file, to obtain to be used to discern the template of this group echo language file body matter.

Those skilled in the art will be understood that above-mentioned acquisition routing information and the mode that obtains the content identification template are merely for example; Other acquisition routing informations existing or that possibly occur from now on or the mode that obtains the content identification template are as applicable to the present invention; Also should be included in the protection domain of the present invention, and be contained in this with way of reference at this.

In another preferred embodiment (with reference to Fig. 1), template provides equipment 1 also to comprise the second deriving means (not shown), and wherein, second deriving means obtains at least one group echo language file in said one or more groups making language document according to pre-defined rule; Then, the content of respective nodes compares analysis in pairing each dom tree of each making language document in every group of said at least one group echo language file that 13 pairs of relative analytic apparatuses obtain, to obtain said body matter node.Followingly with reference to Fig. 1 the preferred embodiment is described in detail, wherein, file deriving means 11 obtains pending a plurality of making language documents; First deriving means 12 obtains one or more groups making language document according to the relevant information of said a plurality of making language documents; Template deriving means 14 obtains in order to discern the content identification template of this group echo language file body matter according to the body matter node that is obtained; Its detailed process for for simplicity, is contained in this with way of reference with aforementioned identical with reference to the performed process of the described embodiment file of Fig. 1 deriving means 11, first deriving means 12 and template deriving means 14, does not give unnecessary details and do not do.

Particularly; Second deriving means obtains said making language document group according to pre-defined rule; For example obtain all making language document groups that first deriving means 12 provides, perhaps only obtain some making language document groups of making language document quantity a predetermined level is exceeded wherein; Then, each group echo language file that 13 pairs second deriving means of relative analytic apparatus obtain carries out described comparative analysis respectively, obtains to comprise the body matter node of this group echo language file body matter for every group echo language file; Wherein, said pre-defined rule comprises that each obtains said making language document group at least based on following:

1) quantity of making language document in this group;

Particularly; When pre-defined rule based on this group echo language file in the quantity of making language document, wherein, only the quantity of the making language document in this group is more for a long time; As surpassing certain quantity of documents threshold value; Can compare analysis through body matter node content, come to obtain more accurately the body matter node of this body matter that comprises the group echo language file, otherwise the acquisition of this body matter node be with inaccurate each making language document; So second deriving means only obtains the making language document group that making language document quantity surpasses this document amount threshold;

2) number of nodes of the pairing dom tree of making language document etc.;

Particularly, when pre-defined rule based on this group echo language file in the number of nodes of the pairing dom tree of making language document, wherein; If the number of nodes of this each dom tree all seldom; As being lower than certain number of nodes threshold value, the content of then representing its pairing making language document also seldom need not its body matter to be extracted again; So second deriving means only obtains the making language document group of the number of nodes of each dom tree above this number of nodes threshold value.

Those skilled in the art will be understood that the above-mentioned each item of lifting not only can be used for second deriving means separately and obtains the making language document group, can also multinomial combination wherein be used for second deriving means and obtain the making language document group.

Those skilled in the art will be understood that also above-mentioned pre-defined rule is merely for example, and other pre-defined rules existing or that possibly occur from now on also should be included in the protection domain of the present invention, and be contained in this at this with way of reference as applicable to the present invention.

In one example, first deriving means 12 obtains 3 groups of html files, and then second deriving means directly extracts this 3 groups of html files.In another example; First deriving means 12 obtains 4 group echo language files, G3, G4, G5 and G6, and wherein the making language document quantity of each group is followed successively by 120,50,5,150; Then second deriving means extracts 2 making language document groups of making language document quantity a predetermined level is exceeded; G3 and G6, at this, this predetermined quantity for example can be made as 100.

In another preferred embodiment (with reference to Fig. 1); Template provides equipment 1 also to comprise template annotation equipment (not shown); Wherein, The said body matter that the template annotation equipment comprises according to said body matter node, mark and the corresponding body matter relevant information of said body matter node in said content identification template; Wherein, said body matter relevant information comprise following at least each:

The type information of-said body matter;

The displaying priority of-said body matter.

Followingly with reference to Fig. 1 the preferred embodiment is described in detail, wherein, file deriving means 11 obtains pending a plurality of making language documents; First deriving means 12 obtains one or more groups making language document according to the relevant information of said a plurality of making language documents; The content of respective nodes compares analysis in pairing each dom tree of each making language document in every group of 13 pairs of at least one group echo language files of relative analytic apparatus, comprises the body matter node of the body matter of this group echo language file with acquisition; Template deriving means 14 obtains in order to discern the content identification template of this group echo language file body matter according to the body matter node that is obtained; Its detailed process is with aforementioned identical with reference to the performed process of the described embodiment file of Fig. 1 deriving means 11, first deriving means 12, relative analytic apparatus 13 and template deriving means 14; For for simplicity; Be contained in this with way of reference, do not give unnecessary details and do not do.

Particularly; The body matter that body matter node that the template annotation equipment obtains according to relative analytic apparatus 13 and subtree node thereof are comprised; For example according to predetermined mark rule, mark and the corresponding body matter relevant information of this body matter node in the content identification template at this body matter node place; Wherein, this body matter relevant information comprise following at least each:

1) type information of said body matter, wherein, the type information includes but not limited to title content piece, body matter piece, navigation content piece etc.;

2) the displaying priority of said body matter, for example, having the higher body matter that represents priority will forwardly in webpage preferentially represent.

In one example; The character quantity of pure words content is above 5000 in the body matter that certain body matter node is comprised; And this pure words content to be presented at the displaying ratio that this body matter occupied in showing be 85%, then the template annotation equipment confirms that according to above information the type information of this body matter is the body matter piece, and according to the type information; Confirm that this body matter is the high content that represents priority; Then, the template annotation equipment correspondingly writes the relevant information of this body matter in the content identification template file at this body matter node place, and is as shown in table 1 below.

Table 1

Content node information	Content-type information	Represent priority
			/R0/N1/N3	T1	High
/R0/N1/N9/N20	T3	Low
			/R0/N1/N[6-7]{1}	T6	In

Preferably, also can mark non-body matter nodal information in the said template file, and with the content-type information of the corresponding non-body matter of this non-body matter nodal information, represent priority etc.

Those skilled in the art will be understood that also the mode of foregoing relevant information and marked content relevant information is merely for example; The mode of other content correlated informations existing or that possibly occur from now on or marked content relevant information is as applicable to the present invention; Also should be included in the protection domain of the present invention, and be contained in this with way of reference at this.

Fig. 4 is the equipment synoptic diagram that is used for identification marking language file body matter in accordance with a preferred embodiment of the present invention, wherein, also comprises screening unit 121 ' and cluster cell 122 ' in first deriving means 12 '.At this, install 11 ', 13 ' identically with reference to the content of the described

device

11,13 of Fig. 1 shown in Fig. 4 with the front, for for simplicity, be contained in this with way of reference, do not give unnecessary details and do not do.

Particularly, screening unit 121 ' is screened said a plurality of making language documents according to the predetermined filtering condition, to obtain to satisfy at least one making language document of said predetermined filtering condition; Then, cluster cell 122 ' carries out cluster according to the relevant information of the pairing dom tree of said at least one making language document to said at least one making language document, to obtain said one or more groups making language document; At last, template deriving means 14 ' obtains and the corresponding said content identification template of this predetermined filtering condition according to the body matter node that is obtained.

More specifically, screening unit 121 ' is based on the predetermined filtering condition, and a plurality of making language documents that file deriving means 11 ' is obtained screen, to obtain to satisfy at least one making language document of this predetermined filtering condition.Preferably, this predetermined filtering condition include but not limited to following at least each:

1) network address of said making language document;

Particularly; If this predetermined filtering condition is based on the network address of making language document; Wherein this network address includes but not limited to the URL address; IP address etc., then screening unit 121 ' can be screened those making language documents according to the regular expression of the network address or the network address of making language document;

2) website under the said making language document;

Particularly; If this predetermined filtering condition is based on the website under the making language document; For example whether making language document is from same website; Perhaps from the website of same type, then whether screening unit 121 ' for example can screen those html files from the website of news type according to html file.

Those skilled in the art will be understood that; Above-mentioned each item predetermined filtering condition not only can be used for screening unit 121 ' separately screens a plurality of making language documents, can also multinomial combination wherein be used for screening unit 121 ' a plurality of making language documents are screened.

Those skilled in the art will be understood that also above-mentioned screening conditions are merely for example, and other screening conditions existing or that possibly occur from now on also should be included in the protection domain of the present invention, and be contained in this at this with way of reference as applicable to the present invention.

Then; The relevant information of the pairing dom tree of making language document that cluster cell 122 ' obtains according to screening unit 121 '; Those making language documents are carried out cluster, to obtain and corresponding said one or more groups making language document of this predetermined filtering condition;

At last; Template deriving means 14 ' is every group of body matter node that is obtained in this one or more groups making language document according to relative analytic apparatus 13 '; Obtain and the one or more one to one content identification templates of this each group echo language file, and should conduct of one or more content identification template and the corresponding content identification template of this predetermined filtering condition.

In one example; A predetermined filtering condition C 1 satisfies regular expression http://www.abc.com/news*.*html for the URL of html file (URL) address; Then screening unit 121 ' is screened in 150 html files that file deriving means 11 ' obtains according to this predetermined filtering condition; Satisfy 70 html files of this regular expression to obtain its URL address; Then, cluster cell 122 ' according to the dom tree relevant information of these 70 html files to these 70 html files are carried out cluster, to obtain and these predetermined filtering condition C 1 corresponding 3 groups of html files; Template deriving means 14 ' is every group of body matter node that is obtained in this 3 group echo language file according to relative analytic apparatus 13 '; Obtain and corresponding 3 the content identification template files of this 3 group echo language file, and with these 3 content identification template files as with predetermined filtering condition C 1 corresponding content identification template.

Those skilled in the art will be understood that also the mode of above-mentioned making language document screening and making language document cluster is merely for example; The mode of other making language documents existing or that possibly occur from now on screenings or making language document cluster is as applicable to the present invention; Also should be included in the protection domain of the present invention, and be contained in this with way of reference at this.

Preferably; Template provides equipment 1 also to comprise screening conditions deriving means (not shown), template selecting (not shown) and body matter recognition device (not shown); Wherein, the screening conditions deriving means obtains the predetermined filtering condition that other making language documents satisfied of body matter to be identified; Then, template selecting is selected the pairing content identification template of predetermined filtering condition that this other making language document satisfied; Then, the body matter recognition device is discerned the body matter of said other making language documents according to selected content identification template.

Particularly; The screening conditions deriving means for example receives predetermined condition or Event triggered ground or obtains other making language documents of body matter to be identified termly from third party device; And it is mated in each predetermined filtering condition, with the screening conditions that this making language document was satisfied that obtain; Then; These screening conditions that template selecting is obtained according to the screening conditions deriving means obtain its pairing one or more content identification templates, and extract the body matter nodal information in each content identification template respectively from template deriving means 14 '; Like XPath; And according to this nodal information, mate in the pairing dom tree of these other making language documents according to predetermined matched rule, to obtain and the pairing content identification template of these other making language documents; Wherein, this matched rule includes but not limited to:

1) if according to each the body matter nodal information in the content identification template, in the dom tree of these other making language documents, all can find corresponding tree node, then these other making language documents and this content identification template matches,

2) if according to being labeled as essential body matter nodal information in the content identification template, in the dom tree of these other making language documents, all can find corresponding tree node, then these other making language documents and this content identification template matches;

Then; The content identification template that the body matter recognition device obtains according to template selecting; From this content identification template, extract each body matter nodal information; And in the dom tree of these other making language documents, search its body matter node, and from this node and subtree node thereof, obtain body matter according to those body matter nodal informations.

Those skilled in the art will be understood that also the mode of above-mentionedly obtaining screening conditions, select template and obtaining body matter is merely for example; Other existing or modes of obtaining screening conditions, select template or obtaining body matter that possibly occur from now on are as applicable to the present invention; Also should be included in the protection domain of the present invention, and be contained in this with way of reference at this.

Those skilled in the art will be understood that also above-mentioned first deriving means and second deriving means are merely example, and in practice, they can be two independently modules, also can be integrated in the module.

Fig. 5 is the method flow diagram that is used for identification marking language file body matter according to one aspect of the invention.Template provides equipment 1 to include but not limited to the cloud that computing machine, network host, single network server, a plurality of webserver collection or a plurality of server constitute.At this, cloud is by constituting based on the great amount of calculation machine of cloud computing (Cloud Computing) or the webserver, and wherein, cloud computing is a kind of of Distributed Calculation, a super virtual machine of being made up of the loosely-coupled computing machine collection of a group.

As shown in Figure 5, in step S1, template provides equipment 1 to obtain pending a plurality of making language documents.

Particularly, in step S1, template provides equipment 1 to obtain rule according to predetermined file and from template provides the web page library of equipment 1, obtains the pairing a plurality of making language documents of internet web page, and wherein said predetermined file is obtained rule and included but not limited to:

Alternatively, in step S1, template provides equipment 1 to receive predetermined condition or Event triggered ground or directly reads this a plurality of making language documents from third party device through the communication mode of arranging termly.

-HTML(Hypertext Markup Language) file;

-extensible HyperText Markup Language (XHTML) file;

-extend markup language (XML) file.

In one example; In step S1; Template provides equipment 1 to carry out statistical study through the webpage relevant information in the web page library that template is provided equipment 1, obtains each webpage by the number of times of user through mobile terminal accessing, and obtains the pairing html file of webpage that this number of times surpasses scheduled visit quantity in view of the above; This scheduled visit quantity should be used and changes along with actual demand and concrete; For example in the less concrete application of number of users, this scheduled visit quantity can be tens thousand of to hundreds thousand of, and in the more concrete application of number of users; This scheduled visit quantity can be hundreds thousand of to millions of, and it is confirmable that this should be that those skilled in the art reach concrete application according to the actual requirements.

In another example; In step S1; Template provides equipment 1 to send to third party device through the API (API) of calling setting termly and obtains the request of making language document, and receives a plurality of making language documents that this third party device returns based on this request.

Subsequently, in step S2, template provides the relevant information of equipment 1 according to its said a plurality of making language documents that in step S1, obtain, and obtains one or more groups making language document.

Particularly, in step S2, template provides equipment 1 according to its a plurality of making language documents that in step S1, obtain; For example; Obtain the relevant information of said a plurality of making language documents, and in view of the above those making language documents are carried out cluster, to obtain one or more groups making language document; Perhaps, obtain the relevant information of partial document in the said making language document, and this partial document is carried out cluster, to obtain one or more groups making language document.Wherein, the relevant information of said a plurality of making language documents includes but not limited to:

A) relevant information of the pairing dom tree of said a plurality of making language documents; Particularly; The relevant information that comprises the pairing dom tree of these a plurality of making language documents when the relevant information of these a plurality of making language documents; Then in step S2; Template provides equipment 1 to carry out cluster to these a plurality of making language documents according to the relevant information of this dom tree, to obtain one or more groups making language document; Wherein, the relevant information of said dom tree includes but not limited to:

I) number of nodes of said dom tree; Particularly; The number of nodes that comprises this dom tree when the relevant information of dom tree; Then in step S2, template provides equipment 1 to carry out cluster to these a plurality of making language documents according to this number of nodes, for example will wherein have same node point quantity; Perhaps the making language document cluster of number of nodes in certain predetermined quantity interval is same group echo language file

The ii) topology information of said dom tree; Particularly; When the relevant information of dom tree comprises the topology information of this dom tree, wherein, this topology information includes but not limited to the distribution of each tree node in the dom tree; Then in step S2, template provides equipment 1 will have making language document cluster that identical tree node distributes in same group.

Those skilled in the art will be understood that; The relevant information of above-mentioned each item dom tree not only can be used for template separately provides equipment 1 to obtain one or more groups making language document, and can also multinomial combination wherein be used for template provides equipment 1 to obtain one or more groups making language document.

If this, then in step S2, template provides equipment 1 to carry out cluster to these a plurality of making language documents according to this resource information, to obtain one or more groups making language document.

Those skilled in the art will be understood that; The relevant information of above-mentioned each item making language document not only can be used for template separately provides equipment 1 to obtain one or more groups making language document, and can also multinomial combination wherein be used for template provides equipment 1 to obtain one or more groups making language document.

In one example; In step S2; Template provides 1 pair of a plurality of html file of equipment to resolve and generate dom tree corresponding with it respectively; According to the topology information of each dom tree, these a plurality of html files are carried out cluster then, the topology information of this DOM includes but not limited to the distribution of each tree node of dom tree.

With Fig. 2, Fig. 3 is example, and in step S2, the pairing dom tree of a part of html file that template provides equipment 1 cluster to obtain has topological structure as shown in Figure 2; And the pairing dom tree of other html files has topological structure as shown in Figure 3; Thus, template provides equipment 1 to obtain 2 groups of html files, G1 group and G2 group; Wherein the html file in the G1 group has topological structure as shown in Figure 2, and the html file in the G2 group has topological structure as shown in Figure 3.Preferably; The topological structure of the dom tree of the html file in cluster to a group can be not quite identical, only needs consistent the getting final product of backbone nodes distribution of its dom tree, and for example the dom tree T1 of html file F 1 correspondence is shown in Fig. 3 A; The pairing dom tree T2 of html file F2 is shown in Fig. 3 B; Visible by figure, T1 and T2 have dom tree topological structure as shown in Figure 3, so during F1 and F2 will organize to G2 by cluster.

In another example, in step S2, template provides equipment 1 through adding up the label < a>in a plurality of html files respectively, obtaining the quantity of hypertext link in each html file, and in view of the above to those html file clusters.Preferably, also can combine the similarity of the anchor content of text of this hypertext link, those HTML are carried out cluster; To obtain some groups of html files; Wherein, the html file in every group has identical hypertext link quantity, and the similar content degree of its anchor text surpasses predetermined similarity threshold.

Then; In step S3; Template provides the content of respective nodes in pairing each dom tree of each making language document in every group of 1 pair of at least one group echo language file of equipment to compare analysis, with the body matter node of the body matter that obtains to comprise this group echo language file.

Particularly; In step S3; Template provides equipment 1 in step S2, to obtain at least one group echo language file in one or more groups making language document according to it, for example obtains the making language document in every group respectively, and those making language documents are resolved; To obtain its corresponding dom tree; And the content in corresponding node in each dom tree and the subtree node thereof compared analysis, and obtaining to comprise the body matter node of this group echo language file body matter, the method for wherein said comparative analysis includes but not limited to:

1) according to the number of characters of the non-link text in each dom tree respective nodes and the subtree node content thereof; If in surpassing the dom tree of preset quantity ratio; The character quantity of the non-link text of this respective nodes and subtree node content thereof surpasses certain character quantity threshold value; Then in step S3, template provides equipment 1 to judge that this node is the body matter node that comprises body matter;

2) according to each dom tree respective nodes content shared full content display space ratio when showing; If in surpassing the dom tree of preset quantity ratio; The shared display space ratio of this respective nodes content is all above certain proportion threshold value; Then in step S3, template provides equipment 1 to judge that this node is the body matter node that comprises body matter;

3) according to the similarity of each dom tree respective nodes and subtree node content thereof; If in each dom tree; This respective nodes and subtree node content similarity each other thereof all are lower than certain similarity threshold; Then in step S3, template provides equipment 1 to judge that this node is the body matter node that comprises body matter.

In one example, in step S3, template provides equipment 1 to obtain one group of html file, and 2 html files in this group html file are resolved, and obtains two dom tree T3 and T4, and wherein T3 is shown in Fig. 3 A, and T4 is shown in Fig. 3 B;

Then, in step S3, template provides 1 pair of these two dom tree of equipment to travel through and the content of respective nodes and subtree node thereof is compared analysis; As obtain the quantity of character in node N4 and subtree node N6 thereof among the T3, the content among the N7; As 2500, and obtain the quantity of character in respective nodes N4 ' and the content among the subtree node N6 ' thereof among the T4, as 2000; Its character quantity is all above 1500 of predetermined character quantity threshold values; Therefore, in step S3, template provide equipment 1 with this node as the body matter node that comprises this group html file body matter.

In another example, in step S3, template provides equipment 1 to obtain one group of html file; And 2 html files in this group html file are resolved, obtain two dom tree T3 and T4, wherein T3 is shown in Fig. 3 A; T4 follows, in step S3 shown in Fig. 3 B; Template provides 1 pair of two dom tree of equipment to travel through and the content of respective nodes and subtree node thereof is compared analysis, as obtains the height and the width that its content of being provided with among the node N3 among the T3 shows, and the height and the width of the pairing web displaying of this html file; And to obtain this node content shared display space in webpage in view of the above be 30%, and in like manner, the shared display space of content that obtains the respective nodes N3 ' among the T4 is 35%; This equal proportion is all above predetermined proportion threshold value 20%; Therefore, in step S3, template provide equipment 1 with this node as the body matter node that comprises this group html file body matter.

Subsequently, in step S4, template provides equipment 1 according to the body matter node that is obtained, and obtains in order to discern the content identification template of this group echo language file body matter.

Particularly, in step S4, template provides equipment 1 each body matter node according to its body matter that comprises this group echo language file that in step S3, is obtained, for example, and with this body matter node pairing numbering in dom tree of making an appointment; Perhaps; With the routing information of this body matter node in dom tree, write with the corresponding content identification template of this group echo language file in, at this; This routing information for example can be XPath; Wherein, said XPath is a kind of path expression, can in dom tree, look for corresponding tree node through this path expression.At this, said content identification template is used to describe each the body matter nodal information that comprises body matter, and this content identification template can be used as template file and is stored in the file system, or can be used as data table stores in relational database.

In one example; Shown in Fig. 3 A, in step S3, the body matter node that template provides equipment 1 acquisition to comprise certain group echo language file body matter is N1, N4 and N5; And the coding rule of body matter node is according under last to the tree node in the dom tree; Order is from left to right numbered, thus, and in step S4; Template provides equipment 1 to confirm that according to this coding rule the pairing numbering of N1, N4 and N5 is followed successively by: 1,4 and 5, and be written in the content identification template file.

In another example; Shown in Fig. 3 A, in step S3, the body matter node that template provides equipment 1 acquisition to comprise certain group echo language file body matter is N3 and N4; Thus; In step S4, template provides equipment 1 according to those body matter nodes, in dom tree, obtains its corresponding XPath and is respectively: the XPath of N3 is "/R0/N1/N3 "; The XPath of N4 is "/R0/N2/N4 ", and those XPath are written in the relational database that belongs to the corresponding content identification template of this group echo language file.

Preferably, be to work continuously between above-mentioned each step.Particularly, in step S1, template provides equipment 1 to obtain pending a plurality of making language documents constantly; Subsequently, in step S2, template provides equipment 1 also constantly according to the relevant information of said a plurality of making language documents, obtains one or more groups making language document; Then; In step S3; Template provides equipment 1 also constantly the content of respective nodes in pairing each dom tree of each making language document in every group of at least one group echo language file to be compared analysis, with the body matter node of the body matter that obtains to comprise this group echo language file; Then, in step S4, template provides equipment 1 also constantly according to the body matter node that is obtained, and obtains in order to discern the content identification template of this group echo language file body matter; At this; What it will be understood by those skilled in the art that " continuing " be meant that each step constantly carries out the obtaining of making language document, making language document group respectively obtains, every group echo language file is compared the content identification template of analyzing and obtaining to be used for identification marking language file body matter; Until satisfying predetermined stoppage condition, for example template provides equipment 1 to stop obtaining making language document in a long time.

(with reference to Fig. 5) in a preferred embodiment; Step S3 comprises step S31 (not shown) and step S32 (not shown); Wherein, In step S31, template provides the content of respective nodes in pairing each dom tree of making language document in 1 pair said every group of the equipment to compare analysis, to obtain the similarity of said content; Subsequently, in step S32, template provides equipment 1 to confirm said body matter node according to said similarity.

Followingly with reference to Fig. 5 the preferred embodiment is described in detail, wherein, in step S 1, template provides equipment 1 to obtain pending a plurality of making language documents; In step S2, template provides the relevant information of equipment 1 according to said a plurality of making language documents, obtains one or more groups making language document; In step S4, template provides equipment 1 according to the body matter node that is obtained, and obtains in order to discern the content identification template of this group echo language file body matter.Its detailed process for for simplicity, is contained in this with way of reference with aforementioned identical with reference to the performed process of step S1, S2 and S4 among the described embodiment of Fig. 5, does not give unnecessary details and do not do.

Particularly; In step S31; Template provides equipment 1 that the content of respective nodes and subtree node thereof in pairing each dom tree of making language document in every group in its at least one group echo language file that in step S2, obtains is compared analysis; To obtain the similarity of said content, wherein, the method that obtains said similar content degree includes but not limited to:

2) respective nodes of each dom tree and the word content of subtree node thereof are carried out participle; And, confirm the similarity of this content, wherein through identical participle quantity in each respective nodes word content is added up; The quantity of identical participle is few more; Then the similarity of content is low more, otherwise then the similarity of this content is high more; At this, word algorithm included but not limited to the forward maximum match in said minute, reverse maximum match, two-way maximum match, language model method, shortest path first or the like; Subsequently; In step S32, template provides the similarity of equipment 1 according to its node that in step S31, obtains and subtree node content thereof, for example is lower than preset similarity threshold according to similarity; Then this content is a body matter; Otherwise this content is the rule of non-body matter, confirm this node whether literary composition comprise the body matter node of body matter.

In one example, in step S31, template provides equipment 1 to obtain the word content in respective nodes in certain group html file pairing each dom tree and the subtree node thereof; Utilize the forward maximum matching algorithm that each word content is carried out word segmentation processing respectively, obtain 3000 different participles, and through statistical study is carried out in the distribution of each participle in each word content that obtains; Confirm to surpass certain preset quantity; As 1500, participle in all each word contents, appearance is arranged all, then in step S32; The similarity that template provides equipment 1 to obtain this each word content in view of the above is as 0.7; Subsequently, in step S32, template provides the similarity of equipment 1 according to its node that in step S31, obtains and subtree node content thereof, and its similarity is higher than preset similarity threshold 0.4, confirms not comprise in this node the body matter of this group html file.

(with reference to Fig. 5) in a further advantageous embodiment; Step S4 comprises step S41 (not shown) and step S42 (not shown), wherein, and in step S41; Template provides equipment 1 according to said body matter node, obtains and the corresponding routing information of said body matter node; Subsequently, in step S42, template provides equipment 1 that said routing information is added in the said content identification template, to obtain said content identification template.

Followingly with reference to Fig. 5 the preferred embodiment is described in detail, wherein, in step S1, template provides equipment 1 to obtain pending a plurality of making language documents; In step S2, template provides the relevant information of equipment 1 according to said a plurality of making language documents, obtains one or more groups making language document; In step S3; Template provides the content of respective nodes in pairing each dom tree of each making language document in every group of 1 pair of at least one group echo language file of equipment to compare analysis, with the body matter node of the body matter that obtains to comprise this group echo language file; Its detailed process for for simplicity, is contained in this with way of reference with aforementioned identical with reference to the performed process of step S1, S2 and S3 among the described embodiment of Fig. 5, does not give unnecessary details and do not do.

Particularly; In step S41, template provides equipment 1 according to its body matter node that comprises certain group echo language file body matter that in step S3, obtains, and obtains the routing information of this node from the dom tree at this node place; Wherein, the expression mode of this routing information includes but not limited to:

-XPath；

Subsequently, in step S42, template provides equipment 1 that its routing information that in step S41, obtains is written to the content identification template that is used for discerning this group echo language file body matter, to obtain this content identification template.

In one example, shown in Fig. 3 A, in step S3; The body matter node that comprises certain group echo language file body matter that template provides equipment 1 to obtain is N6 and N7; In step S41, template provides equipment 1 according to those body matter nodes, obtains its corresponding routing information and is "/R0/N2/N4/N [6-7] { 1} "; Subsequently; In step S42, template provides equipment 1 that this routing information is write in certain content identification template file, to obtain to be used to discern the template of this group echo language file body matter.

In another preferred embodiment (with reference to Fig. 5), this process also comprises step S5 (not shown), and wherein, in step S5, template provides equipment 1 according to pre-defined rule, obtains at least one group echo language file in said one or more groups making language document; Then, in step S3, template provides the content of respective nodes in pairing each dom tree of each making language document in every group of said at least one group echo language file that 1 pair of equipment obtains to compare analysis, to obtain said body matter node.Followingly with reference to Fig. 5 the preferred embodiment is described in detail, wherein, in step S1, template provides equipment 1 to obtain pending a plurality of making language documents; In step S2, template provides the relevant information of equipment 1 according to said a plurality of making language documents, obtains one or more groups making language document; In step S4, template provides equipment 1 according to the body matter node that is obtained, and obtains in order to discern the content identification template of this group echo language file body matter; Its detailed process for for simplicity, is contained in this with way of reference with aforementioned identical with reference to the performed process of step S1, S2 and S4 among the described embodiment of Fig. 5, does not give unnecessary details and do not do.

Particularly; In step S5; Template provides equipment 1 to obtain said making language document group according to pre-defined rule; For example obtain in step S2, all making language document groups that template provides equipment 1 to provide are perhaps only obtained some making language document groups of making language document quantity a predetermined level is exceeded wherein; Then, in step S3, template provides equipment 1 that each group echo language file that it obtains in step S5 is carried out described comparative analysis respectively, obtains to comprise the body matter node of this group echo language file body matter for every group echo language file; Wherein, said pre-defined rule comprises that each obtains said making language document group at least based on following:

1) quantity of making language document in this group;

Particularly, when pre-defined rule based on this group echo language file in the quantity of making language document, wherein; Only the quantity of the making language document in this group more for a long time; As surpassing certain quantity of documents threshold value, can compare analysis through body matter node content to each making language document, come to obtain more accurately the body matter node of this body matter that comprises the group echo language file; Otherwise the acquisition of this body matter node is with inaccurate; So in step S5, template provides equipment 1 only to obtain the making language document group that making language document quantity surpasses this document amount threshold;

2) number of nodes of the pairing dom tree of making language document etc.;

Particularly, when pre-defined rule based on this group echo language file in the number of nodes of the pairing dom tree of making language document, wherein; If the number of nodes of this each dom tree all seldom, as be lower than certain number of nodes threshold value, the content of then representing its pairing making language document also seldom; Need not again its body matter to be extracted; So in step S5, the number of nodes that template provides equipment 1 only to obtain each dom tree surpasses the making language document group of this number of nodes threshold value.

Those skilled in the art will be understood that the above-mentioned each item of lifting not only can be used for template separately and provides equipment 1 to obtain the making language document group, and can also multinomial combination wherein be used for template provides equipment 1 to obtain the making language document group.

In one example, in step S2, template provides equipment 1 to obtain 3 groups of html files, and then in step S5, template provides equipment 1 directly to extract these 3 groups of html files.In another example, in step S2, template provides equipment 1 to obtain 4 group echo language files; G3, G4, G5 and G6, wherein the making language document quantity of each group is followed successively by 120,50,5,150, then in step S5; Template provides equipment 1 to extract 2 making language document groups of making language document quantity a predetermined level is exceeded; G3 and G6, at this, this predetermined quantity for example can be made as 100.

In another preferred embodiment (with reference to Fig. 5); This process also comprises step S6 (not shown); Wherein, In step S6, the said body matter that template provides equipment 1 to comprise according to said body matter node, mark and the corresponding body matter relevant information of said body matter node in said content identification template; Wherein, said body matter relevant information comprise following at least each:

The type information of-said body matter;

The displaying priority of-said body matter.

Followingly with reference to Fig. 5 the preferred embodiment is described in detail, wherein, in step S1, template provides equipment 1 to obtain pending a plurality of making language documents; In step S2, template provides the relevant information of equipment 1 according to said a plurality of making language documents, obtains one or more groups making language document; In step S3; Template provides the content of respective nodes in pairing each dom tree of each making language document in every group of 1 pair of at least one group echo language file of equipment to compare analysis, with the body matter node of the body matter that obtains to comprise this group echo language file; In step S4, template provides equipment 1 according to the body matter node that is obtained, and obtains in order to discern the content identification template of this group echo language file body matter; Its detailed process for for simplicity, is contained in this with way of reference with reference to identical in the performed process of step S1, S2, S3 and S4 among the described embodiment of Fig. 5 with aforementioned, does not give unnecessary details and do not do.

Particularly; In step S6; The body matter that template provides equipment 1 to be comprised according to its body matter node that in step S3, obtains and subtree node thereof; For example according to predetermined mark rule, mark and the corresponding body matter relevant information of this body matter node in the content identification template at this body matter node place; Wherein, this body matter relevant information comprise following at least each:

In one example, the character quantity of pure words content surpasses 5000 in the body matter that certain body matter node is comprised, and this pure words content to be presented at the displaying ratio that this body matter occupied in showing be 85%; Then in step S6; Template provides equipment 1 to confirm that according to above information the type information of this body matter is the body matter piece, and according to the type information, confirms that this body matter is the high content that represents priority; Then; In step S6, template provides equipment 1 that the relevant information of this body matter is correspondingly write in the content identification template file at this body matter node place, and is as shown in table 2 below.

Table 2

Fig. 6 is the method flow diagram that is used for identification marking language file body matter in accordance with a preferred embodiment of the present invention, wherein, also comprises step S21 ' and step S22 ' among the step S2 '.At this, the S1 ' of step shown in Fig. 6, S3 ' are identical with reference to the content of the described step S1 of Fig. 5, S3 with the front, for for simplicity, are contained in this with way of reference, do not give unnecessary details and do not do.

Particularly, in step S21 ', template provides equipment 1 according to the predetermined filtering condition, said a plurality of making language documents is screened, to obtain to satisfy at least one making language document of said predetermined filtering condition; Then, in step S22 ', template provides the relevant information of equipment 1 according to the pairing dom tree of said at least one making language document, and said at least one making language document is carried out cluster, to obtain said one or more groups making language document; At last, in step S4 ', template provides equipment 1 according to the body matter node that is obtained, and obtains and the corresponding said content identification template of this predetermined filtering condition.

More specifically, in step S21 ', template provides equipment 1 based on the predetermined filtering condition, its a plurality of making language documents that in step S1 ', obtain is screened, to obtain to satisfy at least one making language document of this predetermined filtering condition.Preferably, this predetermined filtering condition include but not limited to following at least each:

1) network address of said making language document;

Particularly; If this predetermined filtering condition is based on the network address of making language document; Wherein this network address includes but not limited to the URL address, and IP address etc. are then in step S21 '; Template provides equipment 1 those making language documents to be screened according to the regular expression of the network address or the network address of making language document;

2) website under the said making language document;

Particularly; If this predetermined filtering condition is based on the website under the making language document; For example whether making language document is from same website; Perhaps from the website of same type, then in step S21 ', whether template provides equipment 1 for example can screen those html files from the website of news type according to html file.

Those skilled in the art will be understood that; Above-mentioned each item predetermined filtering condition not only can be used for separately at step S21 '; Template provides 1 pair of a plurality of making language document of equipment to screen; Can also multinomial combination wherein be used at step S21 ', template provides 1 pair of a plurality of making language document of equipment to screen.

Then; In step S22 '; Template provides the relevant information of equipment 1 according to its pairing dom tree of making language document that in step S21 ', obtains, and those making language documents are carried out cluster, to obtain and corresponding said one or more groups making language document of this predetermined filtering condition;

At last; In step S4 '; Template provides equipment 1 in step S3 ', to be every group of body matter node that is obtained in this one or more groups making language document according to it; Obtain and the one or more one to one content identification templates of this each group echo language file, and should conduct of one or more content identification template and the corresponding content identification template of this predetermined filtering condition.

In one example; A predetermined filtering condition C 1 satisfies regular expression http://www.abc.com/news*.*html for the URL of html file (URL) address; Then in step S21 ', template provides equipment 1 to screen in 150 html files that template provides equipment 1 to obtain according to this predetermined filtering condition, satisfies 70 html files of this regular expression to obtain its URL address; Then; In step S22 ', template provide equipment 1 according to the dom tree relevant information of these 70 html files to these 70 html files are carried out cluster, to obtain and these predetermined filtering condition C 1 corresponding 3 groups of html files; In step S4 '; Template provides equipment 1 in step S3 ', to be every group of body matter node that is obtained in this 3 group echo language file according to it; Obtain and corresponding 3 the content identification template files of this 3 group echo language file, and with these 3 content identification template files as with predetermined filtering condition C 1 corresponding content identification template.

Preferably; This process also comprises step S7 ' (not shown), step S8 ' (not shown) and step S9 ' (not shown); Wherein, in step S7 ', template provides equipment 1 to obtain the predetermined filtering condition that other making language documents satisfied of body matter to be identified; Then, in step S8 ', template provides equipment 1 to select the pairing content identification template of predetermined filtering condition that this other making language document satisfied; Then, in step S9 ', template provides equipment 1 to discern the body matter of said other making language documents according to selected content identification template.

Particularly; In step S7 '; Template provides equipment 1 for example to receive predetermined condition or Event triggered ground or obtains other making language documents of body matter to be identified termly from third party device; And it is mated in each predetermined filtering condition, with the screening conditions that this making language document was satisfied that obtain; Then, in step S8 ', template provides equipment 1 according to its these screening conditions of in step S7 ', obtaining; Among step S4 ', obtain its pairing one or more content identification templates from it; And extract the body matter nodal information in each content identification template respectively, like XPath, and according to this nodal information; Matched rule according to predetermined matees in the pairing dom tree of these other making language documents; With acquisition and the pairing content identification template of these other making language documents, wherein, this matched rule includes but not limited to:

Then; In step S9 '; Template provides equipment 1 according to its content identification template that in step S8 ', obtains; From this content identification template, extract each body matter nodal information, and in the dom tree of these other making language documents, search its body matter node, and from this node and subtree node thereof, obtain body matter according to those body matter nodal informations.

To those skilled in the art, obviously the invention is not restricted to the details of above-mentioned example embodiment, and under the situation that does not deviate from spirit of the present invention or essential characteristic, can realize the present invention with other concrete form.Therefore; No matter from which point; All should regard embodiment as exemplary; And be nonrestrictive, scope of the present invention is limited accompanying claims rather than above-mentioned explanation, therefore is intended to the implication of the equivalents that drops on claim and all changes in the scope are included in the present invention.Should any Reference numeral in the claim be regarded as limit related claim.In addition, obviously other unit or step do not got rid of in " comprising " speech, and odd number is not got rid of plural number.A plurality of unit of stating in system's claim or device also can be realized through software or hardware by a unit or device.The first, the second word such as grade is used for representing title, and does not represent any specific order.

Claims

1. computer implemented method that is used for identification marking language file body matter, wherein, this method may further comprise the steps:

A obtains pending a plurality of making language documents;

2. method according to claim 1, wherein, the relevant information of said a plurality of making language documents comprise following at least each:

The relevant information of the pairing dom tree of-said a plurality of making language documents;

Resource information in-said a plurality of making language documents.

3. method according to claim 2, wherein, the relevant information of said dom tree comprise following at least each:

The number of nodes of-said dom tree;

The topology information of-said dom tree.

4. according to each described method in the claim 1 to 3, wherein, said step c comprises:

-content of respective nodes in pairing each dom tree of making language document in said every group is compared analysis, to obtain the similarity of said content;

-confirm said body matter node according to said similarity.

5. according to each described method in the claim 1 to 4, wherein, said steps d comprises:

-obtain the routing information of said body matter node in said dom tree;

-said routing information is added in the said content identification template, to obtain said content identification template.

6. according to each described method in the claim 1 to 5, wherein, this method also comprises:

-according to pre-defined rule, obtain at least one group echo language file in said one or more groups making language document;

Wherein, said step c comprises:

-content of respective nodes in pairing each dom tree of each making language document in every group of said at least one group echo language file of obtaining is compared analysis, to obtain said body matter node.

7. method according to claim 6, wherein, said pre-defined rule comprises that each obtains said at least one group echo language file at least based on following:

The quantity of making language document in-this group echo language file;

The number of nodes of the pairing dom tree of making language document in-this group echo language file.

8. according to each described method in the claim 1 to 7, wherein, this method also comprises:

-according to the said body matter that said body matter node comprises, in said content identification template, mark and the corresponding body matter relevant information of said body matter node;

Wherein, said body matter relevant information comprise following at least each:

The type information of-said body matter;

The displaying priority of-said body matter.

9. according to each described method in the claim 1 to 8, wherein, said step b comprises:

-according to the predetermined filtering condition, said a plurality of making language documents are screened, to obtain to satisfy at least one making language document of said predetermined filtering condition;

-according to the relevant information of the pairing dom tree of said at least one making language document, said at least one making language document is carried out cluster, to obtain said one or more groups making language document;

Wherein, said steps d comprises:

-according to the body matter node that is obtained, obtain and the corresponding said content identification template of this predetermined filtering condition.

10. method according to claim 9, wherein, each screens said a plurality of making language documents said predetermined filtering condition at least based on following:

The network address of-said making language document;

Website under the-said making language document.

11. method according to claim 10, wherein, this method also comprises:

-obtain the predetermined filtering condition that other making language documents satisfied of body matter to be identified;

-selection pairing content identification the template of predetermined filtering condition that this other making language document satisfied;

-discern the body matter of said other making language documents according to selected content identification template.

12. according to each described method in the claim 1 to 11, wherein, said making language document comprise following at least each:

-html file;

-XHTML file;

-XML file.

13. an equipment that is used for identification marking language file body matter, wherein, this equipment comprises:

14. equipment according to claim 13, wherein, the relevant information of said a plurality of making language documents comprise following at least each:

Resource information in-said a plurality of making language documents.

15. equipment according to claim 14, wherein, the relevant information of said dom tree comprise following at least each:

The number of nodes of-said dom tree;

The topology information of-said dom tree.

16. according to each described equipment in the claim 13 to 15, wherein, said relative analytic apparatus comprises:

The similarity acquiring unit is used for the content of respective nodes in pairing each dom tree of said every group making language document is compared analysis, to obtain the similarity of said content;

The node acquiring unit is used for confirming said body matter node according to said similarity.

17. according to each described equipment in the claim 13 to 16, wherein, said template deriving means comprises:

The routing information acquiring unit is used for obtaining the routing information of said body matter node at said dom tree;

The template generation unit is used for adding said routing information to said content identification template, to obtain said content identification template.

18. according to each described equipment in the claim 13 to 17, wherein, this equipment also comprises:

Second deriving means is used for according to pre-defined rule, obtains at least one group echo language file in said one or more groups making language document;

Wherein, said relative analytic apparatus is used for the content of respective nodes in pairing each dom tree of each making language document of every group of the said at least one group echo language file that obtains is compared analysis, to obtain said body matter node.

19. equipment according to claim 18, wherein, said pre-defined rule comprises that each obtains said at least one group echo language file at least based on following:

The quantity of making language document in-this group echo language file;

20. according to each described equipment in the claim 13 to 19, wherein, this equipment also comprises:

The template annotation equipment is used for the said body matter that comprises according to said body matter node, mark and the corresponding body matter relevant information of said body matter node in said content identification template;

The type information of-said body matter;

The displaying priority of-said body matter.

21. according to each described equipment in the claim 13 to 20, wherein, said first deriving means comprises:

Screening unit is used for according to the predetermined filtering condition, said a plurality of making language documents is screened, to obtain to satisfy at least one making language document of said predetermined filtering condition;

Cluster cell is used for the relevant information according to the pairing dom tree of said at least one making language document, and said at least one making language document is carried out cluster, to obtain said one or more groups making language document;

Wherein, said template deriving means is used for according to the body matter node that is obtained, and obtains and the corresponding said content identification template of this predetermined filtering condition.

22. equipment according to claim 21, wherein, each screens said a plurality of making language documents said predetermined filtering condition at least based on following:

The network address of-said making language document;

Website under the-said making language document.

23. equipment according to claim 22, wherein, this equipment also comprises:

The screening conditions deriving means is used to obtain the predetermined filtering condition that other making language documents satisfied of body matter to be identified;

Template selecting is used to the pairing content identification template of predetermined filtering condition of selecting this other making language document to satisfy;

The body matter recognition device is used for discerning according to selected content identification template the body matter of said other making language documents.

24. according to each described equipment in the claim 13 to 23, wherein, said making language document comprise following at least each:

-html file;

-XHTML file;

-XML file.