CN102819595A

CN102819595A - Web page classification method, web page classification device and network equipment

Info

Publication number: CN102819595A
Application number: CN2012102851019A
Authority: CN
Inventors: 王祖海
Original assignee: Beijing Star Net Ruijie Networks Co Ltd
Current assignee: Beijing Star Net Ruijie Networks Co Ltd
Priority date: 2012-08-10
Filing date: 2012-08-10
Publication date: 2012-12-12

Abstract

The invention provides a web page classification method, a web page classification device and network equipment. The method comprises the following steps of: extracting information of different classification weight levels in a source file of a web page; performing word segmentation processing on information of each classification weight level to acquire segmented words of each classification weight level; and performing classification processing on the web page by using the segmented words of each classification weight level according to a sequence of the classification weight level from high to low. According to the technical scheme provided by the invention, classification processing is performed on the web page by preferably using the information with higher classification weight level by using the characteristic that the more important information in the web page has higher influence on a web page classification result, so that the influence of invalid information on web page classification in the web page is favorably reduced, and further the accuracy of web page classification is favorably improved.

Description

Web page classification method, device and the network equipment

Technical field

The present invention relates to the network communications technology, relate in particular to a kind of Web page classification method, device and the network equipment.

Background technology

The internet develop rapidly, the sharp increase of web data amount, people have stepped into informative epoch.In the face of mixed and disorderly info web resource, people need carry out taxonomic revision to the info web of magnanimity, thereby can search the useful information of expectation fast.Automatic webpage classification provides the gordian technique of handling and organize extensive webpage, is to make information resources be able to rationally the effectively important method of tissue, and the accuracy of Web page classifying depends on the extraction of info web to a great extent.

Existing Web page classifying process comprises: the webpage source file is carried out info web extract (also can be described as the webpage source file is carried out denoising), the info web that extracts is carried out Chinese word segmentation, carry out Web page classifying according to the participle that obtains.At present, info web method for distilling commonly used is for example based on DOM Document Object Model (Document Object Model; Abbreviate DOM as) tree method; All have the lower defective of information extraction accuracy rate, and a segmenting method commonly used, for example the string matching participle, understand methods such as participle, statistics participle; Also all have the inaccurate defective of participle, this just makes that the Web page classifying accuracy is lower.

Summary of the invention

The present invention provides a kind of Web page classification method, device and the network equipment, in order to improve the accuracy of Web page classifying.

One aspect of the present invention provides a kind of Web page classification method, comprising:

From the source file of webpage, extract different other information of classification weight classification weight level;

Each other information of classification weight level is carried out word segmentation processing, obtain each other participle of classification weight level;

According to classification weight rank order from high to low, use other participle of weight level of respectively classifying that said webpage is carried out classification processing.

The present invention provides a kind of Web page classifying device on the other hand, comprising:

Information extraction modules is used for extracting different other information of classification weight level from the source file of webpage;

The participle acquisition module is used for each other information of classification weight level is carried out word segmentation processing, obtains each other participle of classification weight level;

The classification processing module is used for using other participle of weight level of respectively classifying that said webpage is carried out classification processing according to classification weight rank order from high to low.

Another aspect of the invention provides a kind of network equipment, comprises arbitrary Web page classifying device provided by the invention.

Web page classification method provided by the invention, device and the network equipment; From the source file of webpage, extract different other information of classification weight level, then each other information of classification weight level is carried out word segmentation processing, obtain different other participles of classification weight level; Then according to classification weight rank order from high to low; Use other participle of weight level of respectively classifying that webpage is carried out classification processing, this shows that technical scheme of the present invention uses the full detail of extraction that webpage is carried out classification processing simultaneously unlike the prior art that kind; But through utilizing information important more in the webpage characteristics big more to Web page classifying result's influence; The preferential higher information of classification weight rank of using is carried out classification processing to webpage, helps reducing invalid information in the webpage to the influence of Web page classifying, and then the accuracy that helps improving Web page classifying.

Description of drawings

The process flow diagram of the Web page classification method that Fig. 1 provides for one embodiment of the invention;

The synoptic diagram of the form that the context between each participle that Fig. 2 provides for one embodiment of the invention is deposited;

The structural representation of the Web page classifying device that Fig. 3 provides for one embodiment of the invention.

Embodiment

The process flow diagram of the Web page classification method that Fig. 1 provides for one embodiment of the invention.The executive agent of present embodiment is the Web page classifying device.As shown in Figure 1, the method for present embodiment comprises:

Step 101, from the source file of webpage, extract different other information of classification weight level.

Webpage (English for Web page) is a file, and it leaves in a certain the computing machine in certain corner in the world, and this computing machine and Internet connection.Different web pages can be passed through network address, and for example URL (Uniform/Universal Resource Locator abbreviates URL as) is discerned and access.For example, when the user imported a network address in the browser of the terminal device of its use after, the corresponding webpage of this network address just was sent to user's terminal device, and the user just can browse this webpage through the browser on the terminal device.Webpage uses HTML (HyperText Mark-up Language abbreviates HTML as) form usually, then the expansion of webpage .html by name or htm.HTML uses language the most widely on the present network, also be the main language that constitutes web document.The descriptive text that html file is made up of HTML order, HTML are ordered can comment, figure, animation, sound, form, link etc.The structure of html file comprises head (Head), main body (Body) two large divisions, and wherein head is mainly described the required information of browser, main body then comprise this webpage the particular content that will explain.If webpage uses html format, then the source file of webpage is meant the html file that constitutes webpage.If webpage uses extended formatting, then the source file of webpage is meant the file of the extended formatting that constitutes webpage.

In the present embodiment, the Web page classifying device is divided each information weight rank of classifying, thereby is extracted different other information of classification weight level when information extraction from the source file of webpage.Wherein, the importance of different classification weight other information of level in webpage is different, and is also different to Web page classifying result's influence degree, the high more information of classification weight rank, and the importance in webpage is high more, and is also just big more to Web page classifying result's influence degree.That is to say that the higher information spinner of importance will influence the affiliated classification of this webpage in the webpage.In this explanation, what other information of classification weight level present embodiment extracts to the Web page classifying device and does not do qualification from the source file of webpage.But those skilled in the art are understood that easily: the classification weight rank of the information of from the source file of webpage, extracting is many more; Division to information importance is just thin more; If can use the highest or determine the classification under the webpage, mean that the Web page classifying result is accurate more than other information of high-class weight level; In addition; The classification weight rank of the information of from the source file of webpage, extracting is many more; Each other quantity of information of classification weight level is just few more; Help more webpage being carried out classification processing,, then in the Web page classifying process, can use other other information of classification weight level if for example other information of highest sorted weight level can be confirmed the classification that webpage is affiliated according to these other information of classification weight level; Help improving the efficient of Web page classifying more, but the Web page classifying device extract burdens of these other information of classification weight level in earlier stage will be heavier.

In addition; In the webpage except having classification Useful Information to webpage; Also have a large amount of with to the irrelevant information of Web page classifying, for example some advertising messages, script information, punctuation mark information etc. not only can increase the reduction that the quantity of information in the Web page classifying process is caused classification effectiveness if use these information that webpage is classified simultaneously; And can cause interference to the classification of webpage, influence is to the accuracy of Web page classifying.And in the present embodiment; The Web page classifying device is in the process of extracting different other information of classification weight level; Can the garbage that can identify be removed on the one hand, on the other hand can be with it as the lower information of classification weight rank for the garbage that can't identify, and be used for classification to webpage in the higher information of follow-up preferential use classification weight rank; Help reducing the quantity of information of using in the Web page classifying process; Improve the efficient of Web page classifying, can also reduce the influence of garbage on the other hand, improve the degree of accuracy of Web page classifying Web page classifying.

In an optional embodiment of present embodiment, the Web page classifying device can be divided into three grades through the characteristic of webpage itself is analyzed with the information in the webpage, is respectively: one-level information, second-level message and three grades of information.Wherein, the classification weight rank of one-level information, second-level message and three grades of information reduces successively.The Web page classifying device extracts different other information of classification weight level from the source file of webpage a kind of embodiment specifically comprises:

In the source file of webpage; Header information mainly comprises the general introduction information such as (English are description) of the subject name (English for title) of this webpage, the key word that this webpage uses (English be keyword) and this webpage; And this three partial information is the summary to whole webpage basically, can embody the affiliated classification of this webpage to a great extent.The Web page classifying device of present embodiment extracts header information as one-level information from the source file of webpage.

In an optional embodiment of present embodiment, if there is not header information in the source file of webpage, then the Web page classifying device can be provided with one-level information for empty, but is not limited thereto.For example, one-level information can also be set is presupposed information to the Web page classifying device.

In the source file of webpage, after the removal one-level information, remaining has been exactly main part.Main part mainly is to be made up of text, a series of segments title or block header, script information, copyright information, punctuation mark information etc.Different information in the main part are also inequality to Web page classifying result's influence degree.In the present embodiment, the Web page classifying device at first extracts second-level message from main part, and the second-level message here mainly is meant information such as text, section header or block header.

At last, the Web page classifying device is removed from source file in other information outside one-level information and the second-level message, extracts three grades of information.Three grades of information here that is to say in the assorting process to webpage, can not use this part information basically to the basic not influence of classifying.

In an optional embodiment of present embodiment; The Web page classifying device extracts second-level message from main part: the information that the source file of webpage can use information table (English is table) to store diverse location usually; And the content of text is normally grown most in webpage (corresponding quantity of information is maximum); Therefore, the Web page classifying device can obtain the maximum information table of quantity of information from source file, and the information in the information table that quantity of information is maximum is thought the information that text is corresponding.Further optional; The Web page classifying device is in order to improve the accuracy of obtaining text as far as possible; Web page classifying device two information tables that quantity of information is maximum compare; The quantity of information of judging one of them information table whether be another information table quantity of information twice or more than the twice, if judged result for being, is then obtained quantity of information and be content in the above information table of twice or the twice of quantity of information of another information table as second-level message.In addition, the Web page classifying device also can obtain section header in the main part as second-level message.Optional, because the sequencing that section header occurs means its importance in webpage to a certain extent, so the Web page classifying device preferably obtains the section header of first appearance and second appearance in the source file of webpage as second-level message.

In an optional embodiment of present embodiment; The Web page classifying device is removed in other information outside one-level information and the second-level message process of extracting three grades of information from source file: the Web page classifying device is removed the invalid information in other information, with remaining information after removing invalid information in other information as three grades of information.

Here said invalid information comprises following arbitrary information or its combination:

Script information, its form in source file can be<script</script>Partly,<noscript</noscript>The part or<style</style>

Annotation information, its form in source file can be!--...--partly, // after delegation or/* ... */.

Page bottom information, its form in source file can be that < div id=" footer, < divclass=" footer 、 &copy, < p class=" copyright " >, < div class=" copyright " >, all rights reserved or all authority keeps (English be All Rights Reserved).Page bottom information spinner will comprise copyright information, site information feedback etc., and this part content that comprises of different web pages is not quite similar.

Hide content information, its form in source file can be style=" display:none " or visibility:hidden.

Punctuation mark information, its form in source file can Shi > 、 &raquo 、 &nbsp 、 &amp 、 &ldquo Huo &rdquo.

Step 102, each other information of classification weight level is carried out word segmentation processing, obtain each other participle of classification weight level.

When from the source file of webpage, extract different other information of classification weight level according to above operation after, just accomplished denoising to webpage, next will be that other information of difference classification weight level of extracting is carried out word segmentation processing.

In the present embodiment, the Web page classifying device can use various segmenting methods, and other information of extracting of weight level of respectively classifying is carried out word segmentation processing.

In an optional embodiment of present embodiment; Web page classifying state character match participle simultaneously carries out participle with the mode that the statistics participle combines to each other information of classification weight level; Can improve the participle accuracy like this, and then help improving other participle of difference classification weight level that obtains based on word segmentation processing and carry out the accuracy of Web page classifying.Concrete; The Web page classifying device is to each information segment in each other information of classification weight level; Carry out forward word segmentation processing and reverse word segmentation processing respectively; If the forward word segmentation result to said information segment is identical with reverse word segmentation result, then will this identical word segmentation result as the corresponding participle of this information segment, if inequality to the forward word segmentation result and the reverse word segmentation result of this information segment; Forward word segmentation result and reverse word segmentation result are added up word segmentation processing respectively, obtain the bigger word segmentation result of participle combined probability as the corresponding participle of this information segment.Wherein, the corresponding participle of all information segments constitutes each other participle of classification weight level in each classification weight rank.The participle combined probability here is meant the probability that each participle combination occurs in the word segmentation result.In this explanation, the classification weight rank of the participle that each information segment that obtains through word segmentation processing is corresponding is identical with the classification weight rank of information under each information segment.For example, the participle that is obtained through word segmentation processing by each information segment in the one-level information constitutes the one-level participle, and the participle that is obtained through word segmentation processing by each information segment in the second-level message constitutes the secondary participle, or the like.Above-mentioned, information segment can be a speech, word, some contaminations or a complete sentence (promptly in short).

An information segment is carried out forward word segmentation processing and reverse word segmentation processing, have two kinds of results.A kind of result is that the forward word segmentation result is consistent with reverse word segmentation result, thinks that in this case this forward word segmentation result (or reverse word segmentation result) is final word segmentation result, and reliability is than higher.Another kind of result then is that forward word segmentation result and reverse word segmentation result are inconsistent; For example: " he say really reason " really the words is carried out forward word segmentation processing and reverse word segmentation processing, and reverse word segmentation result is: he, says,, certain, resonable, and the forward word segmentation result is: he, say, really, really, manage; It is thus clear that; Forward word segmentation result and reverse word segmentation result are inconsistent, introduce the statistics segmenting method in this case, judge that with it selecting the forward word segmentation result still is that reverse word segmentation result is as net result; Help improving the accuracy of participle, and then help improving the accuracy of webpage being carried out classification processing based on the participles at different levels that obtain.Under forward word segmentation result and the inconsistent situation of reverse word segmentation result, confirm the final word segmentation result except introducing the statistics segmenting method, can also confirm directly that reverse word segmentation result or forward word segmentation result are as final word segmentation result.

Above-mentioned statistics participle mainly depends on statistics participle relative storehouse, and this dictionary is used for writing down the number of times that combination occurs between speech and the speech.The process of statistics participle mainly comprises:

At first be that webpage participle device obtains statistics participle relative storehouse.The obtain manner in this statistics participle relative storehouse comprises: collect a large amount of articles as training set, industry-by-industry and age bracket will be contained as far as possible in these articles, so just can accomplish the various aspects of looking after as much as possible; The article here can be a Chinese textbook, and comprehensive newspaper or the like wherein through the quantity of raising article and the range of distribution, can make the statistics participle relative storehouse that trains more representative.Then, these articles are carried out reverse word segmentation processing (reason is that reverse participle accuracy is higher than the forward participle), write down the number of times that combination occurs between each participle then.The number of times that combination occurs between each participle is recorded in the statistics participle relative storehouse.After a sentence participle is for example arranged be: ABCCBDEDADBDBEC, the number of times that then combination occurs between each participle is as shown in table 1.

Table 1

	A	B	C	D	E
						A	-	1	-	1	-
B	-	-	1	2	1
						C	-	-	1	-	-
D	1	2	-	-	1
						E	-	-	1	1	-

A speech that is cut out according to reverse word segmentation processing all represented in each letter among the above-mentioned ABCCBDEDADBDBEC; Be example then with alphabetical A; Following has letter b and an alphabetical D in alphabetical A back; Therefore that delegation of alphabetical A place is filled to 1 respectively corresponding to the position of letter b and alphabetical D place row in table 1, is example with the letter b again, follows the alphabetical D in the letter b back to occur twice; The combination that is BD occurs twice, so that delegation of letter b place is filled to 2 corresponding to the position that alphabetical D place is listed as in the table 1.Can find out that through table 1 some speech combine a speech backward, then they just add in the above-mentioned statistics participle relative storehouse, and the occurrence number of correspondence is added 1.Certainly table 1 is the convenience in order to check, reality expressed statistics participle relative storehouse of table 1 in internal memory is to adopt the form of chained list to deposit, and the form of depositing is as shown in Figure 2, can save very big internal memory like this.And A-E deposits order and adopt the big minispread of hash value, can improve the efficient of searching like this.In this explanation, the letter at this begin chain node place makes up the number of times of appearance in numeral itself and this chained list among Fig. 2 in each letter back bracket.

If the participle of new interpolation is arranged, then the Web page classifying device is retrieved above-mentioned all training sets, finds out the sentence and the cutting that comprise this participle, checks context, and Pleistocene series score speech concerns the number of times that combination occurs between each participle in the dictionary then.

In this explanation, above-mentioned Web page classifying device obtains the process in statistics participle relative storehouse and can carry out in advance, and constantly upgrades.

Secondly, the Web page classifying device gets access to after the statistics participle relative storehouse, calculates " certain, resonable " and " really, really " these two probability that the participle combination occurs, and selects the final word segmentation result of the bigger conduct of participle probability of occurrence.For example, the ABD of record combination in the table 1, the probability that the AB combination occurs is 1, and the probability that the BD combination occurs is 2, explains that the probability of occurrence of BD combination is bigger.

By above-mentioned visible; If forward word segmentation result and reverse word segmentation result to information segment are inequality; The Web page classifying device can be added up word segmentation processing respectively to forward word segmentation result and reverse word segmentation result, obtains the bigger word segmentation result of participle combined probability as the corresponding participle of this information segment.Concrete, the Web page classifying device is searched the forward word segmentation result in the statistics participle relative storehouse that obtains in advance, obtain the corresponding participle combined probability of forward word segmentation result, and this statistics participle relative storehouse comprises the number of times that combination occurs between each participle.The Web page classifying device is searched reverse word segmentation result in statistics participle relative storehouse, obtain the corresponding participle combined probability of reverse word segmentation result.Then; Participle combined probability that the Web page classifying device is corresponding with the forward word segmentation result and the corresponding participle combined probability of reverse word segmentation result compare; If the participle combined probability of forward word segmentation result correspondence is greater than the corresponding participle combined probability of reverse word segmentation result, with the participle of forward word segmentation result as this information segment correspondence; If the corresponding participle combined probability of forward word segmentation result is less than or equal to the corresponding participle combined probability of reverse word segmentation result, with the participle of reverse word segmentation result as this information segment correspondence.

In above-mentioned one optional embodiment, it is 0 situation that the participle combined probability appears in above-mentioned statistics segmenting method probably, and the probability that for example AC, AE combination occur in the table 1 is 0.For this situation, the Web page classifying device can be added up the number of times that two participles are combined respectively, for example in the AB combination; A and B and A and D have all combined once, so A has altogether been combined 2 times, and B and C have combined once, B and D has combined twice and B and E have combined once; So B has been combined 4 times altogether; And in the BD combination, B has been combined 4 times altogether, and D has also been combined 4 times altogether.Therefore the preferential BD of selection makes up.Both are still the same if adopt connecting times, then adopt reverse word segmentation processing result, because the accuracy rate of common reverse word segmentation result is higher than the forward word segmentation result.

In above-mentioned one optional embodiment, in the participle process, just can't therefore in time be added and upgrade by cutting for the participle dictionary for the speech that does not occur at the participle dictionary.In the word segmentation processing process of present embodiment; If the combination of two words and above word is unrecognized; And do not comprise invalid words in this combination, the Web page classifying device adds it in accurate phrase formation to, if the number of times that the combination of these two words and above word occurs is greater than the preset times thresholding; Then the combination of these two words and above word is added in the statistics participle relative storehouse as a new participle, and upgraded the number of times that combination occurs between each participle again.The accurate phrase queue record here has each unrecognized combination that is made up of two words and above word, and each is combined to the number of times that occurs so far.

Step 103, according to classification weight rank order from high to low, use other participle of weight level of respectively classifying that above-mentioned webpage is carried out classification processing.

After the Web page classifying device passed through aforesaid operations, accomplishes the denoising and word segmentation processing to webpage, next the Web page classifying device carried out classification processing according to the participle that obtains to webpage.In the present embodiment; The Web page classifying device no longer directly uses all participles that get access to carry out classification processing simultaneously; But according to classification weight rank order from high to low; Use other participle of weight level of respectively classifying that webpage is carried out classification processing, when using part can confirm the classification under the webpage than other participle of high-class weight level (for example using other participle of highest sorted weight level), just need not use follow-up other participle of classification weight level like this.

To extract three classification weight other information of level and to get access to three other participles of classification weight level is example; The Web page classifying device is specifically according to classification weight rank order from high to low; Use other participle of weight level of respectively classifying to a kind of optional process that webpage carries out classification processing to be: the Web page classifying device uses the one-level participle that webpage is carried out classification processing; Determine the affiliated classification of webpage, then end operation if use the one-level participle; If use the one-level participle not determine the affiliated classification of webpage, use one-level participle and secondary participle that webpage is carried out classification processing simultaneously, if use one-level participle and secondary participle to determine the affiliated classification of webpage, then end operation simultaneously; If use one-level participle and secondary participle not to determine the affiliated classification of webpage simultaneously, use one-level participle, secondary participle and three grades of participles that webpage is carried out classification processing simultaneously.Wherein, The Web page classifying device carries out the process of classification processing according at least one other participle of classification weight level to webpage can be according to certain rule; If definite this webpage is assigned to certain type probability greater than such corresponding preset probability threshold, think that then this webpage belongs to such, otherwise; Think that this webpage does not belong to such, continue subsequent operation.The Web page classifying device is when using two classification weight ranks or two participles of classifying more than the weight rank; Can directly use other participle of weight level of classifying before; Also can re-use after other participle of weight level is handled classifying before, the number of other participle of weight level of for example will classifying is before carried out multiple and is amplified.

By above-mentioned visible; In the Web page classification method that present embodiment provides, the Web page classifying device extracts different other information of classification weight level from the source file of webpage, then each other information of classification weight level is carried out word segmentation processing; Obtain different other participles of classification weight level; According to classification weight rank order from high to low, use other participle of weight level of respectively classifying that webpage is carried out classification processing then, the Web page classifying device no longer directly uses all participles that get access to carry out classification processing simultaneously; But according to classification weight rank order from high to low; Use other participle of weight level of respectively classifying that webpage is carried out classification processing, when using part can confirm the classification under the webpage than other participle of high-class weight level, just need not use follow-up other participle of classification weight level like this; Through utilizing information important more in the webpage characteristics big more to Web page classifying result's influence; The preferential higher information of classification weight rank of using is carried out classification processing to webpage, has reduced the influence of invalid information to the Web page classifying accuracy, has improved the accuracy of Web page classifying.In addition, the method for present embodiment preferentially uses the higher information of classification weight rank that webpage is carried out classification processing, helps reducing webpage is carried out the quantity of information that classification processing is used, and has improved the efficient of webpage being carried out classification processing.

The inventor of technical scheme of the present invention carries out pre-service through the method that adopts the foregoing description to provide to a collection of webpage and classifies then; Obtain classification results through statistics and compared with prior art improved 4 a plurality of percentage points; The point that promotes outwardly is not high, but for the categorizing system of a classification accuracy rate eighty per cant more than five, 4 several percentage points raising can be arranged; Very considerable, make classification speed improve about 12% in addition.

The structural representation of the Web page classifying device that Fig. 3 provides for one embodiment of the invention.As shown in Figure 3, the Web page classifying device of present embodiment comprises: information extraction modules 31, participle acquisition module 32 and classification processing module 33.

Wherein, information extraction modules 31 is used for extracting different other information of classification weight level from the source file of webpage.Participle acquisition module 32 is connected with information extraction modules 31, and each other information of classification weight level that is used for information extraction modules 31 is extracted is carried out word segmentation processing, obtains each other participle of classification weight level.Classification processing module 33 is connected with participle acquisition module 32, is used for according to classification weight rank order from high to low, and other participle of weight level of respectively classifying that uses participle acquisition module 32 to obtain carries out classification processing to above-mentioned webpage.

In an optional embodiment of present embodiment, information extraction modules 31 comprises: first information extraction unit 311, second information extraction unit 312 and the 3rd information extraction unit 313.

First information extraction unit 311, the header information that is used for the extraction source file is as one-level information.Second information extraction unit 312 is used for extracting second-level message from the main part of source file.The 3rd information extraction unit 313; Be connected with second information extraction unit 312 with first information extraction unit 311, be used for from source file other information except that the second-level message that the one-level information and second information extraction unit 312 of 311 extractions of first information extraction unit are extracted, extracting three grades of information.The classification weight rank of the one-level information here, second-level message and three grades of information reduces successively.

Based on above-mentioned, participle acquisition module 32 specifically is used for respectively one-level information, second-level message and three grades of information being carried out word segmentation processing, obtains one-level participle, secondary participle and three grades of participles.Participle acquisition module 32 is connected with first information extraction unit 311, second information extraction unit 312 and the 3rd information extraction unit 313 respectively.

In an optional embodiment of present embodiment; If second information extraction unit 312 specifically is used in two maximum information tables of the main part quantity of information of source file; The quantity of information of an information table is the twice of the wherein quantity of information of another information table or more than the twice; The information extraction amount is twice or the information table more than the twice of the quantity of information of another information table from the main part of source file; And from the main part of source file, extract first and occur and the section header of second appearance, with the information table that is extracted and section header as second-level message.

In an optional embodiment of present embodiment; Remove invalid information other information the second-level message that the one-level information that the 3rd information extraction unit 313 specifically is used for extracting except that first information extraction unit 311 from source file and second information extraction unit 312 are extracted, with remaining information in other information as three grades of information.The invalid information here comprises following arbitrary information or its combination: script information, annotation information, page bottom information, hiding content information and punctuation mark information.

In an optional embodiment of present embodiment; Participle acquisition module 32 specifically is used for each information segment of each other information of classification weight level is carried out forward word segmentation processing and reverse word segmentation processing respectively; If the forward word segmentation result to this information segment is identical with reverse word segmentation result, the word segmentation result that this is identical (being forward word segmentation result or reverse word segmentation result) is as the corresponding participle of this information segment.Wherein, the corresponding participle of all information segments constitutes each other participle of classification weight level in each classification weight rank.

In an optional embodiment of present embodiment; If it is inequality that participle acquisition module 32 also is used for the forward word segmentation result of this information segment and reverse word segmentation result; Forward word segmentation result and reverse word segmentation result are added up word segmentation processing respectively, obtain the bigger word segmentation result of participle combined probability as the corresponding participle of this information segment.Preferably; Participle acquisition module 32 is more concrete is used for searching in the statistics participle relative storehouse that obtains in advance according to the forward word segmentation result; Obtain the corresponding participle combined probability of forward word segmentation result; Search in statistics participle relative storehouse according to reverse word segmentation result; Obtain the corresponding participle combined probability of reverse word segmentation result, if the participle combined probability of forward word segmentation result correspondence is greater than the corresponding participle combined probability of reverse word segmentation result, with the participle of forward word segmentation result as this information segment correspondence; If the corresponding participle combined probability of forward word segmentation result is less than or equal to the corresponding participle combined probability of reverse word segmentation result, with the participle of reverse word segmentation result as this information segment correspondence.The statistics participle relative storehouse here comprises the number of times that combination occurs between each participle.

In an optional embodiment of present embodiment; Classification processing module 33 specifically is used to use the one-level participle that webpage is carried out classification processing; Do not determine the affiliated classification of webpage if use the one-level participle; Use one-level participle and secondary participle that webpage is carried out classification processing simultaneously,, use one-level participle, secondary participle and three grades of participles that webpage is carried out classification processing simultaneously if use one-level participle and secondary participle not to determine the affiliated classification of webpage simultaneously.

The Web page classifying device of present embodiment can be various equipment with computing power, for example computing machine, router, server etc.

The above-mentioned functions module of the Web page classifying device that present embodiment provides or unit can be used for carrying out corresponding flow process among the said method embodiment, and its concrete principle of work repeats no more, and sees the description of method embodiment for details.

Web page classifying device provided by the invention; From the source file of webpage, extract different other information of classification weight level; Then each other information of classification weight level is carried out word segmentation processing, obtain different other participles of classification weight level, then according to classification weight rank order from high to low; Use other participle of weight level of respectively classifying that webpage is carried out classification processing; No longer as prior art, use the full detail of extraction that webpage is carried out classification processing simultaneously, but through utilizing information important more in the webpage characteristics big more to Web page classifying result's influence, the preferential higher information of classification weight rank of using is carried out classification processing to webpage; Help reducing invalid information in the webpage to the influence of Web page classifying, the accuracy that has improved Web page classifying.In addition, the Web page classifying assembly first of present embodiment uses the higher information of classification weight rank that webpage is carried out classification processing, helps reducing webpage is carried out the quantity of information that classification processing is used, and has improved the efficient of webpage being carried out classification processing.

One embodiment of the invention provides a kind of network equipment, comprises the Web page classifying device.The Web page classifying device of present embodiment can be the Web page classifying device that provides embodiment illustrated in fig. 3, and its concrete principle of work and implementation structure can be referring to the descriptions of the foregoing description.

The network equipment of present embodiment comprises the Web page classifying device that the embodiment of the invention provides, and can carry out the Web page classification method that the embodiment of the invention provides equally, therefore can improve the accuracy and the efficient of Web page classifying equally.

One of ordinary skill in the art will appreciate that: all or part of step that realizes above-mentioned each method embodiment can be accomplished through the relevant hardware of programmed instruction.Aforesaid program can be stored in the computer read/write memory medium.This program the step that comprises above-mentioned each method embodiment when carrying out; And aforesaid storage medium comprises: various media that can be program code stored such as ROM, RAM, magnetic disc or CD.

What should explain at last is: above each embodiment is only in order to explaining technical scheme of the present invention, but not to its restriction; Although the present invention has been carried out detailed explanation with reference to aforementioned each embodiment; Those of ordinary skill in the art is to be understood that: it still can be made amendment to the technical scheme that aforementioned each embodiment put down in writing, perhaps to wherein part or all technical characteristic are equal to replacement; And these are revised or replacement, do not make the scope of the essence disengaging various embodiments of the present invention technical scheme of relevant art scheme.

Claims

1. a Web page classification method is characterized in that, comprising:

From the source file of webpage, extract different other information of classification weight level;

2. Web page classification method according to claim 1 is characterized in that, said other packets of information of difference classification weight level of from the source file of webpage, extracting are drawn together:

Extract header information in the said source file as one-level information;

Main part from said source file is extracted second-level message;

Extract three grades of information in other information from said source file except that said one-level information and said second-level message;

Wherein, the classification weight rank of said one-level information, said second-level message and said three grades of information reduces successively;

Said each other information of classification weight level is carried out word segmentation processing, obtains each other participle of classification weight level and comprise:

Respectively said one-level information, said second-level message and said three grades of information are carried out word segmentation processing, obtain one-level participle, secondary participle and three grades of participles.

3. Web page classification method according to claim 2 is characterized in that, said main part from said source file is extracted second-level message and comprised:

If in the main part of said source file in maximum two information tables of quantity of information; The quantity of information of an information table is the twice of the wherein quantity of information of another information table or more than the twice; The information extraction amount is twice or the information table more than the twice of the quantity of information of another information table from the main part of said source file; And from the main part of said source file, extract first and occur and the section header of second appearance, with the information table of said extraction and section header as said second-level message.

4. according to claim 2 or 3 described Web page classification methods, it is characterized in that, saidly from said source file, extract three grades of information in other information except that said one-level information and said second-level message and comprise:

From said other information, remove invalid information, with remaining information in said other information as said three grades of information; Said invalid information comprises following arbitrary information or its combination:

Script information, annotation information, page bottom information, hiding content information and punctuation mark information.

5. according to claim 1 or 2 or 3 described Web page classification methods, it is characterized in that, each other information of classification weight level carried out word segmentation processing, obtain each other participle of classification weight level and comprise:

Every information segment in each other information of classification weight level carries out forward word segmentation processing and reverse word segmentation processing respectively; If the forward word segmentation result to said information segment is identical with reverse word segmentation result; With said identical word segmentation result as the corresponding participle of said information segment; Wherein, the corresponding participle of all information segments constitutes each other participle of classification weight level in each classification weight rank;

Described Web page classification method also comprises:

If forward word segmentation result and reverse word segmentation result to said information segment are inequality; Said forward word segmentation result and said reverse word segmentation result are added up word segmentation processing respectively, obtain the bigger word segmentation result of participle combined probability as the corresponding participle of said information segment.

6. Web page classification method according to claim 5; It is characterized in that; Said forward word segmentation result and said reverse word segmentation result are added up word segmentation processing respectively, obtain the bigger word segmentation result of participle combined probability and comprise as the corresponding participle of said information segment:

Search in the statistics participle relative storehouse that obtains in advance according to said forward word segmentation result, obtain the corresponding participle combined probability of said forward word segmentation result; Said statistics participle relative storehouse comprises the number of times that combination occurs between each participle;

Search in said statistics participle relative storehouse according to said reverse word segmentation result, obtain the corresponding participle combined probability of said reverse word segmentation result;

If the participle combined probability of said forward word segmentation result correspondence is greater than the corresponding participle combined probability of said reverse word segmentation result, with the participle of said forward word segmentation result as said information segment correspondence;

If the corresponding participle combined probability of said forward word segmentation result is less than or equal to the corresponding participle combined probability of said reverse word segmentation result, with the participle of said reverse word segmentation result as said information segment correspondence.

7. according to claim 2 or 3 described Web page classification methods, it is characterized in that, said according to classification weight rank order from high to low, use other participle of weight level of respectively classifying that said webpage is carried out classification processing and comprise:

Use the one-level participle that said webpage is carried out classification processing;

If use said one-level participle not determine the affiliated classification of said webpage, use said one-level participle and said secondary participle that said webpage is carried out classification processing simultaneously;

If use said one-level participle and said secondary participle not to determine the affiliated classification of said webpage simultaneously, use said one-level participle, said secondary participle and said three grades of participles that said webpage is carried out classification processing simultaneously.

8. a Web page classifying device is characterized in that, comprising:

9. Web page classifying device according to claim 8 is characterized in that, said information extraction modules comprises:

First information extraction unit, the header information that is used for extracting said source file is as one-level information;

Second information extraction unit is used for extracting second-level message from the main part of said source file;

The 3rd information extraction unit is used for from said source file other information except that said one-level information and said second-level message, extracting three grades of information;

Said participle acquisition module specifically is used for respectively said one-level information, said second-level message and said three grades of information being carried out word segmentation processing, obtains one-level participle, secondary participle and three grades of participles.

10. Web page classifying device according to claim 9; It is characterized in that; If said second information extraction unit specifically is used in two maximum information tables of the main part quantity of information of said source file; The quantity of information of an information table is the twice of the wherein quantity of information of another information table or more than the twice; The information extraction amount is twice or the information table more than the twice of the quantity of information of another information table from the main part of said source file, and from the main part of said source file, extracts first and occur and the section header of second appearance, with the information table of said extraction and section header as said second-level message.

11., it is characterized in that said the 3rd information extraction unit specifically is used for removing invalid information from said other information according to claim 9 or 10 described Web page classifying devices, with remaining information in said other information as said three grades of information; Said invalid information comprises following arbitrary information or its combination:

12. according to Claim 8 or 9 or 10 described Web page classifying devices; It is characterized in that; Said participle acquisition module specifically is used for each information segment of each other information of classification weight level is carried out forward word segmentation processing and reverse word segmentation processing respectively; If the forward word segmentation result to said information segment is identical with reverse word segmentation result, with said identical word segmentation result as the corresponding participle of said information segment; Wherein, the corresponding participle of all information segments constitutes each other participle of classification weight level in each classification weight rank; If it is inequality that said participle acquisition module also is used for the forward word segmentation result of said information segment and reverse word segmentation result; Said forward word segmentation result and said reverse word segmentation result are added up word segmentation processing respectively, obtain the bigger word segmentation result of participle combined probability as the corresponding participle of said information segment.

13. Web page classifying device according to claim 12; It is characterized in that; Said participle acquisition module is more concrete is used for searching in the statistics participle relative storehouse that obtains in advance according to said forward word segmentation result; Obtain the corresponding participle combined probability of said forward word segmentation result; Search in said statistics participle relative storehouse according to said reverse word segmentation result; Obtain the corresponding participle combined probability of said reverse word segmentation result, if the participle combined probability of said forward word segmentation result correspondence is greater than the corresponding participle combined probability of said reverse word segmentation result, with the participle of said forward word segmentation result as said information segment correspondence; If the corresponding participle combined probability of said forward word segmentation result is less than or equal to the corresponding participle combined probability of said reverse word segmentation result, with the participle of said reverse word segmentation result as said information segment correspondence; Said statistics participle relative storehouse comprises the number of times that combination occurs between each participle.

14. according to claim 9 or 10 described Web page classifying devices; It is characterized in that; Said classification processing module specifically is used to use the one-level participle that said webpage is carried out classification processing; Do not determine the affiliated classification of said webpage if use said one-level participle; Use said one-level participle and said secondary participle that said webpage is carried out classification processing simultaneously,, use said one-level participle, said secondary participle and said three grades of participles that said webpage is carried out classification processing simultaneously if use said one-level participle and said secondary participle not to determine the affiliated classification of said webpage simultaneously.

15. a network equipment is characterized in that, comprises each described Web page classifying device of claim 8-14.