CN102629282A - Website classification method, device and system - Google Patents

Website classification method, device and system Download PDF

Info

Publication number
CN102629282A
CN102629282A CN2012101344981A CN201210134498A CN102629282A CN 102629282 A CN102629282 A CN 102629282A CN 2012101344981 A CN2012101344981 A CN 2012101344981A CN 201210134498 A CN201210134498 A CN 201210134498A CN 102629282 A CN102629282 A CN 102629282A
Authority
CN
China
Prior art keywords
network address
classify
waiting
classification
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012101344981A
Other languages
Chinese (zh)
Inventor
贺泰华
杨建华
张广兴
文吉刚
袁小坊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
HUNAN CNSUNET TECHNOLOGY Co Ltd
Original Assignee
HUNAN CNSUNET TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by HUNAN CNSUNET TECHNOLOGY Co Ltd filed Critical HUNAN CNSUNET TECHNOLOGY Co Ltd
Priority to CN2012101344981A priority Critical patent/CN102629282A/en
Publication of CN102629282A publication Critical patent/CN102629282A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a website classification method, device and system. The method comprises the following steps of: analyzing website data information included in a website to be classified currently; extracting at least one characteristic item of the website to be classified currently in the website data information and weight of the characteristic item, and constituting the characteristic item and the weight thereof into a space vector corresponding to the website to be classified currently; and embedding the space vector into a preset vector machine to obtain a website category corresponding to the website to be classified currently. By implementing the method, when data quantity of website content corresponding to the website to be classified is large, the website data information analysis of the website to be classified is not influenced, so that the system load is reduced, and the website classification efficiency is increased.

Description

A kind of network address sorting technique, Apparatus and system
Technical field
The present invention relates to the local network safety management technical field, particularly a kind of network address sorting technique, Apparatus and system.
Background technology
Along with the continuous development of Internet technology and universal day by day,, generally classify according to preset network address classification through the network address sorting technique network address of will waiting to classify in order effectively to organize and utilize the information resources on the internet.
Existing network address sorting technique is divided time-like carrying out network address; Need treat the corresponding web page contents of classification network address resolves; Through TFIDF (Term Frequency Inverse Document Frequency, the characteristic frequency and the document frequency that falls) Feature Weighting Method, generate and the said corresponding space vector of network address of waiting to classify according to the result who resolves; And adopt the vector machine that is provided with in advance that this network address to be classified is classified, obtain the network address classification of this network address of waiting to classify.
Wherein, Employing is provided with vector machine in advance network address to be classified is classified; Be meant the said corresponding space vector of network address of waiting to classify is inserted this vector machine; Adopt the VC that is based upon Statistical Learning Theory to tie up on theoretical and the structure risk minimum principle basis by this vector machine; Between complicacy of the model learning accuracy of specific training sample (promptly to) and learning ability (promptly discerning the ability of arbitrary sample error-free), seek the method for best compromise scheme according to limited sample information, the said corresponding space vector of network address of waiting to classify is classified, thereby draw the network address classification of the said network address of waiting to classify.
From the above; Carry out network address in the existing network address sorting technique of employing and divide time-like; When the data volume of the corresponding web page contents of network address of waiting to classify is big, adopts existing network address sorting technique can cause bigger system load, thereby make that the efficient of network address classification is lower.
Summary of the invention
Technical matters to be solved by this invention provides a kind of network address sorting technique, Apparatus and system; Adopt network address sorting technique of the prior art in order to solve; When the data volume of the corresponding web page contents of network address of waiting to classify is big; Cause bigger system load, make the technical matters that the network address classification effectiveness is lower.
The application provides a kind of network address sorting technique, comprising:
Resolve the website data information that comprises in the current network address of waiting to classify;
Extract current classify at least one characteristic item of network address and the weights of said characteristic item waited described in the said website data information, and said characteristic item and weights thereof are formed and the said current corresponding space vector of network address of waiting to classify;
Said space vector is inserted preset vector machine, obtain and the corresponding network address classification of the said current network address of waiting to classify.
Said method, preferably, before the website data of the current network address of waiting to classify of said parsing, said method also comprises:
Obtain the network address grouped data in the internet;
Waiting of comprising in the said network address grouped data network address of classifying is classified according to preset preliminary classification rule, generate and wait to classify the network address set;
Obtain the network address to be classified in the said network address set of waiting to classify.
Said method, preferably, before the website data information of the current network address of waiting to classify of said parsing, said method also comprises:
Use preset hash algorithm said current network address to be classified is calculated, obtain result of calculation;
Whether exist and the corresponding hash data of said result of calculation in the hash data acquisition that inquiry is preset, if, cast out said current network address to be classified, finish current network address classification, otherwise, said result of calculation is inserted in the said hash data acquisition.
Said method; Preferably; After the website data information of the current network address of waiting to classify of said parsing, before the weights of at least one characteristic item of the current network address of waiting to classify described in the said website data information of said extraction and said characteristic item, said method also comprises:
Resolve the network address character string that said website data information comprises;
Judge whether said network address character string satisfies the preset rule of presorting, and when satisfying, obtains and the corresponding network address classification of the said current network address of waiting to classify according to the said rule of presorting, finish current network address classification.
Said method, preferably, the said network address of obtaining in the said network address set of waiting to classify to be classified comprises:
Confirm to wait in the said network address set of waiting to classify to classify the network address weights of network address and each transmission weights between the network address of waiting to classify;
According to said network address weights and said transmission weights, obtain in the said network address set of waiting to classify and satisfy the network address to be classified that preset network address is obtained rule.
Said method; Preferably; Said vector machine adopts preset sorting algorithm to obtain the optimal classification model according to said space vector, and parses the classification number that said optimal classification model carries, with said classification number as with the corresponding network address classification of the said current network address of waiting to classify.
Said method, preferably, current wait to classify at least one characteristic item of network address and the weights of said characteristic item comprise described in the said website data information of said extraction:
Resolve at least one html tag and content thereof in the said website data information;
Generate the weights of its characteristic of correspondence item and said characteristic item according to said html tag and content thereof.
The application also provides a kind of network address sorter, comprises data parsing unit, data extracting unit and classification acquiring unit, wherein:
Said data parsing unit is used for resolving the website data information that network address current to be classified comprises;
Said data extracting unit; Be used to extract current classify at least one characteristic item of network address and the weights of said characteristic item waited described in the said website data information, and said characteristic item and weights thereof are formed and the said current corresponding space vector of network address of waiting to classify;
Said classification acquiring unit is used for said space vector is inserted preset vector machine, obtains and the corresponding network address classification of the said current network address of waiting to classify.
Said apparatus preferably, also comprises the network address acquiring unit;
Said network address acquiring unit; Be used for obtaining the network address grouped data of internet; And waiting of comprising in the said network address grouped data network address of classifying classified according to preset preliminary classification rule; The network address to be classified in the said network address set of waiting to classify is obtained in generation wait to classify network address set, triggers said data parsing unit.
Said apparatus preferably, comprises that also network address goes to heavy unit;
Said network address is gone to heavy unit, is used to use preset hash algorithm said current network address to be classified is calculated, and obtains result of calculation; Whether exist and the corresponding hash data of said result of calculation in the hash data acquisition that inquiry is preset, if cast out said current network address to be classified; Finish current network address classification; Otherwise, said result of calculation is inserted in the said hash data acquisition, trigger said data extracting unit.
Said apparatus preferably, also comprises the unit of presorting;
The said unit of presorting by said data parsing unit triggers, is used to resolve the network address character string that said website data information comprises; Judge whether said network address character string satisfies the preset rule of presorting; When satisfying, obtain and the corresponding network address classification of the said current network address of waiting to classify according to the said rule of presorting, finish current network address classification; Otherwise, trigger said data extracting unit.
Said apparatus, preferably, said network address acquiring unit comprises that network address gathers subelement, preliminary classification subelement and network address and obtain subelement, wherein:
Said network address is gathered subelement, is used for obtaining the network address grouped data of internet;
Said preliminary classification subelement is used for waiting of comprising of the said network address grouped data network address of classifying is classified according to preset preliminary classification rule, generates and waits to classify the network address set;
Said network address is obtained subelement; Be used for confirming the said network address set of waiting to classify wait the to classify network address weights of network address; And each transmission weights between the network address of waiting to classify; And, obtain in the said network address set of waiting to classify and satisfy the network address to be classified that preset network address is obtained rule according to said network address weights and said transmission weights.
Said apparatus, preferably, said data extracting unit comprises that characteristic item extracts subelement and generates subelement with vector, wherein:
Said characteristic item extracts subelement, is used for resolving at least one html tag and the content thereof of said website data information, and generates the weights of its characteristic of correspondence item and said characteristic item according to said html tag and content thereof;
Said vector generates subelement, is used for said characteristic item and weights thereof are formed and the said current corresponding space vector of network address of waiting to classify.
The application also provides a kind of network address categorizing system, comprises like above-mentioned any described network address sorter.
Can know by such scheme; With respect to adopting network address sorting technique of the prior art, cause when data volume when the corresponding web page contents of network address of waiting to classify is big, cause system load greatly, influence the situation of network address classification effectiveness; A kind of network address sorting technique, Apparatus and system that the application provides; Through waiting of being resolved to of extraction classify at least one characteristic item and weights thereof in the website data information of network address, and form and the said corresponding space vector of network address of waiting to classify, adopt preset vector machine to obtain and the corresponding network address classification of the said network address of waiting to classify by said characteristic item and weights thereof; When the data volume of the corresponding web page contents of network address of waiting to classify is big; Can not influence and wait the website data information analysis of network address of classifying, thereby reduce system load, improve the network address classification effectiveness.
Simultaneously; A kind of network address sorting technique that the application provides, Apparatus and system are through continuous adjustment wait to classify network address characteristic item correlation parameter and vector machine correlation parameter etc.; Network address sorting technique, Apparatus and system that can accomplished different sorting techniques, promptly the application provides a kind of network address sorting technique, Apparatus and system that can dynamically change the network address classifying rules.
Further; A kind of network address sorting technique that the application provides, Apparatus and system are through carrying out preliminary classification to the network address in the internet; Obtain waiting to classify the network address set, and then treat the classification network address and classify, the coverage of the feasible network address of having classified is wider; Quality is higher, thereby makes the network address database that is formed by the network address of classifying have better query capability.
Further; A kind of network address sorting technique that the application provides, Apparatus and system are resolved through the network address character string that comprises in the website data information of treating the classification network address; And treat processings of presorting of classification network address, thereby accelerated the speed that network address is classified according to this analysis result.
Further; A kind of network address sorting technique that the application improves, Apparatus and system are through resolving at least one html tag and the content thereof in the said website data information; Utilize Chi-square method to generate the weights of its characteristic of correspondence item and said characteristic item according to said html tag and content thereof; With respect to the single TFIDF Feature Weighting Method of available technology adopting, improved the accuracy rate of network address classification.
Certainly, arbitrary product of embodiment of the present invention might not reach above-described all advantages simultaneously.
Description of drawings
In order to be illustrated more clearly in the technical scheme in the embodiment of the invention; The accompanying drawing of required use is done to introduce simply in will describing embodiment below; Obviously, the accompanying drawing in describing below only is some embodiment of the application, for those of ordinary skills; Under the prerequisite of not paying creative work property, can also obtain other accompanying drawing according to these accompanying drawings.
A kind of network address sorting technique process flow diagram that Fig. 1 provides for the application embodiment one;
The part process flow diagram of a kind of network address sorting technique that Fig. 2 provides for the application embodiment two;
The part process flow diagram of a kind of network address sorting technique that Fig. 3 provides for the application embodiment three;
Another part process flow diagram of a kind of network address sorting technique that Fig. 4 provides for the application embodiment three;
The part process flow diagram of a kind of network address sorting technique that Fig. 5 provides for the application embodiment four;
The part process flow diagram of a kind of network address sorting technique that Fig. 6 provides for the application embodiment five;
The structural representation of a kind of network address sorter that Fig. 7 provides for the application embodiment six;
The structural representation of a kind of network address sorter that Fig. 8 provides for the application embodiment seven;
The structural representation of a kind of network address sorter that Fig. 9 provides for the application embodiment eight;
Another structural representation of a kind of network address sorter that Figure 10 provides for the application embodiment eight;
The function realization flow figure of the network address collector of a kind of network address categorizing system that Figure 11 provides for the application embodiment nine;
A kind of network address categorizing system that Figure 12 provides for the application embodiment nine is obtained the process flow diagram of wait to classify network address characteristic item and weights thereof.
Embodiment
To combine the accompanying drawing in the embodiment of the invention below, the technical scheme in the embodiment of the invention is carried out clear, intactly description, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills are not making the every other embodiment that is obtained under the creative work prerequisite, all belong to the scope of the present invention's protection.
With reference to figure 1, it shows a kind of network address sorting technique process flow diagram that the application embodiment one provides, and said method can may further comprise the steps:
Step 101: resolve the website data information that comprises in the current network address of waiting to classify.
Wherein, the said current a certain web page contents of website links of waiting to classify, the website data information of the said current network address of waiting to classify can comprise the following aspects: the network address stem of this network address of waiting to classify, for example www, home etc.; The network address afterbody of this network address of waiting to classify, for example com, cn, org, net etc.; The data message of the network address character string of this network address of waiting to classify for example, comprises in the length of network address character string, the network address character string in the number, network address character string of "/" and comprises number of numeral or the like.
Step 102: extract current classify at least one characteristic item of network address and the weights of said characteristic item waited described in the said website data information, and said characteristic item and weights thereof are formed and the said current corresponding space vector of network address of waiting to classify.
Wherein, in said step 101, be resolved to the website data information of the said current network address of waiting to classify after, according to this website data information, extract at least one characteristic item and the weights of said characteristic item in the said website data information.Said characteristic item and characteristic item weights thereof can extract according to data such as the network address stem in the said website data information, network address afterbody or network address character strings and obtain.Characteristic item that extracts and weights thereof are formed and the corresponding higher dimensional space vector of the said current network address of waiting to classify; Each dimension in this space vector is represented a characteristic item, and the corresponding value of each dimension is represented the weights in the corresponding web document in the network address of waiting to classify of this characteristic in this space vector.For any network address to be classified, can it be expressed as: U=(t1:w1, t2:w2 ..., tn:wn), wherein, t1, t2.。。, tn is expressed as each coordinate system in the n-dimensional space, w1, and w2 ..., wn representes the coordinate figure of each coordinate system, U=(t1:w1, t2:w2 ..., tn:wn) be a vector in this space.
Step 103: said space vector is inserted preset vector machine, obtain and the corresponding network address classification of the said current network address of waiting to classify.
Wherein, the data layout of said vector machine requires as follows: type label
[index1]:[value1][index2]:[value2][index3]:[value3]...[indexn]:[valuen]
Wherein, with said space vector U=(t1:w1, t2:w2; ..., tn:wn) insert said vector machine, index1 in the corresponding said vector machine of said t1; Value1 in the corresponding said vector machine of said w1, index2 in the corresponding said vector machine of said t2, value2 in the corresponding said vector machine of said w2; And the like, indexn in the corresponding said vector machine of said tn, valuen in the corresponding said vector machine of said wn; After said type label is said space vector is inserted this vector machine, the said current corresponding network address classification of network address of waiting to classify.
Wherein, After above-mentioned steps 103 acquisitions and the corresponding network address classification of the said current network address of waiting to classify; Can obtain next bar network address to be classified, and it classified, obtain its network address classification according to the network address sorting technique that the application embodiment one provides; Realize the circulation of network address classification, thereby improve the efficient of network address classification.
Wherein, the network address sorting technique that provides of the application embodiment one also comprises:
Data parameters in said characteristic item parameter and/or the said vector machine is dynamically adjusted.
Wherein, said characteristic item parameter comprises said characteristic item form etc., and the data parameters of said vector machine comprises the data number in the data layout, i.e. dimension of said space vector etc.
Need to prove; A kind of network address sorting technique that the application embodiment one provides can adopt multi-threaded parallel to carry out the form of network address classification, simultaneously a plurality of network address to be classified is classified, and obtains its classification logotype; Accelerate the network address classification speed thus, thereby improve the efficient of network address classification.
Wherein, the network address sorting technique that provides of the application embodiment one also comprises:
The network address of obtaining the network address classification is placed preset network address storer.
Wherein, the concrete network address database of said network address storer.
Can know by such scheme; With respect to adopting network address sorting technique of the prior art, cause when data volume when the corresponding web page contents of network address of waiting to classify is big, cause system load greatly, influence the situation of network address classification effectiveness; A kind of network address sorting technique that the application embodiment one provides; Through waiting of being resolved to of extraction classify at least one characteristic item and weights thereof in the website data information of network address, and form and the said corresponding space vector of network address of waiting to classify, adopt preset vector machine to obtain and the corresponding network address classification of the said network address of waiting to classify by said characteristic item and weights thereof; When the data volume of the corresponding web page contents of network address of waiting to classify is big; Can not influence and wait the website data information analysis of network address of classifying, thereby reduce system load, improve the network address classification effectiveness
Simultaneously; A kind of network address sorting technique that the application embodiment one provides can be through continuous adjustment wait to classify network address characteristic item correlation parameter and vector machine correlation parameter etc.; Network address sorting technique, Apparatus and system that can accomplished different sorting techniques, promptly the application embodiment one provides a kind of network address sorting technique that can dynamically change the network address classifying rules.
Wherein, Based on above-mentioned the application embodiment; Preferably; Said vector machine adopts preset sorting algorithm to obtain the optimal classification model according to said space vector, and parses the classification number that said optimal classification model carries, with said classification number as with the corresponding network address classification of the said current network address of waiting to classify.
Concrete, said preset sorting algorithm comprises LIBSVM open source software bag algorithm.
With reference to figure 2; It shows the part process flow diagram of a kind of network address sorting technique that the application embodiment two provides; Based on the application embodiment one; In the said step 102, current wait to classify at least one characteristic item of network address and the weights of said characteristic item can may further comprise the steps described in the said website data information of said extraction:
Step 201: resolve at least one html tag and content thereof in the said website data information.
Wherein, above-mentioned steps S201 is specially: the website data information of the said current network address of waiting to classify that will be resolved to is obtained each html tag and content thereof in the said website data information through setting up dom tree shape structure.
Wherein, said DOM (Document Object Model, DOM Document Object Model) tree structure is meant: through DOM html page is resolved, and the HTML tree tree structure and the corresponding access method that generate.By dom tree shape structure, can be directly and operate each tag content on the html page easily.
Step 202: the weights that generate its characteristic of correspondence item and said characteristic item according to said html tag and content thereof.
Wherein, Adopting the ICTCLAS of Inst. of Computing Techn. Academia Sinica to be divided into system to the said html tag that is resolved among the said step S201 and content thereof is divided into; Remove stop words and the less entry of data message amount in said html tag and the content thereof; And adopt block-regulations check CHI-SQUARE TEST method to extract said through the html tag of processing such as participle and the characteristic item in the content thereof; Expressive force for said its corresponding html tag of characteristic item combination adopts the TFIDF method to carry out the weighting of characteristic item again, obtains the weights of said characteristic item.
Wherein, said CHI-SQUARE TEST method is meant: to the frequency of sample distribute institute from population distribution whether obey the test of hypothesis that certain theoretical distribution or certain hypothesis distribution are done.Promptly distribute and infer overall distribution, obtain the characteristic item in said html tag and the content thereof according to the frequency of said html tag.
Wherein, Said TFIDF (Term Frequency Inverse Document Frequency; Characteristic frequency with fall document frequency) Feature Weighting Method is meant: the as many as TF*IDF of TFIDF weights; TF is word frequency (Term Frequency), and IDF is anti-document frequency (Inverse Document Frequency); TF representes the frequency that entry occurs in document d, IDF representes the frequency that document d occurs in the entire document set.The TFIDF algorithm is based upon on such hypothesis: to the most significant word of difference document should be that those frequencies of occurrences in document are high; And in other documents of entire document set the few word of the frequency of occurrences; If, just can embody characteristics with class text so the feature space coordinate system is got the TF word frequency as estimating.
Can know by such scheme; A kind of network address sorting technique that the application embodiment two provides; Through resolving at least one html tag and the content thereof in the said website data information; Utilize the weights of Chi-square method,, improved the accuracy rate of network address classification with respect to the single TFIDF Feature Weighting Method of available technology adopting according to said html tag and content its characteristic of correspondence item of generation and said characteristic item.
With reference to figure 3, it shows the part process flow diagram of a kind of network address sorting technique that the application embodiment three provides, and based on above-mentioned the application embodiment one, before said step 101, said method can also may further comprise the steps:
Step 301: obtain the network address grouped data in the internet.
Wherein, the application embodiment three can obtain through network collection devices such as web crawlers when the network address grouped data is obtained in carrying out the internet.Network address grouped data in the said internet is meant the network address that has than high access; Network address directory site, the Web side navigation website for example used always on the internet; For example Yahoo, hao123 etc., the application embodiment three climbs the network address grouped data in the said internet through web crawlers and gets.
Wherein, web crawlers be otherwise known as webpage spider, network robot are a kind of according to certain rule, grasp the program or the script of internet information automatically.
Step 302: waiting of comprising in the said network address grouped data network address of classifying is classified according to preset preliminary classification rule, generate and wait to classify the network address set.
Wherein, In above-mentioned steps 301 with the network address grouped data in the internet climb get after; Said step 302 is filtered, is integrated, the integration processing of promptly classifying according to preset preliminary classification rule the network address of classifying waited with classification logotype particularly; Obtain the network address set of classifying of waiting of preliminary classification through the relation mapping table of setting up in advance, and said relation mapping table is as shown in table 1 with initial category sign.
Table 1 relation mapping table
Figure BDA0000160062180000111
Wherein, as shown in table 1, said class label carries out sorted classification results for the network address of waiting to classify according to said preliminary classification rule.
Step 303: obtain the network address to be classified in the said network address set of waiting to classify.
Wherein, Further network address is carried out the branch time-like at needs, obtain the network address to be classified in the said network address set of waiting to classify, and carry out the network address sorting technique that the application embodiment one provides; Network address to be classified to the further network address classification of needs is classified, and obtains its network address classification.As shown in table 1, what said predefined class categories was that said needs carry out the classification of further network address waits to classify the network address classification of network address.
Can know by such scheme; A kind of network address sorting technique that the application embodiment three provides is through carrying out preliminary classification to the network address in the internet; Obtain waiting to classify the network address set, and then treat the classification network address and classify, the coverage of the feasible network address of having classified is wider; Quality is higher, thereby makes the network address database that is formed by the network address of classifying have better query capability.
Based on the application embodiment three, with reference to figure 4, it shows another part process flow diagram of the application embodiment three, and said step 303 can may further comprise the steps:
Step 401: the network address weights of the network address of confirming to wait in the said network address set of waiting to classify to classify, and each transmission weights between the network address of waiting to classify.
Wherein, when above-mentioned employing network address collector waited to classify obtaining of network address, the influence of the network address weight of the network address of at first confirming to wait to classify adopted similar SiteRank algorithm that said current network address to be classified is carried out weighting.The network address weights are divided into two parts: the transmission weights between network address self weights and the website links.Concrete, when the application embodiment obtains in network address to be classified, the network address weights of the network address of at first confirming to wait to classify, and each transmission weights between the network address of waiting to classify.
Step 402:, obtain in the said network address set of waiting to classify and satisfy the network address to be classified that preset network address is obtained rule according to said network address weights and said transmission weights.
Wherein, said network address is obtained rule and is meant, said network address weights and the higher rule of said transmission weights.Said obtaining satisfied preset network address and obtained the network address to be classified of rule and be specially in the said network address set of waiting to classify, obtain in the said network address set of waiting to classify the network address weights and transmit the higher network address to be classified of weights.
Can know by such scheme; A kind of network address sorting technique that the application embodiment three provides is through carrying out preliminary classification to the network address in the internet; Obtain waiting to classify the network address set, and then network address weights and the higher network address to be classified of transmission weights are classified, the coverage of the feasible network address of having classified is wider; Quality is higher, thereby makes the network address database that is formed by the network address of classifying have better query capability.
With reference to figure 5, it shows the part process flow diagram of a kind of network address sorting technique that the application embodiment four provides, and based on the application embodiment one or the application embodiment three, before said step 101, said method can also may further comprise the steps:
Step 501: use preset hash algorithm said current network address to be classified is calculated, obtain result of calculation.
Wherein, Before the website data information of the said current network address of waiting to classify resolved, need go heavily to handle, judge promptly whether the said current network address of waiting to classify was classified said current network address to be classified; Prevent the repetition classification processing, improve the efficient of network address classification thus.The said method heavily handled of going, concrete, preset hash algorithm is used said hash algorithm said current network address to be classified is calculated, and obtains and the said current corresponding hash algorithm computation of the network address result that waits to classify.
Step 502: whether be present in the corresponding hash data of said result of calculation in the hash data acquisition that inquiry is preset, if, execution in step 503, otherwise, execution in step 504.
The hash algorithm computation result of the network address that remains to be classified wherein, is set in the said hash data acquisition.After the hash algorithm computation result who gets access to the said current network address of waiting to classify; Inquiry and the said current corresponding corresponding hash data of hash algorithm computation result of network address of waiting to classify in said hash data acquisition are if show that the said current network address of waiting to classify is by the classification processing mistake; Carry out said step 503 this moment; Otherwise, show the said current network address of waiting to classify not by classification processing, carry out said step 504 this moment.
Step 503: cast out said current network address to be classified, finish current network address classification.
Wherein,, need stop execution, promptly cast out said current network address to be classified, finish current network address classification the sorting technique of the current network address of waiting to classify when the said current network address of waiting to classify during by classification processing.
Need to prove that after finishing current network address classification, said method also comprises:
Obtain other network address to be classified, resolve the website data information of said other network address of waiting to classify again.
Wherein, Said other wait to classify network address and said current network address to be classified is the network address that need classify; Adopt " other " and " current " in order to the network address distinguishing the network address sorting technique that provides without the application embodiment four and handle with the network address of carrying out the network address sorting technique processing that the application embodiment four provides here.
Step 504: said result of calculation is placed said hash data acquisition.
Wherein, when judging the said network address of waiting to classify not by classification processing, hash algorithm computation result that need it is corresponding is positioned in the said hash data acquisition, is convenient to the classification processing of other network address of waiting to classify of later stage.
Can know that by such scheme the network address sorting technique that the application embodiment four provides is carried out the hash algorithm computation through treating the classification network address, thereby going of the network address that realizes waiting classifying heavily is secondary filtration, and then improves the efficient of network address classification.
With reference to figure 6; It shows the part process flow diagram of a kind of network address sorting technique that the application embodiment five provides, based on the application embodiment one or the application embodiment three, after said step 101; Before said step 102, said method can also may further comprise the steps:
Step 601: resolve the network address character string that said website data information comprises.
Wherein, said website data information is be resolved in the said step 101 said current wait the to classify website data information of network address, and concrete comprises: the network address stem of this network address of waiting to classify, for example www, home etc.; The network address afterbody of this network address of waiting to classify, for example com, cn, org, net etc.; The data message of the network address character string of this network address of waiting to classify for example, comprises in the length of network address character string, the network address character string in the number, network address character string of "/" and comprises number of numeral or the like.
Step 602: judge whether said network address character string satisfies the preset rule of presorting, when satisfying, execution in step 603, otherwise, execution in step 102.
Wherein, the said rule of presorting comprises the rule that network address current to be classified is classified according to the preferred value setting of said network address character string.Concrete, whether the said rule of presorting for example, has the English word or the english abbreviation of obvious characteristic property, like news or edu etc. for judging character or the character string that whether contains strong characteristic in the said network address character string in the said network address character string.Said step 602 is specially: judge English word or english abbreviation that whether said king's Hu substring has obvious characteristic property; Like news or edu etc.; If have; Execution in step 603, otherwise, show that said current network address to be classified need continue to carry out the network address sorting technique that the application embodiment one or the application embodiment three provide.
Step 603: obtain and the corresponding network address classification of the said current network address of waiting to classify according to the said rule of presorting, finish current network address classification.
Wherein,, directly the said current network address of waiting to classify is classified according to the said rule of presorting, obtain the network address classification of the said current network address of waiting to classify, finish current network address classification when said network address character string satisfies preset presorting during rule.
Need to prove that in the application embodiment, after finishing current network address classification, said method can also comprise:
Obtain other network address to be classified, resolve the website data information of said other network address of waiting to classify again.
Wherein, Said other wait to classify network address and said current network address to be classified is the network address that need classify; Adopt " other " and " current " in order to the network address distinguishing the network address sorting technique that provides without the application embodiment five and handle with the network address of carrying out the network address sorting technique processing that the application embodiment five provides here.
Can by such scheme; A kind of network address sorting technique that the application embodiment five provides is resolved through the network address character string that comprises in the website data information of treating the classification network address; And treat the processing of presorting of classification network address according to this analysis result; Thereby accelerated the speed of network address classification, improved the efficient of network address classification.
With reference to figure 7, the structural representation that it shows a kind of network address sorter that the application embodiment six provides is used to realize the application embodiment one, and said device comprises data parsing unit 701, data extracting unit 702 and classification acquiring unit 703, wherein:
Said data parsing unit 701 is used for resolving the website data information that network address current to be classified comprises.
Wherein, the said current a certain web page contents of website links of waiting to classify, the website data information of the said current network address of waiting to classify can comprise the following aspects: the network address stem of this network address of waiting to classify, for example www, home etc.; The network address afterbody of this network address of waiting to classify, for example com, cn, org, net etc.; The data message of the network address character string of this network address of waiting to classify for example, comprises in the length of network address character string, the network address character string in the number, network address character string of "/" and comprises number of numeral or the like.
Said data extracting unit 702; Be used to extract current classify at least one characteristic item of network address and the weights of said characteristic item waited described in the said website data information, and said characteristic item and weights thereof are formed and the said current corresponding space vector of network address of waiting to classify.
Wherein, After in said data parsing unit 701, being resolved to the website data information of the said current network address of waiting to classify; According to this website data information, extract at least one characteristic item and the weights of said characteristic item in the said website data information by said data extracting unit 702.Said characteristic item and characteristic item weights thereof can extract according to data such as the network address stem in the said website data information, network address afterbody or network address character strings and obtain.Characteristic item that extracts and weights thereof are formed and the corresponding higher dimensional space vector of the said current network address of waiting to classify; Each dimension in this space vector is represented a characteristic item, and the corresponding value of each dimension is represented the weights in the corresponding web document in the network address of waiting to classify of this characteristic in this space vector.For any network address to be classified, can it be expressed as: U=(t1:w1, t2:w2 ..., tn:wn), wherein, t1, t2.。。, tn is expressed as each coordinate system in the n-dimensional space, w1, and w2 ..., wn representes the coordinate figure of each coordinate system, U=(t1:w1, t2:w2 ..., tn:wn) be a vector in this space.
Said classification acquiring unit 703 is used for said space vector is inserted preset vector machine, obtains and the corresponding network address classification of the said current network address of waiting to classify.
Wherein, the data layout of said vector machine requires as follows: type label
[index1]:[value1][index2]:[value2][index3]:[value3]...[indexn]:[valuen]
Wherein, with said space vector U=(t1:w1, t2:w2; ..., tn:wn) insert said vector machine, index1 in the corresponding said vector machine of said t1; Value1 in the corresponding said vector machine of said w1, index2 in the corresponding said vector machine of said t2, value2 in the corresponding said vector machine of said w2; And the like, indexn in the corresponding said vector machine of said tn, valuen in the corresponding said vector machine of said wn; After said type label is said space vector is inserted this vector machine, the said current corresponding network address classification of network address of waiting to classify.
Wherein, among the application embodiment six, said classification acquiring unit 703 also is used for the data parameters of said characteristic item parameter and/or said vector machine is dynamically adjusted.
Wherein, said characteristic item parameter comprises said characteristic item form etc., and the data parameters of said vector machine comprises the data number in the data layout, i.e. dimension of said space vector etc.
Need to prove; A kind of network address sorter that the application embodiment six provides can adopt multi-threaded parallel to carry out the form of network address classification, simultaneously a plurality of network address to be classified is classified, and obtains its classification logotype; Accelerate the network address classification speed thus, thereby improve the efficient of network address classification.
Wherein, the network address sorter that provides of the application embodiment six also comprises the network address storage unit;
Said network address storage unit is used for the network address that said classification acquiring unit 703 obtains the network address classification is placed preset network address storer.
Wherein, the concrete network address database of said network address storer.
Can know by such scheme; With respect to adopting network address sorter of the prior art, cause when data volume when the corresponding web page contents of network address of waiting to classify is big, cause system load greatly, influence the situation of network address classification effectiveness; A kind of network address sorter that the application embodiment six provides; Through waiting of being resolved to of extraction classify at least one characteristic item and weights thereof in the website data information of network address, and form and the said corresponding space vector of network address of waiting to classify, adopt preset vector machine to obtain and the corresponding network address classification of the said network address of waiting to classify by said characteristic item and weights thereof; When the data volume of the corresponding web page contents of network address of waiting to classify is big; Can not influence and wait the website data information analysis of network address of classifying, thereby reduce system load, improve the network address classification effectiveness
Simultaneously; A kind of network address sorter that the application embodiment six provides can be through continuous adjustment wait to classify network address characteristic item correlation parameter and vector machine correlation parameter etc.; Network address sorting technique, Apparatus and system that can accomplished different sorting techniques, promptly the application embodiment six provides a kind of network address sorter that can dynamically change the network address classifying rules.
Wherein, Based on above-mentioned the application embodiment; Preferably; Said vector machine adopts preset sorting algorithm to obtain the optimal classification model according to said space vector, and parses the classification number that said optimal classification model carries, with said classification number as with the corresponding network address classification of the said current network address of waiting to classify.
Concrete, said preset sorting algorithm comprises LIBSVM open source software bag algorithm.
With reference to figure 8; It shows the structural representation of a kind of network address sorter that the application embodiment seven provides; Based on the application embodiment six, be used to realize the application embodiment two, wherein; Said data extracting unit 702 comprises that characteristic item extracts subelement 721 and generates subelement 722 with vector, wherein:
Said characteristic item extracts subelement 721, is used for resolving at least one html tag and the content thereof of said website data information, and generates the weights of its characteristic of correspondence item and said characteristic item according to said html tag and content thereof;
Said vector generates subelement 722, is used for said characteristic item and weights thereof are formed and the said current corresponding space vector of network address of waiting to classify.
Can know by such scheme; A kind of network address sorter that the application embodiment seven provides; Through resolving at least one html tag and the content thereof in the said website data information; Utilize the weights of Chi-square method,, improved the accuracy rate of network address classification with respect to the single TFIDF Feature Weighting Method of available technology adopting according to said html tag and content its characteristic of correspondence item of generation and said characteristic item.
With reference to figure 9; It shows the structural representation of a kind of network address sorter that the application embodiment eight provides; Based on the application embodiment six or the application embodiment seven; Be used to realize the application embodiment three, the application embodiment four and the application embodiment five, said device comprises that also network address acquiring unit 704, network address go to the heavy unit 705 and the unit 706 of presorting;
Said network address acquiring unit 704; Be used for obtaining the network address grouped data of internet; And waiting of comprising in the said network address grouped data network address of classifying classified according to preset preliminary classification rule; The network address to be classified in the said network address set of waiting to classify is obtained in generation wait to classify network address set, triggers said data parsing unit 701.
Wherein, said network address acquiring unit 704 is specially: network collection devices such as web crawlers.
Wherein, with reference to Figure 10, it shows another structural representation of the application embodiment eight, and said network address acquiring unit 704 comprises that network address gathers subelement 741, preliminary classification subelement 742 and network address and obtain subelement 743, wherein:
Said network address is gathered subelement 741, is used for obtaining the network address grouped data of internet;
Said preliminary classification subelement 742 is used for waiting of comprising of the said network address grouped data network address of classifying is classified according to preset preliminary classification rule, generates and waits to classify the network address set;
Said network address is obtained subelement 743; Be used for confirming the said network address set of waiting to classify wait the to classify network address weights of network address; And each transmission weights between the network address of waiting to classify; And, obtain in the said network address set of waiting to classify and satisfy the network address to be classified that preset network address is obtained rule according to said network address weights and said transmission weights.
Said network address is gone to heavy unit 705, is used to use preset hash algorithm said current network address to be classified is calculated, and obtains result of calculation; Whether exist and the corresponding hash data of said result of calculation in the hash data acquisition that inquiry is preset, if cast out said current network address to be classified; Finish current network address classification; Otherwise, said result of calculation is inserted in the said hash data acquisition, trigger said data extracting unit 702.
Wherein, said network address goes to heavy unit 705 after finishing current network address classification, can also trigger said network address acquiring unit 704 and obtain other network address to be classified, and resolves the website data information of said other network address of waiting to classify again, proceeds the network address classification.
The said unit 706 of presorting is triggered by said data parsing unit 701, is used to resolve the network address character string that said website data information comprises; Judge whether said network address character string satisfies the preset rule of presorting; When satisfying, obtain and the corresponding network address classification of the said current network address of waiting to classify according to the said rule of presorting, finish current network address classification; Otherwise, trigger said data extracting unit.
Can know by such scheme; With respect to adopting network address sorter of the prior art, cause when data volume when the corresponding web page contents of network address of waiting to classify is big, cause system load greatly, influence the situation of network address classification effectiveness; A kind of network address sorter that the application embodiment eight provides; Through waiting of being resolved to of extraction classify at least one characteristic item and weights thereof in the website data information of network address, and form and the said corresponding space vector of network address of waiting to classify, adopt preset vector machine to obtain and the corresponding network address classification of the said network address of waiting to classify by said characteristic item and weights thereof; When the data volume of the corresponding web page contents of network address of waiting to classify is big; Can not influence and wait the website data information analysis of network address of classifying, thereby reduce system load, improve the network address classification effectiveness
Simultaneously; A kind of network address sorter that the application embodiment eight provides is through continuous adjustment wait to classify network address characteristic item correlation parameter and vector machine correlation parameter etc.; Network address sorting technique, Apparatus and system that can accomplished different sorting techniques, promptly the application embodiment eight provides a kind of network address sorter that can dynamically change the network address classifying rules.
Further; A kind of network address sorter that the application embodiment eight provides is through carrying out preliminary classification to the network address in the internet; Obtain waiting to classify the network address set, and then treat the classification network address and classify, the coverage of the feasible network address of having classified is wider; Quality is higher, thereby makes the network address database that is formed by the network address of classifying have better query capability.
Further; A kind of network address sorter that the application embodiment eight provides is resolved through the network address character string that comprises in the website data information of treating the classification network address; And treat processings of presorting of classification network address, thereby accelerated the speed that network address is classified according to this analysis result.
A kind of network address categorizing system that the application embodiment nine provides, said system comprise that wherein, said system is used to realize following function like above-mentioned any described network address sorter:
Said network address categorizing system is climbed the network address classified information of network address directory site commonly used, Web side navigation website (for example Yahoo, hao123 etc.) in the internet through the network address reptile and is got; The network address that will have the classification mark is filtered, is integrated; Through setting up the relation mapping table of a class label; As shown in table 1, obtain the corresponding said preset class label of network address of said network address directory site commonly used, Web side navigation website, obtain a classified network address storehouse at last;
Said network address categorizing system is gathered the address set in the initial network address class library through realizing the network address collector as initial formation; Can collect the wider network address set of better quality in the internet, coverage in order to ensure the network address collector, formulate the weights design proposal of a cover in advance for the network address significance level.Said network address categorizing system will combine the influence of network address character string self for the network address weight, and adopt similar SiteRank algorithm that network address is carried out weighting.The network address weights are divided into two parts: the transmission weights between network address self weights and the website links.For network address self weights, we will consider following several aspect: (1), network address stem, such as common be www, home etc.; (2), the network address afterbody, such as common be com, cn, org, net etc.; (3), comprise in the network address character string ". " number; (4), the total length of network address character string; (5), the digital number that comprises of network address character string.For the transmission weights between the website links, adopt weights on average to be divided in the subnet location of this network address page according to certain rule with father's network address.Judge in gatherer process whether network address was gathered employing character string hash method, and adopt the multi-threaded parallel collection to improve picking rate, the flow process of network address collector such as accompanying drawing 11.
Said network address categorizing system is obtained each html tag and content thereof with the webpage that obtains through setting up dom tree shape structure; ICTCLAS Words partition system to these content Inst. of Computing Techn. Academia Sinica carries out participle; Remove stop words and comprise the entry that quantity of information is little, effect is little; And adopting CHI-SQUARE TEST method to extract characteristic item, the expressive force for these its corresponding html tags of characteristic items combination adopts the TFIDF method to carry out the weighting of characteristic item.We need be converted into network address the vector of a higher dimensional space in the mathematical model, and each dimension in the high bit space is being represented a characteristic item, and the corresponding value of each dimension is being represented the weights of this characteristic item in the corresponding web document of network address in the vector.For any network address U, we can be expressed as U=(t1:w1, t2:w2 ..., tn:wn); Wherein (t1, t2 ..., tn) be expressed as each coordinate system in the n-dimensional space, (w1; W2 ..., wn) represent the coordinate figure of each coordinate system, U=(t1:w1; T2:w2 ..., tn:wn) be a vector, this process such as accompanying drawing 12 in this space.
Said network address categorizing system is at model training in the stage; Network address training set that adopts and test set are from the described network address storehouse of having classified; Each network address is expressed as a high dimension vector; Adopt LIBSVM open source software bag training and classify, the LIBSVM software package is following for the call format of training data and test data: the class label
[index1]:[value1][index2]:[value2][index3]:[value3]...[indexn]:[valuen]
Wherein, be that training set and test set all need a type label, a type label is being represented different classes, and it can be discontinuous numerical value, is 0 characteristic item for the value value, can omit.The variablees such as number of parameter, training set and the test set of the method for weighting of the characteristic item through continuous adjustment classification, the dimension of characteristic item, SVMs LIBSVM, thus optimum LIBSVM model sought.
Adopt the LIBSVM sorting algorithm to classify automatically through the network address that the network address automatic collector is obtained, form the network address database of having classified.Said network address categorizing system is following for the treatment step of a network address classification inquiry:
Whether elder generation's referral web site exists in the network address database of having classified, should write down then direct return results if exist, otherwise judge whether the network address character string can be classified by pre-service; The pre-service classification is meant anticipates the network address that comprises strong characteristic in those network address character strings, and to improve the efficient performance of categorizing system, these strong characteristics are meant that mainly the network address character string has comprised English word or the English word abbreviation with obvious characteristic property; Such as news and edu, if can, the pre-service sorting result then directly returned; Otherwise, obtain the corresponding webpage of this network address, analyzing web page content; And extraction network address proper vector; LIBSVM model through training is classified automatically, returns classification results, the network address classification of the network address of promptly waiting to classify.
Can know by such scheme,, with respect to adopting network address sorting technique of the prior art; Cause when data volume when the corresponding web page contents of network address of waiting to classify is big; Cause the situation that system load is big, influence the network address classification effectiveness, a kind of network address categorizing system that the application embodiment nine provides is through waiting of being resolved to of extraction classify at least one characteristic item and weights thereof in the website data information of network address; And by said characteristic item and weights thereof form with the said corresponding space vector of network address of waiting to classify; Adopt preset vector machine to obtain and the corresponding network address classification of the said network address of waiting to classify, when the data volume of the corresponding web page contents of network address of waiting to classify is big, can not influence the website data information analysis of the network address of waiting to classify; Thereby reduced system load, improved the network address classification effectiveness.
Simultaneously; A kind of network address categorizing system that the application embodiment nine provides is through continuous adjustment wait to classify network address characteristic item correlation parameter and vector machine correlation parameter etc.; Network address sorting technique, Apparatus and system that can accomplished different sorting techniques, promptly the application embodiment nine provides a kind of network address categorizing system that can dynamically change the network address classifying rules.
Further; A kind of network address categorizing system that the application embodiment nine provides is through carrying out preliminary classification to the network address in the internet; Obtain waiting to classify the network address set, and then treat the classification network address and classify, the coverage of the feasible network address of having classified is wider; Quality is higher, thereby makes the network address database that is formed by the network address of classifying have better query capability.
Further; A kind of network address categorizing system that the application embodiment nine provides is resolved through the network address character string that comprises in the website data information of treating the classification network address; And treat processings of presorting of classification network address, thereby accelerated the speed that network address is classified according to this analysis result.
Further; A kind of network address categorizing system that the application embodiment nine improves is through resolving at least one html tag and the content thereof in the said website data information; Utilize Chi-square method to generate the weights of its characteristic of correspondence item and said characteristic item according to said html tag and content thereof; With respect to the single TFIDF Feature Weighting Method of available technology adopting, improved the accuracy rate of network address classification.
Need to prove that each embodiment in this instructions all adopts the mode of going forward one by one to describe, what each embodiment stressed all is and the difference of other embodiment that identical similar part is mutually referring to getting final product between each embodiment.For device type embodiment, because it is similar basically with method embodiment, so description is fairly simple, relevant part gets final product referring to the part explanation of method embodiment.
At last; Also need to prove; In this article; Relational terms such as first and second grades only is used for an entity or operation are made a distinction with another entity or operation, and not necessarily requires or hint relation or the order that has any this reality between these entities or the operation.And; Term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability; Thereby make and comprise that process, method, article or the equipment of a series of key elements not only comprise those key elements; But also comprise other key elements of clearly not listing, or also be included as this process, method, article or equipment intrinsic key element.Under the situation that do not having much more more restrictions, the key element that limits by statement " comprising ... ", and be not precluded within process, method, article or the equipment that comprises said key element and also have other identical element.
For the convenience of describing, be divided into various unit with function when describing above the device and describe respectively.Certainly, when implementing the application, can in same or a plurality of softwares and/or hardware, realize the function of each unit.
Description through above embodiment can know, those skilled in the art can be well understood to the application and can realize by the mode that software adds essential general hardware platform.Based on such understanding; The part that the application's technical scheme contributes to prior art in essence in other words can be come out with the embodied of software product; This computer software product can be stored in the storage medium, like ROM/RAM, magnetic disc, CD etc., comprises that some instructions are with so that a computer equipment (can be a personal computer; Server, the perhaps network equipment etc.) carry out the described method of some part of each embodiment of the application or embodiment.
More than a kind of network address sorting technique, Apparatus and system that the application provided have been carried out detailed introduction; Used concrete example among this paper the application's principle and embodiment are set forth, the explanation of above embodiment just is used to help to understand the application's method and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to the application's thought, the part that on embodiment and range of application, all can change, in sum, this description should not be construed as the restriction to the application.

Claims (14)

1. a network address sorting technique is characterized in that, comprising:
Resolve the website data information that comprises in the current network address of waiting to classify;
Extract current classify at least one characteristic item of network address and the weights of said characteristic item waited described in the said website data information, and said characteristic item and weights thereof are formed and the said current corresponding space vector of network address of waiting to classify;
Said space vector is inserted preset vector machine, obtain and the corresponding network address classification of the said current network address of waiting to classify.
2. method according to claim 1 is characterized in that, before the website data of the current network address of waiting to classify of said parsing, said method also comprises:
Obtain the network address grouped data in the internet;
Waiting of comprising in the said network address grouped data network address of classifying is classified according to preset preliminary classification rule, generate and wait to classify the network address set;
Obtain the network address to be classified in the said network address set of waiting to classify.
3. method according to claim 1 and 2 is characterized in that, before the website data information of the current network address of waiting to classify of said parsing, said method also comprises:
Use preset hash algorithm said current network address to be classified is calculated, obtain result of calculation;
Whether exist and the corresponding hash data of said result of calculation in the hash data acquisition that inquiry is preset, if, cast out said current network address to be classified, finish current network address classification, otherwise, said result of calculation is inserted in the said hash data acquisition.
4. method according to claim 1 and 2; It is characterized in that; After the website data information of the current network address of waiting to classify of said parsing; Before the weights of at least one characteristic item of the current network address of waiting to classify described in the said website data information of said extraction and said characteristic item, said method also comprises:
Resolve the network address character string that said website data information comprises;
Judge whether said network address character string satisfies the preset rule of presorting, and when satisfying, obtains and the corresponding network address classification of the said current network address of waiting to classify according to the said rule of presorting, finish current network address classification.
5. method according to claim 2 is characterized in that, the said network address of obtaining in the said network address set of waiting to classify to be classified comprises:
Confirm to wait in the said network address set of waiting to classify to classify the network address weights of network address and each transmission weights between the network address of waiting to classify;
According to said network address weights and said transmission weights, obtain in the said network address set of waiting to classify and satisfy the network address to be classified that preset network address is obtained rule.
6. method according to claim 1; It is characterized in that; Said vector machine adopts preset sorting algorithm to obtain the optimal classification model according to said space vector; And parse the classification number that said optimal classification model carries, with said classification number as with the corresponding network address classification of the said current network address of waiting to classify.
7. method according to claim 1 is characterized in that, current wait to classify at least one characteristic item of network address and the weights of said characteristic item comprise described in the said website data information of said extraction:
Resolve at least one html tag and content thereof in the said website data information;
Generate the weights of its characteristic of correspondence item and said characteristic item according to said html tag and content thereof.
8. a network address sorter is characterized in that, comprises data parsing unit, data extracting unit and classification acquiring unit, wherein:
Said data parsing unit is used for resolving the website data information that network address current to be classified comprises;
Said data extracting unit; Be used to extract current classify at least one characteristic item of network address and the weights of said characteristic item waited described in the said website data information, and said characteristic item and weights thereof are formed and the said current corresponding space vector of network address of waiting to classify;
Said classification acquiring unit is used for said space vector is inserted preset vector machine, obtains and the corresponding network address classification of the said current network address of waiting to classify.
9. device according to claim 8 is characterized in that, also comprises the network address acquiring unit;
Said network address acquiring unit; Be used for obtaining the network address grouped data of internet; And waiting of comprising in the said network address grouped data network address of classifying classified according to preset preliminary classification rule; The network address to be classified in the said network address set of waiting to classify is obtained in generation wait to classify network address set, triggers said data parsing unit.
10. according to Claim 8 or 9 described devices, it is characterized in that, comprise that also network address goes to heavy unit;
Said network address is gone to heavy unit, is used to use preset hash algorithm said current network address to be classified is calculated, and obtains result of calculation; Whether exist and the corresponding hash data of said result of calculation in the hash data acquisition that inquiry is preset, if cast out said current network address to be classified; Finish current network address classification; Otherwise, said result of calculation is inserted in the said hash data acquisition, trigger said data extracting unit.
11. according to Claim 8 or 9 described devices, it is characterized in that, also comprise the unit of presorting;
The said unit of presorting by said data parsing unit triggers, is used to resolve the network address character string that said website data information comprises; Judge whether said network address character string satisfies the preset rule of presorting; When satisfying, obtain and the corresponding network address classification of the said current network address of waiting to classify according to the said rule of presorting, finish current network address classification; Otherwise, trigger said data extracting unit.
12. device according to claim 9 is characterized in that, said network address acquiring unit comprises that network address gathers subelement, preliminary classification subelement and network address and obtain subelement, wherein:
Said network address is gathered subelement, is used for obtaining the network address grouped data of internet;
Said preliminary classification subelement is used for waiting of comprising of the said network address grouped data network address of classifying is classified according to preset preliminary classification rule, generates and waits to classify the network address set;
Said network address is obtained subelement; Be used for confirming the said network address set of waiting to classify wait the to classify network address weights of network address; And each transmission weights between the network address of waiting to classify; And, obtain in the said network address set of waiting to classify and satisfy the network address to be classified that preset network address is obtained rule according to said network address weights and said transmission weights.
13. device according to claim 8 is characterized in that, said data extracting unit comprises that characteristic item extracts subelement and generates subelement with vector, wherein:
Said characteristic item extracts subelement, is used for resolving at least one html tag and the content thereof of said website data information, and generates the weights of its characteristic of correspondence item and said characteristic item according to said html tag and content thereof;
Said vector generates subelement, is used for said characteristic item and weights thereof are formed and the said current corresponding space vector of network address of waiting to classify.
14. a network address categorizing system is characterized in that, comprises like any described network address sorter of above-mentioned claim 8 to 13.
CN2012101344981A 2012-05-03 2012-05-03 Website classification method, device and system Pending CN102629282A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012101344981A CN102629282A (en) 2012-05-03 2012-05-03 Website classification method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012101344981A CN102629282A (en) 2012-05-03 2012-05-03 Website classification method, device and system

Publications (1)

Publication Number Publication Date
CN102629282A true CN102629282A (en) 2012-08-08

Family

ID=46587542

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012101344981A Pending CN102629282A (en) 2012-05-03 2012-05-03 Website classification method, device and system

Country Status (1)

Country Link
CN (1) CN102629282A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819595A (en) * 2012-08-10 2012-12-12 北京星网锐捷网络技术有限公司 Web page classification method, web page classification device and network equipment
CN103577492A (en) * 2012-08-09 2014-02-12 腾讯科技(深圳)有限公司 Webpage homepage generating method and device
CN104090931A (en) * 2014-06-25 2014-10-08 华南理工大学 Information prediction and acquisition method based on webpage link parameter analysis
CN104750754A (en) * 2013-12-31 2015-07-01 北龙中网(北京)科技有限责任公司 Website industry classification method and server
CN105468683A (en) * 2015-11-16 2016-04-06 孙宝文 Method and device for carrying out duplicate checking to network address
CN105512143A (en) * 2014-09-26 2016-04-20 中兴通讯股份有限公司 Method and device for web page classification
CN108038242A (en) * 2017-12-28 2018-05-15 中译语通科技(青岛)有限公司 A kind of data source distribution management system based on collecting webpage data
CN108881138A (en) * 2017-10-26 2018-11-23 新华三信息安全技术有限公司 A kind of web-page requests recognition methods and device
CN109284465A (en) * 2018-09-04 2019-01-29 暨南大学 A kind of Web page classifying device construction method and its classification method based on URL
CN102929963B (en) * 2012-10-11 2019-03-29 北京百度网讯科技有限公司 A kind of setting method and system of website type
CN109614509A (en) * 2018-10-29 2019-04-12 山东中创软件工程股份有限公司 Ship portrait construction method, device, equipment and storage medium
CN114417216A (en) * 2022-01-04 2022-04-29 马上消费金融股份有限公司 Data acquisition method and device, electronic equipment and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101290626A (en) * 2008-06-12 2008-10-22 昆明理工大学 Text categorization feature selection and weight computation method based on field knowledge
CN101692639A (en) * 2009-09-15 2010-04-07 西安交通大学 Bad webpage recognition method based on URL
JP2010123000A (en) * 2008-11-20 2010-06-03 Nippon Telegr & Teleph Corp <Ntt> Web page group extraction method, device and program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101290626A (en) * 2008-06-12 2008-10-22 昆明理工大学 Text categorization feature selection and weight computation method based on field knowledge
JP2010123000A (en) * 2008-11-20 2010-06-03 Nippon Telegr & Teleph Corp <Ntt> Web page group extraction method, device and program
CN101692639A (en) * 2009-09-15 2010-04-07 西安交通大学 Bad webpage recognition method based on URL

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577492B (en) * 2012-08-09 2018-07-06 腾讯科技(深圳)有限公司 WEB home page generation method and device
CN103577492A (en) * 2012-08-09 2014-02-12 腾讯科技(深圳)有限公司 Webpage homepage generating method and device
CN102819595A (en) * 2012-08-10 2012-12-12 北京星网锐捷网络技术有限公司 Web page classification method, web page classification device and network equipment
CN102929963B (en) * 2012-10-11 2019-03-29 北京百度网讯科技有限公司 A kind of setting method and system of website type
CN104750754A (en) * 2013-12-31 2015-07-01 北龙中网(北京)科技有限责任公司 Website industry classification method and server
WO2015196740A1 (en) * 2014-06-25 2015-12-30 华南理工大学 Information forecast and acquisition method based on webpage link parameter analysis
CN104090931A (en) * 2014-06-25 2014-10-08 华南理工大学 Information prediction and acquisition method based on webpage link parameter analysis
CN105512143A (en) * 2014-09-26 2016-04-20 中兴通讯股份有限公司 Method and device for web page classification
CN105468683A (en) * 2015-11-16 2016-04-06 孙宝文 Method and device for carrying out duplicate checking to network address
CN108881138A (en) * 2017-10-26 2018-11-23 新华三信息安全技术有限公司 A kind of web-page requests recognition methods and device
CN108881138B (en) * 2017-10-26 2020-06-26 新华三信息安全技术有限公司 Webpage request identification method and device
CN108038242A (en) * 2017-12-28 2018-05-15 中译语通科技(青岛)有限公司 A kind of data source distribution management system based on collecting webpage data
CN109284465A (en) * 2018-09-04 2019-01-29 暨南大学 A kind of Web page classifying device construction method and its classification method based on URL
CN109284465B (en) * 2018-09-04 2021-03-19 暨南大学 URL-based web page classifier construction method and classification method thereof
CN109614509A (en) * 2018-10-29 2019-04-12 山东中创软件工程股份有限公司 Ship portrait construction method, device, equipment and storage medium
CN114417216A (en) * 2022-01-04 2022-04-29 马上消费金融股份有限公司 Data acquisition method and device, electronic equipment and readable storage medium
CN114417216B (en) * 2022-01-04 2022-11-29 马上消费金融股份有限公司 Data acquisition method and device, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN102629282A (en) Website classification method, device and system
Vega-Oliveros et al. A multi-centrality index for graph-based keyword extraction
CN103226578B (en) Towards the website identification of medical domain and the method for webpage disaggregated classification
CN101794311A (en) Fuzzy data mining based automatic classification method of Chinese web pages
CN104077377A (en) Method and device for finding network public opinion hotspots based on network article attributes
CN103678418A (en) Information processing method and equipment
Rajalakshmi et al. Web page classification using n-gram based URL features
CN102117339A (en) Filter supervision method specific to unsecure web page texts
CN107180075A (en) The label automatic generation method of text classification integrated level clustering
CN110245289A (en) A kind of information search method and relevant device
Kao et al. Entropy-based link analysis for mining web informative structures
CN104361059A (en) Harmful information identification and web page classification method based on multi-instance learning
CN110427404A (en) A kind of across chain data retrieval system of block chain
CN103049557A (en) Website resource management method and website resource management device
Hassan et al. Automatic document topic identification using wikipedia hierarchical ontology
Patel et al. A review on web pages clustering techniques
Kaur et al. SIMHAR-smart distributed web crawler for the hidden web using SIM+ hash and redis server
Peng et al. Focused crawling enhanced by CBP–SLC
CN108694192B (en) Webpage type judging method and device
JP2016218512A (en) Information processing device and information processing program
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
Ma et al. Advanced deep web crawler based on Dom
Liu et al. Clustering-based topical Web crawling using CFu-tree guided by link-context
CN108897736B (en) Document sorting method and device based on Paper Rank algorithm
Li et al. Research on the feature selection techniques used in text classification

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20120808