Summary of the invention
The application's technical matters to be solved provides a kind of websites collection method and device, and the data volume that need handle with solution website using sorting technique is big, the problem that efficient is not high.
In order to address the above problem; The application discloses a kind of websites collection method; Comprise: according to the access to netwoks behavioural characteristic of obtaining in advance; Websites all in the site databases is divided into a plurality of different set of websites, and wherein, said access to netwoks behavioural characteristic obtains through the communication network message being carried out the message characteristic extraction; If comprise the website that mark was classified in the said set of websites; The classification of other websites in the set of websites that the website belonged to of then said mark being classified; Confirm as the classification of the website that said mark classified; Wherein, the said mark website of classifying has been carried out the website of mark classification in advance for what from said site databases, extract.
Preferably; The access to netwoks behavioural characteristic of obtaining in advance in said basis; Websites all in the site databases is divided into before the step of a plurality of different set of websites; Also comprise: in the communication network message that obtains, based on the correspondence relationship information of query word and website, the relationship characteristic that extracts said website is as said access to netwoks behavioural characteristic.
Preferably, the websites collection method also comprises: if all websites are not mark the website of classifying in the said set of websites, then classified automatically in all websites in this set of websites.
Preferably; The access to netwoks behavioural characteristic of obtaining in advance in said basis; Websites all in the site databases is divided into before the step of a plurality of different set of websites, also comprises: from said site databases, randomly draw the website of setting quantity, mark classification.
In order to address the above problem; Disclosed herein as well is a kind of websites collection device; Comprise: divide module, be used for websites all in the site databases being divided into a plurality of different set of websites according to the access to netwoks behavioural characteristic of obtaining in advance; Wherein, said access to netwoks behavioural characteristic obtains through the communication network message being carried out the message characteristic extraction; Sort module; Be used for if said set of websites comprises the website that mark was classified; The classification of other websites in the set of websites that the website belonged to of then said mark being classified; Confirm as the classification of the website that said mark classified, wherein, the website that said mark was classified has been carried out the website of mark classification in advance for what from said site databases, extract.
Preferably; The network class device also comprises: characteristic extracting module; Be used in said division module according to the access to netwoks behavioural characteristic of obtaining in advance, websites all in the site databases be divided into before a plurality of different set of websites, in the communication network message that obtains; Based on the correspondence relationship information of query word and website, the relationship characteristic that extracts said website is as said access to netwoks behavioural characteristic.
Preferably, said sort module also is used for then being classified automatically in all websites in this set of websites if all websites of said set of websites are not mark the website of classifying.
Preferably; The network class device also comprises: labeling module; Be used in said division module according to the access to netwoks behavioural characteristic of obtaining in advance; Websites all in the site databases is divided into before a plurality of different set of websites, from said site databases, randomly draws the website of setting quantity, mark classification.
Compared with prior art, the application has the following advantages:
Pass through the application; Use is carried out message characteristic to the communication network message and is extracted the foundation of the access to netwoks behavioural characteristic of back acquisition as the website division; Extract and needn't carry out macromethod, solved the problem that the legacy network sorting technique must be handled mass data the mass data of webpage or website, and then according to the mark classification of the website of setting; Carry out precise classification to having comprised the set of websites that marks the website of classifying, improved the accuracy rate of websites collection.Thus it was clear that,, both reduced the data volume of carrying out the required processing of network class through the application; Improved the accuracy rate of network class again; Thereby effectively improved the efficient of network class, it is big to have solved the data volume that the website using sorting technique need handle, problem that efficient is not high.In addition; The access to netwoks behavioural characteristic is the information through the relevant subscriber network access behavior of the user's communications network message being analyzed the back acquisition; The true classification that can reflect the website preferably, thus the accuracy of websites collection improved, and then improve websites collection efficient.
Embodiment
For above-mentioned purpose, the feature and advantage that make the application can be more obviously understandable, the application is done further detailed explanation below in conjunction with accompanying drawing and embodiment.
Embodiment one
With reference to Fig. 1, show flow chart of steps according to a kind of websites collection method of the application embodiment one.
A kind of websites collection method of present embodiment may further comprise the steps:
Step S102:, websites all in the site databases is divided into a plurality of different set of websites according to the access to netwoks behavioural characteristic of obtaining in advance.
Wherein, access to netwoks behavioural characteristic obtains through the communication network message being carried out the message characteristic extraction.
The access to netwoks behavioural characteristic is after using the behavior of network to analyze to the user, its behavioural characteristic of extraction.In the present embodiment, be the behavior of using network through the communication network message analysis user who obtains, message characteristic is extracted and obtain the access to netwoks behavioural characteristic; As, the data in the communication network message are carried out data analysis, the result extracts message characteristic according to data analysis; In the present embodiment; Emphasis is analyzed the URL of the webpage that user's webpage query word and user clicked, thereby extracts message characteristic, obtains the access to netwoks behavioural characteristic.
The one or more webpages that include a plurality of websites in the site databases through the analysis to webpage, can obtain the relevant information of its affiliated website.In the present embodiment, according to the access to netwoks behavioural characteristic, the related web page of a plurality of websites is analyzed, and then the website under the related web page is divided into different set.
Step S104: if comprise the mark website of classifying in the set of websites, then will mark the classification of other websites in the set of websites that the website belonged to of classifying, and confirm as the classification of the website that this mark classified.
Wherein, mark the website of classifying has been carried out the mark classification in advance for what from site databases, extract website.Also promptly, from site databases, choose the website of some, adopt artificial then or by machine according to setting rule, to each website mark of choosing categories of websites under it, belong to military type website like the A website, the B website belongs to finance and economic website etc.
Generally, possibly comprise a mark classifieds website in the set of websites, also possibly comprise a plurality of mark classifieds website.Because the classification of the website that similar access to netwoks behavioural characteristic is corresponding is also basic identical; Therefore; In general; Passed through be the division of collection of network of foundation with the access to netwoks behavioural characteristic after, if comprised the website that a plurality of marks were classified in a set of websites, the classification of the website that each mark in this set of websites was classified also is identical.
After all websites having been carried out division according to the access to netwoks behavioural characteristic, just can confirm the classification of the website in this set of websites according to set of websites and the relation that marks the website of classifying.Include A, F, G, four websites of H as in the S1 set, and the A website is labeled as military type website in advance, can confirms that then F, the G in the S set 1, the classification of H website also belong to military type website.And, then can adopt other suitable sorting technique to classify for not comprising the set that marks the website of classifying, like automatic classification, perhaps each website in this set of websites directly is classified as other class, miscellany etc.
Pass through present embodiment; Use is carried out message characteristic to the communication network message and is extracted the foundation of the access to netwoks behavioural characteristic of back acquisition as the website division; Extract and needn't carry out macromethod, solved the problem that the legacy network sorting technique must be handled mass data the mass data of webpage or website, and then according to the mark classification of the website of setting; Carry out precise classification to having comprised the set of websites that marks the website of classifying, improved the accuracy rate of websites collection.Pass through present embodiment; Both reduce the data volume of carrying out the required processing of network class, improved the accuracy rate of network class again, thereby effectively improved the efficient of network class; It is big to have solved the data volume that the website using sorting technique need handle, problem that efficient is not high.In addition; The access to netwoks behavioural characteristic is the information through the relevant subscriber network access behavior of the user's communications network message being analyzed the back acquisition; The true classification that can reflect the website preferably, thus the accuracy of websites collection improved, and then improve websites collection efficient.
Embodiment two
With reference to Fig. 2, show flow chart of steps according to a kind of websites collection method of the application embodiment two.
The websites collection method of present embodiment may further comprise the steps:
Step S202: in the communication network message that obtains, based on the correspondence relationship information of query word and website, the relationship characteristic that extracts the website is as the access to netwoks behavioural characteristic.
In the present embodiment, dispose gateway, obtain all user's communications network messages through gateway in a plurality of geographic position.Original communication network message to obtaining carries out data processing and analysis, forms the correspondence relationship information of query word (also can be described as webpage query word or query site speech) and the website of clicking, and abbreviates " query word website relation information " among the application as.Has certain similarity relation a plurality of websites that can confirm to click by " query word website relation information ".It should be noted that query word has reflected user's inquiry intention, and the website of clicking generally is the result that the user wants, so " query word website relation information " is to carry out one of websites collection text feature preferably.In practical application, can be directly with " query word website relation information " as the access to netwoks behavioural characteristic, also can be further with according to the relationship characteristic between its a plurality of websites of confirming as the access to netwoks behavioural characteristic.
Relationship characteristic through with the website of extracting can be divided into a plurality of related web sites in the site databases in the set as the access to netwoks behavioural characteristic easily and effectively, has reduced websites collection deal with data amount, has improved websites collection efficient.
Certainly, be not limited thereto, other effectively the characteristic of identifying user access to netwoks behavior also can be used as the access to netwoks behavioural characteristic.
Step S204: from site databases, randomly draw the website of some, mark classification.
Wherein, the quantity of the website of extraction can preestablish, and also can set at random, can certainly proportionally extract the website that need mark classification.Theoretically, the website of mark classification is many more, and categories of websites is more comprehensive and careful, and websites collection is accurate more.
From site databases, randomly draw the part website and carry out the classification of manual work mark, can confirm the classification of related web site exactly to mark sorted website as classification foundation.
Need to prove that step S204 and step S202 can carry out in no particular order in proper order.
Step S206:, websites all in the site databases is divided into a plurality of different set of websites according to the access to netwoks behavioural characteristic.
For ease of understanding the scheme of present embodiment, below with the example of a simplification as schematic illustration.For example; The query word of setting user's input is " patent "; The website of setting the click corresponding with this relative is A, B, three websites of C; Can infer that the similarity of these three websites maybe be bigger through a series of data analysis, the relation that A, B, three website similarities of C is big is as the access to netwoks behavioural characteristic.Through this access to netwoks behavioural characteristic; Can infer that in conjunction with certain data analysis A in the site databases, B, three websites of C belong to same set of websites; In addition; Through the analysis to the communication network message, in conjunction with certain data analysis, can also infer with these three websites has the website of getting in touch more by force also all to belong to and A, B, three identical set of websites in website of C.As; Through the communication network message analysis is found; During most user inquirings " patent ", in one or more in visit A, B, three websites of C, D, E website have also been visited; Perhaps, then can A, B, C, D, E be divided in the set of websites via one or more D, the E websites visited in A, B, three websites of C.
Step S208: according to the mark website of classify in the site databases, the classification of the website in definite respectively a plurality of different set of websites.
In the present embodiment, suppose that any one website among A, B, C, D, the E has been noted as patent website, then A, B, C, D, E all are classified as patent website.
Preferably, when realize confirming categories of websites, judge earlier in certain or some set of websites of a plurality of different set of websites whether comprise that at least one marks the website of classifying; If, then with the website in this set of websites confirm as with this set of websites in the mark that the comprises identical classification in website of classifying.
And be the set of websites that mark was not classified for those included websites, and can be classified in website wherein through existing sorting technique, as classifying automatically, perhaps, directly it is classified as other type etc.Automatically classification realizes simply, and is convenient, flexible, can practice thrift the realization cost of websites collection.
Present embodiment has been realized carrying out the artificial annotation results that marks based on the subscriber network access behavioural information with to the part website, is classified in the website.In the process of carrying out websites collection; Adopt in the communication network message correspondence relationship information based on query word and website; The relationship characteristic of the website of extracting is as the access to netwoks behavioural characteristic; This access to netwoks behavioural characteristic has representative preferably, can gather division comparatively exactly to the website in the site databases on this basis, compares through the mass data of webpage or website being analyzed the extraction web page characteristics with prior art; The scheme of present embodiment only needs analyzing through network message; And the data volume of communication network message is much smaller than the website or the data volume of webpage, reduced thus to the website classify the data volume that will handle, improved the efficient of websites collection; And when definite set of websites classification, those set that do not comprise the website of having carried out artificial mark classification are classified automatically or directly incorporated into is other type, has practiced thrift the realization cost of websites collection.
Embodiment three
With reference to Fig. 3, show structured flowchart according to a kind of websites collection device of the application embodiment three.
The websites collection device of present embodiment comprises: divide module 302; Be used for according to the access to netwoks behavioural characteristic of obtaining in advance; Websites all in the site databases is divided into a plurality of different set of websites; Wherein, access to netwoks behavioural characteristic obtains through the communication network message being carried out the message characteristic extraction; Sort module 304; Can be connected with division module 302; Be used for then will marking the classification of other websites in the set of websites that the website belonged to of classifying, confirm as the classification that marks the website of classifying if set of websites comprises the website that mark was classified; Wherein, mark the website of classifying has been carried out the mark classification in advance for what from site databases, extract website.
Preferably; The websites collection device of present embodiment also comprises: characteristic extracting module 306, can be connected respectively with division module 302, labeling module 308, and be used for dividing module 302 according to the access to netwoks behavioural characteristic of obtaining in advance; Websites all in the site databases is divided into before a plurality of different set of websites; In the communication network message that obtains, based on the correspondence relationship information of query word and website, the relationship characteristic that extracts the website is as the access to netwoks behavioural characteristic.
Preferably, sort module 304 also is used for if then classify to all websites in this set of websites automatically in the website of classify for mark in all website of said set of websites, or directly to incorporate into be " other class ".
Preferably; The websites collection device of present embodiment also comprises: labeling module 308; Can with divide module 302, characteristic extracting module 306 be connected respectively, is used for dividing the access to netwoks behavioural characteristic that module 302 bases are obtained in advance, websites all in the site databases is divided into before a plurality of different set of websites; From site databases, randomly draw the website of setting quantity, mark classification.
The execution of the execution of characteristic extracting module 306 and labeling module 308 is order in no particular order.
The websites collection device of this enforcement row is used for realizing aforementioned a plurality of method embodiment corresponding website sorting technique, and has the beneficial effect of corresponding website sorting technique embodiment, repeats no more at this.
Embodiment four
The websites collection device of present embodiment mainly comprises two modules, i.e. communication network message pre-processing module and websites collection module.
Wherein:
Communication network message pre-processing module (being equivalent to the characteristic extracting module among the embodiment three) mainly is responsible for original communication network message is carried out pre-service; The correspondence relationship information of the website that forms query word and click; Be called for short " query word website relation information ", with this information as the access to netwoks behavioural characteristic.It should be noted that query word has reflected user's inquiry intention, and the website of clicking generally is the result that the user wants, so query word website relation information is to carry out one of websites collection text feature preferably.
Websites collection module (being equivalent to labeling module, division module and sort module among the embodiment three) mainly is responsible for randomly drawing the part website and is carried out the classification of manual work mark; Based on query word website relation information and artificial mark classification results; Utilize the disaggregated model of machine learning to accomplish the classification of website; To not marking the website that classification results is classified according to manual work, classify automatically, form the websites collection result.
The process that the above-mentioned websites collection device of use present embodiment carries out network class is as shown in Figure 4.In Fig. 4; The communication network message is through the processing of communication network message pre-processing module; Generated query word website relation information, the websites collection module is according to this query word website relation information, and the website of mark classification in advance; Classifying in all websites in the site databases, forms the websites collection result.Wherein, the classification of the mark of website maybe be prior to the generation of query word website relation information, and also the possibility back is in the generation of query word website relation information.The original communication network message constructs the websites collection system through the processing of above-mentioned two modules.
The websites collection device of this enforcement row is used for realizing aforementioned a plurality of method embodiment corresponding website sorting technique, and has the beneficial effect of corresponding website sorting technique embodiment, repeats no more at this.
The application carries out the artificial annotation results that marks based on the subscriber network access behavioural information with to the part website, is classified in the website.Through the application, realized:
(1) the application has proposed the scheme of being classified in the website based on the communication network message information; Used better text feature; Be that query word website relation information carries out websites collection, solved traditional scheme and must handle mass data and the not high problem of accuracy rate.
(2) the application can carry out the higher classification of accuracy rate to present online most websites, its more accurately classification results can be used for a lot of network applications.For example, can make up better guidance to website door, can be used for identifying responsive website and monitor to carry out public sentiment than www.hao123.com, or the like.
(3) based on the application's websites collection result, can further excavate, set up accurate user profile and throw in or the like to instruct accurate advertisement to user's interest.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed all is and the difference of other embodiment that identical similar part is mutually referring to getting final product between each embodiment.For device embodiment, because it is similar basically with method embodiment, so description is fairly simple, relevant part gets final product referring to the part explanation of method embodiment.
More than a kind of websites collection method and apparatus that the application provided has been carried out detailed introduction; Used concrete example among this paper the application's principle and embodiment are set forth, the explanation of above embodiment just is used to help to understand the application's method and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to the application's thought, the part that on embodiment and range of application, all can change, in sum, this description should not be construed as the restriction to the application.