CN102567494A - Website classification method and device - Google Patents

Website classification method and device Download PDF

Info

Publication number
CN102567494A
CN102567494A CN2011104361679A CN201110436167A CN102567494A CN 102567494 A CN102567494 A CN 102567494A CN 2011104361679 A CN2011104361679 A CN 2011104361679A CN 201110436167 A CN201110436167 A CN 201110436167A CN 102567494 A CN102567494 A CN 102567494A
Authority
CN
China
Prior art keywords
websites
website
mark
classification
netwoks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011104361679A
Other languages
Chinese (zh)
Other versions
CN102567494B (en
Inventor
罗峰
黄苏支
李娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
IZP (BEIJING) TECHNOLOGIES CO LTD
Izp China Network Technology Co ltd
Original Assignee
BEIJING IZP TECHNOLOGIES Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING IZP TECHNOLOGIES Co Ltd filed Critical BEIJING IZP TECHNOLOGIES Co Ltd
Priority to CN201110436167.9A priority Critical patent/CN102567494B/en
Publication of CN102567494A publication Critical patent/CN102567494A/en
Application granted granted Critical
Publication of CN102567494B publication Critical patent/CN102567494B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a website classification method and a device, wherein the website classification method comprises the following steps: dividing all websites in a website database into a plurality of different website sets according to the features of network access behavior, which are obtained in advance, wherein the features of the network access behavior are obtained by performing message feature extraction on a communication network message; and if the website sets comprise the websites which are labeled and classified, determining the categories of the other websites in the website sets to which the labeled and classified websites belonging to as the categories of the websites which are labeled and classified, wherein the websites which are labeled and classified are of the websites which are labeled and classified in advance and extracted from the website database. Through the application disclosed by the invention, network classification efficiency is effectively improved, and the problems of a large quantity of data to be processed and low efficiency of an existing website classification technology can be solved.

Description

Websites collection method and device
Technical field
The application relates to networking technology area, particularly relates to a kind of websites collection method and device.
Background technology
Along with the fast development of infotech, website quantity is millions of.In order in so numerous websites, to search the website that needs fast, the websites collection technology is arisen at the historic moment.Websites collection promptly utilizes the characteristic of website, and category division is carried out in the website." Things of a kind come together ", it is significant that website using is put into different categories; For example, www.hao123.com classifies to the higher website of some flows, and the website is divided in the only classification; To carry out guidance to website, for user inquiring facilitates.In addition, websites collection can also identify responsive website to carry out the public sentiment monitoring.
At present, the branch time-like is being carried out in the website, often utilizing crawler technology.Crawler technology is a kind of according to certain rule, automatically the program of grasping information of web site or script.Crawler technology obtains the text feature of website through site information is gathered, and then utilizes specific file classification method, as through machine learning or rule-based method, is classified in the website.
Yet; Because the website One's name is legion, and the data on the website also all are mass data usually, and this makes that data of using crawler technology collection to obtain also are magnanimity; Information to big data quantity like this is carried out collection analysis; After being unfavorable for data acquisition, the formation of web page text characteristic causes the websites collection inefficiency.
In addition; In some cases; The text feature of website possibly be able to not reflect the real classification of website truly, and partly cause is because some website for improving its visit capacity, is especially being introduced the key word information that some have nothing to do with this website on the title on its page; The accuracy rate that this will reduce websites collection undoubtedly greatly also causes the low of websites collection efficient.
In a word, need the urgent technical matters that solves of those skilled in the art to be exactly: how to reduce the data volume that network class need be handled, improve the efficient of websites collection.
Summary of the invention
The application's technical matters to be solved provides a kind of websites collection method and device, and the data volume that need handle with solution website using sorting technique is big, the problem that efficient is not high.
In order to address the above problem; The application discloses a kind of websites collection method; Comprise: according to the access to netwoks behavioural characteristic of obtaining in advance; Websites all in the site databases is divided into a plurality of different set of websites, and wherein, said access to netwoks behavioural characteristic obtains through the communication network message being carried out the message characteristic extraction; If comprise the website that mark was classified in the said set of websites; The classification of other websites in the set of websites that the website belonged to of then said mark being classified; Confirm as the classification of the website that said mark classified; Wherein, the said mark website of classifying has been carried out the website of mark classification in advance for what from said site databases, extract.
Preferably; The access to netwoks behavioural characteristic of obtaining in advance in said basis; Websites all in the site databases is divided into before the step of a plurality of different set of websites; Also comprise: in the communication network message that obtains, based on the correspondence relationship information of query word and website, the relationship characteristic that extracts said website is as said access to netwoks behavioural characteristic.
Preferably, the websites collection method also comprises: if all websites are not mark the website of classifying in the said set of websites, then classified automatically in all websites in this set of websites.
Preferably; The access to netwoks behavioural characteristic of obtaining in advance in said basis; Websites all in the site databases is divided into before the step of a plurality of different set of websites, also comprises: from said site databases, randomly draw the website of setting quantity, mark classification.
In order to address the above problem; Disclosed herein as well is a kind of websites collection device; Comprise: divide module, be used for websites all in the site databases being divided into a plurality of different set of websites according to the access to netwoks behavioural characteristic of obtaining in advance; Wherein, said access to netwoks behavioural characteristic obtains through the communication network message being carried out the message characteristic extraction; Sort module; Be used for if said set of websites comprises the website that mark was classified; The classification of other websites in the set of websites that the website belonged to of then said mark being classified; Confirm as the classification of the website that said mark classified, wherein, the website that said mark was classified has been carried out the website of mark classification in advance for what from said site databases, extract.
Preferably; The network class device also comprises: characteristic extracting module; Be used in said division module according to the access to netwoks behavioural characteristic of obtaining in advance, websites all in the site databases be divided into before a plurality of different set of websites, in the communication network message that obtains; Based on the correspondence relationship information of query word and website, the relationship characteristic that extracts said website is as said access to netwoks behavioural characteristic.
Preferably, said sort module also is used for then being classified automatically in all websites in this set of websites if all websites of said set of websites are not mark the website of classifying.
Preferably; The network class device also comprises: labeling module; Be used in said division module according to the access to netwoks behavioural characteristic of obtaining in advance; Websites all in the site databases is divided into before a plurality of different set of websites, from said site databases, randomly draws the website of setting quantity, mark classification.
Compared with prior art, the application has the following advantages:
Pass through the application; Use is carried out message characteristic to the communication network message and is extracted the foundation of the access to netwoks behavioural characteristic of back acquisition as the website division; Extract and needn't carry out macromethod, solved the problem that the legacy network sorting technique must be handled mass data the mass data of webpage or website, and then according to the mark classification of the website of setting; Carry out precise classification to having comprised the set of websites that marks the website of classifying, improved the accuracy rate of websites collection.Thus it was clear that,, both reduced the data volume of carrying out the required processing of network class through the application; Improved the accuracy rate of network class again; Thereby effectively improved the efficient of network class, it is big to have solved the data volume that the website using sorting technique need handle, problem that efficient is not high.In addition; The access to netwoks behavioural characteristic is the information through the relevant subscriber network access behavior of the user's communications network message being analyzed the back acquisition; The true classification that can reflect the website preferably, thus the accuracy of websites collection improved, and then improve websites collection efficient.
Description of drawings
Fig. 1 is the flow chart of steps according to a kind of websites collection method of the application embodiment one;
Fig. 2 is the flow chart of steps according to a kind of websites collection method of the application embodiment two;
Fig. 3 is the structured flowchart according to a kind of websites collection device of the application embodiment three;
Fig. 4 is to use a kind of websites collection device of the application embodiment four to carry out the synoptic diagram of network class.
Embodiment
For above-mentioned purpose, the feature and advantage that make the application can be more obviously understandable, the application is done further detailed explanation below in conjunction with accompanying drawing and embodiment.
Embodiment one
With reference to Fig. 1, show flow chart of steps according to a kind of websites collection method of the application embodiment one.
A kind of websites collection method of present embodiment may further comprise the steps:
Step S102:, websites all in the site databases is divided into a plurality of different set of websites according to the access to netwoks behavioural characteristic of obtaining in advance.
Wherein, access to netwoks behavioural characteristic obtains through the communication network message being carried out the message characteristic extraction.
The access to netwoks behavioural characteristic is after using the behavior of network to analyze to the user, its behavioural characteristic of extraction.In the present embodiment, be the behavior of using network through the communication network message analysis user who obtains, message characteristic is extracted and obtain the access to netwoks behavioural characteristic; As, the data in the communication network message are carried out data analysis, the result extracts message characteristic according to data analysis; In the present embodiment; Emphasis is analyzed the URL of the webpage that user's webpage query word and user clicked, thereby extracts message characteristic, obtains the access to netwoks behavioural characteristic.
The one or more webpages that include a plurality of websites in the site databases through the analysis to webpage, can obtain the relevant information of its affiliated website.In the present embodiment, according to the access to netwoks behavioural characteristic, the related web page of a plurality of websites is analyzed, and then the website under the related web page is divided into different set.
Step S104: if comprise the mark website of classifying in the set of websites, then will mark the classification of other websites in the set of websites that the website belonged to of classifying, and confirm as the classification of the website that this mark classified.
Wherein, mark the website of classifying has been carried out the mark classification in advance for what from site databases, extract website.Also promptly, from site databases, choose the website of some, adopt artificial then or by machine according to setting rule, to each website mark of choosing categories of websites under it, belong to military type website like the A website, the B website belongs to finance and economic website etc.
Generally, possibly comprise a mark classifieds website in the set of websites, also possibly comprise a plurality of mark classifieds website.Because the classification of the website that similar access to netwoks behavioural characteristic is corresponding is also basic identical; Therefore; In general; Passed through be the division of collection of network of foundation with the access to netwoks behavioural characteristic after, if comprised the website that a plurality of marks were classified in a set of websites, the classification of the website that each mark in this set of websites was classified also is identical.
After all websites having been carried out division according to the access to netwoks behavioural characteristic, just can confirm the classification of the website in this set of websites according to set of websites and the relation that marks the website of classifying.Include A, F, G, four websites of H as in the S1 set, and the A website is labeled as military type website in advance, can confirms that then F, the G in the S set 1, the classification of H website also belong to military type website.And, then can adopt other suitable sorting technique to classify for not comprising the set that marks the website of classifying, like automatic classification, perhaps each website in this set of websites directly is classified as other class, miscellany etc.
Pass through present embodiment; Use is carried out message characteristic to the communication network message and is extracted the foundation of the access to netwoks behavioural characteristic of back acquisition as the website division; Extract and needn't carry out macromethod, solved the problem that the legacy network sorting technique must be handled mass data the mass data of webpage or website, and then according to the mark classification of the website of setting; Carry out precise classification to having comprised the set of websites that marks the website of classifying, improved the accuracy rate of websites collection.Pass through present embodiment; Both reduce the data volume of carrying out the required processing of network class, improved the accuracy rate of network class again, thereby effectively improved the efficient of network class; It is big to have solved the data volume that the website using sorting technique need handle, problem that efficient is not high.In addition; The access to netwoks behavioural characteristic is the information through the relevant subscriber network access behavior of the user's communications network message being analyzed the back acquisition; The true classification that can reflect the website preferably, thus the accuracy of websites collection improved, and then improve websites collection efficient.
Embodiment two
With reference to Fig. 2, show flow chart of steps according to a kind of websites collection method of the application embodiment two.
The websites collection method of present embodiment may further comprise the steps:
Step S202: in the communication network message that obtains, based on the correspondence relationship information of query word and website, the relationship characteristic that extracts the website is as the access to netwoks behavioural characteristic.
In the present embodiment, dispose gateway, obtain all user's communications network messages through gateway in a plurality of geographic position.Original communication network message to obtaining carries out data processing and analysis, forms the correspondence relationship information of query word (also can be described as webpage query word or query site speech) and the website of clicking, and abbreviates " query word website relation information " among the application as.Has certain similarity relation a plurality of websites that can confirm to click by " query word website relation information ".It should be noted that query word has reflected user's inquiry intention, and the website of clicking generally is the result that the user wants, so " query word website relation information " is to carry out one of websites collection text feature preferably.In practical application, can be directly with " query word website relation information " as the access to netwoks behavioural characteristic, also can be further with according to the relationship characteristic between its a plurality of websites of confirming as the access to netwoks behavioural characteristic.
Relationship characteristic through with the website of extracting can be divided into a plurality of related web sites in the site databases in the set as the access to netwoks behavioural characteristic easily and effectively, has reduced websites collection deal with data amount, has improved websites collection efficient.
Certainly, be not limited thereto, other effectively the characteristic of identifying user access to netwoks behavior also can be used as the access to netwoks behavioural characteristic.
Step S204: from site databases, randomly draw the website of some, mark classification.
Wherein, the quantity of the website of extraction can preestablish, and also can set at random, can certainly proportionally extract the website that need mark classification.Theoretically, the website of mark classification is many more, and categories of websites is more comprehensive and careful, and websites collection is accurate more.
From site databases, randomly draw the part website and carry out the classification of manual work mark, can confirm the classification of related web site exactly to mark sorted website as classification foundation.
Need to prove that step S204 and step S202 can carry out in no particular order in proper order.
Step S206:, websites all in the site databases is divided into a plurality of different set of websites according to the access to netwoks behavioural characteristic.
For ease of understanding the scheme of present embodiment, below with the example of a simplification as schematic illustration.For example; The query word of setting user's input is " patent "; The website of setting the click corresponding with this relative is A, B, three websites of C; Can infer that the similarity of these three websites maybe be bigger through a series of data analysis, the relation that A, B, three website similarities of C is big is as the access to netwoks behavioural characteristic.Through this access to netwoks behavioural characteristic; Can infer that in conjunction with certain data analysis A in the site databases, B, three websites of C belong to same set of websites; In addition; Through the analysis to the communication network message, in conjunction with certain data analysis, can also infer with these three websites has the website of getting in touch more by force also all to belong to and A, B, three identical set of websites in website of C.As; Through the communication network message analysis is found; During most user inquirings " patent ", in one or more in visit A, B, three websites of C, D, E website have also been visited; Perhaps, then can A, B, C, D, E be divided in the set of websites via one or more D, the E websites visited in A, B, three websites of C.
Step S208: according to the mark website of classify in the site databases, the classification of the website in definite respectively a plurality of different set of websites.
In the present embodiment, suppose that any one website among A, B, C, D, the E has been noted as patent website, then A, B, C, D, E all are classified as patent website.
Preferably, when realize confirming categories of websites, judge earlier in certain or some set of websites of a plurality of different set of websites whether comprise that at least one marks the website of classifying; If, then with the website in this set of websites confirm as with this set of websites in the mark that the comprises identical classification in website of classifying.
And be the set of websites that mark was not classified for those included websites, and can be classified in website wherein through existing sorting technique, as classifying automatically, perhaps, directly it is classified as other type etc.Automatically classification realizes simply, and is convenient, flexible, can practice thrift the realization cost of websites collection.
Present embodiment has been realized carrying out the artificial annotation results that marks based on the subscriber network access behavioural information with to the part website, is classified in the website.In the process of carrying out websites collection; Adopt in the communication network message correspondence relationship information based on query word and website; The relationship characteristic of the website of extracting is as the access to netwoks behavioural characteristic; This access to netwoks behavioural characteristic has representative preferably, can gather division comparatively exactly to the website in the site databases on this basis, compares through the mass data of webpage or website being analyzed the extraction web page characteristics with prior art; The scheme of present embodiment only needs analyzing through network message; And the data volume of communication network message is much smaller than the website or the data volume of webpage, reduced thus to the website classify the data volume that will handle, improved the efficient of websites collection; And when definite set of websites classification, those set that do not comprise the website of having carried out artificial mark classification are classified automatically or directly incorporated into is other type, has practiced thrift the realization cost of websites collection.
Embodiment three
With reference to Fig. 3, show structured flowchart according to a kind of websites collection device of the application embodiment three.
The websites collection device of present embodiment comprises: divide module 302; Be used for according to the access to netwoks behavioural characteristic of obtaining in advance; Websites all in the site databases is divided into a plurality of different set of websites; Wherein, access to netwoks behavioural characteristic obtains through the communication network message being carried out the message characteristic extraction; Sort module 304; Can be connected with division module 302; Be used for then will marking the classification of other websites in the set of websites that the website belonged to of classifying, confirm as the classification that marks the website of classifying if set of websites comprises the website that mark was classified; Wherein, mark the website of classifying has been carried out the mark classification in advance for what from site databases, extract website.
Preferably; The websites collection device of present embodiment also comprises: characteristic extracting module 306, can be connected respectively with division module 302, labeling module 308, and be used for dividing module 302 according to the access to netwoks behavioural characteristic of obtaining in advance; Websites all in the site databases is divided into before a plurality of different set of websites; In the communication network message that obtains, based on the correspondence relationship information of query word and website, the relationship characteristic that extracts the website is as the access to netwoks behavioural characteristic.
Preferably, sort module 304 also is used for if then classify to all websites in this set of websites automatically in the website of classify for mark in all website of said set of websites, or directly to incorporate into be " other class ".
Preferably; The websites collection device of present embodiment also comprises: labeling module 308; Can with divide module 302, characteristic extracting module 306 be connected respectively, is used for dividing the access to netwoks behavioural characteristic that module 302 bases are obtained in advance, websites all in the site databases is divided into before a plurality of different set of websites; From site databases, randomly draw the website of setting quantity, mark classification.
The execution of the execution of characteristic extracting module 306 and labeling module 308 is order in no particular order.
The websites collection device of this enforcement row is used for realizing aforementioned a plurality of method embodiment corresponding website sorting technique, and has the beneficial effect of corresponding website sorting technique embodiment, repeats no more at this.
Embodiment four
The websites collection device of present embodiment mainly comprises two modules, i.e. communication network message pre-processing module and websites collection module.
Wherein:
Communication network message pre-processing module (being equivalent to the characteristic extracting module among the embodiment three) mainly is responsible for original communication network message is carried out pre-service; The correspondence relationship information of the website that forms query word and click; Be called for short " query word website relation information ", with this information as the access to netwoks behavioural characteristic.It should be noted that query word has reflected user's inquiry intention, and the website of clicking generally is the result that the user wants, so query word website relation information is to carry out one of websites collection text feature preferably.
Websites collection module (being equivalent to labeling module, division module and sort module among the embodiment three) mainly is responsible for randomly drawing the part website and is carried out the classification of manual work mark; Based on query word website relation information and artificial mark classification results; Utilize the disaggregated model of machine learning to accomplish the classification of website; To not marking the website that classification results is classified according to manual work, classify automatically, form the websites collection result.
The process that the above-mentioned websites collection device of use present embodiment carries out network class is as shown in Figure 4.In Fig. 4; The communication network message is through the processing of communication network message pre-processing module; Generated query word website relation information, the websites collection module is according to this query word website relation information, and the website of mark classification in advance; Classifying in all websites in the site databases, forms the websites collection result.Wherein, the classification of the mark of website maybe be prior to the generation of query word website relation information, and also the possibility back is in the generation of query word website relation information.The original communication network message constructs the websites collection system through the processing of above-mentioned two modules.
The websites collection device of this enforcement row is used for realizing aforementioned a plurality of method embodiment corresponding website sorting technique, and has the beneficial effect of corresponding website sorting technique embodiment, repeats no more at this.
The application carries out the artificial annotation results that marks based on the subscriber network access behavioural information with to the part website, is classified in the website.Through the application, realized:
(1) the application has proposed the scheme of being classified in the website based on the communication network message information; Used better text feature; Be that query word website relation information carries out websites collection, solved traditional scheme and must handle mass data and the not high problem of accuracy rate.
(2) the application can carry out the higher classification of accuracy rate to present online most websites, its more accurately classification results can be used for a lot of network applications.For example, can make up better guidance to website door, can be used for identifying responsive website and monitor to carry out public sentiment than www.hao123.com, or the like.
(3) based on the application's websites collection result, can further excavate, set up accurate user profile and throw in or the like to instruct accurate advertisement to user's interest.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed all is and the difference of other embodiment that identical similar part is mutually referring to getting final product between each embodiment.For device embodiment, because it is similar basically with method embodiment, so description is fairly simple, relevant part gets final product referring to the part explanation of method embodiment.
More than a kind of websites collection method and apparatus that the application provided has been carried out detailed introduction; Used concrete example among this paper the application's principle and embodiment are set forth, the explanation of above embodiment just is used to help to understand the application's method and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to the application's thought, the part that on embodiment and range of application, all can change, in sum, this description should not be construed as the restriction to the application.

Claims (8)

1. a websites collection method is characterized in that, comprising:
According to the access to netwoks behavioural characteristic of obtaining in advance, websites all in the site databases is divided into a plurality of different set of websites, wherein, said access to netwoks behavioural characteristic obtains through the communication network message being carried out the message characteristic extraction;
If comprise the website that mark was classified in the said set of websites; The classification of other websites in the set of websites that the website belonged to of then said mark being classified; Confirm as the classification of the website that said mark classified; Wherein, the said mark website of classifying has been carried out the website of mark classification in advance for what from said site databases, extract.
2. method according to claim 1 is characterized in that, the access to netwoks behavioural characteristic of obtaining in advance in said basis is divided into websites all in the site databases before the step of a plurality of different set of websites, also comprises:
In the communication network message that obtains, based on the correspondence relationship information of query word and website, the relationship characteristic that extracts said website is as said access to netwoks behavioural characteristic.
3. method according to claim 1 is characterized in that, also comprises:
If all websites are not mark the website of classifying in the said set of websites, then classified automatically in all websites in this set of websites.
4. method according to claim 3 is characterized in that, the access to netwoks behavioural characteristic of obtaining in advance in said basis is divided into websites all in the site databases before the step of a plurality of different set of websites, also comprises:
From said site databases, randomly draw the website of setting quantity, mark classification.
5. a websites collection device is characterized in that, comprising:
Divide module; Be used for according to the access to netwoks behavioural characteristic of obtaining in advance; Websites all in the site databases is divided into a plurality of different set of websites, and wherein, said access to netwoks behavioural characteristic obtains through the communication network message being carried out the message characteristic extraction;
Sort module; Be used for if said set of websites comprises the website that mark was classified; The classification of other websites in the set of websites that the website belonged to of then said mark being classified; Confirm as the classification of the website that said mark classified, wherein, the website that said mark was classified has been carried out the website of mark classification in advance for what from said site databases, extract.
6. device according to claim 5 is characterized in that, also comprises:
Characteristic extracting module; Be used in said division module according to the access to netwoks behavioural characteristic of obtaining in advance; Websites all in the site databases is divided into before a plurality of different set of websites; In the communication network message that obtains, based on the correspondence relationship information of query word and website, the relationship characteristic that extracts said website is as said access to netwoks behavioural characteristic.
7. device according to claim 5 is characterized in that, said sort module also is used for then being classified automatically in all websites in this set of websites if all websites of said set of websites are not mark the website of classifying.
8. device according to claim 7 is characterized in that, also comprises:
Labeling module; Be used in said division module according to the access to netwoks behavioural characteristic of obtaining in advance; Websites all in the site databases is divided into before a plurality of different set of websites, from said site databases, randomly draws the website of setting quantity, mark classification.
CN201110436167.9A 2011-12-22 2011-12-22 Website classification method and device Expired - Fee Related CN102567494B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110436167.9A CN102567494B (en) 2011-12-22 2011-12-22 Website classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110436167.9A CN102567494B (en) 2011-12-22 2011-12-22 Website classification method and device

Publications (2)

Publication Number Publication Date
CN102567494A true CN102567494A (en) 2012-07-11
CN102567494B CN102567494B (en) 2014-07-02

Family

ID=46412896

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110436167.9A Expired - Fee Related CN102567494B (en) 2011-12-22 2011-12-22 Website classification method and device

Country Status (1)

Country Link
CN (1) CN102567494B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750754A (en) * 2013-12-31 2015-07-01 北龙中网(北京)科技有限责任公司 Website industry classification method and server
CN105335449A (en) * 2014-08-15 2016-02-17 北京奇虎科技有限公司 Search engine database based automatic sample mining method and apparatus
CN105447077A (en) * 2015-11-04 2016-03-30 清华大学 Query word extraction method and system based on OpenFlow
WO2016115319A1 (en) * 2015-01-15 2016-07-21 The University Of North Carolina At Chapel Hill Methods, systems, and computer readable media for generating and using a web page classification model
CN106294443A (en) * 2015-05-28 2017-01-04 上海池乐信息科技有限公司 The URL classification recognition methods in a kind of knowledge based storehouse and system
CN106294442A (en) * 2015-05-28 2017-01-04 上海池乐信息科技有限公司 A kind of internet information classifying identification method based on URL and system
CN106649384A (en) * 2015-11-03 2017-05-10 中国电信股份有限公司 Method and device for classifying URL (Uniform Resource Locator)
CN106708843A (en) * 2015-11-12 2017-05-24 北京国双科技有限公司 Pushing method and device for website search term
CN108073667A (en) * 2016-11-11 2018-05-25 财团法人工业技术研究院 Method for generating user browsing attributes, and non-transitory computer readable medium
CN111966948A (en) * 2020-09-25 2020-11-20 北京百度网讯科技有限公司 Information delivery method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040243676A1 (en) * 2003-05-24 2004-12-02 Blankenship Mark H. Message manager for tracking customer attributes
CN101464905A (en) * 2009-01-08 2009-06-24 中国科学院计算技术研究所 Web page information extraction system and method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040243676A1 (en) * 2003-05-24 2004-12-02 Blankenship Mark H. Message manager for tracking customer attributes
CN101464905A (en) * 2009-01-08 2009-06-24 中国科学院计算技术研究所 Web page information extraction system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
贾梦青等: "基于用户HTTP行为分析的网站分类研究", 《计算机工程与设计》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750754A (en) * 2013-12-31 2015-07-01 北龙中网(北京)科技有限责任公司 Website industry classification method and server
CN105335449A (en) * 2014-08-15 2016-02-17 北京奇虎科技有限公司 Search engine database based automatic sample mining method and apparatus
CN105335449B (en) * 2014-08-15 2019-03-01 北京奇虎科技有限公司 Sample automatic mining method and device based on search engine database
WO2016115319A1 (en) * 2015-01-15 2016-07-21 The University Of North Carolina At Chapel Hill Methods, systems, and computer readable media for generating and using a web page classification model
US10530671B2 (en) 2015-01-15 2020-01-07 The University Of North Carolina At Chapel Hill Methods, systems, and computer readable media for generating and using a web page classification model
CN106294443A (en) * 2015-05-28 2017-01-04 上海池乐信息科技有限公司 The URL classification recognition methods in a kind of knowledge based storehouse and system
CN106294442A (en) * 2015-05-28 2017-01-04 上海池乐信息科技有限公司 A kind of internet information classifying identification method based on URL and system
CN106649384A (en) * 2015-11-03 2017-05-10 中国电信股份有限公司 Method and device for classifying URL (Uniform Resource Locator)
CN106649384B (en) * 2015-11-03 2019-07-09 中国电信股份有限公司 The method and apparatus classified to URL
CN105447077A (en) * 2015-11-04 2016-03-30 清华大学 Query word extraction method and system based on OpenFlow
CN106708843A (en) * 2015-11-12 2017-05-24 北京国双科技有限公司 Pushing method and device for website search term
CN108073667A (en) * 2016-11-11 2018-05-25 财团法人工业技术研究院 Method for generating user browsing attributes, and non-transitory computer readable medium
CN108073667B (en) * 2016-11-11 2021-08-27 财团法人工业技术研究院 Method for generating user browsing attributes, and non-transitory computer readable medium
CN111966948A (en) * 2020-09-25 2020-11-20 北京百度网讯科技有限公司 Information delivery method, device, equipment and storage medium
CN111966948B (en) * 2020-09-25 2023-08-01 北京百度网讯科技有限公司 Information delivery method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN102567494B (en) 2014-07-02

Similar Documents

Publication Publication Date Title
CN102567494B (en) Website classification method and device
CN103164427B (en) News Aggreagation method and device
CN101794311B (en) Fuzzy data mining based automatic classification method of Chinese web pages
CN102760138B (en) Classification method and device for user network behaviors and search method and device for user network behaviors
CN103226578B (en) Towards the website identification of medical domain and the method for webpage disaggregated classification
Chakrabarti et al. Page-level template detection via isotonic smoothing
CN103365924B (en) A kind of method of internet information search, device and terminal
CN102542061B (en) Intelligent product classification method
CN104794242B (en) Searching method
CN104199972A (en) Named entity relation extraction and construction method based on deep learning
CN105404699A (en) Method, device and server for searching articles of finance and economics
CN101393555A (en) Rubbish blog detecting method
CN103914478A (en) Webpage training method and system and webpage prediction method and system
CN104462611A (en) Modeling method, ranking method, modeling device and ranking device for information ranking model
CN103116635B (en) Field-oriented method and system for collecting invisible web resources
CN104268148A (en) Forum page information auto-extraction method and system based on time strings
CN103577478A (en) Web page pushing method and system
CN104778208A (en) Method and system for optimally grasping search engine SEO (search engine optimization) website data
CN102681994A (en) Webpage information extracting method and system
CN110457579B (en) Webpage denoising method and system based on cooperative work of template and classifier
CN103838754A (en) Information searching device and method
CN102811207A (en) Network information pushing method and system
CN103942268A (en) Method and device for combining search and application and application interface
CN108959580A (en) A kind of optimization method and system of label data
CN105117434A (en) Webpage classification method and webpage classification system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C41 Transfer of patent application or patent right or utility model
C56 Change in the name or address of the patentee
CP01 Change in the name or title of a patent holder

Address after: 100081, building 2, building 18, 1607 South Main Street, Beijing, Haidian District, Zhongguancun, China

Patentee after: Izp (China) Network Technology Co.,Ltd.

Address before: 100081, building 2, building 18, 1607 South Main Street, Beijing, Haidian District, Zhongguancun, China

Patentee before: BEIJING IZP NETWORK TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right

Effective date of registration: 20151230

Address after: 100190, Haidian District, Beijing South Street, northeast flourishing, Beijing Zhongguancun software incubator, building 1, block C, three, 1322-D

Patentee after: IZP (BEIJING) TECHNOLOGIES Co.,Ltd.

Address before: 100081, building 2, building 18, 1607 South Main Street, Beijing, Haidian District, Zhongguancun, China

Patentee before: Izp (China) Network Technology Co.,Ltd.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140702

Termination date: 20181222