CN101211339A - Intelligent web page classifier based on user behaviors - Google Patents
Intelligent web page classifier based on user behaviors Download PDFInfo
- Publication number
- CN101211339A CN101211339A CNA2006101483419A CN200610148341A CN101211339A CN 101211339 A CN101211339 A CN 101211339A CN A2006101483419 A CNA2006101483419 A CN A2006101483419A CN 200610148341 A CN200610148341 A CN 200610148341A CN 101211339 A CN101211339 A CN 101211339A
- Authority
- CN
- China
- Prior art keywords
- user
- classification
- web page
- sample set
- webpage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
An intelligent web page classification device based on the user's behaviors: (1) Perform background input with an initial classification sample group for training, so as to gain a clustering center of each classification in the characteristic space. (2) Receive a URL input by the user input before catching and analyzing corresponding pages on the background; then, output texts with index value in the page. Moreover, extract a characteristic set according to the user-input content and web page contents; then, perform feedback modification for the characteristic space in the initial classification sample group and adjust the characteristic weight factor in the vector space. (3) Use the user-selected classification device to perform automatic classification for the texts in the previous step of the created texts and output results. When the user executes a search, the classification device can automatically determine the classification of each result and perform gradual adjustment for the classification device; the more times the user executes the search, the more accurate the classification of the web page classification device will be, so as to help different users effectively reduce the size of the set of the search result before locating necessary information more accurately.
Description
Technical field
The present invention relates to a kind of webpage be carried out the technology of intelligent classification, particularly carried out the intelligence learning of sorter in conjunction with user behavior feature and web page contents.
Background technology
Existing Web page classifying device mainly comprises two big classes: the manual sort.Directory search such as YAHOO is to take artificial mode to classify to being kept in the local data base.Though the classification degree of accuracy is higher, efficient is very low, and renewal speed is slow, and maintenance workload is big.Automatically classification.Replace manually webpage being classified with computer system, mainly comprise two kinds of implementation methods: the sorter of KBE and based on the sorter of statistics.The former mainly relies on linguistic knowledge, needs a large amount of inference rules of establishment as classificating knowledge, and Search Results is very accurate, but realizes quite complexity, and the development cost costliness.The latter ignores the linguistics structure of text, and text is considered as the characteristic item set, utilizes the weighted feature item to constitute vector and carries out text representation, the frequency of utilizing word to occur is weighted text feature, realize fairly simplely, classify accuracy is higher, can satisfy general requirement of using.But having shortcoming is not consider with the user search behavior to carry out interaction, and also is not suitable for the very high inquiry of scientific literature and so on claimed accuracy.
Summary of the invention
In order to overcome the above-mentioned deficiency of existing sorter, on traditional basis based on the vector space model Web page classifying device of adding up, the invention provides a kind of novel intelligent Web page classifying device, this sorter can not only satisfy the accurate classificating requirement with huge scale storage webpage, and can dynamically adjust the preliminary classification sample set in conjunction with user behavior, the classification results of meeting consumers' demand is provided for the search engine foreground.
Intelligent web page classifier based on user behavior:
(1) backstage input preliminary classification sample set is trained, and obtains each and is sorted in cluster centre on the feature space.
(2) receive the URL that the user imports, the backstage is grasped and is also analyzed the corresponding page, the text that has index to be worth in the output page.And according to user input content and web page contents, extract characteristic set, the feature space of preliminary classification sample set is carried out feedback revise, adjust the feature weight value of vector space.
(3) sorter that adopts the user to select is classified automatically to the text that previous step generates, and the output result.
After the user has carried out once search, sorter is judged the classification under every result automatically, and sorter is progressively adjusted, the searching times that the user carries out is many more, the classification of Web page classifying device is just accurate more, thereby help different user effectively to dwindle the set of Search Results, find required information more accurately.This method is applicable to the classification of various language texts, only will adopt the segmenting word method that is different from Romance for Asian language.
Specific embodiment:
In vector space model, text is made a general reference various machine-readable records, with D (Document) expression.Characteristic item T (Term) is meant the basic language unit that appears in the document D and can represent the document content, mainly is to be made of speech or phrase, and text can be D (T with the characteristic item set representations
1, T
2..., Tn), wherein Tk is a characteristic item, k ∈ 1,2 ..., N.For containing the text of n characteristic item, give certain weight can for usually each characteristic item and represent its significance level.Be D=D (T1, W1; T2, W2; , Tn, Wn), note by abridging into D=D (W1, W2 ..., Wn), this is become the vector representation of text, and wherein Wk is the weight of Tk, k ∈ 1,2 ..., N.In vector space model, two arbitrary text D
iAnd D
jBetween content degree of correlation Sim (D
i, D
j) cosine value of angle represents that formula is between the vector commonly used:
Wherein, W
Ik, W
JkRepresent text D respectively
iAnd D
jThe weights of k characteristic item, k ∈ 1,2 ..., N.If it is Th[i that the threshold value of certain keyword number of times is imported in user's search].
The concrete grammar step is as follows:
(1) selects suitable standard vocabulary.The range of application of combining classification device is selected suitable classed thesaurus, just selects the Dewey Decimal Classification sorted table such as belonging to library's application.Each word in the vocabulary is endowed different serial numbers and Hash functional value.And be the threshold value Th[i of each word designated user input number of times].
(2) set up training sample set.Crawling results in conjunction with new Web Crawler is carried out preliminary analysis to indexed webpage, and setting up training sample set is matrix O[i j]=| W[i j] |, i ∈ 1,2 ..., M, j ∈ 1,2 ..., N.Wherein, each row eigenwert comprises classification number, the initial weight value of a lot of keyword and synon initial weight value.Not corresponding with the standard vocabulary or can't give weighted value the time, the weighted value of these ranks of training sample set is 0, and all weighted values of matrix can not be negative when certain row of certain row.
(3) web page characteristics is extracted and weighting.Indexed webpage is carried out automatic indexing, to the words and phrases in the webpage according to they word frequency and webpage in the position that occurs give weight, obtain the primitive character set D[i j of webpage].If D[i j]=0, then withdraw from; Otherwise changeed for (4) step;
(4) Web page classifying.Then obtain each web page characteristics matrix and training sample set matrix each the row between content degree of correlation Sim (Di, Oj), (Sim (Di, Oj)) then is divided into i webpage in the j class of initial training collection relatively to obtain smallest match value Min.If all webpages are all classified to finish, then withdraw from.
(6) if the user searches for, the searching times of input keyword is less than pre-set threshold Th[i], then changeed for (7) step; Otherwise the combined standard vocabulary is obtained the keyword weighted mean value of all user's inputs, with this weighted value that is worth the keyword of the correspondence position of replacing initial training sample set matrix, obtains new training sample set matrix O[i j], changeed for (2) step.
(7) if the user no longer imports keyword search, then the backstage sorter does not respond this user's searching request, and the user obtains satisfied Search Results, and the training of sample set is stopped.
Claims (2)
1. the intelligent web page classifier based on user behavior is characterized in that,
(1) backstage input preliminary classification sample set is trained, and obtains each and is sorted in cluster centre on the feature space;
(2) receive the URL that the user imports, the backstage is grasped and is also analyzed the corresponding page, the text that has index to be worth in the output page; And according to user input content and web page contents, extract characteristic set, the feature space of preliminary classification sample set is carried out feedback revise, adjust the feature weight value of vector space;
(3) sorter that adopts the user to select is classified automatically to the text that previous step generates, and the output result.
2. according to the described intelligent web page classifier of claim 1, it is characterized in that based on user behavior,
The concrete grammar step is as follows:
(1) selects suitable standard vocabulary; The range of application of combining classification device is selected suitable classed thesaurus, just selects the Dewey Decimal Classification sorted table such as belonging to library's application; Each word in the vocabulary is endowed different serial numbers and Hash functional value; And be the threshold value Th[i of each word designated user input number of times];
(2) set up training sample set; Crawling results in conjunction with new Web Crawler is carried out preliminary analysis to indexed webpage, and setting up training sample set is matrix O[ij]=| W[ij] |, i ∈ 1,2 ..., M, j ∈ 1,2 ..., N; Wherein, each row eigenwert comprises classification number, the initial weight value of a lot of keyword and synon initial weight value; Not corresponding with the standard vocabulary or can't give weighted value the time, the weighted value of these ranks of training sample set is 0, and all weighted values of matrix can not be negative when certain row of certain row;
(3) web page characteristics is extracted and weighting; Indexed webpage is carried out automatic indexing, to the words and phrases in the webpage according to they word frequency and webpage in the position that occurs give weight, obtain the primitive character set D[ij of webpage]; If D[ij]=0, then withdraw from; Otherwise changeed for (4) step;
(4) Web page classifying; Then obtain each web page characteristics matrix and training sample set matrix each the row between content degree of correlation Sim (Di, Oj), (Sim (Di, Oj)) then is divided into i webpage in the j class of initial training collection relatively to obtain smallest match value Min; If all webpages are all classified to finish, then withdraw from;
(6) if the user searches for, the searching times of input keyword is less than pre-set threshold Th[i], then changeed for (7) step; Otherwise the combined standard vocabulary is obtained the keyword weighted mean value of all user's inputs, with this weighted value that is worth the keyword of the correspondence position of replacing initial training sample set matrix, obtains new training sample set matrix O[ij], changeed for (2) step;
(7) if the user no longer imports keyword search, then the backstage sorter does not respond this user's searching request, and the user obtains satisfied Search Results, and the training of sample set is stopped.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2006101483419A CN101211339A (en) | 2006-12-29 | 2006-12-29 | Intelligent web page classifier based on user behaviors |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2006101483419A CN101211339A (en) | 2006-12-29 | 2006-12-29 | Intelligent web page classifier based on user behaviors |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101211339A true CN101211339A (en) | 2008-07-02 |
Family
ID=39611371
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNA2006101483419A Pending CN101211339A (en) | 2006-12-29 | 2006-12-29 | Intelligent web page classifier based on user behaviors |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101211339A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102236652A (en) * | 2010-04-27 | 2011-11-09 | 腾讯科技(深圳)有限公司 | Method and device for classifying information |
CN102253937A (en) * | 2010-05-18 | 2011-11-23 | 阿里巴巴集团控股有限公司 | Method and related device for acquiring information of interest in webpages |
CN102411583A (en) * | 2010-09-20 | 2012-04-11 | 阿里巴巴集团控股有限公司 | Text matching method and device |
CN103136360A (en) * | 2013-03-07 | 2013-06-05 | 北京宽连十方数字技术有限公司 | Internet behavior markup engine and behavior markup method corresponding to same |
CN103678422A (en) * | 2012-09-25 | 2014-03-26 | 北京亿赞普网络技术有限公司 | Web page classification method and device and training method and device of web page classifier |
CN103870567A (en) * | 2014-03-11 | 2014-06-18 | 浪潮集团有限公司 | Automatic identifying method for webpage collecting template of vertical search engine in cloud computing |
CN104239351A (en) * | 2013-06-20 | 2014-12-24 | 阿里巴巴集团控股有限公司 | User behavior machine learning model training method and device |
CN104331429A (en) * | 2014-10-21 | 2015-02-04 | 北京奇虎科技有限公司 | Method and device for performing multi-characteristic dimension quantization on network object |
CN104573021A (en) * | 2015-01-12 | 2015-04-29 | 浪潮软件集团有限公司 | Method for analyzing internet behaviors |
CN104615767A (en) * | 2015-02-15 | 2015-05-13 | 百度在线网络技术(北京)有限公司 | Searching-ranking model training method and device and search processing method |
CN105378699A (en) * | 2013-11-27 | 2016-03-02 | Ntt都科摩公司 | Automatic task classification based upon machine learning |
CN107193814A (en) * | 2016-03-14 | 2017-09-22 | 北京京东尚科信息技术有限公司 | The method and apparatus that the automatic taxonomic revision of books is realized in digital reading |
CN108304427A (en) * | 2017-04-28 | 2018-07-20 | 腾讯科技(深圳)有限公司 | A kind of user visitor's heap sort method and apparatus |
-
2006
- 2006-12-29 CN CNA2006101483419A patent/CN101211339A/en active Pending
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102236652A (en) * | 2010-04-27 | 2011-11-09 | 腾讯科技(深圳)有限公司 | Method and device for classifying information |
CN102253937A (en) * | 2010-05-18 | 2011-11-23 | 阿里巴巴集团控股有限公司 | Method and related device for acquiring information of interest in webpages |
CN102411583A (en) * | 2010-09-20 | 2012-04-11 | 阿里巴巴集团控股有限公司 | Text matching method and device |
CN102411583B (en) * | 2010-09-20 | 2013-09-18 | 阿里巴巴集团控股有限公司 | Method and device for matching texts |
CN103678422A (en) * | 2012-09-25 | 2014-03-26 | 北京亿赞普网络技术有限公司 | Web page classification method and device and training method and device of web page classifier |
CN103136360A (en) * | 2013-03-07 | 2013-06-05 | 北京宽连十方数字技术有限公司 | Internet behavior markup engine and behavior markup method corresponding to same |
CN103136360B (en) * | 2013-03-07 | 2016-09-07 | 北京宽连十方数字技术有限公司 | A kind of internet behavior markup engine and to should the behavior mask method of engine |
CN104239351B (en) * | 2013-06-20 | 2017-12-19 | 阿里巴巴集团控股有限公司 | A kind of training method and device of the machine learning model of user behavior |
CN104239351A (en) * | 2013-06-20 | 2014-12-24 | 阿里巴巴集团控股有限公司 | User behavior machine learning model training method and device |
CN105378699A (en) * | 2013-11-27 | 2016-03-02 | Ntt都科摩公司 | Automatic task classification based upon machine learning |
CN105378699B (en) * | 2013-11-27 | 2018-12-18 | Ntt都科摩公司 | Autotask classification based on machine learning |
CN103870567A (en) * | 2014-03-11 | 2014-06-18 | 浪潮集团有限公司 | Automatic identifying method for webpage collecting template of vertical search engine in cloud computing |
CN104331429A (en) * | 2014-10-21 | 2015-02-04 | 北京奇虎科技有限公司 | Method and device for performing multi-characteristic dimension quantization on network object |
CN104331429B (en) * | 2014-10-21 | 2018-04-27 | 北京奇虎科技有限公司 | The method and device of multiple features dimension quantization is carried out to network object |
CN104573021A (en) * | 2015-01-12 | 2015-04-29 | 浪潮软件集团有限公司 | Method for analyzing internet behaviors |
CN104615767A (en) * | 2015-02-15 | 2015-05-13 | 百度在线网络技术(北京)有限公司 | Searching-ranking model training method and device and search processing method |
CN104615767B (en) * | 2015-02-15 | 2017-12-29 | 百度在线网络技术(北京)有限公司 | Training method, search processing method and the device of searching order model |
CN107193814A (en) * | 2016-03-14 | 2017-09-22 | 北京京东尚科信息技术有限公司 | The method and apparatus that the automatic taxonomic revision of books is realized in digital reading |
CN107193814B (en) * | 2016-03-14 | 2020-07-31 | 北京京东尚科信息技术有限公司 | Method and device for realizing automatic book sorting in digital reading |
CN108304427A (en) * | 2017-04-28 | 2018-07-20 | 腾讯科技(深圳)有限公司 | A kind of user visitor's heap sort method and apparatus |
WO2018196798A1 (en) * | 2017-04-28 | 2018-11-01 | 腾讯科技(深圳)有限公司 | User group classification method and device |
CN108304427B (en) * | 2017-04-28 | 2020-03-17 | 腾讯科技(深圳)有限公司 | User passenger group classification method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101211339A (en) | Intelligent web page classifier based on user behaviors | |
CN101510221B (en) | Query statement analysis method and system for information retrieval | |
CN110334178B (en) | Data retrieval method, device, equipment and readable storage medium | |
Ahmed et al. | Language identification from text using n-gram based cumulative frequency addition | |
CN101794311A (en) | Fuzzy data mining based automatic classification method of Chinese web pages | |
CN101609450A (en) | Web page classification method based on training set | |
CN108038099B (en) | Low-frequency keyword identification method based on word clustering | |
Bin et al. | Web mining research | |
CN103678422A (en) | Web page classification method and device and training method and device of web page classifier | |
CN112307182A (en) | Question-answering system-based pseudo-correlation feedback extended query method | |
CN110532450A (en) | A kind of Theme Crawler of Content method based on improvement shark search | |
CN113918702A (en) | Semantic matching-based online legal automatic question-answering method and system | |
CN111460147A (en) | Title short text classification method based on semantic enhancement | |
Irfan et al. | Implementation of Fuzzy C-Means algorithm and TF-IDF on English journal summary | |
CN103377224B (en) | Identify the method and device of problem types, set up the method and device identifying model | |
Murray et al. | Learning to rank images using semantic and aesthetic labels. | |
CN115618014A (en) | Standard document analysis management system and method applying big data technology | |
CN110728135A (en) | Text theme indexing method and device, electronic equipment and computer storage medium | |
CN107818078B (en) | Semantic association and matching method for Chinese natural language dialogue | |
Zheng et al. | Using high-level semantic features in video retrieval | |
CN109543049A (en) | Method and system for automatically pushing materials according to writing characteristics | |
TW575813B (en) | System and method using external search engine as foundation for segmentation of word | |
CN113516202A (en) | Webpage accurate classification method for CBL feature extraction and denoising | |
CN112948544A (en) | Book retrieval method based on deep learning and quality influence | |
Lin et al. | Support vector machines for text categorization in Chinese question classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Open date: 20080702 |