CN101211339A - Intelligent web page classifier based on user behaviors - Google Patents

Intelligent web page classifier based on user behaviors Download PDF

Info

Publication number
CN101211339A
CN101211339A CNA2006101483419A CN200610148341A CN101211339A CN 101211339 A CN101211339 A CN 101211339A CN A2006101483419 A CNA2006101483419 A CN A2006101483419A CN 200610148341 A CN200610148341 A CN 200610148341A CN 101211339 A CN101211339 A CN 101211339A
Authority
CN
China
Prior art keywords
user
classification
web page
sample set
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2006101483419A
Other languages
Chinese (zh)
Inventor
蔡阳波
陈勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI XINSHENG ELECTRONIC TECHNOLOGY Co Ltd
Original Assignee
SHANGHAI XINSHENG ELECTRONIC TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI XINSHENG ELECTRONIC TECHNOLOGY Co Ltd filed Critical SHANGHAI XINSHENG ELECTRONIC TECHNOLOGY Co Ltd
Priority to CNA2006101483419A priority Critical patent/CN101211339A/en
Publication of CN101211339A publication Critical patent/CN101211339A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An intelligent web page classification device based on the user's behaviors: (1) Perform background input with an initial classification sample group for training, so as to gain a clustering center of each classification in the characteristic space. (2) Receive a URL input by the user input before catching and analyzing corresponding pages on the background; then, output texts with index value in the page. Moreover, extract a characteristic set according to the user-input content and web page contents; then, perform feedback modification for the characteristic space in the initial classification sample group and adjust the characteristic weight factor in the vector space. (3) Use the user-selected classification device to perform automatic classification for the texts in the previous step of the created texts and output results. When the user executes a search, the classification device can automatically determine the classification of each result and perform gradual adjustment for the classification device; the more times the user executes the search, the more accurate the classification of the web page classification device will be, so as to help different users effectively reduce the size of the set of the search result before locating necessary information more accurately.

Description

Intelligent web page classifier based on user behavior
Technical field
The present invention relates to a kind of webpage be carried out the technology of intelligent classification, particularly carried out the intelligence learning of sorter in conjunction with user behavior feature and web page contents.
Background technology
Existing Web page classifying device mainly comprises two big classes: the manual sort.Directory search such as YAHOO is to take artificial mode to classify to being kept in the local data base.Though the classification degree of accuracy is higher, efficient is very low, and renewal speed is slow, and maintenance workload is big.Automatically classification.Replace manually webpage being classified with computer system, mainly comprise two kinds of implementation methods: the sorter of KBE and based on the sorter of statistics.The former mainly relies on linguistic knowledge, needs a large amount of inference rules of establishment as classificating knowledge, and Search Results is very accurate, but realizes quite complexity, and the development cost costliness.The latter ignores the linguistics structure of text, and text is considered as the characteristic item set, utilizes the weighted feature item to constitute vector and carries out text representation, the frequency of utilizing word to occur is weighted text feature, realize fairly simplely, classify accuracy is higher, can satisfy general requirement of using.But having shortcoming is not consider with the user search behavior to carry out interaction, and also is not suitable for the very high inquiry of scientific literature and so on claimed accuracy.
Summary of the invention
In order to overcome the above-mentioned deficiency of existing sorter, on traditional basis based on the vector space model Web page classifying device of adding up, the invention provides a kind of novel intelligent Web page classifying device, this sorter can not only satisfy the accurate classificating requirement with huge scale storage webpage, and can dynamically adjust the preliminary classification sample set in conjunction with user behavior, the classification results of meeting consumers' demand is provided for the search engine foreground.
Intelligent web page classifier based on user behavior:
(1) backstage input preliminary classification sample set is trained, and obtains each and is sorted in cluster centre on the feature space.
(2) receive the URL that the user imports, the backstage is grasped and is also analyzed the corresponding page, the text that has index to be worth in the output page.And according to user input content and web page contents, extract characteristic set, the feature space of preliminary classification sample set is carried out feedback revise, adjust the feature weight value of vector space.
(3) sorter that adopts the user to select is classified automatically to the text that previous step generates, and the output result.
After the user has carried out once search, sorter is judged the classification under every result automatically, and sorter is progressively adjusted, the searching times that the user carries out is many more, the classification of Web page classifying device is just accurate more, thereby help different user effectively to dwindle the set of Search Results, find required information more accurately.This method is applicable to the classification of various language texts, only will adopt the segmenting word method that is different from Romance for Asian language.
Specific embodiment:
In vector space model, text is made a general reference various machine-readable records, with D (Document) expression.Characteristic item T (Term) is meant the basic language unit that appears in the document D and can represent the document content, mainly is to be made of speech or phrase, and text can be D (T with the characteristic item set representations 1, T 2..., Tn), wherein Tk is a characteristic item, k ∈ 1,2 ..., N.For containing the text of n characteristic item, give certain weight can for usually each characteristic item and represent its significance level.Be D=D (T1, W1; T2, W2; , Tn, Wn), note by abridging into D=D (W1, W2 ..., Wn), this is become the vector representation of text, and wherein Wk is the weight of Tk, k ∈ 1,2 ..., N.In vector space model, two arbitrary text D iAnd D jBetween content degree of correlation Sim (D i, D j) cosine value of angle represents that formula is between the vector commonly used:
Sim ( Di , Dj ) = cos θ = Σ k = 1 n W ik × W jk ( Σ k = 1 n W ik 2 ) ( Σ k = 1 n W jk 2 )
Wherein, W Ik, W JkRepresent text D respectively iAnd D jThe weights of k characteristic item, k ∈ 1,2 ..., N.If it is Th[i that the threshold value of certain keyword number of times is imported in user's search].
The concrete grammar step is as follows:
(1) selects suitable standard vocabulary.The range of application of combining classification device is selected suitable classed thesaurus, just selects the Dewey Decimal Classification sorted table such as belonging to library's application.Each word in the vocabulary is endowed different serial numbers and Hash functional value.And be the threshold value Th[i of each word designated user input number of times].
(2) set up training sample set.Crawling results in conjunction with new Web Crawler is carried out preliminary analysis to indexed webpage, and setting up training sample set is matrix O[i j]=| W[i j] |, i ∈ 1,2 ..., M, j ∈ 1,2 ..., N.Wherein, each row eigenwert comprises classification number, the initial weight value of a lot of keyword and synon initial weight value.Not corresponding with the standard vocabulary or can't give weighted value the time, the weighted value of these ranks of training sample set is 0, and all weighted values of matrix can not be negative when certain row of certain row.
(3) web page characteristics is extracted and weighting.Indexed webpage is carried out automatic indexing, to the words and phrases in the webpage according to they word frequency and webpage in the position that occurs give weight, obtain the primitive character set D[i j of webpage].If D[i j]=0, then withdraw from; Otherwise changeed for (4) step;
(4) Web page classifying.Then obtain each web page characteristics matrix and training sample set matrix each the row between content degree of correlation Sim (Di, Oj), (Sim (Di, Oj)) then is divided into i webpage in the j class of initial training collection relatively to obtain smallest match value Min.If all webpages are all classified to finish, then withdraw from.
(6) if the user searches for, the searching times of input keyword is less than pre-set threshold Th[i], then changeed for (7) step; Otherwise the combined standard vocabulary is obtained the keyword weighted mean value of all user's inputs, with this weighted value that is worth the keyword of the correspondence position of replacing initial training sample set matrix, obtains new training sample set matrix O[i j], changeed for (2) step.
(7) if the user no longer imports keyword search, then the backstage sorter does not respond this user's searching request, and the user obtains satisfied Search Results, and the training of sample set is stopped.

Claims (2)

1. the intelligent web page classifier based on user behavior is characterized in that,
(1) backstage input preliminary classification sample set is trained, and obtains each and is sorted in cluster centre on the feature space;
(2) receive the URL that the user imports, the backstage is grasped and is also analyzed the corresponding page, the text that has index to be worth in the output page; And according to user input content and web page contents, extract characteristic set, the feature space of preliminary classification sample set is carried out feedback revise, adjust the feature weight value of vector space;
(3) sorter that adopts the user to select is classified automatically to the text that previous step generates, and the output result.
2. according to the described intelligent web page classifier of claim 1, it is characterized in that based on user behavior,
The concrete grammar step is as follows:
(1) selects suitable standard vocabulary; The range of application of combining classification device is selected suitable classed thesaurus, just selects the Dewey Decimal Classification sorted table such as belonging to library's application; Each word in the vocabulary is endowed different serial numbers and Hash functional value; And be the threshold value Th[i of each word designated user input number of times];
(2) set up training sample set; Crawling results in conjunction with new Web Crawler is carried out preliminary analysis to indexed webpage, and setting up training sample set is matrix O[ij]=| W[ij] |, i ∈ 1,2 ..., M, j ∈ 1,2 ..., N; Wherein, each row eigenwert comprises classification number, the initial weight value of a lot of keyword and synon initial weight value; Not corresponding with the standard vocabulary or can't give weighted value the time, the weighted value of these ranks of training sample set is 0, and all weighted values of matrix can not be negative when certain row of certain row;
(3) web page characteristics is extracted and weighting; Indexed webpage is carried out automatic indexing, to the words and phrases in the webpage according to they word frequency and webpage in the position that occurs give weight, obtain the primitive character set D[ij of webpage]; If D[ij]=0, then withdraw from; Otherwise changeed for (4) step;
(4) Web page classifying; Then obtain each web page characteristics matrix and training sample set matrix each the row between content degree of correlation Sim (Di, Oj), (Sim (Di, Oj)) then is divided into i webpage in the j class of initial training collection relatively to obtain smallest match value Min; If all webpages are all classified to finish, then withdraw from;
(6) if the user searches for, the searching times of input keyword is less than pre-set threshold Th[i], then changeed for (7) step; Otherwise the combined standard vocabulary is obtained the keyword weighted mean value of all user's inputs, with this weighted value that is worth the keyword of the correspondence position of replacing initial training sample set matrix, obtains new training sample set matrix O[ij], changeed for (2) step;
(7) if the user no longer imports keyword search, then the backstage sorter does not respond this user's searching request, and the user obtains satisfied Search Results, and the training of sample set is stopped.
CNA2006101483419A 2006-12-29 2006-12-29 Intelligent web page classifier based on user behaviors Pending CN101211339A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2006101483419A CN101211339A (en) 2006-12-29 2006-12-29 Intelligent web page classifier based on user behaviors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2006101483419A CN101211339A (en) 2006-12-29 2006-12-29 Intelligent web page classifier based on user behaviors

Publications (1)

Publication Number Publication Date
CN101211339A true CN101211339A (en) 2008-07-02

Family

ID=39611371

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2006101483419A Pending CN101211339A (en) 2006-12-29 2006-12-29 Intelligent web page classifier based on user behaviors

Country Status (1)

Country Link
CN (1) CN101211339A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102236652A (en) * 2010-04-27 2011-11-09 腾讯科技(深圳)有限公司 Method and device for classifying information
CN102253937A (en) * 2010-05-18 2011-11-23 阿里巴巴集团控股有限公司 Method and related device for acquiring information of interest in webpages
CN102411583A (en) * 2010-09-20 2012-04-11 阿里巴巴集团控股有限公司 Text matching method and device
CN103136360A (en) * 2013-03-07 2013-06-05 北京宽连十方数字技术有限公司 Internet behavior markup engine and behavior markup method corresponding to same
CN103678422A (en) * 2012-09-25 2014-03-26 北京亿赞普网络技术有限公司 Web page classification method and device and training method and device of web page classifier
CN103870567A (en) * 2014-03-11 2014-06-18 浪潮集团有限公司 Automatic identifying method for webpage collecting template of vertical search engine in cloud computing
CN104239351A (en) * 2013-06-20 2014-12-24 阿里巴巴集团控股有限公司 User behavior machine learning model training method and device
CN104331429A (en) * 2014-10-21 2015-02-04 北京奇虎科技有限公司 Method and device for performing multi-characteristic dimension quantization on network object
CN104573021A (en) * 2015-01-12 2015-04-29 浪潮软件集团有限公司 Method for analyzing internet behaviors
CN104615767A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Searching-ranking model training method and device and search processing method
CN105378699A (en) * 2013-11-27 2016-03-02 Ntt都科摩公司 Automatic task classification based upon machine learning
CN107193814A (en) * 2016-03-14 2017-09-22 北京京东尚科信息技术有限公司 The method and apparatus that the automatic taxonomic revision of books is realized in digital reading
CN108304427A (en) * 2017-04-28 2018-07-20 腾讯科技(深圳)有限公司 A kind of user visitor's heap sort method and apparatus

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102236652A (en) * 2010-04-27 2011-11-09 腾讯科技(深圳)有限公司 Method and device for classifying information
CN102253937A (en) * 2010-05-18 2011-11-23 阿里巴巴集团控股有限公司 Method and related device for acquiring information of interest in webpages
CN102411583A (en) * 2010-09-20 2012-04-11 阿里巴巴集团控股有限公司 Text matching method and device
CN102411583B (en) * 2010-09-20 2013-09-18 阿里巴巴集团控股有限公司 Method and device for matching texts
CN103678422A (en) * 2012-09-25 2014-03-26 北京亿赞普网络技术有限公司 Web page classification method and device and training method and device of web page classifier
CN103136360A (en) * 2013-03-07 2013-06-05 北京宽连十方数字技术有限公司 Internet behavior markup engine and behavior markup method corresponding to same
CN103136360B (en) * 2013-03-07 2016-09-07 北京宽连十方数字技术有限公司 A kind of internet behavior markup engine and to should the behavior mask method of engine
CN104239351B (en) * 2013-06-20 2017-12-19 阿里巴巴集团控股有限公司 A kind of training method and device of the machine learning model of user behavior
CN104239351A (en) * 2013-06-20 2014-12-24 阿里巴巴集团控股有限公司 User behavior machine learning model training method and device
CN105378699A (en) * 2013-11-27 2016-03-02 Ntt都科摩公司 Automatic task classification based upon machine learning
CN105378699B (en) * 2013-11-27 2018-12-18 Ntt都科摩公司 Autotask classification based on machine learning
CN103870567A (en) * 2014-03-11 2014-06-18 浪潮集团有限公司 Automatic identifying method for webpage collecting template of vertical search engine in cloud computing
CN104331429A (en) * 2014-10-21 2015-02-04 北京奇虎科技有限公司 Method and device for performing multi-characteristic dimension quantization on network object
CN104331429B (en) * 2014-10-21 2018-04-27 北京奇虎科技有限公司 The method and device of multiple features dimension quantization is carried out to network object
CN104573021A (en) * 2015-01-12 2015-04-29 浪潮软件集团有限公司 Method for analyzing internet behaviors
CN104615767A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Searching-ranking model training method and device and search processing method
CN104615767B (en) * 2015-02-15 2017-12-29 百度在线网络技术(北京)有限公司 Training method, search processing method and the device of searching order model
CN107193814A (en) * 2016-03-14 2017-09-22 北京京东尚科信息技术有限公司 The method and apparatus that the automatic taxonomic revision of books is realized in digital reading
CN107193814B (en) * 2016-03-14 2020-07-31 北京京东尚科信息技术有限公司 Method and device for realizing automatic book sorting in digital reading
CN108304427A (en) * 2017-04-28 2018-07-20 腾讯科技(深圳)有限公司 A kind of user visitor's heap sort method and apparatus
WO2018196798A1 (en) * 2017-04-28 2018-11-01 腾讯科技(深圳)有限公司 User group classification method and device
CN108304427B (en) * 2017-04-28 2020-03-17 腾讯科技(深圳)有限公司 User passenger group classification method and device

Similar Documents

Publication Publication Date Title
CN101211339A (en) Intelligent web page classifier based on user behaviors
CN101510221B (en) Query statement analysis method and system for information retrieval
CN110334178B (en) Data retrieval method, device, equipment and readable storage medium
Ahmed et al. Language identification from text using n-gram based cumulative frequency addition
CN101794311A (en) Fuzzy data mining based automatic classification method of Chinese web pages
CN101609450A (en) Web page classification method based on training set
CN108038099B (en) Low-frequency keyword identification method based on word clustering
Bin et al. Web mining research
CN103678422A (en) Web page classification method and device and training method and device of web page classifier
CN112307182A (en) Question-answering system-based pseudo-correlation feedback extended query method
CN110532450A (en) A kind of Theme Crawler of Content method based on improvement shark search
CN113918702A (en) Semantic matching-based online legal automatic question-answering method and system
CN111460147A (en) Title short text classification method based on semantic enhancement
Irfan et al. Implementation of Fuzzy C-Means algorithm and TF-IDF on English journal summary
CN103377224B (en) Identify the method and device of problem types, set up the method and device identifying model
Murray et al. Learning to rank images using semantic and aesthetic labels.
CN115618014A (en) Standard document analysis management system and method applying big data technology
CN110728135A (en) Text theme indexing method and device, electronic equipment and computer storage medium
CN107818078B (en) Semantic association and matching method for Chinese natural language dialogue
Zheng et al. Using high-level semantic features in video retrieval
CN109543049A (en) Method and system for automatically pushing materials according to writing characteristics
TW575813B (en) System and method using external search engine as foundation for segmentation of word
CN113516202A (en) Webpage accurate classification method for CBL feature extraction and denoising
CN112948544A (en) Book retrieval method based on deep learning and quality influence
Lin et al. Support vector machines for text categorization in Chinese question classification

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20080702