CN103744964A - Webpage classification method based on locality sensitive Hash function - Google Patents

Webpage classification method based on locality sensitive Hash function Download PDF

Info

Publication number
CN103744964A
CN103744964A CN201410005868.0A CN201410005868A CN103744964A CN 103744964 A CN103744964 A CN 103744964A CN 201410005868 A CN201410005868 A CN 201410005868A CN 103744964 A CN103744964 A CN 103744964A
Authority
CN
China
Prior art keywords
webpage
hash function
class
fingerprint
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410005868.0A
Other languages
Chinese (zh)
Inventor
蒋昌俊
陈闳中
闫春钢
丁志军
王鹏伟
孙海春
邓晓栋
刘俊俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201410005868.0A priority Critical patent/CN103744964A/en
Publication of CN103744964A publication Critical patent/CN103744964A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method for achieving webpage classification based on a locality sensitive Hash function. The method for achieving webpage classification based on the locality sensitive Hash function is characterized by comprising training a classifier, wherein the input of the classifier is a training set of all classes, and the output is a 64-bit fingerprint capable of representing a specific class; testing the classifier by a test set, after passing the test, classifying an unclassified webpage by the classifier, wherein the input of the classification process is the main body of the unclassified webpage, mapping the main body content to be the 64-bit fingerprint after treatment, comparing the 64-bit fingerprint with the fingerprints of all classes, and computing Hamming distances; judging that the webpage belongs to the class with the fingerprint having the minimal Hamming distance with the 64-bit fingerprint. According to the method for achieving webpage classification based on the locality sensitive Hash function provided by the invention, the text classification is achieved by the locality sensitive Hash function, and the classification efficiency is greatly improved in the case of ensuring the accuracy.

Description

A kind of Web page classification method based on local sensitivity Hash function
Technical field
The invention belongs to internet information retrieval technique field.
Background technology
In the face of the information resources of internet mass scale, search engine has become requisite acquisition of information instrument in people's live and work.On the one hand, people wish to obtain increasing information; On the other hand, fast and effeciently retrieve required content more and more difficult in bulk information, therefore, in the urgent need to extensive text message is effectively processed, text classification is born under this requirement.Conventional Chinese Text Categorization has: K nearest neighbor classification, Decision tree classification, Naive Bayes Classification Algorithm etc.
Summary of the invention
The invention discloses a kind of local sensitivity hash function that utilizes and realize the sorting technique to text.
Local sensitivity hash algorithm (Locality Sensitive Hashing, be called for short LSH) Piotr Indyk in 1999 and Rajeev Motwani propose in order to solve nearest neighbor search problem in primary memory.The present invention is in local sensitivity hash algorithm applicating text classification, to obtaining good classification performance the lower time complexity in the situation that.
The technical scheme that the present invention provides is:
That summarizes says, a kind of method that realizes Web page classifying based on local sensitivity Hash function, is characterized in that: be first training classifier, its input is the training set of all classes, and output is the fingerprint of 64 that can represent certain kinds.Then with test set, test this sorter, after test is passed through, utilize this sorter to classify to webpage undetermined, the input of assorting process is the text of webpage to be sorted, through processing, body matter is mapped to the fingerprint of 64, fingerprint with all classes contrasts, and calculates hamming distance.With the hamming of which class, apart from minimum, judge which class this webpage belongs to.
Specifically, the described method that realizes Web page classifying based on local sensitivity Hash function, is characterized in that:
When pre-service, according to specifically obtaining class name in project, then according to class name and search engine, utilize web crawlers to get the sample set of corresponding class name.
The first step, removes the sample set of all classes the noise informations such as webpage label, advertisement and extracts text, utilizes participle device participle to remove stop words operation.
Second step, utilizes the method for Naive Bayes Classification Algorithm training classifier, calculates the weight of each each Feature Words of class.
The 3rd step, utilizes local sensitivity hash function the proper vector of class to be become to the fingerprint of 64, and concrete mapping step is:
The Text eigenvector V of a N position of input, each feature has certain weight.Output is the binary signature f of 64.
(1) 64 dimensional vector F of initialization are that the binary signature S of 0,64 is 0.
(2), to each feature in vectorial V, use traditional hash algorithm to calculate the hashed value H of 64.For 1<=i<=64,
If the i position of H is 1, i the element of F adds the weight of this feature;
Otherwise i the element of F deducts the weight of this feature.
(3) if i the element of F is greater than 0, the i position of f is 1; Otherwise be 0;
(4) return to signature f.
The 4th step, extracts Feature Words and the weighted value thereof of webpage to be sorted, becomes the binary signature s of webpage by identical local sensitivity hash Function Mapping.
The 5th step, carries out Hamming code comparison by s with the finger print information of all classes, and the corresponding webpage of s just belongs to hamming apart from that class of minimum.
The 6th step, after testing classification device passes through, is used for sorter webpage is classified.
Compared with prior art, the present invention first by the local sensitivity hash function application using in text duplicate removal field in text classification, make to obtain good classification performance in the situation that of lower time complexity.The present invention is in the situation that webpage classification number thin and class is more, and effect is particularly outstanding, and the present invention can better provide service for Web page classifying, for Network Information Service, realizes and laying the foundation.The present invention, by utilizing local sensitivity hash function to realize the classification to text, can increase substantially classification effectiveness in the situation of assurance accuracy rate.
Accompanying drawing explanation
Below in conjunction with drawings and embodiments, the present invention is described in further detail:
Fig. 1 is classification specific works flow process.
Fig. 2 is based on local sensitivity hash function text classification workflow diagram.
Embodiment
First according to manual sort's catalogue of DMOZ, extract class, then the keyword using class name as search engine utilizes web crawlers to crawl front 200 web page contents of Search Results as such sample set.On the basis of this sample set, utilize local sensitivity hash function training classifier, then, the webpage capturing is classified to webpage according to the class standard extracting.Wherein to Web page classifying specific works flow process as shown in Figure 1:
1, the keyword using class name as existing search engine, utilizes web crawlers to crawl front 200 information and the web page contents of Search Results.
2, get rid of webpage label, the noise informations such as advertisement, extract Web page text as such sample set.
3, extract Feature Words and the weight thereof of class.
4, utilize local sensitivity hash function by the fingerprint that is mapped to 64.
5, test the accuracy rate of this sorter.
The specific works flow process that realizes text classification based on local sensitivity hash function is as shown in Figure 2:
1, training set is processed and removed stop words with participle device, utilize naive Bayesian to divide the method for training classifier in this classification to obtain the proper vector of certain kinds and the weight of Feature Words.
2, proper vector and the weighted value thereof of input class, utilize local sensitivity hash function to obtain such fingerprint of 64.
3,, after sorter is trained successfully, text to be sorted is carried out to identical processing and generate the fingerprint of 64 that can represent this webpage.
4, to the fingerprint of webpage to be sorted, the fingerprint with all classes compares, and judges that this webpage belongs to hamming apart from that class of minimum.
Specifically, the present invention is based on the method that local sensitivity Hash function is realized Web page classifying, it is characterized in that:
When pre-service, according to specifically obtaining class name in project, then according to class name and search engine, utilize web crawlers to get the sample set of corresponding class name.
The first step, removes the sample set of all classes the noise informations such as webpage label, advertisement and extracts text, utilizes participle device participle to remove stop words operation.
Second step, utilizes the method for Naive Bayes Classification Algorithm training classifier, calculates the weight of each each Feature Words of class.
The 3rd step, utilizes local sensitivity hash function the proper vector of class to be become to the fingerprint of 64, and concrete mapping step is:
The Text eigenvector V of a N position of input, each feature has certain weight.Output is the binary signature f of 64.
(1) 64 dimensional vector F of initialization are that the binary signature S of 0,64 is 0.
(2), to each feature in vectorial V, use traditional hash algorithm to calculate the hashed value H of 64.For 1<=i<=64,
If the i position of H is 1, i the element of F adds the weight of this feature;
Otherwise i the element of F deducts the weight of this feature.
(3) if i the element of F is greater than 0, the i position of f is 1; Otherwise be 0;
(4) return to signature f.
The 4th step, extracts Feature Words and the weighted value thereof of webpage to be sorted, becomes the binary signature s of webpage by identical local sensitivity hash Function Mapping.
The 5th step, carries out Hamming code comparison by s with the finger print information of all classes, and the corresponding webpage of s just belongs to hamming apart from that class of minimum.
The 6th step, after testing classification device passes through, is used for sorter webpage is classified.

Claims (2)

1. based on local sensitivity Hash function, realize a method for Web page classifying, it is characterized in that: be first training classifier, its input is the training set of all classes, output is the fingerprint of 64 that can represent certain kinds; Then with test set, test this sorter, after test is passed through, utilize this sorter to classify to webpage undetermined; the input of assorting process is the text of webpage to be sorted; through processing, body matter be mapped to the fingerprint of 64, with the fingerprint of all classes, contrast, to calculate hamming distance; With the hamming of which class, apart from minimum, judge which class this webpage belongs to.
2. the method that realizes Web page classifying based on local sensitivity Hash function according to claim 1, is characterized in that:
When pre-service, according to specifically obtaining class name in project, then according to class name and search engine, utilize web crawlers to get the sample set of corresponding class name;
The first step, removes the sample set of all classes the noise informations such as webpage label, advertisement and extracts text, utilizes participle device participle to remove stop words operation;
Second step, utilizes the method for Naive Bayes Classification Algorithm training classifier, calculates the weight of each each Feature Words of class;
The 3rd step, utilizes local sensitivity hash function the proper vector of class to be become to the fingerprint of 64, and concrete mapping step is:
The Text eigenvector V of a N position of input, each feature has certain weight; Output is the binary signature f of 64;
(1) 64 dimensional vector F of initialization are that the binary signature S of 0,64 is 0;
(2), to each feature in vectorial V, use traditional hash algorithm to calculate the hashed value H of 64; For 1<=i<=64,
If the i position of H is 1, i the element of F adds the weight of this feature;
Otherwise i the element of F deducts the weight of this feature;
(3) if i the element of F is greater than 0, the i position of f is 1; Otherwise be 0;
(4) return to signature f;
The 4th step, extracts Feature Words and the weighted value thereof of webpage to be sorted, becomes the binary signature s of webpage by identical local sensitivity hash Function Mapping;
The 5th step, carries out Hamming code comparison by s with the finger print information of all classes, and the corresponding webpage of s just belongs to hamming apart from that class of minimum;
The 6th step, after testing classification device passes through, is used for sorter webpage is classified.
CN201410005868.0A 2014-01-06 2014-01-06 Webpage classification method based on locality sensitive Hash function Pending CN103744964A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410005868.0A CN103744964A (en) 2014-01-06 2014-01-06 Webpage classification method based on locality sensitive Hash function

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410005868.0A CN103744964A (en) 2014-01-06 2014-01-06 Webpage classification method based on locality sensitive Hash function

Publications (1)

Publication Number Publication Date
CN103744964A true CN103744964A (en) 2014-04-23

Family

ID=50501982

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410005868.0A Pending CN103744964A (en) 2014-01-06 2014-01-06 Webpage classification method based on locality sensitive Hash function

Country Status (1)

Country Link
CN (1) CN103744964A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104615681A (en) * 2015-01-21 2015-05-13 广州神马移动信息科技有限公司 Text selecting method and device
CN105183792A (en) * 2015-08-21 2015-12-23 东南大学 Distributed fast text classification method based on locality sensitive hashing
WO2016180268A1 (en) * 2015-05-13 2016-11-17 阿里巴巴集团控股有限公司 Text aggregate method and device
CN106302202A (en) * 2015-05-15 2017-01-04 阿里巴巴集团控股有限公司 Data current-limiting method and device
US10778707B1 (en) 2016-05-12 2020-09-15 Amazon Technologies, Inc. Outlier detection for streaming data using locality sensitive hashing
WO2021121279A1 (en) * 2019-12-19 2021-06-24 Beijing Didi Infinity Technology And Development Co., Ltd. Text document categorization using rules and document fingerprints

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521366A (en) * 2011-12-16 2012-06-27 华中科技大学 Image retrieval method integrating classification with hash partitioning and image retrieval system utilizing same

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521366A (en) * 2011-12-16 2012-06-27 华中科技大学 Image retrieval method integrating classification with hash partitioning and image retrieval system utilizing same

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
严澜: ""海量数据相似度计算之simhash和海明距离"", 《HTTP://WWW.LANCEYAN.COM/TECH/ARCH/SIMHASH_HAMMING_DISTANCE _SIMILARITY.HTML》 *
何学文: ""基于LSH的语音文档主题分类研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
史世泽: ""局部敏感哈希算法的研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
李晋松: ""基于朴素贝叶斯的网页自动分类技术研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104615681A (en) * 2015-01-21 2015-05-13 广州神马移动信息科技有限公司 Text selecting method and device
WO2016180268A1 (en) * 2015-05-13 2016-11-17 阿里巴巴集团控股有限公司 Text aggregate method and device
CN106302202A (en) * 2015-05-15 2017-01-04 阿里巴巴集团控股有限公司 Data current-limiting method and device
CN106302202B (en) * 2015-05-15 2020-07-28 阿里巴巴集团控股有限公司 Data current limiting method and device
CN105183792A (en) * 2015-08-21 2015-12-23 东南大学 Distributed fast text classification method based on locality sensitive hashing
US10778707B1 (en) 2016-05-12 2020-09-15 Amazon Technologies, Inc. Outlier detection for streaming data using locality sensitive hashing
WO2021121279A1 (en) * 2019-12-19 2021-06-24 Beijing Didi Infinity Technology And Development Co., Ltd. Text document categorization using rules and document fingerprints
US11557141B2 (en) 2019-12-19 2023-01-17 Beijing Didi Infinity Technology And Development Co., Ltd. Text document categorization using rules and document fingerprints

Similar Documents

Publication Publication Date Title
CN103744964A (en) Webpage classification method based on locality sensitive Hash function
Gharge et al. An integrated approach for malicious tweets detection using NLP
Ning et al. Spam message classification based on the Naïve Bayes classification algorithm
CN103577755A (en) Malicious script static detection method based on SVM (support vector machine)
CN103886077B (en) Short text clustering method and system
CN110929025A (en) Junk text recognition method and device, computing equipment and readable storage medium
JP5012078B2 (en) Category creation method, category creation device, and program
CN105183792B (en) Distributed fast text classification method based on locality sensitive hashing
CN103218405A (en) Method for integrating migration text classifications based on dimensionality reduction
CN109710825A (en) Webpage harmful information identification method based on machine learning
CN102663435A (en) Junk image filtering method based on semi-supervision
Geng et al. Evaluating web content quality via multi-scale features
CN104462229A (en) Event classification method and device
Patil et al. Web spam detection using SVM classifier
CN107544961A (en) A kind of sentiment analysis method, equipment and its storage device of social media comment
Dong et al. An adult image detection algorithm based on Bag-of-Visual-Words and text information
Khan et al. Text mining approach to detect spam in emails
Ramraj et al. Topic categorization of tamil news articles using pretrained word2vec embeddings with convolutional neural network
Tabone et al. Pornographic content classification using deep-learning
Chinavle et al. Ensembles in adversarial classification for spam
Mussa et al. Relevant SMS spam feature selection using wrapper approach and XGBoost algorithm
Pandya Spam detection using clustering-based SVM
Silva et al. Towards web spam filtering using a classifier based on the minimum description length principle
CN102103700A (en) Land mobile distance-based image spam similarity-detection method
Ghosh et al. Semi-supervised granular classification framework for resource constrained short-texts: Towards retrieving situational information during disaster events

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140423

RJ01 Rejection of invention patent application after publication