CN101609450A - Web page classification method based on training set - Google Patents

Web page classification method based on training set Download PDF

Info

Publication number
CN101609450A
CN101609450A CNA2009100307095A CN200910030709A CN101609450A CN 101609450 A CN101609450 A CN 101609450A CN A2009100307095 A CNA2009100307095 A CN A2009100307095A CN 200910030709 A CN200910030709 A CN 200910030709A CN 101609450 A CN101609450 A CN 101609450A
Authority
CN
China
Prior art keywords
text
webpage
classification
web page
training set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2009100307095A
Other languages
Chinese (zh)
Inventor
王攀
张顺颐
汤琛
于伟涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CNA2009100307095A priority Critical patent/CN101609450A/en
Publication of CN101609450A publication Critical patent/CN101609450A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Based on the training set automatic webpage classification method, assorting process is by Feature Selection, and the feature weights determine that text vector relatively waits the combination of method.Automated taxonomy based on taxonomic hierarchies mainly is that the basis class models that foundation is good in advance is a training set, will treat that classifying documents is included into respective classes.Along with Development of Multimedia Technology, the content-form of info web is also rich and varied, not only comprises text message, also comprises a lot of structural informations, and other form informations such as sound, figure, image.But because the text based webpage still occupies bigger ratio, therefore classification is still occupied an leading position based on web page text.This method have theoretical support reliably, good extensibility and accuracy, and be easy to the application interface relevant and dock with operator.

Description

Web page classification method based on training set
Technical field
The present invention be directed to any Chinese web page and carry out the research of web page contents automatic classification method, how main research makes up training set and utilizes vectorial relative method exactly unknown webpage to be classified, design automatic webpage classification model and algorithm, related to technical fields such as file characteristics extraction and the calculating of feature weights.
Background technology
Along with the develop rapidly of Internet technology with popularize, the info web amount on the Web rapidly increases, and people have stepped into informative epoch.In the face of so abundant Web information, it is at a loss as to what to do that people often feel, how effectively finding resource requirement becomes the problem that people pay close attention to.As the most frequently used network information gopher of user (as baidu and google), there are shortcomings such as low such as precision ratio, that information redundancy is big in the keyword search engine.Because immature on the Chinese web page automatic classification technology, most of catalogue search engines adopt manual sorts' method, as YAHOO.Though precision ratio improves, there are drawbacks such as poor in timeliness, classification results are inconsistent, database small scale, simple use manual sort costs dearly and is unpractical.Therefore, the automatic classification of Chinese web page has just become fast and an important technology of magnanimity information on the organization network effectively.
Automatically classifying at the Chinese web page of the unknown possesses certain degree of difficulty, and following reason is arranged:
The first, Chinese web page uses the Chinese editor, and unlike the interval that has nature between the English word, Chinese need carry out the processing of participle, and the effect of participle can influence classifying quality significantly.
The second, the variation of webpage format.Multiple form is also deposited, and also there are a plurality of standards in the webpage of same form, and simultaneously because the writing style and the content change of webpage are all very big, the webpage of therefore how to resolve different-format, different-style becomes the pretreated difficult point of webpage.
Three, classification scheme is fuzzy.The knowledge system of internet undergoes an unusual development rapidly, and the various new structures of knowledge are constantly emerged in large numbers, if training corpus can not get upgrading in time, will cause webpage to classify or classification accuracy declines to a great extent.
Four, webpage denoising.Have noise information a large amount of and that page theme is irrelevant in the webpage, the performance that how to improve denoise algorithm is the problem that requires study.
Five, structure of web page information.Webpage contains abundant structures information, except that plain text, also has some other content that classification is had contribution.Title and paragraph subtitle as Head and Title mark webpage, name property value in the meta mark and content property value are the descriptions to Web page subject, the content that hyperlink in the webpage is pointed to might be the content relevant with Web page subject, also might be noise, how distinguish and extraction is the difficult point place.
Therefore the design of the automatic classification system of Chinese web page and realization exist a lot of problems and very big difficulty, so we study this.
Summary of the invention
Technical matters: the objective of the invention is to set up a kind of Web page classification method based on training set, promptly to the unknown classification webpage with the training set comparison to obtain the method for the corresponding classification of this webpage, and feature extraction algorithm, the distance vector comparison algorithm of the vector representation model of design webpage and vector, by Web page classifying being determined can being done more deep analysis to user's visit behavior of surfing the Net.
Technical scheme: the Web page classification method based on training set of the present invention comprises 3 parts, is respectively that web page contents processing, webpage vector representation and webpage vector compare:
The web page contents processing section:
A1.) get access to this webpage source code content automatically according to webpage URL,
A2.) utilize regular expression to filter out picture in the web page contents, noise informations such as hyperlink extract effective text information,
A3.) web page text after will filtering, through word segmentation processing,
A4.) text behind the participle is filtered, with function word, entries such as auxiliary word filter, and stay the keyword that can summarize content of text;
Webpage vector representation part:
This part is divided into again to measure feature speech dimension and subtracts approximately, and feature speech eigenwert is determined 2 processes,
Feature speech dimension subtracts approximately:
B1.) all participles gather in the training set, and training set is after previous action, and submitting the form of coming to is the text of the good speech of branch, and text leaves in respectively in the different files according to the difference classification, on demand all texts is gathered by batch processing; The keyword entry that all classification have so just been arranged,
B2.) entry length screening, between 5, the entry in this length range is not considered as not quite even play interference effect to the classification effect with the length restriction to 2 of all entries, these entries are rejected,
B3.) the entry uniqueness is done qualification, all the entry frequencies in total vocabulary text are restricted to once, with raising computing velocity and minimizing miscount,
B4.) calculate the frequency that each entry occurs respectively in the difference classification, with all frequency summations, characteristic item is chosen algorithm and is finished then,
B5.) calculate between every pair of different entry classification four kinds and concern frequency, then according to χ 2The dimension that computing method obtain every pair of entry subtracts weights approximately,
B6.) weights are pressed descending sort, get preceding 1000 entries, finish determining of characteristic item as characteristic item;
Feature speech eigenwert is determined:
B7.) obtain characteristic item,
B8.) according to the quantity dynamically creating data tables of characteristic item,
B9.) comprise the number of files of characteristic item in the training centralized calculation,
B10.) add up total amount of text, total categorical measure, all kinds of contained amount of text,
B11.) frequency of calculated characteristics item in each literary composition, and handle with matrix form,
B12.) according to different texts, the eigenwert of calculated characteristics item is finished the vector representation of text,
B13.) the vector representation algorithm finishes;
Webpage vector rating unit
C1.) obtain the proper vector of test text X,
C2.) from training set, take out a text feature vector T i,
C3.) calculate two proper vectors similarity sim (X, Ti),
C4.) judging whether to finish with all vector calculation in the training set, is then to carry out C5), otherwise jump to step C2) continue to carry out,
C5.) the similarity result of calculation of calculating is carried out quicksort, takes out K the highest text of similarity,
C6.) the similarity category of this k text is added up,
C7.) get similarity maximal value Si and corresponding class Ci,
C8.) the sign text may belong to the Ci class,
C9.) sorting algorithm finishes.
Beneficial effect: based on the Web page classification method of training set, promptly to the unknown classification webpage with the training set comparison to obtain the method for the corresponding classification of this webpage, and feature extraction algorithm, the distance vector comparison algorithm of the vector representation model of design webpage and vector, by Web page classifying being determined can being done more deep analysis to user's visit behavior of surfing the Net.
The user is the direct user of network, also is simultaneously the final judge that the network service quality quality is judged.Traditional network service also exists deficiency, as can not initiatively providing information needed to the user bring the huge while easily to the user.User behavior analysis, can be the information that is hidden under the user behavior, hobby as the user, user's field, user's access frequency etc. is concluded summary, by study to user behavior, make the network service more targetedly towards the specific user, preferentially or initiatively return the required essential information of user.
Utilization can be carried out classification analysis to user's browsing web record based on the Web page classification method of training set, can obtain the user utilizes network often to pay close attention to the information of which aspect, be engaged in any aspect work and like on which website, carrying out user behavior information such as consume activity.This is for improving network service quality, and it all is very important improving network management.
Traditional Web page classification method all is to utilize artificial treatment, and this method accuracy rate can finely guarantee, and for webpage quantity excessive the time poor efficiency of this disposal route will come out, more can't reach real-time effect.And can be with the assorting process robotization based on the Web page classification method of training set, and the method that adopts has certain assurance on accuracy rate.
Description of drawings
Fig. 1 is the Web page classifying functional diagram.Provided each processing procedure of classification among the figure.
Fig. 2 is that the characteristic item of webpage vector is determined method flow diagram.Provide characteristic item among the figure and determined the concrete processing procedure of method.
Fig. 3 is that the eigenwert of the characteristic item of webpage vector is determined method flow diagram.The eigenwert that has provided characteristic item among the figure is determined the concrete processing procedure of method.
Fig. 4 is vectorial comparative approach process flow diagram.Provided the concrete processing procedure of vectorial comparative approach among the figure.
Embodiment
The present invention proposes a kind of effectively to the technological frame of automatic webpage classification, and detailed design sorting algorithm, as shown in Figure 1.As can be seen from the figure, system is divided into three parts, is respectively: web page contents processing, webpage vector representation and webpage vector are relatively.
Here it may be noted that 2 text terms.Training set refers to the webpage source code set of a large amount of known classification, and source code is stored with textual form, and is stored in respectively in the different files according to the civilian class of correspondence, and these texts finally all pass through treatment conversion and become the form of corresponding vector to represent.Feature extraction is meant definite process of each element of webpage vector, and wherein element is the keyword entry that can embody web page contents, and the value of element is the weights result of calculation of entry to classification importance.Each webpage all has the vector representation of oneself.
Key method of the present invention is at vector representation part and vectorial rating unit, and vector representation partly mainly comprises two methods: the characteristic item of webpage vector determines that the eigenwert of method and characteristic item determines method; The main method of vector rating unit is: webpage vector to be measured is with training set vector method relatively.
The characteristic item of webpage vector is determined method: the basic foundation of feature selecting is the effect size of feature to classification results, utilizes statistic to measure.The result of feature selecting also will guarantee not change the character of original feature space, and the dimension of feature space is reduced in the ideal range.Because will be based on the principle above original, we have selected statistical method, think that Chinese keyword in the webpage satisfies between generic to distribute.This statistics value is high more, and the independence between keyword is generic is more little, and correlativity is strong more, and promptly keyword is contributed big more to such other.All keywords in the training set after handling are gathered in the text, and calculate generic 4 kinds of each keyword and concern frequency: 1. the frequency n that in classification j, occurs of keyword i 11, the 2. frequency n that occurs in keyword i other classifications outside classification j 12, the 3. frequency n that all entries occur in classification j except that keyword i 21, the 4. frequency n that outside classification j, occurs in other classifications of all entries except that keyword i 22Pass through formula then:
χ 2 = n × ( n 11 × n 22 - n 12 × n 21 ) 2 ( n 11 × n 12 ) × ( n 21 × n 22 ) × ( n 11 × n 21 ) × ( n 12 × n 22 )
Calculate statistic.Wherein n is the frequency summation of all keywords.Every couple of keyword i and classification j are calculated χ 2Value, take out bigger preceding 1000 speech of result as feature, promptly finished determining to measure feature.
The eigenwert of characteristic item is determined method: after having finished the selection of characteristic item, compose with weight for the characteristic item of selecting, and be used for describing the content and the importance of feature in text of document.For the special document of form web page, because its design feature and feature corresponding class information, we calculate more accurate statistic on the basis of the weighing computation method of TF*IDF, to describe the importance of characteristic item for web page contents.This method be characteristic item i document j the frequency TF of appearance IjAnd the inverse ratio document frequency of characteristic item i and document j
Figure G2009100307095D00042
Pass through formula: w ij = TF ij × log ( N n j + 0.01 ) Σ j = 1 n ( TF ij × log ( N n j + 0.01 ) ) 2 Calculate the feature weights.N wherein jBe web page text d in the training set iIn characteristic item t appears jTextual data.
Vector method relatively: utilize the K nearest neighbor algorithm, vector to be measured is compared with each text in the training set, calculate their similarity, find out K training text the most similar.And give each text class marking on this basis, score value is to belong to such text and the similarity sum between the test text in K the training text, sorts by score value then.Get the big person of score value result as a comparison.Concrete computing formula is: y ( χ → , c j ) = Σ d → i ∈ kNN sim ( χ → , d → i ) y ( d → i , c j ) - b j . Wherein:
Figure G2009100307095D00051
Be webpage vector to be measured, Be webpage vector in the training set, c is the element that classification is concentrated,
sim ( χ → , d → i ) = χ → · d → | χ → | | d → | ,
Figure G2009100307095D00054
The size 0, value between the 1} (when
Figure G2009100307095D00055
Get 1 when belonging to c; Otherwise be 0).
Below introduce the various piece function implementation method of this design in detail.
1. web page contents processing module
Function: this part is obtained the named web page source code earlier, and utilizes regular expression to extract the Chinese part of source code, again the Chinese text after extracting is carried out word segmentation processing and stores with textual form.
Interface: the web page text of this part after with participle offers next functional module.
2. webpage vector representation module
Function: part at first obtains the expression of webpage vector by training set.To submit to each web page text that comes to generate vector again, and store in the database through calculating.
Interface: this funtion part is a webpage vector comparison module, and the data that compare are provided.Vector of every record expression in the database, the web page text that the row representative is different, the feature of row representation vector, the data value of storage is the weights that calculate of each feature speech in this web page text.
This layer mainly comprises two methods: the characteristic item of webpage vector determines that the eigenwert of method and characteristic item determines method.By at first determining each element of vector, promptly the keyword in the web page contents calculates and composes with corresponding weight value for the importance of web page contents the feature speech again, finishes the vector representation of webpage.
◆ the characteristic item of webpage vector is determined method.Method processing procedure such as accompanying drawing 2.
(1) all participles gather in the training set, and training set is after previous action, and submitting the form of coming to is the text of the good speech of branch, and text leaves in respectively in the different files according to the difference classification, on demand all texts is gathered by batch processing.The keyword entry that all classification have so just been arranged, but it is too big as the characteristic item calculated amount with so many keywords, the result who calculates can be not big and accurate because of the quantity of characteristic item yet, therefore characteristic item need be carried out dimension and subtract approximately, the feature speech is reduced to certain limit to improve computing velocity.
(2) entry length screening, between 5, the entry in this length range is not considered as not quite even play interference effect to the classification effect, and these entries are rejected with the length restriction to 2 of all entries.
(3) the entry uniqueness is done qualification, because amount of text is huge, the probability that same entry occurs is also very big, but the entry that is used to calculate is only once all right with calculating, therefore all the entry frequencies in total vocabulary text must be restricted to once, to improve computing velocity and to reduce miscount.
(4) calculate the frequency that each entry occurs respectively in the difference classification, then with all frequency summations.
(5) calculate between every pair of different entry classification four kinds and concern frequency.Then according to χ 2Computing method obtain the weights of every pair of entry.
(6) weights are pressed descending sort, get preceding 1000 entries, finish determining of characteristic item as characteristic item.
(7) characteristic item is chosen the algorithm end.
◆ the eigenwert of characteristic item is determined method.Method flow as shown in Figure 3.
(1) obtains characteristic item.
(2) according to the quantity dynamically creating data tables of characteristic item.
(3) training centralized calculation to comprise the number of files of characteristic item.
(4) add up total amount of text, total categorical measure, all kinds of contained amount of text.
(5) frequency of calculated characteristics item in each literary composition, and handle with matrix form.
(6) according to different texts, the eigenwert of calculated characteristics item is finished the vector representation of text.
(7) the vector representation algorithm finishes.
3. webpage vector comparison module
Function: this partial function is that the webpage vector to be measured after the last resume module is compared with the institute's directed quantity in the training set, calculates the result of webpage vector to be measured with all webpage vectors in the training set through special algorithm.Find with vector the most similar in the training set, the classification of its correspondence is document classification to be measured.
Interface: final classification results is stored in the database.
This part is the core of categorizing system, comprises vectorial comparative approach.
◆ vectorial comparative approach.Method flow as shown in Figure 4.
(1) obtains the proper vector of test text X.
(2) from training set, take out a text feature vector T i.
(3) calculate two proper vectors similarity sim (X, Ti).
(4) judge whether that it is then to carry out (5) that calculating finishes, continue to carry out otherwise jump to step (2).
(5) the similarity result of calculation of calculating is carried out quicksort, take out K the highest text of similarity.
(6) the similarity category of this k text is added up.
(7) get similarity maximal value Si and corresponding class Ci.
(8) the sign text may belong to the Ci class.
(9) sorting algorithm finishes.
4. the application of automatic webpage classification system
Have very wide significance and using value for automatic webpage classification.Mainly can be applied in:
◆ the automatic taxonomic clustering research of Chinese web page;
◆ the research of Chinese web web page characteristics;
◆ information retrieval technique research;
◆ for the thematic search engine of specialty is laid the groundwork;
◆ Internet information is obtained and is utilized form analysis.
This method is partly having design utilization for the reverse engine in the automatic webpage classification system of our development ﹠ construction.System by the B/S form again in conjunction with just/support of contrary engine, realize searching related urls and given URL is known its classification by classification.Through sampling Detection, based on the global url and the Chinese url rank of alexa rank net, the coverage rate of native system reaches 50% and 97% respectively.We still need classification accuracy continue to improve, and also need tighter division, the customization of classification also to remain to 3 grades to 4 grades trend development, to guarantee that coverage rate is more complete more extensive to defining of training set.
The environment for use of system is built simply, only need be under the windows environment, and be equipped with Net2.0 framework and oracle9i or above version, internet in the connection just can move native system.System easy to use, in conjunction with simple and clear B/S framework, the user can carry out the associative search operation according to prompting.The real-time of system also can accomplish to require in time to upgrade the url database with timing according to user oneself.

Claims (1)

1. the Web page classification method based on training set is characterized in that this method comprises 3 parts, is respectively that web page contents processing, webpage vector representation and webpage vector compare:
The web page contents processing section:
A1.) get access to this webpage source code content automatically according to webpage URL,
A2.) utilize regular expression to filter out picture in the web page contents, noise informations such as hyperlink extract effective text information,
A3.) web page text after will filtering, through word segmentation processing,
A4.) text behind the participle is filtered, with function word, entries such as auxiliary word filter, and stay the keyword that can summarize content of text;
Webpage vector representation part:
This part is divided into again to measure feature speech dimension and subtracts approximately, and feature speech eigenwert is determined 2 processes, and feature speech dimension subtracts approximately:
B1.) all participles gather in the training set, and training set is after previous action, and submitting the form of coming to is the text of the good speech of branch, and text leaves in respectively in the different files according to the difference classification, on demand all texts is gathered by batch processing; The keyword entry that all classification have so just been arranged,
B2.) entry length screening, between 5, the entry in this length range is not considered as not quite even play interference effect to the classification effect with the length restriction to 2 of all entries, these entries are rejected,
B3.) the entry uniqueness is done qualification, all the entry frequencies in total vocabulary text are restricted to once, with raising computing velocity and minimizing miscount,
B4.) calculate the frequency that each entry occurs respectively in the difference classification, with all frequency summations, characteristic item is chosen algorithm and is finished then,
B5.) calculate between every pair of different entry classification four kinds and concern frequency, then according to χ 2The dimension that computing method obtain every pair of entry subtracts weights approximately,
B6.) weights are pressed descending sort, get preceding 1000 entries, finish determining of characteristic item as characteristic item; Feature speech eigenwert is determined:
B7.) obtain characteristic item,
B8.) according to the quantity dynamically creating data tables of characteristic item,
B9.) comprise the number of files of characteristic item in the training centralized calculation,
B10.) add up total amount of text, total categorical measure, all kinds of contained amount of text,
B11.) frequency of calculated characteristics item in each literary composition, and handle with matrix form,
B12.) according to different texts, the eigenwert of calculated characteristics item is finished the vector representation of text,
B13.) the vector representation algorithm finishes;
Webpage vector rating unit
C1.) obtain the proper vector of test text X,
C2.) from training set, take out a text feature vector T i,
C3.) calculate two proper vectors similarity sim (X, Ti),
C4.) judging whether to finish with all vector calculation in the training set, is then to carry out (C5), otherwise jumps to step C2) continue to carry out,
C5.) the similarity result of calculation of calculating is carried out quicksort, takes out K the highest text of similarity,
C6.) the similarity category of this k text is added up,
C7.) get similarity maximal value Si and corresponding class Ci,
C8.) the sign text may belong to the Ci class,
C9.) sorting algorithm finishes.
CNA2009100307095A 2009-04-10 2009-04-10 Web page classification method based on training set Pending CN101609450A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2009100307095A CN101609450A (en) 2009-04-10 2009-04-10 Web page classification method based on training set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2009100307095A CN101609450A (en) 2009-04-10 2009-04-10 Web page classification method based on training set

Publications (1)

Publication Number Publication Date
CN101609450A true CN101609450A (en) 2009-12-23

Family

ID=41483207

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2009100307095A Pending CN101609450A (en) 2009-04-10 2009-04-10 Web page classification method based on training set

Country Status (1)

Country Link
CN (1) CN101609450A (en)

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101984435A (en) * 2010-11-17 2011-03-09 百度在线网络技术(北京)有限公司 Method and device for distributing texts
CN102004796A (en) * 2010-12-24 2011-04-06 钱钢 Non-retardant hierarchical classification method and device of webpage texts
CN102184262A (en) * 2011-06-15 2011-09-14 悠易互通(北京)广告有限公司 Web-based text classification mining system and web-based text classification mining method
CN102207961A (en) * 2011-05-25 2011-10-05 盛乐信息技术(上海)有限公司 Automatic web page classification method and device
CN102436512A (en) * 2012-01-17 2012-05-02 电子科技大学 Preference-based web page text content control method
CN103309857A (en) * 2012-03-06 2013-09-18 腾讯科技(深圳)有限公司 Method and equipment for determining classified linguistic data
CN103309862A (en) * 2012-03-07 2013-09-18 腾讯科技(深圳)有限公司 Webpage type recognition method and system
CN103544310A (en) * 2013-11-04 2014-01-29 北京中搜网络技术股份有限公司 Shopping guide webpage information classifying method achieved based on classifier
CN103729466A (en) * 2014-01-16 2014-04-16 重庆邮电大学 Name country identification method based on WEB and GBBoosting algorithms
CN103744981A (en) * 2014-01-14 2014-04-23 南京汇吉递特网络科技有限公司 System for automatic classification analysis for website based on website content
CN103761477A (en) * 2014-01-07 2014-04-30 北京奇虎科技有限公司 Method and equipment for acquiring virus program samples
CN103810264A (en) * 2014-01-27 2014-05-21 西安理工大学 Webpage text classification method based on feature selection
CN103984700A (en) * 2014-04-15 2014-08-13 厦门产业技术研究院 Heterogeneous data analysis method for vertical search of scientific information
WO2014203264A1 (en) * 2013-06-21 2014-12-24 Hewlett-Packard Development Company, L.P. Topic based classification of documents
CN104408148A (en) * 2014-12-03 2015-03-11 复旦大学 Field encyclopedia establishment system based on general encyclopedia websites
CN104424308A (en) * 2013-09-04 2015-03-18 中兴通讯股份有限公司 Web page classification standard acquisition method and device and web page classification method and device
CN104504086A (en) * 2014-12-25 2015-04-08 北京国双科技有限公司 Clustering method and device for webpage
CN104573021A (en) * 2015-01-12 2015-04-29 浪潮软件集团有限公司 Method for analyzing internet behaviors
CN104615608A (en) * 2014-04-28 2015-05-13 腾讯科技(深圳)有限公司 Data mining processing system and method
CN105468731A (en) * 2015-11-20 2016-04-06 成都科来软件有限公司 Preprocessing method of text sentiment analysis characteristic verification
CN105912695A (en) * 2016-04-25 2016-08-31 全球电气资源(深圳)有限公司 Data information transmission and acquisition system based on Internet
CN106203073A (en) * 2016-07-13 2016-12-07 浪潮电子信息产业股份有限公司 Windows white list control method based on file system filter driver
CN106484919A (en) * 2016-11-15 2017-03-08 任子行网络技术股份有限公司 A kind of industrial sustainability sorting technique based on webpage autonomous word and system
WO2017118427A1 (en) * 2016-01-07 2017-07-13 腾讯科技(深圳)有限公司 Webpage training method and device, and search intention identification method and device
CN108062377A (en) * 2017-12-12 2018-05-22 百度在线网络技术(北京)有限公司 The foundation of label picture collection, definite method, apparatus, equipment and the medium of label
WO2018196561A1 (en) * 2017-04-25 2018-11-01 腾讯科技(深圳)有限公司 Label information generating method and device for application and storage medium
CN108921184A (en) * 2018-04-18 2018-11-30 中国科学院信息工程研究所 A kind of general type of webpage determination method
CN109299346A (en) * 2018-10-31 2019-02-01 长春理工大学 A kind of recognition methods and system of invalid address webpage
CN109446327A (en) * 2018-11-01 2019-03-08 合肥工业大学 A kind of diagnostic method and system of client mobile communication complaint
WO2019085691A1 (en) * 2017-10-30 2019-05-09 清华大学 Unsupervised exceptional access detection method and apparatus based on one-hot encoding mechanism

Cited By (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101984435B (en) * 2010-11-17 2012-10-10 百度在线网络技术(北京)有限公司 Method and device for distributing texts
CN101984435A (en) * 2010-11-17 2011-03-09 百度在线网络技术(北京)有限公司 Method and device for distributing texts
CN102004796A (en) * 2010-12-24 2011-04-06 钱钢 Non-retardant hierarchical classification method and device of webpage texts
CN102004796B (en) * 2010-12-24 2012-04-25 钱钢 Non-retardant hierarchical classification method and device of webpage texts
CN102207961B (en) * 2011-05-25 2013-10-23 盛乐信息技术(上海)有限公司 Automatic web page classification method and device
CN102207961A (en) * 2011-05-25 2011-10-05 盛乐信息技术(上海)有限公司 Automatic web page classification method and device
CN102184262A (en) * 2011-06-15 2011-09-14 悠易互通(北京)广告有限公司 Web-based text classification mining system and web-based text classification mining method
CN102436512A (en) * 2012-01-17 2012-05-02 电子科技大学 Preference-based web page text content control method
CN102436512B (en) * 2012-01-17 2013-05-08 电子科技大学 Preference-based web page text content control method
CN103309857A (en) * 2012-03-06 2013-09-18 腾讯科技(深圳)有限公司 Method and equipment for determining classified linguistic data
CN103309862A (en) * 2012-03-07 2013-09-18 腾讯科技(深圳)有限公司 Webpage type recognition method and system
CN103309862B (en) * 2012-03-07 2017-05-17 腾讯科技(深圳)有限公司 Webpage type recognition method and system
WO2014203264A1 (en) * 2013-06-21 2014-12-24 Hewlett-Packard Development Company, L.P. Topic based classification of documents
CN104424308A (en) * 2013-09-04 2015-03-18 中兴通讯股份有限公司 Web page classification standard acquisition method and device and web page classification method and device
CN103544310A (en) * 2013-11-04 2014-01-29 北京中搜网络技术股份有限公司 Shopping guide webpage information classifying method achieved based on classifier
CN103544310B (en) * 2013-11-04 2017-08-08 北京中搜云商网络技术有限公司 A kind of information classification approach for the shopping guide's class webpage realized based on grader
CN103761477A (en) * 2014-01-07 2014-04-30 北京奇虎科技有限公司 Method and equipment for acquiring virus program samples
CN103744981B (en) * 2014-01-14 2017-02-15 南京汇吉递特网络科技有限公司 System for automatic classification analysis for website based on website content
CN103744981A (en) * 2014-01-14 2014-04-23 南京汇吉递特网络科技有限公司 System for automatic classification analysis for website based on website content
CN103729466A (en) * 2014-01-16 2014-04-16 重庆邮电大学 Name country identification method based on WEB and GBBoosting algorithms
CN103810264B (en) * 2014-01-27 2017-06-06 西安理工大学 The web page text sorting technique of feature based selection
CN103810264A (en) * 2014-01-27 2014-05-21 西安理工大学 Webpage text classification method based on feature selection
CN103984700B (en) * 2014-04-15 2017-09-26 厦门产业技术研究院 A kind of isomeric data analysis method for scientific and technological information vertical search
CN103984700A (en) * 2014-04-15 2014-08-13 厦门产业技术研究院 Heterogeneous data analysis method for vertical search of scientific information
CN104615608A (en) * 2014-04-28 2015-05-13 腾讯科技(深圳)有限公司 Data mining processing system and method
CN104615608B (en) * 2014-04-28 2018-05-15 腾讯科技(深圳)有限公司 A kind of data mining processing system and method
CN104408148B (en) * 2014-12-03 2017-12-01 复旦大学 A kind of field encyclopaedia constructing system based on general encyclopaedia website
CN104408148A (en) * 2014-12-03 2015-03-11 复旦大学 Field encyclopedia establishment system based on general encyclopedia websites
CN104504086A (en) * 2014-12-25 2015-04-08 北京国双科技有限公司 Clustering method and device for webpage
CN104504086B (en) * 2014-12-25 2017-11-21 北京国双科技有限公司 The clustering method and device of Webpage
CN104573021A (en) * 2015-01-12 2015-04-29 浪潮软件集团有限公司 Method for analyzing internet behaviors
CN105468731B (en) * 2015-11-20 2018-10-02 成都科来软件有限公司 A kind of preposition processing method of text emotion analysis signature verification
CN105468731A (en) * 2015-11-20 2016-04-06 成都科来软件有限公司 Preprocessing method of text sentiment analysis characteristic verification
WO2017118427A1 (en) * 2016-01-07 2017-07-13 腾讯科技(深圳)有限公司 Webpage training method and device, and search intention identification method and device
CN105912695A (en) * 2016-04-25 2016-08-31 全球电气资源(深圳)有限公司 Data information transmission and acquisition system based on Internet
CN106203073A (en) * 2016-07-13 2016-12-07 浪潮电子信息产业股份有限公司 Windows white list control method based on file system filter driver
CN106484919A (en) * 2016-11-15 2017-03-08 任子行网络技术股份有限公司 A kind of industrial sustainability sorting technique based on webpage autonomous word and system
WO2018196561A1 (en) * 2017-04-25 2018-11-01 腾讯科技(深圳)有限公司 Label information generating method and device for application and storage medium
WO2019085691A1 (en) * 2017-10-30 2019-05-09 清华大学 Unsupervised exceptional access detection method and apparatus based on one-hot encoding mechanism
CN108062377A (en) * 2017-12-12 2018-05-22 百度在线网络技术(北京)有限公司 The foundation of label picture collection, definite method, apparatus, equipment and the medium of label
CN108921184A (en) * 2018-04-18 2018-11-30 中国科学院信息工程研究所 A kind of general type of webpage determination method
CN109299346A (en) * 2018-10-31 2019-02-01 长春理工大学 A kind of recognition methods and system of invalid address webpage
CN109299346B (en) * 2018-10-31 2020-12-08 长春理工大学 Method and system for identifying invalid address webpage
CN109446327A (en) * 2018-11-01 2019-03-08 合肥工业大学 A kind of diagnostic method and system of client mobile communication complaint

Similar Documents

Publication Publication Date Title
CN101609450A (en) Web page classification method based on training set
CN103177090B (en) A kind of topic detection method and device based on big data
CN101320375B (en) Digital book search method based on user click action
CN100465954C (en) Reinforced clustering of multi-type data objects for search term suggestion
CN103914478A (en) Webpage training method and system and webpage prediction method and system
CN107784092A (en) A kind of method, server and computer-readable medium for recommending hot word
CN103309862B (en) Webpage type recognition method and system
Yin et al. Facto: a fact lookup engine based on web tables
CN109271477A (en) A kind of method and system by internet building taxonomy library
CN106156372B (en) A kind of classification method and device of internet site
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN102184262A (en) Web-based text classification mining system and web-based text classification mining method
CN1963816A (en) Automatization processing method of rating of merit of search engine
CN103226578A (en) Method for identifying websites and finely classifying web pages in medical field
CN103064945A (en) Situation searching method based on body
CN103177024A (en) Method and device of topic information show
CN104199822A (en) Method and system for identifying demand classification corresponding to searching
CN103324666A (en) Topic tracing method and device based on micro-blog data
CN109558587B (en) Method for classifying public opinion tendency recognition aiming at category distribution imbalance
CN103678422A (en) Web page classification method and device and training method and device of web page classifier
KR101059557B1 (en) Computer-readable recording media containing information retrieval methods and programs capable of performing the information
CN101751439A (en) Image retrieval method based on hierarchical clustering
Shani et al. Mining recommendations from the web
JP4569380B2 (en) Vector generation method and apparatus, category classification method and apparatus, program, and computer-readable recording medium storing program
Khalid et al. Topic detection from conversational dialogue corpus with parallel dirichlet allocation model and elbow method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20091223