CN101923561A - Automatic document classifying method - Google Patents
Automatic document classifying method Download PDFInfo
- Publication number
- CN101923561A CN101923561A CN 201010179678 CN201010179678A CN101923561A CN 101923561 A CN101923561 A CN 101923561A CN 201010179678 CN201010179678 CN 201010179678 CN 201010179678 A CN201010179678 A CN 201010179678A CN 101923561 A CN101923561 A CN 101923561A
- Authority
- CN
- China
- Prior art keywords
- classifying
- information
- pretreated
- carried out
- automatic document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to an automatic document classifying method which belongs to the filed of data mining and is applicable to automatic resource classification, network content monitoring, spam filtration, digital libraries and the like. The method comprises the following steps of: firstly extracting the text information, the image information, the video information and the audio information in a document, then classifying the four kinds of information with different classifying methods, next gathering the classification results of the four kinds of information, and carrying out comprehensive treatment by adopting a decision-level fusion algorithm to obtain a final classification result. The invention can obtain a document classification result with higher accuracy.
Description
Technical field
The present invention relates to a kind of automatic document classifying method, belong to the data mining field, be applicable to resource automatic clustering, Web content supervision, Spam filtering, digital library etc.
Background technology
Automatic document classifying is comparatively studying a question of focus of data mining field.Its objective is classification function of training or sorter, this function or sorter can be treating that the branch file is mapped in the given respective classes.Its target be find classification speed faster, manage the method for text message more accurately.
At present, a large amount of research concentrates on the research of text classification, as people such as Zhang Xiaodan in document " a kind of decision level text automatic classified fusion method " (national patent, number of patent application: disclose a kind of decision level text automatic classified fusion method 2009100878443), its disaggregated model as shown in Figure 1.This method is theoretical foundation with the information fusion, with automatic document classifying algorithms such as the higher SVM of nicety of grading, KNN, Bayes is research object, adopt the multilayer fusion structure, the form that series and parallel is mixed has been set up the automatic document classifying Fusion Model of decision level.The shortcoming of this method is: because it only handles this paper information in the file, and the information such as image in the sort file, video, audio frequency of not treating are handled, and cause the accuracy rate of classifying undesirable.This mainly is that as video, image, audio frequency etc., so the text based sorting technique can't satisfy people's needs owing to comprise a large amount of multi-medium datas in the network data at present.
From disclosed document, yet there are no the file classifying method of handling multiple medium simultaneously.
Summary of the invention
The present invention is directed at present existing Automatic document classification method and have the not high shortcoming of accuracy, on the basis of existing decision level text automatic classified fusion method, propose a kind of automatic document classifying method, obtain the higher classification results of accuracy rate based on multiple medium (image, audio frequency, video and text message).
The present invention is achieved by the following technical solutions.
A kind of automatic document classifying method, its concrete operations step is as follows:
The 1st step: from treat sort file, extract text message, image information, video information, audio-frequency information;
The 2nd step: on the basis in the 1st step, text message, image information, video information, the audio-frequency information that extracts carried out pre-service respectively; Text message is carried out pre-service comprise participle, feature extraction, weight calculation etc.; Image information is carried out pre-service to be comprised image transformation, enhancing, rim detection, recovers, cuts apart etc.; Video information is carried out pre-service to comprise feature extraction, build video library, video data is carried out multidimensional analysis etc.; Audio-frequency information is carried out pre-service comprise front end pre-service, feature extraction, identification etc.;
The 3rd step: on the basis in the 2nd step, to classifying through pretreated text message; The sorting technique of using includes but not limited to: KNN, SVM, Bayes;
The 4th step: on the basis in the 2nd step, to classifying through pretreated image information; The sorting technique of using includes but not limited to: SVM, Bayesian network, BP neural network;
The 5th step: on the basis in the 2nd step, to classifying through pretreated video information; The sorting technique of using includes but not limited to: KNN, SVM, Boosting algorithm;
The 6th step: on the basis in the 2nd step, to classifying through pretreated audio-frequency information; The sorting technique of using includes but not limited to: SVM, GMM algorithm;
The 7th step: collect the 3rd and go on foot the classification results in the 6th step, and adopt the decision level fusion algorithm that the classification results of collecting is carried out reasoning and calculation, obtain final classification results; Described decision level fusion algorithm includes but not limited to: Bayesian network algorithm, D-S evidence theory algorithm, ballot algorithm.
Beneficial effect
The inventive method adopts classifies respectively to the text message in the file, image information, video information, audio-frequency information, adopts the decision level fusion algorithm that classification results is carried out overall treatment then, can obtain document classification result with higher accuracy.
Description of drawings
Fig. 1 is the decision level text automatic classified Fusion Model synoptic diagram of prior art.
Embodiment
According to technique scheme, the present invention is described in detail below in conjunction with embodiment.
Present embodiment adopts the inventive method categorizing system of creating a file, and this categorizing system adopts JAVA development platform, oracle database.Adopt 6000 pieces for the audio frequency corpus this categorizing system to be trained for video corpus, 3000 pieces for image corpus, 3000 pieces for text corpus, 5000 pieces, after training, use 4000 pieces of testing materials to test, concrete steps are as follows:
The 1st step: treat to extract the sort file text message, image information, video information, audio-frequency information from 4000 pieces;
The 2nd step: text message is carried out pre-service, comprise participle, feature extraction, weight calculation; Image information is carried out pre-service, comprise image transformation, enhancing, rim detection, recover, cut apart; Video information is carried out pre-service, comprise feature extraction, build video library, video data is carried out multidimensional analysis; Audio-frequency information is carried out pre-service, comprise front end pre-service, feature extraction, identification;
The 3rd step: use the KNN method to classifying through pretreated text message;
The 4th step: use the SVM method to classifying through pretreated image information;
The 5th step: use the SVM method to classifying through pretreated video information;
The 6th step: use the GMM algorithm to classifying through pretreated audio-frequency information;
The 7th step: collect the 3rd and go on foot the classification results in the 6th step, and adopt D-S evidence theory algorithm that the classification results of collecting is carried out reasoning and calculation, obtain final classification results.
Through the operation of above step, it is as shown in table 1 to obtain test findings.
Simultaneously, for classifying quality of the present invention is described, under equal conditions, adopt KNN, SVM and document " a kind of decision level text automatic classified fusion method " (national patent respectively with identical corpus, testing material and identical taxonomic hierarchies, number of patent application: disclosed a kind of decision level text automatic classified fusion method is classified 2009100878443), and classifying quality is as shown in table 1:
Three kinds of algorithm classification effects of table 1 relatively
Conclusion: the automatic document classifying method that the present invention proposes has adopted the mode of multiple medium, has brought into play the advantage of multiple sorter, has obtained being higher than the accuracy rate and the recall rate of literature method and other single classifiers, has verified its validity.
It is emphasized that to those skilled in the art under the prerequisite that does not break away from the principle of the invention, can also make some improvement, these also should be considered as belonging to protection scope of the present invention.
Claims (6)
1. automatic document classifying method, it is characterized in that: its concrete operations step is as follows:
The 1st step: from treat sort file, extract text message, image information, video information, audio-frequency information;
The 2nd step: on the basis in the 1st step, text message, image information, video information, the audio-frequency information that extracts carried out pre-service respectively; Text message is carried out pre-service comprise participle, feature extraction, weight calculation etc.; Image information is carried out pre-service to be comprised image transformation, enhancing, rim detection, recovers, cuts apart etc.; Video information is carried out pre-service to comprise feature extraction, build video library, video data is carried out multidimensional analysis etc.; Audio-frequency information is carried out pre-service comprise front end pre-service, feature extraction, identification etc.;
The 3rd step: on the basis in the 2nd step, to classifying through pretreated text message;
The 4th step: on the basis in the 2nd step, to classifying through pretreated image information;
The 5th step: on the basis in the 2nd step, to classifying through pretreated video information;
The 6th step: on the basis in the 2nd step, to classifying through pretreated audio-frequency information;
The 7th step: collect the 3rd and go on foot the classification results in the 6th step, and adopt the decision level fusion algorithm that the classification results of collecting is carried out reasoning and calculation, obtain final classification results.
2. a kind of automatic document classifying method as claimed in claim 1 is characterized in that: to classifying through pretreated text message, the sorting technique of use includes but not limited to: KNN, SVM, Bayes described in the 3rd step.
3. a kind of automatic document classifying method as claimed in claim 1 or 2 is characterized in that: to classifying through pretreated image information, the sorting technique of use includes but not limited to: SVM, Bayesian network, BP neural network described in the 4th step.
4. a kind of automatic document classifying method as claimed in claim 1 or 2 is characterized in that: to classifying through pretreated video information, the sorting technique of use includes but not limited to: KNN, SVM, Boosting algorithm described in the 5th step.
5. a kind of automatic document classifying method as claimed in claim 1 or 2 is characterized in that: to classifying through pretreated audio-frequency information, the sorting technique of use includes but not limited to: SVM, GMM algorithm described in the 6th step.
6. a kind of automatic document classifying method as claimed in claim 1 or 2 is characterized in that: the decision level fusion algorithm includes but not limited to described in the 7th step: Bayesian network algorithm, D-S evidence theory algorithm, ballot algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201010179678 CN101923561A (en) | 2010-05-24 | 2010-05-24 | Automatic document classifying method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201010179678 CN101923561A (en) | 2010-05-24 | 2010-05-24 | Automatic document classifying method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101923561A true CN101923561A (en) | 2010-12-22 |
Family
ID=43338497
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 201010179678 Pending CN101923561A (en) | 2010-05-24 | 2010-05-24 | Automatic document classifying method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101923561A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102426585A (en) * | 2011-08-09 | 2012-04-25 | 中国科学技术信息研究所 | Webpage automatic classification method based on Bayesian network |
CN103136247A (en) * | 2011-11-29 | 2013-06-05 | 阿里巴巴集团控股有限公司 | Attribute data interval partition method and attribute data interval partition device |
CN104731979A (en) * | 2015-04-16 | 2015-06-24 | 广东欧珀移动通信有限公司 | Method and device for storing all exclusive information resources of specific user |
CN105260398A (en) * | 2015-09-17 | 2016-01-20 | 中国科学院自动化研究所 | Quick sorting method for movie types based on poster and plot summary |
CN105512131A (en) * | 2014-09-25 | 2016-04-20 | 中国科学技术信息研究所 | Method and device for classification method category mapping based on category similarity calculation |
CN106372182A (en) * | 2016-08-30 | 2017-02-01 | 浪潮(北京)电子信息产业有限公司 | File management method and system and cloud platform |
CN106897454A (en) * | 2017-02-15 | 2017-06-27 | 北京时间股份有限公司 | A kind of file classifying method and device |
CN107391751A (en) * | 2017-08-15 | 2017-11-24 | 郑州云海信息技术有限公司 | A kind of file classifying method and device |
CN107958289A (en) * | 2016-10-18 | 2018-04-24 | 深圳光启合众科技有限公司 | Data processing method and device, robot for robot |
CN109460467A (en) * | 2018-09-28 | 2019-03-12 | 中国科学院电子学研究所苏州研究院 | A kind of network information classification system construction method |
CN117668333A (en) * | 2024-02-01 | 2024-03-08 | 北京宽客进化科技有限公司 | File classification method, system, equipment and readable storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1588879A (en) * | 2004-08-12 | 2005-03-02 | 复旦大学 | Internet content filtering system and method |
CN1920818A (en) * | 2006-09-14 | 2007-02-28 | 浙江大学 | Transmedia search method based on multi-mode information convergence analysis |
CN1945581A (en) * | 2005-09-30 | 2007-04-11 | 通用电气公司 | Computer assisted domain specific entity mapping method and system |
-
2010
- 2010-05-24 CN CN 201010179678 patent/CN101923561A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1588879A (en) * | 2004-08-12 | 2005-03-02 | 复旦大学 | Internet content filtering system and method |
CN1945581A (en) * | 2005-09-30 | 2007-04-11 | 通用电气公司 | Computer assisted domain specific entity mapping method and system |
CN1920818A (en) * | 2006-09-14 | 2007-02-28 | 浙江大学 | Transmedia search method based on multi-mode information convergence analysis |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102426585A (en) * | 2011-08-09 | 2012-04-25 | 中国科学技术信息研究所 | Webpage automatic classification method based on Bayesian network |
CN103136247A (en) * | 2011-11-29 | 2013-06-05 | 阿里巴巴集团控股有限公司 | Attribute data interval partition method and attribute data interval partition device |
CN103136247B (en) * | 2011-11-29 | 2015-12-02 | 阿里巴巴集团控股有限公司 | Attribute data interval division method and device |
CN105512131A (en) * | 2014-09-25 | 2016-04-20 | 中国科学技术信息研究所 | Method and device for classification method category mapping based on category similarity calculation |
CN104731979A (en) * | 2015-04-16 | 2015-06-24 | 广东欧珀移动通信有限公司 | Method and device for storing all exclusive information resources of specific user |
CN105260398A (en) * | 2015-09-17 | 2016-01-20 | 中国科学院自动化研究所 | Quick sorting method for movie types based on poster and plot summary |
CN106372182A (en) * | 2016-08-30 | 2017-02-01 | 浪潮(北京)电子信息产业有限公司 | File management method and system and cloud platform |
CN107958289A (en) * | 2016-10-18 | 2018-04-24 | 深圳光启合众科技有限公司 | Data processing method and device, robot for robot |
CN107958289B (en) * | 2016-10-18 | 2022-02-01 | 深圳市中吉电气科技有限公司 | Data processing method and device for robot and robot |
CN106897454A (en) * | 2017-02-15 | 2017-06-27 | 北京时间股份有限公司 | A kind of file classifying method and device |
CN106897454B (en) * | 2017-02-15 | 2020-07-03 | 北京时间股份有限公司 | File classification method and device |
CN107391751A (en) * | 2017-08-15 | 2017-11-24 | 郑州云海信息技术有限公司 | A kind of file classifying method and device |
CN109460467A (en) * | 2018-09-28 | 2019-03-12 | 中国科学院电子学研究所苏州研究院 | A kind of network information classification system construction method |
CN109460467B (en) * | 2018-09-28 | 2020-02-14 | 中国科学院电子学研究所苏州研究院 | Method for constructing network information classification system |
CN117668333A (en) * | 2024-02-01 | 2024-03-08 | 北京宽客进化科技有限公司 | File classification method, system, equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101923561A (en) | Automatic document classifying method | |
CN101937445B (en) | Automatic file classification system | |
KR101612423B1 (en) | Disaster detecting system using social media | |
CN101604322B (en) | Decision level text automatic classified fusion method | |
CN102346847B (en) | License plate character recognizing method of support vector machine | |
CN103186845B (en) | A kind of rubbish mail filtering method | |
CN106203492A (en) | The system and method that a kind of image latent writing is analyzed | |
CN103310179A (en) | Method and system for optimal attitude detection based on face recognition technology | |
CN102420723A (en) | Anomaly detection method for various kinds of intrusion | |
CN107121436B (en) | The Intelligent detecting method and identification device of a kind of silicon material quality | |
CN105141455B (en) | A kind of net flow assorted modeling method of making an uproar based on statistical nature | |
CN109857862A (en) | File classification method, device, server and medium based on intelligent decision | |
CN101251896B (en) | Object detecting system and method based on multiple classifiers | |
CN103592587A (en) | Partial discharge diagnosis method based on data mining | |
CN109598303A (en) | A kind of rubbish detection method based on City scenarios | |
CN101540017A (en) | Feature extraction method based on byte level n-gram and junk mail filter | |
CN110334602B (en) | People flow statistical method based on convolutional neural network | |
CN101330476A (en) | Method for dynamically detecting junk mail | |
CN113362299B (en) | X-ray security inspection image detection method based on improved YOLOv4 | |
CN109190657A (en) | Sample homogeneous assays method based on data slicer and image hash combination | |
CN101546557A (en) | Method for updating classifier parameters for identifying audio content | |
CN105516941A (en) | Interception method and device of spam messages | |
CN107123076A (en) | A kind of waste management system and waste management method | |
CN104484651B (en) | Portrait dynamic contrast method and system | |
CN201796362U (en) | Automatic file classifying system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20101222 |