CN101923561A - Automatic document classifying method - Google Patents

Automatic document classifying method Download PDF

Info

Publication number
CN101923561A
CN101923561A CN 201010179678 CN201010179678A CN101923561A CN 101923561 A CN101923561 A CN 101923561A CN 201010179678 CN201010179678 CN 201010179678 CN 201010179678 A CN201010179678 A CN 201010179678A CN 101923561 A CN101923561 A CN 101923561A
Authority
CN
China
Prior art keywords
classifying
information
pretreated
carried out
automatic document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 201010179678
Other languages
Chinese (zh)
Inventor
张晓丹
乔晓东
姚长青
朱礼军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Original Assignee
INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA filed Critical INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Priority to CN 201010179678 priority Critical patent/CN101923561A/en
Publication of CN101923561A publication Critical patent/CN101923561A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an automatic document classifying method which belongs to the filed of data mining and is applicable to automatic resource classification, network content monitoring, spam filtration, digital libraries and the like. The method comprises the following steps of: firstly extracting the text information, the image information, the video information and the audio information in a document, then classifying the four kinds of information with different classifying methods, next gathering the classification results of the four kinds of information, and carrying out comprehensive treatment by adopting a decision-level fusion algorithm to obtain a final classification result. The invention can obtain a document classification result with higher accuracy.

Description

A kind of automatic document classifying method
Technical field
The present invention relates to a kind of automatic document classifying method, belong to the data mining field, be applicable to resource automatic clustering, Web content supervision, Spam filtering, digital library etc.
Background technology
Automatic document classifying is comparatively studying a question of focus of data mining field.Its objective is classification function of training or sorter, this function or sorter can be treating that the branch file is mapped in the given respective classes.Its target be find classification speed faster, manage the method for text message more accurately.
At present, a large amount of research concentrates on the research of text classification, as people such as Zhang Xiaodan in document " a kind of decision level text automatic classified fusion method " (national patent, number of patent application: disclose a kind of decision level text automatic classified fusion method 2009100878443), its disaggregated model as shown in Figure 1.This method is theoretical foundation with the information fusion, with automatic document classifying algorithms such as the higher SVM of nicety of grading, KNN, Bayes is research object, adopt the multilayer fusion structure, the form that series and parallel is mixed has been set up the automatic document classifying Fusion Model of decision level.The shortcoming of this method is: because it only handles this paper information in the file, and the information such as image in the sort file, video, audio frequency of not treating are handled, and cause the accuracy rate of classifying undesirable.This mainly is that as video, image, audio frequency etc., so the text based sorting technique can't satisfy people's needs owing to comprise a large amount of multi-medium datas in the network data at present.
From disclosed document, yet there are no the file classifying method of handling multiple medium simultaneously.
Summary of the invention
The present invention is directed at present existing Automatic document classification method and have the not high shortcoming of accuracy, on the basis of existing decision level text automatic classified fusion method, propose a kind of automatic document classifying method, obtain the higher classification results of accuracy rate based on multiple medium (image, audio frequency, video and text message).
The present invention is achieved by the following technical solutions.
A kind of automatic document classifying method, its concrete operations step is as follows:
The 1st step: from treat sort file, extract text message, image information, video information, audio-frequency information;
The 2nd step: on the basis in the 1st step, text message, image information, video information, the audio-frequency information that extracts carried out pre-service respectively; Text message is carried out pre-service comprise participle, feature extraction, weight calculation etc.; Image information is carried out pre-service to be comprised image transformation, enhancing, rim detection, recovers, cuts apart etc.; Video information is carried out pre-service to comprise feature extraction, build video library, video data is carried out multidimensional analysis etc.; Audio-frequency information is carried out pre-service comprise front end pre-service, feature extraction, identification etc.;
The 3rd step: on the basis in the 2nd step, to classifying through pretreated text message; The sorting technique of using includes but not limited to: KNN, SVM, Bayes;
The 4th step: on the basis in the 2nd step, to classifying through pretreated image information; The sorting technique of using includes but not limited to: SVM, Bayesian network, BP neural network;
The 5th step: on the basis in the 2nd step, to classifying through pretreated video information; The sorting technique of using includes but not limited to: KNN, SVM, Boosting algorithm;
The 6th step: on the basis in the 2nd step, to classifying through pretreated audio-frequency information; The sorting technique of using includes but not limited to: SVM, GMM algorithm;
The 7th step: collect the 3rd and go on foot the classification results in the 6th step, and adopt the decision level fusion algorithm that the classification results of collecting is carried out reasoning and calculation, obtain final classification results; Described decision level fusion algorithm includes but not limited to: Bayesian network algorithm, D-S evidence theory algorithm, ballot algorithm.
Beneficial effect
The inventive method adopts classifies respectively to the text message in the file, image information, video information, audio-frequency information, adopts the decision level fusion algorithm that classification results is carried out overall treatment then, can obtain document classification result with higher accuracy.
Description of drawings
Fig. 1 is the decision level text automatic classified Fusion Model synoptic diagram of prior art.
Embodiment
According to technique scheme, the present invention is described in detail below in conjunction with embodiment.
Present embodiment adopts the inventive method categorizing system of creating a file, and this categorizing system adopts JAVA development platform, oracle database.Adopt 6000 pieces for the audio frequency corpus this categorizing system to be trained for video corpus, 3000 pieces for image corpus, 3000 pieces for text corpus, 5000 pieces, after training, use 4000 pieces of testing materials to test, concrete steps are as follows:
The 1st step: treat to extract the sort file text message, image information, video information, audio-frequency information from 4000 pieces;
The 2nd step: text message is carried out pre-service, comprise participle, feature extraction, weight calculation; Image information is carried out pre-service, comprise image transformation, enhancing, rim detection, recover, cut apart; Video information is carried out pre-service, comprise feature extraction, build video library, video data is carried out multidimensional analysis; Audio-frequency information is carried out pre-service, comprise front end pre-service, feature extraction, identification;
The 3rd step: use the KNN method to classifying through pretreated text message;
The 4th step: use the SVM method to classifying through pretreated image information;
The 5th step: use the SVM method to classifying through pretreated video information;
The 6th step: use the GMM algorithm to classifying through pretreated audio-frequency information;
The 7th step: collect the 3rd and go on foot the classification results in the 6th step, and adopt D-S evidence theory algorithm that the classification results of collecting is carried out reasoning and calculation, obtain final classification results.
Through the operation of above step, it is as shown in table 1 to obtain test findings.
Simultaneously, for classifying quality of the present invention is described, under equal conditions, adopt KNN, SVM and document " a kind of decision level text automatic classified fusion method " (national patent respectively with identical corpus, testing material and identical taxonomic hierarchies, number of patent application: disclosed a kind of decision level text automatic classified fusion method is classified 2009100878443), and classifying quality is as shown in table 1:
Three kinds of algorithm classification effects of table 1 relatively
Figure GSA00000133804700031
Conclusion: the automatic document classifying method that the present invention proposes has adopted the mode of multiple medium, has brought into play the advantage of multiple sorter, has obtained being higher than the accuracy rate and the recall rate of literature method and other single classifiers, has verified its validity.
It is emphasized that to those skilled in the art under the prerequisite that does not break away from the principle of the invention, can also make some improvement, these also should be considered as belonging to protection scope of the present invention.

Claims (6)

1. automatic document classifying method, it is characterized in that: its concrete operations step is as follows:
The 1st step: from treat sort file, extract text message, image information, video information, audio-frequency information;
The 2nd step: on the basis in the 1st step, text message, image information, video information, the audio-frequency information that extracts carried out pre-service respectively; Text message is carried out pre-service comprise participle, feature extraction, weight calculation etc.; Image information is carried out pre-service to be comprised image transformation, enhancing, rim detection, recovers, cuts apart etc.; Video information is carried out pre-service to comprise feature extraction, build video library, video data is carried out multidimensional analysis etc.; Audio-frequency information is carried out pre-service comprise front end pre-service, feature extraction, identification etc.;
The 3rd step: on the basis in the 2nd step, to classifying through pretreated text message;
The 4th step: on the basis in the 2nd step, to classifying through pretreated image information;
The 5th step: on the basis in the 2nd step, to classifying through pretreated video information;
The 6th step: on the basis in the 2nd step, to classifying through pretreated audio-frequency information;
The 7th step: collect the 3rd and go on foot the classification results in the 6th step, and adopt the decision level fusion algorithm that the classification results of collecting is carried out reasoning and calculation, obtain final classification results.
2. a kind of automatic document classifying method as claimed in claim 1 is characterized in that: to classifying through pretreated text message, the sorting technique of use includes but not limited to: KNN, SVM, Bayes described in the 3rd step.
3. a kind of automatic document classifying method as claimed in claim 1 or 2 is characterized in that: to classifying through pretreated image information, the sorting technique of use includes but not limited to: SVM, Bayesian network, BP neural network described in the 4th step.
4. a kind of automatic document classifying method as claimed in claim 1 or 2 is characterized in that: to classifying through pretreated video information, the sorting technique of use includes but not limited to: KNN, SVM, Boosting algorithm described in the 5th step.
5. a kind of automatic document classifying method as claimed in claim 1 or 2 is characterized in that: to classifying through pretreated audio-frequency information, the sorting technique of use includes but not limited to: SVM, GMM algorithm described in the 6th step.
6. a kind of automatic document classifying method as claimed in claim 1 or 2 is characterized in that: the decision level fusion algorithm includes but not limited to described in the 7th step: Bayesian network algorithm, D-S evidence theory algorithm, ballot algorithm.
CN 201010179678 2010-05-24 2010-05-24 Automatic document classifying method Pending CN101923561A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010179678 CN101923561A (en) 2010-05-24 2010-05-24 Automatic document classifying method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010179678 CN101923561A (en) 2010-05-24 2010-05-24 Automatic document classifying method

Publications (1)

Publication Number Publication Date
CN101923561A true CN101923561A (en) 2010-12-22

Family

ID=43338497

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010179678 Pending CN101923561A (en) 2010-05-24 2010-05-24 Automatic document classifying method

Country Status (1)

Country Link
CN (1) CN101923561A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102426585A (en) * 2011-08-09 2012-04-25 中国科学技术信息研究所 Webpage automatic classification method based on Bayesian network
CN103136247A (en) * 2011-11-29 2013-06-05 阿里巴巴集团控股有限公司 Attribute data interval partition method and attribute data interval partition device
CN104731979A (en) * 2015-04-16 2015-06-24 广东欧珀移动通信有限公司 Method and device for storing all exclusive information resources of specific user
CN105260398A (en) * 2015-09-17 2016-01-20 中国科学院自动化研究所 Quick sorting method for movie types based on poster and plot summary
CN105512131A (en) * 2014-09-25 2016-04-20 中国科学技术信息研究所 Method and device for classification method category mapping based on category similarity calculation
CN106372182A (en) * 2016-08-30 2017-02-01 浪潮(北京)电子信息产业有限公司 File management method and system and cloud platform
CN106897454A (en) * 2017-02-15 2017-06-27 北京时间股份有限公司 A kind of file classifying method and device
CN107391751A (en) * 2017-08-15 2017-11-24 郑州云海信息技术有限公司 A kind of file classifying method and device
CN107958289A (en) * 2016-10-18 2018-04-24 深圳光启合众科技有限公司 Data processing method and device, robot for robot
CN109460467A (en) * 2018-09-28 2019-03-12 中国科学院电子学研究所苏州研究院 A kind of network information classification system construction method
CN117668333A (en) * 2024-02-01 2024-03-08 北京宽客进化科技有限公司 File classification method, system, equipment and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1588879A (en) * 2004-08-12 2005-03-02 复旦大学 Internet content filtering system and method
CN1920818A (en) * 2006-09-14 2007-02-28 浙江大学 Transmedia search method based on multi-mode information convergence analysis
CN1945581A (en) * 2005-09-30 2007-04-11 通用电气公司 Computer assisted domain specific entity mapping method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1588879A (en) * 2004-08-12 2005-03-02 复旦大学 Internet content filtering system and method
CN1945581A (en) * 2005-09-30 2007-04-11 通用电气公司 Computer assisted domain specific entity mapping method and system
CN1920818A (en) * 2006-09-14 2007-02-28 浙江大学 Transmedia search method based on multi-mode information convergence analysis

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102426585A (en) * 2011-08-09 2012-04-25 中国科学技术信息研究所 Webpage automatic classification method based on Bayesian network
CN103136247A (en) * 2011-11-29 2013-06-05 阿里巴巴集团控股有限公司 Attribute data interval partition method and attribute data interval partition device
CN103136247B (en) * 2011-11-29 2015-12-02 阿里巴巴集团控股有限公司 Attribute data interval division method and device
CN105512131A (en) * 2014-09-25 2016-04-20 中国科学技术信息研究所 Method and device for classification method category mapping based on category similarity calculation
CN104731979A (en) * 2015-04-16 2015-06-24 广东欧珀移动通信有限公司 Method and device for storing all exclusive information resources of specific user
CN105260398A (en) * 2015-09-17 2016-01-20 中国科学院自动化研究所 Quick sorting method for movie types based on poster and plot summary
CN106372182A (en) * 2016-08-30 2017-02-01 浪潮(北京)电子信息产业有限公司 File management method and system and cloud platform
CN107958289A (en) * 2016-10-18 2018-04-24 深圳光启合众科技有限公司 Data processing method and device, robot for robot
CN107958289B (en) * 2016-10-18 2022-02-01 深圳市中吉电气科技有限公司 Data processing method and device for robot and robot
CN106897454A (en) * 2017-02-15 2017-06-27 北京时间股份有限公司 A kind of file classifying method and device
CN106897454B (en) * 2017-02-15 2020-07-03 北京时间股份有限公司 File classification method and device
CN107391751A (en) * 2017-08-15 2017-11-24 郑州云海信息技术有限公司 A kind of file classifying method and device
CN109460467A (en) * 2018-09-28 2019-03-12 中国科学院电子学研究所苏州研究院 A kind of network information classification system construction method
CN109460467B (en) * 2018-09-28 2020-02-14 中国科学院电子学研究所苏州研究院 Method for constructing network information classification system
CN117668333A (en) * 2024-02-01 2024-03-08 北京宽客进化科技有限公司 File classification method, system, equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN101923561A (en) Automatic document classifying method
CN101937445B (en) Automatic file classification system
KR101612423B1 (en) Disaster detecting system using social media
CN101604322B (en) Decision level text automatic classified fusion method
CN102346847B (en) License plate character recognizing method of support vector machine
CN103186845B (en) A kind of rubbish mail filtering method
CN106203492A (en) The system and method that a kind of image latent writing is analyzed
CN103310179A (en) Method and system for optimal attitude detection based on face recognition technology
CN102420723A (en) Anomaly detection method for various kinds of intrusion
CN107121436B (en) The Intelligent detecting method and identification device of a kind of silicon material quality
CN105141455B (en) A kind of net flow assorted modeling method of making an uproar based on statistical nature
CN109857862A (en) File classification method, device, server and medium based on intelligent decision
CN101251896B (en) Object detecting system and method based on multiple classifiers
CN103592587A (en) Partial discharge diagnosis method based on data mining
CN109598303A (en) A kind of rubbish detection method based on City scenarios
CN101540017A (en) Feature extraction method based on byte level n-gram and junk mail filter
CN110334602B (en) People flow statistical method based on convolutional neural network
CN101330476A (en) Method for dynamically detecting junk mail
CN113362299B (en) X-ray security inspection image detection method based on improved YOLOv4
CN109190657A (en) Sample homogeneous assays method based on data slicer and image hash combination
CN101546557A (en) Method for updating classifier parameters for identifying audio content
CN105516941A (en) Interception method and device of spam messages
CN107123076A (en) A kind of waste management system and waste management method
CN104484651B (en) Portrait dynamic contrast method and system
CN201796362U (en) Automatic file classifying system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20101222