CN201796362U - Automatic file classifying system - Google Patents

Automatic file classifying system Download PDF

Info

Publication number
CN201796362U
CN201796362U CN2010202000431U CN201020200043U CN201796362U CN 201796362 U CN201796362 U CN 201796362U CN 2010202000431 U CN2010202000431 U CN 2010202000431U CN 201020200043 U CN201020200043 U CN 201020200043U CN 201796362 U CN201796362 U CN 201796362U
Authority
CN
China
Prior art keywords
module
text
image
audio
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2010202000431U
Other languages
Chinese (zh)
Inventor
张晓丹
乔晓东
姚长青
朱礼军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Original Assignee
INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA filed Critical INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Priority to CN2010202000431U priority Critical patent/CN201796362U/en
Application granted granted Critical
Publication of CN201796362U publication Critical patent/CN201796362U/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Abstract

The utility model relates to an automatic file classifying system, which belongs to the technical field of data mining. The automatic file classifying system comprises an input module, an information extraction module, a text preprocessing module, an image preprocessing module, a video preprocessing module, an audio preprocessing module, a text classifying module, an image classifying module, a video classifying module, an audio classifying module, a fusion module and an output module. The automatic file classifying system extracts text information, image information, video information and audio information in files through the information extraction module, after being preprocessed by the test preprocessing module, the image preprocessing module, the video preprocessing module and the audio preprocessing module, the text information, the image information, the video information and the audio information in the files are respectively sent to the text classifying module, the image classifying module, the video classifying module and the audio classifying module to be classified, and various classified results are processed comprehensively through the fusion module, so that fine classification results are obtained. The automatic file classifying system can obtain text classification results with higher accuracy.

Description

A kind of automatic document classifying system
Technical field
The utility model relates to a kind of automatic document classifying system, belongs to the data mining field, is applicable to resource automatic clustering, Web content supervision, Spam filtering, digital library etc.
Background technology
Automatic document classifying is comparatively studying a question of focus of data mining field.Its objective is classification function of training or sorter, this function or sorter can be treating that the branch file is mapped in the given respective classes.Its target be find classification speed faster, manage the method for text message more accurately.
At present, a large amount of research concentrates on the research of text classification, as people such as Zhang Xiaodan in document " a kind of decision level text automatic classified fusion method " (national patent, number of patent application: disclose a kind of decision level text automatic classified fusion method 2009100878443), its disaggregated model as shown in Figure 1.This method is theoretical foundation with the information fusion, with automatic document classifying algorithms such as the higher SVM of nicety of grading, KNN, Bayes is research object, adopt the multilayer fusion structure, the form that series and parallel is mixed has been set up the automatic document classifying Fusion Model of decision level.The shortcoming of this method is: because it only handles this paper information in the file, and the information such as image in the sort file, video, audio frequency of not treating are handled, and cause the accuracy rate of classifying undesirable.This mainly is that as video, image, audio frequency etc., so the text based sorting technique can't satisfy people's needs owing to comprise a large amount of multi-medium datas in the network data at present.
From disclosed document and practical application, yet there are no the file classifying method of handling multiple medium simultaneously.
Summary of the invention
There is the not high shortcoming of accuracy in the utility model at present existing text automatic classification system, on the basis of existing decision level text automatic classified Fusion Model, propose a kind of automatic document classifying system, obtain the higher classification results of accuracy rate based on multiple medium (image, audio frequency, video and text message).
The utility model is achieved through the following technical solutions.
A kind of automatic document classifying system comprises: load module, information extraction module, text pretreatment module, image pretreatment module, video preprocessor processing module, audio frequency pretreatment module, text classification module, image classification module, visual classification module, audio classification module, Fusion Module, output module;
Its annexation is: load module is connected with the input end of information extraction module, text pretreatment module, image pretreatment module, audio frequency pretreatment module, video preprocessor processing module respectively; The output terminal of information extraction module is connected with the input end of text pretreatment module, image pretreatment module, audio frequency pretreatment module, video preprocessor processing module respectively; The output terminal of text pretreatment module is connected with the input end of text classification module; The output terminal of image pretreatment module is connected with the input end of image classification module; The output terminal of audio frequency pretreatment module is connected with the input end of audio classification module; The output terminal of video preprocessor processing module is connected with the input end of visual classification module; The output terminal of text classification module, image classification module, audio classification module, visual classification module is connected with the input end of Fusion Module; The output terminal of Fusion Module is connected with output module.
The function of its main modular is:
The major function of described load module is: the input interface that data are provided;
The major function of described information extraction module is: from the input treat extract text message, image information, video information, audio-frequency information the sort file;
The major function of described text pretreatment module is: text message is carried out pre-service such as participle, feature extraction, weight calculation;
The major function of described image pretreatment module is: image information is carried out image transformation, enhancing, rim detection, pre-service such as recovered, cuts apart;
The major function of described video preprocessor processing module is: video information is carried out feature extraction, builds video library, video data carried out pre-service such as multidimensional analysis;
The major function of described audio frequency pretreatment module is: audio-frequency information is carried out pre-service such as front end pre-service, feature extraction, identification;
Described text classification module functions is: use the text corpus to determine predefined various types of other feature, in the genealogical classification stage the pretreated text message of process is classified in the systematic training stage; Described text classification module can be but be not limited to a kind of in the following equipment: KNN sorter, svm classifier device, Bayes classifier;
Described image classification module functions is: use the image corpus to determine predefined various types of other feature, in the genealogical classification stage the pretreated image information of process is classified in the systematic training stage; Described image classification module can be but be not limited to a kind of in the following equipment: svm classifier device, based on the sorter of Bayesian network algorithm, based on the sorter of BP neural network algorithm;
Described visual classification module functions is: use the video corpus to determine predefined various types of other feature, in the genealogical classification stage the pretreated video information of process is classified in the systematic training stage; Described visual classification module can be but be not limited to a kind of in the following equipment: KNN sorter, svm classifier device, based on the sorter of Boosting algorithm;
Described audio classification module functions is: use the audio frequency corpus to determine predefined various types of other feature, in the genealogical classification stage the pretreated audio-frequency information of process is classified in the systematic training stage; Described audio classification module can be but be not limited to a kind of in the following equipment: svm classifier device, based on the sorter of GMM algorithm;
The major function of described Fusion Module is to adopt the decision level fusion algorithm that the classification results of input is carried out reasoning and calculation, obtains final classification results; Described decision level fusion algorithm includes but not limited to: Bayesian network algorithm, D-S evidence theory algorithm, ballot algorithm;
The major function of described output module is: the output function that data are provided.Described output module can be but be not limited to one or more combination in the following equipment: display, projector, printer.
Its course of work is divided into systematic training stage and genealogical classification stage:
The course of work in systematic training stage is:
The 1st step: the text corpus is input to the text pretreatment module by load module, and the text pretreatment module is carried out pre-service to text message, comprises participle, feature extraction, weight calculation; Then, will be transferred to the text classification module through pretreated text message;
The 2nd the step: this step can with the 1st the step synchronous operation: the image corpus is input to the image pretreatment module by load module, and the image pretreatment module is carried out pre-service to image information, comprises image transformation, enhancing, rim detection, recovers, cuts apart; Then, will be transferred to the image classification module through pretreated image information;
The 3rd the step: this step can with the 1st the step synchronous operation: the video corpus is input to the video preprocessor processing module by load module, the video preprocessor processing module is carried out pre-service to video information, comprises feature extraction, builds video library, video data is carried out multidimensional analysis; Then, will be transferred to the visual classification module through pretreated video information;
The 4th the step: this step can with the 1st the step synchronous operation: the audio frequency corpus is input to the audio frequency pretreatment module by load module, and the audio frequency pretreatment module is carried out pre-service to audio-frequency information, comprises front end pre-service, feature extraction, identification; Then, will be transferred to the audio classification module through pretreated audio-frequency information;
The 5th step: the text classification module is to extracting category feature through pretreated text message; The image classification module is to extracting category feature through pretreated image information; The visual classification module is to extracting category feature through pretreated video information; The audio classification module is to the pretreated audio information category feature of process;
The 6th step: training finishes, and information is finished in the training of output module output system.
The course of work in genealogical classification stage is:
The 1st step: will treat that sort file is input to information extraction module by load module;
The 2nd step: information extraction module extracts text message, image information, video information, audio-frequency information from treat sort file, is input to corresponding text pretreatment module, image pretreatment module, video preprocessor processing module, audio frequency pretreatment module respectively;
The 3rd step: on the basis in the 2nd step, the text pretreatment module is carried out pre-service to text message, comprises participle, feature extraction, weight calculation;
The 4th step: on the basis in the 2nd step, the image pretreatment module is carried out pre-service to image information, comprises image transformation, enhancing, rim detection, recovers, cuts apart;
The 5th step: on the basis in the 2nd step, the video preprocessor processing module is carried out pre-service to video information, comprises feature extraction, builds video library, video data is carried out multidimensional analysis;
The 6th step: on the basis in the 2nd step, the audio frequency pretreatment module is carried out pre-service to audio-frequency information, comprises front end pre-service, feature extraction, identification;
The 7th step: on the basis in the 3rd step, the text classification module is to classifying through pretreated text message and exporting classification results to Fusion Module;
The 8th step: on the basis in the 4th step, the image classification module is to classifying through pretreated image information and exporting classification results to Fusion Module;
The 9th step: on the basis in the 5th step, the visual classification module is to classifying through pretreated video information and exporting classification results to Fusion Module;
The 10th step: on the basis in the 6th step, the audio classification module is to classifying through pretreated audio-frequency information and exporting classification results to Fusion Module;
The 11st step: Fusion Module adopts the decision level fusion algorithm that the classification results of input is carried out reasoning and calculation, obtains final classification results.
The 12nd step: classification results is exported through output module.
Beneficial effect
1. the automatic document classifying system that the utility model proposes classifies respectively to the text message in the file, image information, video information, audio-frequency information, adopt the decision level fusion algorithm that classification results is carried out overall treatment then, can obtain the text classification effect of higher accuracy;
2. the automatic document classifying system that the utility model proposes not only can guarantee the correctness of each local classification, can also adapt to the change of class object, guarantees the efficient and the accuracy of categorizing system.
Description of drawings
Fig. 1 is the decision level text automatic classified Fusion Model synoptic diagram of prior art;
Fig. 2 is the structural representation about a kind of embodiment of automatic document classifying of the present utility model system.
Embodiment
According to technique scheme, the utility model is elaborated below in conjunction with drawings and Examples.
The automatic document classifying system that the utility model proposes adopts JAVA development platform, oracle database.As shown in Figure 2, automatic document classifying of the present utility model system comprises: load module, information extraction module, text pretreatment module, image pretreatment module, audio frequency pretreatment module, video preprocessor processing module, text classification module (adopting the KNN algorithm), image classification module (adopting the SVM algorithm), audio classification module (GMM algorithm), visual classification module (SVM algorithm), Fusion Module (D-S evidence theory algorithm), output module (display and printer).
Adopt this system that 21000 pieces of language materials are classified, wherein 6000 pieces be text corpus, 5000 pieces for image corpus, 3000 pieces for video corpus, 3000 pieces are testing material for audio frequency corpus, 4000 pieces, be divided into 6 classifications.
Its workflow is difference systematic training stage and genealogical classification stage:
The course of work in systematic training stage is:
The 1st step: 6000 pieces of text corpus are input to the text pretreatment module by load module, and the text pretreatment module is carried out pre-service to text message, comprises participle, feature extraction, weight calculation;
The 2nd step: 5000 pieces of image corpus are input to the image pretreatment module by load module, and the image pretreatment module is carried out pre-service to image information, comprises image transformation, enhancing, rim detection, recovers, cuts apart;
The 3rd step: 3000 pieces of video corpus are input to the video preprocessor processing module by load module, and the video preprocessor processing module is carried out pre-service to video information, comprises feature extraction, builds video library, video data is carried out multidimensional analysis;
The 4th step: 3000 pieces of audio frequency corpus are input to the audio frequency pretreatment module by load module, and the audio frequency pretreatment module is carried out pre-service to audio-frequency information, comprises front end pre-service, feature extraction, identification;
The 5th step: the text classification module adopts the KNN algorithm to extracting category feature through pretreated text message; The image classification module adopts SVM to extracting category feature through pretreated image information; The visual classification module adopts the SVM algorithm to extracting category feature through pretreated video information; The audio classification module adopts the GMM algorithm to the pretreated audio information category feature of process;
The 6th step: training finishes, and information is finished in output module output training.
In the genealogical classification stage, be specially:
The 1st step: 4000 pieces of testing materials are input to information extraction module by load module;
The 2nd step: information extraction module extracts text message, image information, video information, audio-frequency information from 4000 pieces of testing materials, is input to corresponding text pretreatment module, image pretreatment module, audio frequency pretreatment module, video preprocessor processing module respectively;
The 3rd step: the text pretreatment module is carried out pre-service to text message, comprises participle, feature extraction, weight calculation;
The 4th step: the image pretreatment module is carried out pre-service to image information, comprises image transformation, enhancing, rim detection, recovers, cuts apart;
The 5th step: the video preprocessor processing module is carried out pre-service to video information, comprises feature extraction, builds video library, video data is carried out multidimensional analysis;
The 6th step: the audio frequency pretreatment module is carried out pre-service to audio-frequency information, comprises front end pre-service, feature extraction, identification;
The 7th step: on the basis in the 3rd step, the text classification module adopts the KNN algorithm to classifying through pretreated text message and exporting classification results to Fusion Module;
The 8th step: on the basis in the 4th step, the image classification module adopts the SVM algorithm to classifying through pretreated image information and exporting classification results to Fusion Module;
The 9th step: on the basis in the 5th step, the visual classification module adopts the SVM algorithm to classifying through pretreated video information and exporting classification results to Fusion Module;
The 10th step: on the basis in the 6th step, the audio classification module adopts the GMM algorithm to classifying through pretreated audio-frequency information and exporting classification results to Fusion Module;
The 11st step: Fusion Module adopts D-S evidence theory algorithm that the classification results of input is carried out reasoning and calculation, obtains final classification results.
Through above operation, it is as shown in table 1 to obtain test findings.
Simultaneously, for classifying quality of the present utility model is described, this experiment is under equal conditions, adopt KNN, SVM and document " a kind of decision level text automatic classified fusion method " (national patent respectively with identical corpus, testing material and identical taxonomic hierarchies, number of patent application: disclosed a kind of decision level text automatic classified fusion method is classified 2009100878443), and classification results is as shown in table 1:
Three kinds of algorithm classification effects of table 1 relatively
Conclusion: the automatic document classifying method that the utility model proposes has adopted the mode of multiple medium, has brought into play the advantage of multiple sorter, has obtained being higher than the accuracy rate and the recall rate of literature method and other single classifiers, has verified its validity.
It is emphasized that to those skilled in the art under the prerequisite that does not break away from the utility model principle, can also make some improvement, these also should be considered as belonging to protection domain of the present utility model.

Claims (6)

1. an automatic document classifying system is characterized in that: comprising: load module, information extraction module, text pretreatment module, image pretreatment module, video preprocessor processing module, audio frequency pretreatment module, text classification module, image classification module, visual classification module, audio classification module, Fusion Module, output module;
Its annexation is: load module is connected with the input end of information extraction module, text pretreatment module, image pretreatment module, audio frequency pretreatment module, video preprocessor processing module respectively; The output terminal of information extraction module is connected with the input end of text pretreatment module, image pretreatment module, audio frequency pretreatment module, video preprocessor processing module respectively; The output terminal of text pretreatment module is connected with the input end of text classification module; The output terminal of image pretreatment module is connected with the input end of image classification module; The output terminal of audio frequency pretreatment module is connected with the input end of audio classification module; The output terminal of video preprocessor processing module is connected with the input end of visual classification module; The output terminal of text classification module, image classification module, audio classification module, visual classification module is connected with the input end of Fusion Module; The output terminal of Fusion Module is connected with output module.
2. a kind of automatic document classifying as claimed in claim 1 system is characterized in that: described text classification module is a kind of with in the lower device: KNN sorter, svm classifier device, Bayes classifier.
3. a kind of automatic document classifying as claimed in claim 1 or 2 system is characterized in that: described image classification module is a kind of with in the lower device: svm classifier device, based on the sorter of Bayesian network algorithm, based on the sorter of BP neural network algorithm.
4. a kind of automatic document classifying as claimed in claim 1 or 2 system is characterized in that: described visual classification module is a kind of with in the lower device: KNN sorter, svm classifier device, based on the sorter of Boosting algorithm.
5. a kind of automatic document classifying as claimed in claim 1 or 2 system is characterized in that: described audio classification module is a kind of with in the lower device: svm classifier device, based on the sorter of GMM algorithm.
6. a kind of automatic document classifying as claimed in claim 1 or 2 system is characterized in that: described output module is one or more the combination in the following equipment: display, projector, printer.
CN2010202000431U 2010-05-24 2010-05-24 Automatic file classifying system Expired - Fee Related CN201796362U (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010202000431U CN201796362U (en) 2010-05-24 2010-05-24 Automatic file classifying system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010202000431U CN201796362U (en) 2010-05-24 2010-05-24 Automatic file classifying system

Publications (1)

Publication Number Publication Date
CN201796362U true CN201796362U (en) 2011-04-13

Family

ID=43851263

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010202000431U Expired - Fee Related CN201796362U (en) 2010-05-24 2010-05-24 Automatic file classifying system

Country Status (1)

Country Link
CN (1) CN201796362U (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897454A (en) * 2017-02-15 2017-06-27 北京时间股份有限公司 A kind of file classifying method and device
CN109189950A (en) * 2018-09-03 2019-01-11 腾讯科技(深圳)有限公司 Multimedia resource classification method, device, computer equipment and storage medium
CN111428088A (en) * 2018-12-14 2020-07-17 腾讯科技(深圳)有限公司 Video classification method and device and server

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897454A (en) * 2017-02-15 2017-06-27 北京时间股份有限公司 A kind of file classifying method and device
CN106897454B (en) * 2017-02-15 2020-07-03 北京时间股份有限公司 File classification method and device
CN109189950A (en) * 2018-09-03 2019-01-11 腾讯科技(深圳)有限公司 Multimedia resource classification method, device, computer equipment and storage medium
CN109189950B (en) * 2018-09-03 2023-04-07 腾讯科技(深圳)有限公司 Multimedia resource classification method and device, computer equipment and storage medium
CN111428088A (en) * 2018-12-14 2020-07-17 腾讯科技(深圳)有限公司 Video classification method and device and server
CN111428088B (en) * 2018-12-14 2022-12-13 腾讯科技(深圳)有限公司 Video classification method and device and server

Similar Documents

Publication Publication Date Title
CN101937445B (en) Automatic file classification system
CN101604322B (en) Decision level text automatic classified fusion method
CN101923561A (en) Automatic document classifying method
CN101329734B (en) License plate character recognition method based on K-L transform and LS-SVM
CN109189901B (en) Method for automatically discovering new classification and corresponding corpus in intelligent customer service system
CN110598800A (en) Garbage classification and identification method based on artificial intelligence
CN106203492A (en) The system and method that a kind of image latent writing is analyzed
CN102915453B (en) Real-time feedback and update vehicle detection method
CN104834940A (en) Medical image inspection disease classification method based on support vector machine (SVM)
CN110689085B (en) Garbage classification method based on deep cross-connection network and loss function design
CN108764302B (en) Bill image classification method based on color features and bag-of-words features
CN110717426A (en) Garbage classification method based on domain adaptive learning, electronic equipment and storage medium
CN104050361A (en) Intelligent analysis early warning method for dangerousness tendency of prison persons serving sentences
CN109271523A (en) A kind of government document subject classification method based on information retrieval
CN103020645A (en) System and method for junk picture recognition
CN110442842A (en) The extracting method and device of treaty content, computer equipment, storage medium
CN104142960A (en) Internet data analysis system
CN201796362U (en) Automatic file classifying system
CN105516941A (en) Interception method and device of spam messages
CN107357895A (en) A kind of processing method of the text representation based on bag of words
CN112328792A (en) Optimization method for recognizing credit events based on DBSCAN clustering algorithm
CN113407644A (en) Enterprise industry secondary industry multi-label classifier based on deep learning algorithm
CN101719924B (en) Unhealthy multimedia message filtering method based on groupware comprehension
CN104866606A (en) MapReduce parallel big data text classification method
CN101594314A (en) A kind of spam image-recognizing method and device based on high-order autocorrelation characteristic

Legal Events

Date Code Title Description
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20110413

Termination date: 20130524