CN107391751A - A kind of file classifying method and device - Google Patents
A kind of file classifying method and device Download PDFInfo
- Publication number
- CN107391751A CN107391751A CN201710697846.9A CN201710697846A CN107391751A CN 107391751 A CN107391751 A CN 107391751A CN 201710697846 A CN201710697846 A CN 201710697846A CN 107391751 A CN107391751 A CN 107391751A
- Authority
- CN
- China
- Prior art keywords
- file
- sorted
- classification
- categories
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
Abstract
The invention discloses a kind of file classifying method and device, respectively using the common characteristic of the file of all categories in machine learning algorithm extraction sample training set;The number with the feature identical feature of file to be sorted in the common characteristic of file of all categories is counted respectively;The classification of file to be sorted is divided according to the number counted.From the embodiment of the present invention, realize automatically to document classification, avoid and take human resources progress document classification, improve the efficiency of document classification.
Description
Technical field
The present invention relates to, but not limited to data processing technique, espespecially a kind of file classifying method and device.
Background technology
Cloud platform is used extensively at present, and user can upload to file in cloud platform, but when file is uploaded
User is needed to classify manually to the file of upload, classification is got up very troublesome.In particular with the arrival in big data epoch,
The quantity of documents of upload is very big, and user not only needs to expend substantial amounts of human resources manually to document classification, and classify
Efficiency comparison is low.
The content of the invention
In order to solve the above-mentioned technical problem, the invention provides a kind of file classifying method and device, it is automatic right to realize
Document classification, avoid and take human resources progress document classification.
In order to reach the object of the invention, the invention provides a kind of file classifying method, including:
Respectively using the common characteristic of the file of all categories in machine learning algorithm extraction sample training set;
The number with the feature identical feature of file to be sorted in the common characteristic of file of all categories is counted respectively;
The classification of file to be sorted is divided according to the number counted.
Further, before the common characteristic of the file of all categories in the extraction sample training set, in addition to:
According to the function of each file in the sample training set, the file with identical function is divided into a class
Not.
Further, the machine learning algorithm is SVMs, K-Means algorithms or bayesian algorithm.
Further, the number that the basis counts divides the classification of file to be sorted, including:
The file to be sorted is divided into the classification of the affiliated file of the largest number of features counted.
Present invention also offers a kind of device for sorting document, including:
Extraction module, for using being total to for the file of all categories in machine learning algorithm extraction sample training set respectively
There is feature;
Statistical module, the feature identical in the common characteristic for counting file of all categories respectively with file to be sorted
The number of feature;
First division module, for dividing the classification of file to be sorted according to the number counted.
Further, in addition to:
Second division module, for the function according to each file in the sample training set, there will be identical function
File be divided into a classification.
Further, the machine learning algorithm is SVMs, K-Means algorithms or bayesian algorithm.
Further, first division module is specifically used for,
The file to be sorted is divided into the classification of the affiliated file of the largest number of features counted.
The present invention is comprised at least respectively using the file of all categories in machine learning algorithm extraction sample training set
Common characteristic;The number with the feature identical feature of file to be sorted in the common characteristic of file of all categories is counted respectively;
The classification of file to be sorted is divided according to the number counted.From the embodiment of the present invention, learnt by machine learning algorithm
The common characteristic of file of all categories, treat sort file using the common characteristic and classified automatically, avoid and take manpower money
Source carries out document classification, when the quantity of file particularly to be sorted is very big, is effectively improved the efficiency of document classification.
Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification
Obtain it is clear that or being understood by implementing the present invention.The purpose of the present invention and other advantages can be by specification, rights
Specifically noted structure is realized and obtained in claim and accompanying drawing.
Brief description of the drawings
Accompanying drawing is used for providing further understanding technical solution of the present invention, and a part for constitution instruction, with this
The embodiment of application is used to explain technical scheme together, does not form the limitation to technical solution of the present invention.
Fig. 1 is a kind of schematic flow sheet of file classifying method provided in an embodiment of the present invention;
Fig. 2 is a kind of structural representation of device for sorting document provided in an embodiment of the present invention;
Fig. 3 is the structural representation of another device for sorting document provided in an embodiment of the present invention.
Embodiment
For the object, technical solutions and advantages of the present invention are more clearly understood, below in conjunction with accompanying drawing to the present invention
Embodiment be described in detail.It should be noted that in the case where not conflicting, in the embodiment and embodiment in the application
Feature can mutually be combined.
Can be in the computer system of such as one group computer executable instructions the flow of accompanying drawing illustrates the step of
Perform.Also, although logical order is shown in flow charts, in some cases, can be with suitable different from herein
Sequence performs shown or described step.
The embodiment of the present invention provides a kind of file classifying method, as shown in figure 1, this method includes:
Step 101, the shared spy for extracting the file of all categories in sample training set using machine learning algorithm respectively
Sign.
Specifically, the file of all categories in sample training set is learnt using machine learning algorithm, it is each to learn
The classifying rules of category file, the wherein classifying rules just refer to the common characteristic of file of all categories.Enter to file of all categories
During row study, it can be learnt based on the form of file of all categories, for example, all Word document category in sample training set
In same classification, the common characteristic learnt to category file is exactly the suffix entitled doc or docx of file;All ISO
The file of form belongs to same classification, and the common characteristic learnt to category file is exactly the entitled iso of suffix of file;It is all
The file of txt forms belongs to same classification, and the common characteristic learnt to category file is exactly the entitled txt of suffix of file.
The content for being also based on file of all categories is learnt, for example, all mails in sample training set belong to one species
Not, study includes addressee and sender to the content that the common characteristic of category file is exactly file.Wherein, file of all categories
Specific learning Content, can be determined according to the demand of user.
File wherein in sample training set includes but is not limited to one below or any a variety of combination:Audio, figure
Piece, video, document, mail, software installation bag, system file, the file for assigning distribution command.Sample training set Chinese
The quantity a predetermined level is exceeded of part, for example, sample training set Chinese part quantity more than 10000, i.e., in sample training set
File amount it is bigger, can so ensure that the common characteristic for extracting category file is more accurate.
Step 102, the feature identical feature with file to be sorted in the common characteristic of file of all categories is counted respectively
Number.
Specifically, the feature of file to be sorted is extracted;For the common characteristic of file of all categories, respectively by file to be sorted
Feature and file of all categories common characteristic one by one compared with, that is to say, that be respectively compared out the shared of file of all categories
Which common characteristic is identical with the feature of file to be sorted in feature, with count respectively in the common characteristic of file of all categories with
The number of the feature identical feature of file to be sorted.
Feature for extracting file to be sorted, is specifically described, the feature of extraction not only includes text to be sorted below
The format character of part, include the content characteristic of file to be sorted.For example, if file to be sorted is document, the content of document
Feature includes Feature Words;If file to be sorted is image, the content characteristic of image includes textural characteristics and gray level co-occurrence matrixes;
If file to be sorted is audio, the content characteristic of audio includes the sound intensity, sound intensity level, loudness, pitch, pitch period, fundamental tone frequency
Rate and signal to noise ratio;If file to be sorted is video, content characteristic and image of the content characteristic feature including audio of video
Content characteristic, the content characteristic of the above-mentioned content characteristic and image for having been described above audio, it is no longer repeated herein;If treating point
Class file is mail, then the content characteristic of mail includes addressee, sender, Feature Words;If file to be sorted is software installation
Wrap, then the content characteristic of software installation bag includes the code of the file and each file in software installation bag;If file to be sorted
It is system file or the file for assigning distribution command, then the content characteristic of this document includes the code of file.
Step 103, the classification according to the number division file to be sorted counted.
Specifically, the number counted is more, and the affiliated file of feature of more numbers is more similar to file to be sorted, treats point
The classification for the affiliated file of feature that class file the more is likely to belong to the more more number.
Further, on the basis of Fig. 1 corresponds to embodiment, after step 101, in addition to:It is each according to what is extracted
The common characteristic generation grader of category file, the grader include the common characteristic of file of all categories.Extract sample training collection
The generation of the common characteristic and grader of file of all categories in conjunction can generate in cloud platform, can also cloud platform it
Outer generation.If being generated outside cloud platform, after grader is generated, the grader of generation is applied in cloud platform, works as user
When uploading file to be sorted to cloud platform, grader is treated sort file according to the common characteristic of file of all categories and classified,
Treat sort file manually without user to be classified, simplify user's operation, lift the usage experience of user, solve cloud platform
The problem of file hardly possible classification uploaded to batch.It is of course also possible to the classification of file to be sorted is changed according to the demand of user, with
Realize that treating sort file manually is classified.
Further, on the basis of Fig. 1 corresponds to embodiment, before step 101, in addition to:
According to the function of each file in sample training set, the file with identical function is divided into a classification.
Specifically, there is mark on each file in sample training set, the mark is used for the function of identifying this document,
The mark of the file of identical function is identical, and the file with same tag is divided into a classification, that is, realized with phase
The file of congenerous is divided into a classification.For example, one kind will be classified as the software installation bag for installing software, will be used to dispose
The system file of system is classified as one kind, will be classified as one kind for the file for assigning distribution command.
Further, machine learning algorithm is SVMs, K-Means algorithms or bayesian algorithm.
SVMs (Support Vector Machines, SVM) is similar with neutral net, is all learning-oriented machine
System, but the SVMs unlike neutral net uses mathematical method and optimisation technique.Supporting vector refers to those
In the training sample point of interval area edge, " machine (machine, machine) " here is actually an algorithm, in machine learning
Field, some algorithms are often regarded as a machine.
K-Means algorithms are a kind of indirect clustering methods based on similarity measurement between sample, belong to unsupervised learning side
Method.This algorithm is divided into k cluster using k as parameter, n object (n file in sample training set) so that cluster in have compared with
High similarity, and the similarity between cluster is relatively low.The calculating of similarity (is counted as cluster according to the average value of object in a cluster
Center of gravity) carry out.This algorithm randomly chooses k object first, and each object represents the barycenter of a cluster.For remaining
Each object, the distance between barycenter is clustered with each according to the object, it is assigned in cluster most like therewith.So
Afterwards, the new barycenter each clustered is calculated.Said process is repeated, until criterion function is restrained.K-Means algorithms are a kind of more typical
Pointwise modification iteration Dynamic Clustering Algorithm, when being clustered using K-Means algorithms, it is necessary to extract common trait, according to
Clustered according to the common trait of extraction.
Bayesian algorithm is a kind of algorithm based on probability, and the basic thought of bayesian algorithm is instructed for the sample provided
Practice the file in set, solve the probability that each file occurs in each category file, which probability is big, and this document just belongs to
Which classification.
Further, on the basis of Fig. 1 corresponds to embodiment, step 103 includes:
File to be sorted is divided into the classification of the affiliated file of the largest number of features counted.
In other words, there are several common characteristics and file to be sorted in the common characteristic of file of all categories by counting
Feature is identical, for example, there is the other file of three species in sample training set, the feature of file to be sorted has three, the first species
3 common traits of other file are identical with 3 features of file to be sorted respectively, that is, count the file of the first classification
With the feature identical Characteristic Number of file to be sorted it is 3 in common characteristic, in 5 common traits of the other file of second species
Only 1 common trait is identical with 1 feature of file to be sorted, that is, in the common characteristic for counting the other file of second species
It is 1 with the feature identical Characteristic Number of file to be sorted, not with treating point in 4 common traits of the file of the third classification
The feature identical feature of class file, that is, the feature in the common characteristic for the file for counting the third classification with file to be sorted
Identical Characteristic Number is 0.It can be seen that the largest number of of the feature counted are 3, i.e., file to be sorted is divided into the first kind
Not.
The file classifying method that the embodiment of the present invention is provided, extract sample training set using machine learning algorithm respectively
In file of all categories common characteristic;The feature with file to be sorted in the common characteristic of file of all categories is counted respectively
The number of identical feature;The classification of file to be sorted is divided according to the number counted.From the embodiment of the present invention, pass through
Machine learning algorithm learns the common characteristic of file of all categories, treats sort file using the common characteristic and is classified automatically,
Avoid and take human resources progress document classification, when the quantity of file particularly to be sorted is very big, be effectively improved text
The efficiency of part classification.
The embodiment of the present invention provides a kind of device for sorting document, as shown in Fig. 2 this document sorter 2 includes:
Extraction module 21, for using the file of all categories in machine learning algorithm extraction sample training set respectively
Common characteristic.
Statistical module 22 is identical with the feature of file to be sorted in the common characteristic for counting file of all categories respectively
Feature number.
First division module 23, for dividing the classification of file to be sorted according to the number counted.
Further, on the basis of Fig. 2 corresponds to embodiment, the present invention provides another device for sorting document, such as Fig. 3 institutes
Show, this document sorter 2 also includes:
Second division module 24, for the function according to each file in sample training set, by with identical function
File is divided into a classification.
Further, machine learning algorithm is SVMs, K-Means algorithms or bayesian algorithm.
Further, on the basis of Fig. 2 corresponds to embodiment, the first division module 23 is specifically used for,
File to be sorted is divided into the classification of the affiliated file of the largest number of features counted.
In actual applications, extraction module 21, statistical module 22, the first division module 23 and the second division module 24
By the CPU in device for sorting document 2, microprocessor (Micro Processor Unit, MPU), digital signal processor
(Digital Signal Processor, DSP) or field programmable gate array (Field Programmable Gate
Array, FPGA) etc. realize.
The device for sorting document that the embodiment of the present invention is provided, extract sample training set using machine learning algorithm respectively
In file of all categories common characteristic;The feature with file to be sorted in the common characteristic of file of all categories is counted respectively
The number of identical feature;The classification of file to be sorted is divided according to the number counted.From the embodiment of the present invention, pass through
Machine learning algorithm learns the common characteristic of file of all categories, treats sort file using the common characteristic and is classified automatically,
Avoid and take human resources progress document classification, when the quantity of file particularly to be sorted is very big, be effectively improved text
The efficiency of part classification.
The embodiment of the present invention provides another device for sorting document, this document sorter include memory, processor with
The step realized and storage is on a memory and the computer program that can run on a processor, during computing device computer program
Suddenly include:
Respectively using the common characteristic of the file of all categories in machine learning algorithm extraction sample training set;
The number with the feature identical feature of file to be sorted in the common characteristic of file of all categories is counted respectively;
The classification of file to be sorted is divided according to the number counted.
Further, the step of being realized during above-mentioned computing device computer program also includes:
According to the function of each file in the sample training set, the file with identical function is divided into a class
Not.
Further, the machine learning algorithm is SVMs, K-Means algorithms or bayesian algorithm.
Further, the step of being realized during above-mentioned computing device computer program specifically includes:
The file to be sorted is divided into the classification of the affiliated file of the largest number of features counted.
Although disclosed herein embodiment as above, described content be only readily appreciate the present invention and use
Embodiment, it is not limited to the present invention.Technical staff in any art of the present invention, taken off not departing from the present invention
On the premise of the spirit and scope of dew, any modification and change, but the present invention can be carried out in the form and details of implementation
Scope of patent protection, still should be subject to the scope of the claims as defined in the appended claims.
Claims (8)
- A kind of 1. file classifying method, it is characterised in that including:Respectively using the common characteristic of the file of all categories in machine learning algorithm extraction sample training set;The number with the feature identical feature of file to be sorted in the common characteristic of file of all categories is counted respectively;The classification of file to be sorted is divided according to the number counted.
- 2. file classifying method according to claim 1, it is characterised in that each in the extraction sample training set Before the common characteristic of category file, in addition to:According to the function of each file in the sample training set, the file with identical function is divided into a classification.
- 3. file classifying method according to claim 1 or 2, it is characterised in thatThe machine learning algorithm is SVMs, K-Means algorithms or bayesian algorithm.
- 4. file classifying method according to claim 1 or 2, it is characterised in that the number division that the basis counts The classification of file to be sorted, including:The file to be sorted is divided into the classification of the affiliated file of the largest number of features counted.
- A kind of 5. device for sorting document, it is characterised in that including:Extraction module, for using the shared spy of the file of all categories in machine learning algorithm extraction sample training set respectively Sign;Statistical module, the feature identical feature in the common characteristic for counting file of all categories respectively with file to be sorted Number;First division module, for dividing the classification of file to be sorted according to the number counted.
- 6. device for sorting document according to claim 5, it is characterised in that also include:Second division module, for the function according to each file in the sample training set, by the text with identical function Part is divided into a classification.
- 7. the device for sorting document according to claim 5 or 6, it is characterised in thatThe machine learning algorithm is SVMs, K-Means algorithms or bayesian algorithm.
- 8. the device for sorting document according to claim 5 or 6, it is characterised in that first division module is specifically used for,The file to be sorted is divided into the classification of the affiliated file of the largest number of features counted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710697846.9A CN107391751A (en) | 2017-08-15 | 2017-08-15 | A kind of file classifying method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710697846.9A CN107391751A (en) | 2017-08-15 | 2017-08-15 | A kind of file classifying method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107391751A true CN107391751A (en) | 2017-11-24 |
Family
ID=60355974
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710697846.9A Pending CN107391751A (en) | 2017-08-15 | 2017-08-15 | A kind of file classifying method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107391751A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109190001A (en) * | 2018-09-19 | 2019-01-11 | 广东电网有限责任公司 | office document management method |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1378158A (en) * | 2001-03-29 | 2002-11-06 | 国际商业机器公司 | File classifying management system and method for operation system |
CN101923561A (en) * | 2010-05-24 | 2010-12-22 | 中国科学技术信息研究所 | Automatic document classifying method |
CN103064855A (en) * | 2011-10-21 | 2013-04-24 | 铭传大学 | Method and system for classifying file |
US20130268535A1 (en) * | 2011-09-15 | 2013-10-10 | Kabushiki Kaisha Toshiba | Apparatus and method for classifying document, and computer program product |
CN106233286A (en) * | 2014-04-24 | 2016-12-14 | 谷歌公司 | Files passe is carried out the system and method for priorization |
CN106897454A (en) * | 2017-02-15 | 2017-06-27 | 北京时间股份有限公司 | A kind of file classifying method and device |
-
2017
- 2017-08-15 CN CN201710697846.9A patent/CN107391751A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1378158A (en) * | 2001-03-29 | 2002-11-06 | 国际商业机器公司 | File classifying management system and method for operation system |
CN101923561A (en) * | 2010-05-24 | 2010-12-22 | 中国科学技术信息研究所 | Automatic document classifying method |
US20130268535A1 (en) * | 2011-09-15 | 2013-10-10 | Kabushiki Kaisha Toshiba | Apparatus and method for classifying document, and computer program product |
CN103064855A (en) * | 2011-10-21 | 2013-04-24 | 铭传大学 | Method and system for classifying file |
CN106233286A (en) * | 2014-04-24 | 2016-12-14 | 谷歌公司 | Files passe is carried out the system and method for priorization |
CN106897454A (en) * | 2017-02-15 | 2017-06-27 | 北京时间股份有限公司 | A kind of file classifying method and device |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109190001A (en) * | 2018-09-19 | 2019-01-11 | 广东电网有限责任公司 | office document management method |
CN109190001B (en) * | 2018-09-19 | 2022-02-11 | 广东电网有限责任公司 | Office file management method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kalash et al. | Malware classification with deep convolutional neural networks | |
Li et al. | Multi-window based ensemble learning for classification of imbalanced streaming data | |
US9923912B2 (en) | Learning detector of malicious network traffic from weak labels | |
CN107636665A (en) | Cascade classifier for computer security applications program | |
CN106156163B (en) | Text classification method and device | |
CN108416213A (en) | A kind of malicious code sorting technique based on image texture fingerprint | |
CN112733936A (en) | Recyclable garbage classification method based on image recognition | |
WO2021189830A1 (en) | Sample data optimization method, apparatus and device, and storage medium | |
Purbolaksono et al. | Implementation of mutual information and bayes theorem for classification microarray data | |
CN108241662A (en) | The optimization method and device of data mark | |
CN103902706B (en) | Method for classifying and predicting big data on basis of SVM (support vector machine) | |
CN110163206B (en) | License plate recognition method, system, storage medium and device | |
CN104966109A (en) | Medical laboratory report image classification method and apparatus | |
CN107886130A (en) | A kind of kNN rapid classification methods based on cluster and Similarity-Weighted | |
EP3588352A1 (en) | Byte n-gram embedding model | |
Liu et al. | Less is more: Culling the training set to improve robustness of deep neural networks | |
CN113962199A (en) | Text recognition method, text recognition device, text recognition equipment, storage medium and program product | |
CN106844596A (en) | One kind is based on improved SVM Chinese Text Categorizations | |
CN112884061A (en) | Malicious software family classification method based on parameter optimization meta-learning | |
CN107391751A (en) | A kind of file classifying method and device | |
Anyaso-Samuel et al. | Metagenomic geolocation prediction using an adaptive ensemble classifier | |
CN109101487A (en) | Conversational character differentiating method, device, terminal device and storage medium | |
US8645290B2 (en) | Apparatus and method for improved classifier training | |
CN108108371A (en) | A kind of file classification method and device | |
JP4125951B2 (en) | Text automatic classification method and apparatus, program, and recording medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171124 |
|
RJ01 | Rejection of invention patent application after publication |