CN107391751A - A kind of file classifying method and device - Google Patents

A kind of file classifying method and device Download PDF

Info

Publication number
CN107391751A
CN107391751A CN201710697846.9A CN201710697846A CN107391751A CN 107391751 A CN107391751 A CN 107391751A CN 201710697846 A CN201710697846 A CN 201710697846A CN 107391751 A CN107391751 A CN 107391751A
Authority
CN
China
Prior art keywords
file
sorted
classification
categories
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710697846.9A
Other languages
Chinese (zh)
Inventor
杨瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Yunhai Information Technology Co Ltd
Original Assignee
Zhengzhou Yunhai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Yunhai Information Technology Co Ltd filed Critical Zhengzhou Yunhai Information Technology Co Ltd
Priority to CN201710697846.9A priority Critical patent/CN107391751A/en
Publication of CN107391751A publication Critical patent/CN107391751A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Abstract

The invention discloses a kind of file classifying method and device, respectively using the common characteristic of the file of all categories in machine learning algorithm extraction sample training set;The number with the feature identical feature of file to be sorted in the common characteristic of file of all categories is counted respectively;The classification of file to be sorted is divided according to the number counted.From the embodiment of the present invention, realize automatically to document classification, avoid and take human resources progress document classification, improve the efficiency of document classification.

Description

A kind of file classifying method and device
Technical field
The present invention relates to, but not limited to data processing technique, espespecially a kind of file classifying method and device.
Background technology
Cloud platform is used extensively at present, and user can upload to file in cloud platform, but when file is uploaded User is needed to classify manually to the file of upload, classification is got up very troublesome.In particular with the arrival in big data epoch, The quantity of documents of upload is very big, and user not only needs to expend substantial amounts of human resources manually to document classification, and classify Efficiency comparison is low.
The content of the invention
In order to solve the above-mentioned technical problem, the invention provides a kind of file classifying method and device, it is automatic right to realize Document classification, avoid and take human resources progress document classification.
In order to reach the object of the invention, the invention provides a kind of file classifying method, including:
Respectively using the common characteristic of the file of all categories in machine learning algorithm extraction sample training set;
The number with the feature identical feature of file to be sorted in the common characteristic of file of all categories is counted respectively;
The classification of file to be sorted is divided according to the number counted.
Further, before the common characteristic of the file of all categories in the extraction sample training set, in addition to:
According to the function of each file in the sample training set, the file with identical function is divided into a class Not.
Further, the machine learning algorithm is SVMs, K-Means algorithms or bayesian algorithm.
Further, the number that the basis counts divides the classification of file to be sorted, including:
The file to be sorted is divided into the classification of the affiliated file of the largest number of features counted.
Present invention also offers a kind of device for sorting document, including:
Extraction module, for using being total to for the file of all categories in machine learning algorithm extraction sample training set respectively There is feature;
Statistical module, the feature identical in the common characteristic for counting file of all categories respectively with file to be sorted The number of feature;
First division module, for dividing the classification of file to be sorted according to the number counted.
Further, in addition to:
Second division module, for the function according to each file in the sample training set, there will be identical function File be divided into a classification.
Further, the machine learning algorithm is SVMs, K-Means algorithms or bayesian algorithm.
Further, first division module is specifically used for,
The file to be sorted is divided into the classification of the affiliated file of the largest number of features counted.
The present invention is comprised at least respectively using the file of all categories in machine learning algorithm extraction sample training set Common characteristic;The number with the feature identical feature of file to be sorted in the common characteristic of file of all categories is counted respectively; The classification of file to be sorted is divided according to the number counted.From the embodiment of the present invention, learnt by machine learning algorithm The common characteristic of file of all categories, treat sort file using the common characteristic and classified automatically, avoid and take manpower money Source carries out document classification, when the quantity of file particularly to be sorted is very big, is effectively improved the efficiency of document classification.
Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification Obtain it is clear that or being understood by implementing the present invention.The purpose of the present invention and other advantages can be by specification, rights Specifically noted structure is realized and obtained in claim and accompanying drawing.
Brief description of the drawings
Accompanying drawing is used for providing further understanding technical solution of the present invention, and a part for constitution instruction, with this The embodiment of application is used to explain technical scheme together, does not form the limitation to technical solution of the present invention.
Fig. 1 is a kind of schematic flow sheet of file classifying method provided in an embodiment of the present invention;
Fig. 2 is a kind of structural representation of device for sorting document provided in an embodiment of the present invention;
Fig. 3 is the structural representation of another device for sorting document provided in an embodiment of the present invention.
Embodiment
For the object, technical solutions and advantages of the present invention are more clearly understood, below in conjunction with accompanying drawing to the present invention Embodiment be described in detail.It should be noted that in the case where not conflicting, in the embodiment and embodiment in the application Feature can mutually be combined.
Can be in the computer system of such as one group computer executable instructions the flow of accompanying drawing illustrates the step of Perform.Also, although logical order is shown in flow charts, in some cases, can be with suitable different from herein Sequence performs shown or described step.
The embodiment of the present invention provides a kind of file classifying method, as shown in figure 1, this method includes:
Step 101, the shared spy for extracting the file of all categories in sample training set using machine learning algorithm respectively Sign.
Specifically, the file of all categories in sample training set is learnt using machine learning algorithm, it is each to learn The classifying rules of category file, the wherein classifying rules just refer to the common characteristic of file of all categories.Enter to file of all categories During row study, it can be learnt based on the form of file of all categories, for example, all Word document category in sample training set In same classification, the common characteristic learnt to category file is exactly the suffix entitled doc or docx of file;All ISO The file of form belongs to same classification, and the common characteristic learnt to category file is exactly the entitled iso of suffix of file;It is all The file of txt forms belongs to same classification, and the common characteristic learnt to category file is exactly the entitled txt of suffix of file. The content for being also based on file of all categories is learnt, for example, all mails in sample training set belong to one species Not, study includes addressee and sender to the content that the common characteristic of category file is exactly file.Wherein, file of all categories Specific learning Content, can be determined according to the demand of user.
File wherein in sample training set includes but is not limited to one below or any a variety of combination:Audio, figure Piece, video, document, mail, software installation bag, system file, the file for assigning distribution command.Sample training set Chinese The quantity a predetermined level is exceeded of part, for example, sample training set Chinese part quantity more than 10000, i.e., in sample training set File amount it is bigger, can so ensure that the common characteristic for extracting category file is more accurate.
Step 102, the feature identical feature with file to be sorted in the common characteristic of file of all categories is counted respectively Number.
Specifically, the feature of file to be sorted is extracted;For the common characteristic of file of all categories, respectively by file to be sorted Feature and file of all categories common characteristic one by one compared with, that is to say, that be respectively compared out the shared of file of all categories Which common characteristic is identical with the feature of file to be sorted in feature, with count respectively in the common characteristic of file of all categories with The number of the feature identical feature of file to be sorted.
Feature for extracting file to be sorted, is specifically described, the feature of extraction not only includes text to be sorted below The format character of part, include the content characteristic of file to be sorted.For example, if file to be sorted is document, the content of document Feature includes Feature Words;If file to be sorted is image, the content characteristic of image includes textural characteristics and gray level co-occurrence matrixes; If file to be sorted is audio, the content characteristic of audio includes the sound intensity, sound intensity level, loudness, pitch, pitch period, fundamental tone frequency Rate and signal to noise ratio;If file to be sorted is video, content characteristic and image of the content characteristic feature including audio of video Content characteristic, the content characteristic of the above-mentioned content characteristic and image for having been described above audio, it is no longer repeated herein;If treating point Class file is mail, then the content characteristic of mail includes addressee, sender, Feature Words;If file to be sorted is software installation Wrap, then the content characteristic of software installation bag includes the code of the file and each file in software installation bag;If file to be sorted It is system file or the file for assigning distribution command, then the content characteristic of this document includes the code of file.
Step 103, the classification according to the number division file to be sorted counted.
Specifically, the number counted is more, and the affiliated file of feature of more numbers is more similar to file to be sorted, treats point The classification for the affiliated file of feature that class file the more is likely to belong to the more more number.
Further, on the basis of Fig. 1 corresponds to embodiment, after step 101, in addition to:It is each according to what is extracted The common characteristic generation grader of category file, the grader include the common characteristic of file of all categories.Extract sample training collection The generation of the common characteristic and grader of file of all categories in conjunction can generate in cloud platform, can also cloud platform it Outer generation.If being generated outside cloud platform, after grader is generated, the grader of generation is applied in cloud platform, works as user When uploading file to be sorted to cloud platform, grader is treated sort file according to the common characteristic of file of all categories and classified, Treat sort file manually without user to be classified, simplify user's operation, lift the usage experience of user, solve cloud platform The problem of file hardly possible classification uploaded to batch.It is of course also possible to the classification of file to be sorted is changed according to the demand of user, with Realize that treating sort file manually is classified.
Further, on the basis of Fig. 1 corresponds to embodiment, before step 101, in addition to:
According to the function of each file in sample training set, the file with identical function is divided into a classification.
Specifically, there is mark on each file in sample training set, the mark is used for the function of identifying this document, The mark of the file of identical function is identical, and the file with same tag is divided into a classification, that is, realized with phase The file of congenerous is divided into a classification.For example, one kind will be classified as the software installation bag for installing software, will be used to dispose The system file of system is classified as one kind, will be classified as one kind for the file for assigning distribution command.
Further, machine learning algorithm is SVMs, K-Means algorithms or bayesian algorithm.
SVMs (Support Vector Machines, SVM) is similar with neutral net, is all learning-oriented machine System, but the SVMs unlike neutral net uses mathematical method and optimisation technique.Supporting vector refers to those In the training sample point of interval area edge, " machine (machine, machine) " here is actually an algorithm, in machine learning Field, some algorithms are often regarded as a machine.
K-Means algorithms are a kind of indirect clustering methods based on similarity measurement between sample, belong to unsupervised learning side Method.This algorithm is divided into k cluster using k as parameter, n object (n file in sample training set) so that cluster in have compared with High similarity, and the similarity between cluster is relatively low.The calculating of similarity (is counted as cluster according to the average value of object in a cluster Center of gravity) carry out.This algorithm randomly chooses k object first, and each object represents the barycenter of a cluster.For remaining Each object, the distance between barycenter is clustered with each according to the object, it is assigned in cluster most like therewith.So Afterwards, the new barycenter each clustered is calculated.Said process is repeated, until criterion function is restrained.K-Means algorithms are a kind of more typical Pointwise modification iteration Dynamic Clustering Algorithm, when being clustered using K-Means algorithms, it is necessary to extract common trait, according to Clustered according to the common trait of extraction.
Bayesian algorithm is a kind of algorithm based on probability, and the basic thought of bayesian algorithm is instructed for the sample provided Practice the file in set, solve the probability that each file occurs in each category file, which probability is big, and this document just belongs to Which classification.
Further, on the basis of Fig. 1 corresponds to embodiment, step 103 includes:
File to be sorted is divided into the classification of the affiliated file of the largest number of features counted.
In other words, there are several common characteristics and file to be sorted in the common characteristic of file of all categories by counting Feature is identical, for example, there is the other file of three species in sample training set, the feature of file to be sorted has three, the first species 3 common traits of other file are identical with 3 features of file to be sorted respectively, that is, count the file of the first classification With the feature identical Characteristic Number of file to be sorted it is 3 in common characteristic, in 5 common traits of the other file of second species Only 1 common trait is identical with 1 feature of file to be sorted, that is, in the common characteristic for counting the other file of second species It is 1 with the feature identical Characteristic Number of file to be sorted, not with treating point in 4 common traits of the file of the third classification The feature identical feature of class file, that is, the feature in the common characteristic for the file for counting the third classification with file to be sorted Identical Characteristic Number is 0.It can be seen that the largest number of of the feature counted are 3, i.e., file to be sorted is divided into the first kind Not.
The file classifying method that the embodiment of the present invention is provided, extract sample training set using machine learning algorithm respectively In file of all categories common characteristic;The feature with file to be sorted in the common characteristic of file of all categories is counted respectively The number of identical feature;The classification of file to be sorted is divided according to the number counted.From the embodiment of the present invention, pass through Machine learning algorithm learns the common characteristic of file of all categories, treats sort file using the common characteristic and is classified automatically, Avoid and take human resources progress document classification, when the quantity of file particularly to be sorted is very big, be effectively improved text The efficiency of part classification.
The embodiment of the present invention provides a kind of device for sorting document, as shown in Fig. 2 this document sorter 2 includes:
Extraction module 21, for using the file of all categories in machine learning algorithm extraction sample training set respectively Common characteristic.
Statistical module 22 is identical with the feature of file to be sorted in the common characteristic for counting file of all categories respectively Feature number.
First division module 23, for dividing the classification of file to be sorted according to the number counted.
Further, on the basis of Fig. 2 corresponds to embodiment, the present invention provides another device for sorting document, such as Fig. 3 institutes Show, this document sorter 2 also includes:
Second division module 24, for the function according to each file in sample training set, by with identical function File is divided into a classification.
Further, machine learning algorithm is SVMs, K-Means algorithms or bayesian algorithm.
Further, on the basis of Fig. 2 corresponds to embodiment, the first division module 23 is specifically used for,
File to be sorted is divided into the classification of the affiliated file of the largest number of features counted.
In actual applications, extraction module 21, statistical module 22, the first division module 23 and the second division module 24 By the CPU in device for sorting document 2, microprocessor (Micro Processor Unit, MPU), digital signal processor (Digital Signal Processor, DSP) or field programmable gate array (Field Programmable Gate Array, FPGA) etc. realize.
The device for sorting document that the embodiment of the present invention is provided, extract sample training set using machine learning algorithm respectively In file of all categories common characteristic;The feature with file to be sorted in the common characteristic of file of all categories is counted respectively The number of identical feature;The classification of file to be sorted is divided according to the number counted.From the embodiment of the present invention, pass through Machine learning algorithm learns the common characteristic of file of all categories, treats sort file using the common characteristic and is classified automatically, Avoid and take human resources progress document classification, when the quantity of file particularly to be sorted is very big, be effectively improved text The efficiency of part classification.
The embodiment of the present invention provides another device for sorting document, this document sorter include memory, processor with The step realized and storage is on a memory and the computer program that can run on a processor, during computing device computer program Suddenly include:
Respectively using the common characteristic of the file of all categories in machine learning algorithm extraction sample training set;
The number with the feature identical feature of file to be sorted in the common characteristic of file of all categories is counted respectively;
The classification of file to be sorted is divided according to the number counted.
Further, the step of being realized during above-mentioned computing device computer program also includes:
According to the function of each file in the sample training set, the file with identical function is divided into a class Not.
Further, the machine learning algorithm is SVMs, K-Means algorithms or bayesian algorithm.
Further, the step of being realized during above-mentioned computing device computer program specifically includes:
The file to be sorted is divided into the classification of the affiliated file of the largest number of features counted.
Although disclosed herein embodiment as above, described content be only readily appreciate the present invention and use Embodiment, it is not limited to the present invention.Technical staff in any art of the present invention, taken off not departing from the present invention On the premise of the spirit and scope of dew, any modification and change, but the present invention can be carried out in the form and details of implementation Scope of patent protection, still should be subject to the scope of the claims as defined in the appended claims.

Claims (8)

  1. A kind of 1. file classifying method, it is characterised in that including:
    Respectively using the common characteristic of the file of all categories in machine learning algorithm extraction sample training set;
    The number with the feature identical feature of file to be sorted in the common characteristic of file of all categories is counted respectively;
    The classification of file to be sorted is divided according to the number counted.
  2. 2. file classifying method according to claim 1, it is characterised in that each in the extraction sample training set Before the common characteristic of category file, in addition to:
    According to the function of each file in the sample training set, the file with identical function is divided into a classification.
  3. 3. file classifying method according to claim 1 or 2, it is characterised in that
    The machine learning algorithm is SVMs, K-Means algorithms or bayesian algorithm.
  4. 4. file classifying method according to claim 1 or 2, it is characterised in that the number division that the basis counts The classification of file to be sorted, including:
    The file to be sorted is divided into the classification of the affiliated file of the largest number of features counted.
  5. A kind of 5. device for sorting document, it is characterised in that including:
    Extraction module, for using the shared spy of the file of all categories in machine learning algorithm extraction sample training set respectively Sign;
    Statistical module, the feature identical feature in the common characteristic for counting file of all categories respectively with file to be sorted Number;
    First division module, for dividing the classification of file to be sorted according to the number counted.
  6. 6. device for sorting document according to claim 5, it is characterised in that also include:
    Second division module, for the function according to each file in the sample training set, by the text with identical function Part is divided into a classification.
  7. 7. the device for sorting document according to claim 5 or 6, it is characterised in that
    The machine learning algorithm is SVMs, K-Means algorithms or bayesian algorithm.
  8. 8. the device for sorting document according to claim 5 or 6, it is characterised in that first division module is specifically used for,
    The file to be sorted is divided into the classification of the affiliated file of the largest number of features counted.
CN201710697846.9A 2017-08-15 2017-08-15 A kind of file classifying method and device Pending CN107391751A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710697846.9A CN107391751A (en) 2017-08-15 2017-08-15 A kind of file classifying method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710697846.9A CN107391751A (en) 2017-08-15 2017-08-15 A kind of file classifying method and device

Publications (1)

Publication Number Publication Date
CN107391751A true CN107391751A (en) 2017-11-24

Family

ID=60355974

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710697846.9A Pending CN107391751A (en) 2017-08-15 2017-08-15 A kind of file classifying method and device

Country Status (1)

Country Link
CN (1) CN107391751A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190001A (en) * 2018-09-19 2019-01-11 广东电网有限责任公司 office document management method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1378158A (en) * 2001-03-29 2002-11-06 国际商业机器公司 File classifying management system and method for operation system
CN101923561A (en) * 2010-05-24 2010-12-22 中国科学技术信息研究所 Automatic document classifying method
CN103064855A (en) * 2011-10-21 2013-04-24 铭传大学 Method and system for classifying file
US20130268535A1 (en) * 2011-09-15 2013-10-10 Kabushiki Kaisha Toshiba Apparatus and method for classifying document, and computer program product
CN106233286A (en) * 2014-04-24 2016-12-14 谷歌公司 Files passe is carried out the system and method for priorization
CN106897454A (en) * 2017-02-15 2017-06-27 北京时间股份有限公司 A kind of file classifying method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1378158A (en) * 2001-03-29 2002-11-06 国际商业机器公司 File classifying management system and method for operation system
CN101923561A (en) * 2010-05-24 2010-12-22 中国科学技术信息研究所 Automatic document classifying method
US20130268535A1 (en) * 2011-09-15 2013-10-10 Kabushiki Kaisha Toshiba Apparatus and method for classifying document, and computer program product
CN103064855A (en) * 2011-10-21 2013-04-24 铭传大学 Method and system for classifying file
CN106233286A (en) * 2014-04-24 2016-12-14 谷歌公司 Files passe is carried out the system and method for priorization
CN106897454A (en) * 2017-02-15 2017-06-27 北京时间股份有限公司 A kind of file classifying method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190001A (en) * 2018-09-19 2019-01-11 广东电网有限责任公司 office document management method
CN109190001B (en) * 2018-09-19 2022-02-11 广东电网有限责任公司 Office file management method

Similar Documents

Publication Publication Date Title
Kalash et al. Malware classification with deep convolutional neural networks
Li et al. Multi-window based ensemble learning for classification of imbalanced streaming data
US9923912B2 (en) Learning detector of malicious network traffic from weak labels
CN107636665A (en) Cascade classifier for computer security applications program
CN106156163B (en) Text classification method and device
CN108416213A (en) A kind of malicious code sorting technique based on image texture fingerprint
CN112733936A (en) Recyclable garbage classification method based on image recognition
WO2021189830A1 (en) Sample data optimization method, apparatus and device, and storage medium
Purbolaksono et al. Implementation of mutual information and bayes theorem for classification microarray data
CN108241662A (en) The optimization method and device of data mark
CN103902706B (en) Method for classifying and predicting big data on basis of SVM (support vector machine)
CN110163206B (en) License plate recognition method, system, storage medium and device
CN104966109A (en) Medical laboratory report image classification method and apparatus
CN107886130A (en) A kind of kNN rapid classification methods based on cluster and Similarity-Weighted
EP3588352A1 (en) Byte n-gram embedding model
Liu et al. Less is more: Culling the training set to improve robustness of deep neural networks
CN113962199A (en) Text recognition method, text recognition device, text recognition equipment, storage medium and program product
CN106844596A (en) One kind is based on improved SVM Chinese Text Categorizations
CN112884061A (en) Malicious software family classification method based on parameter optimization meta-learning
CN107391751A (en) A kind of file classifying method and device
Anyaso-Samuel et al. Metagenomic geolocation prediction using an adaptive ensemble classifier
CN109101487A (en) Conversational character differentiating method, device, terminal device and storage medium
US8645290B2 (en) Apparatus and method for improved classifier training
CN108108371A (en) A kind of file classification method and device
JP4125951B2 (en) Text automatic classification method and apparatus, program, and recording medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20171124

RJ01 Rejection of invention patent application after publication