CN107391751A

CN107391751A - A kind of file classifying method and device

Info

Publication number: CN107391751A
Application number: CN201710697846.9A
Authority: CN
Inventors: 杨瑞
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2017-08-15
Filing date: 2017-08-15
Publication date: 2017-11-24

Abstract

The invention discloses a kind of file classifying method and device, respectively using the common characteristic of the file of all categories in machine learning algorithm extraction sample training set；The number with the feature identical feature of file to be sorted in the common characteristic of file of all categories is counted respectively；The classification of file to be sorted is divided according to the number counted.From the embodiment of the present invention, realize automatically to document classification, avoid and take human resources progress document classification, improve the efficiency of document classification.

Description

A kind of file classifying method and device

Technical field

The present invention relates to, but not limited to data processing technique, espespecially a kind of file classifying method and device.

Background technology

Cloud platform is used extensively at present, and user can upload to file in cloud platform, but when file is uploaded User is needed to classify manually to the file of upload, classification is got up very troublesome.In particular with the arrival in big data epoch, The quantity of documents of upload is very big, and user not only needs to expend substantial amounts of human resources manually to document classification, and classify Efficiency comparison is low.

The content of the invention

In order to solve the above-mentioned technical problem, the invention provides a kind of file classifying method and device, it is automatic right to realize Document classification, avoid and take human resources progress document classification.

In order to reach the object of the invention, the invention provides a kind of file classifying method, including：

Respectively using the common characteristic of the file of all categories in machine learning algorithm extraction sample training set；

The number with the feature identical feature of file to be sorted in the common characteristic of file of all categories is counted respectively；

The classification of file to be sorted is divided according to the number counted.

Further, before the common characteristic of the file of all categories in the extraction sample training set, in addition to：

According to the function of each file in the sample training set, the file with identical function is divided into a class Not.

Further, the machine learning algorithm is SVMs, K-Means algorithms or bayesian algorithm.

Further, the number that the basis counts divides the classification of file to be sorted, including：

The file to be sorted is divided into the classification of the affiliated file of the largest number of features counted.

Present invention also offers a kind of device for sorting document, including：

Extraction module, for using being total to for the file of all categories in machine learning algorithm extraction sample training set respectively There is feature；

Statistical module, the feature identical in the common characteristic for counting file of all categories respectively with file to be sorted The number of feature；

First division module, for dividing the classification of file to be sorted according to the number counted.

Further, in addition to：

Second division module, for the function according to each file in the sample training set, there will be identical function File be divided into a classification.

Further, first division module is specifically used for,

The present invention is comprised at least respectively using the file of all categories in machine learning algorithm extraction sample training set Common characteristic；The number with the feature identical feature of file to be sorted in the common characteristic of file of all categories is counted respectively； The classification of file to be sorted is divided according to the number counted.From the embodiment of the present invention, learnt by machine learning algorithm The common characteristic of file of all categories, treat sort file using the common characteristic and classified automatically, avoid and take manpower money Source carries out document classification, when the quantity of file particularly to be sorted is very big, is effectively improved the efficiency of document classification.

Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification Obtain it is clear that or being understood by implementing the present invention.The purpose of the present invention and other advantages can be by specification, rights Specifically noted structure is realized and obtained in claim and accompanying drawing.

Brief description of the drawings

Accompanying drawing is used for providing further understanding technical solution of the present invention, and a part for constitution instruction, with this The embodiment of application is used to explain technical scheme together, does not form the limitation to technical solution of the present invention.

Fig. 1 is a kind of schematic flow sheet of file classifying method provided in an embodiment of the present invention；

Fig. 2 is a kind of structural representation of device for sorting document provided in an embodiment of the present invention；

Fig. 3 is the structural representation of another device for sorting document provided in an embodiment of the present invention.

Embodiment

For the object, technical solutions and advantages of the present invention are more clearly understood, below in conjunction with accompanying drawing to the present invention Embodiment be described in detail.It should be noted that in the case where not conflicting, in the embodiment and embodiment in the application Feature can mutually be combined.

Can be in the computer system of such as one group computer executable instructions the flow of accompanying drawing illustrates the step of Perform.Also, although logical order is shown in flow charts, in some cases, can be with suitable different from herein Sequence performs shown or described step.

The embodiment of the present invention provides a kind of file classifying method, as shown in figure 1, this method includes：

Step 101, the shared spy for extracting the file of all categories in sample training set using machine learning algorithm respectively Sign.

Specifically, the file of all categories in sample training set is learnt using machine learning algorithm, it is each to learn The classifying rules of category file, the wherein classifying rules just refer to the common characteristic of file of all categories.Enter to file of all categories During row study, it can be learnt based on the form of file of all categories, for example, all Word document category in sample training set In same classification, the common characteristic learnt to category file is exactly the suffix entitled doc or docx of file；All ISO The file of form belongs to same classification, and the common characteristic learnt to category file is exactly the entitled iso of suffix of file；It is all The file of txt forms belongs to same classification, and the common characteristic learnt to category file is exactly the entitled txt of suffix of file. The content for being also based on file of all categories is learnt, for example, all mails in sample training set belong to one species Not, study includes addressee and sender to the content that the common characteristic of category file is exactly file.Wherein, file of all categories Specific learning Content, can be determined according to the demand of user.

File wherein in sample training set includes but is not limited to one below or any a variety of combination：Audio, figure Piece, video, document, mail, software installation bag, system file, the file for assigning distribution command.Sample training set Chinese The quantity a predetermined level is exceeded of part, for example, sample training set Chinese part quantity more than 10000, i.e., in sample training set File amount it is bigger, can so ensure that the common characteristic for extracting category file is more accurate.

Step 102, the feature identical feature with file to be sorted in the common characteristic of file of all categories is counted respectively Number.

Specifically, the feature of file to be sorted is extracted；For the common characteristic of file of all categories, respectively by file to be sorted Feature and file of all categories common characteristic one by one compared with, that is to say, that be respectively compared out the shared of file of all categories Which common characteristic is identical with the feature of file to be sorted in feature, with count respectively in the common characteristic of file of all categories with The number of the feature identical feature of file to be sorted.

Feature for extracting file to be sorted, is specifically described, the feature of extraction not only includes text to be sorted below The format character of part, include the content characteristic of file to be sorted.For example, if file to be sorted is document, the content of document Feature includes Feature Words；If file to be sorted is image, the content characteristic of image includes textural characteristics and gray level co-occurrence matrixes； If file to be sorted is audio, the content characteristic of audio includes the sound intensity, sound intensity level, loudness, pitch, pitch period, fundamental tone frequency Rate and signal to noise ratio；If file to be sorted is video, content characteristic and image of the content characteristic feature including audio of video Content characteristic, the content characteristic of the above-mentioned content characteristic and image for having been described above audio, it is no longer repeated herein；If treating point Class file is mail, then the content characteristic of mail includes addressee, sender, Feature Words；If file to be sorted is software installation Wrap, then the content characteristic of software installation bag includes the code of the file and each file in software installation bag；If file to be sorted It is system file or the file for assigning distribution command, then the content characteristic of this document includes the code of file.

Step 103, the classification according to the number division file to be sorted counted.

Specifically, the number counted is more, and the affiliated file of feature of more numbers is more similar to file to be sorted, treats point The classification for the affiliated file of feature that class file the more is likely to belong to the more more number.

Further, on the basis of Fig. 1 corresponds to embodiment, after step 101, in addition to：It is each according to what is extracted The common characteristic generation grader of category file, the grader include the common characteristic of file of all categories.Extract sample training collection The generation of the common characteristic and grader of file of all categories in conjunction can generate in cloud platform, can also cloud platform it Outer generation.If being generated outside cloud platform, after grader is generated, the grader of generation is applied in cloud platform, works as user When uploading file to be sorted to cloud platform, grader is treated sort file according to the common characteristic of file of all categories and classified, Treat sort file manually without user to be classified, simplify user's operation, lift the usage experience of user, solve cloud platform The problem of file hardly possible classification uploaded to batch.It is of course also possible to the classification of file to be sorted is changed according to the demand of user, with Realize that treating sort file manually is classified.

Further, on the basis of Fig. 1 corresponds to embodiment, before step 101, in addition to：

According to the function of each file in sample training set, the file with identical function is divided into a classification.

Specifically, there is mark on each file in sample training set, the mark is used for the function of identifying this document, The mark of the file of identical function is identical, and the file with same tag is divided into a classification, that is, realized with phase The file of congenerous is divided into a classification.For example, one kind will be classified as the software installation bag for installing software, will be used to dispose The system file of system is classified as one kind, will be classified as one kind for the file for assigning distribution command.

Further, machine learning algorithm is SVMs, K-Means algorithms or bayesian algorithm.

SVMs (Support Vector Machines, SVM) is similar with neutral net, is all learning-oriented machine System, but the SVMs unlike neutral net uses mathematical method and optimisation technique.Supporting vector refers to those In the training sample point of interval area edge, " machine (machine, machine) " here is actually an algorithm, in machine learning Field, some algorithms are often regarded as a machine.

K-Means algorithms are a kind of indirect clustering methods based on similarity measurement between sample, belong to unsupervised learning side Method.This algorithm is divided into k cluster using k as parameter, n object (n file in sample training set) so that cluster in have compared with High similarity, and the similarity between cluster is relatively low.The calculating of similarity (is counted as cluster according to the average value of object in a cluster Center of gravity) carry out.This algorithm randomly chooses k object first, and each object represents the barycenter of a cluster.For remaining Each object, the distance between barycenter is clustered with each according to the object, it is assigned in cluster most like therewith.So Afterwards, the new barycenter each clustered is calculated.Said process is repeated, until criterion function is restrained.K-Means algorithms are a kind of more typical Pointwise modification iteration Dynamic Clustering Algorithm, when being clustered using K-Means algorithms, it is necessary to extract common trait, according to Clustered according to the common trait of extraction.

Bayesian algorithm is a kind of algorithm based on probability, and the basic thought of bayesian algorithm is instructed for the sample provided Practice the file in set, solve the probability that each file occurs in each category file, which probability is big, and this document just belongs to Which classification.

Further, on the basis of Fig. 1 corresponds to embodiment, step 103 includes：

File to be sorted is divided into the classification of the affiliated file of the largest number of features counted.

In other words, there are several common characteristics and file to be sorted in the common characteristic of file of all categories by counting Feature is identical, for example, there is the other file of three species in sample training set, the feature of file to be sorted has three, the first species 3 common traits of other file are identical with 3 features of file to be sorted respectively, that is, count the file of the first classification With the feature identical Characteristic Number of file to be sorted it is 3 in common characteristic, in 5 common traits of the other file of second species Only 1 common trait is identical with 1 feature of file to be sorted, that is, in the common characteristic for counting the other file of second species It is 1 with the feature identical Characteristic Number of file to be sorted, not with treating point in 4 common traits of the file of the third classification The feature identical feature of class file, that is, the feature in the common characteristic for the file for counting the third classification with file to be sorted Identical Characteristic Number is 0.It can be seen that the largest number of of the feature counted are 3, i.e., file to be sorted is divided into the first kind Not.

The file classifying method that the embodiment of the present invention is provided, extract sample training set using machine learning algorithm respectively In file of all categories common characteristic；The feature with file to be sorted in the common characteristic of file of all categories is counted respectively The number of identical feature；The classification of file to be sorted is divided according to the number counted.From the embodiment of the present invention, pass through Machine learning algorithm learns the common characteristic of file of all categories, treats sort file using the common characteristic and is classified automatically, Avoid and take human resources progress document classification, when the quantity of file particularly to be sorted is very big, be effectively improved text The efficiency of part classification.

The embodiment of the present invention provides a kind of device for sorting document, as shown in Fig. 2 this document sorter 2 includes：

Extraction module 21, for using the file of all categories in machine learning algorithm extraction sample training set respectively Common characteristic.

Statistical module 22 is identical with the feature of file to be sorted in the common characteristic for counting file of all categories respectively Feature number.

First division module 23, for dividing the classification of file to be sorted according to the number counted.

Further, on the basis of Fig. 2 corresponds to embodiment, the present invention provides another device for sorting document, such as Fig. 3 institutes Show, this document sorter 2 also includes：

Second division module 24, for the function according to each file in sample training set, by with identical function File is divided into a classification.

Further, on the basis of Fig. 2 corresponds to embodiment, the first division module 23 is specifically used for,

In actual applications, extraction module 21, statistical module 22, the first division module 23 and the second division module 24 By the CPU in device for sorting document 2, microprocessor (Micro Processor Unit, MPU), digital signal processor (Digital Signal Processor, DSP) or field programmable gate array (Field Programmable Gate Array, FPGA) etc. realize.

The device for sorting document that the embodiment of the present invention is provided, extract sample training set using machine learning algorithm respectively In file of all categories common characteristic；The feature with file to be sorted in the common characteristic of file of all categories is counted respectively The number of identical feature；The classification of file to be sorted is divided according to the number counted.From the embodiment of the present invention, pass through Machine learning algorithm learns the common characteristic of file of all categories, treats sort file using the common characteristic and is classified automatically, Avoid and take human resources progress document classification, when the quantity of file particularly to be sorted is very big, be effectively improved text The efficiency of part classification.

The embodiment of the present invention provides another device for sorting document, this document sorter include memory, processor with The step realized and storage is on a memory and the computer program that can run on a processor, during computing device computer program Suddenly include：

Further, the step of being realized during above-mentioned computing device computer program also includes：

Further, the step of being realized during above-mentioned computing device computer program specifically includes：

Although disclosed herein embodiment as above, described content be only readily appreciate the present invention and use Embodiment, it is not limited to the present invention.Technical staff in any art of the present invention, taken off not departing from the present invention On the premise of the spirit and scope of dew, any modification and change, but the present invention can be carried out in the form and details of implementation Scope of patent protection, still should be subject to the scope of the claims as defined in the appended claims.

Claims

A kind of 1. file classifying method, it is characterised in that including：

Respectively using the common characteristic of the file of all categories in machine learning algorithm extraction sample training set；

The number with the feature identical feature of file to be sorted in the common characteristic of file of all categories is counted respectively；

The classification of file to be sorted is divided according to the number counted.
2. file classifying method according to claim 1, it is characterised in that each in the extraction sample training set Before the common characteristic of category file, in addition to：

According to the function of each file in the sample training set, the file with identical function is divided into a classification.
3. file classifying method according to claim 1 or 2, it is characterised in that

The machine learning algorithm is SVMs, K-Means algorithms or bayesian algorithm.
4. file classifying method according to claim 1 or 2, it is characterised in that the number division that the basis counts The classification of file to be sorted, including：

The file to be sorted is divided into the classification of the affiliated file of the largest number of features counted.
A kind of 5. device for sorting document, it is characterised in that including：

Extraction module, for using the shared spy of the file of all categories in machine learning algorithm extraction sample training set respectively Sign；

Statistical module, the feature identical feature in the common characteristic for counting file of all categories respectively with file to be sorted Number；

First division module, for dividing the classification of file to be sorted according to the number counted.
6. device for sorting document according to claim 5, it is characterised in that also include：

Second division module, for the function according to each file in the sample training set, by the text with identical function Part is divided into a classification.
7. the device for sorting document according to claim 5 or 6, it is characterised in that

The machine learning algorithm is SVMs, K-Means algorithms or bayesian algorithm.
8. the device for sorting document according to claim 5 or 6, it is characterised in that first division module is specifically used for,

The file to be sorted is divided into the classification of the affiliated file of the largest number of features counted.