CN107679244A - File classification method and device - Google Patents

File classification method and device Download PDF

Info

Publication number
CN107679244A
CN107679244A CN201711042851.2A CN201711042851A CN107679244A CN 107679244 A CN107679244 A CN 107679244A CN 201711042851 A CN201711042851 A CN 201711042851A CN 107679244 A CN107679244 A CN 107679244A
Authority
CN
China
Prior art keywords
sample
text
classification
sample set
sample text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711042851.2A
Other languages
Chinese (zh)
Inventor
许丹丹
刘静沙
刘颖慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China United Network Communications Group Co Ltd
Original Assignee
China United Network Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China United Network Communications Group Co Ltd filed Critical China United Network Communications Group Co Ltd
Priority to CN201711042851.2A priority Critical patent/CN107679244A/en
Publication of CN107679244A publication Critical patent/CN107679244A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the present invention provides a kind of file classification method and device.This method includes:Each sample text in sample set is pre-processed, obtains the classification information and participle language material of each sample text;Collected by the classification information to each sample text in the sample set and participle language material, the sample set is divided into test set and training set;Text classification is carried out to each sample text in the sample set by the test set and the training set.The embodiment of the present invention obtains the classification information and participle language material of each sample text by being pre-processed to each sample text in sample set;Collected by the classification information to each sample text in the sample set and participle language material, the sample set is divided into test set and training set;Text classification is carried out to each sample text in the sample set by the test set and the training set;Realized with reference to R language and Hadoop and big data text is classified.

Description

File classification method and device
Technical field
The present embodiments relate to field of computer technology, more particularly to a kind of file classification method and device.
Background technology
Text classification is carried out with computer to text set (or other entities or object) according to certain taxonomic hierarchies or standard Automatic key words sorting.Belong to a kind of automatic classification based on taxonomic hierarchies, be Naive Bayes Classification method.
Text classification has generally comprised the expression of text, the selection of grader and training, the evaluation of classification results and feedback Etc. process, the expression of its Chinese version can be subdivided into the steps such as Text Pretreatment, index and statistics, feature extraction again.Text classification The general function module of system is:(1) pre-process:Original language material is formatted as same form, is easy to follow-up be uniformly processed; (2) index:It is basic processing unit by document decomposition, while reduces the expense of subsequent treatment;(3) count:Word frequency statisticses, item (word, concept) and the dependent probability of classification;(4) feature extraction:The feature of reflection document subject matter is extracted from document;(5) Grader:The training of grader;(6) evaluate:The test result analysis of grader.
Due to the quick increase of the data scale of construction in recent years, the arriving in big data epoch so that traditional statistical analysis instrument and Data-handling capacity can not meet the requirement of large-scale data processing, can not realize big data text classification.
The content of the invention
The embodiment of the present invention provides a kind of file classification method and device, and big data text is classified with realizing.
The one side of the embodiment of the present invention is to provide a kind of file classification method, including:
Each sample text in sample set is pre-processed, obtains the classification information and participle language material of each sample text;
Collected by the classification information to each sample text in the sample set and participle language material, by the sample Set is divided into test set and training set;
Text classification is carried out to each sample text in the sample set by the test set and the training set.
The other side of the embodiment of the present invention is to provide a kind of document sorting apparatus, including:
Pretreatment module, for being pre-processed to each sample text in sample set, obtain the classification of each sample text Information and participle language material;
Summarizing module, collect for the classification information to each sample text in the sample set and participle language material;
Division module, for the sample set to be divided into test set and training set;
Sort module, for being entered by the test set and the training set to each sample text in the sample set Row text classification.
File classification method and device provided in an embodiment of the present invention, it is pre- by being carried out to each sample text in sample set Processing, obtain the classification information and participle language material of each sample text;Pass through the classification to each sample text in the sample set Information and participle language material are collected, and the sample set is divided into test set and training set;Pass through the test set and institute State training set and text classification is carried out to each sample text in the sample set;Realized with reference to R language and Hadoop to big Data text is classified.
Brief description of the drawings
Accompanying drawing herein is merged in specification and forms the part of this specification, shows the implementation for meeting the disclosure Example, and be used to together with specification to explain the principle of the disclosure.
Fig. 1 is the schematic diagram of system architecture provided in an embodiment of the present invention;
Fig. 2 is file classification method flow chart provided in an embodiment of the present invention;
Fig. 3 is the file classification method flow chart that another embodiment of the present invention provides;
Fig. 4 is the schematic diagram of maprecude tasks chain structure provided in an embodiment of the present invention;
Fig. 5 is the structure chart of document sorting apparatus provided in an embodiment of the present invention.
Pass through above-mentioned accompanying drawing, it has been shown that the clear and definite embodiment of the disclosure, will hereinafter be described in more detail.These accompanying drawings It is not intended to limit the scope of disclosure design by any mode with word description, but is by reference to specific embodiment Those skilled in the art illustrate the concept of the disclosure.
Embodiment
Here exemplary embodiment will be illustrated in detail, its example is illustrated in the accompanying drawings.Following description is related to During accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawings represent same or analogous key element.Following exemplary embodiment Described in embodiment do not represent all embodiments consistent with the disclosure.On the contrary, they be only with it is such as appended The example of the consistent apparatus and method of some aspects be described in detail in claims, the disclosure.
Noun involved in the present invention is explained first:
R language:R language is the statistical and analytical tool freely increased income, and a kind of efficient programming language, is possessed abundant Statistical model and data analysing method, but data processing poor expandability, its core technology engine based on internal memory can be handled Or the data volume of processing is extremely limited, and the problems such as internal memory would generally be caused to overflow, therefore it can not effectively handle GB, TB, very To the big data and real-time stream of PB ranks.
Hadoop:It is a distributed system architecture developed by Apache funds club.User can be not In the case of angle distribution formula low-level details, distributed program is developed.The power of cluster is made full use of to carry out high-speed computation and storage. Hadoop realizes a distributed file system (Hadoop Distributed File System, abbreviation HDFS).HDFS There is the characteristics of high fault tolerance, and be designed to be deployed on cheap (low-cost) hardware;And it provides high-throughput (high throughput) carrys out the data of access application, is adapted to those to have super large data set (large data set) Application program.HDFS relaxes (relax) POSIX requirement, can access in the form of streaming (streaming access) Data in file system.The design that Hadoop framework is most crucial is exactly:HDFS and MapReduce.HDFS is the number of magnanimity According to storage is provided, then MapReduce provides calculating for the data of magnanimity.
MapReduce:MapReduce is a kind of programming model, the parallel fortune for large-scale dataset (being more than 1TB) Calculate.Concept " Map (mapping) " and " Reduce (reduction) ", are their main thoughts, are borrowed in Functional Programming Come, the characteristic also borrowed in vector programming language.It is very easy to programming personnel will not distributed parallel compile In the case of journey, the program of oneself is operated in distributed system.It is to specify Map (mapping) letter that current software, which is realized, Number, for one group of key-value pair is mapped to one group of new key-value pair, concurrent Reduce (reduction) function is specified, for ensureing There is each shared identical key group in the key-value pair of mapping.
RHadoop:RHadoop is the product after a Hadoop and R language is combined.Hadoop and R language is carried out Big data analysis can be carried out with reference to after.RHadoop includes three R bags, and three R bags are respectively:Rmr bags, rhdfs bags, rhbase Bag.Rmr bags correspond to the MapReduce in Hadoop system framework, and rhdfs bags correspond to the HDFS in Hadoop system framework, Rhbase bags correspond to the Hbase in Hadoop system framework.
Under normal circumstances, R language can not effectively handle the big data and real-time stream of GB, TB, even PB ranks.And base In Hadoop big data distributed processing framework, mass data storage and disposal ability are realized by HDFS and Mapreduce, Wherein component Mahout provides machine learning algorithm storehouse, but development difficulty is big, the cycle is long, model algorithm support is not complete.Therefore, R language and Hadoop can be combined to obtain RHadoop, Map/Reduce modules are completed using R language in RHadoop, And carry out data mart modeling processing and modeling with Hadoop Map/Reduce parallel processings mechanism.
File classification method provided by the invention, go for the system architecture shown in Fig. 1.As shown in figure 1, the system Framework includes:R consoles, Rhadoop set, Hadoop platform.Wherein, Rhadoop set includes:Rmr bags, rhdfs bags, Rhbase bags, in the present embodiment, Rhadoop set include:Rmr bags, rhdfs bags.Rhadoop set in rmr bags, Rhdfs bags are used to provide big data associative operation in R environment, and each bag can provide different hadoop characteristics, such as Fig. 1 institutes Show, rhdfs bags are provided with the characteristic that HDFS is accessed in R environment, and rmr bags are provided with the spy that pReduce is called in R environment Property.
Specifically, rhdfs bags provide the interface that R consoles access hdfs, rear end hadoop hdfs API can be called Go operation to be stored in the data on hdfs, especially perform maprecude program implementing results, it provides hdfs data Operating method.
Rmr bags are then the interfaces that hadoop maprecude functions are provided under a R environment, primarily to realizing R journeys Sequence logical breakdown is map stages and reduce stages, then submits task by rmr methods.Then rmr is input according to parameter Catalogue, map execution units, recude execution units, output directory etc. call hadoop streaming maprecude Api, RMapReduce tasks are realized in hadoop cluster.
Rstudio editing machines are the R language IDEs of integrated and visualized.
Hadoop platform is the Data Analysis Platform increased income, and solve big data (can not be deposited greatly to a computer Storage, a computer can not be handled within the desired time) reliable memory and processing.It is adapted to the unstructured number of processing According to, including HDFS, MapReduce basic module.
MapReduce technologies provide the standardization flow of perception data position:Data are read, data are reflected (Map) is penetrated, enters rearrangement using some key-value pair data, then carrying out abbreviation (Reduce) to data obtains final output.
File classification method provided by the invention, it is intended to solve the as above technical problem of prior art.
How to be solved to the technical scheme of technical scheme and the application with specifically embodiment below above-mentioned Technical problem is described in detail.These specific embodiments can be combined with each other below, for same or analogous concept Or process may repeat no more in certain embodiments.Below in conjunction with accompanying drawing, embodiments of the invention are described.
Fig. 2 is file classification method flow chart provided in an embodiment of the present invention.The embodiment of the present invention is for prior art As above technical problem, there is provided file classification method, this method comprise the following steps that:
Step 201, each sample text in sample set is pre-processed, obtain the classification information of each sample text and divide Word material.
In the present embodiment, on the basis of existing hadoop cluster, each node installation R and rhadoop, especially main section R IDE Rstudio is installed on point.
Sample set described in the present embodiment includes polytype sample text, optionally, polytype sample text This form is the text of txt forms.Polytype sample text can specifically include law, physical culture, medical science etc. totally 25 Class text file, the data volume of all sample texts is in TB levels in sample set.Optionally, training set is built from sample set And test set.Training set is specifically used for training pattern, and the model can classify to text.Test set is specifically used for test should Model, such as test the model and carry out the indexs such as the accuracy of text classification, speed.
Optionally, sample text can store on a different server, that is to say, that all samples in sample set Text can be stored with cross-server, and sample text can be specifically unstructured HDFS files, cross-server elastic storage txt lattice The unstructured HDFS files of formula.The sample text of cross-server storage forms big data.Substantial amounts of sample text cross-server Storage forms big data quantity storage.
Further, it is also possible to data reconstruction is carried out to sample text, for example, multifile is read automatically and semi-structured processing, Solves coding Confused-code.This process includes two steps, first, setting, introduces and calls hadoop cluster resource Interface;Second, parallelization data processing, this step is defined by user and performs maprecude tasks, wherein according to maprecude(input、output、map、reduce、combine、input.format,output.format,verbose) Structure, create and extract key-value pair using keyval (key, val).
The process that data reconstruction is carried out to sample text is completed by 1 map process and a reduce process.Wherein Map Multiple txt files are read by way of file directory is read in parallel and mark classification, key is file class in single file form, Such as Law, value is file content;Reduce summary file collection, obtains sample data set, before data analysis is carried out, reconstruct Textual source file, and create new csv formatted files and be loaded into hdfs files.
Can further data analysis be carried out to each sample text in sample set, to the various kinds in sample set herein The process of this progress data analysis realizes that hadoop map/reduce parallelizations data processing and model should using R language With.Need to carry out each sample text in sample set during carrying out data analysis to each sample text in sample set Pretreatment.
Specifically, described pre-process to each sample text in sample set, including it is following at least one:Load word Allusion quotation;Word segmentation processing is carried out to each sample text in sample set;Remove the stop words of various kinds in this document in sample set.
Wherein, dictionary can be specifically Custom Dictionaries, can include in the Custom Dictionaries customized multiple in advance Participle, when each sample text in sample set carries out word segmentation processing, the Custom Dictionaries can be specifically used to sample Each sample text in set carries out word segmentation processing, such as obtains what is matched with each participle in Custom Dictionaries from sample text Participle, so as to which sample text is resolved into the participle that multiple participles with Custom Dictionaries match, with raising to sample text The accuracy rate segmented.
After each sample text carries out word segmentation processing in sample set, some tone can be removed by removing stop words The nonsense words such as word, interjection, by carrying out word segmentation processing to each sample text, it can obtain determining the sample text Keyword, the characteristic information of the sample text can be determined according to the keyword.By to each sample text in sample set Pre-processed, can obtain the classification information and participle language material of each sample text;Optionally, key represents text categories value, Value represents the participle language material of single sample text.
Step 202, by the classification information to each sample text in the sample set and participle language material collect, will The sample set is divided into test set and training set.
In the present embodiment, Reduce processes be mainly to the classification information of each sample text in the sample set and point Word material is collected, and style of writing of going forward side by side shelves-entry matrix conversion, it is text categories value to obtain key, and value is word frequency statisticses square Battle array.Random division training set and test set on the premise of ensureing that training set ratio is more than 60%.
Step 203, style of writing entered to each sample text in the sample set by the test set and the training set This classification.
In the present embodiment, text classification is carried out by K arest neighbors sorting algorithm, K arest neighbors sorting algorithms belong to machine One kind in learning classification algorithm, it mainly utilizes training set and Euclidean distance and the selected arest neighbors of test set data characteristics Quantity differentiates similarity, and the final class categories result that preserves is to value, and true classification key carries out correctness contrast, completion Whole assorting process.
K arest neighbors (k-NearestNeighbor, abbreviation kNN) sorting algorithm is most simple in Data Mining Classification technology One of method.So-called K arest neighbors, it is exactly the meaning of k nearest neighbours, what is said is that each sample can be closest with it K neighbour represent.
The core concept of kNN algorithms is if most in the k in feature space most adjacent samples of a sample Number belongs to some classification, then the sample falls within this classification, and with the characteristic of sample in this classification.This method is true Determine only to determine the classification belonging to sample to be divided according to the classification of one or several closest samples on categorised decision.kNN Method is only relevant with minimal amount of adjacent sample in classification decision-making.Because kNN methods are mainly by neighbouring sample limited around This, rather than determines generic by differentiating the method for class field, therefore intersection for class field or overlapping more treats point For sample set, kNN methods are more suitable for compared with other method.KNN algorithms can be not only used for classifying, and can be also used for returning. By finding out k nearest-neighbors of a sample, the average value of the attribute of these neighbours is assigned to the sample, it is possible to be somebody's turn to do The attribute of sample.More useful method is on influenceing to give different weights caused by the sample by the neighbours of different distance (weight), as weights and distance are inversely proportional.
The present embodiment obtains the classification information of each sample text by being pre-processed to each sample text in sample set With participle language material;Collected by the classification information to each sample text in the sample set and participle language material, by described in Sample set is divided into test set and training set;By the test set and the training set to the various kinds in the sample set This text carries out text classification;Realized with reference to R language and Hadoop and big data text is classified.
Fig. 3 is the file classification method flow chart that another embodiment of the present invention provides.Fig. 4 is provided in an embodiment of the present invention The schematic diagram of maprecude task chain structures.The embodiment of the present invention is directed to the as above technical problem of prior art, there is provided text Sorting technique, this method comprise the following steps that:
Step 301, each sample text in sample set is pre-processed, obtain the classification information of each sample text and divide Word material.
The process that data analysis is carried out to each sample text in sample set realizes hadoop's using R language Map/reduce parallelizations data processing and model application.The process of data analysis is carried out to each sample text in sample set In need to pre-process each sample text in sample set.
Specifically, described pre-process to each sample text in sample set, including it is following at least one:Load word Allusion quotation;Word segmentation processing is carried out to each sample text in sample set;Remove the stop words of various kinds in this document in sample set.
Wherein, dictionary can be specifically Custom Dictionaries, can include in the Custom Dictionaries customized multiple in advance Participle, when each sample text in sample set carries out word segmentation processing, the Custom Dictionaries can be specifically used to sample Each sample text in set carries out word segmentation processing, such as obtains what is matched with each participle in Custom Dictionaries from sample text Participle, so as to which sample text is resolved into the participle that multiple participles with Custom Dictionaries match, with raising to sample text The accuracy rate segmented.After each sample text carries out word segmentation processing in sample set, it can be removed by removing stop words The nonsense words such as some modal particles, interjection, by carrying out word segmentation processing to each sample text, it can obtain determining the sample The keyword of this text, the characteristic information of the sample text can be determined according to the keyword.By to each in sample set Sample text is pre-processed, and can obtain the classification information and participle language material of each sample text;Optionally, key represents text class It is not worth, value represents the participle language material of single sample text.
Step 302, collect the classification information of each sample text in the sample set and segment language material, style of writing of going forward side by side shelves- Entry matrix conversion, obtain text categories value and word frequency statisticses matrix.
In the present embodiment, Reduce processes be mainly to the classification information of each sample text in the sample set and point Word material is collected, and style of writing of going forward side by side shelves-entry matrix conversion, it is text categories value to obtain key, and value is word frequency statisticses square Battle array.
Step 303, in the training set sample set ratio is accounted for as on the premise of preset ratio, described in random division Sample set obtains the test set and the training set.
Random division training set and test set on the premise of ensureing that training set ratio is more than 60%.
Step 304, using machine learning classification algorithm in the sample set each sample text carry out text classification.
In the present embodiment, text is carried out to each sample text in the sample set using machine learning classification algorithm Classification.Optionally, the machine learning classification algorithm includes K arest neighbors sorting algorithms.For example, pass through K arest neighbors sorting algorithms Carry out text classification, one kind that K arest neighbors sorting algorithms belong in machine learning classification algorithm, its mainly using training set and The Euclidean distance of test set data characteristics and selected arest neighbors quantity differentiate similarity, and the final class categories result that preserves arrives Value, and true classification key carry out correctness contrast, complete whole assorting process.
In the present embodiment, maprecude is completed according to logically true write of mapper and recuder, and performs task, In whole task chain, it is only necessary to recall maprecude functions inside an original maprecude function and transmit institute Need parameter.Therefore, this process is completed by 1 map process and 1 recude process.Whole maprecude task chain structures As shown in Figure 4.
As shown in figure 4, input can be specifically text, the text can be text to be sorted, specifically, input text Mode be specifically read in text.Further collect text data, such as the classification to each sample text in the sample set Information and participle language material are collected.Collect text data and carry out participle duplicate removal, the participle that output repeats afterwards.Further Carry out language material matrix collect, document-entry matrix conversion, kNN text classifications, finally obtain text classification result, and export text This classification results.
The present embodiment is using the powerful data mining algorithm ability of R language, distributed storage and processing using Hadoop Data capability, big data text classification is realized with R+Hadoop, and complete model classifications storage.Heavy code development is contrasted, It is easier to rapid deployment and transplanting.
Fig. 5 is the structure chart of document sorting apparatus provided in an embodiment of the present invention.Text provided in an embodiment of the present invention point Class device can perform the handling process of file classification method embodiment offer, as shown in figure 5, document sorting apparatus 50 includes: Pretreatment module 51, summarizing module 52, division module 53, sort module 54;Wherein, pretreatment module 51 is used for sample set In each sample text pre-processed, obtain each sample text classification information and participle language material;Summarizing module 52 is used for institute The classification information of each sample text and participle language material in sample set is stated to be collected;Division module 53 is used for the sample set Conjunction is divided into test set and training set;Sort module 54 is used for by the test set and the training set to the sample set In each sample text carry out text classification.
Document sorting apparatus provided in an embodiment of the present invention can be specifically used for performing the method implementation that above-mentioned Fig. 2 is provided Example, here is omitted for concrete function.
The embodiment of the present invention obtains the classification of each sample text by being pre-processed to each sample text in sample set Information and participle language material;Collected by the classification information to each sample text in the sample set and participle language material, will The sample set is divided into test set and training set;By the test set and the training set in the sample set Each sample text carries out text classification;Realized with reference to R language and Hadoop and big data text is classified.
On the basis of above-described embodiment, pretreatment module 51 is specifically used for following at least one:Load dictionary;To sample Each sample text carries out word segmentation processing in set;Remove the stop words of various kinds in this document in sample set.
Optionally, summarizing module 52 is specifically used for:Collect the classification information of each sample text in the sample set and divide Word material, style of writing of going forward side by side shelves-entry matrix conversion, obtains text categories value and word frequency statisticses matrix;Division module 53 is specifically used In:On the premise of the training set accounts for the sample set ratio as preset ratio, sample set obtains described in random division The test set and the training set.
In addition, sort module 54 by the test set and the training set to each sample text in the sample set When carrying out text classification, it is specifically used for:Each sample text in the sample set is carried out using machine learning classification algorithm Text classification.Optionally, the machine learning classification algorithm includes K arest neighbors sorting algorithms.
Document sorting apparatus provided in an embodiment of the present invention can be specifically used for performing the method implementation that above-mentioned Fig. 3 is provided Example, here is omitted for concrete function.
The embodiment of the present invention using the powerful data mining algorithm ability of R language, using Hadoop distributed storage and Processing data ability, big data text classification is realized with R+Hadoop, and complete model classifications storage.Heavy code is contrasted to open Hair, it is easier to rapid deployment and transplanting.
In summary, the embodiment of the present invention obtains each sample by being pre-processed to each sample text in sample set The classification information and participle language material of text;Entered by the classification information to each sample text in the sample set and participle language material Row collects, and the sample set is divided into test set and training set;By the test set and the training set to the sample Each sample text in this set carries out text classification;Realized with reference to R language and Hadoop and big data text is classified; Using the powerful data mining algorithm ability of R language, using Hadoop distributed storage and processing data ability, with R+ Hadoop realizes big data text classification, and completes model classifications storage.Contrast heavy code development, it is easier to rapid deployment And transplanting.
In several embodiments provided by the present invention, it should be understood that disclosed apparatus and method, it can be passed through Its mode is realized.For example, device embodiment described above is only schematical, for example, the division of the unit, only Only a kind of division of logic function, there can be other dividing mode when actually realizing, such as multiple units or component can be tied Another system is closed or is desirably integrated into, or some features can be ignored, or do not perform.It is another, it is shown or discussed Mutual coupling or direct-coupling or communication connection can be the INDIRECT COUPLINGs or logical by some interfaces, device or unit Letter connection, can be electrical, mechanical or other forms.
The unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list Member can both be realized in the form of hardware, can also be realized in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit realized in the form of SFU software functional unit, can be stored in one and computer-readable deposit In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are causing a computer It is each that equipment (can be personal computer, server, or network equipment etc.) or processor (processor) perform the present invention The part steps of embodiment methods described.And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (Read- Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disc or CD etc. it is various Can be with the medium of store program codes.
Those skilled in the art can be understood that, for convenience and simplicity of description, only with above-mentioned each functional module Division progress for example, in practical application, can be complete by different functional modules by above-mentioned function distribution as needed Into the internal structure of device being divided into different functional modules, to complete all or part of function described above.On The specific work process of the device of description is stated, the corresponding process in preceding method embodiment is may be referred to, will not be repeated here.
Finally it should be noted that:Various embodiments above is merely illustrative of the technical solution of the present invention, rather than its limitations;To the greatest extent The present invention is described in detail with reference to foregoing embodiments for pipe, it will be understood by those within the art that:Its according to The technical scheme described in foregoing embodiments can so be modified, either which part or all technical characteristic are entered Row equivalent substitution;And these modifications or replacement, the essence of appropriate technical solution is departed from various embodiments of the present invention technology The scope of scheme.

Claims (10)

  1. A kind of 1. file classification method, it is characterised in that including:
    Each sample text in sample set is pre-processed, obtains the classification information and participle language material of each sample text;
    Collected by the classification information to each sample text in the sample set and participle language material, by the sample set It is divided into test set and training set;
    Text classification is carried out to each sample text in the sample set by the test set and the training set.
  2. 2. according to the method for claim 1, it is characterised in that described that each sample text in sample set is located in advance Reason, including it is following at least one:
    Load dictionary;
    Word segmentation processing is carried out to each sample text in sample set;
    Remove the stop words of various kinds in this document in sample set.
  3. 3. according to the method for claim 2, it is characterised in that described by each sample text in the sample set Classification information and participle language material are collected, and the sample set is divided into test set and training set, including:
    Collect the classification information of each sample text in the sample set and segment language material, style of writing of going forward side by side shelves-entry matrix conversion, Obtain text categories value and word frequency statisticses matrix;
    On the premise of the training set accounts for the sample set ratio as preset ratio, sample set obtains described in random division The test set and the training set.
  4. 4. according to the method for claim 3, it is characterised in that it is described by the test set and the training set to described Each sample text in sample set carries out text classification, including:
    Text classification is carried out to each sample text in the sample set using machine learning classification algorithm.
  5. 5. according to the method for claim 4, it is characterised in that the machine learning classification algorithm is classified including K arest neighbors Algorithm.
  6. A kind of 6. document sorting apparatus, it is characterised in that including:
    Pretreatment module, for being pre-processed to each sample text in sample set, obtain the classification information of each sample text With participle language material;
    Summarizing module, collect for the classification information to each sample text in the sample set and participle language material;
    Division module, for the sample set to be divided into test set and training set;
    Sort module, for entering style of writing to each sample text in the sample set by the test set and the training set This classification.
  7. 7. document sorting apparatus according to claim 6, it is characterised in that the pretreatment module is specifically used for as follows extremely Few one kind:
    Load dictionary;
    Word segmentation processing is carried out to each sample text in sample set;
    Remove the stop words of various kinds in this document in sample set.
  8. 8. document sorting apparatus according to claim 7, it is characterised in that the summarizing module is specifically used for:Collect institute The classification information of each sample text and participle language material in sample set are stated, style of writing of going forward side by side shelves-entry matrix conversion, obtains text class Zhi not be with word frequency statisticses matrix;
    The division module is specifically used for:On the premise of the training set accounts for the sample set ratio as preset ratio, with Machine divides the sample set and obtains the test set and the training set.
  9. 9. document sorting apparatus according to claim 8, it is characterised in that the sort module by the test set and When the training set carries out text classification to each sample text in the sample set, it is specifically used for:
    Text classification is carried out to each sample text in the sample set using machine learning classification algorithm.
  10. 10. document sorting apparatus according to claim 9, it is characterised in that the machine learning classification algorithm includes K most Nearest neighbour classification algorithm.
CN201711042851.2A 2017-10-30 2017-10-30 File classification method and device Pending CN107679244A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711042851.2A CN107679244A (en) 2017-10-30 2017-10-30 File classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711042851.2A CN107679244A (en) 2017-10-30 2017-10-30 File classification method and device

Publications (1)

Publication Number Publication Date
CN107679244A true CN107679244A (en) 2018-02-09

Family

ID=61143057

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711042851.2A Pending CN107679244A (en) 2017-10-30 2017-10-30 File classification method and device

Country Status (1)

Country Link
CN (1) CN107679244A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108681607A (en) * 2018-05-25 2018-10-19 马鞍山市润启新材料科技有限公司 A kind of accounting voucher sorting technique and device
CN109189883A (en) * 2018-08-09 2019-01-11 中国银行股份有限公司 A kind of intelligent distributing method and device of electronic document
CN110895562A (en) * 2018-09-13 2020-03-20 阿里巴巴集团控股有限公司 Feedback information processing method and device
CN111767718A (en) * 2020-07-03 2020-10-13 北京邮电大学 Chinese grammar error correction method based on weakened grammar error feature representation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070294223A1 (en) * 2006-06-16 2007-12-20 Technion Research And Development Foundation Ltd. Text Categorization Using External Knowledge
CN105760471A (en) * 2016-02-06 2016-07-13 北京工业大学 Classification method for two types of texts based on multiconlitron
CN106202116A (en) * 2015-05-08 2016-12-07 北京信息科技大学 A kind of file classification method based on rough set and KNN and system
CN106886569A (en) * 2017-01-13 2017-06-23 重庆邮电大学 A kind of ML KNN multi-tag Chinese Text Categorizations based on MPI

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070294223A1 (en) * 2006-06-16 2007-12-20 Technion Research And Development Foundation Ltd. Text Categorization Using External Knowledge
CN106202116A (en) * 2015-05-08 2016-12-07 北京信息科技大学 A kind of file classification method based on rough set and KNN and system
CN105760471A (en) * 2016-02-06 2016-07-13 北京工业大学 Classification method for two types of texts based on multiconlitron
CN106886569A (en) * 2017-01-13 2017-06-23 重庆邮电大学 A kind of ML KNN multi-tag Chinese Text Categorizations based on MPI

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵文娟: "基于Hadoop的Web文本分类系统设计研究", 《兰州大学学报》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108681607A (en) * 2018-05-25 2018-10-19 马鞍山市润启新材料科技有限公司 A kind of accounting voucher sorting technique and device
CN109189883A (en) * 2018-08-09 2019-01-11 中国银行股份有限公司 A kind of intelligent distributing method and device of electronic document
CN109189883B (en) * 2018-08-09 2022-01-28 中国银行股份有限公司 Intelligent distribution method and device for electronic files
CN110895562A (en) * 2018-09-13 2020-03-20 阿里巴巴集团控股有限公司 Feedback information processing method and device
CN111767718A (en) * 2020-07-03 2020-10-13 北京邮电大学 Chinese grammar error correction method based on weakened grammar error feature representation
CN111767718B (en) * 2020-07-03 2021-12-07 北京邮电大学 Chinese grammar error correction method based on weakened grammar error feature representation

Similar Documents

Publication Publication Date Title
CN106815369B (en) A kind of file classification method based on Xgboost sorting algorithm
JP7090936B2 (en) ESG-based corporate evaluation execution device and its operation method
Han et al. Spark: A big data processing platform based on memory computing
CN107679244A (en) File classification method and device
CN106202518A (en) Based on CHI and the short text classification method of sub-category association rule algorithm
CN110188047B (en) Double-channel convolutional neural network-based repeated defect report detection method
CN101308496A (en) Large scale text data external clustering method and system
CN104834940A (en) Medical image inspection disease classification method based on support vector machine (SVM)
Cai et al. An abstract syntax tree encoding method for cross-project defect prediction
CN107122382A (en) A kind of patent classification method based on specification
CN110245800A (en) A method of based on superior vector spatial model goods made to order information class indication
CN110097096A (en) A kind of file classification method based on TF-IDF matrix and capsule network
CN115186069A (en) CNN-BiGRU-based academic text abstract automatic classification method
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN115098690A (en) Multi-data document classification method and system based on cluster analysis
CN107679209A (en) Expression formula generation method of classifying and device
Amin et al. Multiclass classification for bangla news tags with parallel cnn using word level data augmentation
CN111666748B (en) Construction method of automatic classifier and decision recognition method
Revindasari et al. Traceability between business process and software component using Probabilistic Latent Semantic Analysis
Yuan et al. Automatic legal judgment prediction via large amounts of criminal cases
Gupta et al. Feature selection: an overview
CN106202116A (en) A kind of file classification method based on rough set and KNN and system
CN107766545A (en) Scientific and technological data management method and device
Al Hasan et al. Clustering Analysis of Bangla News Articles with TF-IDF & CV Using Mini-Batch K-Means and K-Means
Iancu et al. Multi-label classification for automatic tag prediction in the context of programming challenges

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180209