CN107679244A - File classification method and device - Google Patents
File classification method and device Download PDFInfo
- Publication number
- CN107679244A CN107679244A CN201711042851.2A CN201711042851A CN107679244A CN 107679244 A CN107679244 A CN 107679244A CN 201711042851 A CN201711042851 A CN 201711042851A CN 107679244 A CN107679244 A CN 107679244A
- Authority
- CN
- China
- Prior art keywords
- sample
- text
- classification
- sample set
- sample text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the present invention provides a kind of file classification method and device.This method includes:Each sample text in sample set is pre-processed, obtains the classification information and participle language material of each sample text;Collected by the classification information to each sample text in the sample set and participle language material, the sample set is divided into test set and training set;Text classification is carried out to each sample text in the sample set by the test set and the training set.The embodiment of the present invention obtains the classification information and participle language material of each sample text by being pre-processed to each sample text in sample set;Collected by the classification information to each sample text in the sample set and participle language material, the sample set is divided into test set and training set;Text classification is carried out to each sample text in the sample set by the test set and the training set;Realized with reference to R language and Hadoop and big data text is classified.
Description
Technical field
The present embodiments relate to field of computer technology, more particularly to a kind of file classification method and device.
Background technology
Text classification is carried out with computer to text set (or other entities or object) according to certain taxonomic hierarchies or standard
Automatic key words sorting.Belong to a kind of automatic classification based on taxonomic hierarchies, be Naive Bayes Classification method.
Text classification has generally comprised the expression of text, the selection of grader and training, the evaluation of classification results and feedback
Etc. process, the expression of its Chinese version can be subdivided into the steps such as Text Pretreatment, index and statistics, feature extraction again.Text classification
The general function module of system is:(1) pre-process:Original language material is formatted as same form, is easy to follow-up be uniformly processed;
(2) index:It is basic processing unit by document decomposition, while reduces the expense of subsequent treatment;(3) count:Word frequency statisticses, item
(word, concept) and the dependent probability of classification;(4) feature extraction:The feature of reflection document subject matter is extracted from document;(5)
Grader:The training of grader;(6) evaluate:The test result analysis of grader.
Due to the quick increase of the data scale of construction in recent years, the arriving in big data epoch so that traditional statistical analysis instrument and
Data-handling capacity can not meet the requirement of large-scale data processing, can not realize big data text classification.
The content of the invention
The embodiment of the present invention provides a kind of file classification method and device, and big data text is classified with realizing.
The one side of the embodiment of the present invention is to provide a kind of file classification method, including:
Each sample text in sample set is pre-processed, obtains the classification information and participle language material of each sample text;
Collected by the classification information to each sample text in the sample set and participle language material, by the sample
Set is divided into test set and training set;
Text classification is carried out to each sample text in the sample set by the test set and the training set.
The other side of the embodiment of the present invention is to provide a kind of document sorting apparatus, including:
Pretreatment module, for being pre-processed to each sample text in sample set, obtain the classification of each sample text
Information and participle language material;
Summarizing module, collect for the classification information to each sample text in the sample set and participle language material;
Division module, for the sample set to be divided into test set and training set;
Sort module, for being entered by the test set and the training set to each sample text in the sample set
Row text classification.
File classification method and device provided in an embodiment of the present invention, it is pre- by being carried out to each sample text in sample set
Processing, obtain the classification information and participle language material of each sample text;Pass through the classification to each sample text in the sample set
Information and participle language material are collected, and the sample set is divided into test set and training set;Pass through the test set and institute
State training set and text classification is carried out to each sample text in the sample set;Realized with reference to R language and Hadoop to big
Data text is classified.
Brief description of the drawings
Accompanying drawing herein is merged in specification and forms the part of this specification, shows the implementation for meeting the disclosure
Example, and be used to together with specification to explain the principle of the disclosure.
Fig. 1 is the schematic diagram of system architecture provided in an embodiment of the present invention;
Fig. 2 is file classification method flow chart provided in an embodiment of the present invention;
Fig. 3 is the file classification method flow chart that another embodiment of the present invention provides;
Fig. 4 is the schematic diagram of maprecude tasks chain structure provided in an embodiment of the present invention;
Fig. 5 is the structure chart of document sorting apparatus provided in an embodiment of the present invention.
Pass through above-mentioned accompanying drawing, it has been shown that the clear and definite embodiment of the disclosure, will hereinafter be described in more detail.These accompanying drawings
It is not intended to limit the scope of disclosure design by any mode with word description, but is by reference to specific embodiment
Those skilled in the art illustrate the concept of the disclosure.
Embodiment
Here exemplary embodiment will be illustrated in detail, its example is illustrated in the accompanying drawings.Following description is related to
During accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawings represent same or analogous key element.Following exemplary embodiment
Described in embodiment do not represent all embodiments consistent with the disclosure.On the contrary, they be only with it is such as appended
The example of the consistent apparatus and method of some aspects be described in detail in claims, the disclosure.
Noun involved in the present invention is explained first:
R language:R language is the statistical and analytical tool freely increased income, and a kind of efficient programming language, is possessed abundant
Statistical model and data analysing method, but data processing poor expandability, its core technology engine based on internal memory can be handled
Or the data volume of processing is extremely limited, and the problems such as internal memory would generally be caused to overflow, therefore it can not effectively handle GB, TB, very
To the big data and real-time stream of PB ranks.
Hadoop:It is a distributed system architecture developed by Apache funds club.User can be not
In the case of angle distribution formula low-level details, distributed program is developed.The power of cluster is made full use of to carry out high-speed computation and storage.
Hadoop realizes a distributed file system (Hadoop Distributed File System, abbreviation HDFS).HDFS
There is the characteristics of high fault tolerance, and be designed to be deployed on cheap (low-cost) hardware;And it provides high-throughput
(high throughput) carrys out the data of access application, is adapted to those to have super large data set (large data set)
Application program.HDFS relaxes (relax) POSIX requirement, can access in the form of streaming (streaming access)
Data in file system.The design that Hadoop framework is most crucial is exactly:HDFS and MapReduce.HDFS is the number of magnanimity
According to storage is provided, then MapReduce provides calculating for the data of magnanimity.
MapReduce:MapReduce is a kind of programming model, the parallel fortune for large-scale dataset (being more than 1TB)
Calculate.Concept " Map (mapping) " and " Reduce (reduction) ", are their main thoughts, are borrowed in Functional Programming
Come, the characteristic also borrowed in vector programming language.It is very easy to programming personnel will not distributed parallel compile
In the case of journey, the program of oneself is operated in distributed system.It is to specify Map (mapping) letter that current software, which is realized,
Number, for one group of key-value pair is mapped to one group of new key-value pair, concurrent Reduce (reduction) function is specified, for ensureing
There is each shared identical key group in the key-value pair of mapping.
RHadoop:RHadoop is the product after a Hadoop and R language is combined.Hadoop and R language is carried out
Big data analysis can be carried out with reference to after.RHadoop includes three R bags, and three R bags are respectively:Rmr bags, rhdfs bags, rhbase
Bag.Rmr bags correspond to the MapReduce in Hadoop system framework, and rhdfs bags correspond to the HDFS in Hadoop system framework,
Rhbase bags correspond to the Hbase in Hadoop system framework.
Under normal circumstances, R language can not effectively handle the big data and real-time stream of GB, TB, even PB ranks.And base
In Hadoop big data distributed processing framework, mass data storage and disposal ability are realized by HDFS and Mapreduce,
Wherein component Mahout provides machine learning algorithm storehouse, but development difficulty is big, the cycle is long, model algorithm support is not complete.Therefore,
R language and Hadoop can be combined to obtain RHadoop, Map/Reduce modules are completed using R language in RHadoop,
And carry out data mart modeling processing and modeling with Hadoop Map/Reduce parallel processings mechanism.
File classification method provided by the invention, go for the system architecture shown in Fig. 1.As shown in figure 1, the system
Framework includes:R consoles, Rhadoop set, Hadoop platform.Wherein, Rhadoop set includes:Rmr bags, rhdfs bags,
Rhbase bags, in the present embodiment, Rhadoop set include:Rmr bags, rhdfs bags.Rhadoop set in rmr bags,
Rhdfs bags are used to provide big data associative operation in R environment, and each bag can provide different hadoop characteristics, such as Fig. 1 institutes
Show, rhdfs bags are provided with the characteristic that HDFS is accessed in R environment, and rmr bags are provided with the spy that pReduce is called in R environment
Property.
Specifically, rhdfs bags provide the interface that R consoles access hdfs, rear end hadoop hdfs API can be called
Go operation to be stored in the data on hdfs, especially perform maprecude program implementing results, it provides hdfs data
Operating method.
Rmr bags are then the interfaces that hadoop maprecude functions are provided under a R environment, primarily to realizing R journeys
Sequence logical breakdown is map stages and reduce stages, then submits task by rmr methods.Then rmr is input according to parameter
Catalogue, map execution units, recude execution units, output directory etc. call hadoop streaming maprecude
Api, RMapReduce tasks are realized in hadoop cluster.
Rstudio editing machines are the R language IDEs of integrated and visualized.
Hadoop platform is the Data Analysis Platform increased income, and solve big data (can not be deposited greatly to a computer
Storage, a computer can not be handled within the desired time) reliable memory and processing.It is adapted to the unstructured number of processing
According to, including HDFS, MapReduce basic module.
MapReduce technologies provide the standardization flow of perception data position:Data are read, data are reflected
(Map) is penetrated, enters rearrangement using some key-value pair data, then carrying out abbreviation (Reduce) to data obtains final output.
File classification method provided by the invention, it is intended to solve the as above technical problem of prior art.
How to be solved to the technical scheme of technical scheme and the application with specifically embodiment below above-mentioned
Technical problem is described in detail.These specific embodiments can be combined with each other below, for same or analogous concept
Or process may repeat no more in certain embodiments.Below in conjunction with accompanying drawing, embodiments of the invention are described.
Fig. 2 is file classification method flow chart provided in an embodiment of the present invention.The embodiment of the present invention is for prior art
As above technical problem, there is provided file classification method, this method comprise the following steps that:
Step 201, each sample text in sample set is pre-processed, obtain the classification information of each sample text and divide
Word material.
In the present embodiment, on the basis of existing hadoop cluster, each node installation R and rhadoop, especially main section
R IDE Rstudio is installed on point.
Sample set described in the present embodiment includes polytype sample text, optionally, polytype sample text
This form is the text of txt forms.Polytype sample text can specifically include law, physical culture, medical science etc. totally 25
Class text file, the data volume of all sample texts is in TB levels in sample set.Optionally, training set is built from sample set
And test set.Training set is specifically used for training pattern, and the model can classify to text.Test set is specifically used for test should
Model, such as test the model and carry out the indexs such as the accuracy of text classification, speed.
Optionally, sample text can store on a different server, that is to say, that all samples in sample set
Text can be stored with cross-server, and sample text can be specifically unstructured HDFS files, cross-server elastic storage txt lattice
The unstructured HDFS files of formula.The sample text of cross-server storage forms big data.Substantial amounts of sample text cross-server
Storage forms big data quantity storage.
Further, it is also possible to data reconstruction is carried out to sample text, for example, multifile is read automatically and semi-structured processing,
Solves coding Confused-code.This process includes two steps, first, setting, introduces and calls hadoop cluster resource
Interface;Second, parallelization data processing, this step is defined by user and performs maprecude tasks, wherein according to
maprecude(input、output、map、reduce、combine、input.format,output.format,verbose)
Structure, create and extract key-value pair using keyval (key, val).
The process that data reconstruction is carried out to sample text is completed by 1 map process and a reduce process.Wherein Map
Multiple txt files are read by way of file directory is read in parallel and mark classification, key is file class in single file form,
Such as Law, value is file content;Reduce summary file collection, obtains sample data set, before data analysis is carried out, reconstruct
Textual source file, and create new csv formatted files and be loaded into hdfs files.
Can further data analysis be carried out to each sample text in sample set, to the various kinds in sample set herein
The process of this progress data analysis realizes that hadoop map/reduce parallelizations data processing and model should using R language
With.Need to carry out each sample text in sample set during carrying out data analysis to each sample text in sample set
Pretreatment.
Specifically, described pre-process to each sample text in sample set, including it is following at least one:Load word
Allusion quotation;Word segmentation processing is carried out to each sample text in sample set;Remove the stop words of various kinds in this document in sample set.
Wherein, dictionary can be specifically Custom Dictionaries, can include in the Custom Dictionaries customized multiple in advance
Participle, when each sample text in sample set carries out word segmentation processing, the Custom Dictionaries can be specifically used to sample
Each sample text in set carries out word segmentation processing, such as obtains what is matched with each participle in Custom Dictionaries from sample text
Participle, so as to which sample text is resolved into the participle that multiple participles with Custom Dictionaries match, with raising to sample text
The accuracy rate segmented.
After each sample text carries out word segmentation processing in sample set, some tone can be removed by removing stop words
The nonsense words such as word, interjection, by carrying out word segmentation processing to each sample text, it can obtain determining the sample text
Keyword, the characteristic information of the sample text can be determined according to the keyword.By to each sample text in sample set
Pre-processed, can obtain the classification information and participle language material of each sample text;Optionally, key represents text categories value,
Value represents the participle language material of single sample text.
Step 202, by the classification information to each sample text in the sample set and participle language material collect, will
The sample set is divided into test set and training set.
In the present embodiment, Reduce processes be mainly to the classification information of each sample text in the sample set and point
Word material is collected, and style of writing of going forward side by side shelves-entry matrix conversion, it is text categories value to obtain key, and value is word frequency statisticses square
Battle array.Random division training set and test set on the premise of ensureing that training set ratio is more than 60%.
Step 203, style of writing entered to each sample text in the sample set by the test set and the training set
This classification.
In the present embodiment, text classification is carried out by K arest neighbors sorting algorithm, K arest neighbors sorting algorithms belong to machine
One kind in learning classification algorithm, it mainly utilizes training set and Euclidean distance and the selected arest neighbors of test set data characteristics
Quantity differentiates similarity, and the final class categories result that preserves is to value, and true classification key carries out correctness contrast, completion
Whole assorting process.
K arest neighbors (k-NearestNeighbor, abbreviation kNN) sorting algorithm is most simple in Data Mining Classification technology
One of method.So-called K arest neighbors, it is exactly the meaning of k nearest neighbours, what is said is that each sample can be closest with it
K neighbour represent.
The core concept of kNN algorithms is if most in the k in feature space most adjacent samples of a sample
Number belongs to some classification, then the sample falls within this classification, and with the characteristic of sample in this classification.This method is true
Determine only to determine the classification belonging to sample to be divided according to the classification of one or several closest samples on categorised decision.kNN
Method is only relevant with minimal amount of adjacent sample in classification decision-making.Because kNN methods are mainly by neighbouring sample limited around
This, rather than determines generic by differentiating the method for class field, therefore intersection for class field or overlapping more treats point
For sample set, kNN methods are more suitable for compared with other method.KNN algorithms can be not only used for classifying, and can be also used for returning.
By finding out k nearest-neighbors of a sample, the average value of the attribute of these neighbours is assigned to the sample, it is possible to be somebody's turn to do
The attribute of sample.More useful method is on influenceing to give different weights caused by the sample by the neighbours of different distance
(weight), as weights and distance are inversely proportional.
The present embodiment obtains the classification information of each sample text by being pre-processed to each sample text in sample set
With participle language material;Collected by the classification information to each sample text in the sample set and participle language material, by described in
Sample set is divided into test set and training set;By the test set and the training set to the various kinds in the sample set
This text carries out text classification;Realized with reference to R language and Hadoop and big data text is classified.
Fig. 3 is the file classification method flow chart that another embodiment of the present invention provides.Fig. 4 is provided in an embodiment of the present invention
The schematic diagram of maprecude task chain structures.The embodiment of the present invention is directed to the as above technical problem of prior art, there is provided text
Sorting technique, this method comprise the following steps that:
Step 301, each sample text in sample set is pre-processed, obtain the classification information of each sample text and divide
Word material.
The process that data analysis is carried out to each sample text in sample set realizes hadoop's using R language
Map/reduce parallelizations data processing and model application.The process of data analysis is carried out to each sample text in sample set
In need to pre-process each sample text in sample set.
Specifically, described pre-process to each sample text in sample set, including it is following at least one:Load word
Allusion quotation;Word segmentation processing is carried out to each sample text in sample set;Remove the stop words of various kinds in this document in sample set.
Wherein, dictionary can be specifically Custom Dictionaries, can include in the Custom Dictionaries customized multiple in advance
Participle, when each sample text in sample set carries out word segmentation processing, the Custom Dictionaries can be specifically used to sample
Each sample text in set carries out word segmentation processing, such as obtains what is matched with each participle in Custom Dictionaries from sample text
Participle, so as to which sample text is resolved into the participle that multiple participles with Custom Dictionaries match, with raising to sample text
The accuracy rate segmented.After each sample text carries out word segmentation processing in sample set, it can be removed by removing stop words
The nonsense words such as some modal particles, interjection, by carrying out word segmentation processing to each sample text, it can obtain determining the sample
The keyword of this text, the characteristic information of the sample text can be determined according to the keyword.By to each in sample set
Sample text is pre-processed, and can obtain the classification information and participle language material of each sample text;Optionally, key represents text class
It is not worth, value represents the participle language material of single sample text.
Step 302, collect the classification information of each sample text in the sample set and segment language material, style of writing of going forward side by side shelves-
Entry matrix conversion, obtain text categories value and word frequency statisticses matrix.
In the present embodiment, Reduce processes be mainly to the classification information of each sample text in the sample set and point
Word material is collected, and style of writing of going forward side by side shelves-entry matrix conversion, it is text categories value to obtain key, and value is word frequency statisticses square
Battle array.
Step 303, in the training set sample set ratio is accounted for as on the premise of preset ratio, described in random division
Sample set obtains the test set and the training set.
Random division training set and test set on the premise of ensureing that training set ratio is more than 60%.
Step 304, using machine learning classification algorithm in the sample set each sample text carry out text classification.
In the present embodiment, text is carried out to each sample text in the sample set using machine learning classification algorithm
Classification.Optionally, the machine learning classification algorithm includes K arest neighbors sorting algorithms.For example, pass through K arest neighbors sorting algorithms
Carry out text classification, one kind that K arest neighbors sorting algorithms belong in machine learning classification algorithm, its mainly using training set and
The Euclidean distance of test set data characteristics and selected arest neighbors quantity differentiate similarity, and the final class categories result that preserves arrives
Value, and true classification key carry out correctness contrast, complete whole assorting process.
In the present embodiment, maprecude is completed according to logically true write of mapper and recuder, and performs task,
In whole task chain, it is only necessary to recall maprecude functions inside an original maprecude function and transmit institute
Need parameter.Therefore, this process is completed by 1 map process and 1 recude process.Whole maprecude task chain structures
As shown in Figure 4.
As shown in figure 4, input can be specifically text, the text can be text to be sorted, specifically, input text
Mode be specifically read in text.Further collect text data, such as the classification to each sample text in the sample set
Information and participle language material are collected.Collect text data and carry out participle duplicate removal, the participle that output repeats afterwards.Further
Carry out language material matrix collect, document-entry matrix conversion, kNN text classifications, finally obtain text classification result, and export text
This classification results.
The present embodiment is using the powerful data mining algorithm ability of R language, distributed storage and processing using Hadoop
Data capability, big data text classification is realized with R+Hadoop, and complete model classifications storage.Heavy code development is contrasted,
It is easier to rapid deployment and transplanting.
Fig. 5 is the structure chart of document sorting apparatus provided in an embodiment of the present invention.Text provided in an embodiment of the present invention point
Class device can perform the handling process of file classification method embodiment offer, as shown in figure 5, document sorting apparatus 50 includes:
Pretreatment module 51, summarizing module 52, division module 53, sort module 54;Wherein, pretreatment module 51 is used for sample set
In each sample text pre-processed, obtain each sample text classification information and participle language material;Summarizing module 52 is used for institute
The classification information of each sample text and participle language material in sample set is stated to be collected;Division module 53 is used for the sample set
Conjunction is divided into test set and training set;Sort module 54 is used for by the test set and the training set to the sample set
In each sample text carry out text classification.
Document sorting apparatus provided in an embodiment of the present invention can be specifically used for performing the method implementation that above-mentioned Fig. 2 is provided
Example, here is omitted for concrete function.
The embodiment of the present invention obtains the classification of each sample text by being pre-processed to each sample text in sample set
Information and participle language material;Collected by the classification information to each sample text in the sample set and participle language material, will
The sample set is divided into test set and training set;By the test set and the training set in the sample set
Each sample text carries out text classification;Realized with reference to R language and Hadoop and big data text is classified.
On the basis of above-described embodiment, pretreatment module 51 is specifically used for following at least one:Load dictionary;To sample
Each sample text carries out word segmentation processing in set;Remove the stop words of various kinds in this document in sample set.
Optionally, summarizing module 52 is specifically used for:Collect the classification information of each sample text in the sample set and divide
Word material, style of writing of going forward side by side shelves-entry matrix conversion, obtains text categories value and word frequency statisticses matrix;Division module 53 is specifically used
In:On the premise of the training set accounts for the sample set ratio as preset ratio, sample set obtains described in random division
The test set and the training set.
In addition, sort module 54 by the test set and the training set to each sample text in the sample set
When carrying out text classification, it is specifically used for:Each sample text in the sample set is carried out using machine learning classification algorithm
Text classification.Optionally, the machine learning classification algorithm includes K arest neighbors sorting algorithms.
Document sorting apparatus provided in an embodiment of the present invention can be specifically used for performing the method implementation that above-mentioned Fig. 3 is provided
Example, here is omitted for concrete function.
The embodiment of the present invention using the powerful data mining algorithm ability of R language, using Hadoop distributed storage and
Processing data ability, big data text classification is realized with R+Hadoop, and complete model classifications storage.Heavy code is contrasted to open
Hair, it is easier to rapid deployment and transplanting.
In summary, the embodiment of the present invention obtains each sample by being pre-processed to each sample text in sample set
The classification information and participle language material of text;Entered by the classification information to each sample text in the sample set and participle language material
Row collects, and the sample set is divided into test set and training set;By the test set and the training set to the sample
Each sample text in this set carries out text classification;Realized with reference to R language and Hadoop and big data text is classified;
Using the powerful data mining algorithm ability of R language, using Hadoop distributed storage and processing data ability, with R+
Hadoop realizes big data text classification, and completes model classifications storage.Contrast heavy code development, it is easier to rapid deployment
And transplanting.
In several embodiments provided by the present invention, it should be understood that disclosed apparatus and method, it can be passed through
Its mode is realized.For example, device embodiment described above is only schematical, for example, the division of the unit, only
Only a kind of division of logic function, there can be other dividing mode when actually realizing, such as multiple units or component can be tied
Another system is closed or is desirably integrated into, or some features can be ignored, or do not perform.It is another, it is shown or discussed
Mutual coupling or direct-coupling or communication connection can be the INDIRECT COUPLINGs or logical by some interfaces, device or unit
Letter connection, can be electrical, mechanical or other forms.
The unit illustrated as separating component can be or may not be physically separate, show as unit
The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple
On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs
's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also
That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list
Member can both be realized in the form of hardware, can also be realized in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit realized in the form of SFU software functional unit, can be stored in one and computer-readable deposit
In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are causing a computer
It is each that equipment (can be personal computer, server, or network equipment etc.) or processor (processor) perform the present invention
The part steps of embodiment methods described.And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (Read-
Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disc or CD etc. it is various
Can be with the medium of store program codes.
Those skilled in the art can be understood that, for convenience and simplicity of description, only with above-mentioned each functional module
Division progress for example, in practical application, can be complete by different functional modules by above-mentioned function distribution as needed
Into the internal structure of device being divided into different functional modules, to complete all or part of function described above.On
The specific work process of the device of description is stated, the corresponding process in preceding method embodiment is may be referred to, will not be repeated here.
Finally it should be noted that:Various embodiments above is merely illustrative of the technical solution of the present invention, rather than its limitations;To the greatest extent
The present invention is described in detail with reference to foregoing embodiments for pipe, it will be understood by those within the art that:Its according to
The technical scheme described in foregoing embodiments can so be modified, either which part or all technical characteristic are entered
Row equivalent substitution;And these modifications or replacement, the essence of appropriate technical solution is departed from various embodiments of the present invention technology
The scope of scheme.
Claims (10)
- A kind of 1. file classification method, it is characterised in that including:Each sample text in sample set is pre-processed, obtains the classification information and participle language material of each sample text;Collected by the classification information to each sample text in the sample set and participle language material, by the sample set It is divided into test set and training set;Text classification is carried out to each sample text in the sample set by the test set and the training set.
- 2. according to the method for claim 1, it is characterised in that described that each sample text in sample set is located in advance Reason, including it is following at least one:Load dictionary;Word segmentation processing is carried out to each sample text in sample set;Remove the stop words of various kinds in this document in sample set.
- 3. according to the method for claim 2, it is characterised in that described by each sample text in the sample set Classification information and participle language material are collected, and the sample set is divided into test set and training set, including:Collect the classification information of each sample text in the sample set and segment language material, style of writing of going forward side by side shelves-entry matrix conversion, Obtain text categories value and word frequency statisticses matrix;On the premise of the training set accounts for the sample set ratio as preset ratio, sample set obtains described in random division The test set and the training set.
- 4. according to the method for claim 3, it is characterised in that it is described by the test set and the training set to described Each sample text in sample set carries out text classification, including:Text classification is carried out to each sample text in the sample set using machine learning classification algorithm.
- 5. according to the method for claim 4, it is characterised in that the machine learning classification algorithm is classified including K arest neighbors Algorithm.
- A kind of 6. document sorting apparatus, it is characterised in that including:Pretreatment module, for being pre-processed to each sample text in sample set, obtain the classification information of each sample text With participle language material;Summarizing module, collect for the classification information to each sample text in the sample set and participle language material;Division module, for the sample set to be divided into test set and training set;Sort module, for entering style of writing to each sample text in the sample set by the test set and the training set This classification.
- 7. document sorting apparatus according to claim 6, it is characterised in that the pretreatment module is specifically used for as follows extremely Few one kind:Load dictionary;Word segmentation processing is carried out to each sample text in sample set;Remove the stop words of various kinds in this document in sample set.
- 8. document sorting apparatus according to claim 7, it is characterised in that the summarizing module is specifically used for:Collect institute The classification information of each sample text and participle language material in sample set are stated, style of writing of going forward side by side shelves-entry matrix conversion, obtains text class Zhi not be with word frequency statisticses matrix;The division module is specifically used for:On the premise of the training set accounts for the sample set ratio as preset ratio, with Machine divides the sample set and obtains the test set and the training set.
- 9. document sorting apparatus according to claim 8, it is characterised in that the sort module by the test set and When the training set carries out text classification to each sample text in the sample set, it is specifically used for:Text classification is carried out to each sample text in the sample set using machine learning classification algorithm.
- 10. document sorting apparatus according to claim 9, it is characterised in that the machine learning classification algorithm includes K most Nearest neighbour classification algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711042851.2A CN107679244A (en) | 2017-10-30 | 2017-10-30 | File classification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711042851.2A CN107679244A (en) | 2017-10-30 | 2017-10-30 | File classification method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107679244A true CN107679244A (en) | 2018-02-09 |
Family
ID=61143057
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711042851.2A Pending CN107679244A (en) | 2017-10-30 | 2017-10-30 | File classification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107679244A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108681607A (en) * | 2018-05-25 | 2018-10-19 | 马鞍山市润启新材料科技有限公司 | A kind of accounting voucher sorting technique and device |
CN109189883A (en) * | 2018-08-09 | 2019-01-11 | 中国银行股份有限公司 | A kind of intelligent distributing method and device of electronic document |
CN110895562A (en) * | 2018-09-13 | 2020-03-20 | 阿里巴巴集团控股有限公司 | Feedback information processing method and device |
CN111767718A (en) * | 2020-07-03 | 2020-10-13 | 北京邮电大学 | Chinese grammar error correction method based on weakened grammar error feature representation |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070294223A1 (en) * | 2006-06-16 | 2007-12-20 | Technion Research And Development Foundation Ltd. | Text Categorization Using External Knowledge |
CN105760471A (en) * | 2016-02-06 | 2016-07-13 | 北京工业大学 | Classification method for two types of texts based on multiconlitron |
CN106202116A (en) * | 2015-05-08 | 2016-12-07 | 北京信息科技大学 | A kind of file classification method based on rough set and KNN and system |
CN106886569A (en) * | 2017-01-13 | 2017-06-23 | 重庆邮电大学 | A kind of ML KNN multi-tag Chinese Text Categorizations based on MPI |
-
2017
- 2017-10-30 CN CN201711042851.2A patent/CN107679244A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070294223A1 (en) * | 2006-06-16 | 2007-12-20 | Technion Research And Development Foundation Ltd. | Text Categorization Using External Knowledge |
CN106202116A (en) * | 2015-05-08 | 2016-12-07 | 北京信息科技大学 | A kind of file classification method based on rough set and KNN and system |
CN105760471A (en) * | 2016-02-06 | 2016-07-13 | 北京工业大学 | Classification method for two types of texts based on multiconlitron |
CN106886569A (en) * | 2017-01-13 | 2017-06-23 | 重庆邮电大学 | A kind of ML KNN multi-tag Chinese Text Categorizations based on MPI |
Non-Patent Citations (1)
Title |
---|
赵文娟: "基于Hadoop的Web文本分类系统设计研究", 《兰州大学学报》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108681607A (en) * | 2018-05-25 | 2018-10-19 | 马鞍山市润启新材料科技有限公司 | A kind of accounting voucher sorting technique and device |
CN109189883A (en) * | 2018-08-09 | 2019-01-11 | 中国银行股份有限公司 | A kind of intelligent distributing method and device of electronic document |
CN109189883B (en) * | 2018-08-09 | 2022-01-28 | 中国银行股份有限公司 | Intelligent distribution method and device for electronic files |
CN110895562A (en) * | 2018-09-13 | 2020-03-20 | 阿里巴巴集团控股有限公司 | Feedback information processing method and device |
CN111767718A (en) * | 2020-07-03 | 2020-10-13 | 北京邮电大学 | Chinese grammar error correction method based on weakened grammar error feature representation |
CN111767718B (en) * | 2020-07-03 | 2021-12-07 | 北京邮电大学 | Chinese grammar error correction method based on weakened grammar error feature representation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106815369B (en) | A kind of file classification method based on Xgboost sorting algorithm | |
JP7090936B2 (en) | ESG-based corporate evaluation execution device and its operation method | |
Han et al. | Spark: A big data processing platform based on memory computing | |
CN107679244A (en) | File classification method and device | |
CN106202518A (en) | Based on CHI and the short text classification method of sub-category association rule algorithm | |
CN110188047B (en) | Double-channel convolutional neural network-based repeated defect report detection method | |
CN101308496A (en) | Large scale text data external clustering method and system | |
CN104834940A (en) | Medical image inspection disease classification method based on support vector machine (SVM) | |
Cai et al. | An abstract syntax tree encoding method for cross-project defect prediction | |
CN107122382A (en) | A kind of patent classification method based on specification | |
CN110245800A (en) | A method of based on superior vector spatial model goods made to order information class indication | |
CN110097096A (en) | A kind of file classification method based on TF-IDF matrix and capsule network | |
CN115186069A (en) | CNN-BiGRU-based academic text abstract automatic classification method | |
CN115146062A (en) | Intelligent event analysis method and system fusing expert recommendation and text clustering | |
CN115098690A (en) | Multi-data document classification method and system based on cluster analysis | |
CN107679209A (en) | Expression formula generation method of classifying and device | |
Amin et al. | Multiclass classification for bangla news tags with parallel cnn using word level data augmentation | |
CN111666748B (en) | Construction method of automatic classifier and decision recognition method | |
Revindasari et al. | Traceability between business process and software component using Probabilistic Latent Semantic Analysis | |
Yuan et al. | Automatic legal judgment prediction via large amounts of criminal cases | |
Gupta et al. | Feature selection: an overview | |
CN106202116A (en) | A kind of file classification method based on rough set and KNN and system | |
CN107766545A (en) | Scientific and technological data management method and device | |
Al Hasan et al. | Clustering Analysis of Bangla News Articles with TF-IDF & CV Using Mini-Batch K-Means and K-Means | |
Iancu et al. | Multi-label classification for automatic tag prediction in the context of programming challenges |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180209 |