CN110377735A - A kind of corpus file classification method based on KNN technology - Google Patents
A kind of corpus file classification method based on KNN technology Download PDFInfo
- Publication number
- CN110377735A CN110377735A CN201910587778.XA CN201910587778A CN110377735A CN 110377735 A CN110377735 A CN 110377735A CN 201910587778 A CN201910587778 A CN 201910587778A CN 110377735 A CN110377735 A CN 110377735A
- Authority
- CN
- China
- Prior art keywords
- text data
- data
- target
- training
- target text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Abstract
A kind of corpus file classification method based on KNN technology, includes the following steps: that the text data that obtains text data from corpus, and will acquire is divided into test text data and text data to be sorted;Test text data are divided into target text data and training text data;Determine target text data generic;Along with target text data distance priority from large to small successively chosen from training text data the M, M-1 to successively decrease, M-2 ..., 1 neighbouring text data;N number of target text data identical with correct target text data generic in M target text data are obtained, and to obtain N number of training text data corresponding to N number of target text data;It treats classifying text data and classification processing is carried out by KNN sorting algorithm.The present invention treats the data volume minimum that classifying text data carry out classification processing by KNN sorting algorithm using the K value of release, and data processing time is most short, data-handling efficiency highest.
Description
Technical field
The present invention relates to technical field of data processing more particularly to a kind of corpus text classification sides based on KNN technology
Method.
Background technique
It, using a large amount of text datas in corpus, needs to carry out classification processing to text data for convenience of quickly, so as to
In calling.Classification processing can be carried out to corpus library text using KNN algorithm, the core concept of KNN algorithm is if a sample
Most of in K in feature space most adjacent samples belong to some classification, then the sample also belongs to this classification,
And the characteristic with sample in this classification.This method is on determining categorised decision only according to one or several closest samples
This classification is determined wait divide classification belonging to sample.KNN method only has with minimal amount of adjacent sample in classification decision
It closes.Since KNN method is mainly by limited neighbouring sample around, rather than generic is determined by differentiating the method for class field
, therefore it is more wait divide for sample set for the intersection of class field or overlapping, the more other methods of KNN method are more suitable for.
Currently, K value is chosen without calculating by rigorous, in feature space in the file classification method based on KNN technology
The K value of selection is not optimum k value, and when carrying out KNN classification processing using the K value, data processing amount when text classification is big, number
Long according to the processing time, data-handling efficiency is low.
Summary of the invention
(1) goal of the invention
To solve technical problem present in background technique, the present invention proposes a kind of corpus library text based on KNN technology
Classification method treats classifying text data by KNN sorting algorithm using the K value of release and carries out the data volume of classification processing most
Small, data processing time is most short, data-handling efficiency highest.
(2) technical solution
To solve the above problems, the present invention provides a kind of corpus file classification methods based on KNN technology, including such as
Lower step:
S1, the text data that obtains text data from corpus, and will acquire is divided into test text data and to be sorted
Text data;
S2, test text data are divided into target text data and training text data, training text data are target text
The proximity data of notebook data;
S3, target text data generic is determined;Along with target text data distance priority from large to small from
Successively chosen in training text data the M, M-1 to successively decrease, M-2 ..., 1 neighbouring text data;
S4, it releases by carrying out the M target text data that KNN sorting algorithm is handled adjacent to text data to M kind
Classification;
S5, by M target text data generic of release and the target text data generic that has determined into
Row compares, and obtains N number of target text number identical with correct target text data generic in M target text data
According to, and to obtain N number of training text data corresponding to N number of target text data;
S6, it chooses in above-mentioned N number of training text data with target text data apart from shortest training text data, and will
The text data amount for including in the training text data is denoted as K;
S7, classifying text data are treated by KNN sorting algorithm progress classification processing, to obtain text data institute to be sorted
Belong to classification.
Preferably, further includes:
S8, memory module for storing different classes of text data is preset, by text data to be sorted by obtaining
Data generic classification storage in memory module.
Preferably, in S1, reduction processing is made to the attribute of text data, deleting influences lesser category to classification results
Property.
Preferably, in S1, test text data and text data to be sorted are the textual data obtained from corpus
The text data randomly selected in.
Preferably, in S2, target text data and training text data are randomly selected, and target text data are single
Text data, training text data include multiple text datas.
Preferably, in S4, M, M-1 for being chosen from training text data, M-2 ..., 1 text data it is right respectively
A kind of neighbouring text data is answered, obtains M kind altogether adjacent to text data.
Preferably, in S4, M target text data category of acquisition is stored by type classification.
Preferably, in S6, N number of training text data are arranged in the way of being gradually increased with target text data distance
Sequence.
Above-mentioned technical proposal of the invention has following beneficial technical effect: the text data that will be obtained from corpus
It is divided into test text data and text data to be sorted, by being handled test text data to obtain and can be used in treating
The smallest K value of data processing amount of classifying text data progress KNN sorting algorithm classification processing.To obtain the K value, the first step will
Test text data are divided into target text data and training text data, are carried out according to KNN sorting algorithm to training text data
Classification processing to know M target text data generic, by this M obtain target text data generic and in advance
Determining target text data generic is compared, and filters out N number of and target text from M target text data category
The identical text data classification of data category, then counter release the corresponding N training of the N number of target text data category deduced
Text data is chosen in N number of training text data with target text data apart from shortest training text data, and by the instruction
Practice the text data amount for including in text data and is denoted as K.Using the K value by KNN sorting algorithm treat classifying text data into
The data volume of row classification processing is minimum, and data processing time is most short, data-handling efficiency highest.
Detailed description of the invention
Fig. 1 is a kind of the first flowage structure of the corpus file classification method based on KNN technology proposed by the present invention
Schematic diagram.
Fig. 2 is a kind of second of flowage structure of the corpus file classification method based on KNN technology proposed by the present invention
Schematic diagram.
Fig. 3 is a kind of local flow diagram of the corpus file classification method based on KNN technology proposed by the present invention.
Specific embodiment
In order to make the objectives, technical solutions and advantages of the present invention clearer, With reference to embodiment and join
According to attached drawing, the present invention is described in more detail.It should be understood that these descriptions are merely illustrative, and it is not intended to limit this hair
Bright range.In addition, in the following description, descriptions of well-known structures and technologies are omitted, to avoid this is unnecessarily obscured
The concept of invention.
As shown in Figure 1-3, a kind of corpus file classification method based on KNN technology proposed by the present invention, including it is as follows
Step:
S1, the text data that obtains text data from corpus, and will acquire is divided into test text data and to be sorted
Text data;
S2, test text data are divided into target text data and training text data, training text data are target text
The proximity data of notebook data;
S3, target text data generic is determined;Along with target text data distance priority from large to small from
Successively chosen in training text data the M, M-1 to successively decrease, M-2 ..., 1 neighbouring text data;
S4, it releases by carrying out the M target text data that KNN sorting algorithm is handled adjacent to text data to M kind
Classification;
S5, by M target text data generic of release and the target text data generic that has determined into
Row compares, and obtains N number of target text number identical with correct target text data generic in M target text data
According to, and to obtain N number of training text data corresponding to N number of target text data;
S6, it chooses in above-mentioned N number of training text data with target text data apart from shortest training text data, and will
The text data amount for including in the training text data is denoted as K;
S7, classifying text data are treated by KNN sorting algorithm progress classification processing, to obtain text data institute to be sorted
Belong to classification.
In the present invention, the text data obtained from corpus is divided into test text data and text data to be sorted,
It can be used in treating the progress KNN sorting algorithm classification of classifying text data by being handled test text data to obtain
The smallest K value of the data processing amount of processing.To obtain the K value, the first step by test text data be divided into target text data and
Training text data carry out classification processing to training text data according to KNN sorting algorithm to know M target text data institute
Belong to classification, the target text data generic and predetermined target text data generic that this M is obtained carry out
Compare, N number of text data classification identical with target text data category is filtered out from M target text data category, then
It is counter to release the N corresponding N number of training text data of target text data category deduced, choose N number of training text data
In with target text data apart from shortest training text data, and by the text data amount for including in the training text data remember
For K.The data volume minimum that classifying text data carry out classification processing, data processing are treated by KNN sorting algorithm using the K value
Time is most short, data-handling efficiency highest.
In an alternative embodiment, further includes:
S8, memory module for storing different classes of text data is preset, by text data to be sorted by obtaining
Data generic classification storage in memory module.
It should be noted that after the classification for obtaining text data to be sorted using KNN sorting algorithm, by text to be sorted
Data classification storage, so that the processing to data more has logic.
In an alternative embodiment, in S1, reduction processing is made to the attribute of text data, is deleted to classification results
Influence lesser attribute.
It should be noted that the text data obtained from corpus has a variety of attributes, divide to text data
When class, chooses to the big attribute of classification weighing factor of data as the underlying attribute for judging text data generic, delete
Classification results are influenced with small attribute, so that data processing amount of the KNN sorting algorithm to text data classification processing when is reduced, into
And the classification time is saved, improve the efficiency that classification processing is carried out to the text data obtained from corpus.
In an alternative embodiment, in S1, test text data and text data to be sorted are from corpus
The text data randomly selected in the text data of middle acquisition.
It should be noted that test text data and text data to be sorted are all made of the mode of randomly selecting and are chosen,
The random data of selection is more advantageous to KNN sorting algorithm and carries out classification processing with more generality.
In an alternative embodiment, in S2, target text data and training text data are randomly selected, target
Text data is single text data, and training text data include multiple text datas.
It should be noted that randomly selecting target text data and training text data, target text data are single number
According to, training text data include multiple text datas, by analyzing multiple text datas in training text data, and
It is compared with the single target text data of selection, K when faster knowing using KNN sorting algorithm classification processing
Value.
In an alternative embodiment, in S4, M, M-1 for being chosen from training text data, M-2 ..., 1
Text data respectively corresponds a kind of neighbouring text data, obtains M kind altogether adjacent to text data.
It should be noted that M, M-1, M-2 ..., 1 text data constitute total M kind adjacent to text data, wherein M is a
Text data includes M-1 text data, and M-1 text data includes M-2 text data ... ..., and 2 text datas include
1 text data.By using the diminishing text data of range, successively to obtain M kind adjacent to text data, to know
Classification belonging to M target text data.
In an alternative embodiment, in S4, M target text data category of acquisition is deposited by type classification
Storage.
It should be noted that must have the identical mesh of classification in the M target text data generic known in S4
Text data is marked, by by the same category and different classes of text data generic classification storage, with the place of more logic
Manage text data.
In an alternative embodiment, in S6, N number of training text data are pressed with target text data distance gradually
The mode of increase sorts.
It should be noted that by the way that N number of training text data are gradually increased by with target text data distance in order
Mode sort, can faster know in N number of training text data with target text data apart from shortest training text
Data improve data-handling efficiency to save data obtaining time.
It should be understood that above-mentioned specific embodiment of the invention is used only for exemplary illustration or explains of the invention
Principle, but not to limit the present invention.Therefore, that is done without departing from the spirit and scope of the present invention is any
Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention
Covering the whole variations fallen into attached claim scope and boundary or this range and the equivalent form on boundary and is repairing
Change example.
Claims (8)
1. a kind of corpus file classification method based on KNN technology, which comprises the steps of:
S1, the text data that obtains text data from corpus, and will acquire is divided into test text data and text to be sorted
Data;
S2, test text data are divided into target text data and training text data, training text data are target text number
According to proximity data;
S3, target text data generic is determined;Along with target text data distance priority from large to small from training
Successively chosen in text data the M, M-1 to successively decrease, M-2 ..., 1 neighbouring text data;
S4, the class for releasing the M target text data by being handled adjacent to text data progress KNN sorting algorithm M kind
Not;
S5, M target text data generic of release is compared with the target text data generic having determined
Compared with, N number of target text data identical with correct target text data generic in M target text data are obtained, with
And to obtain N number of training text data corresponding to N number of target text data;
S6, it chooses in above-mentioned N number of training text data with target text data apart from shortest training text data, and by the instruction
Practice the text data amount for including in text data and is denoted as K;
S7, classifying text data are treated by KNN sorting algorithm progress classification processing, to obtain the affiliated class of text data to be sorted
Not.
2. a kind of corpus file classification method based on KNN technology according to claim 1, which is characterized in that also wrap
It includes:
S8, memory module for storing different classes of text data is preset, by text data to be sorted by the number obtained
According to generic classification storage in memory module.
3. a kind of corpus file classification method based on KNN technology according to claim 1 or 2, which is characterized in that
In S1, reduction processing is made to the attribute of text data, deleting influences lesser attribute to classification results.
4. a kind of corpus file classification method based on KNN technology according to claim 1, which is characterized in that in S1
In, test text data and text data to be sorted are the textual data randomly selected from the text data obtained in corpus
According to.
5. a kind of corpus file classification method based on KNN technology according to claim 1, which is characterized in that in S2
In, target text data and training text data randomly select, and target text data are single text data, training text number
According to including multiple text datas.
6. a kind of corpus file classification method based on KNN technology according to claim 1, which is characterized in that in S4
In, M, M-1 for being chosen from training text data, M-2 ..., 1 text data respectively correspond a kind of neighbouring text data,
M kind is obtained altogether adjacent to text data.
7. a kind of corpus file classification method based on KNN technology according to claim 1, which is characterized in that in S4
In, M target text data category of acquisition is stored by type classification.
8. a kind of corpus file classification method based on KNN technology according to claim 1, which is characterized in that in S6
In, N number of training text data are sorted in the way of being gradually increased with target text data distance.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910587778.XA CN110377735A (en) | 2019-07-02 | 2019-07-02 | A kind of corpus file classification method based on KNN technology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910587778.XA CN110377735A (en) | 2019-07-02 | 2019-07-02 | A kind of corpus file classification method based on KNN technology |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110377735A true CN110377735A (en) | 2019-10-25 |
Family
ID=68251510
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910587778.XA Withdrawn CN110377735A (en) | 2019-07-02 | 2019-07-02 | A kind of corpus file classification method based on KNN technology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110377735A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130230255A1 (en) * | 2012-03-02 | 2013-09-05 | Microsoft Corporation | Image Searching By Approximate k-NN Graph |
CN103345528A (en) * | 2013-07-24 | 2013-10-09 | 南京邮电大学 | Text classification method based on correlation analysis and KNN |
CN108167133A (en) * | 2017-12-20 | 2018-06-15 | 北京金风慧能技术有限公司 | Training method, device and the wind power generating set of air speed error model |
CN109816033A (en) * | 2019-01-31 | 2019-05-28 | 清华四川能源互联网研究院 | A method of the supervised learning based on optimization carries out area user identification zone |
-
2019
- 2019-07-02 CN CN201910587778.XA patent/CN110377735A/en not_active Withdrawn
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130230255A1 (en) * | 2012-03-02 | 2013-09-05 | Microsoft Corporation | Image Searching By Approximate k-NN Graph |
CN103345528A (en) * | 2013-07-24 | 2013-10-09 | 南京邮电大学 | Text classification method based on correlation analysis and KNN |
CN108167133A (en) * | 2017-12-20 | 2018-06-15 | 北京金风慧能技术有限公司 | Training method, device and the wind power generating set of air speed error model |
CN109816033A (en) * | 2019-01-31 | 2019-05-28 | 清华四川能源互联网研究院 | A method of the supervised learning based on optimization carries out area user identification zone |
Non-Patent Citations (2)
Title |
---|
SHICHAO ZHANG等: "Efficient kNN Classification With Different Numbers of Nearest Neighbors", 《IEEE》 * |
许燕青: "基于属性值贡献度的k最近邻分类算法", 《宁德师范学院学报(自然科学版)》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6931090B2 (en) | Method of establishing a nuclear reactor core fuel assembly loading pattern | |
US9292550B2 (en) | Feature generation and model selection for generalized linear models | |
CN102141978A (en) | Method and system for classifying texts | |
CN110147321A (en) | A kind of recognition methods of the defect high risk module based on software network | |
CN106295502A (en) | A kind of method for detecting human face and device | |
CN107122382A (en) | A kind of patent classification method based on specification | |
CN105871879A (en) | Automatic network element abnormal behavior detection method and device | |
CN103164709A (en) | Method for optimizing support vector machine based on tabu search algorithm | |
CN112860917A (en) | Method, device and equipment for processing data of goods to be warehoused and storage medium | |
CN109448366A (en) | A kind of space domain sector degree of crowding prediction technique based on random forest | |
CN111143685A (en) | Recommendation system construction method and device | |
CN116109139A (en) | Wind control strategy generation method, decision method, server and storage medium | |
CN114078189A (en) | Lattice model additive manufacturing self-adaptive filling method based on machine learning method | |
CN110377496B (en) | Test case priority determining method based on intelligent water drops in software regression testing process | |
CN102929587A (en) | Data processing system and data processing method | |
CN103490974A (en) | Junk mail detection method and device | |
CN115148299A (en) | XGboost-based ore deposit type identification method and system | |
CN111767216A (en) | Cross-version depth defect prediction method capable of relieving class overlap problem | |
CN110377735A (en) | A kind of corpus file classification method based on KNN technology | |
CN107193940A (en) | Big data method for optimization analysis | |
CN106503144A (en) | Design patent retrieval analysis system and its analysis method | |
CN104077361B (en) | A kind of sort method and system for big data | |
CN105787004A (en) | Text classification method and device | |
Lin et al. | A new density-based scheme for clustering based on genetic algorithm | |
CN113987240B (en) | Customs inspection sample tracing method and system based on knowledge graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20191025 |
|
WW01 | Invention patent application withdrawn after publication |