CN110377735A - A kind of corpus file classification method based on KNN technology - Google Patents

A kind of corpus file classification method based on KNN technology Download PDF

Info

Publication number
CN110377735A
CN110377735A CN201910587778.XA CN201910587778A CN110377735A CN 110377735 A CN110377735 A CN 110377735A CN 201910587778 A CN201910587778 A CN 201910587778A CN 110377735 A CN110377735 A CN 110377735A
Authority
CN
China
Prior art keywords
text data
data
target
training
target text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201910587778.XA
Other languages
Chinese (zh)
Inventor
肖清林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Knight Source Information Technology Co Ltd
Original Assignee
Xiamen Knight Source Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Knight Source Information Technology Co Ltd filed Critical Xiamen Knight Source Information Technology Co Ltd
Priority to CN201910587778.XA priority Critical patent/CN110377735A/en
Publication of CN110377735A publication Critical patent/CN110377735A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

A kind of corpus file classification method based on KNN technology, includes the following steps: that the text data that obtains text data from corpus, and will acquire is divided into test text data and text data to be sorted;Test text data are divided into target text data and training text data;Determine target text data generic;Along with target text data distance priority from large to small successively chosen from training text data the M, M-1 to successively decrease, M-2 ..., 1 neighbouring text data;N number of target text data identical with correct target text data generic in M target text data are obtained, and to obtain N number of training text data corresponding to N number of target text data;It treats classifying text data and classification processing is carried out by KNN sorting algorithm.The present invention treats the data volume minimum that classifying text data carry out classification processing by KNN sorting algorithm using the K value of release, and data processing time is most short, data-handling efficiency highest.

Description

A kind of corpus file classification method based on KNN technology
Technical field
The present invention relates to technical field of data processing more particularly to a kind of corpus text classification sides based on KNN technology Method.
Background technique
It, using a large amount of text datas in corpus, needs to carry out classification processing to text data for convenience of quickly, so as to In calling.Classification processing can be carried out to corpus library text using KNN algorithm, the core concept of KNN algorithm is if a sample Most of in K in feature space most adjacent samples belong to some classification, then the sample also belongs to this classification, And the characteristic with sample in this classification.This method is on determining categorised decision only according to one or several closest samples This classification is determined wait divide classification belonging to sample.KNN method only has with minimal amount of adjacent sample in classification decision It closes.Since KNN method is mainly by limited neighbouring sample around, rather than generic is determined by differentiating the method for class field , therefore it is more wait divide for sample set for the intersection of class field or overlapping, the more other methods of KNN method are more suitable for.
Currently, K value is chosen without calculating by rigorous, in feature space in the file classification method based on KNN technology The K value of selection is not optimum k value, and when carrying out KNN classification processing using the K value, data processing amount when text classification is big, number Long according to the processing time, data-handling efficiency is low.
Summary of the invention
(1) goal of the invention
To solve technical problem present in background technique, the present invention proposes a kind of corpus library text based on KNN technology Classification method treats classifying text data by KNN sorting algorithm using the K value of release and carries out the data volume of classification processing most Small, data processing time is most short, data-handling efficiency highest.
(2) technical solution
To solve the above problems, the present invention provides a kind of corpus file classification methods based on KNN technology, including such as Lower step:
S1, the text data that obtains text data from corpus, and will acquire is divided into test text data and to be sorted Text data;
S2, test text data are divided into target text data and training text data, training text data are target text The proximity data of notebook data;
S3, target text data generic is determined;Along with target text data distance priority from large to small from Successively chosen in training text data the M, M-1 to successively decrease, M-2 ..., 1 neighbouring text data;
S4, it releases by carrying out the M target text data that KNN sorting algorithm is handled adjacent to text data to M kind Classification;
S5, by M target text data generic of release and the target text data generic that has determined into Row compares, and obtains N number of target text number identical with correct target text data generic in M target text data According to, and to obtain N number of training text data corresponding to N number of target text data;
S6, it chooses in above-mentioned N number of training text data with target text data apart from shortest training text data, and will The text data amount for including in the training text data is denoted as K;
S7, classifying text data are treated by KNN sorting algorithm progress classification processing, to obtain text data institute to be sorted Belong to classification.
Preferably, further includes:
S8, memory module for storing different classes of text data is preset, by text data to be sorted by obtaining Data generic classification storage in memory module.
Preferably, in S1, reduction processing is made to the attribute of text data, deleting influences lesser category to classification results Property.
Preferably, in S1, test text data and text data to be sorted are the textual data obtained from corpus The text data randomly selected in.
Preferably, in S2, target text data and training text data are randomly selected, and target text data are single Text data, training text data include multiple text datas.
Preferably, in S4, M, M-1 for being chosen from training text data, M-2 ..., 1 text data it is right respectively A kind of neighbouring text data is answered, obtains M kind altogether adjacent to text data.
Preferably, in S4, M target text data category of acquisition is stored by type classification.
Preferably, in S6, N number of training text data are arranged in the way of being gradually increased with target text data distance Sequence.
Above-mentioned technical proposal of the invention has following beneficial technical effect: the text data that will be obtained from corpus It is divided into test text data and text data to be sorted, by being handled test text data to obtain and can be used in treating The smallest K value of data processing amount of classifying text data progress KNN sorting algorithm classification processing.To obtain the K value, the first step will Test text data are divided into target text data and training text data, are carried out according to KNN sorting algorithm to training text data Classification processing to know M target text data generic, by this M obtain target text data generic and in advance Determining target text data generic is compared, and filters out N number of and target text from M target text data category The identical text data classification of data category, then counter release the corresponding N training of the N number of target text data category deduced Text data is chosen in N number of training text data with target text data apart from shortest training text data, and by the instruction Practice the text data amount for including in text data and is denoted as K.Using the K value by KNN sorting algorithm treat classifying text data into The data volume of row classification processing is minimum, and data processing time is most short, data-handling efficiency highest.
Detailed description of the invention
Fig. 1 is a kind of the first flowage structure of the corpus file classification method based on KNN technology proposed by the present invention Schematic diagram.
Fig. 2 is a kind of second of flowage structure of the corpus file classification method based on KNN technology proposed by the present invention Schematic diagram.
Fig. 3 is a kind of local flow diagram of the corpus file classification method based on KNN technology proposed by the present invention.
Specific embodiment
In order to make the objectives, technical solutions and advantages of the present invention clearer, With reference to embodiment and join According to attached drawing, the present invention is described in more detail.It should be understood that these descriptions are merely illustrative, and it is not intended to limit this hair Bright range.In addition, in the following description, descriptions of well-known structures and technologies are omitted, to avoid this is unnecessarily obscured The concept of invention.
As shown in Figure 1-3, a kind of corpus file classification method based on KNN technology proposed by the present invention, including it is as follows Step:
S1, the text data that obtains text data from corpus, and will acquire is divided into test text data and to be sorted Text data;
S2, test text data are divided into target text data and training text data, training text data are target text The proximity data of notebook data;
S3, target text data generic is determined;Along with target text data distance priority from large to small from Successively chosen in training text data the M, M-1 to successively decrease, M-2 ..., 1 neighbouring text data;
S4, it releases by carrying out the M target text data that KNN sorting algorithm is handled adjacent to text data to M kind Classification;
S5, by M target text data generic of release and the target text data generic that has determined into Row compares, and obtains N number of target text number identical with correct target text data generic in M target text data According to, and to obtain N number of training text data corresponding to N number of target text data;
S6, it chooses in above-mentioned N number of training text data with target text data apart from shortest training text data, and will The text data amount for including in the training text data is denoted as K;
S7, classifying text data are treated by KNN sorting algorithm progress classification processing, to obtain text data institute to be sorted Belong to classification.
In the present invention, the text data obtained from corpus is divided into test text data and text data to be sorted, It can be used in treating the progress KNN sorting algorithm classification of classifying text data by being handled test text data to obtain The smallest K value of the data processing amount of processing.To obtain the K value, the first step by test text data be divided into target text data and Training text data carry out classification processing to training text data according to KNN sorting algorithm to know M target text data institute Belong to classification, the target text data generic and predetermined target text data generic that this M is obtained carry out Compare, N number of text data classification identical with target text data category is filtered out from M target text data category, then It is counter to release the N corresponding N number of training text data of target text data category deduced, choose N number of training text data In with target text data apart from shortest training text data, and by the text data amount for including in the training text data remember For K.The data volume minimum that classifying text data carry out classification processing, data processing are treated by KNN sorting algorithm using the K value Time is most short, data-handling efficiency highest.
In an alternative embodiment, further includes:
S8, memory module for storing different classes of text data is preset, by text data to be sorted by obtaining Data generic classification storage in memory module.
It should be noted that after the classification for obtaining text data to be sorted using KNN sorting algorithm, by text to be sorted Data classification storage, so that the processing to data more has logic.
In an alternative embodiment, in S1, reduction processing is made to the attribute of text data, is deleted to classification results Influence lesser attribute.
It should be noted that the text data obtained from corpus has a variety of attributes, divide to text data When class, chooses to the big attribute of classification weighing factor of data as the underlying attribute for judging text data generic, delete Classification results are influenced with small attribute, so that data processing amount of the KNN sorting algorithm to text data classification processing when is reduced, into And the classification time is saved, improve the efficiency that classification processing is carried out to the text data obtained from corpus.
In an alternative embodiment, in S1, test text data and text data to be sorted are from corpus The text data randomly selected in the text data of middle acquisition.
It should be noted that test text data and text data to be sorted are all made of the mode of randomly selecting and are chosen, The random data of selection is more advantageous to KNN sorting algorithm and carries out classification processing with more generality.
In an alternative embodiment, in S2, target text data and training text data are randomly selected, target Text data is single text data, and training text data include multiple text datas.
It should be noted that randomly selecting target text data and training text data, target text data are single number According to, training text data include multiple text datas, by analyzing multiple text datas in training text data, and It is compared with the single target text data of selection, K when faster knowing using KNN sorting algorithm classification processing Value.
In an alternative embodiment, in S4, M, M-1 for being chosen from training text data, M-2 ..., 1 Text data respectively corresponds a kind of neighbouring text data, obtains M kind altogether adjacent to text data.
It should be noted that M, M-1, M-2 ..., 1 text data constitute total M kind adjacent to text data, wherein M is a Text data includes M-1 text data, and M-1 text data includes M-2 text data ... ..., and 2 text datas include 1 text data.By using the diminishing text data of range, successively to obtain M kind adjacent to text data, to know Classification belonging to M target text data.
In an alternative embodiment, in S4, M target text data category of acquisition is deposited by type classification Storage.
It should be noted that must have the identical mesh of classification in the M target text data generic known in S4 Text data is marked, by by the same category and different classes of text data generic classification storage, with the place of more logic Manage text data.
In an alternative embodiment, in S6, N number of training text data are pressed with target text data distance gradually The mode of increase sorts.
It should be noted that by the way that N number of training text data are gradually increased by with target text data distance in order Mode sort, can faster know in N number of training text data with target text data apart from shortest training text Data improve data-handling efficiency to save data obtaining time.
It should be understood that above-mentioned specific embodiment of the invention is used only for exemplary illustration or explains of the invention Principle, but not to limit the present invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention Covering the whole variations fallen into attached claim scope and boundary or this range and the equivalent form on boundary and is repairing Change example.

Claims (8)

1. a kind of corpus file classification method based on KNN technology, which comprises the steps of:
S1, the text data that obtains text data from corpus, and will acquire is divided into test text data and text to be sorted Data;
S2, test text data are divided into target text data and training text data, training text data are target text number According to proximity data;
S3, target text data generic is determined;Along with target text data distance priority from large to small from training Successively chosen in text data the M, M-1 to successively decrease, M-2 ..., 1 neighbouring text data;
S4, the class for releasing the M target text data by being handled adjacent to text data progress KNN sorting algorithm M kind Not;
S5, M target text data generic of release is compared with the target text data generic having determined Compared with, N number of target text data identical with correct target text data generic in M target text data are obtained, with And to obtain N number of training text data corresponding to N number of target text data;
S6, it chooses in above-mentioned N number of training text data with target text data apart from shortest training text data, and by the instruction Practice the text data amount for including in text data and is denoted as K;
S7, classifying text data are treated by KNN sorting algorithm progress classification processing, to obtain the affiliated class of text data to be sorted Not.
2. a kind of corpus file classification method based on KNN technology according to claim 1, which is characterized in that also wrap It includes:
S8, memory module for storing different classes of text data is preset, by text data to be sorted by the number obtained According to generic classification storage in memory module.
3. a kind of corpus file classification method based on KNN technology according to claim 1 or 2, which is characterized in that In S1, reduction processing is made to the attribute of text data, deleting influences lesser attribute to classification results.
4. a kind of corpus file classification method based on KNN technology according to claim 1, which is characterized in that in S1 In, test text data and text data to be sorted are the textual data randomly selected from the text data obtained in corpus According to.
5. a kind of corpus file classification method based on KNN technology according to claim 1, which is characterized in that in S2 In, target text data and training text data randomly select, and target text data are single text data, training text number According to including multiple text datas.
6. a kind of corpus file classification method based on KNN technology according to claim 1, which is characterized in that in S4 In, M, M-1 for being chosen from training text data, M-2 ..., 1 text data respectively correspond a kind of neighbouring text data, M kind is obtained altogether adjacent to text data.
7. a kind of corpus file classification method based on KNN technology according to claim 1, which is characterized in that in S4 In, M target text data category of acquisition is stored by type classification.
8. a kind of corpus file classification method based on KNN technology according to claim 1, which is characterized in that in S6 In, N number of training text data are sorted in the way of being gradually increased with target text data distance.
CN201910587778.XA 2019-07-02 2019-07-02 A kind of corpus file classification method based on KNN technology Withdrawn CN110377735A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910587778.XA CN110377735A (en) 2019-07-02 2019-07-02 A kind of corpus file classification method based on KNN technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910587778.XA CN110377735A (en) 2019-07-02 2019-07-02 A kind of corpus file classification method based on KNN technology

Publications (1)

Publication Number Publication Date
CN110377735A true CN110377735A (en) 2019-10-25

Family

ID=68251510

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910587778.XA Withdrawn CN110377735A (en) 2019-07-02 2019-07-02 A kind of corpus file classification method based on KNN technology

Country Status (1)

Country Link
CN (1) CN110377735A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130230255A1 (en) * 2012-03-02 2013-09-05 Microsoft Corporation Image Searching By Approximate k-NN Graph
CN103345528A (en) * 2013-07-24 2013-10-09 南京邮电大学 Text classification method based on correlation analysis and KNN
CN108167133A (en) * 2017-12-20 2018-06-15 北京金风慧能技术有限公司 Training method, device and the wind power generating set of air speed error model
CN109816033A (en) * 2019-01-31 2019-05-28 清华四川能源互联网研究院 A method of the supervised learning based on optimization carries out area user identification zone

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130230255A1 (en) * 2012-03-02 2013-09-05 Microsoft Corporation Image Searching By Approximate k-NN Graph
CN103345528A (en) * 2013-07-24 2013-10-09 南京邮电大学 Text classification method based on correlation analysis and KNN
CN108167133A (en) * 2017-12-20 2018-06-15 北京金风慧能技术有限公司 Training method, device and the wind power generating set of air speed error model
CN109816033A (en) * 2019-01-31 2019-05-28 清华四川能源互联网研究院 A method of the supervised learning based on optimization carries out area user identification zone

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SHICHAO ZHANG等: "Efficient kNN Classification With Different Numbers of Nearest Neighbors", 《IEEE》 *
许燕青: "基于属性值贡献度的k最近邻分类算法", 《宁德师范学院学报(自然科学版)》 *

Similar Documents

Publication Publication Date Title
US6931090B2 (en) Method of establishing a nuclear reactor core fuel assembly loading pattern
US9292550B2 (en) Feature generation and model selection for generalized linear models
CN102141978A (en) Method and system for classifying texts
CN110147321A (en) A kind of recognition methods of the defect high risk module based on software network
CN106295502A (en) A kind of method for detecting human face and device
CN107122382A (en) A kind of patent classification method based on specification
CN105871879A (en) Automatic network element abnormal behavior detection method and device
CN103164709A (en) Method for optimizing support vector machine based on tabu search algorithm
CN112860917A (en) Method, device and equipment for processing data of goods to be warehoused and storage medium
CN109448366A (en) A kind of space domain sector degree of crowding prediction technique based on random forest
CN111143685A (en) Recommendation system construction method and device
CN116109139A (en) Wind control strategy generation method, decision method, server and storage medium
CN114078189A (en) Lattice model additive manufacturing self-adaptive filling method based on machine learning method
CN110377496B (en) Test case priority determining method based on intelligent water drops in software regression testing process
CN102929587A (en) Data processing system and data processing method
CN103490974A (en) Junk mail detection method and device
CN115148299A (en) XGboost-based ore deposit type identification method and system
CN111767216A (en) Cross-version depth defect prediction method capable of relieving class overlap problem
CN110377735A (en) A kind of corpus file classification method based on KNN technology
CN107193940A (en) Big data method for optimization analysis
CN106503144A (en) Design patent retrieval analysis system and its analysis method
CN104077361B (en) A kind of sort method and system for big data
CN105787004A (en) Text classification method and device
Lin et al. A new density-based scheme for clustering based on genetic algorithm
CN113987240B (en) Customs inspection sample tracing method and system based on knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20191025

WW01 Invention patent application withdrawn after publication