CN110377735A

CN110377735A - A kind of corpus file classification method based on KNN technology

Info

Publication number: CN110377735A
Application number: CN201910587778.XA
Authority: CN
Inventors: 肖清林
Original assignee: Xiamen Knight Source Information Technology Co Ltd
Current assignee: Xiamen Knight Source Information Technology Co Ltd
Priority date: 2019-07-02
Filing date: 2019-07-02
Publication date: 2019-10-25

Abstract

A kind of corpus file classification method based on KNN technology, includes the following steps: that the text data that obtains text data from corpus, and will acquire is divided into test text data and text data to be sorted；Test text data are divided into target text data and training text data；Determine target text data generic；Along with target text data distance priority from large to small successively chosen from training text data the M, M-1 to successively decrease, M-2 ..., 1 neighbouring text data；N number of target text data identical with correct target text data generic in M target text data are obtained, and to obtain N number of training text data corresponding to N number of target text data；It treats classifying text data and classification processing is carried out by KNN sorting algorithm.The present invention treats the data volume minimum that classifying text data carry out classification processing by KNN sorting algorithm using the K value of release, and data processing time is most short, data-handling efficiency highest.

Description

A kind of corpus file classification method based on KNN technology

Technical field

The present invention relates to technical field of data processing more particularly to a kind of corpus text classification sides based on KNN technology Method.

Background technique

It, using a large amount of text datas in corpus, needs to carry out classification processing to text data for convenience of quickly, so as to In calling.Classification processing can be carried out to corpus library text using KNN algorithm, the core concept of KNN algorithm is if a sample Most of in K in feature space most adjacent samples belong to some classification, then the sample also belongs to this classification, And the characteristic with sample in this classification.This method is on determining categorised decision only according to one or several closest samples This classification is determined wait divide classification belonging to sample.KNN method only has with minimal amount of adjacent sample in classification decision It closes.Since KNN method is mainly by limited neighbouring sample around, rather than generic is determined by differentiating the method for class field , therefore it is more wait divide for sample set for the intersection of class field or overlapping, the more other methods of KNN method are more suitable for.

Currently, K value is chosen without calculating by rigorous, in feature space in the file classification method based on KNN technology The K value of selection is not optimum k value, and when carrying out KNN classification processing using the K value, data processing amount when text classification is big, number Long according to the processing time, data-handling efficiency is low.

Summary of the invention

(1) goal of the invention

To solve technical problem present in background technique, the present invention proposes a kind of corpus library text based on KNN technology Classification method treats classifying text data by KNN sorting algorithm using the K value of release and carries out the data volume of classification processing most Small, data processing time is most short, data-handling efficiency highest.

(2) technical solution

To solve the above problems, the present invention provides a kind of corpus file classification methods based on KNN technology, including such as Lower step:

S1, the text data that obtains text data from corpus, and will acquire is divided into test text data and to be sorted Text data；

S2, test text data are divided into target text data and training text data, training text data are target text The proximity data of notebook data；

S3, target text data generic is determined；Along with target text data distance priority from large to small from Successively chosen in training text data the M, M-1 to successively decrease, M-2 ..., 1 neighbouring text data；

S4, it releases by carrying out the M target text data that KNN sorting algorithm is handled adjacent to text data to M kind Classification；

S5, by M target text data generic of release and the target text data generic that has determined into Row compares, and obtains N number of target text number identical with correct target text data generic in M target text data According to, and to obtain N number of training text data corresponding to N number of target text data；

S6, it chooses in above-mentioned N number of training text data with target text data apart from shortest training text data, and will The text data amount for including in the training text data is denoted as K；

S7, classifying text data are treated by KNN sorting algorithm progress classification processing, to obtain text data institute to be sorted Belong to classification.

Preferably, further includes:

S8, memory module for storing different classes of text data is preset, by text data to be sorted by obtaining Data generic classification storage in memory module.

Preferably, in S1, reduction processing is made to the attribute of text data, deleting influences lesser category to classification results Property.

Preferably, in S1, test text data and text data to be sorted are the textual data obtained from corpus The text data randomly selected in.

Preferably, in S2, target text data and training text data are randomly selected, and target text data are single Text data, training text data include multiple text datas.

Preferably, in S4, M, M-1 for being chosen from training text data, M-2 ..., 1 text data it is right respectively A kind of neighbouring text data is answered, obtains M kind altogether adjacent to text data.

Preferably, in S4, M target text data category of acquisition is stored by type classification.

Preferably, in S6, N number of training text data are arranged in the way of being gradually increased with target text data distance Sequence.

Above-mentioned technical proposal of the invention has following beneficial technical effect: the text data that will be obtained from corpus It is divided into test text data and text data to be sorted, by being handled test text data to obtain and can be used in treating The smallest K value of data processing amount of classifying text data progress KNN sorting algorithm classification processing.To obtain the K value, the first step will Test text data are divided into target text data and training text data, are carried out according to KNN sorting algorithm to training text data Classification processing to know M target text data generic, by this M obtain target text data generic and in advance Determining target text data generic is compared, and filters out N number of and target text from M target text data category The identical text data classification of data category, then counter release the corresponding N training of the N number of target text data category deduced Text data is chosen in N number of training text data with target text data apart from shortest training text data, and by the instruction Practice the text data amount for including in text data and is denoted as K.Using the K value by KNN sorting algorithm treat classifying text data into The data volume of row classification processing is minimum, and data processing time is most short, data-handling efficiency highest.

Detailed description of the invention

Fig. 1 is a kind of the first flowage structure of the corpus file classification method based on KNN technology proposed by the present invention Schematic diagram.

Fig. 2 is a kind of second of flowage structure of the corpus file classification method based on KNN technology proposed by the present invention Schematic diagram.

Fig. 3 is a kind of local flow diagram of the corpus file classification method based on KNN technology proposed by the present invention.

Specific embodiment

In order to make the objectives, technical solutions and advantages of the present invention clearer, With reference to embodiment and join According to attached drawing, the present invention is described in more detail.It should be understood that these descriptions are merely illustrative, and it is not intended to limit this hair Bright range.In addition, in the following description, descriptions of well-known structures and technologies are omitted, to avoid this is unnecessarily obscured The concept of invention.

As shown in Figure 1-3, a kind of corpus file classification method based on KNN technology proposed by the present invention, including it is as follows Step:

In the present invention, the text data obtained from corpus is divided into test text data and text data to be sorted, It can be used in treating the progress KNN sorting algorithm classification of classifying text data by being handled test text data to obtain The smallest K value of the data processing amount of processing.To obtain the K value, the first step by test text data be divided into target text data and Training text data carry out classification processing to training text data according to KNN sorting algorithm to know M target text data institute Belong to classification, the target text data generic and predetermined target text data generic that this M is obtained carry out Compare, N number of text data classification identical with target text data category is filtered out from M target text data category, then It is counter to release the N corresponding N number of training text data of target text data category deduced, choose N number of training text data In with target text data apart from shortest training text data, and by the text data amount for including in the training text data remember For K.The data volume minimum that classifying text data carry out classification processing, data processing are treated by KNN sorting algorithm using the K value Time is most short, data-handling efficiency highest.

In an alternative embodiment, further includes:

It should be noted that after the classification for obtaining text data to be sorted using KNN sorting algorithm, by text to be sorted Data classification storage, so that the processing to data more has logic.

In an alternative embodiment, in S1, reduction processing is made to the attribute of text data, is deleted to classification results Influence lesser attribute.

It should be noted that the text data obtained from corpus has a variety of attributes, divide to text data When class, chooses to the big attribute of classification weighing factor of data as the underlying attribute for judging text data generic, delete Classification results are influenced with small attribute, so that data processing amount of the KNN sorting algorithm to text data classification processing when is reduced, into And the classification time is saved, improve the efficiency that classification processing is carried out to the text data obtained from corpus.

In an alternative embodiment, in S1, test text data and text data to be sorted are from corpus The text data randomly selected in the text data of middle acquisition.

It should be noted that test text data and text data to be sorted are all made of the mode of randomly selecting and are chosen, The random data of selection is more advantageous to KNN sorting algorithm and carries out classification processing with more generality.

In an alternative embodiment, in S2, target text data and training text data are randomly selected, target Text data is single text data, and training text data include multiple text datas.

It should be noted that randomly selecting target text data and training text data, target text data are single number According to, training text data include multiple text datas, by analyzing multiple text datas in training text data, and It is compared with the single target text data of selection, K when faster knowing using KNN sorting algorithm classification processing Value.

In an alternative embodiment, in S4, M, M-1 for being chosen from training text data, M-2 ..., 1 Text data respectively corresponds a kind of neighbouring text data, obtains M kind altogether adjacent to text data.

It should be noted that M, M-1, M-2 ..., 1 text data constitute total M kind adjacent to text data, wherein M is a Text data includes M-1 text data, and M-1 text data includes M-2 text data ... ..., and 2 text datas include 1 text data.By using the diminishing text data of range, successively to obtain M kind adjacent to text data, to know Classification belonging to M target text data.

In an alternative embodiment, in S4, M target text data category of acquisition is deposited by type classification Storage.

It should be noted that must have the identical mesh of classification in the M target text data generic known in S4 Text data is marked, by by the same category and different classes of text data generic classification storage, with the place of more logic Manage text data.

In an alternative embodiment, in S6, N number of training text data are pressed with target text data distance gradually The mode of increase sorts.

It should be noted that by the way that N number of training text data are gradually increased by with target text data distance in order Mode sort, can faster know in N number of training text data with target text data apart from shortest training text Data improve data-handling efficiency to save data obtaining time.

It should be understood that above-mentioned specific embodiment of the invention is used only for exemplary illustration or explains of the invention Principle, but not to limit the present invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention Covering the whole variations fallen into attached claim scope and boundary or this range and the equivalent form on boundary and is repairing Change example.

Claims

1. a kind of corpus file classification method based on KNN technology, which comprises the steps of:

S1, the text data that obtains text data from corpus, and will acquire is divided into test text data and text to be sorted Data；

S2, test text data are divided into target text data and training text data, training text data are target text number According to proximity data；

S3, target text data generic is determined；Along with target text data distance priority from large to small from training Successively chosen in text data the M, M-1 to successively decrease, M-2 ..., 1 neighbouring text data；

S4, the class for releasing the M target text data by being handled adjacent to text data progress KNN sorting algorithm M kind Not；

S5, M target text data generic of release is compared with the target text data generic having determined Compared with, N number of target text data identical with correct target text data generic in M target text data are obtained, with And to obtain N number of training text data corresponding to N number of target text data；

S6, it chooses in above-mentioned N number of training text data with target text data apart from shortest training text data, and by the instruction Practice the text data amount for including in text data and is denoted as K；

S7, classifying text data are treated by KNN sorting algorithm progress classification processing, to obtain the affiliated class of text data to be sorted Not.

2. a kind of corpus file classification method based on KNN technology according to claim 1, which is characterized in that also wrap It includes:

S8, memory module for storing different classes of text data is preset, by text data to be sorted by the number obtained According to generic classification storage in memory module.

3. a kind of corpus file classification method based on KNN technology according to claim 1 or 2, which is characterized in that In S1, reduction processing is made to the attribute of text data, deleting influences lesser attribute to classification results.

4. a kind of corpus file classification method based on KNN technology according to claim 1, which is characterized in that in S1 In, test text data and text data to be sorted are the textual data randomly selected from the text data obtained in corpus According to.

5. a kind of corpus file classification method based on KNN technology according to claim 1, which is characterized in that in S2 In, target text data and training text data randomly select, and target text data are single text data, training text number According to including multiple text datas.

6. a kind of corpus file classification method based on KNN technology according to claim 1, which is characterized in that in S4 In, M, M-1 for being chosen from training text data, M-2 ..., 1 text data respectively correspond a kind of neighbouring text data, M kind is obtained altogether adjacent to text data.

7. a kind of corpus file classification method based on KNN technology according to claim 1, which is characterized in that in S4 In, M target text data category of acquisition is stored by type classification.

8. a kind of corpus file classification method based on KNN technology according to claim 1, which is characterized in that in S6 In, N number of training text data are sorted in the way of being gradually increased with target text data distance.