CN110413791A

CN110413791A - File classification method based on CNN-SVM-KNN built-up pattern

Info

Publication number: CN110413791A
Application number: CN201910718426.3A
Authority: CN
Inventors: 郑文斌; 凤雷; 刘冰; 付平; 孙媛媛; 石金龙; 叶俊涛; 王天城; 魏明晨; 徐明珠; 吴瑞东
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2019-08-05
Filing date: 2019-08-05
Publication date: 2019-11-05

Abstract

Based on the file classification method of CNN-SVM-KNN built-up pattern, the present invention relates to the file classification methods based on built-up pattern.The purpose of the present invention is to solve the low problems of existing method text classification accuracy.Detailed process are as follows: 1: Text Pretreatment；2: feature extraction being carried out to the text after after step 1 pretreatment, the text after obtaining feature extraction；3: CNN model is established based on step 2；4: establishing CNN-SVM model；5: establishing CNN-KNN model；6: threshold value d is distinguished in setting；7: calculating distance: calculating the sample point to be sorted classifying face distance tmp optimal to CNN-SVM classifier；8: comparing distance: as tmp > d, selecting CNN-SVM classifier；Otherwise, CNN-KNN classifier is selected；9: repeating step 6 to step 9, find the optimal d value of evaluation index.The present invention is used for text classification field.

Description

File classification method based on CNN-SVM-KNN built-up pattern

Technical field

The present invention relates to the file classification methods based on built-up pattern.The present invention is used for text classification field.

Background technique

With flourishing for network technology, the information on internet emerges one after another, if it is mutual only to rely only on manual sort Massive information in networking seems too unrealistic rather.Manual sort can take a substantial amount of time and resource, simultaneously because Otherness between different people, it is also difficult to reach unified classification results.Therefore pass through statistics and machine after the nineties in last century The automatic classification technology of device study is always hot spot concerned by people and the main application technology of people, but with text Resource gradually expands, and becomes increasingly difficult to the actual demand for meeting people, this brings sternness to Text Classification and examines It tests.

Text classification is can be according to scheduled subject categories, to determine the classification of all unknown documents, so as to real Now text is objectively handled, reaches the target for improving classification accuracy.As the tool of processing information, it leads in information retrieval Efficient sorting algorithm and accurate query result are perfectly presented in domain.Wherein, high efficiency shows first time people Just general text categories can be determined, make that matched article quantity is needed to be reduced, validity shows identical issuer Formula may reach an agreement to the standard of text, thereby, it is ensured that the indices of evaluation information are unified.Text classification belongs to There is one kind of the study of supervision, it needs a training set, then its good classification of each text marking utilizes training data Learning classification model, to carry out prediction classification to unknown text.Text classification can be divided into single label and two kinds of multi-tag Task, single label refer to that every text is pertaining only to a classification, and multiple labeling refer to a text may belong to simultaneously one or Multiple classifications.This project pertains only to single retrtieval classification task.

Currently, text categorization task has been applied to many fields, such as intelligence analysis, sentiment analysis, subject classification and rubbish Rubbish mail-detection etc..These application fields suffer from extensive development prospect, therefore just seem especially to the exploration of text classification Important, finding more efficient and superior textual classification model is present top priority.Along with deep learning in image The fields such as identification, image classification, data compression, target detection and tracking, information retrieval and speech recognition all achieve very big Progress and development, simultaneously because sight has also gradually been placed at natural language by its extremely strong adaptability, many researchers On reason field, as Google uses the long neural network (Long short-term network, LSTM) in short-term of deep learning model Good achievement is achieved in machine translation.These learning models by scholar prove than traditional machine learning method more added with Effect, so, in natural language field, deep learning model solves the problems, such as that text classification becomes more concerned, also becomes text One new challenge of sorting technique.Using this kind of learning model, the complicated language contained in the text can be preferably excavated Adopted relationship, combines closely with specific tasks.Meanwhile it being expanded on a large scale and multimedia development, extensive number with internet Accordingly and the significant enhancement of machinery equipment performance, high-performance GPU and CPU cluster provide powerful computing capability, give these It practises model and provides wide platform.

Seen in sum up, in view of Text Classification being widely used in corresponding field, while with depth learning technology It flourishes, to excavate deeper feature, obtains accurate text categories, improve the standard of text classification in every field True rate, therefore text classification is studied based on deep learning model, there is very important value and meaning in actual use.

Summary of the invention

The purpose of the present invention is to solve the low problems of existing method text classification accuracy, and propose to be based on CNN- The file classification method of SVM-KNN built-up pattern.

File classification method detailed process based on CNN-SVM-KNN built-up pattern are as follows:

Step 1: Text Pretreatment；

Step 2: feature extraction being carried out to the pretreated text of step 1, the text after obtaining feature extraction；

Step 3: CNN model is established based on step 2；

Step 4: establishing CNN-SVM model；

Step 5: establishing CNN-KNN model；

Step 6: threshold value d is distinguished in setting；

Step 7: calculate distance:

Calculate the sample point to be sorted classifying face distance tmp optimal to CNN-SVM classifier；

Step 8: compare distance:

As tmp > d, CNN-SVM classifier is selected；Otherwise, CNN-KNN classifier is selected；

Step 9: repeating step 6 to step 9, find the optimal d value of evaluation index.

The invention has the benefit that

The present invention carries out feature extraction to Text Pretreatment, to the text after after pretreatment, the text after obtaining feature extraction This；Establish CNN model；Establish CNN-SVM model；Establish CNN-KNN model；It is manually set and distinguishes threshold value d；Calculate distance；Meter Sample point the to be sorted classifying face distance tmp optimal to CNN-SVM classifier；Compare distance: comparing the size of d and tmp； As tmp > d, CNN-SVM classifier is selected；Otherwise, CNN-KNN classifier is selected；Find optimal d value；By being set to d parameter Fixed circulation finds the optimal d value of evaluation index to above-mentioned steps repetitive operation；Metrics evaluation are as follows: precision ratio (Precision, P), recall ratio (Recall, R), F1 value and accuracy rate (Accuarcy, ACC), F1 value combine precision ratio P and recall ratio R two The effect of index, when this value is bigger, presentation class effect is better, maximum value 1.The accuracy rate of text classification is effectively promoted, More accurate article category is obtained, guarantees the correctness of article classification.

Such as Figure 11, from the point of view of accuracy rate and other three indexs, built-up pattern CNN-SVM, CNN-KNN and CNN- The result of SVM-KNN will be improved better than single CNN model as a result, wherein the classifying quality of CNN-SVM-KNN model is best 0.133%.

The Chinese data collection of this experiment is classified by built-up pattern, the experimental results showed that being based on CNN-SVM-KNN mould The classification accuracy of type is higher by 0.133% than single application CNN model；Classification accuracy based on CNN-KNN model is than single 0.117% is higher by using CNN model；Classification accuracy based on CNN-SVM model is higher by than single application CNN model 0.100%；Classification accuracy based on CNN-NBayes model be not so good as single use CNN model, but by Naive Bayes with And KNN algorithm is applied to have greatly improved in built-up pattern.

Detailed description of the invention

Fig. 1 is flow chart of the present invention；

Fig. 2 is CNN network structure, and Input is input, and Output is output；

Fig. 3 a is full connection figure；

Fig. 3 b is local connection figure；

Fig. 4 is convolution operation process schematic, and a, b, c, d, e, f, g, h, i represent the numerical value of input layer term vector, j, k, L, m, n, o represent the value of convolution kernel, and Kernel represents convolution kernel；

Fig. 5 is activation primitive schematic diagram, and sigmoid is sigmoid activation primitive, and tanh is tanh activation primitive, relu For relu activation primitive；

Fig. 6 a is not introduce the shared multireel product part connection calculating figure of weight；

Fig. 6 b is to introduce the shared multireel product part connection calculating figure of weight；

Fig. 7 is maximum pond and average Chi Huafa schematic diagram；

Fig. 8 is the Chinese Text Categorization figure based on built-up pattern；

Fig. 9 is the English Text Classification figure based on built-up pattern；

Figure 10 is SVM-KNN schematic diagram, and d indicates the distance of sample point to be divided optimal interface H into SVM, and tmp indicates area Divide threshold value, H1, H2 are interface, and S is sport category region, and E is amusement class region, and M is the region M；

Figure 11 is the relational graph of Chinese Text Categorization result and accuracy rate based on built-up pattern；

Figure 12 is the relational graph of English Text Classification result and accuracy rate based on CNN and built-up pattern.

Specific embodiment

Specific embodiment 1: embodiment is described with reference to Fig. 1, present embodiment is based on CNN-SVM-KNN combination die The file classification method detailed process of type are as follows:

Text classification general flow generally can be divided into following several processes: Text Pretreatment, feature selecting, training and survey Examination, metrics evaluation.Sorter model is established first with training set, then is used to classify in test set by the model, most The class label and true tag predicted at last compares, and the quality of classifier is judged by index.

Step 1: Text Pretreatment；

Step 2: feature extraction being carried out to the text after after step 1 pretreatment, the text after obtaining feature extraction；

Step 3: CNN model is established based on step 2；

Step 4: establishing CNN-SVM model；

Step 5: establishing CNN-KNN model；

Step 6: being manually set and distinguish threshold value d；

Step 7: calculate distance:

Step 8: compare distance:

SVM-KNN sorting algorithm

The advantages of SVM and KNN algorithm, be combined with each other by the algorithm, and concrete principle figure is as shown in Figure 10:

Wherein d indicates the distance of sample point to be divided optimal interface H into SVM；Tmp indicates to distinguish threshold value；Optimal classification Face H will can be separated well between sport category and amusement class.

When text to be sorted is in the region S and E, that is, when tmp > d, at this time using SVM can be very good into Row classification；But when text to be sorted is in the region M, SVM algorithm is not easily distinguishable classification, is at this moment carried out using KNN algorithm Classification.It calculates sample to be sorted and sport category sample and entertains the supporting vector distance between class sample, obtain distance wait divide The nearest K sample point of class sample, counts in this K sample point, and it is big to compare its value for sport category and the number of samples for entertaining class It is small, determine the belonging kinds of sample to be sorted.

Specific embodiment 2: the present embodiment is different from the first embodiment in that, text is located in advance in the step 1 Reason；Detailed process are as follows:

Text information is usually to be made of word and sentence, and computer is unable to these text informations of Direct Recognition, it is therefore desirable to Text is pre-processed, useless information is removed, is converted into the language that computer can distinguish.Due to Chinese and English it is pre- Processing mode is different, therefore needs to separate to be operated.

Carried out between each word of English text by space it is connected, therefore it participle operation by using space come Segment can complete.Such as Fig. 9；

English text preprocessing process are as follows:

(1) capitalization is converted into small letter；

(2) stop words, such as a, an, the word of this no practical significance of the are removed；

(3) lemmatization；The English word of all same roots is reduced into Unified Form；

(4) it segments: all punctuation marks is replaced using space, complete participle and go the behaviour of punctuation mark Make；

Chinese text preprocessing process are as follows:

Participle: it chooses participle tool jieba and completes participle；

And the pretreatment of Chinese is slightly difficult to the pretreatment of English in participle here, is generally realized by corresponding library, such as Based on the HanLP kit and IKAnalyzer kit of Java language exploitation, the Words partition system NLPIR of the Chinese Academy of Sciences is based on The methods of the library jieba of Python exploitation.The Chinese word segmentation tool of this experiment is selected as jieba, remaining pretreatment operation With English.Such as Fig. 8.

Other steps and parameter are same as the specific embodiment one.

Specific embodiment 3: the present embodiment is different from the first and the second embodiment in that, to step in the step 2 Text after after a rapid pretreatment carries out feature extraction, the text after obtaining feature extraction；Process are as follows:

Feature selecting is to select b (b < B) a feature to come out from B feature, and other B-b features are rejected.

Therefore in this case, new feature is a subset of original feature.And the part feature quilt given up It regards as not having importance, the theme of article cannot be represented.By after pretreatment operation, spy at this time under normal conditions Levying matrix can be very big, and dimension is very high, therefore results in the problems such as calculation amount is excessive, and the training time is too long, and nicety of grading is not high, and Feature selecting is exactly to retain the feature that can highlight article theme, to reach drop to reject the unessential noise of part The purpose of low-dimensional number improves above-mentioned phenomenon.

Other steps and parameter are the same as one or two specific embodiments.

Specific embodiment 4: unlike one of present embodiment and specific embodiment one to three, in the step 3 CNN model is established based on step 2；Detailed process are as follows:

To the Chinese and English data set of selection, by constantly testing, to select optimal parameter.The parameter being mainly concerned with has The dimension of term vector, maximum sequence length divide the mode of data set and the structure of model etc.；

CNN model is usually by input layer, convolutional layer, pond layer, and full articulamentum and output layer are constituted, as shown in Figure 2. Wherein the design of convolutional layer is to determine the key factor of CNN model success or failure.In traditional deep neural network, two neighboring layer Carried out by way of connecting entirely it is connected, in this way caused by result be exactly that the weight parameter amount that calculates between every two layers can be special Greatly, the integral operation speed of network is influenced, meanwhile, each parameter also will do it right value update, therefore subtle fluctuation accordingly Different output be will lead to as a result, to be unable to get preferable training effect.Therefore, for above-mentioned phenomenon, CNN model is logical It crosses convolutional layer and pond layer and substitutes a part of full articulamentum, by the way that its distinctive weight is shared and local sensing open country solves The above problem.

Step 2 carry out feature extraction after text from input layer enter, by first layer roll up base, first layer pond layer, The second layer rolls up the mutual superimposed extraction feature of base, second layer pond layer, carries out integration characteristics using full articulamentum, finally exists Output layer exports result；

From figure 2 it can be seen that data enter from input layer (Input), pass through convolutional layer (C1, C2) and pond layer (S1, S2) it is mutual it is superimposed extract feature, carry out integration characteristics using full articulamentum, it is finally defeated in output layer (Output) Result out.

A, the specific building process of the input layer of CNN model is as follows:

It is applied in image domains from CNN model, can learns that the element in the pixel matrix due to image has Born density, therefore it can be directly inputted in CNN model.However, text has very strong higher-dimension, sparsity, it will It is improper by the input after pretreatment directly as CNN model.Then, it have passed through continuous research, just propose suitable For the language model of neural network, referred to as term vector method.It can be converted by each word by the mapping in space The term vector of low-dimensional, density.In this way, each word can be indicated by an one-dimensional vector, each sentence is then It can be converted several one-dimensional vectors.Using the term vector after training as input, it is embedded into the embedding layer of CNN model, Specific building process is as follows:

(1) first that the text after step 2 progress feature extraction is corresponding with coding, form sentence vector；

Assuming that the dimension of term vector is Dim, maximum sequence length is M.Due to needing unified sequence when being input to CNN model Column length, therefore the maximum sequence length of each sentence is M.It is assumed that dictionary dic be encoded to dic=' Chinese Basketball Association ': 1, ' hold ': 2, ' news ': 3, ' news conference ': 4 ..., ' basketball ': M }.If with one of sentence, " Chinese Basketball Association is held newly Hear news conference " for, which becomes " Chinese Basketball Association/hold/news/news conference " by corresponding pretreatment.By above-mentioned sentence The word that occurs in son and dic carry out matching coding, are changed into sentence vector, then the final vector of the sentence be [1,2,3, 4,...,0]。

(2) word in sentence is converted into term vector matrix by training；Process are as follows:

Word in sentence is converted into the matrix of a M × Dim；

Wherein, every a line of matrix indicates a word；The dimension of Dim expression term vector；M indicates each sentence maximum Sequence length, its corresponding term vector of the coding of each word is mapped, it is corresponding to constitute each sentence Term vector matrix；

For example: if M is set as 800, Dim and is set as 64, then is mapped as coding [1,2,3,4 ..., 0] in (1) accordingly Term vector matrix, be shown below:

[1,-->[[1.1687614,-0.060387988,...,0.1394098,4.461792]

2,-->[-2.207607,1.5443151,...,1.2688257,6.298082]

3,-->[-2.5072408,0.25763997,...,1.9246347,1.8763151]

4,-->[-2.8602438,-0.573817,...,-1.4370164,1.0429095]

0,-->[-0.0574601,-0.06492017,...,-0.00432132,-0.0805641]

...

0]-->[-0.0574601,-0.06492017,...,-0.00432132,-0.0805641]]

B, the specific building process of the convolutional layer of CNN model is as follows:

Convolutional layer is a layer mostly important in CNN model, it shares to extract spy by local sensing open country and weight Sign.This layer determines the characteristic mass of extraction.

(1) local sensing is wild

Convolutional layer can limit the connection type between hidden unit and input unit, that is to say, that each hidden unit A part of input unit can only be connected.The input area of each hidden unit connection is known as the receptive field of the neuron.Entirely Connection and locally-attached detail are as shown in Fig. 3 a, 3b.

As can be seen that left figure is full connection figure from Fig. 3 a, 3b, hidden unit is connected to all nerves of input layer Member.Therefore, its computational efficiency can be very low.And right figure is local connection figure, in this way, hidden unit only some It is connected with the neuron of input layer, therefore when convolutional layer uses locally-attached mode, it can effectively reduction calculating Parameter improves training effectiveness.The process of calculating parameter is as shown in Figure 4.

When CNN model is applied to text classification, the width of convolution kernel and the dimension of input layer term vector should be kept It is identical, it can be seen that the width of convolution kernel and the width of input layer are consistent in Fig. 4.If moving step length is set as 1, convolution kernel Length be set as f, then convolutional calculation is from top to bottom slided, and finally obtains M- (f-1) a neuron；The calculation formula of convolution As shown in formula (1):

Wherein, r is the number of convolution kernel, and F is convolution kernel sum,Indicate h layers in convolutional layer of output,It indicates The h layers of weight being connect with h-1 layers,Indicate h-1 layers of feature vector, b is the bias term of convolutional layer, and g is non-linear Activation primitive common are sigmoid, tanh and ReLu function, they can effectively accelerate convergence rate, while certain The phenomenon that gradient explosion and gradient disappearance are solved in degree, so as to promote the performance of CNN model, training for promotion effect well Rate.The curve of activation primitive is as shown in Figure 5.

In Fig. 5, solid line represents sigmoid activation primitive, the function serial number of input can be converted into 0 to 1 it Between output, but be easy to appear gradient disappear the phenomenon that；Pecked line indicates that tanh activation primitive, the image of the function are up and down Symmetrically, therefore calculating can be simplified, but the problem of gradient disappearance still remains；Dotted line indicates ReLu activation primitive, in Zheng Qu Between when, solve the problems, such as gradient disappearance, while the calculating speed of the function is quickly, but be easily trapped into dead-time problem.

Shown in the number of parameters par_conv such as formula (2) for participating in calculating in convolutional layer:

Par_conv=(in_conv × f+1) × out_conv (2)

Wherein, par_conv indicates the number of parameters that calculating is participated in convolutional layer, and in_conv indicates convolutional layer input feature vector Number, out_conv indicate convolutional layer output feature number, 1 indicate convolutional layer biasing；

(2) weight is shared

Locally-attached mode can reduce the calculating of parameter, but its parameter amount is still huge, and effect is not ten sub-arguments Think.Therefore, on the basis of the above, it is shared to introduce weight, accelerates the study of network with this, improves training rate.Detail As shown in Fig. 6 a, 6b:

Fig. 6 a indicates that the part connection of multireel product calculates figure.Each color all represents a kind of convolution kernel.Each feature All convolutional calculation is carried out with different convolution kernels.And Fig. 6 b indicates each feature and next layer on the basis of locally-attached Neuron is connected by identical convolution kernel, can reduce calculation amount to the greatest extent in this way, makes trained speed and accurate Rate gets a promotion.Fig. 6 b introduces weight and shares, and not, comparison is introduced and do not introduced Fig. 6 a.Weight is shared can to reduce meter Calculation amount, accelerates training speed and accuracy rate is promoted.

By two above advantage, CNN model all achieves preferable achievement in image and other fields.Convolutional layer The number of plies is more, and the effect of feature extraction is better, and learning ability is also stronger, but easily model is made over-fitting occur in this way, leads Model generalization poor ability is caused, prediction effect is undesirable in practice.Therefore in actual use, the number of plies of convolution model, convolution The length of core, the parameters such as sliding step are required to test repeatedly, be continuously attempted to, and reach optimal result with this.

C, the specific building process of pond layer of CNN model is as follows:

Pondization is alternatively referred to as down-sampling layer, which mainly reduces data volume, and using maximum pond, average pond, the overall situation are most Great Chiization or global average pond method etc., further extract the feature that convolutional layer obtains.Maximum pond and average Chi Huafa It is as shown in Figure 7:

The core of Chi Huazhong sliding, can be referred to as sliding window.If setting sliding window as 2 × 2, stride 2.Maximum pond is It is when often sliding into a new region, to select maximum value as final output.In Fig. 7, final output is a left side The 4 of upper area, the 7 of upper right side region, the 5 of 8 and lower right region of bottom-left quadrant.And average pond is exactly to export it Every four values are added again divided by 4 by final average value.

The difference in global pool and above-mentioned local pond is that the size of its sliding window and the size of entire characteristic pattern are protected It holds consistent, that is to say, that eventually obtain one-dimensional output.And residue calculating maximum value is identical with Fig. 7 with the operation of average value. Nowadays, since the operation in local pond is relatively complicated, global pool, this reality have been used in most of practical applications Pond method used in testing is global maximum pond.

D, the specific building process of full articulamentum of CNN model is as follows:

The connection type of full articulamentum is exactly using between each neuron and upper one layer of all neuron above stated Whole connections, main function is mapped in sample labeling space by distributed nature expression, integration from convolutional layer and The feature of pond layer；Full articulamentum selects ReLu as activation primitive；

When the output valve of full articulamentum is delivered to an output, at this time general full articulamentum as CNN model most Later layer, also referred to as softmax layer, main function judge final generic by the output of probability as normalization. If input is a, a_iIndicate i-th of element in a, a_jIndicate that j-th of element in a, n indicate that element sum in a, c are all Element is exported, then the softamx value of a element is are as follows:

In order to measure the degree of closeness of reality output and prediction output, then cross entropy loss function, formula such as formula (4) are used It is shown:

Wherein, E indicates cross entropy loss function, y_iIndicate label, R_iIndicate the input of cross entropy loss function；

Wherein, as the input R of cross entropy loss function_i=S_iWhen, intersection entropy function at this time is indicated as softmax loss；Label y at this time_i=[0,0,1,0 ..., 0], due in the matrix number of values be 0, then with logR_iAccordingly multiplied Product, result of product 0, therefore finally determine (4) formula the result is that when label virtual value be 1 when, i.e., formula (4) can simplify Become formula (5):

Other steps and parameter are identical as one of specific embodiment one to three.

Specific embodiment 5: unlike one of present embodiment and specific embodiment one to four, the step 4

In establish CNN-SVM model；Detailed process are as follows:

Make feature extractor using CNN model, selects SVM, KNN and SVM-KNN algorithm is as sorter model.

CNN-SVM model is established, the correlation of CNN-SVM model is determined most by grid optimizing using 5 folding cross-validation methods Excellent parameter, the related optimized parameter of CNN-SVM model are penalty factor, kernel functional parameter.

Other steps and parameter are identical as one of specific embodiment one to four.

Specific embodiment 6: unlike one of present embodiment and specific embodiment one to five, in the step 5 Establish CNN-KNN model；Detailed process are as follows:

Establish CNN-KNN model, K initial value is 1, adds up 1 every time, and loop iteration to 50, the optimal K value of selection result is The final value of K in parameter, K are the parameter in KNN model.

Other steps and parameter are identical as one of specific embodiment one to five.

Specific embodiment 7: unlike one of present embodiment and specific embodiment one to six, in the step 9 Evaluation index are as follows:

Four indexs of common mistake describe the experimental result of text classification, respectively precision ratio (Precision, P), Cha Quan Rate (Recall, R), F1 value and accuracy rate (Accuarcy, ACC).Below by the above-mentioned index of the succinct description of table 1.

1 text classification evaluation index of table

In table 1, the number for the positive class sample that TP presentation class device is correctly classified；TN presentation class device is correctly classified negative The number of class sample；FN indicate by mistake label be negative class positive class sample number；FP indicates that the label by mistake is positive The number of the negative class sample of class；

Accuracy rate ACC, precision ratio P and recall ratio R are expressed as formula (6), formula (7) and formula (8):

The number of the correct positive class text of precision ratio P presentation class accounts for the ratio of total text number of the positive class of prediction；

Recall ratio R indicates that the text number of the correct positive class of classification in text accounts for the ratio of total number of samples of practical positive class Example；

The ratio for the correct total text number of text number Zhan of classifying in accuracy rate ACC expression text, indicates that data set is whole The accurate precision of the classification of body；

F1 combines precision ratio P and recall ratio R two indices, produces a kind of new assessment mode；

F1 value combines the effect of precision ratio P and recall ratio R two indices, and F1 value is bigger, and presentation class effect is better, most Big value is 1.

Other steps and parameter are identical as one of specific embodiment one to six.

Beneficial effects of the present invention are verified using following embodiment:

Embodiment one:

The present embodiment is specifically to be prepared according to the following steps:

Such as Figure 12, from the point of view of accuracy rate and other three indexs, built-up pattern CNN-SVM, CNN-KNN and CNN- The result of SVM-KNN equally will be better than single CNN model as a result, wherein the classifying quality of CNN-SVM-KNN model is most It is good, improve 0.475%.Experimental result has also more determined the accuracy and applicability of built-up pattern of the present invention.

The English data set of this experiment is classified by built-up pattern, the experimental results showed that being based on CNN-SVM-KNN mould The classification accuracy of type is higher by 0.475% than single application CNN model；Classification accuracy based on CNN-KNN model is than single 0.025% is higher by using CNN model；Classification accuracy based on CNN-SVM model is higher by than single application CNN model 0.450%；Classification accuracy based on CNN-NBayes model be not so good as single use CNN model, but by Naive Bayes with And KNN algorithm is applied to equally have greatly improved in built-up pattern.

The present invention can also have other various embodiments, without deviating from the spirit and substance of the present invention, this field Technical staff makes various corresponding changes and modifications in accordance with the present invention, but these corresponding changes and modifications all should belong to The protection scope of the appended claims of the present invention.

Claims

1. the file classification method based on CNN-SVM-KNN built-up pattern, it is characterised in that: the method detailed process are as follows:

Step 1: Text Pretreatment；

Step 3: CNN model is established based on step 2；

Step 4: establishing CNN-SVM model；

Step 5: establishing CNN-KNN model；

Step 6: threshold value d is distinguished in setting；

Step 7: calculate distance:

Step 8: compare distance:

2. the file classification method according to claim 1 based on CNN-SVM-KNN built-up pattern, it is characterised in that: described Text Pretreatment in step 1；Detailed process are as follows:

English text preprocessing process are as follows:

(1) capitalization is converted into small letter；

(2) stop words is removed；

(3) lemmatization；

(4) it segments: all punctuation marks is replaced using space, complete participle and go the operation of punctuation mark；

Chinese text preprocessing process are as follows:

Participle: it chooses participle tool jieba and completes participle.

3. the file classification method according to claim 1 or claim 2 based on CNN-SVM-KNN built-up pattern, it is characterised in that: institute It states in step 2 and feature extraction is carried out to the pretreated text of step 1, the text after obtaining feature extraction；Detailed process are as follows:

Feature selecting is to select b feature to come out from B feature, and b < B, other B-b features are rejected.

4. the file classification method according to claim 3 based on CNN-SVM-KNN built-up pattern, it is characterised in that: described CNN model is established based on step 2 in step 3；Detailed process are as follows:

CNN model is made of input layer, convolutional layer, pond layer, full articulamentum and output layer；

Step 2 carries out the text after feature extraction and enters from input layer, rolls up base, first layer pond layer, second by first layer Ceng Juan base, second layer pond layer mutual superimposed extraction feature, using full articulamentum carry out integration characteristics, finally exporting Layer output result；

A, the specific building process of the input layer of CNN model is as follows:

Word in sentence is converted into the matrix of a M × Dim；

Wherein, every a line of matrix indicates a word；The dimension of Dim expression term vector；M indicates the maximum sequence of each sentence Column length maps its corresponding term vector of the coding of each word, constitute the corresponding word of each sentence to Moment matrix；

(1) local sensing is wild

If moving step length is set as 1, the length of convolution kernel is set as f, then convolutional calculation is from top to bottom slided, and finally obtains M- (f-1) a neuron；Shown in the calculation formula of convolution such as formula (1):

Wherein, r is the number of convolution kernel, and F is convolution kernel sum,Indicate h layers in convolutional layer of output,Indicate h layers The weight being connect with h-1 layers,Indicate h-1 layers of feature vector, b is the bias term of convolutional layer, and g is nonlinear activation Function；

Par_conv=(in_conv × f+1) × out_conv (2)

Wherein, par_conv indicates the number of parameters that calculating is participated in convolutional layer, and in_conv indicates of convolutional layer input feature vector Number, out_conv indicate the number of convolutional layer output feature, and 1 indicates the biasing of convolutional layer；

(2) weight is shared

On the basis of locally-attached, each feature is connected with next layer of neuron by identical convolution kernel；

C, the specific building process of pond layer of CNN model is as follows:

Pondization is known as down-sampling layer, and using maximum pond, average pond, global maximum pond or global average Chi Huafa are right The feature that convolutional layer obtains further is extracted；

Full articulamentum integrates the feature from convolutional layer and pond layer；Full articulamentum selects ReLu as activation primitive；

Full the last layer of the articulamentum as CNN model, also referred to as softmax layers, if input is a, a_iIndicate i-th yuan in a Element, a_jIndicate that j-th of element in a, n indicate that element sum in a, c are all output elements, then the softamx value of a element I.e. are as follows:

In order to measure the degree of closeness of reality output and prediction output, then cross entropy loss function, formula such as formula (4) institute are used Show:

Wherein, as the input R of cross entropy loss function_i=S_iWhen, intersection entropy function at this time is indicated as softmax loss； Label y at this time_i=[0,0,1,0 ..., 0], when the virtual value of label is 1, i.e. formula (4) simplification becomes formula (5):

5. the file classification method according to claim 4 based on CNN-SVM-KNN built-up pattern, it is characterised in that: described CNN-SVM model is established in step 4；Detailed process are as follows:

CNN-SVM model is established, determines the optimal ginseng of correlation of CNN-SVM model by grid optimizing using 5 folding cross-validation methods Number, the related optimized parameter of CNN-SVM model are penalty factor, kernel functional parameter.

6. the file classification method according to claim 5 based on CNN-SVM-KNN built-up pattern, it is characterised in that: described CNN-KNN model is established in step 5；Detailed process are as follows:

Establish CNN-KNN model, K initial value is 1, adds up 1 every time, and loop iteration to 50, the optimal K value of selection result is parameter The final value of middle K, K are the parameter in KNN model.

7. the file classification method according to claim 6 based on CNN-SVM-KNN built-up pattern, it is characterised in that: described Evaluation index in step 9 are as follows:

Precision ratio, recall ratio, F1 value and accuracy rate；

The number for the positive class sample that TP presentation class device is correctly classified；For the negative class sample that TN presentation class device is correctly classified Number；FN indicate by mistake label be negative class positive class sample number；FP indicates that the label by mistake is positive the negative class sample of class This number；

Recall ratio R indicates that the text number of the correct positive class of classification in text accounts for the ratio of total number of samples of practical positive class；

The ratio for the correct total text number of text number Zhan of classifying in accuracy rate ACC expression text, indicates data set entirety Classify accurate precision；

F1 combines precision ratio P and recall ratio R two indices；

F1 value combines the effect of precision ratio P and recall ratio R two indices, and F1 value is bigger, and presentation class effect is better, maximum value It is 1.